HyperLogLog compatibility with Apache Spark #24384

billcrook · 2024-12-05T11:02:54Z

billcrook
Dec 5, 2024

I'm interested in using spark to write HLL data structures that are compatible with Trino (and implicitly aws athena). From what I can tell, trino uses HLL datastructures from the associated airlift project. Spark leverages the apache datasketches library for HLL.

I'm hitting a dead end when trying to find anyone doing this. Before I go down a rabbit hole of writing a custom spark aggregator to use the airlift HLL datastructure, does anyone know of prior art or if there are plans in the trino pipeline to improve this compatibility?

hashhar · 2024-12-06T12:59:43Z

hashhar
Dec 6, 2024
Collaborator

I don't think most people actually store the serialized HLLs into a table.

cc: @electrum if he's thought about interop in the past.

1 reply

electrum Dec 6, 2024
Maintainer

This is a question for @martint

billcrook · 2024-12-06T19:57:23Z

billcrook
Dec 6, 2024
Author

I don't think most people actually store the serialized HLLs into a table.

My use case is nearly 1T events aggregated down into a reasonable grain which the BI team can then further aggregate across many dimensions. I don't believe it's an unusual use case.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HyperLogLog compatibility with Apache Spark #24384

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

HyperLogLog compatibility with Apache Spark #24384

billcrook Dec 5, 2024

Replies: 2 comments · 1 reply

hashhar Dec 6, 2024 Collaborator

electrum Dec 6, 2024 Maintainer

billcrook Dec 6, 2024 Author

billcrook
Dec 5, 2024

Replies: 2 comments 1 reply

hashhar
Dec 6, 2024
Collaborator

electrum Dec 6, 2024
Maintainer

billcrook
Dec 6, 2024
Author