
HyperLogLog in Presto: Faster cardinality estimation - craigkerstiens
https://code.fb.com/data-infrastructure/hyperloglog/
======
craigkerstiens
HyperLogLog is a really interesting and powerful data type.

Though a few things seem missed in the article. HyperLogLog has been around I
believe over 10 years now. It's been supported for Postgres for over 5 years
[1] now as an extension[2]. It is great to see it growing further with
Facebook adding support for it for Presto.

[1]. [https://research.neustar.biz/2012/10/25/sketch-of-the-day-
hy...](https://research.neustar.biz/2012/10/25/sketch-of-the-day-hyperloglog-
cornerstone-of-a-big-data-infrastructure/)

[2]. [https://www.citusdata.com/blog/2017/06/30/efficient-
rollup-w...](https://www.citusdata.com/blog/2017/06/30/efficient-rollup-with-
hyperloglog-on-postgres/)

~~~
jamespo
I was first aware of it being used in redis

~~~
agacera
Same here. And Salvatore's blog post about hll is pretty nice:

[http://antirez.com/news/75](http://antirez.com/news/75)

------
massaman_yams
Original HLL paper from 2007:
[http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf)

~~~
bulldoa
One of the key quality that hyperloglog relies on is that the action of hash
distributes any input to any output with equal probability (eg. any input have
1/m chance to be in a particular point in output space where m is the
cardinality of the output space)

I can't seem to prove this rigorously, am I understanding the paper correclty?

~~~
massaman_yams
I think it's more a practical concern than something that can be proven
conclusively; the behavior characteristics (e.g., collision probability) are
known for a given hash function, and if you're implementing HLL, you choose an
appropriate one based on your use case (expected cardinality, etc.)

Redis uses a modified 64-bit MurmurHash2, for example.

------
ahurmazda
Awesome! Now that we have union, implementing intersect[1] would be killer.

[1]
[https://github.com/axiomhq/hyperminhash/tree/master](https://github.com/axiomhq/hyperminhash/tree/master)

------
mr_fraces2k
This is so cool to have the raw data type becoming accessible in Presto.
Testing it out, it is interesting to see the HEX of the data type as one
increases the count of elements!

