
Distributed count distinct vs. HyperLogLog in Postgres - craigkerstiens
https://www.citusdata.com/blog/2017/12/22/distributed-count-vs-hyperloglog/
======
rawnlq
> We’ve been big fans of the Postgres HyperLogLog extension for many years and
> are excited to be taking on responsibility to maintain HyperLogLog going
> forward.

Conceptually hyperloglog should be really simple (count leading zeroes) so I
wasn't sure what there was to "maintain".

Looking at the citus repo it seems like they made a lot of interesting
improvements to the original algorithm:
[https://github.com/citusdata/postgresql-
hll](https://github.com/citusdata/postgresql-hll)

Interesting how fitting a simple theory work to the real world takes 3.3k
lines of code.

------
shmerl
For those unfamiliar, log in hyperloglog is not related to logging, but rather
refers to logarithm.

~~~
potle
any good step by step derivation for this out there?

I read the wikipedia article and still not sure why this works.

~~~
teraflop
The basic idea is that for each element X in your data, you compute a function
f(X) which is equal to the number of leading zeros in a bitstring generated by
hashing X. Then you take the maximum of this value across all elements in the
dataset (which can be efficiently done on distributed or streaming datasets).

For any given value of X, the probability that its hash contains N leading
zeros is proportional to 1/2^N. We're taking the maximum, so if _any one_ of
the values has N leading zeros, then max(f(X)) >= N.

If the number of distinct values observed is significantly greater than 2^N,
it's very likely that at least one of them has N leading zeros. Conversely, if
the number of distinct values is much less than 2^N, it's relatively unlikely.
So the expected value of max(f(X)) is proportional to log(distinct values),
with a constant factor that can be derived.

This is the basic "probabilistic counting" algorithm of Flajolet and Martin
(1985). Its major flaw is that it produces an estimate with very high
variance; it only takes a single unusual bitstring to throw the result way
off. You can get a much more reliable result by partitioning the input into a
bunch of subsets (by slicing off a few bits of the hash function) and keeping
a separate counter for each subset, then averaging all the results at the end.
This improvement, plus a few other optimizations, gives you HyperLogLog.

~~~
rjeli
What kind of average? Not a math person, but median seems best?

