Hacker News new | comments | show | ask | jobs | submit login

The basic idea is that for each element X in your data, you compute a function f(X) which is equal to the number of leading zeros in a bitstring generated by hashing X. Then you take the maximum of this value across all elements in the dataset (which can be efficiently done on distributed or streaming datasets).

For any given value of X, the probability that its hash contains N leading zeros is proportional to 1/2^N. We're taking the maximum, so if any one of the values has N leading zeros, then max(f(X)) >= N.

If the number of distinct values observed is significantly greater than 2^N, it's very likely that at least one of them has N leading zeros. Conversely, if the number of distinct values is much less than 2^N, it's relatively unlikely. So the expected value of max(f(X)) is proportional to log(distinct values), with a constant factor that can be derived.

This is the basic "probabilistic counting" algorithm of Flajolet and Martin (1985). Its major flaw is that it produces an estimate with very high variance; it only takes a single unusual bitstring to throw the result way off. You can get a much more reliable result by partitioning the input into a bunch of subsets (by slicing off a few bits of the hash function) and keeping a separate counter for each subset, then averaging all the results at the end. This improvement, plus a few other optimizations, gives you HyperLogLog.

What kind of average? Not a math person, but median seems best?

Thank you for your good explanation!

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact