
How a HyperLogLog works - softwaredoug
http://opensourceconnections.com/blog/2015/02/04/its-log-its-log-its-big-its-hyper-its-good/
======
dblock
A simpler and more concise explanation in [https://github.com/aaw/hyperloglog-
redis](https://github.com/aaw/hyperloglog-redis).

"The basic idea of HyperLogLog (and its predecessors PCSA, LogLog, and others)
is to apply a good hash function to each value observed in the stream and
record the longest run of zeros seen as a prefix of any hashed value. If the
hash function is good, the bits in any hashed value should be close to
statistically independent, so seeing a value that starts with exactly X zeros
should happen with probability close to 2 __-(X + 1). So, if you 've seen a
run of 5 zeros in one of your hash values, you're likely to have around 2 __6
= 64 values in the underlying set. The actual implementation and analysis are
much more advanced than this, but that 's the idea."

~~~
rudolf0
Interesting to see Bitcoin use almost the same principle. Essentially it's "if
there is a hash with enough leading zeroes, we can assume a lot of CPU time
has been put into the network."

Of course that's just an artifact of what's really happening, which is pseudo-
random values + laws of probability.

------
cwyers
I don't think this does a very good job of explaining the HyperLogLog,
honestly. A slightly less contrived example would've been useful.

~~~
amitparikh
I agree. Additionally, the words "distinct" or "cardinality" don't appear
anywhere in this article, which is a major omission when discussing
HyperLogLog. The primary use of the algorithm is to provide a cardinality
estimate.

~~~
darkmighty
Agreed, cardinality is the key word: it goes with how his example is
inappropriate -- if you were just counting the number of people coming in a
simple addition (and maybe take a log() ) would cost just as much. I believe
the usefulness of HLL is evaluating the cardinality of sets you need to access
-- you "sample" the set and get a quick estimate of the cardinality of certain
objects.

~~~
mcherm
I disagree. I think that the article does an EXCELLENT job of describing it.
They avoided the term "cardinality" on purpose, since the target audience was
people who might well be scared off by fancy words they weren't familiar with.

And the problem posed was not to count the number of visitors, but to count
the number of different people who visited... the fancy word for that is the
"cardinality of unique visitors" but "number of different people" is just as
accurate.

Counting the number of people could NOT be done with the same amount of space.
This example required two simple counters on the blackboard... call it 30 bits
of memory, assuming you didn't expect either counter to get more than 8.
That's just about exactly enough space to store ONE social security number...
and nowhere near enough to count the unique people.

------
sergeio
Nick Johnson did a Damn Cool Algorithms post [0] on this that inspired me to
play around with HLL myself and do a little writeup [1].

[0]: [http://blog.notdot.net/2012/09/Dam-Cool-Algorithms-
Cardinali...](http://blog.notdot.net/2012/09/Dam-Cool-Algorithms-Cardinality-
Estimation)

[1]:
[https://github.com/sergeio/hyperloglog](https://github.com/sergeio/hyperloglog)

------
druzac
Here's a less contrived application. This paper uses hlls to compute miss rate
curves for LRU caches from traces of block requests.
[https://www.usenix.org/system/files/conference/osdi14/osdi14...](https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-
wires.pdf)

The short, simplified story is a list of hlls (a 'counterstack' in the paper,
which is an unfortunate bit of historically motivated terminology) are used to
count the number of distinct blocks requested. Each hll is fed every block
request. A new hll is created after a certain number of requests, so the
counts your hlls report correspond to estimations of the number of unique
blocks accessed in tails of the trace. The oldest hll has seen every block in
the trace, while the most recently created has seen a tail of the trace of
length <= your creation interval parameter.

It turns out that you can use this structure to compute stack distances, which
are the number of unique blocks accessed between two requests to the same
block. The details are a little hairy, so see the paper for the full story.

------
ipsin
Another good explanation, this one with a javascript simulation of HLL:
[http://research.neustar.biz/2012/10/25/sketch-of-the-day-
hyp...](http://research.neustar.biz/2012/10/25/sketch-of-the-day-hyperloglog-
cornerstone-of-a-big-data-infrastructure/)

------
elihu
I thought it was a clear explanation. It was easy to understand, and I now
have a good intuition of how HLL works and why I might want to use it.

------
softwaredoug
Seriously? Someone renamed the title? Nobody likes Ren & Stimpy these days?

~~~
0xdeadbeefbabe
Surely that someone favored homework over R&S, but "a HyperLogLog" Why does it
need an "a"? I would have expected "How HyperLogLog works".

~~~
eridius
Would you expect a title "How Hash Table works"? No, you'd expect "How a Hash
Table works". Same thing here.

