HyperLogLog – Cornerstone of a Big Data Infrastructure

armon · on May 16, 2014

My previous job was at an advertising firm, and we used HyperLogLogs for almost all of our real-time analytics infrastructure. They are incredibly space and time efficient. Each "counter" fits into about a single page of memory, and can count into the trillions with <2% error.

We developed an extremely high performance server around it (hlld): https://github.com/armon/hlld.

We were typically hitting with with tens of thousands of requests per second across about 50K counters. Although it was benchmarked to >1MM ops a second.

Similarly, we also make bloomd, which is an equivalent for using bloom filters, which provide a more set-like abstraction: https://github.com/armon/bloomd

millisecond · on May 16, 2014

HLL also has two nice real-world optimizations possible depending on use-case.

We're storing 100,000+ unique counters, but only around 1% have more than 100 unique objects counted. Some of those 1% have millions of records so HLL is very useful. As the HLL itself is a fixed size (~10kb for decent accuracy) regardless of #counted objects, in the small case you can replace the HLL with a pure set of counted values and produce a HLL when it grows beyond a bound. Because you're storing the raw values, the transition to HLL is seamless.

Once you've moved beyond raw storage of values there's a harder but still space-saving technique. If you look at the raw bytes of a ~10kb HLL structure with "only" 10's of thousands of counted values around 90% of them will be zero. Below a certain bound it can save a lot of space to have a map of locations and non-zero byte values rather than a raw array of bytes.

cbsmith · on May 16, 2014

One thing people forget about in all the excitement over HLL's is how effectiveness of compressed bitsets, which aren't lossy and so yield precise answers. They exploit the same "90% of them will be zero" phenomenon for space and execution efficiency, but are much more flexible... in exchange for consuming more memory and being slower than HLL's.

iaw · on May 16, 2014

I must be missing something...

hyperion2010 · on May 16, 2014

Fairly certain this would be one of those generated papers.