

HyperLogLog – Cornerstone of a Big Data Infrastructure - martyhu
http://research.neustar.biz/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/

======
armon
My previous job was at an advertising firm, and we used HyperLogLogs for
almost all of our real-time analytics infrastructure. They are incredibly
space and time efficient. Each "counter" fits into about a single page of
memory, and can count into the trillions with <2% error.

We developed an extremely high performance server around it (hlld):
[https://github.com/armon/hlld](https://github.com/armon/hlld).

We were typically hitting with with tens of thousands of requests per second
across about 50K counters. Although it was benchmarked to >1MM ops a second.

Similarly, we also make bloomd, which is an equivalent for using bloom
filters, which provide a more set-like abstraction:
[https://github.com/armon/bloomd](https://github.com/armon/bloomd)

------
millisecond
HLL also has two nice real-world optimizations possible depending on use-case.

We're storing 100,000+ unique counters, but only around 1% have more than 100
unique objects counted. Some of those 1% have millions of records so HLL is
very useful. As the HLL itself is a fixed size (~10kb for decent accuracy)
regardless of #counted objects, in the small case you can replace the HLL with
a pure set of counted values and produce a HLL when it grows beyond a bound.
Because you're storing the raw values, the transition to HLL is seamless.

Once you've moved beyond raw storage of values there's a harder but still
space-saving technique. If you look at the raw bytes of a ~10kb HLL structure
with "only" 10's of thousands of counted values around 90% of them will be
zero. Below a certain bound it can save a lot of space to have a map of
locations and non-zero byte values rather than a raw array of bytes.

~~~
cbsmith
One thing people forget about in all the excitement over HLL's is how
effectiveness of compressed bitsets, which aren't lossy and so yield precise
answers. They exploit the same "90% of them will be zero" phenomenon for space
and execution efficiency, but are much more flexible... in exchange for
consuming more memory and being slower than HLL's.

------
iaw
I must be missing something...

~~~
hyperion2010
Fairly certain this would be one of those generated papers.

