
Redis new data structure: the HyperLogLog - antirez
http://antirez.com/news/75
======
mitchellh
This is nice. At my last job, we developed an extremely high performance
HyperLogLog server (hlld). Take a look if you're interested:

[https://github.com/armon/hlld](https://github.com/armon/hlld)

It processes over a 1MM ops per second on a basic MacBook Air. In production
we saw it handle more. We never had to scale it over one server in order to
handle millions of distinct metrics.

As an ad company, my last company had a ridiculous fanout of metrics for every
request. Every ad serve API request (of which we got thousands per second)
would fan out to over 150 metrics EACH. So for 1000 requests per second we'd
have to modify 150,000 metrics. Raw SQL updates of course fall down with this
(even if we tone back their durability). Instead, we opted for accumulated for
10 to 60 seconds in bloomd and hlld and statsite, then aggregating and doing a
single bulk SQL update. Worked incredibly well.

We used it in combination with bloomd:
[https://github.com/armon/bloomd](https://github.com/armon/bloomd) Bloomd is
another super high performance data structure server, but this time for bloom
filters.

You can read the READMEs to learn why they're better for their specific tasks.
Both have been running in production for 2+ years. If you're not going to be
using the data structures that much then Redis is a good choice. We personally
used these servers alongside Redis for k/v.

~~~
arielweisberg
There are several databases that will do 150k replicated updates a second on a
2 socket server. Heck, 150k multi-statement transactions has been realistic
for years. 1 million on a MacBook air no, not yet. Not enough caching, JITing
and precompilation to optimize away the overhead of SQL in anything that
currently exists.

This is not to talk down using the right tool for the job, but as soon as you
want to do something a little more complicated than count you are right back
to reinventing databases. That is basically what you are doing by implementing
periodic group commit on top of a database.

~~~
mitchellh
We would've preferred to not have to write these things. If you knew me you'd
know I'm not a "build things because we can" type of person. We spent a week
tuning PostgreSQL trying to get it fast enough, and couldn't (due to the
restrictions of ACID we were trying to break). We then hired an outside DB
consultancy for a handful of hours, since they claimed they could do it, since
consultants for a few hours is cheaper than us writing a DB, and they
ultimately came back and said "SQL isn't the right choice for this."

I can see how you can make your comment without understanding this context and
more. But really, I promise you SQL didn't work AT ALL. This is primarily due
to the type of data we were putting in there: we needed some aggregates like
totals, averages, standard deviation, etc. To get this data, you need to have
it inserted first, and we were requesting on the fly in real time at a pretty
large rate. This would cause the CPU of the SQL servers to go up pretty hard
WHILE also serving hundreds of thousands of insert/update ops/sec. We could've
solved this by adding read slaves and all that, but now we're talking
expensive CapEx for something we were confident we could fix with some C in a
week or two.

Enter: the two servers linked in my parent comment.

We sacrificed durability, accuracy, and some level of safety (if the server
crashes before an accumulation period once every X seconds, you lose that
data) for raw speed. We calculated out the acceptable tolerances (max % off
the real value) and tuned our data structures to that (bloomd and hlld both
expose these tunables). We then ran both servers in parallel to our existing
SQL solution for a few weeks to verify the numbers were correct. Finally, we
switched over.

Like I said, when we built that we were doing maybe 450,000 metrics per
second. That was 2 years ago. Now they're doing much much more than that, and
both hlld and bloomd are running on a single m3.medium with no persistence
(they aggregate then issue a single large UPDATE to SQL) with no issue.

The amortized operating expenses (human labor) cost of developing the servers,
along with the thousands saved in server costs, and with the effectively zero
ops overhead (they never go down, they're stateless, etc.) has been well worth
it.

Hope that makes our decision a little clearer.

~~~
wpietri
Very well put. I think people who have only built SQL-based solutions tend to
underestimate the labor and ops overhead of SQL-based solutions for problems
that don't fit the typical SQL use cases very well. It's nice to read of a
long-running example like this.

------
dhotson
HyperLogLog counters are one of my favourite things. They're incredibly
clever.

So also, you can combine this idea with the idea of bloom filters (another
probabilistic data structure) in a really interesting way. The goal is to have
a _lot_ of counters in a fixed amount of space.

First part is to encode your HyperLogLog counter in unary. So each counter
generally only takes up a few bits.

Quick unary tutorial: in unary you count with a sequence of 1's (followed by 0
to signify the end of the number). 3 is encoded as 1110. 5 is 111110. 7 is
11111110. You get the idea.

So also, we're just estimating cardinality, not an exact count. We're
estimating if we've seen 2^n things... not n things. Counting to a billion is
going to be: 1 billion ~= 2^30 ~= 31 bits.

So you can store all these counters in a big array of bits (a bitmap) and
index into it using a hash function. As you probably guessed, all these unary
counters will end up overlapping each other in weird ways.

But! The cool thing is if you use a bloom filter to insert and retrieve
multiple counters using different hash functions.. and then average the
results. The more of these probabilistic and messed up counters you average ..
the better the results get!

It's kind of opposite of the usual programmer "junk in, junk out" problem. In
this case, the more junk you combine, the better the results. Pretty neat huh?
:)

I don't have links to the papers behind these ideas sorry. I learnt most of
this from a presentation by Alex Sharp at PyCon Au. Edit: found link to talk -
[http://pyvideo.org/video/1621/big-data-with-
python](http://pyvideo.org/video/1621/big-data-with-python)

This article explains some of this stuff a bit better:
[http://highscalability.com/blog/2012/4/5/big-data-
counting-h...](http://highscalability.com/blog/2012/4/5/big-data-counting-how-
to-count-a-billion-distinct-objects-us.html)

~~~
ronaldx
Averaging the results correctly must be an interesting problem in itself?

I believe that standard averages wouldn't necessarily give you the maximum
likelihood.

~~~
dhotson
To be honest, I'm not an expert at this stuff—merely enthusiastic about it. I
could have some details wrong.

My intuition says that a straight average should give you fairly good
results.. although antirez talks a bit about bias in HyperLogLog counts that
he's trying to adjust for.

So maybe an average + some bias correction gives you better results?

~~~
ronaldx
On light reflection: probably if you do N tests, the modal result would be a
good enough choice, or perhaps the median.

The issue is that (e.g.) the arithmetic mean would be very influenced by the
stats that gave randomly bad results - the very results you are trying to
exclude by doing the test multiple times and taking an average.

~~~
dhotson
So I think this is roughly what antirez is talking about with 'registers' for
a counter.

From the article:

The standard error of HyperLogLog is 1.04/sqrt(m), where “m” is the number of
registers used. Redis uses 16384 registers, so the standard error is 0.81%.

My understanding is that the error is very small because you're averaging over
16,000 registers.

Edit: also, I think you want the mean because it'll give you back a
precise/fractional power of 2 that is pretty close to the actual count. The
median would give you the nearest integer power of two, which is probably not
a good estimate of the real count.

Double edit: looks like there's some research around improving averaging by
throwing away outliers: [http://blog.notdot.net/2012/09/Dam-Cool-Algorithms-
Cardinali...](http://blog.notdot.net/2012/09/Dam-Cool-Algorithms-Cardinality-
Estimation)

------
octo_t
With the name, I started reading thinking it was an April fools joke, then I
went down the rabbit hole of starting to read the post and every linked paper.

~~~
Filligree
Plot twist: It actually is an april fool's joke.

~~~
ohwp
Actually your post fooled me.

------
malanj
The idea of inexact computing seems to be gaining a lot of traction over the
past few years.

The HyperLogLog data structure reminded me of the research at Rice University
on inexact processors: [http://www.ece.rice.edu/~al4/lingamneni-
cf12.pdf](http://www.ece.rice.edu/~al4/lingamneni-cf12.pdf)

There's a lot to gain by being less precise!

~~~
sitkack
The future of computation is probabilistically inexact numbers over an
interval.

[http://en.wikipedia.org/wiki/Interval_arithmetic](http://en.wikipedia.org/wiki/Interval_arithmetic)

We currently have a huge over tolerancing problem. We don't need exact math
most of the time.

[http://web.media.mit.edu/~bates/Summary_files/BatesTalk.pdf](http://web.media.mit.edu/~bates/Summary_files/BatesTalk.pdf)

~~~
malanj
Hypothesis: inexact computations will be required to make strong AI work in
the medium term.

You can't simulate a brain accurately (and you probably don't need to).
Neurons aren't very accurate themselves, and it does the job fine.

~~~
sitkack
Top of the line supercomputers consume ~10 megawatts of power, a human brain
uses 20. Different problems yes, but the magnitude is staggering.

------
ronaldx
If I've understood correctly, this relates to the German Tank Problem: using
statistical analysis to estimate a count.

With German Tanks, you have limited data available to create an estimate from
(a few serial numbers, in the canonical case). But in this case, you
_deliberately_ limit the data stored in order to keep it at a manageable
level: hash to randomise and then store only summary stats.

Pretty cool. Actually very cool, I really like it.

However, are there any interesting use cases for this? It seems to me that
it's not particularly useful until your list is long enough to make
search/sort/storage difficult: that's rare even at web-scale, no?

~~~
antirez
Hello, the use case is that you can count what you usually don't store (for
example different IPs connecting every day to your server), or you can count
what is usually stored but it is too expensive to count otherwise (Like
iterating a database table and counting unique items in a given column).

Also since it is so cheap, you start counting even things that could be
unreasonable to count otherwise but that may provide useful insight, for
example unique users accessing the site for every hour of the day, unique
words ever used by a given user in her comments, and so forth.

------
ris
On a related note, the recently released HyperLogLog extension for postgres:
[https://github.com/aggregateknowledge/postgresql-
hll](https://github.com/aggregateknowledge/postgresql-hll)

------
e98cuenc
We used this datastructure a few months ago to track an approximation to the
cardinality of the followers of a set of twitter users, which is handy to
calculate the reach of a twitter post. More recently, we used it again to
track an approximation to the number of unique images that a user generated in
a time period (each image is identified by a different URL).

Very handy data structure, and glad to see it land in Redis!

------
bochi
This blog post is a nice survey about other interesting probabilistic data
structures:
[http://highlyscalable.wordpress.com/2012/05/01/probabilistic...](http://highlyscalable.wordpress.com/2012/05/01/probabilistic-
structures-web-analytics-data-mining/)

------
dexcs
The funny fact is that whether it's a april fool or not, i am not able to
verify it because it's so well (technical) written :/

~~~
antirez
No April fool at all :-) You can build the "unstable" branch and play with it.

~~~
MasterScrat
Thanks for the clarification.

It's even more confusing for french speakers as "Philippe Flajolet" really
sounds like a made-up name ("Flajolet" means "white beans").

~~~
sbarre
Yeah it would be like seeing a paper written by "Peter Potato" or something
like that...

------
eternalban
Nick's blog is a must read for these sort of algorithms. See his original
writeup on the HLL here: [http://blog.notdot.net/2012/09/Dam-Cool-Algorithms-
Cardinali...](http://blog.notdot.net/2012/09/Dam-Cool-Algorithms-Cardinality-
Estimation)

------
dscrd
This is awesome on so many different levels.

Check out the implementation:
[https://github.com/antirez/redis/blob/unstable/src/hyperlogl...](https://github.com/antirez/redis/blob/unstable/src/hyperloglog.c)

------
medriscoll
The Druid.io folks recently blogged about three additional optimizations to
HyperLogLog that compact the registers and can improve query speed:

[https://groups.google.com/forum/#!searchin/druid-
development...](https://groups.google.com/forum/#!searchin/druid-
development/hyperloglog/druid-development/q1_R5yD3_DA/Dg6IC-q4L6kJ)

------
emmanueloga_
I now that projects like [http://druid.io](http://druid.io) use HLL to do
timeseries analytics. Can someone explain or link to an explanation about how
counting sets cardinality can be used for time series analysis? (Do I need to
add I haven't find a satisfactory one myself? :).

Naively I think it could be done by factoring in a timestamp on the things to
count:

    
    
      PFADD SOMEHLLVAR "#{timestamp1}#{event1}" "#{timestamp}#{event2}"
      PFADD SOMEHLLVAR "#{timestamp2}#{event3}" "#{timestamp}#{event2}"
    
      PFCOUNT "#{timestamp1}#{event1}" # -> 1 
      PFCOUNT "#{timestamp2}#{event1}" # -> 0
    

etc..

The problem with doing that is that you would need to iterate through each
second in a given range to find out the count of specific events... Also,
seems like a pretty wasteful way of encoding time <edit> thinking about it is
probably not wasteful in the sense the hll size should be the same, but
probably sub-optimal some other way

------
nefasti
A very good post helping explain HyperLogLog :
[http://blog.aggregateknowledge.com/2012/10/25/sketch-of-
the-...](http://blog.aggregateknowledge.com/2012/10/25/sketch-of-the-day-
hyperloglog-cornerstone-of-a-big-data-infrastructure/)

------
PanMan
Does anybody have a good example of this? E.g. to count unique hits over time
without storing each hit?

~~~
nerfhammer
Here's a cool example of using HLL in javascript to dynamically show set
union/intersections for huge sets:

[http://www.aggregateknowledge.com/science/blog/venn.html](http://www.aggregateknowledge.com/science/blog/venn.html)

------
jacinda
The top answer about HyperLogLog on stackoverflow links to this awesome
visualization:
[http://www.aggregateknowledge.com/science/blog/hll.html](http://www.aggregateknowledge.com/science/blog/hll.html)

SO Answer:
[http://stackoverflow.com/a/13074072/307293](http://stackoverflow.com/a/13074072/307293)

Full blog post: [http://blog.aggregateknowledge.com/2012/10/25/sketch-of-
the-...](http://blog.aggregateknowledge.com/2012/10/25/sketch-of-the-day-
hyperloglog-cornerstone-of-a-big-data-infrastructure/)

------
pointernil
Very likable naming scheme ;) Kudos for the extraordinary work on redis.
cheers

------
yread
The HLL principle seems quite similar to this way of approximating a biased
coin with a fair one by counting the 1s in binary

[http://amakelov.wordpress.com/2013/10/10/arbitrarily-
biasing...](http://amakelov.wordpress.com/2013/10/10/arbitrarily-biasing-a-
coin-in-2-expected-tosses/)

------
jimktrains2
Oh man, I started this and native bloom filters a summer ago and never
finished it. These are awesome data structures when dealing with really larges
sets and you only need a cardinality count or set membership test and not all
the members of the set. I can't wait to start using them!

------
eurleif
Cool. I wonder if bloom filters might be up next?

~~~
lifthrasiir
There already exists SETBIT (for adding items), GETBIT (for querying items)
and BITOP (for union operation) that can be used to implement bloom filters
altogether. HyperLogLog requires multi-bit registers (6 bit long in Redis)
which would be inefficient with the current design, so I guess that's why
Redis is to include a specialized implementation for that.

~~~
eurleif
Couldn't HyperLogLog be implemented efficiently using Lua scripting?

~~~
dbro
Yes it's possible. I did this a few years ago as a learning project, and the
relevant lua scripts can be seen here:
[https://gist.github.com/dbro/9920666](https://gist.github.com/dbro/9920666).

... but I wouldn't claim that it's done efficiently. From the comments in the
code's tests:

    
    
        # some results based on running speed tests on a lightweight netbook:
        # standard error = 0.01, 10kb of registers
        # < 1 ms per init()
        # < 1 ms per update()
        # < 1 ms per count() for single set
        # ~30 ms per set for count() of union of sets

------
donretag
Elasticsearch started using HyperLogLog recently for aggregating data. No
April Fools there!

------
ithkuil
How does HyperLogLog compare to Count-Min-Sketch?

~~~
Lanzaa
One basic difference between Count-Min Sketches and HyperLogLog is that Count-
Min Sketches represent the cardinality of items in multisets while HyperLogLog
represent the cardinality of sets.

For example, Put 3 of the item 'A' (where each 'A' is indistinguishable from
another 'A') and similarly 5 of item 'B' into a Count-Min Sketch and a
HyperLogLog.

When you query the Count-Min Sketch for 'A', you can probably get back 3. You
can also query for 'B' and get back 5.

With you query the HyperLogLog you query how many unique/indistinguishable
items are in the set, in this case 2. The HyperLogLog would return a number
near 2. You could query the HyperLogLog for whether or not 'A' is in the set,
but HyperLogLog are not designed to be accurate for that type of query. It
would return true in this case and in general there are no false negatives for
set membership queries in the HyperLogLog. I believe it would be less accurate
than a Bloom Filter for set membership queries.

~~~
nerfhammer
If you made the number of registers (m) extremely large and the register width
1 then a bloom filter and HLL are almost exactly the same (except bloom
filters have multiple hashes for the same input and in HLL you only split one
hash into two parts).

You can also use a bloom filter as a cardinality counter assuming it isn't too
filled up yet:
[http://en.wikipedia.org/wiki/Bloom_filter#Approximating_the_...](http://en.wikipedia.org/wiki/Bloom_filter#Approximating_the_number_of_items_in_a_Bloom_filter)

HLLs are designed to be useful for cardinality counting past the point where a
bloom filter would be over capacity. HLL would be useful for set membership
but after some threshold the false positive rate will approach 100% at some
super-linear rate.

------
yoyo1999
I have realized how mediocre I am.

