
How To Count A Billion Objects Using Only 1.5KB Of Memory [Probabilistically] - gruseom
http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html
======
yonran
The szl language (which was used at Google for log processing) has a output
type called unique(n) [1], which uses a simple algorithm: it stores the
smallest n sorted MD5 hashes, assumes that the hash is uniformly distributed
and there are no collisions, and estimates that total unique count as highest
possible hash * n / highest stored hash.

Edit: fixed the formula.

[1]:
[http://code.google.com/p/szl/source/browse/trunk/src/emitter...](http://code.google.com/p/szl/source/browse/trunk/src/emitters/szlunique.cc)

------
jewel
> These 16 characters represent 128 bytes. 8192 IDs would require 1 megabyte
> of space. We receive over 3 billion events per day... Even if we assume that
> only 1 in 3 records are unique the hash set would still take 119 gigs of
> RAM, not including the overhead Java requires to store objects in memory.

The math seems wrong here. You can actually fit 16 characters in 16 bytes! In
this case they can do it in 8 since it looks like it's a hex string.

That would mean that the hash table would fit in 8 GB of RAM.

In any case, it's a neat trick. Here's an algorithm for counting arbitrarily
sized input, assuming the data is too big to fit in memory:

1) Walk through the data, writing out the IDs to a different file based on the
first n digits. Split it into enough buckets that each individual bucket is
small enough to fit comfortably in RAM.

2) Count the number of unique items in each bucket using a hash table.

This requires three times the disk I/O as the original but gives exact
answers.

~~~
jandrewrogers
I believe they meant 128 bits (16 characters * 8 bits).

~~~
iclelland
You'd think so, but then every other calculation in that paragraph is based on
bytes:

1MB for 8192 identifiers (as opposed to 128K as strings, or 64K as ints) 384GB
for 3 billion identifiers (rather than 38GB strings, or 19GB ints) 119GiB if 1
in 3 is unique (rather than 12GiB strings, or 6GiB ints)

6GiB is not a lot of RAM anymore.

------
hcarvalhoalves
You know what is amazing? Is that as soon you hit bigger or more general
problems like this, you always face the compromise of "trading X resource for
accuracy". Which leads me to believe that software, so far, has only been
deterministic by pure accident.

~~~
lmkg
Experience has led me to conclude that modern software is both non-
deterministic, and prone to accidents.

Seriously, though... garbage collection, multiple processors, race conditions,
networks, OS scheduler preemptions, cache evictions, branch mis-predictions,
disk failures. Even without getting into drivers, the reason that software
appears deterministic, is because we try our damnedest to maintain that
illusion.

------
haberman
Title should be "how to _probablistically_ count distinct objects" The given
title makes it sound like the author is doing the impossible.

~~~
pornel
It _is_ possible to get exact count with no extra memory, but it's going to be
much slower: sort the set in-place and then do equivalent of `uniq | wc -l`.

~~~
chrisaycock
You can shorten that further with just:

    
    
      uniq -c

------
pooriaazimi
I submitted a kinda related link (counting 128M distinct objects) a few weeks
ago, which might be of interest:

    
    
       Realtime Metrics for 128 Million Users with Redis: 50ms + 16MB RAM
    

[http://blog.getspool.com/2011/11/29/fast-easy-realtime-
metri...](http://blog.getspool.com/2011/11/29/fast-easy-realtime-metrics-
using-redis-bitmaps)

Discussion: <http://news.ycombinator.com/item?id=3748239> &
<https://news.ycombinator.com/item?id=3292542>

~~~
Groxx
The technique is very different, though - the getspool article makes a 128
million bit bitmask and flips a bit for each user.

This article refers to probabilistic techniques for counting number-of-unique
objects, and uses _massively_ less memory to do so. The bitmask technique
gives 128 million in 16MB RAM and no error, where the HyperLogLog technique
gives 1,000 million in 0.0015MB RAM and 2% margin of error.

------
bluetshirt
Well, if you're estimating, you're not really counting, are you?

~~~
ww520
True. Misleading question. Asking the right question is half of the battle.

------
grandalf
I wrote a python library that makes it fairly easy to use a bitmap for this a
couple of years ago. Seemed to work pretty well.

<https://github.com/mmmurf/python_bitfield>

This was used as a proof of concept and surely contains a few embarrassing
bits of code, but oh well.

------
abramsma51
First off. I am very embarrassed to have made such an obvious mistake in the
opening paragraph. It has been corrected, please forgive gods of bits and
bytes...

I also suck at titles. So those of you who pointed out that it should be 'How
to estimate the cardinality of...' are correct, no argument from me there.

Re bloom filters. You can use bloom's to estimate the number of occurrences of
a given key but you can't use a bloom to get the total cardinality of the set.

One commenter suggested using Cassandra. We do use Cassandra for counting
(accurately) the number of times a URL has been shared but we do not use it
for our primary analytics engine. Also, I'm not sure exactly what you are
proposing here though. Store all IDs in C* and then count the number of keys?
What if I want to bucket the data by day, hour, week, month, etc? Would I have
to store the key n times for each bucket? As a coincidence Jonathan Ellis, of
Cassandra fame, was the inspiration for our bloom filter implementation.
Details are refenced in our github project.

~~~
boxein
The bloom filter could be used in a similar way, you have X hash functions
used and Y total bits, so for a bloom filter with Z bits set you anticipate N
unique items were hashed. Probably not as good as the HyperLogLog in terms of
memory space / % error, but there you go.

------
web_chops
Bitmaps always fascinate me. They are also the solution for one of the most
common interview questions:

How do you pick a single poisoned barrel out of 1000 with 10 prisoners as your
guinea pigs :)

~~~
pooriaazimi
I thought about it for 5 minutes and couldn't find an answer (I found several,
but they were all half-answers). I looked at the answer below, and it turned
out that you'd missed an important part: The prisoners don't die immediately
after drinking from a poisoned barrel... They die after 30 days (in other
words, they can drink from several barrels, and if any of those barrels were
poisoned, they'd die).

But thanks anyway.

~~~
sbov
You also need to specify that you need to know by 30 days. Without both
restrictions a simple binary search works.

------
hnwh
can we start by using the correct terminology? This isn't a count, its an
estimation

~~~
icefox
True, but on the other hand how many times have you been asked to do a task,
but upon further investigation the answer they needed could be provided easier
with a better question? So maybe they asked for a count, but they really just
wanted to know if the count was more like 1K or 100K.

~~~
smackfu
OTOH, if you can't bound your estimate, most people don't want it.

------
akadien
We use Bloom filters to do something similar.

~~~
pork
You should read the original Flajolet Martin paper if all you're doing is
distinct count estimation.

~~~
eternalban
Do you have a citation? A PDF link would be great and appreciated.

~~~
minimax
Looks like Flajolet was responsible for both the linear probabilistic method
and the hyper log log method.

Probabilistic method (1985):
<http://algo.inria.fr/flajolet/Publications/FlMa85.pdf>

Hyper log log (2007):
<http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf>

------
EvanMiller
Well, if all you care about is the space complexity you could do the same
thing with 8 bytes of memory, an integer and a floating point. Start the
floating point at zero. For each item, set the integer to 0, then compare the
item to every other item (including itself) and add one to the integer if they
are equal. After each iteration, add one divided by the integer to the
floating point, and repeat for the next item. When you're done the answer will
be stored in the floating point.

Of course this method requires O(N^2) time (that is, 1,000,000,000,000,000,000
compare operations when counting a billion objects), but who cares? It only
uses 8 bytes!

~~~
subleq
If you're allowed to make multiple passes over your set, you might as apply an
in-place comparison sort and get it down to O(n*log n) time with constant
space. Their method has the advantage that the stream doesn't need to be
stored -- events can be processed as they come in and then discarded, never
storing the entire set.

------
ilaksh
These seems like an ideal solution to that particular problem of counting.

But.. considering the fact that they have many many servers and may need to do
other types of analysis on their large data set, would it make sense to
consider solving this problem with a distributed system like Cassandra?

[http://stackoverflow.com/questions/3558844/how-to-
increment-...](http://stackoverflow.com/questions/3558844/how-to-increment-a-
counter-in-cassandra)

------
jasonkolb
I really love posts like this with an open-source project attached (w/ Apache
license no less!)

It's such a niche thing but if you've ever tried to do this you know what a
b*tch it is to do properly. I have never seen another open source solution for
this problem.

Thank you for the code, I will peruse and consider using it over a beer
tonight!

------
fleitz
Isn't this just called a bloom filter?

------
ww520
Let me guess, radix-tree variant with common prefixes of keys at the branches
of the tree and counters at the leaves.

~~~
pork
Nope, totally off. It's closer to probabilistic hashing where the
probabilities of picking successive buckets are exponentially decreasing.

~~~
ww520
That's not counting. That's estimating. I based my guess from the title of the
post.

~~~
jasonwatkinspdx
Which is why you should &A^*($@# read before commenting.

~~~
derleth
Do you need to draw attention to your lack of a specific swear by creating a
visually-distinct meta-swear?

------
mvanveen
I have a gripe with the opener.

> At Clearspring we like to count things. Counting the number of distinct
> elements (the cardinality) of a set is challenge when the cardinality of the
> set is large.

On a turing machine, you're always operating on sets with the same cardinality
as the integers.

~~~
jasonwatkinspdx
This is the same as saying Turing machines cannot operate on finite sets,
which is clearly wrong.

~~~
mvanveen
since when is the set of integers a finite set? Without understanding you
argument more fully, I don't see how this follows at all.

~~~
dpark
Since never. That's the opposite of what he was saying.

You: Turing machines only operate on sets with the same cardinality as the set
of integers.

Jason: Integers are _not_ a finite set.

Jason: Therefore Turing machines cannot operate on finite sets.

