How To Count A Billion Objects Using Only 1.5KB Of Memory [Probabilistically] 213 points by gruseom 2114 days ago | hide | past | web | favorite | 59 comments

 The szl language (which was used at Google for log processing) has a output type called unique(n) [1], which uses a simple algorithm: it stores the smallest n sorted MD5 hashes, assumes that the hash is uniformly distributed and there are no collisions, and estimates that total unique count as highest possible hash * n / highest stored hash.Edit: fixed the formula.
 > These 16 characters represent 128 bytes. 8192 IDs would require 1 megabyte of space. We receive over 3 billion events per day... Even if we assume that only 1 in 3 records are unique the hash set would still take 119 gigs of RAM, not including the overhead Java requires to store objects in memory.The math seems wrong here. You can actually fit 16 characters in 16 bytes! In this case they can do it in 8 since it looks like it's a hex string.That would mean that the hash table would fit in 8 GB of RAM.In any case, it's a neat trick. Here's an algorithm for counting arbitrarily sized input, assuming the data is too big to fit in memory:1) Walk through the data, writing out the IDs to a different file based on the first n digits. Split it into enough buckets that each individual bucket is small enough to fit comfortably in RAM.2) Count the number of unique items in each bucket using a hash table.This requires three times the disk I/O as the original but gives exact answers.
 I believe they meant 128 bits (16 characters * 8 bits).
 You'd think so, but then every other calculation in that paragraph is based on bytes:1MB for 8192 identifiers (as opposed to 128K as strings, or 64K as ints) 384GB for 3 billion identifiers (rather than 38GB strings, or 19GB ints) 119GiB if 1 in 3 is unique (rather than 12GiB strings, or 6GiB ints)6GiB is not a lot of RAM anymore.
 > 1) Walk through the data, writing out the IDs to a different file based on the first n digits. Split it into enough buckets that each individual bucket is small enough to fit comfortably in RAM.That is the same as saying "count all id's starting with 'aa', add all id's starting with 'ab', add all id's starting with 'ac'...", isn't it?
 No, you don't want a count, you actually want a list of those IDs so that you can later reopen each file and count how many distinct entries there are in it, aka. divide and conquer.
 You know what is amazing? Is that as soon you hit bigger or more general problems like this, you always face the compromise of "trading X resource for accuracy". Which leads me to believe that software, so far, has only been deterministic by pure accident.
 Experience has led me to conclude that modern software is both non-deterministic, and prone to accidents.Seriously, though... garbage collection, multiple processors, race conditions, networks, OS scheduler preemptions, cache evictions, branch mis-predictions, disk failures. Even without getting into drivers, the reason that software appears deterministic, is because we try our damnedest to maintain that illusion.
 Title should be "how to probablistically count distinct objects" The given title makes it sound like the author is doing the impossible.
 It is possible to get exact count with no extra memory, but it's going to be much slower: sort the set in-place and then do equivalent of `uniq | wc -l`.
 You can shorten that further with just:`` uniq -c``
 I submitted a kinda related link (counting 128M distinct objects) a few weeks ago, which might be of interest:`````` Realtime Metrics for 128 Million Users with Redis: 50ms + 16MB RAM `````` http://blog.getspool.com/2011/11/29/fast-easy-realtime-metri...
 The technique is very different, though - the getspool article makes a 128 million bit bitmask and flips a bit for each user.This article refers to probabilistic techniques for counting number-of-unique objects, and uses massively less memory to do so. The bitmask technique gives 128 million in 16MB RAM and no error, where the HyperLogLog technique gives 1,000 million in 0.0015MB RAM and 2% margin of error.
 Well, if you're estimating, you're not really counting, are you?
 True. Misleading question. Asking the right question is half of the battle.
 Let me guess...combinatorics background? :)
 I wrote a python library that makes it fairly easy to use a bitmap for this a couple of years ago. Seemed to work pretty well.https://github.com/mmmurf/python_bitfieldThis was used as a proof of concept and surely contains a few embarrassing bits of code, but oh well.
 First off. I am very embarrassed to have made such an obvious mistake in the opening paragraph. It has been corrected, please forgive gods of bits and bytes...I also suck at titles. So those of you who pointed out that it should be 'How to estimate the cardinality of...' are correct, no argument from me there.Re bloom filters. You can use bloom's to estimate the number of occurrences of a given key but you can't use a bloom to get the total cardinality of the set.One commenter suggested using Cassandra. We do use Cassandra for counting (accurately) the number of times a URL has been shared but we do not use it for our primary analytics engine. Also, I'm not sure exactly what you are proposing here though. Store all IDs in C* and then count the number of keys? What if I want to bucket the data by day, hour, week, month, etc? Would I have to store the key n times for each bucket? As a coincidence Jonathan Ellis, of Cassandra fame, was the inspiration for our bloom filter implementation. Details are refenced in our github project.
 The bloom filter could be used in a similar way, you have X hash functions used and Y total bits, so for a bloom filter with Z bits set you anticipate N unique items were hashed. Probably not as good as the HyperLogLog in terms of memory space / % error, but there you go.
 Bitmaps always fascinate me. They are also the solution for one of the most common interview questions:How do you pick a single poisoned barrel out of 1000 with 10 prisoners as your guinea pigs :)
 I thought about it for 5 minutes and couldn't find an answer (I found several, but they were all half-answers). I looked at the answer below, and it turned out that you'd missed an important part: The prisoners don't die immediately after drinking from a poisoned barrel... They die after 30 days (in other words, they can drink from several barrels, and if any of those barrels were poisoned, they'd die).But thanks anyway.
 You also need to specify that you need to know by 30 days. Without both restrictions a simple binary search works.
 Or your allow a cocktail to be made from the contents of several barrels.
 Ya, that's correct :) It doesn't make much sense otherwise I guess...
 I love problems like that. I'm going to go work that out for awhile, you have any links to the answer, or more questions like it?
 I'm not the OP and I don't have a solution at hand, but I'd start by thinking about error correction and parity, then realize that 10 bits is enough to represent 1000 items completely.
 SPOILER: Couldn't you just make one prisoner drink from half the barrels, and then if he dies have another prisoner drink from half of those barrels, and if not from half of the other barrels, and so forth?
 That would work for the original question, but with the delayed death and quick response conditions imposed by pooriaazimi[0], it would not work. Tying the solution back to binary, you realize that each barrel has a unique 10-bit identifier (2^10=1024, so numbering the barrels in binary gives each one a unique ID). You assign each prisoner to a bit, then have them all drink from the barrels where their bit is a 1. 30 days later, you'll have a pattern of prisoner deaths exactly matching a single barrel.Note: I have not verified this solution, and it does not match the solution that was posted elsewhere in this thread.
 can we start by using the correct terminology? This isn't a count, its an estimation
 True, but on the other hand how many times have you been asked to do a task, but upon further investigation the answer they needed could be provided easier with a better question? So maybe they asked for a count, but they really just wanted to know if the count was more like 1K or 100K.
 OTOH, if you can't bound your estimate, most people don't want it.
 The actual article uses precise terminology. I see no reason to pick at the headline.
 It's a probabilistic count. Still proper terminology.
 We use Bloom filters to do something similar.
 You should read the original Flajolet Martin paper if all you're doing is distinct count estimation.
 More than that, it's worth perusing all of Flajolet's papers, they're a wonderful intersection of analysis combinatorics and algorithms, and the recent book he had Coauthored (before his unfortunate passing last year) http://algo.inria.fr/flajolet/Publications/book.pdf is an amazing book.
 Do you have a citation? A PDF link would be great and appreciated.
 Looks like Flajolet was responsible for both the linear probabilistic method and the hyper log log method.Probabilistic method (1985): http://algo.inria.fr/flajolet/Publications/FlMa85.pdfHyper log log (2007): http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf
 Thanks. I've seen it before but will give a re-read.
 Well, if all you care about is the space complexity you could do the same thing with 8 bytes of memory, an integer and a floating point. Start the floating point at zero. For each item, set the integer to 0, then compare the item to every other item (including itself) and add one to the integer if they are equal. After each iteration, add one divided by the integer to the floating point, and repeat for the next item. When you're done the answer will be stored in the floating point.Of course this method requires O(N^2) time (that is, 1,000,000,000,000,000,000 compare operations when counting a billion objects), but who cares? It only uses 8 bytes!
 If you're allowed to make multiple passes over your set, you might as apply an in-place comparison sort and get it down to O(n*log n) time with constant space. Their method has the advantage that the stream doesn't need to be stored -- events can be processed as they come in and then discarded, never storing the entire set.
 If n is large enough (and 1000000000 definitely is), at some point (possibly even the beginning) adding 1/n to the floating point value will result in no change.
 Uhh, beside that I don't get what's going on with the floating point number in your algorithm, that requires storing the N items, which is exactly what the author is trying to avoid.
 These seems like an ideal solution to that particular problem of counting.But.. considering the fact that they have many many servers and may need to do other types of analysis on their large data set, would it make sense to consider solving this problem with a distributed system like Cassandra?
 I really love posts like this with an open-source project attached (w/ Apache license no less!)It's such a niche thing but if you've ever tried to do this you know what a b*tch it is to do properly. I have never seen another open source solution for this problem.Thank you for the code, I will peruse and consider using it over a beer tonight!
 Isn't this just called a bloom filter?
 Let me guess, radix-tree variant with common prefixes of keys at the branches of the tree and counters at the leaves.
 Nope, totally off. It's closer to probabilistic hashing where the probabilities of picking successive buckets are exponentially decreasing.
 That's not counting. That's estimating. I based my guess from the title of the post.
 Which is why you should &A^*(\$@# read before commenting.
 Do you need to draw attention to your lack of a specific swear by creating a visually-distinct meta-swear?
 I have a gripe with the opener.> At Clearspring we like to count things. Counting the number of distinct elements (the cardinality) of a set is challenge when the cardinality of the set is large.On a turing machine, you're always operating on sets with the same cardinality as the integers.
 That would be cool if you actually had a Turing machine. Most people have a deterministic linear bounded automata.
 This is the same as saying Turing machines cannot operate on finite sets, which is clearly wrong.
 since when is the set of integers a finite set? Without understanding you argument more fully, I don't see how this follows at all.
 Since never. That's the opposite of what he was saying.You: Turing machines only operate on sets with the same cardinality as the set of integers.Jason: Integers are not a finite set.Jason: Therefore Turing machines cannot operate on finite sets.
 Computers are finite, and so is Clearspring's data.

Search: