The szl language (which was used at Google for log processing) has a output type called unique(n) [1], which uses a simple algorithm: it stores the smallest n sorted MD5 hashes, assumes that the hash is uniformly distributed and there are no collisions, and estimates that total unique count as highest possible hash * n / highest stored hash.
> These 16 characters represent 128 bytes. 8192 IDs would require 1 megabyte of space. We receive over 3 billion events per day... Even if we assume that only 1 in 3 records are unique the hash set would still take 119 gigs of RAM, not including the overhead Java requires to store objects in memory.
The math seems wrong here. You can actually fit 16 characters in 16 bytes! In this case they can do it in 8 since it looks like it's a hex string.
That would mean that the hash table would fit in 8 GB of RAM.
In any case, it's a neat trick. Here's an algorithm for counting arbitrarily sized input, assuming the data is too big to fit in memory:
1) Walk through the data, writing out the IDs to a different file based on the first n digits. Split it into enough buckets that each individual bucket is small enough to fit comfortably in RAM.
2) Count the number of unique items in each bucket using a hash table.
This requires three times the disk I/O as the original but gives exact answers.
You'd think so, but then every other calculation in that paragraph is based on bytes:
1MB for 8192 identifiers (as opposed to 128K as strings, or 64K as ints)
384GB for 3 billion identifiers (rather than 38GB strings, or 19GB ints)
119GiB if 1 in 3 is unique (rather than 12GiB strings, or 6GiB ints)
> 1) Walk through the data, writing out the IDs to a different file based on the first n digits. Split it into enough buckets that each individual bucket is small enough to fit comfortably in RAM.
That is the same as saying "count all id's starting with 'aa', add all id's starting with 'ab', add all id's starting with 'ac'...", isn't it?
No, you don't want a count, you actually want a list of those IDs so that you can later reopen each file and count how many distinct entries there are in it, aka. divide and conquer.
You know what is amazing? Is that as soon you hit bigger or more general problems like this, you always face the compromise of "trading X resource for accuracy". Which leads me to believe that software, so far, has only been deterministic by pure accident.
Experience has led me to conclude that modern software is both non-deterministic, and prone to accidents.
Seriously, though... garbage collection, multiple processors, race conditions, networks, OS scheduler preemptions, cache evictions, branch mis-predictions, disk failures. Even without getting into drivers, the reason that software appears deterministic, is because we try our damnedest to maintain that illusion.
It is possible to get exact count with no extra memory, but it's going to be much slower: sort the set in-place and then do equivalent of `uniq | wc -l`.
The technique is very different, though - the getspool article makes a 128 million bit bitmask and flips a bit for each user.
This article refers to probabilistic techniques for counting number-of-unique objects, and uses massively less memory to do so. The bitmask technique gives 128 million in 16MB RAM and no error, where the HyperLogLog technique gives 1,000 million in 0.0015MB RAM and 2% margin of error.
First off. I am very embarrassed to have made such an obvious mistake in the opening paragraph. It has been corrected, please forgive gods of bits and bytes...
I also suck at titles. So those of you who pointed out that it should be 'How to estimate the cardinality of...' are correct, no argument from me there.
Re bloom filters. You can use bloom's to estimate the number of occurrences of a given key but you can't use a bloom to get the total cardinality of the set.
One commenter suggested using Cassandra. We do use Cassandra for counting (accurately) the number of times a URL has been shared but we do not use it for our primary analytics engine. Also, I'm not sure exactly what you are proposing here though. Store all IDs in C* and then count the number of keys? What if I want to bucket the data by day, hour, week, month, etc? Would I have to store the key n times for each bucket? As a coincidence Jonathan Ellis, of Cassandra fame, was the inspiration for our bloom filter implementation. Details are refenced in our github project.
The bloom filter could be used in a similar way, you have X hash functions used and Y total bits, so for a bloom filter with Z bits set you anticipate N unique items were hashed. Probably not as good as the HyperLogLog in terms of memory space / % error, but there you go.
I thought about it for 5 minutes and couldn't find an answer (I found several, but they were all half-answers). I looked at the answer below, and it turned out that you'd missed an important part: The prisoners don't die immediately after drinking from a poisoned barrel... They die after 30 days (in other words, they can drink from several barrels, and if any of those barrels were poisoned, they'd die).
I'm not the OP and I don't have a solution at hand, but I'd start by thinking about error correction and parity, then realize that 10 bits is enough to represent 1000 items completely.
SPOILER: Couldn't you just make one prisoner drink from half the barrels, and then if he dies have another prisoner drink from half of those barrels, and if not from half of the other barrels, and so forth?
That would work for the original question, but with the delayed death and quick response conditions imposed by pooriaazimi[0], it would not work. Tying the solution back to binary, you realize that each barrel has a unique 10-bit identifier (2^10=1024, so numbering the barrels in binary gives each one a unique ID). You assign each prisoner to a bit, then have them all drink from the barrels where their bit is a 1. 30 days later, you'll have a pattern of prisoner deaths exactly matching a single barrel.
Note: I have not verified this solution, and it does not match the solution that was posted elsewhere in this thread.
True, but on the other hand how many times have you been asked to do a task, but upon further investigation the answer they needed could be provided easier with a better question? So maybe they asked for a count, but they really just wanted to know if the count was more like 1K or 100K.
More than that, it's worth perusing all of Flajolet's papers, they're a wonderful intersection of analysis combinatorics and algorithms, and the recent book he had Coauthored (before his unfortunate passing last year) http://algo.inria.fr/flajolet/Publications/book.pdf is an amazing book.
Well, if all you care about is the space complexity you could do the same thing with 8 bytes of memory, an integer and a floating point. Start the floating point at zero. For each item, set the integer to 0, then compare the item to every other item (including itself) and add one to the integer if they are equal. After each iteration, add one divided by the integer to the floating point, and repeat for the next item. When you're done the answer will be stored in the floating point.
Of course this method requires O(N^2) time (that is, 1,000,000,000,000,000,000 compare operations when counting a billion objects), but who cares? It only uses 8 bytes!
If you're allowed to make multiple passes over your set, you might as apply an in-place comparison sort and get it down to O(n*log n) time with constant space. Their method has the advantage that the stream doesn't need to be stored -- events can be processed as they come in and then discarded, never storing the entire set.
If n is large enough (and 1000000000 definitely is), at some point (possibly even the beginning) adding 1/n to the floating point value will result in no change.
Uhh, beside that I don't get what's going on with the floating point number in your algorithm, that requires storing the N items, which is exactly what the author is trying to avoid.
These seems like an ideal solution to that particular problem of counting.
But.. considering the fact that they have many many servers and may need to do other types of analysis on their large data set, would it make sense to consider solving this problem with a distributed system like Cassandra?
I really love posts like this with an open-source project attached (w/ Apache license no less!)
It's such a niche thing but if you've ever tried to do this you know what a b*tch it is to do properly. I have never seen another open source solution for this problem.
Thank you for the code, I will peruse and consider using it over a beer tonight!
> At Clearspring we like to count things. Counting the number of distinct elements (the cardinality) of a set is challenge when the cardinality of the set is large.
On a turing machine, you're always operating on sets with the same cardinality as the integers.
Edit: fixed the formula.
[1]: http://code.google.com/p/szl/source/browse/trunk/src/emitter...