Edit: fixed the formula.
The math seems wrong here. You can actually fit 16 characters in 16 bytes! In this case they can do it in 8 since it looks like it's a hex string.
That would mean that the hash table would fit in 8 GB of RAM.
In any case, it's a neat trick. Here's an algorithm for counting arbitrarily sized input, assuming the data is too big to fit in memory:
1) Walk through the data, writing out the IDs to a different file based on the first n digits. Split it into enough buckets that each individual bucket is small enough to fit comfortably in RAM.
2) Count the number of unique items in each bucket using a hash table.
This requires three times the disk I/O as the original but gives exact answers.
1MB for 8192 identifiers (as opposed to 128K as strings, or 64K as ints)
384GB for 3 billion identifiers (rather than 38GB strings, or 19GB ints)
119GiB if 1 in 3 is unique (rather than 12GiB strings, or 6GiB ints)
6GiB is not a lot of RAM anymore.
That is the same as saying "count all id's starting with 'aa', add all id's starting with 'ab', add all id's starting with 'ac'...", isn't it?
Seriously, though... garbage collection, multiple processors, race conditions, networks, OS scheduler preemptions, cache evictions, branch mis-predictions, disk failures. Even without getting into drivers, the reason that software appears deterministic, is because we try our damnedest to maintain that illusion.
Realtime Metrics for 128 Million Users with Redis: 50ms + 16MB RAM
Discussion: http://news.ycombinator.com/item?id=3748239 & https://news.ycombinator.com/item?id=3292542
This article refers to probabilistic techniques for counting number-of-unique objects, and uses massively less memory to do so. The bitmask technique gives 128 million in 16MB RAM and no error, where the HyperLogLog technique gives 1,000 million in 0.0015MB RAM and 2% margin of error.
This was used as a proof of concept and surely contains a few embarrassing bits of code, but oh well.
I also suck at titles. So those of you who pointed out that it should be 'How to estimate the cardinality of...' are correct, no argument from me there.
Re bloom filters. You can use bloom's to estimate the number of occurrences of a given key but you can't use a bloom to get the total cardinality of the set.
One commenter suggested using Cassandra. We do use Cassandra for counting (accurately) the number of times a URL has been shared but we do not use it for our primary analytics engine. Also, I'm not sure exactly what you are proposing here though. Store all IDs in C* and then count the number of keys? What if I want to bucket the data by day, hour, week, month, etc? Would I have to store the key n times for each bucket? As a coincidence Jonathan Ellis, of Cassandra fame, was the inspiration for our bloom filter implementation. Details are refenced in our github project.
How do you pick a single poisoned barrel out of 1000 with 10 prisoners as your guinea pigs :)
But thanks anyway.
Note: I have not verified this solution, and it does not match the solution that was posted elsewhere in this thread.
Probabilistic method (1985): http://algo.inria.fr/flajolet/Publications/FlMa85.pdf
Hyper log log (2007): http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf
Of course this method requires O(N^2) time (that is, 1,000,000,000,000,000,000 compare operations when counting a billion objects), but who cares? It only uses 8 bytes!
But.. considering the fact that they have many many servers and may need to do other types of analysis on their large data set, would it make sense to consider solving this problem with a distributed system like Cassandra?
It's such a niche thing but if you've ever tried to do this you know what a b*tch it is to do properly. I have never seen another open source solution for this problem.
Thank you for the code, I will peruse and consider using it over a beer tonight!
> At Clearspring we like to count things. Counting the number of distinct elements (the cardinality) of a set is challenge when the cardinality of the set is large.
On a turing machine, you're always operating on sets with the same cardinality as the integers.
You: Turing machines only operate on sets with the same cardinality as the set of integers.
Jason: Integers are not a finite set.
Jason: Therefore Turing machines cannot operate on finite sets.