
Damn Cool Algorithms: Cardinality Estimation - nicksdjohnson
http://blog.notdot.net/2012/09/Dam-Cool-Algorithms-Cardinality-Estimation
======
Smerity
Just to mention a modification on the "simple and intuitive cardinality
estimator" that's far more accurate and actually used for Google's SZL:

Given N objects, hash each object and store the lowest M (s.t. M << N) hashes,
then calculate what percentage of hash space is covered and compare to how
much of the hash space would be expected to be covered.

This removes the issue of having to assume the N objects are distributed
evenly as hash(object) should exhibit that property.

[1] My naive Python implementation for tests on the NLP WSJ corpus --
[https://github.com/Smerity/Snippets/blob/master/analytics/ap...](https://github.com/Smerity/Snippets/blob/master/analytics/approx_unique.py)

[2]: Szl's implementation --
[http://code.google.com/p/szl/source/browse/trunk/src/emitter...](http://code.google.com/p/szl/source/browse/trunk/src/emitters/szlunique.cc)

~~~
shardling
>This removes the issue of having to assume the N objects are distributed
evenly as hash(object) should exhibit that property.

I'm a bit confused -- isn't that exactly how the article proceeds?

~~~
Smerity
The hashing isn't the point. The leading zeroes method is more likely to
produce unreliable results when you can't guarantee the dataset will be large
in advance whilst the "keep min M elements" method works fine for unexpectedly
small sets. That's why it's used for SZL where the user can request
approximate unique counts on any data regardless of size.

~~~
nicksdjohnson
Have you read the papers I linked in detail? Some of them, such as
HyperLogLog, provide corrections to give better estimates for small sets, and
although I can't follow the proof in its entirety, they claim to be more
efficient than the alternatives, including the one you propose.

------
psykotic
I can't resist mentioning a favorite algorithm of mine for solving a related
problem. It generalizes the algorithm for the classic problem of finding the
majority element (one which occurs at least 50% of the time) with a single
pass over a stream, using a constant amount of space, and a second sieving
pass. It's one of those simple gems you wonder why it took so long to
discover.

[http://www.cs.yale.edu/homes/el327/datamining2011aFiles/ASim...](http://www.cs.yale.edu/homes/el327/datamining2011aFiles/ASimpleAlgorithmForFindingFrequentElementsInStreamsAndBags.pdf)

~~~
nicksdjohnson
Very nice, thanks! It might be too similar for a post all of its own, but I
think it'd be a worthy followup to this post.

------
_delirium
Fwiw, I believe the best estimator has been improved on since HyperLogLog,
with a more recent result that is provably optimal (and slightly faster
asymptotically, dropping the loglog factor), which perhaps more importantly
can also process streamed data online:
<http://people.seas.harvard.edu/~minilek/papers/f0.pdf>

~~~
nicksdjohnson
A friend mentioned the existence of this, but I couldn't find it myself.
Thanks for pointing it out.

All the algorithms can process data in a streaming fashion, though - they only
require a single pass.

------
bochi
If you liked this post, you will probably find this article about other
probabilistic data structures interesting too:
[http://highlyscalable.wordpress.com/2012/05/01/probabilistic...](http://highlyscalable.wordpress.com/2012/05/01/probabilistic-
structures-web-analytics-data-mining/)

------
jallmann
That is mind-blowing. Intuitively, a hash digest is completely uncorrelated
with its input, but this shows that you can make inferences about a collection
of data based on its hashes.

In retrospect, it makes sense stochastically, but still a Damn Cool Algorithm
indeed.

~~~
lurker14
Isn't that the point? That hashing maps an arbitrary set of numbers to an
approximately uniform scale-invariant distribution.

~~~
jallmann
That's not what I said. Generating a predictable distribution by hashing is
old hat.

What's awesome is you can then make inferences about the original data from
those hashes -- something that good hash functions are supposed to be
resistant to, in isolation.

~~~
shardling
It works here because we only care whether two bits of data are equal -- and a
hash function had better preserve that relationship!

------
tripzilch
Very cool algorithm. I wonder, can anyone explain in what kind of real-world
application it would be used?

------
brilee
A minor nitpick... 'cardinality' in mathematics refers to a set of infinities.
The set of all integers has a certain cardinality, which is equal to the
cardinality of all rational numbers, but which is smaller than the cardinality
of all irrational numbers.

"Size estimation" is sufficient; no need to get fancy with words you don't
know the meaning of.

~~~
code0
The OP is correct. Cardinality in mathematics is the number of elements in a
set[1]. What you are probably referring to is the concept of cardinal
numbers[2]. I guess you can skip the smug tone.

[1] <http://en.wikipedia.org/wiki/Cardinality> [2]
<http://en.wikipedia.org/wiki/Cardinal_number>

~~~
Sniffnoy
He can't even claim that. Cardinal numbers include the finite cardinals, after
all.

