Hacker News new | past | comments | ask | show | jobs | submit login
Damn Cool Algorithms: Cardinality Estimation (notdot.net)
233 points by nicksdjohnson on Sept 7, 2012 | hide | past | favorite | 30 comments



Just to mention a modification on the "simple and intuitive cardinality estimator" that's far more accurate and actually used for Google's SZL:

Given N objects, hash each object and store the lowest M (s.t. M << N) hashes, then calculate what percentage of hash space is covered and compare to how much of the hash space would be expected to be covered.

This removes the issue of having to assume the N objects are distributed evenly as hash(object) should exhibit that property.

[1] My naive Python implementation for tests on the NLP WSJ corpus -- https://github.com/Smerity/Snippets/blob/master/analytics/ap...

[2]: Szl's implementation -- http://code.google.com/p/szl/source/browse/trunk/src/emitter...


I actually cover that in the post (well, more or less - I talk about hashing to remove bias, after talking about the 'min element' algorithm. According to the papers cited, though, taking the count of leading zeroes is more space efficient, allowing you to have more buckets in the same amount of space.


>This removes the issue of having to assume the N objects are distributed evenly as hash(object) should exhibit that property.

I'm a bit confused -- isn't that exactly how the article proceeds?


The hashing isn't the point. The leading zeroes method is more likely to produce unreliable results when you can't guarantee the dataset will be large in advance whilst the "keep min M elements" method works fine for unexpectedly small sets. That's why it's used for SZL where the user can request approximate unique counts on any data regardless of size.


Have you read the papers I linked in detail? Some of them, such as HyperLogLog, provide corrections to give better estimates for small sets, and although I can't follow the proof in its entirety, they claim to be more efficient than the alternatives, including the one you propose.


Do you know off the top of your head if this is more or less efficient/accurate than the algorithm in the post?


It's the same algorithm.


Does this method have a name/are there any reasonable papers out there describing it?


I can't resist mentioning a favorite algorithm of mine for solving a related problem. It generalizes the algorithm for the classic problem of finding the majority element (one which occurs at least 50% of the time) with a single pass over a stream, using a constant amount of space, and a second sieving pass. It's one of those simple gems you wonder why it took so long to discover.

http://www.cs.yale.edu/homes/el327/datamining2011aFiles/ASim...


Very nice, thanks! It might be too similar for a post all of its own, but I think it'd be a worthy followup to this post.


Direct link to the pre-generalized: https://www.cs.utexas.edu/~moore/best-ideas/mjrty/index.html

Note that this algorithms fails silently if there is not majority element, which limits the scope of its utility.

I assume that the reason it took so long to discover is that no one needed the solution before. Are there applications that can benefit form this algorithm?


You can do a second pass over the list to check that your solution is actually a majority element. This maintains the linear time and constant space properties.


Unfortunately single pass is usually the binding constraint eg real time data processing.


Fwiw, I believe the best estimator has been improved on since HyperLogLog, with a more recent result that is provably optimal (and slightly faster asymptotically, dropping the loglog factor), which perhaps more importantly can also process streamed data online: http://people.seas.harvard.edu/~minilek/papers/f0.pdf


A friend mentioned the existence of this, but I couldn't find it myself. Thanks for pointing it out.

All the algorithms can process data in a streaming fashion, though - they only require a single pass.


If you liked this post, you will probably find this article about other probabilistic data structures interesting too: http://highlyscalable.wordpress.com/2012/05/01/probabilistic...


That is mind-blowing. Intuitively, a hash digest is completely uncorrelated with its input, but this shows that you can make inferences about a collection of data based on its hashes.

In retrospect, it makes sense stochastically, but still a Damn Cool Algorithm indeed.


[deleted]


> Other hash applications (like hash tables) don't need that property

I'm unclear on what you guys mean by correlation, but if it means what I think it does I disagree with this. A hash table ideally has an uniform distribution regardless of the input, so any structured correlation with the input will harm this goal in real-world applications.


Python's hashes for small integers never collide.

Python 2.6.5 (r265:79063, Apr 16 2010, 13:57:41)

>>> map(hash, range(10))

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]


> ... you can make inferences about a collection of data based on its hashes.

Quite a few cryptographers have learned this the hard way!


Isn't that the point? That hashing maps an arbitrary set of numbers to an approximately uniform scale-invariant distribution.


That's not what I said. Generating a predictable distribution by hashing is old hat.

What's awesome is you can then make inferences about the original data from those hashes -- something that good hash functions are supposed to be resistant to, in isolation.


It works here because we only care whether two bits of data are equal -- and a hash function had better preserve that relationship!


Very cool algorithm. I wonder, can anyone explain in what kind of real-world application it would be used?


A minor nitpick... 'cardinality' in mathematics refers to a set of infinities. The set of all integers has a certain cardinality, which is equal to the cardinality of all rational numbers, but which is smaller than the cardinality of all irrational numbers.

"Size estimation" is sufficient; no need to get fancy with words you don't know the meaning of.


His usage is correct.

Cardinality refers to the size of any set. This includes finite, countable, and uncountable sets.

The sequence of cardinal numbers is transfinite: 0, 1, 2, ... , n, ..., aleph_0, aleph_1, ...

http://en.wikipedia.org/wiki/Cardinality

http://en.wikipedia.org/wiki/Cardinal_number


The OP is correct. Cardinality in mathematics is the number of elements in a set[1]. What you are probably referring to is the concept of cardinal numbers[2]. I guess you can skip the smug tone.

[1] http://en.wikipedia.org/wiki/Cardinality [2] http://en.wikipedia.org/wiki/Cardinal_number


He can't even claim that. Cardinal numbers include the finite cardinals, after all.


> no need to get fancy with words you don't know the meaning of.

If you want to write things that leave no room for you being mistaken, it is a good idea to double check what you are saying.

https://www.google.com/webhp?#hl=en&q=cardinality&tb...


The irony of your crushingly ignorant post nearly killed me.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: