Just to mention a modification on the "simple and intuitive cardinality estimator" that's far more accurate and actually used for Google's SZL:
Given N objects, hash each object and store the lowest M (s.t. M << N) hashes, then calculate what percentage of hash space is covered and compare to how much of the hash space would be expected to be covered.
This removes the issue of having to assume the N objects are distributed evenly as hash(object) should exhibit that property.
I actually cover that in the post (well, more or less - I talk about hashing to remove bias, after talking about the 'min element' algorithm. According to the papers cited, though, taking the count of leading zeroes is more space efficient, allowing you to have more buckets in the same amount of space.
The hashing isn't the point. The leading zeroes method is more likely to produce unreliable results when you can't guarantee the dataset will be large in advance whilst the "keep min M elements" method works fine for unexpectedly small sets. That's why it's used for SZL where the user can request approximate unique counts on any data regardless of size.
Have you read the papers I linked in detail? Some of them, such as HyperLogLog, provide corrections to give better estimates for small sets, and although I can't follow the proof in its entirety, they claim to be more efficient than the alternatives, including the one you propose.
I can't resist mentioning a favorite algorithm of mine for solving a related problem. It generalizes the algorithm for the classic problem of finding the majority element (one which occurs at least 50% of the time) with a single pass over a stream, using a constant amount of space, and a second sieving pass. It's one of those simple gems you wonder why it took so long to discover.
Note that this algorithms fails silently if there is not majority element, which limits the scope of its utility.
I assume that the reason it took so long to discover is that no one needed the solution before. Are there applications that can benefit form this algorithm?
You can do a second pass over the list to check that your solution is actually a majority element. This maintains the linear time and constant space properties.
Fwiw, I believe the best estimator has been improved on since HyperLogLog, with a more recent result that is provably optimal (and slightly faster asymptotically, dropping the loglog factor), which perhaps more importantly can also process streamed data online: http://people.seas.harvard.edu/~minilek/papers/f0.pdf
That is mind-blowing. Intuitively, a hash digest is completely uncorrelated with its input, but this shows that you can make inferences about a collection of data based on its hashes.
In retrospect, it makes sense stochastically, but still a Damn Cool Algorithm indeed.
> Other hash applications (like hash tables) don't need that property
I'm unclear on what you guys mean by correlation, but if it means what I think it does I disagree with this. A hash table ideally has an uniform distribution regardless of the input, so any structured correlation with the input will harm this goal in real-world applications.
That's not what I said. Generating a predictable distribution by hashing is old hat.
What's awesome is you can then make inferences about the original data from those hashes -- something that good hash functions are supposed to be resistant to, in isolation.
A minor nitpick... 'cardinality' in mathematics refers to a set of infinities. The set of all integers has a certain cardinality, which is equal to the cardinality of all rational numbers, but which is smaller than the cardinality of all irrational numbers.
"Size estimation" is sufficient; no need to get fancy with words you don't know the meaning of.
The OP is correct. Cardinality in mathematics is the number of elements in a set[1]. What you are probably referring to is the concept of cardinal numbers[2]. I guess you can skip the smug tone.
Given N objects, hash each object and store the lowest M (s.t. M << N) hashes, then calculate what percentage of hash space is covered and compare to how much of the hash space would be expected to be covered.
This removes the issue of having to assume the N objects are distributed evenly as hash(object) should exhibit that property.
[1] My naive Python implementation for tests on the NLP WSJ corpus -- https://github.com/Smerity/Snippets/blob/master/analytics/ap...
[2]: Szl's implementation -- http://code.google.com/p/szl/source/browse/trunk/src/emitter...