
Hashing to estimate the size of a stream - CarolineW
http://jeremykun.com/2016/01/04/hashing-to-estimate-the-size-of-a-stream/
======
droffel
This feels very similar to how the total work expended by the Bitcoin network
can be approximated (not the current hashrate, but the sum of all hashes over
time). Basically, you can take all of the block hashes, and find the lowest
one of the lot. From that number, you can estimate the total work performed by
the network in aggregate. The concept is basically the same here.

------
Someone
I find this interesting, but somewhat sloppy writing. Examples:

 _" Problem: Estimate the number of distinct items in a data stream that is
too large to fit in memory"_

That is _not_ what the title promises: the size of the stream.

 _" assume the data in the stream consists of uniformly random integers
between zero and N"_

I guess the lower bound is inclusive and the upper bound exclusive (that
probably doesn't matter in practice. It would matter if you want to estimate
the number of different byte values in a data stream on a truly tiny device,
but I would guess that anything that can do that division to get the estimate
would have the 32 bytes to store a bitmask with 256 values)

 _" Maintain a single integer xMin [...] expected value would be 1/(xMin+1)"_

With xMin a nonnegative integer, that estimate (of the number of different
values in the stream) would never get larger than 1. The code is correct; it
multiplies by N to get a number between 1 and N (inclusive).

~~~
j2kun
Thanks for the feedback. I updated the post a bit, but I'm not sure what to
use in place of "size." Maybe "cardinality" but that's not a very appealing
word for a title. Suggestions welcome :)

------
JFlash
> So if we assume the data in the stream consists of uniformly random integers
> between zero and N

This needs to be in the problem statement.

~~~
jsnell
It shouldn't, the algorithm as described works for arbitrary data. The whole
point of the hashing is to transform the stream of arbitrary data to a stream
of uniformly distributed random integers, so that you can then apply the
reasoning from that paragraph.

------
dragontamer
I enjoyed the discussion at least. I always enjoy these estimation algorithms.
Most of the time, you don't really need a precise answer.

