
Counting hash collisions with the birthday paradox - ot
http://matt.might.net/articles/counting-hash-collisions/
======
etep
I've faced similar questions and the work of Michael Mitzenmacher (at Harvard)
has proved very valuable (e.g. modeling table load for d-left).

------
PaulAJ
Some experiments suggest that if D=n^2 then the expected number of collisions
is close to 0.5, and converges on 0.5 as D tends to infinity. So if D is large
then collisions will be a problem when n approaches sqrt(D).

~~~
oxymoron
Practically, does that mean that the rate of collissions increase
substantially once you have more than 65535 entries in a hash table with 32
bit hashes? To the point that it makes more sense to use a BST instead, even?

~~~
nly
If you have a hash table of size S with N entries in it, then the probability
of hitting a collision on the _next_ insertion is N/S. This is why many hash
tables will resize and rehash once the load factor (which is N/S) reaches some
predetermined fraction. Book-keeping using this method is cheap.

It's important to distinguish "What is the probability of the next insertion
causing a collision?" and looking at a chained hash table of occupancy N and
asking "What is the probability that this hash table already contains a
collision?" (the birthday paradox yields 50% for 23 birthdays in answer to
this), and again "How many collisions do I expect to find in this hash
table?", which is what the author is writing about, and is actually the
cumulative probability of there having been a collison over the lifetime of
all previous insertions.

The collision _rate_ will depend on how long entries live in your hash table
and the rate at which you're doing insertions.

