Counting hash collisions with the birthday paradox

etep · on July 18, 2015

I've faced similar questions and the work of Michael Mitzenmacher (at Harvard) has proved very valuable (e.g. modeling table load for d-left).

PaulAJ · on July 18, 2015

Some experiments suggest that if D=n^2 then the expected number of collisions is close to 0.5, and converges on 0.5 as D tends to infinity. So if D is large then collisions will be a problem when n approaches sqrt(D).

oxymoron · on July 18, 2015

Practically, does that mean that the rate of collissions increase substantially once you have more than 65535 entries in a hash table with 32 bit hashes? To the point that it makes more sense to use a BST instead, even?

nly · on July 18, 2015

If you have a hash table of size S with N entries in it, then the probability of hitting a collision on the next insertion is N/S. This is why many hash tables will resize and rehash once the load factor (which is N/S) reaches some predetermined fraction. Book-keeping using this method is cheap.

It's important to distinguish "What is the probability of the next insertion causing a collision?" and looking at a chained hash table of occupancy N and asking "What is the probability that this hash table already contains a collision?" (the birthday paradox yields 50% for 23 birthdays in answer to this), and again "How many collisions do I expect to find in this hash table?", which is what the author is writing about, and is actually the cumulative probability of there having been a collison over the lifetime of all previous insertions.

The collision rate will depend on how long entries live in your hash table and the rate at which you're doing insertions.