Hacker News new | past | comments | ask | show | jobs | submit login
Counting hash collisions with the birthday paradox (might.net)
48 points by ot on July 18, 2015 | hide | past | favorite | 4 comments



I've faced similar questions and the work of Michael Mitzenmacher (at Harvard) has proved very valuable (e.g. modeling table load for d-left).


Some experiments suggest that if D=n^2 then the expected number of collisions is close to 0.5, and converges on 0.5 as D tends to infinity. So if D is large then collisions will be a problem when n approaches sqrt(D).


Practically, does that mean that the rate of collissions increase substantially once you have more than 65535 entries in a hash table with 32 bit hashes? To the point that it makes more sense to use a BST instead, even?


If you have a hash table of size S with N entries in it, then the probability of hitting a collision on the next insertion is N/S. This is why many hash tables will resize and rehash once the load factor (which is N/S) reaches some predetermined fraction. Book-keeping using this method is cheap.

It's important to distinguish "What is the probability of the next insertion causing a collision?" and looking at a chained hash table of occupancy N and asking "What is the probability that this hash table already contains a collision?" (the birthday paradox yields 50% for 23 birthdays in answer to this), and again "How many collisions do I expect to find in this hash table?", which is what the author is writing about, and is actually the cumulative probability of there having been a collison over the lifetime of all previous insertions.

The collision rate will depend on how long entries live in your hash table and the rate at which you're doing insertions.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: