
Hash Functions: the modulo prime myth? - edw519
http://www.codexon.com/posts/hash-functions-the-modulo-prime-myth
======
calambrac
For a uniformly distributed input set, it doesn't matter how many buckets you
have. When you take your modulus, you're going to mix well, regardless of
whether its prime or not. This is what the article (accidentally, apparently)
confirms.

The win of using a prime modulus comes when you have a non-uniformly
distributed input set (though of course you could craft an especially evil
input set). Mod prime is just more likely to produce a more uniform mix across
a larger set of input distributions.

Since we're talking about a general-purpose hash table, we can't assume we
know anything about our inputs (i.e., that they're uniformly distributed), so
we use a prime number of buckets just to up the odds that we come out ahead.

~~~
edw519
"For a uniformly distributed input set..."

A little off-subject, but another good reason to make your primary ids
sequential whenever possible. This issue just goes away.

------
barrkel
The article writer (Nathan) used uniformly distributed cryptographically
secure random numbers as his test data. Any decent hash for integers that
doesn't throw bits away is going to produce a nice even distribution for that
kind of sequence no matter what the divisor.

The problem with hash tables is bad hash functions, or hash functions which
perform badly on certain data, where "perform badly" means recurrence of
certain patterns, usually "latent periods", i.e. too many function results
that are divisible by some number or one of some set of numbers. If those
numbers have factors in common with the bucket count, then there's going to be
trouble.

For example, a poor hash function might turn up even numbers twice as often as
odd numbers; if the bucket count is also even, then the even buckets will have
50% more collisions than the odd buckets, on average. A prime number of
buckets guarantees that there's only one factor that can clash, rather than a
bunch of them.

~~~
codexon
Hello, this is the author. (btw Nathan is a commenter, not me)

It was exactly my conclusion that the hash function and data is more important
than using mod prime.

Try to think of it this way. The ideal hash function takes your not so random
data and outputs random data.

I = low-medium entropy data

H(I) = high entropy output equivalent to random numbers

bucket = (H(I) or random number) % m <\--- experiment is here.

Now to simulate this, I used the previous model, as it is the definition of a
good hash function. If you have repetitive data in, and repetitive data out,
no modulus is going to help you.

So yes, the conclusion of your comment and mine are the same: Worry about your
data and hash function. But many people came out of the woodwork trying to
poke holes in the experiment.

~~~
calambrac
_If you have repetitive data in, and repetitive data out, no modulus is going
to help you._

Um, no. A prime modulus will often help in this case. That's the whole point.

~~~
codexon
Do you have any statistical evidence?

~~~
calambrac
In your test code, change your loop to:

    
    
        for r in [3, 5, 17, 257]:
            for i in xrange(n):
                num = r * i
                table_p[num%prime] += 1
                table_c[num%composite] += 1
    
    

On here and on your blog, you keep falling back to the observation that modulo
prime makes no difference when you have a uniform distribution of inputs, like
you get after you apply a good hash function.

Nobody who is telling you that you're wrong would dispute that. The
fundamental point that you keep ignoring is that doesn't matter, because as a
library writer providing a hashtable, _you don't control the hash function_.
Having a simple way to mitigate the ill effects of users choosing bad hash
functions is a good thing. Why do you keep ignoring this point?

------
lallysingh
Quick point: the reason to use primes is to avoid a nasty interaction between
a common fault of hash functions and the modulus. Specifically, if the hash
function puts out a lot (e.g. more than appropriate for a uniform
distribution) of hash values that have a common factor F, and if F also a
factor of your modulus, then you've divided your number of buckets by F for
those many hash values.

It seems a little obscure until we look at common factors like 2 or 4. It's
pretty easy to write a hash function, that when combined with its input (say
ASCII text, or addresses returned by malloc()), puts out a lot of values that
divide by a power of two.

To avoid it, without doing any analysis of your hash function or input, a
prime factor adds safety. Thanks to this:
<http://primes.utm.edu/notes/proofs/Theorem2.html> you know that if n is
prime, then so's 2^n-1. So, for a prime n, a cheap prime modulus for computers
is to do a bitwise and with ((1 << n) -1).

------
vomjom
I'm a little late to the conversation, but I'll explain the reasoning. You
don't want the hash and the bucket size to have a common divisor. A prime
bucket size is the best way to ensure that.

For example, let's suppose you have a bucket size of 8 and a flawed hash
function that produces multiples of 12 more often than other hashes (it can
even be only slightly more common, but will still have an effect). Then the 0
and 4 spots will be used more often than any other.

Obviously, if you have a super awesome hash function, then it isn't a problem.
Most of the time, making a hash function with good distributive properties is
hard and makes the hash function slower.

There's also a simple universal hashing solution using prime bucket sizes:

<http://en.wikipedia.org/wiki/Universal_hashing>

[http://ocw.mit.edu/OcwWeb/Electrical-Engineering-and-
Compute...](http://ocw.mit.edu/OcwWeb/Electrical-Engineering-and-Computer-
Science/6-046JFall-2005/VideoLectures/detail/embed08.htm)

------
sfphotoarts
very interesting, and worth reading all the way up to and including the
comments. Nathan appears (in the comments) to counter the conclusions, but
merely sketches out why, rather than give anything concrete.

Then again, it would seem odd that Knuth was not only wrong, but that nobody
in 30 years of CS has spotted the error.

I'm going to continue to use primes I think, just in case.

------
kylec
It's been a few years since abstract algebra, but might it have something to
do with finite fields? I seem to recall that modulo primes held special
significance (i.e. you can't have a finite field size other than a power of a
prime).

------
stratomorph
I don't understand what that graph is showing. Which variable is independent,
or supposed to be? It can't be buckets as a function of collisions. At first I
thought he had his axes reversed (from the customary y as a function of x),
but it's not a graph of collisions as a function of buckets, either, because
reading it that way implies that a very small number of buckets produces a
small number of collisions, which is obviously false with a hundred million
inputs being distributed over ten or less buckets.

It looks like a graph of the moral equivalent of 1=1, by numeric methods. What
am I supposed to see there?

~~~
calambrac
It's a histogram. It's counting the number of buckets (y axis) for each number
of collisions (x axis). He's showing that the distribution is the same for
both methods; if one had a clear advantage, it should have a relatively more
uniform distribution.

The problem is that the input distribution is uniform, so you'd expect a
pretty decent distribution of collisions regardless of what your modulus was.
A prime modulus only helps for non-uniformly distributed inputs.

------
tybris
You use a prime to be able to reach all slots in open addressing if you use a
linear probe bigger than 1. Can't think of any other reason.

------
illumen
Use a critbits/radix tree instead of hash tables. They are just better.

<http://cr.yp.to/critbit.html> <http://en.wikipedia.org/wiki/Radix_tree>

