
Brief Introduction To General Purpose Hash Functions - ArashPartow
http://www.partow.net/programming/hashfunctions/index.html
======
demerphq
I think you have confused "ideal hash function" and "perfect hash function". A
perfect hash function is one that maps a finite set of keys into the same
number of buckets with no collisions. They have to be tailor constructed for
the keys they are intended for and are not for general purpose use.

An ideal hash function on the other hand produces output that is
indistinguishable from choosing the hash value for every key independently and
randomly. In particular an ideal hash function should be a PRF, and it should
handle any input size, and any keyset.

They aren't the same thing _at all_ , and you should not confuse them. You
cannot substitute an ideal hash function for a perfect hash function, or vice
versa.

~~~
ArashPartow
@demerphq (Yves Orton):

> "They aren't the same thing at all, and you should not confuse them. "

You're absolutely right. The term "idea hash function" was what I was after
all along - I just couldn't find a formal definition of it. In any case I've
updated the article accordingly. Thank-you for your review and comment.

~~~
demerphq
You are welcome. I wanted add another point. rurban does have a bit of a
point. The hash functions you referenced in your article are generally old and
low quality, and slow. FNV is neither a good hash, nor is it fast. FNV_yt is
very fast, but not the best quality hash function, it fails many tests in
SMHasher, including some I personally consider dealbreaker: failing avalanche
tests is unacceptable.

Both rurban and I have (independently) spent considerable time testing about
100 hash functions using the hash test suite smhasher. In addition I have
greatly enhanced smhasher, and included test results and graphs in my version.
Included in my fork is the --stream option so you can test different hash
functions in counter mode with tools like the dieharder RNG test suite.

[https://github.com/demerphq/smhasher](https://github.com/demerphq/smhasher)
[https://github.com/rurban/smhasher](https://github.com/rurban/smhasher)

These are the FNV test results from my version of SMHasher:

FNV1a 32 32 32 FAILED 99 of 183 tests. Avalanche: 17-41, Crc-MultiCollision:
81, 83, 85-86, 88-91, 94, 96-98, 100, Cyclic: 42-52, Hi-Lo: 157-158, HiBit-
Null: 150-152, Highbits2: 147-149, Highbits: 144-146, LowBit-Null: 153-155,
Lowbits: 142-143, Murmur2-MultiCollision: 101-110, Murmur3A-MultiCollision:
111, 113, 117, 119-120, Murmur3F-MultiCollision: 122-127, 129-130, OverAll:
183, Text: 160-165, TwoBytes: 54-56, 63, Zeroes: 167-168

FNV1a_YT 32 32 32 FAILED 67 of 181 tests. Avalanche: 15-39, Differential:
9-14, Hi-Lo: 154-156, HiBit-Null: 148-150, Highbits2: 145-147, Highbits:
142-144, Lowbits: 139-141, OverAll: 181, Sparse: 74-78, Text: 157-158,
160-163, TwoBytes: 51, 53-61

FNV64 64 64 64 FAILED 66 of 183 tests. Avalanche: 17-41, Cyclic: 43-45, 47,
49, 51-52, Hi-Lo: 157-158, HiBit-Null: 151-152, Highbits2: 148-149, Highbits:
145-146, LowBit-Null: 154-155, Lowbits: 142-143, OverAll: 183, Sparse: 65-67,
69, 71, 73, 75, 77, 79-80, Text: 160-162, 164-165, TwoBytes: 54-56, 58, 60,
62-63, Words: 182, Zeroes: 167-168

You can see text view of the full list at:

[https://github.com/demerphq/smhasher/blob/master/doc/summary...](https://github.com/demerphq/smhasher/blob/master/doc/summary.txt)

Also after reading this thread I added some graphs comparing performance of
other hashes, including some that I have developed using genetic algorithm for
use in Perl. Some of those hashes have hash-time performance at the top of the
envelope (StadtX is very fast).

[https://github.com/demerphq/smhasher/blob/master/doc/graph/F...](https://github.com/demerphq/smhasher/blob/master/doc/graph/FNV_Comparison.long.svg)
[https://github.com/demerphq/smhasher/blob/master/doc/graph/F...](https://github.com/demerphq/smhasher/blob/master/doc/graph/FNV_Comparison.medium.svg)
[https://github.com/demerphq/smhasher/blob/master/doc/graph/F...](https://github.com/demerphq/smhasher/blob/master/doc/graph/FNV_Comparison.short.svg)
[https://github.com/demerphq/smhasher/blob/master/doc/graph/F...](https://github.com/demerphq/smhasher/blob/master/doc/graph/FNV_Comparison.tiny.svg)

You may also be interested in looking at

[https://github.com/demerphq/smhasher/blob/master/HashAnatomy...](https://github.com/demerphq/smhasher/blob/master/HashAnatomy.md)

Good luck!

~~~
ot
Thanks for saying this! It is incredible how widespread still is the belief
that FNV is a good hash function, when much better ones exist that are faster
even on short keys.

~~~
demerphq
I am only repeating what my tests say. I was surprised frankly, given FNV's
reputation, to see it fail so many tests.

Also I put a lot of effort into improving the quality of testing that smhasher
does, so there are many hashes that would pass the old version which will fail
the new one.

------
rurban
Just read the first two intro paragraphs, and already stumbled on 2 major
mistakes:

"Hash functions are by definition and implementation pseudo random number
generators (PRNG)."

No.

> "In practice it is generally recognized that a perfect hash function is the
> hash function that produces the least amount of collisions for a particular
> set of data."

No. A perfect hash function produces no collisions. "Least collisions" =>
better but not perfect.

I briefly looked over the rest, and this looks much better.

Just the last section is a bit misleading. First he talks about hashing in
general and then he introduces his implementations of hash functions as
"Available Hash Functions". This is not a generalization as above, it's merely
his collection.

DJB: "It is one of the most efficient hash functions ever published."
Nonsense. FNV is considered the generally most efficient simple hash function.
DJB ranks in the middle of generic unaligned string hash functions.

~~~
ArashPartow
@Rurban (Reini Urban): wrt to your first point, general purpose hash functions
being modeled as PRNGs, I believe that statement to be correct. It is
definitely the case when analysing hash function behaviour in contexts such as
bloom filters, hash tables etc.

With regards to the definition of perfect hash functions - if you had read the
sentence before the one you have quoted, you would have notice that the formal
definition for a perfect hash function has been given in the article as:

> "In general there is a theoretical hash function known as the perfect hash
> function for any group of data. The perfect hash function by definition
> states that no collisions will occur meaning no repeating hash values will
> arise from different elements of the group."

It then goes on to talk about the practicalities of perfect hash functions
(deriving them etc), then makes a statement that generally a perfect hash
function is one that has the least collisions - I agree that it is an
intentional weakening of the formal definition, but it is generally the most
practical one - as there is no guarantee that one can always find a "perfect
hash function" for any arbitrary set of values in sub-polynomial time.

\----

wrt the collection of hash function, take note that the article is broken into
the following sections:

1\. What is a general purpose hash function

2\. Designing and implementing general purpose hash functions

3\. Uses of hash functions

4\. Example general purpose hash functions (popular ones or written by leaders
in the field, and dummies such as myself)

5\. Downloads

So could you clarify what it is you meant by:

> " last section is a bit misleading. First he talks about hashing in general
> and then he introduces his implementations of hash functions"

what specifically was misleading in the article? I'm confused, could you
please clarify.

\----

wrt: > "DJB: "It is one of the most efficient hash functions ever published."

The statement is _not_ incorrect, it is indeed one of the most efficient hash
functions out there, it may not be the MOST efficient or in other words "the
number one hash function", but it's there at the top - I think you may have
jumped the gun on that one too.

To further that, your comment: " Nonsense. FNV is considered the generally
most efficient"

Lets have a look and see if that's true, the following are pseudo code
representations of the mixes for each of the hash functions:

DJB: hash = ((hash << 5) + hash) + msg_byte

FNV1: hash = hash ^ msg_byte * FNV_prime

Unless you have a really bad shift operation (eg: having no barrel shifter
present), then I can't see how FNV1 can be significantly more efficient than
DJB - but then again, the compiler may optimize the shift right by 5 as a
multiply by 32 for targets where the mul is faster than the shift...

But lets step back a bit and consider the following:

Is hash function efficiency that important when neither of the hash functions
DJB or FNV1 posses good distributive qualities?

[https://research.neustar.biz/tag/hash-
function/](https://research.neustar.biz/tag/hash-function/)

~~~
rurban
ad 1: [https://stackoverflow.com/questions/31954237/pseudo-
random-g...](https://stackoverflow.com/questions/31954237/pseudo-random-
generator-hash-function) has some answers.

ad 2: in most measurable stats fnv1 is better than djb. practical throughput,
time if inlined, cachegrind cost model (4x!). fnv also produces less
collisions with average keys. only in pure cycle/hash djb is a bit better.

[https://github.com/rurban/smhasher#smhasher](https://github.com/rurban/smhasher#smhasher)
[https://github.com/rurban/perl-hash-stats#perl-hash-
stats](https://github.com/rurban/perl-hash-stats#perl-hash-stats)

> ad 3: Is hash function efficiency that important when neither of the hash
> functions DJB or FNV1 posses good distributive qualities?

It depends entirely on the use case. Usage in hash tables is totally different
to usage as file or block checksum or crypto. Besides looking how many bis of
the keys are really used or how easy it is to create collisions, it also
depends if the keys are random length strings, aligned buffers or plain
integers. Both DJB and FNV1 are considered bad regarding their stats. But they
are small and fast.

------
panic
It might be worth mentioning popular functions like FNV
([http://isthe.com/chongo/tech/comp/fnv/](http://isthe.com/chongo/tech/comp/fnv/))
MurmurHash
([https://github.com/aappleby/smhasher](https://github.com/aappleby/smhasher))
and SipHash ([https://131002.net/siphash/](https://131002.net/siphash/)).

------
bogomipz
The post uses the term "internal state" pretty early on(initialize internal
state, etc) and continues to reference it frequently for the algorithmic steps
but nowhere is "internal state" defined. Can someone explain to me what this
"internal state"?

~~~
psyc
Internal state is typically a small-ish bag of bits that mixes, and gets mixed
deterministically, as the hash mechanism is used. See the internals of the
Mersenne Twister[0] for an example/analogy.

[0]
[https://en.wikipedia.org/wiki/Mersenne_Twister#Initializatio...](https://en.wikipedia.org/wiki/Mersenne_Twister#Initialization)

~~~
demerphq
It is only small in older hashes. Modern hashes tend to have a state larger
than both the seed and the input block.

See:

[https://github.com/demerphq/smhasher/blob/master/HashAnatomy...](https://github.com/demerphq/smhasher/blob/master/HashAnatomy.md)

It actually turns out there are performance benefit by having a larger seed,
but not too large. (If you think about it this makes sense consider how many
bits you can change with an average combining operation, and with a worse case
combining operation.)

------
jnwatson
A great resource for hash function performance:
[https://www.strchr.com/hash_functions](https://www.strchr.com/hash_functions)

It includes a bunch of more modern hash functions the author missed.

------
duckqlz
+1 "Brief" Introduction lol

~~~
ArashPartow
Once upon a time it was 'brief'... that is to say circa 2001 :D

