
Which hashing algorithm is best for uniqueness and speed? - suprgeek
http://programmers.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed/145633#145633
======
MichaelGG
MurmurHash2, which is pretty great, has some issues:

"MurmurHash2_x86_64 computes two 32-bit results in parallel and mixes them at
the end, which is fast but means that collision resistance is only as good as
a 32-bit hash. I suggest avoiding this variant."[1]

Murmurhash3 has a 128-bit variant, which might be more along the lines of what
he's looking for (the original post mentions SHA256).

1: <http://code.google.com/p/smhasher/wiki/MurmurHash3>

~~~
martincmartin
Computing two 32-bit results in parallel and mixing them at the end does NOT
mean collision resistance is only as good as a 32-bit hash. For that, you need
to compute ONE 32-bit result, then transform it into a 64-bit result.

~~~
finnw
Depends whether the two 32-bit hashes are correlated with each other. If there
is no correlation then a pair of 32-bit hashes is no more likely to collide
than a single 64-bit hash. But this is difficult to achieve, and you should
not assume (for example) running the same algorithm twice with different
initial states will produce uncorrelated hashes.

~~~
martincmartin
Very true.

------
memset
Question: say you use a hash which returns a 32-bit integer. If you were
actually implementing a hash table, would you need to declare a structure with
2^32 elements? `int buckets[2^32]`? This seems unwieldy!

Would an actual hash table only use, say, the first 10 bits or something of a
hash function (int buckets[1024]) to make it less sparse (albeit increase
collisions?)

If you decide you want more buckets later on, would you have to re-hash
everything in your array and move it to a new, bigger one?

~~~
jrmg
Yeah, you've pretty much got it. A common alternative to masking bits off of
the hash is to take hash modulo the size of the table as the index (although
you have to be careful with a modulo strategy, so as not to introduce a bias
towards certain table slots).

There are strategies to make the resize not be so expensive. Wikipedia's page
on Hash Tables covers this at
<http://en.wikipedia.org/wiki/Hash_table#Dynamic_resizing>

[edit: clarity in the first paragraph]

~~~
yason
If the hash function is _any good_ , simply bitwise-and enough bits from the
hash and use that as the index. (It's a modulo too, but simply modulo(2^x).)

Using 2's powers generally fits nicely with programming on binary computers.
It also has the nice quality of producing number of slots that are always 2's
powers and you can shove those naturally for example into a single memory
page.

------
jorangreef
FNV despite much popularity is a relatively poor quality hash. Murmur2 has a
major flaw, hence Murmur3. CRC32 is slow. Not mentioned in the post, but if
you're thinking Fletcher or Adler, they have terrible distribution. For a fast
32-bit hash, much better to go with Bob Jenkin's one-at-a-time hash
([http://en.wikipedia.org/wiki/Jenkins_hash_function#one-
at-a-...](http://en.wikipedia.org/wiki/Jenkins_hash_function#one-at-a-time))
which is simpler than Murmur3 and displays much better avalanche
characteristics than the other hashes.

~~~
dkersten
According to this site[1], CrapWow seems to be the best in raw performance:
[http://www.team5150.com/~andrew/noncryptohashzoo/CrapWow.htm...](http://www.team5150.com/~andrew/noncryptohashzoo/CrapWow.html)
(Crap8 in second place and Murmur3 in third place - one at a time performs
quite badly in this guys benchmarks)

Performance graphs:
<http://www.team5150.com/~andrew/noncryptohashzoo/speed.html>

[1] <http://www.team5150.com/~andrew/noncryptohashzoo/>

~~~
trebor
What about its randomness compared to Murmur2/3?

------
tmeasday
Why don't I see any mentions of the latest research? There are people out
there that do this for a living people!

~~~
marcusf
I find stuff like this terribly interesting, so please elaborate. Any journal
or so that is recommended for someone curious?

~~~
tmeasday
I'm not active in the field right now, so I can't give specifics (the comment
was more borne out of frustration with people putting in so much effort
'reinventing the wheel' in an academic sense).

But if I was you, I'd start with Knuth (he, as another commenter mentioned,
covers hashes in great detail), head to the references, and then use Google
Scholar to find well-cited recent articles that reference the important papers
mentioned there.

~~~
pwaring
Unfortunately though, a lot of the papers that turn up in Google Scholar are
behind paywalls. It is _very_ expensive to get hold of academic papers if you
don't work/study somewhere with an institutional licence (anything from $10
upwards per paper).

~~~
tmeasday
Yup, that's the world of academic publishing unfortunately. A tip: often you
can get a 'preprint' copy of the paper from one of the authors' websites.
Probably not strictly legal, but it does happen a lot.

------
bmm6o
If you think this is the sort of answer that Stack Overflow should strive for,
note that the author edited it too many times and it fell into "Community
Wiki" status. Until this was reverted in a manual process, the author didn't
receive any points for his answer.

~~~
sathyabhat
Well, not entirely. Reverting the CW restores the rep, subject to daily rep
limit. The answer's been given 3 separate bounties (100, 100,50) which are not
affected by CW.

~~~
bmm6o
1) It still required manual intervention from a moderator.

2) I'm certainly not an expert on how SE distributes its points, but the
moderator comment in the related meta post
([http://meta.programmers.stackexchange.com/questions/3527/rem...](http://meta.programmers.stackexchange.com/questions/3527/remove-
cw-status-for-this-answer-hashing-algorithms-testing-by-ian-boyd)) states "I
don't think there is a way to refund the reputation the answer gained while it
was CW [...]"

------
moe
Why did he omit the standards (MD5 and SHA1) from the comparison?

~~~
afsina
These are not cryptographic hashes. Comparison would be unfair as
cryptographic ones are rather slow . These are used in structures like Hash
tables or Bloom Filters etc. they need to be very fast and provide reasonable
randomness (low collision). Bu their collision rates are very high comparing
to say SHA1.

~~~
bennysaurus
>@Orbling, for implementation of a hash dictionary. So collisions should be
kept to a minimal, but it has no security purpose at all. – Earlz

SHA-1 is very fast though so it is a good point for comparison.

~~~
IsTom
SHA-1 was designed as a cryptographic hash functions, they are puprposely
slow. No, SHA-1 is not as fast as functions from the article.

~~~
pjscott
You've got it backwards; cryptographic hash functions are designed to be as
fast as possible without giving up their cryptographic properties. If you need
a slow hash (e.g. for password storage), you use something like bcrypt that's
designed to be slow.

------
akg
Donald Knuth's Art of Computer Programming Volume 3 has an excellent
exposition on hash functions and hash tables in general. If you can get
yourself a copy I would highly recommend the read.

------
gizzlon
Found these by coincidence, might be interesting to some of you.. (can't speak
to the content)

[http://blog.aggregateknowledge.com/2011/12/05/choosing-a-
goo...](http://blog.aggregateknowledge.com/2011/12/05/choosing-a-good-hash-
function-part-1/) [http://blog.aggregateknowledge.com/2011/12/29/choosing-a-
goo...](http://blog.aggregateknowledge.com/2011/12/29/choosing-a-good-hash-
function-part-2/) [http://blog.aggregateknowledge.com/2012/02/02/choosing-a-
goo...](http://blog.aggregateknowledge.com/2012/02/02/choosing-a-good-hash-
function-part-3/)

------
IgorPartola
Honest question: why not use 2-3 different hash algorithms to minimize
collisions if you are simply after uniqueness (e.g.: verifying that a
downloaded file is correct)?

As for hash tables, can't you guarantee uniqueness by using only reversible
operations? I remember reading a lengthy post about this on HN, but can't find
the link anymore.

~~~
mseebach
> As for hash tables, can't you guarantee uniqueness by using only reversible
> operations?

This is called a perfect hash and its appropriateness depends on the input
data.

The problem is that the problem domains where hashes are practical are where a
very large space of inputs needs to fit into a finite number of buckets such
that lookups are fast.

For that to work with a perfect hash, you need an infinite number of buckets
which needs infinite space.

~~~
andreasvc
A perfect hash is able to avoid collisions when given the set of all possible
keys in advance; it is not related to reversibility. The latter contradicts
the very idea of a hash function, and would conceptually be a lossless
compression technique.

~~~
mseebach
Any reversible hash would be a perfect hash - not the other way around. That's
all I'm saying.

That said, there's nothing in the definition of hash functions that require
them to be compressing or non-reversible - although they would typically have
to be to be useful.

~~~
andreasvc
I agree with the first point, but I think compressing and non-reversible are
necessary conditions for a given function to be called a hash function; if
they weren't, any mathematical function would do, wouldn't it?

One could see a hash function as an (extremely) lossy compression method.
However, lossy compression only makes sense when you can exploit features of
the domain, e.g., psychoacccoustics with sound, or characteristics of human
vision with photos; perceptual hashes come to mind here.

~~~
mseebach
It's really quite simple. Perfect hashes exists and are hashes. Perfect hashes
do not, as a matter of definition, compress. They typically aren't reversible,
because it's not an useful feature for a hash, but they could be - and if
nothing else, they can always be deterministically brute forced (as they have
no collisions), which is a (very bad) form of reversibility.

It's not meaningful to study hashes as compression, since compression is by
definition reversible.

------
leif
I wish he provided the architecture he ran on. Nehalem and later cores have
crc32 instructions that are quite zippy.

------
p4bl0
Very interesting, thanks for sharing.

"CRC32 collisions: codding collides with gnu". At first I read "coding" and I
was all "haha this must be an easter egg of the implementation".

~~~
drucken
The irony is that codding [British English] means a practical joke or trick on
someone. :)

------
pbreit
How do folks generally take an autoincrementing database ID and generate a
hash to be used "in public" when trying to avoid revealing obviously serial
numbers? I don't think I need something airtight, in fact in one system we
just multiplied/divided by a 4 digit prime number. While this worked fine, it
seemed a little loose.

~~~
eli
HASH(<id> \+ "static string pad") is how I imagine most people do it. Depends
how much you care about a dedicated attacker figuring out that ID.

If this is a thing you're going to use a lot, I'd probably just add a database
column and give each record a random unique "public ID" -- then there is
literally no connection between the public ID and the private one..

------
jorangreef
If you're doing 32-bit hashes in Javascript and are willing to trade a bit,
then a 31-bit hash may be at least an order of magnitude faster due to VM
implementation:
[https://groups.google.com/d/msg/v8-users/zGCS_wEMawU/6mConTi...](https://groups.google.com/d/msg/v8-users/zGCS_wEMawU/6mConTiBUyMJ)

------
mlok
There is a python wrapper for murmur3 here :
<http://stackoverflow.com/a/5400389>

------
kolo32
More tests: <http://www.strchr.com/hash_functions>

------
potkor
Going by just the subject, maybe a perfect hash function generated with eg.
GNU gperf?

------
esbwhat
isn't slow better in this case? I mean if it's fast to generate, it's fast to
crack, right?

~~~
masklinn
> isn't slow better in this case?

No, guy's looking for a hash table hash, not a cryptographic one. For a hash
table you're looking for low collisions and high throughput (so your hash
table is fast) first and foremost.

~~~
pjscott
Even for cryptographic hashes, fast is what you want most of the time. Look at
it this way: when you connect to a web site via SSL, all the data you send
will be hashed for authentication. Do you really want this to be slow?

~~~
terangdom
Cryptographic hashes should be as fast as possible while not sacrificing
collision resistance. Hashtable hashes should be as collision resistant as
possible while not sacrificing speed.

Sort of anyway.

