
Which hashing algorithm is best for uniqueness and speed? - fekberg
http://programmers.stackexchange.com/a/145633
======
ot
A more thorough test suite for hash function is SMHasher, by the same author
of MurmurHash. It measures collisions, output uniformity, biased bits, etc...
as well as speed, under several input distributions.

[https://code.google.com/p/smhasher/](https://code.google.com/p/smhasher/)

Here is the output for some popular hash functions:

[https://code.google.com/p/smhasher/wiki/SMHasherDump](https://code.google.com/p/smhasher/wiki/SMHasherDump)

~~~
tibbe
If someone could run smhasher for the hash functions in the original article
(and perhaps also MurmurHash 2/3) and summarize the result for the small key
sizes I will give them a cookie.

~~~
gregrata
I REALLY think we're all tired of random people giving us yet another cookie
to store...

------
ColinWright
It's worth remembering that when selecting things at random from a given pool,
there is always a chance of collision. If you think of a hash as taking a
value to a "random" result, with different objects going to results uniformly
"at random", then you have to expect collisions. This is why the term "perfect
hash" is defined so carefully.

For those who are interested, the probability of collision, then size of the
hash destination, and the number of objects chosen are related by this
formula:

    
    
        Choosing from N items, with N large, and wanting a
        probability T of having no collisions, the number of
        items you can choose at random with replacement is:
    
            k ~ sqrt( -2 N ln(T) )
    

So

    
    
            T ~ exp(-(k^2)/(2N))
    

Source:
[http://www.solipsys.co.uk/new/TheBirthdayParadox.html?HN2013...](http://www.solipsys.co.uk/new/TheBirthdayParadox.html?HN20131014)

~~~
dllthomas
There's always a chance of a collision, but for a large enough pool relative
to the number of objects the chance may be sufficiently smaller than the
chance of hardware failure that it can be legitimately ignored. For something
where the entire space must in some way be realized (eg. hashtable) this won't
be the case, obviously.

~~~
Someone
Most of the time, that isn't true. Hardware failures are (barring quality
control issues) independent, hash collisions are not.

For example, if you assume your hash will not have collisions, and design your
hash table without any mechanism for coping with one, and it happens to have a
clash on "Jones" and "Smith", your address book app will become unreliable for
a rather large fraction of your users.

~~~
dllthomas
But at the scales we're talking about, it _won 't_ happen to have a clash on
"Jones" and "Smith". This won't be a hash table - this will be things like a
git commit hash, where generating 1/ns is likely to give _a_ collision in 3
years, before which the storage required _just to store hashes_ is in the
yottabytes.

------
dgacmu
My research group at CMU uses a lot of hashing these days and we care mostly
about performance. We've settled on three as our basics: \- CityHash as a
nice, general purpose hash, particularly on newer machines with CRC32c support
in hardware; \- SpookyHash for longer strings, particularly with no CRC32
support; \- CRC32c in hardware for a quick & dirty hash with possibly slightly
more poor distribution.

It depends a lot on what you're hashing. For strings, both City and Spooky are
great. You should benchmark both on your target architecture(s). If you're
trying to take, e.g., integers to integers, we've found that CRC32c is a
pretty darn good fallback _if_ you need insanely high performance.

If you're running a hash table exposed to _externally_ controlled input, then
you also want to consider (whether you need) algorithms that provide decent
resistance against "hash flooding". SipHash is the strongest contender there
short of throwing your hands in the air and using MD5 with a local secret
seed.

~~~
mcguire
I think it's worth reiterating your last paragraph. If your hash table is
exposed to external input, an effective DoS attack can be mounted by turning
your hash table into an expensive linked list.

[http://events.ccc.de/congress/2011/Fahrplan/events/4680.en.h...](http://events.ccc.de/congress/2011/Fahrplan/events/4680.en.html)

------
jameshart
One issue to be careful of is assuming that you don't need the guarantees a
cryptographic hash provides (preimage resistance, extension resistance, etc.)
in your application. If an adversary can force hash collisions they may,
depending on your uses for the hashing algorithm, be able to cause degredation
or denial of service, or perhaps trigger information disclosures if they're
very smart. If there's any way in which an attacker might be able to have some
control over the dataset you are hashing you should probably be looking at
using a cryptographic hash anyway.

~~~
fragsworth
Is denial of service from an attacker who has your source something you
actually consider when building a product?

I can see trying to prevent data breaches and such, but going so far as to be
wary of denial of service attacks from people who have your source code might
be wasted effort.

~~~
bcoates
The prevalence of frameworks means attackers probably have your source, or at
least enough of it to be interesting.

This specific denial of service attack hit a lot of systems in 2012, and all
you needed to know was which framework they were using (or just try all of
them, as the attack has incredible leverage)

It's as simple as crafting a query-string that results in an O(n²) hashtable
building, and allows you to burn minutes of server CPU with each request.

~~~
fragsworth
Coming from the perspective of a startup that is building a product, future
risk/estimated damage from DOS attacks is not worth any of today's efforts to
prevent. I mean, weren't they all able to fix it _after the fact_ and not
suffer much in the end?

These are problems you only run into if you've got a successful product (which
happens very rarely) - and solving these problems before you have users might
not be a good allocation of engineering resources for startups.

I liken it to making a system that scales _in advance_ to millions of
concurrent users when you haven't even got one.

~~~
jameshart
The thing is that anywhere in your code where you are doing something which
involves hashing data to create a digest for some purpose you are almost
certainly attempting to do something a bit _clever_ (I'm going to map the
problem from the domain of entities to a domain of proxies for entities that
are smaller and easier to deal with. I'll just deal with the special
cases...). And if you're making a decision to use Murmur2 in that role, you're
probably trying to be _very clever_. Whenever a programmer tries to do
something clever, in my experience, there is something they haven't thought
about. Often the something they haven't thought about leads to a security
vulnerability.

The fact is it's often difficult to tell the difference between a place where
the security properties of the digest algorithm you use are irrelevant, and a
place where it might be critical. For example, a zip file uses a CRC32
checksum to validate each file; git uses SHA1 to uniquely identify objects;
CouchDB shard mapping uses a strategy based on CRC32... if git had used
murmur, would that cause problems for it? Is CouchDB being _clever_ by using a
fast hash for sharding at the expense of being secure?

Now, quick, make a decision which algorithm to use to generate Etag headers
for your web API HTTP response entities: are you going to go with SHA256 or
Murmur2? Murmur2 is much faster, and it's certainly a _clever_ solution... but
are you sure you didn't create a security hole?

I'm not saying "nobody ever got fired for choosing SHA256," but I am kind of
saying that you have to be very sure you know the security model you're
operating in before you build on a hash algorithm that doesn't provide strong
extension and preimage resistance.

------
ultimoo
>>FNV-1a collisions

>>altarage collides with zinke

>>altarages collides with zinkes

I found this most interesting. I have no idea how FNV-1a is constructed but
I'm sure this is an interesting (mathematical?) quirk in an otherwise great
algorithm since it had only 4 collisions for more than a couple of hundred
thousand inputs.

~~~
joshdick
Yeah, that seems to violate the goal of the avalanche effect.

[http://en.wikipedia.org/wiki/Avalanche_effect](http://en.wikipedia.org/wiki/Avalanche_effect)

~~~
dreish
That's a design goal for cryptographic hash functions, which these are not.

~~~
dllthomas
For something like English words, where you will have a _lot_ of common
prefixes and suffixes, it seems a legitimate design goal as well (though it
may not be a tradeoff worth making, to be sure).

------
rkalla
For those that love this topic, xxhash[1] has a SMHasher score of 10[2] and is
~2x as fast as MurMur3 -- really interesting, incredibly fast, high quality
hash function.

    
    
      [1] https://code.google.com/p/xxhash
      [2] https://code.google.com/p/xxhash/wiki/xxh32

~~~
tg3
I was going to say this as well. It has a Node.js port that I'm using for
static file versioning, and the speed is great.

------
edwintorok
It doesn't mention SipHash:
[https://131002.net/siphash/](https://131002.net/siphash/)

~~~
JoachimS
If you have the reputation points, please suggest it to be included in the
test on Stack Exchange.

SipHash was developed to replace CityHash, Murmurhash as well as being a fast
and secure PRF.

~~~
ygra
I added a comment to suggest it.

~~~
JoachimS
Thanks!

------
WayneS
Nice answer. However in practice current Intel processors do CRC32 in
hardware. This blows away the rest and has really nice results. So I use
portable code that using Intel's slicing-by-8 CRC32 code for generic machines
and then automatically uses SSE4.2 on x86's that most people are actually
using.

------
noelwelsh
Nice analysis.

Since that post was submitted Murmurhash 3 and a number of other notable hash
functions have been released. If you liked the linked post you should go read
Aggregate Knowledge's series of posts on hash functions for a look at some
newer algorithms: [http://blog.aggregateknowledge.com/2012/02/02/choosing-a-
goo...](http://blog.aggregateknowledge.com/2012/02/02/choosing-a-good-hash-
function-part-3/)

------
dllthomas
The best hash algorithm for uniqueness and speed uses knowledge of the
distribution of the incoming values and exploits whatever patterns are
available. It's not one-size-fits-all. If I am being fed arbitrary binary
blobs, taking the first 4 bytes might be as unique as any other approach (and
will clearly be faster than anything else that actually depends on the data).
If I am being fed English test, that is going to perform horribly on
uniqueness.

If the source of the data might benefit (DoS attack being a common example)
from deliberately causing collisions then there are additional constraints and
looking at cryptographic hash functions may be warranted.

------
bcoates
I often see these "plot the output of the RNG on a 2d grid" things. I
understand they're useful for demonstrating non-randomness, but if you're
searching for patterns, do they reveal anything an FFT wouldn't?

~~~
baq
isn't fft just a different way of looking at the data?

~~~
USNetizen
Essentially, yes, that was my understanding, too. I've only used in in signal
processing - converting from the time domain to the frequency domain. Not sure
how that would apply to this.

~~~
bcoates
I was thinking specifically of this:

    
    
      When i look at the FNV-1a "number" map, i think i see subtle
      vertical patterns. With Murmur i see no patterns at all. 
      What do you think?
    

Wouldn't that vertical repetition, and the other repeating patterns the other
hashes form, show up as a nice spike in the frequency domain? Whereas I would
expect uniform hashing to generate a flat, white-noise spectrum.

~~~
edwintorok
I find the avalanche maps used here easier to grasp:
[http://home.comcast.net/~bretm/hash/5.html](http://home.comcast.net/~bretm/hash/5.html)

Here is FNV, quite a lot of reds:
[http://home.comcast.net/~bretm/hash/6.html](http://home.comcast.net/~bretm/hash/6.html)
And of course all green for SHA-1:
[http://home.comcast.net/~bretm/hash/9.html](http://home.comcast.net/~bretm/hash/9.html)

------
dagss
This seems to be very of date? My understanding was that when SipHash arrived
it knocked out all of these? (Please correct me if I'm wrong.)

This may be an interesting weakness with the StackOverflow system; if there's
an answer that used to be very good, it may take a lot to knock it out of the
first position when reality changes.

~~~
mcguire
I don't know of a performance comparison between SipHash and other secure
hashes other than [1], which compares it to MD5, "Python's new randomized
hash, with a 128-bit key and 64-bit output", and a couple of SHA's. It is
significantly slower than DJB2, though.

[1] [http://bench.cr.yp.to/results-auth.html](http://bench.cr.yp.to/results-
auth.html)

------
waynecochran
> Yes. I started writing my test program to see if hash > collisions actually
> happen - and are not just a theoretical construct.

This should be no surprise. The birthday paradox, as Knuth points in in TAoCP,
makes it clear why collisions happen. Given 23 people in a room, there is a
greater that 50% chance that at least two have the same birthday.

Placing only 23 keys into a hash table with 365 entries with a perfectly
random hash function will probably result in at least one collision.

~~~
ColinWright

        Placing only 23 keys into a hash table with 365 entries with
        a perfectly random hash function will probably result in at
        least one collision.
    

No, it won't "probably" result in at least one collision, it will result in at
least one collision with probability about 1/2\. As quoted elsewhere, placing
_k_ objects in a hash table with _N_ entries will result in no collisions with
probability about _exp(-(k^2) /(2N))_.

~~~
abentspoon
Interesting. I was going to complain, as I thought "probably" meant the
probability of occurrence was greater than 50%.. That doesn't seem to be the
case.

    
    
      Probably: almost certainly; as far as one knows or can tell
    

I would guess waynecochran thought something similar.

~~~
ColinWright
Curious. I've just taken a straw poll in the office. I asked "if someone said
something was probably going to happen, what probability would you give it?"

The answers were:

* 80%

* 50% to 90%

* 90%

* > 80%

* 80%

* 80% to 90%

~~~
StavrosK
My guess would have been "exactly strictly higher than 50%".

~~~
crpatino
While mathematically correct, this does not seem to be how people use with
that phrase. "X is probably going to happen" implies there is some evidence in
favor of X, while "X is probably not going to happen" implies evidence
_against_. Now, in regular speech, people do not feel compelled to force
themselves to choose one or the other, but are just as likely to say "I have
no idea", implying no evidence in either direction.

From an Information Theory point of view, 50% probability means zero
information, and 25%/75% probabilities mean one bit of information against/in-
favor-of the event. This sounds roughly consistent with the results of the
poll, with people assigning higher probabilities implying higher information
density of the evidence.

------
tracker1
I'm curious as to why the emphasis on speed here? I would think that hashes
are mostly for either confirming a payload, or as a security measure against
plain text passwords/keys.

I would think that at least in the latter, that less fast, more cpu/memory
intensive would be more desirable. Something like SCRYPT may be overkill,
especially if you use if for password hashes, under load for a lot of logins,
but the comparison seems to fly in the face of that.

This also doesn't really mention random seed/salt values.

~~~
bradleyjg
I'd think it's far more common to use hashes for various data structures (i.e.
hash map) than for cryptographic or data integrity purposes.

~~~
DerpDerpDerp
Even for data structures, you generally want to use a cryptographic hash
function, to prevent collision and other similar attacks from degrading the
performance of a data structure.

For example, if I can sufficiently reverse your hashing process, I can degrade
a hash tree in to a linked list for a particular set of data - something your
firewalls and other security measures will have virtually no chance of
stopping, but which will greatly increase the run time taken to parse
(malicious) instructions about that data.

A common use of this is maliciously chosen inserts of new items, to cause the
tree structure to become a list building on it.

~~~
tibbe
Until recently I don't think any major hash table implementation (e.g. not
Java, STL, Python, Ruby, etc) used a cryptographically secure hashing
algorithm. They're just too slow. Slow enough that you might as well use a
tree-based data structure, which doesn't suffer from hash-collision based
attacks.

This is changing a bit now when SipHash is available. Unfortunately SipHash
still is a bit too slow and forces an uncomfortable trade-off.

~~~
aweisberg
I would be really surprised to find that a tree is faster then a fast
cryptographic hash function like SIP hash. Cache misses to main memory are
worth 10s of instructions and that number is going up in many cases. Hash
functions can be pretty efficient at allowing the processor to run multiple
instructions in parallel.

If you know the tree will be cached and/or the keys you have to hash are large
then sure a tree might be a win, but hashing a small key might only be the
same as two or three cache misses. Comparing tree nodes is not instruction
free either.

I think it is really a case by case sort of thing, but there is no substitute
for measuring (and nailing down the comparison to specific hash functions).

------
jorangreef
If you are hashing fixed-size keys, then tabulation hashing is simple and fast
with good theoretical properties:

[http://en.wikipedia.org/wiki/Tabulation_hashing](http://en.wikipedia.org/wiki/Tabulation_hashing)

[http://rcoh.me/pyhash.pdf](http://rcoh.me/pyhash.pdf)

------
svag
It is worth mentioning that the libstdc++, in version 4.6.3 at least, it has
implemented the Murmur and FNV hash function for user defined key types.

If you want to use Murmur you use std::_Hash_impl::hash and for FNV you use
std::_Fnv_hash_impl::hash. (Include bits/functional_hash.h)

The implementation is at libsupc++/hash_bytes.cc

------
devx
Would the security of such algorithms improve dramatically if say 5 years from
now we'd live in a world where everyone has a 64-bit device or machine, and
the algorithms can be optimized for 64-bit? Are current algorithms limited by
32-bit a lot or not really?

~~~
ctz
Security of such algorithms now: none.

Security of such algorithms now where everyone has a 64-bit device: none.

Don't confuse hash functions like these and cryptographic hash functions.

------
ppierald
gperf -
[https://www.gnu.org/software/gperf/](https://www.gnu.org/software/gperf/)
This is a "perfect" application of a "perfect hash" generator.

------
sukaka
seeing at least 1 collision the tests makes me wonder why the chance of a
bitcoin address collision is almost nil.
[https://en.bitcoin.it/wiki/Weaknesses#Generating_tons_of_add...](https://en.bitcoin.it/wiki/Weaknesses#Generating_tons_of_addresses)

~~~
vertex-four
Bitcoin uses a cryptographic hash function, SHA256. There are, at present, no
known collisions with SHA256.

This question is about non-cryptographic hash functions, as are commonly used
in hash tables[1] which implement associative arrays, or sharding databases
based on the hash of the record key.

Non-cryptographic hash functions are simply not in the same league as
cryptographic ones; cryptographic hash functions are optimized to make it
infeasible for an attacker to generate a different message with the same hash,
while a non-cryptographic hash function generally simply has to have uniformly
random output over its keyspace.

[1]
[http://en.wikipedia.org/wiki/Hash_table](http://en.wikipedia.org/wiki/Hash_table)

~~~
VLM
All of which is true and well written. However it brings up the obvious point
missed in the otherwise well written article that yes, if I use 32 bit hashes
to shard into 2 to the power of 32 databases, some will have more collisions
than others, aka the 32 MSB of a 32 bit hash has some "bad looking"
variations. But I don't care about non-randomness in the LSBs because I'm
infinitely more likely to shard into, perhaps, 2 to the power of 4 database
machines.

The original article did not descend into that obvious area of research. I see
no particular reason why a hash algo that has the worst randomness shoved into
the least significant byte (which I simply don't care about) might be an
inherent result of smooshing the best randomness into its most significant
nibble, which I do care very much about. Given the likely use case for a
sharding hash, a smart hash designer would make sure that most of his effort
is put into smooth distribution in MSB and perhaps totally ignore the LSB for
a given amount of latency / computation / electrical power. After all, the
actual users are more likely to hash based on the first 4 bits than the first
24 bits. Although you'll always run into people who think its funny to pull
their shard subset out of the hash using the LSB (why?) or some random byte in
the middle (why?)

I think MSB for shard key comes out of the tradition in the really olden days
of sharding based on raw unhashed data. Sometimes thats random enough such
that the MSB of the data makes an excellent shard key and hashing would just
slow things down for a minimal gain, even today.

~~~
bcoates
Whoa, it's conventional to use the MSB? I usually just do

    
    
      bucket = hash(x) & mask
    

Should I expect non-cryptographic hashes and RNGs to produce better noise in
the MSBs? That would explain "Sun Redefines Randomness"

~~~
VLM
Well, look on the bright side, at least you're not calculating

bucket = hash(x) % buckets

Unless you're stuck supporting non power of two number of shards in which case
you need modulus. And shoving more randomness into the LSB is more important
than the MSB.

I've thought about it some and I think anyone with a networking background is
automatically going to "subnet" their data from MSB down out of pure raw
habit. Of course you split top down, just like IP addrs.

I've also seen non-technical popular explanations of sharding which
stereotypically use something like phone numbers and start at the most
significant digit.

And a stereotypical test question is this does not work well with stuff like
customer IDs or transaction IDs because they tend toward sequential-ish,
unless you're talking about some kind of data mining thing not daily live
transactions.

On the other hand it works well if you shard (sorta) by date if you use the
right (usually non-native) format. So if you have dates like 2013-10-14 and
shard by YYYYMM somehow, then it could be easy to wipe 201210 from the
database because whichever shard its on, its probably not impacting the
latency figures for 201310 today. Unless you wanted to speed up the delete by
smoothing it across all shards in which case sharding by hash ends up being
the smart idea.

Trying to do tricky things can turn into a mess when the tricky thing changes
mid project, too.

~~~
aweisberg
This is actually an important topic for me. I am implementing a sharding
distribution based on consistent hashing using MurmurHash3 as hash function.

I am taking the first 4-bytes of the hash function output and using that. I
checked and MurmurHash3 mixes the first and last 8 bytes of hash output as a
last step, but I am not sure how much differentiation there is in the first
4-bytes.

I guess it is something I should check.

------
arek2
>CRC32 collisions

>codding collides with gnu

accident?

------
easytiger
cityhash -
[http://code.google.com/p/cityhash/](http://code.google.com/p/cityhash/)

