
Hash Functions all the way down - based2
http://aras-p.info/blog/2016/08/02/Hash-Functions-all-the-way-down/
======
aappleby
It'd be interesting to see how the 128-bit Murmur versions compare.

Something not mentioned in the "Quality" section is that FNV and djb2 produce
may produce decent results on words/filenames/source code, but they can fail
catastrophically (produce far more collisions than expected) on certain
patterns of keys - SMHasher is designed not just to check performance, but to
try and find patterns of keys that 'break' a hash function.

Murmur3 tries to be short, simple, portable, and efficient on small keys while
still passing SMHasher. xxHash, Spooky, and the Farm/City/Metro suite all use
a lot of loop unrolling and/or platform-specific optimizations to get the
highest possible performance when hashing large chunks of data.

-Austin (author of Murmur and SMHasher)

~~~
_dps
First, thanks for all your work on Murmur and SMHasher!

I wonder if you have any recommendations for a one-byte-at-a-time hash that
survives SMHasher? I've been using Jenkins OAAT and haven't bumped into any
problems but I've seen conflicting reports about its robustness. Any thoughts
on Jenkins OAAT or an alternative one-at-a-time hash would be appreciated :)

~~~
aappleby
If you can only hash data a byte at a time, you have bigger problems. ;)

I have no real suggestions, but it would be trivial to feed Murmur one byte
per loop iteration and then ditch the 'tail' section.

------
caf
The "ZOMGBadHash" isn't unique to that codebase - it's the hash specified by
the ELF ABI (see
[http://www.sco.com/developers/gabi/latest/ch5.dynamic.html#h...](http://www.sco.com/developers/gabi/latest/ch5.dynamic.html#hash)
).

It's not just a random mush of xors and shifts, it's:

    
    
      - Add the next character value to the current value;
      - Left-shift the current value by 4 (because most of the entropy in plain text ASCII is in the lower 4 bits);
      - XOR the top 4 bits into the bottom four bits of the current value, because they're going to get shifted out next time; then clear the top four bits.
    

It's still not a great hash function (obviously). Part of the issue is that
with a 32-bit unsigned int it's effectively only a 28 bit hash.

------
based2
[https://www.reddit.com/r/programming/comments/4vxtyz/hash_fu...](https://www.reddit.com/r/programming/comments/4vxtyz/hash_functions_all_the_way_down/)

~~~
the_duke
What's the point of linking a Reddit submission of the same article?

~~~
fegu
The comments, probably.

------
Animats
From the title, I was expecting either a discussion of the CPython
implementation or an associative memory system.

It would be amusing to have a system with hardware associative memory, where
everything is a key-value tuple. Such things have been built (most of an MMU
is such an associative memory) but never caught on.

------
gregschlom
If anyone isn't familiar with him, Aras is one of the key architects behind
the Unity game engine. An excellent hacker. Whenever I see his blog come up, I
know it's going to be some awesome technical stuff.

------
MichaelGG
Great article.

1\. This seems to disagree with Rust's tests for FVN vs xxHash.[1] This
article has xxHash winning a bit after 10 bytes. The Rust one has FVN winning
until after 32 bytes. Rust recently added FVN. Is someone wrong or am I
missing something?

2\. Why is 64-bit slower on asm.js? If the host is 64-bit shouldn't it get
64-bit? I know JS doesn't have proper ints, but I thought asm.js-aware
browsers were supposed to handle this well?

Curious: Is there no uniform, fast, way to switch the algorithm for small
inputs? Require the caller to hint that the key might be small (<8 or 16
bytes), otherwise in streaming mode, just restart if it turns out you only get
<16 bytes. Or will that branching+code kill the gain?

1: [http://cglab.ca/~abeinges/blah/hash-
rs/](http://cglab.ca/~abeinges/blah/hash-rs/)

~~~
rurban
as 1) SMHasher is not the right test for performance for a hash function in a
hash table. In a hash table FNV will always win over xxHash and all those file
digests, which are optimized for aligned larger blocks. SMHasher is a good
test for network throughput or hashing files, but not for hash tables.

The reason is that an easily short inlinable hash function like FNV will
always win over a big non-inlinable digest like Murmur, xxHash, Farm, Metro
... Furthermore, even if small worse hash functions produce more collisions,
their small icache size will still outperform bigger functions with less
collisions. More work, but less cache misses. For small hash tables it is even
faster to do a linear search because of cache advantages, and also a linked-
list of buckets is better be written as array. ("Cache-Conscious Collision
Resolution in String Hash Tables" by Nikolas Askitis and Justin Zobel,
Melbourne 2005)

ad branching) Google successfully experimented with hash variants based on
alignment and size. Newer CPU's explore those branches beforehead, so the
overhead was not that big and worth it. See their sparse hashtable.

------
arielweisberg
This really needs to take into account hardware instruction support. The
hardware instructions are quite fast.

------
JosephRedfern
I'm surprised how fast the iPhone CPU is compared to the Xbox One CPU (for
this application).

~~~
aras_p
PX4/XB1 use a fairly low-end AMD CPU (Jaguar core, ~1.6GHz). Apparently it
also does not quite appreciate heavy use of 64 bit multiplies (as used in
xxHash64 for example).

Modern phones (e.g. iPhone 6) in terms of CPU are actually very impressive
indeed!

------
edwintorok
How about Siphash?

------
qwertyuiop924
Wait, don't you want minimum acceptable throughput on security-related
hashing, so that it becomes less likely that somebody can just brute-force it?

~~~
xxs
Not everything hashish ends up being salted passwords. Hashing is great for
just hashtables and partioning.

~~~
qwertyuiop924
Ah, that makes sense. Thanks.

------
ashitlerferad
How did they get to a place where there are multiple copies of the same
external codebase? Shared libraries FTW.

~~~
aras_p
Something like FNVHash is so small (like a handful of lines), that it's
extremely easy to just copy-paste the function into some place, without
realizing it's already copy-pasted somewhere else.

------
Zardoz84
it would be interesting check against Java hash implementation. I remember
that Java hash was "bad"

~~~
ucho
Was it? The only issue I remember was with String before Java 1.2, where only
few characters where used in hash (instead of all) leading to abysmal
performance in many applications.

