Hacker News new | comments | ask | show | jobs | submit login
Which hashing algorithm is best for uniqueness and speed? (stackexchange.com)
336 points by suprgeek on May 29, 2012 | hide | past | web | favorite | 104 comments

MurmurHash2, which is pretty great, has some issues:

"MurmurHash2_x86_64 computes two 32-bit results in parallel and mixes them at the end, which is fast but means that collision resistance is only as good as a 32-bit hash. I suggest avoiding this variant."[1]

Murmurhash3 has a 128-bit variant, which might be more along the lines of what he's looking for (the original post mentions SHA256).

1: http://code.google.com/p/smhasher/wiki/MurmurHash3

Computing two 32-bit results in parallel and mixing them at the end does NOT mean collision resistance is only as good as a 32-bit hash. For that, you need to compute ONE 32-bit result, then transform it into a 64-bit result.

Depends whether the two 32-bit hashes are correlated with each other. If there is no correlation then a pair of 32-bit hashes is no more likely to collide than a single 64-bit hash. But this is difficult to achieve, and you should not assume (for example) running the same algorithm twice with different initial states will produce uncorrelated hashes.

Very true.

I thought Google's CityHash - http://code.google.com/p/cityhash/source/browse/trunk/README - was their successor to the MurmurHash variants.

CityHash doesn't have a 32-bit variant to participate in the benchmark.


Question: say you use a hash which returns a 32-bit integer. If you were actually implementing a hash table, would you need to declare a structure with 2^32 elements? `int buckets[2^32]`? This seems unwieldy!

Would an actual hash table only use, say, the first 10 bits or something of a hash function (int buckets[1024]) to make it less sparse (albeit increase collisions?)

If you decide you want more buckets later on, would you have to re-hash everything in your array and move it to a new, bigger one?

Yeah, you've pretty much got it. A common alternative to masking bits off of the hash is to take hash modulo the size of the table as the index (although you have to be careful with a modulo strategy, so as not to introduce a bias towards certain table slots).

There are strategies to make the resize not be so expensive. Wikipedia's page on Hash Tables covers this at http://en.wikipedia.org/wiki/Hash_table#Dynamic_resizing

[edit: clarity in the first paragraph]

If the hash function is any good, simply bitwise-and enough bits from the hash and use that as the index. (It's a modulo too, but simply modulo(2^x).)

Using 2's powers generally fits nicely with programming on binary computers. It also has the nice quality of producing number of slots that are always 2's powers and you can shove those naturally for example into a single memory page.

iirc, there is a major advantage to using a prime number of slots that prevents the biasing

It's to protect against bad hashing algorithms:


Does the modulo operation add a negligible cost to these (very cheap) hashing functions?

i.e. if you're hashing a 10 character (non-unicode) word with FNV-1a you're using 10 multiplications and 10 xors. Adding a modulo (by a prime†) operation in there could feasibly double the time taken.

†Not 2...

The modolo function is pretty much required unless you have 2^32 bytes of RAM available for each hashtable. Modolo is also ridiculously cheap, especially compared to the hashing function.

Certainly, if you are hashing a large amount of data then a single modulo operation isn't going to cost a lot. But if you're just hashing an English word, possibly even a URL, I'm not so sure.

You could shift or mask the result (as described above) to use a table size of any power of 2 to avoid the modulo.

Doing constant modulo on an integer is very fast - we're talking a few CPU operations.

It won't be a compile-time constant unless your hash table is incapable of resizing. And modulo arithmetic on arbitrary values is slow (20-94 cycles of latency on Sandy Bridge).

Hmm, good point. Is that also true for JIT compilers?

Yes, but bitwise AND is faster.

Cross that bridge when you get there. If you're really finding that your hash function is the slowest piece of code in a tightly bound for-loop, and it's not all the collisions you're having (from having a bad hashing function), then look into alternatives. Before that, you're doing premature optimization.

Please read the whole thread before replying with the "premature optimization" thing. We are talking about hash table optimizations; this commenter -- http://news.ycombinator.com/item?id=4037320 -- is asking about the cost of modulo operation.

IIrc changing a modulo to an and roughly doubled the performance of my hashtable.

(1) The trick for resizing is, we only rehash all k elements to recalculate buckets every k inserts or so, and we double the hash table size. So, if you imagine that you just came from a resize, you had k elements and had performed N(k) hashes, then when you hit 2k elements you will have to resize again, and you will perform N(2k) = N(k) + k + 2k total hashes. This recurrence is solved by N(k) = 3k + C for an arbitrary constant C. Averaged over the elements you have inserted k, it's easy to see that for very large dictionaries, you only hash each element on average 2-3 times -- three times if you trigger a resize with k, 2 times if you come in just before triggering the resize.

(2) Strictly speaking you don't need this overhead and you can use trees to keep it sparse as well, although as far as I know the low-n overhead slows it down too much in practice when compared to the high-n space efficiency. That is, you could declare a binary tree structure with only those buckets and pathways of the 2^32 which you need, but to find something within the structure requires checking and following ~32 pointers and storing something requires creating ~32 nodes.

It depends on your performance use case. 2^32 is only about 540MB of RAM so you can allocate that all at once and then all subsequent actions are read/writes.

> 2^32 is only about 540MB of RAM

Yes, if you're only storing a single bit for each. Usually, you'll want to store a pointer to the head of a linked list. At 64 bits per pointer, you'll need to allocate 32 gigs of RAM - slightly less feasible.

Yeah sure you could waste all the space you want why in the world would you do that? Using a linked list is slow and inefficient with memory usage especially when you can do the same thing with a byte array.

Look up an algorithm called a Bloom Filter.

Bloom filters are not suitable as stand-in replacements for hash tables when you are representing the Set abstract data type. They are not deterministic, they treat objects with the same hash as identical, and they do not actually store values.

Bloom filters are only useful (in this context) to augment a proper Set implementation, in which case they increase the memory cost, in exchange for decreasing (on average) the I/O cost.

What about bloom filter + sparse hash? You'd be able to see whether an object isn't in the hash table efficiently and pay the price of a longer lookup time. Could be useful for some situations with tiny true positive rates; say, a web browser's list of malwared sites.

FNV despite much popularity is a relatively poor quality hash. Murmur2 has a major flaw, hence Murmur3. CRC32 is slow. Not mentioned in the post, but if you're thinking Fletcher or Adler, they have terrible distribution. For a fast 32-bit hash, much better to go with Bob Jenkin's one-at-a-time hash (http://en.wikipedia.org/wiki/Jenkins_hash_function#one-at-a-...) which is simpler than Murmur3 and displays much better avalanche characteristics than the other hashes.

According to this site[1], CrapWow seems to be the best in raw performance: http://www.team5150.com/~andrew/noncryptohashzoo/CrapWow.htm... (Crap8 in second place and Murmur3 in third place - one at a time performs quite badly in this guys benchmarks)

Performance graphs: http://www.team5150.com/~andrew/noncryptohashzoo/speed.html

[1] http://www.team5150.com/~andrew/noncryptohashzoo/

What about its randomness compared to Murmur2/3?

Why don't I see any mentions of the latest research? There are people out there that do this for a living people!

I find stuff like this terribly interesting, so please elaborate. Any journal or so that is recommended for someone curious?

I'm not active in the field right now, so I can't give specifics (the comment was more borne out of frustration with people putting in so much effort 'reinventing the wheel' in an academic sense).

But if I was you, I'd start with Knuth (he, as another commenter mentioned, covers hashes in great detail), head to the references, and then use Google Scholar to find well-cited recent articles that reference the important papers mentioned there.

Unfortunately though, a lot of the papers that turn up in Google Scholar are behind paywalls. It is very expensive to get hold of academic papers if you don't work/study somewhere with an institutional licence (anything from $10 upwards per paper).

Yup, that's the world of academic publishing unfortunately. A tip: often you can get a 'preprint' copy of the paper from one of the authors' websites. Probably not strictly legal, but it does happen a lot.

Thanks very much! Might be finally time to cough up the dough for TAOCP then and take it from there.

Money well spent, I'm sure!

I get the impression this is meant for more practical drop-in use for your hashtables - in which case the latest research is doubly unacceptable as it's still new and unreviewed, and there may be no suitable implementation (eg. a C binding in your distro's stable release).

If you think this is the sort of answer that Stack Overflow should strive for, note that the author edited it too many times and it fell into "Community Wiki" status. Until this was reverted in a manual process, the author didn't receive any points for his answer.

Well, not entirely. Reverting the CW restores the rep, subject to daily rep limit. The answer's been given 3 separate bounties (100, 100,50) which are not affected by CW.

1) It still required manual intervention from a moderator.

2) I'm certainly not an expert on how SE distributes its points, but the moderator comment in the related meta post (http://meta.programmers.stackexchange.com/questions/3527/rem...) states "I don't think there is a way to refund the reputation the answer gained while it was CW [...]"

Community Wiki is immortality.

Why did he omit the standards (MD5 and SHA1) from the comparison?

These are not cryptographic hashes. Comparison would be unfair as cryptographic ones are rather slow . These are used in structures like Hash tables or Bloom Filters etc. they need to be very fast and provide reasonable randomness (low collision). Bu their collision rates are very high comparing to say SHA1.

Still it would be useful to see them as a point of reference - to determine whether it's worth bothering with anything else in first place.

I thought MD5 and SHA-1 were both designed as cryptographic hashes but now deprecated for this purpose - although SHA-1 is still used in applications such as git as a general purpose hash function.

Neither are recommended for modern security needs, but MD5 is way, way more broken than SHA1. As far as I know, no SHA1 collision has ever been found, whereas any cs undergrad could implement arbitrary MD5 collisions using some work done by a Chinese team a few years back. SHA1 is deprecated because of some theoretical attacks which lower the complexity of finding a collision without really making it a tractable problem.

It'd still be interesting, particularly in a world where hardware is picking up the load for things which SSL uses. I'd expect the answer to be something along the lines of “more competitive as the data size increases” but it'd be interesting to know how great the margin is and whether the answer shifts significantly when comparing, say, a conventional hash to a cryptographic one on an embedded chip.

There are, however, applications with both hash tables and bloom filters where some cryptographic properties are still useful/necessary, or where low-collision hashes are particularly important.

>@Orbling, for implementation of a hash dictionary. So collisions should be kept to a minimal, but it has no security purpose at all. – Earlz

SHA-1 is very fast though so it is a good point for comparison.

> SHA-1 is very fast though

SHA-1 is fast for a cryptographic hash function, but it's orders of magnitude slower than "hashmap" hashes. crc32 is more than an order of magnitude faster, and as you can see in TFA's table it's ~half the speed of the best non-crypto hashes.

SHA-1 was designed as a cryptographic hash functions, they are puprposely slow. No, SHA-1 is not as fast as functions from the article.

You've got it backwards; cryptographic hash functions are designed to be as fast as possible without giving up their cryptographic properties. If you need a slow hash (e.g. for password storage), you use something like bcrypt that's designed to be slow.

Donald Knuth's Art of Computer Programming Volume 3 has an excellent exposition on hash functions and hash tables in general. If you can get yourself a copy I would highly recommend the read.

Honest question: why not use 2-3 different hash algorithms to minimize collisions if you are simply after uniqueness (e.g.: verifying that a downloaded file is correct)?

As for hash tables, can't you guarantee uniqueness by using only reversible operations? I remember reading a lengthy post about this on HN, but can't find the link anymore.

> As for hash tables, can't you guarantee uniqueness by using only reversible operations?

This is called a perfect hash and its appropriateness depends on the input data.

The problem is that the problem domains where hashes are practical are where a very large space of inputs needs to fit into a finite number of buckets such that lookups are fast.

For that to work with a perfect hash, you need an infinite number of buckets which needs infinite space.

A perfect hash is able to avoid collisions when given the set of all possible keys in advance; it is not related to reversibility. The latter contradicts the very idea of a hash function, and would conceptually be a lossless compression technique.

Any reversible hash would be a perfect hash - not the other way around. That's all I'm saying.

That said, there's nothing in the definition of hash functions that require them to be compressing or non-reversible - although they would typically have to be to be useful.

I agree with the first point, but I think compressing and non-reversible are necessary conditions for a given function to be called a hash function; if they weren't, any mathematical function would do, wouldn't it?

One could see a hash function as an (extremely) lossy compression method. However, lossy compression only makes sense when you can exploit features of the domain, e.g., psychoacccoustics with sound, or characteristics of human vision with photos; perceptual hashes come to mind here.

It's really quite simple. Perfect hashes exists and are hashes. Perfect hashes do not, as a matter of definition, compress. They typically aren't reversible, because it's not an useful feature for a hash, but they could be - and if nothing else, they can always be deterministically brute forced (as they have no collisions), which is a (very bad) form of reversibility.

It's not meaningful to study hashes as compression, since compression is by definition reversible.

>As for hash tables, can't you guarantee uniqueness by using only reversible operations?

Wouldn't that make the output almost as long as the input, defeating the purpose?

actually, just as long.

Any set of possible bits is a valid input, so the input has one bit of entropy per bit. If the hash were not "just as long" then one of the following must be true: - it collides (two possible inputs would have the same hash) - some inputs could not hash - the algorithm encodes more than 2 bits of entropy per bit.

This is not possible by the pigeonhole principle. http://en.wikipedia.org/wiki/Pigeonhole_principle

There's probably a more rigorous proof here but I'm lazy. I guess something like imagine the input is going to be n bits (like 16 bits)...make an array of length 2^n and initialize them all to -1, and then for each possible input (0..2^n-1) hash it and - since the hash is not longer than the input, the representation of the hash is there in the array if you simply interpret it as the index (subscript). Set that to the n that produced it, if it's not already set (otherwise abort with a collision). After you have reached 2^n-1, you have set 2^n indices (including whatever hash 0 produced). Is it possible for you not to have made a collision? (Yes, easily: for example anything hashes to the next integer, except binary all 1's which hash to all 0's). Could some of the indices to be empty? No. If you put 2^n values in 2^n holes distinctly, there is one value per hole. Conversely, if you had anything LESS than that many holes to put it into (for example, you're trying to hash 32 bits of input into just the lower half of the indices, by creating a hash that uniquely hashes to 16 bits of output) then by the pigeonhole prinicpal you have to put two n's into the same index.

So, any hash that would be unique (that can hash anything) has to be at least exactly as long as what it's hashing. If there are any outputs that are never hashed to, it would have to be even longer.

Note that this proof depends on iterating on every possible input. If some inputs aren't possible, the hash could be unique while being shorter than the input. This is how lossless compression works.

>actually, just as long

Well, more accurately, at least as long, on average.

It would be possible, though, to have a guaranteed unique string which is usually shorter than the input for real world data, which is called lossless compression. Obviously, it is not possible to provide a fixed digest length though, which makes it useless for many hash function applications, like bloom filters. And, of course, a hash table implementation based on DEFLATE would be hilariously inefficient.

Fair enough. The proof is that you can't shave even a single bit off of every output, so that given any file that's 1024 bytes, you can hash it to 1023 bytes. Not possible unless you're hiding that entropy somewhere (filename etc).

The proof is straightforward application of the pigeonhole principle. Input strings are your pigeons, hashes are your holes.

what you've just said says nothing about the length of the hashes, which was my point. You would have to say something like "all possible inputs of a certain length are your pigeons, and these same possibilities(1) are the holes"

(1)this time interpreted as the hash of one and only one of the same list of possible inputs

The proof is not so straightforward, because it's not always true when you expect it to be true. For example, can you hash, reversibly and uniquely, all the numbers between 0 and 10 (real) to all the numbers between 0 and 1?

You would expect the answer to be "no" since there are "more real numbers from 0 to 10 than from 0 to 1". But that's actually not necessarily true the way you might expect, and the answer is actually "yes you can". Simply geometrically establish a coordinate plane like this


   0 b 1

   0       10
so that point (a) has a well-defined coordinate, say (0,1) the 0 from the second line is at the origin, the 1 at the second line is at (1,0) and the bottom 0 is at (0,-1) with the bottom 10 being wherever the ray passing through a and passing through (1,0) - enough to uniquely identify the ray - has the y-value of -1.

Now to get the hash simply move 'b' to wherever between 0 and 1 you're trying to hash, solve for the ray that passes through a and that point, and solve for the x-value of the ray where it has y-value of -1.

Through elementary proof which should be obvious, you'll see that any b casts a distinct ray (which is easily reversible introducing a c into line 3), and you can solve for it if you like.

I'm not saying that you're wrong, but I think the proof isn't as "obvious" and straightforward as you're saying. It doesn't apply to just anything, but to the specific case of the impossibility of hashing distinct length inputs uniquely into a discrete space defined by a uniformly shorter length of bits.

From wikipedia, "A hash function is any algorithm or subroutine that maps large data sets of variable length, called keys, to smaller data sets of a fixed length."

Given the pigeonhole principle, and the observation that there are fewer strings of length K than strings of any length, you have that you cannot map every string into a unique hash. That's what I meant by "a straightforward application of the pigeonhole principle".

If you're dealing with an infinite set of hashes (I'm not sure what that would look like, but hey), then of course you need to take into account the limitations of the pigeonhole principle when dealing with infinite sets - specifically, that it only applies if you have a set of pigeons of a larger cardinality than your set of holes. [0,1] and [0,10] have the same cardinality.

>You would expect the answer to be "no" since there are "more real numbers from 0 to 10 than from 0 to 1"

Well, the quotes you had to put there say it all. You might expect the answer to be "no" if you didn't understand how infinity works.

I was indeed thinking of lossless compression: In practice it compresses almost all its input. If the the output length does not have to be constant, it's fine that some input actually gets inflated. As long as it compresses the average input.

But in this case the output length might have to be constant? If that's the case, you are of course correct.

Lossless compression doesn't compress random data. You will still have the same number of bits in and out.

Who says the input is necessarily random? ;)

Computing two 32-bit hashes is the same as computing 1 64-bit hash. In general, computing n m-bit hashes is the same as computing 1 n*m-bit hash.

Any digest algorithm that takes variable-sized data as input and produces fixed-sized output will necessarily produce collisions for some inputs that are larger than the output.

I wish he provided the architecture he ran on. Nehalem and later cores have crc32 instructions that are quite zippy.

Very interesting, thanks for sharing.

"CRC32 collisions: codding collides with gnu". At first I read "coding" and I was all "haha this must be an easter egg of the implementation".

The irony is that codding [British English] means a practical joke or trick on someone. :)

How do folks generally take an autoincrementing database ID and generate a hash to be used "in public" when trying to avoid revealing obviously serial numbers? I don't think I need something airtight, in fact in one system we just multiplied/divided by a 4 digit prime number. While this worked fine, it seemed a little loose.

HASH(<id> + "static string pad") is how I imagine most people do it. Depends how much you care about a dedicated attacker figuring out that ID.

If this is a thing you're going to use a lot, I'd probably just add a database column and give each record a random unique "public ID" -- then there is literally no connection between the public ID and the private one..

Look into format-preserving encryption. I wrote something that allows you to generate permutations for an arbitrary sized range. With it, there are no collisions.

If you're doing 32-bit hashes in Javascript and are willing to trade a bit, then a 31-bit hash may be at least an order of magnitude faster due to VM implementation: https://groups.google.com/d/msg/v8-users/zGCS_wEMawU/6mConTi...

There is a python wrapper for murmur3 here : http://stackoverflow.com/a/5400389

Going by just the subject, maybe a perfect hash function generated with eg. GNU gperf?

isn't slow better in this case? I mean if it's fast to generate, it's fast to crack, right?

> isn't slow better in this case?

No, guy's looking for a hash table hash, not a cryptographic one. For a hash table you're looking for low collisions and high throughput (so your hash table is fast) first and foremost.

Even for cryptographic hashes, fast is what you want most of the time. Look at it this way: when you connect to a web site via SSL, all the data you send will be hashed for authentication. Do you really want this to be slow?

Cryptographic hashes should be as fast as possible while not sacrificing collision resistance. Hashtable hashes should be as collision resistant as possible while not sacrificing speed.

Sort of anyway.

For cryptographic purposes you usually want fast digests and slow key derivation functions.

The person specifically requested an algorithm for a hash table. Nobody is going to attempt to crack that

nobody? This exact problem was a massive issue within the year.


Using a slower hash would make the issue worse (linearly): the collisions would still be there, but now each insertion would take even more time due to the extra computational cost of the hash.


EDIT: parent basically said "use a cryptographic hash".

You need some kind of random salt, too - it's not that hard to find a decent number of near-collisions even in cryptographic hashes. (E.g. if your hash table doubles in size whenever two keys end up in the same bucket, an attacker needs to spend O(2^n) operations to find two keys that agree in the first n bits, which will make you do O(2^n) operations and use O(2^n) memory.)

> No, using a slower cryptographic hash would produce better distribution

That absolutely isn't an intrinsic property of cryptographic hashes.

Using a faster cryptographic hash would be better.

Not if you're not looking for it to be hard to crack. Suppose you're generating quick checksums, using a hashtable, indexing non-malicious data, etc. In those cases you want a very fast hash function, and you're not worried about folks being able to find collisions for existing hashes, or create multiple values with the same hash.

There are times you want something that won't be a cracking attack. E.g. to randomly (but consistantly) partition data (i.e. a hash table).

Given Moore's law and the prevalence of botnets and cloud computing - hash speed really isn't a very strong means of defence against cracking...

Also see: http://cyberarms.wordpress.com/2010/10/21/cracking-14-charac...

Uh? Hash speed is pretty much the only means of defense against brute-force cracking (assuming the hash function isn't broken and the attacker has no other vector available).

(low) hash speed is the whole point of PBKDF2, bcrypt or scrypt.

Can we stop calling them hashes? They're key derivation functions. I know it's a long, awkward term, but a lot of people in this thread are confusing the technical requirements for cryptographic hashes and KDFs, so I think that being precise about this is warranted.

But talking about salt in our hash sounds delicious.

Talking about stretching our hash, though, sounds like some strange tasting taffy...

You'll find that adding a long salt, unique to each password, is a far more effective protector against brute force than "hash speed".

See this: http://net.tutsplus.com/tutorials/php/understanding-hash-fun... and: http://stackoverflow.com/questions/1191112/password-hashing-...

to learn more about preventing brute-force attacks

> You'll find that adding a long salt, unique to each password, is a far more effective protector against brute force than "hash speed".

No, I'll find that you've apparently stopped your education at hashing 101 and hashing speed is covered in hashing 102.

1. It's not an either-or situation, all three hashes I mentioned not only specifically include salts in their hashing interface but go as far as mandating a salt for brypt and scrypt (I don't believe PBKDF2 mandates one though it surely is recommended)

2. Salts don't protect much against brute-force attacks, their primary role is to protect against rainbow tables (dictionary attacks) in case the hashed data is leaked. They do add cost to brute-forcing a series of passwords (if salts are unique), but that cost is low compared to

3. Hash speed mitigates brute-forcing from both inside (leaked data) and outside (provide built-in rate-limiting, of course additional rate-limiting is still a good idea). Increasing the raw cost of computing the hash by multiple orders of magnitude and adding data dependencies within the hash (to prevent trivial parallelization) are very much vital against brute-force attacks.

4. You might have wanted to read your own bloody links, the first one specifically mentions hash speed to mitigate brute force attacks, and mentions salts only against rainbow tables; responses to the second one specifically note that salt mitigate rainbow table attacks but do not significantly mitigate brute force attacks especially on weak passwords.

If you did your research, you'd know that salts are insufficient protection. Heck, read your own link, the first link has as #7 that you need to use a slow hash function. Or you can just read http://codahale.com/how-to-safely-store-a-password/ which lays it out in black and white.

You should be glad. You learned something important today.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact