MurmurHash2, which is pretty great, has some issues:
"MurmurHash2_x86_64 computes two 32-bit results in parallel and mixes them at the end, which is fast but means that collision resistance is only as good as a 32-bit hash. I suggest avoiding this variant."
Murmurhash3 has a 128-bit variant, which might be more along the lines of what he's looking for (the original post mentions SHA256).
Computing two 32-bit results in parallel and mixing them at the end does NOT mean collision resistance is only as good as a 32-bit hash. For that, you need to compute ONE 32-bit result, then transform it into a 64-bit result.
Depends whether the two 32-bit hashes are correlated with each other. If there is no correlation then a pair of 32-bit hashes is no more likely to collide than a single 64-bit hash. But this is difficult to achieve, and you should not assume (for example) running the same algorithm twice with different initial states will produce uncorrelated hashes.
Question: say you use a hash which returns a 32-bit integer. If you were actually implementing a hash table, would you need to declare a structure with 2^32 elements? `int buckets[2^32]`? This seems unwieldy!
Would an actual hash table only use, say, the first 10 bits or something of a hash function (int buckets) to make it less sparse (albeit increase collisions?)
If you decide you want more buckets later on, would you have to re-hash everything in your array and move it to a new, bigger one?
Yeah, you've pretty much got it. A common alternative to masking bits off of the hash is to take hash modulo the size of the table as the index (although you have to be careful with a modulo strategy, so as not to introduce a bias towards certain table slots).
If the hash function is any good, simply bitwise-and enough bits from the hash and use that as the index. (It's a modulo too, but simply modulo(2^x).)
Using 2's powers generally fits nicely with programming on binary computers. It also has the nice quality of producing number of slots that are always 2's powers and you can shove those naturally for example into a single memory page.
Cross that bridge when you get there. If you're really finding that your hash function is the slowest piece of code in a tightly bound for-loop, and it's not all the collisions you're having (from having a bad hashing function), then look into alternatives. Before that, you're doing premature optimization.
Please read the whole thread before replying with the "premature optimization" thing. We are talking about hash table optimizations; this commenter -- http://news.ycombinator.com/item?id=4037320 -- is asking about the cost of modulo operation.
(1) The trick for resizing is, we only rehash all k elements to recalculate buckets every k inserts or so, and we double the hash table size. So, if you imagine that you just came from a resize, you had k elements and had performed N(k) hashes, then when you hit 2k elements you will have to resize again, and you will perform N(2k) = N(k) + k + 2k total hashes. This recurrence is solved by N(k) = 3k + C for an arbitrary constant C. Averaged over the elements you have inserted k, it's easy to see that for very large dictionaries, you only hash each element on average 2-3 times -- three times if you trigger a resize with k, 2 times if you come in just before triggering the resize.
(2) Strictly speaking you don't need this overhead and you can use trees to keep it sparse as well, although as far as I know the low-n overhead slows it down too much in practice when compared to the high-n space efficiency. That is, you could declare a binary tree structure with only those buckets and pathways of the 2^32 which you need, but to find something within the structure requires checking and following ~32 pointers and storing something requires creating ~32 nodes.
Yes, if you're only storing a single bit for each. Usually, you'll want to store a pointer to the head of a linked list. At 64 bits per pointer, you'll need to allocate 32 gigs of RAM - slightly less feasible.
Yeah sure you could waste all the space you want why in the world would you do that? Using a linked list is slow and inefficient with memory usage especially when you can do the same thing with a byte array.
Bloom filters are not suitable as stand-in replacements for hash tables when you are representing the Set abstract data type. They are not deterministic, they treat objects with the same hash as identical, and they do not actually store values.
Bloom filters are only useful (in this context) to augment a proper Set implementation, in which case they increase the memory cost, in exchange for decreasing (on average) the I/O cost.
What about bloom filter + sparse hash? You'd be able to see whether an object isn't in the hash table efficiently and pay the price of a longer lookup time. Could be useful for some situations with tiny true positive rates; say, a web browser's list of malwared sites.
FNV despite much popularity is a relatively poor quality hash. Murmur2 has a major flaw, hence Murmur3. CRC32 is slow. Not mentioned in the post, but if you're thinking Fletcher or Adler, they have terrible distribution. For a fast 32-bit hash, much better to go with Bob Jenkin's one-at-a-time hash (http://en.wikipedia.org/wiki/Jenkins_hash_function#one-at-a-...) which is simpler than Murmur3 and displays much better avalanche characteristics than the other hashes.
I'm not active in the field right now, so I can't give specifics (the comment was more borne out of frustration with people putting in so much effort 'reinventing the wheel' in an academic sense).
But if I was you, I'd start with Knuth (he, as another commenter mentioned, covers hashes in great detail), head to the references, and then use Google Scholar to find well-cited recent articles that reference the important papers mentioned there.
Unfortunately though, a lot of the papers that turn up in Google Scholar are behind paywalls. It is very expensive to get hold of academic papers if you don't work/study somewhere with an institutional licence (anything from $10 upwards per paper).
Yup, that's the world of academic publishing unfortunately. A tip: often you can get a 'preprint' copy of the paper from one of the authors' websites. Probably not strictly legal, but it does happen a lot.
I get the impression this is meant for more practical drop-in use for your hashtables - in which case the latest research is doubly unacceptable as it's still new and unreviewed, and there may be no suitable implementation (eg. a C binding in your distro's stable release).
If you think this is the sort of answer that Stack Overflow should strive for, note that the author edited it too many times and it fell into "Community Wiki" status. Until this was reverted in a manual process, the author didn't receive any points for his answer.
These are not cryptographic hashes. Comparison would be unfair as cryptographic ones are rather slow . These are used in structures like Hash tables or Bloom Filters etc. they need to be very fast and provide reasonable randomness (low collision). Bu their collision rates are very high comparing to say SHA1.
Neither are recommended for modern security needs, but MD5 is way, way more broken than SHA1. As far as I know, no SHA1 collision has ever been found, whereas any cs undergrad could implement arbitrary MD5 collisions using some work done by a Chinese team a few years back. SHA1 is deprecated because of some theoretical attacks which lower the complexity of finding a collision without really making it a tractable problem.
It'd still be interesting, particularly in a world where hardware is picking up the load for things which SSL uses. I'd expect the answer to be something along the lines of “more competitive as the data size increases” but it'd be interesting to know how great the margin is and whether the answer shifts significantly when comparing, say, a conventional hash to a cryptographic one on an embedded chip.
SHA-1 is fast for a cryptographic hash function, but it's orders of magnitude slower than "hashmap" hashes. crc32 is more than an order of magnitude faster, and as you can see in TFA's table it's ~half the speed of the best non-crypto hashes.
You've got it backwards; cryptographic hash functions are designed to be as fast as possible without giving up their cryptographic properties. If you need a slow hash (e.g. for password storage), you use something like bcrypt that's designed to be slow.
A perfect hash is able to avoid collisions when given the set of all possible keys in advance; it is not related to reversibility. The latter contradicts the very idea of a hash function, and would conceptually be a lossless compression technique.
I agree with the first point, but I think compressing and non-reversible are necessary conditions for a given function to be called a hash function; if they weren't, any mathematical function would do, wouldn't it?
One could see a hash function as an (extremely) lossy compression method. However, lossy compression only makes sense when you can exploit features of the domain, e.g., psychoacccoustics with sound, or characteristics of human vision with photos; perceptual hashes come to mind here.
It's really quite simple. Perfect hashes exists and are hashes. Perfect hashes do not, as a matter of definition, compress. They typically aren't reversible, because it's not an useful feature for a hash, but they could be - and if nothing else, they can always be deterministically brute forced (as they have no collisions), which is a (very bad) form of reversibility.
It's not meaningful to study hashes as compression, since compression is by definition reversible.
Any set of possible bits is a valid input, so the input has one bit of entropy per bit. If the hash were not "just as long" then one of the following must be true:
- it collides (two possible inputs would have the same hash)
- some inputs could not hash
- the algorithm encodes more than 2 bits of entropy per bit.
There's probably a more rigorous proof here but I'm lazy. I guess something like imagine the input is going to be n bits (like 16 bits)...make an array of length 2^n and initialize them all to -1, and then for each possible input (0..2^n-1) hash it and - since the hash is not longer than the input, the representation of the hash is there in the array if you simply interpret it as the index (subscript). Set that to the n that produced it, if it's not already set (otherwise abort with a collision). After you have reached 2^n-1, you have set 2^n indices (including whatever hash 0 produced). Is it possible for you not to have made a collision? (Yes, easily: for example anything hashes to the next integer, except binary all 1's which hash to all 0's). Could some of the indices to be empty? No. If you put 2^n values in 2^n holes distinctly, there is one value per hole. Conversely, if you had anything LESS than that many holes to put it into (for example, you're trying to hash 32 bits of input into just the lower half of the indices, by creating a hash that uniquely hashes to 16 bits of output) then by the pigeonhole prinicpal you have to put two n's into the same index.
So, any hash that would be unique (that can hash anything) has to be at least exactly as long as what it's hashing. If there are any outputs that are never hashed to, it would have to be even longer.
Note that this proof depends on iterating on every possible input. If some inputs aren't possible, the hash could be unique while being shorter than the input. This is how lossless compression works.
Well, more accurately, at least as long, on average.
It would be possible, though, to have a guaranteed unique string which is usually shorter than the input for real world data, which is called lossless compression. Obviously, it is not possible to provide a fixed digest length though, which makes it useless for many hash function applications, like bloom filters. And, of course, a hash table implementation based on DEFLATE would be hilariously inefficient.
Fair enough. The proof is that you can't shave even a single bit off of every output, so that given any file that's 1024 bytes, you can hash it to 1023 bytes. Not possible unless you're hiding that entropy somewhere (filename etc).
what you've just said says nothing about the length of the hashes, which was my point. You would have to say something like "all possible inputs of a certain length are your pigeons, and these same possibilities(1) are the holes"
(1)this time interpreted as the hash of one and only one of the same list of possible inputs
The proof is not so straightforward, because it's not always true when you expect it to be true. For example, can you hash, reversibly and uniquely, all the numbers between 0 and 10 (real) to all the numbers between 0 and 1?
You would expect the answer to be "no" since there are "more real numbers from 0 to 10 than from 0 to 1". But that's actually not necessarily true the way you might expect, and the answer is actually "yes you can". Simply geometrically establish a coordinate plane like this
0 b 1
so that point (a) has a well-defined coordinate, say (0,1) the 0 from the second line is at the origin, the 1 at the second line is at (1,0) and the bottom 0 is at (0,-1) with the bottom 10 being wherever the ray passing through a and passing through (1,0) - enough to uniquely identify the ray - has the y-value of -1.
Now to get the hash simply move 'b' to wherever between 0 and 1 you're trying to hash, solve for the ray that passes through a and that point, and solve for the x-value of the ray where it has y-value of -1.
Through elementary proof which should be obvious, you'll see that any b casts a distinct ray (which is easily reversible introducing a c into line 3), and you can solve for it if you like.
I'm not saying that you're wrong, but I think the proof isn't as "obvious" and straightforward as you're saying. It doesn't apply to just anything, but to the specific case of the impossibility of hashing distinct length inputs uniquely into a discrete space defined by a uniformly shorter length of bits.
From wikipedia, "A hash function is any algorithm or subroutine that maps large data sets of variable length, called keys, to smaller data sets of a fixed length."
Given the pigeonhole principle, and the observation that there are fewer strings of length K than strings of any length, you have that you cannot map every string into a unique hash. That's what I meant by "a straightforward application of the pigeonhole principle".
If you're dealing with an infinite set of hashes (I'm not sure what that would look like, but hey), then of course you need to take into account the limitations of the pigeonhole principle when dealing with infinite sets - specifically, that it only applies if you have a set of pigeons of a larger cardinality than your set of holes. [0,1] and [0,10] have the same cardinality.
I was indeed thinking of lossless compression: In practice it compresses almost all its input. If the the output length does not have to be constant, it's fine that some input actually gets inflated. As long as it compresses the average input.
But in this case the output length might have to be constant? If that's the case, you are of course correct.
How do folks generally take an autoincrementing database ID and generate a hash to be used "in public" when trying to avoid revealing obviously serial numbers? I don't think I need something airtight, in fact in one system we just multiplied/divided by a 4 digit prime number. While this worked fine, it seemed a little loose.
HASH(<id> + "static string pad") is how I imagine most people do it. Depends how much you care about a dedicated attacker figuring out that ID.
If this is a thing you're going to use a lot, I'd probably just add a database column and give each record a random unique "public ID" -- then there is literally no connection between the public ID and the private one..
Even for cryptographic hashes, fast is what you want most of the time. Look at it this way: when you connect to a web site via SSL, all the data you send will be hashed for authentication. Do you really want this to be slow?
EDIT: parent basically said "use a cryptographic hash".
You need some kind of random salt, too - it's not that hard to find a decent number of near-collisions even in cryptographic hashes. (E.g. if your hash table doubles in size whenever two keys end up in the same bucket, an attacker needs to spend O(2^n) operations to find two keys that agree in the first n bits, which will make you do O(2^n) operations and use O(2^n) memory.)
Not if you're not looking for it to be hard to crack. Suppose you're generating quick checksums, using a hashtable, indexing non-malicious data, etc. In those cases you want a very fast hash function, and you're not worried about folks being able to find collisions for existing hashes, or create multiple values with the same hash.
Can we stop calling them hashes? They're key derivation functions. I know it's a long, awkward term, but a lot of people in this thread are confusing the technical requirements for cryptographic hashes and KDFs, so I think that being precise about this is warranted.
> You'll find that adding a long salt, unique to each password, is a far more effective protector against brute force than "hash speed".
No, I'll find that you've apparently stopped your education at hashing 101 and hashing speed is covered in hashing 102.
1. It's not an either-or situation, all three hashes I mentioned not only specifically include salts in their hashing interface but go as far as mandating a salt for brypt and scrypt (I don't believe PBKDF2 mandates one though it surely is recommended)
2. Salts don't protect much against brute-force attacks, their primary role is to protect against rainbow tables (dictionary attacks) in case the hashed data is leaked. They do add cost to brute-forcing a series of passwords (if salts are unique), but that cost is low compared to
3. Hash speed mitigates brute-forcing from both inside (leaked data) and outside (provide built-in rate-limiting, of course additional rate-limiting is still a good idea). Increasing the raw cost of computing the hash by multiple orders of magnitude and adding data dependencies within the hash (to prevent trivial parallelization) are very much vital against brute-force attacks.
4. You might have wanted to read your own bloody links, the first one specifically mentions hash speed to mitigate brute force attacks, and mentions salts only against rainbow tables; responses to the second one specifically note that salt mitigate rainbow table attacks but do not significantly mitigate brute force attacks especially on weak passwords.
If you did your research, you'd know that salts are insufficient protection. Heck, read your own link, the first link has as #7 that you need to use a slow hash function. Or you can just read http://codahale.com/how-to-safely-store-a-password/ which lays it out in black and white.
You should be glad. You learned something important today.