"MurmurHash2_x86_64 computes two 32-bit results in parallel and mixes them at the end, which is fast but means that collision resistance is only as good as a 32-bit hash. I suggest avoiding this variant."
Murmurhash3 has a 128-bit variant, which might be more along the lines of what he's looking for (the original post mentions SHA256).
Would an actual hash table only use, say, the first 10 bits or something of a hash function (int buckets) to make it less sparse (albeit increase collisions?)
If you decide you want more buckets later on, would you have to re-hash everything in your array and move it to a new, bigger one?
There are strategies to make the resize not be so expensive. Wikipedia's page on Hash Tables covers this at http://en.wikipedia.org/wiki/Hash_table#Dynamic_resizing
[edit: clarity in the first paragraph]
Using 2's powers generally fits nicely with programming on binary computers. It also has the nice quality of producing number of slots that are always 2's powers and you can shove those naturally for example into a single memory page.
i.e. if you're hashing a 10 character (non-unicode) word with FNV-1a you're using 10 multiplications and 10 xors. Adding a modulo (by a prime†) operation in there could feasibly double the time taken.
You could shift or mask the result (as described above) to use a table size of any power of 2 to avoid the modulo.
(2) Strictly speaking you don't need this overhead and you can use trees to keep it sparse as well, although as far as I know the low-n overhead slows it down too much in practice when compared to the high-n space efficiency. That is, you could declare a binary tree structure with only those buckets and pathways of the 2^32 which you need, but to find something within the structure requires checking and following ~32 pointers and storing something requires creating ~32 nodes.
Yes, if you're only storing a single bit for each. Usually, you'll want to store a pointer to the head of a linked list. At 64 bits per pointer, you'll need to allocate 32 gigs of RAM - slightly less feasible.
Look up an algorithm called a Bloom Filter.
Bloom filters are only useful (in this context) to augment a proper Set implementation, in which case they increase the memory cost, in exchange for decreasing (on average) the I/O cost.
Performance graphs: http://www.team5150.com/~andrew/noncryptohashzoo/speed.html
But if I was you, I'd start with Knuth (he, as another commenter mentioned, covers hashes in great detail), head to the references, and then use Google Scholar to find well-cited recent articles that reference the important papers mentioned there.
2) I'm certainly not an expert on how SE distributes its points, but the moderator comment in the related meta post (http://meta.programmers.stackexchange.com/questions/3527/rem...) states "I don't think there is a way to refund the reputation the answer gained while it was CW [...]"
SHA-1 is very fast though so it is a good point for comparison.
SHA-1 is fast for a cryptographic hash function, but it's orders of magnitude slower than "hashmap" hashes. crc32 is more than an order of magnitude faster, and as you can see in TFA's table it's ~half the speed of the best non-crypto hashes.
As for hash tables, can't you guarantee uniqueness by using only reversible operations? I remember reading a lengthy post about this on HN, but can't find the link anymore.
This is called a perfect hash and its appropriateness depends on the input data.
The problem is that the problem domains where hashes are practical are where a very large space of inputs needs to fit into a finite number of buckets such that lookups are fast.
For that to work with a perfect hash, you need an infinite number of buckets which needs infinite space.
That said, there's nothing in the definition of hash functions that require them to be compressing or non-reversible - although they would typically have to be to be useful.
One could see a hash function as an (extremely) lossy compression method. However, lossy compression only makes sense when you can exploit features of the domain, e.g., psychoacccoustics with sound, or characteristics of human vision with photos; perceptual hashes come to mind here.
It's not meaningful to study hashes as compression, since compression is by definition reversible.
Wouldn't that make the output almost as long as the input, defeating the purpose?
Any set of possible bits is a valid input, so the input has one bit of entropy per bit. If the hash were not "just as long" then one of the following must be true:
- it collides (two possible inputs would have the same hash)
- some inputs could not hash
- the algorithm encodes more than 2 bits of entropy per bit.
This is not possible by the pigeonhole principle.
There's probably a more rigorous proof here but I'm lazy. I guess something like imagine the input is going to be n bits (like 16 bits)...make an array of length 2^n and initialize them all to -1, and then for each possible input (0..2^n-1) hash it and - since the hash is not longer than the input, the representation of the hash is there in the array if you simply interpret it as the index (subscript). Set that to the n that produced it, if it's not already set (otherwise abort with a collision). After you have reached 2^n-1, you have set 2^n indices (including whatever hash 0 produced). Is it possible for you not to have made a collision? (Yes, easily: for example anything hashes to the next integer, except binary all 1's which hash to all 0's). Could some of the indices to be empty? No. If you put 2^n values in 2^n holes distinctly, there is one value per hole. Conversely, if you had anything LESS than that many holes to put it into (for example, you're trying to hash 32 bits of input into just the lower half of the indices, by creating a hash that uniquely hashes to 16 bits of output) then by the pigeonhole prinicpal you have to put two n's into the same index.
So, any hash that would be unique (that can hash anything) has to be at least exactly as long as what it's hashing. If there are any outputs that are never hashed to, it would have to be even longer.
Note that this proof depends on iterating on every possible input. If some inputs aren't possible, the hash could be unique while being shorter than the input. This is how lossless compression works.
Well, more accurately, at least as long, on average.
It would be possible, though, to have a guaranteed unique string which is usually shorter than the input for real world data, which is called lossless compression. Obviously, it is not possible to provide a fixed digest length though, which makes it useless for many hash function applications, like bloom filters. And, of course, a hash table implementation based on DEFLATE would be hilariously inefficient.
(1)this time interpreted as the hash of one and only one of the same list of possible inputs
The proof is not so straightforward, because it's not always true when you expect it to be true. For example, can you hash, reversibly and uniquely, all the numbers between 0 and 10 (real) to all the numbers between 0 and 1?
You would expect the answer to be "no" since there are "more real numbers from 0 to 10 than from 0 to 1". But that's actually not necessarily true the way you might expect, and the answer is actually "yes you can". Simply geometrically establish a coordinate plane like this
0 b 1
Now to get the hash simply move 'b' to wherever between 0 and 1 you're trying to hash, solve for the ray that passes through a and that point, and solve for the x-value of the ray where it has y-value of -1.
Through elementary proof which should be obvious, you'll see that any b casts a distinct ray (which is easily reversible introducing a c into line 3), and you can solve for it if you like.
I'm not saying that you're wrong, but I think the proof isn't as "obvious" and straightforward as you're saying. It doesn't apply to just anything, but to the specific case of the impossibility of hashing distinct length inputs uniquely into a discrete space defined by a uniformly shorter length of bits.
Given the pigeonhole principle, and the observation that there are fewer strings of length K than strings of any length, you have that you cannot map every string into a unique hash. That's what I meant by "a straightforward application of the pigeonhole principle".
If you're dealing with an infinite set of hashes (I'm not sure what that would look like, but hey), then of course you need to take into account the limitations of the pigeonhole principle when dealing with infinite sets - specifically, that it only applies if you have a set of pigeons of a larger cardinality than your set of holes. [0,1] and [0,10] have the same cardinality.
Well, the quotes you had to put there say it all. You might expect the answer to be "no" if you didn't understand how infinity works.
But in this case the output length might have to be constant? If that's the case, you are of course correct.
"CRC32 collisions: codding collides with gnu". At first I read "coding" and I was all "haha this must be an easter egg of the implementation".
If this is a thing you're going to use a lot, I'd probably just add a database column and give each record a random unique "public ID" -- then there is literally no connection between the public ID and the private one..
No, guy's looking for a hash table hash, not a cryptographic one. For a hash table you're looking for low collisions and high throughput (so your hash table is fast) first and foremost.
Sort of anyway.
You need some kind of random salt, too - it's not that hard to find a decent number of near-collisions even in cryptographic hashes. (E.g. if your hash table doubles in size whenever two keys end up in the same bucket, an attacker needs to spend O(2^n) operations to find two keys that agree in the first n bits, which will make you do O(2^n) operations and use O(2^n) memory.)
That absolutely isn't an intrinsic property of cryptographic hashes.
Also see: http://cyberarms.wordpress.com/2010/10/21/cracking-14-charac...
(low) hash speed is the whole point of PBKDF2, bcrypt or scrypt.
Talking about stretching our hash, though, sounds like some strange tasting taffy...
See this: http://net.tutsplus.com/tutorials/php/understanding-hash-fun...
to learn more about preventing brute-force attacks
No, I'll find that you've apparently stopped your education at hashing 101 and hashing speed is covered in hashing 102.
1. It's not an either-or situation, all three hashes I mentioned not only specifically include salts in their hashing interface but go as far as mandating a salt for brypt and scrypt (I don't believe PBKDF2 mandates one though it surely is recommended)
2. Salts don't protect much against brute-force attacks, their primary role is to protect against rainbow tables (dictionary attacks) in case the hashed data is leaked. They do add cost to brute-forcing a series of passwords (if salts are unique), but that cost is low compared to
3. Hash speed mitigates brute-forcing from both inside (leaked data) and outside (provide built-in rate-limiting, of course additional rate-limiting is still a good idea). Increasing the raw cost of computing the hash by multiple orders of magnitude and adding data dependencies within the hash (to prevent trivial parallelization) are very much vital against brute-force attacks.
4. You might have wanted to read your own bloody links, the first one specifically mentions hash speed to mitigate brute force attacks, and mentions salts only against rainbow tables; responses to the second one specifically note that salt mitigate rainbow table attacks but do not significantly mitigate brute force attacks especially on weak passwords.
You should be glad. You learned something important today.