

Hash Functions - jacquesm
http://home.comcast.net/~bretm/hash/

======
antirez
It is worth adding that if your goal using an hash function is sharding across
different servers the CRC function can be good, especially the CRC16 function,
that has a smaller output than the CRC32, tends to perform very well with many
common "key name" patterns.

It's also worth to note that the CRC16 (and CRC32) function will map every
possible input of 16 (or 32) bits to a different output, without collisions,
similarly to a block cipher.

EDIT: I would take this web site with a bit of salt, the description of the
"One Way" concept in the home page is, as far as I can tell, not correct:

"Another important goal is that "similar" input vectors produce very different
output vectors. For the case of table lookup, table keys are usually either
human language strings, index numbers, or other data that exhibit non-
randomness. For the case of message digest functions, the goal is to make it
as difficult as possible to find two different messages that result in the
same hash. A hash function with this property is called a one-way hash
function."

~~~
pja
It's sort of almost correct. He's fumbling his way towards the avalanche
property: for a perfect cryptographic hash if you flip a single bit in the
input, on average half the bits in the output will change.

<http://en.wikipedia.org/wiki/Avalanche_effect>

~~~
antirez
Yes, but the Avalanche effect does not mean it is not invertible. It is
possible to design a one way hash function with remarkable avalanche (50% of
flipped bits on average after 1 bit change in input) effect but that is easily
invertible.

The simplest would be an hash function that outputs an N-bit hash where M bits
are a pseudo-random sequence of bits (depending on the input, let's say
SHA1(input)) that are used to key a stream cipher, and the rest N-M bits are
the XOR between the stream cipher output and part of the input of the hash
function.

Perfect Avalanche effect, but the function is invertible.

~~~
jacquesm
Only in that special case where the input <= the output of the hash function,
as in your example earlier in the thread.

~~~
antirez
An hash function is invertible enough if it will leak the first three bytes of
the input. Does not need to be invertible as F^-1 in mathematical notation.

------
eis
If you're looking for a cutting-edge non-cryptographic hash function, then I'd
suggest using MurmurHash 3 created by Austin Appleby.

Excellent performance with great collision properties.

<http://code.google.com/p/smhasher/> or more specifically
<http://code.google.com/p/smhasher/wiki/MurmurHash3>

There exist 32, 64 and 128 bit versions.

~~~
jwn
I'll wait for the winner of NIST's SHA-3 competition:
<http://csrc.nist.gov/groups/ST/hash/sha-3/Round3/index.html>

~~~
psykotic
He said non-cryptographic.

------
kenjackson
I must admit that I find it a little disappointing that an article of this
length on hash functions doesn't mention universal hashing at all.

~~~
jemfinch
As far as I know, no one anywhere _uses_ universal hashing. Are there some
modern, serious uses of universal hashing that I'm missing?

~~~
kenjackson
UMACs, for example, are based on them. And it would seem to me to understand
the power and flaws of hashing, knowing universal hashing seems important --
even if you never use them.

------
Roboprog
Hash functions are generally not reversible, unless you were to very carefully
define the input domain. Given the "similar inputs" -> "quite different
outputs" ("avalanche") effect described here and the article, I wonder this:

Is it possible to describe a given hash function as a "cyclic" function, given
some kind of transformation on the input to create a matrix/tensor out of the
input -- are there common methods to re-order the inputs in such a way that
there are discernible "axis" that exhibit cyclic behavior? I suppose such an
analysis would be a way to determine cryptographic fitness of a function, but
I am only speculating.

------
ypcx
It should be duly noted that hashing used for table lookup is a workaround
technique, a hack to overcome the inherent inefficiency of the Von Neumann
design.

~~~
mmphosis
Interesting. Would you elaborate, and/or give an example of some non-Von
Neumann designs and how they accomplish table lookups more efficiently?

------
wisty
Note, Python's hash function is a very bad hash function. It usually maps
sequential values (i.e. 1,2,3 ..) to sequential hash values (i.e. 1,2 3 ...) -
yes, hash(1) = 1. The advantage of this is, if you have sequential keys (which
is _very_ common), you minimize collisions better than more "random"
distributions.

Of course, you couldn't use it for cryptography.

~~~
slig
If it was made that way by design, for that specific use, how can it be "very
bad"?

------
deutronium
One type of hashing that I hadn't heard of before that I think is really cool,
is fuzzy hashing.

Which allows you to compare hashes, to find the similarity of data.

For an example: <http://ssdeep.sourceforge.net/>

------
thinkbohemian
This is great, thanks to Bret for writing this.

