
Fast Perfect Hashing - dmit
http://www.jandrewrogers.com/2019/02/12/fast-perfect-hashing/
======
munificent
This article is interesting but... weird. It seems to assume a bunch of
things, and then states some conclusions based on that, but then also assumes
the value of the conclusions is obvious.

 _> A perfect hash function is one that is collision-free. By implication, the
hash must be at least as many bytes as the key and the function is
theoretically reversible, though not always tractably so._

This is true because of the pigeonhole principle, but most people won't assume
that the key size is equivalent to the domain size. Usually when using perfect
hashing, you are hashing (much) fewer elements than the total range a key can
represent.

A perfect hash function that uniquely assigns hash values to the eight items
you need to store, but gives you back integers anywhere in the 32 bit range
isn't super helpful. Or, at least, it's not obvious to me why it would be.
They do state the algorithm can be scaled down to 8 or 16 bits, so I guess
there's that.

 _> Because the hash is no smaller than the key, the primary use case is
randomizing small values like integral types._

This just doesn't make sense to me. The key can be much larger than the hash,
even with perfect hashing, as long as the _number of keys in your data set_ is
smaller than the maximum hash value.

The author says "primary use case" but what they really mean is that this
isn't what most people would consider a perfect hash at all. It's just an
integer permuter. In other words, here as an even faster implementation of
what the author would describe as a perfect hash over small integer keys:

    
    
        uint32_t hash(uint32_t key) { return key; }
    

This also guarantees there are no collisions so is a "perfect hash". It's just
not a very useful one.

The author assumes the reader knows the value of randomizing your integer
keys. I'm not sure what that value is in cases where you know you don't have
collisions anyway.

~~~
derefr
> The author assumes the reader knows the value of randomizing your integer
> keys. I'm not sure what that value is in cases where you know you don't have
> collisions anyway.

If you're going to be doing writes of your data in order of an integer
sequence number, and you're writing to a multi-node storage system that
partitions/shards by primary-key range, then you're going to get worst-case
performance if you use your integer sequence numbers directly as the primary
keys, since you're queuing all your writes onto one shard.

Instead, you want to permute your integer sequence numbers so that successive
integers become evenly spread across the range partitions. This (along with
some write duplication and verify-on-read) is the foundation of Dynamo-based
"scale-free" storage systems (DynamoDB+S3, Riak+CS, etc.)

~~~
jnwatson
It is trivial to generate a mostly-randomized bijective mapping of small
integers to small integers with an xor and a little byte swapping, which is
what the author does with a built-in AES instruction.

The OP's point is that this isn't a particularly interesting problem. The
interesting problem is generating efficient bijective mappings of large
integers to very small integers.

~~~
derefr
I wasn't attempting to argue the OP's point, merely answering the question
posed: "what [is the] value [of randomizing your integer keys] in cases where
you know you don't have collisions."

------
nightcracker
The OP seems to be confusing terminology. What they construct is a
permutation. Everything they say is relevant to permutations, not perfect
hashing.

In particular:

> A perfect hash function is one that is collision-free. By implication, the
> hash must be at least as many bytes as the key

This is just false. Perfect hashing is often if not always performed on
subsets of the full domain, e.g. the keywords in a programming language.

~~~
ot
Yeah the terminology is a bit misleading, it would be better to call this a
"bijective" or "invertible" integer hash function, or mixer.

It's really nice work though! Invertible mixers are usually building blocks
for more general hash function, for example MurmurHash3 uses this 64-bit mixer

[https://github.com/aappleby/smhasher/blob/master/src/MurmurH...](https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp#L81)

Using this trick could produce a faster murmurhash-type function.

~~~
jandrewrogers
A difference is that most common mixers are not collision-free. For some
applications of runtime hashing using ordinary hash functions, where pseudo-
randomness is important regardless, being able to prove the hash function is
collision-free enables useful optimizations without any prior knowledge of the
keys to be hashed.

The genesis of this was database internals, which have many use cases for
strongly randomized bijective function for integral types. It is used in
places where most people would use an ordinary hash function.

This particular hash is not trivially invertible, since internal AES state
necessary to reverse it is discarded.

~~~
ot
> A difference is that most common mixers are not collision-free.

That is not true, most modern hash implementations use invertible mixers. See
Bob Jenkins' notes on hash function design:

[http://burtleburtle.net/bob/hash/evahash.html#funneling](http://burtleburtle.net/bob/hash/evahash.html#funneling)

------
gumby
> A perfect hash function is one that is collision-free. By implication, the
> hash must be at least as many bytes as the key

I don't understand this lemma. An indexed array is a degenerate example in
which the "hash" (pointer) is presumably a _lot_ smaller than the key (index).

BTW for those not familiar with a perfect hash function they are great for
things like symbol tables that are immutable at runtime. You can make a very
fast lookup that requires no probing (bucket size is always one) so you can
also eliminate a lot of implementation both of code and data structure.

~~~
da_chicken
> _An indexed array is a degenerate example in which the "hash" (pointer) is
> presumably a lot smaller than the key (index)._

No, it isn't.

The "0" in arr[0] is (in most cases) a 32-bit signed integer. If you're using
a signed 32-bit int to index your array, you're using a hash length of
32-bits. You won't be able to index more than 2.1 billion unique items without
changing your data type. Even if you're not using all those bits, you're still
using 32-bits to index them. The fact that you're storing something that may
have more than 32 bits of information is irrelevant, because _your array (a
naive hash) doesn 't consider them real values_.

If you want to store a set of N elements in an array or perfectly hash a set
of N elements, then the set of all possible indexes or hashes _must_ have at
least N elements.

~~~
gumby
It's the lemma I challenge, not the common case on certain of today's
architectures. The use of the word "must" implies ∀, which is plainly not the
case.

~~~
bjourne
If it is not the case, can you present a counter-example? Can you code it up
in say Python?

------
inciampati
Coincidentally, I was just testing an actual perfect hash function generator
[https://github.com/rizkg/BBHash](https://github.com/rizkg/BBHash). It
requires 2-3 bits per entry irrespective of key or table size. They show that
they can build the table for 10^12 entries.

One problem is that identifying keys that aren't in the table can be as
expensive as storing all the keys. I was thinking that it would be interesting
to build two tables with different hash functions for the same set of keys and
then record the mapping between them. This would double the table size and add
n log n bits to record the relationship, but it would let you make
probabilistic (but pretty confident) estimates of what keys weren't in the
table, all with a fixed number of bits per key.

To clarify, if you map a key into the table with both functions, by
construction it will map to a position in each table that are linked together.
If a key maps to unlinked positions in each table then we know with
probability ~(1-1/n) that the key isn't in the input to the table. When keys
are big and n is big you could be very confident in this estimate and save a
huge amount of space over typical structures to do this.

------
chubot
_The algorithms presented below are much faster than most non-perfect hash
functions on recent CPUs._

This is a very vague statement. I would like to see some numbers, as well as
real-world use cases.

Last month there was a thread about perfect hashing in the postgres parser,
but I wonder if a DFA / trie is better, since you don't have to always look at
every byte.

[https://news.ycombinator.com/item?id=18880374](https://news.ycombinator.com/item?id=18880374)

------
CoolGuySteve
This is a nice simple mixing function on newer Intel chips.

But since the XOR doesn't increase entropy, I think you can save a cacheline
load of the 0xDEADBEEF constant by passing _mm_setzero_si128() instead of
_mm_set1_epi32(0xDEADBEEF).

------
Piezoid
A good blog post about xor-shift-multiply reversible hash functions:
[https://nullprogram.com/blog/2018/07/31/](https://nullprogram.com/blog/2018/07/31/)
The inverse functions are easily found by computing the modular inverses of
the odd multiplicative factors.

Real perfect hashing is more about hashing a non contiguous set of elements
with no collisions, unlike the set of all integers that fit in n bits.

A different problem is Minimal Perfect Hashing. In addition to being
injective, the function should have a minimal image of the input set: the set
of keys is mapped to consecutive integers without vacant slots.

The implementations in optimal space are non-practical but some libraries
construct such functions with a bit more memory (2-3bits/key), like emphf or
BBhash (both on github). The latter use a series of 1-hash bloom filters where
keys having collision on one layer are removed to be handled by deeper layers.
The overall rank() of the bit where a key ultimately lands gives its hash.

------
berbc
As pointed out by some of you, this looks more like permutation of sequences.
Also, since a symmetric cipher is used, I'm surprised the author didn't
mention Black and Rogaway [1].

Their algorithm is a permutation (int -> int) that works on a domain of any
size up to a limit. A typical application for this is encrypting credit card
numbers so that the ciphertext still looks like a credit card number (non-
trivial because the size of domain isn't a power of two) or efficiently
shuffling sequences, randomly in appearance but repeatably if you know the
seed.

For instance, this is used by Masscan to randomize the order in which IP
addresses and ports are scanned [2]. I've built a Python package that could
help you use this algorithm [3] (mostly for fun, but maybe that's useful, let
me know :)).

[1]: [https://en.wikipedia.org/wiki/Format-
preserving_encryption#T...](https://en.wikipedia.org/wiki/Format-
preserving_encryption#The_FPE_constructions_of_Black_and_Rogaway) [2]:
[https://github.com/robertdavidgraham/masscan/blob/6c15edc280...](https://github.com/robertdavidgraham/masscan/blob/6c15edc280645b59aadb562333511cebd3405246/src/rand-
blackrock.c) [3]:
[https://github.com/bbc2/shuffled](https://github.com/bbc2/shuffled)

------
jemfinch
Seems to accomplish the same purpose as Fibonacci Hashing[0], but with more
steps.

[0] [https://probablydance.com/2018/06/16/fibonacci-hashing-
the-o...](https://probablydance.com/2018/06/16/fibonacci-hashing-the-
optimization-that-the-world-forgot-or-a-better-alternative-to-integer-modulo/)

------
norswap
And to think I was excited to finally get a good explanation of what people
usually mean when they say "perfect hashing" (i.e, given a set of keys,
construct a hashing function that will assign each key a unique hash so as to
avoid collision, with a hash-range that has typically little overhead over the
number of keys).

If someone has good resources on that, I'm still interested. I found it hard
to fully internalize the stuff I've found when looking a while back. In
particular, I'm confused about the (complexity, time) bounds of such a hash
construction process.

~~~
norswap
Well, it looks like I found something that could qualify:
[https://www.cs.cmu.edu/~avrim/451f11/lectures/lect1004.pdf](https://www.cs.cmu.edu/~avrim/451f11/lectures/lect1004.pdf)

------
beyondCritics
As already mentionend, what he describes should be called a bijective mixing
function. Now when i am given n different values, i can mix them them with his
function and will get n other pairwise different values. That helps me nothing
in practive with regard to the task of perfect hashing. Technically i would
say, he describes in deed a perfect hashing function, but of questionable
quality.

------
senderista
This article is about integer permutations, not perfect hashing. That said,
integer permutations have many nice applications. One is in hash tables with
integer keys: one can store the "hash code" in lieu of the key, since
permutations are reversible.

------
bmh
This is awesome! In case you're wondering, the latency of _mm_aesenc_si128 on
Skylake and above is only 4 clock cycles

------
convolvatron
this..seems extremely cool (?) does someone have a reference or pointer for
further study about how this is 'perfect' and not just 'really good'

~~~
lern_too_spel
A perfect hash function has no collisions. Typically, people are interested in
perfect hashes constructed on types that themselves are not represented in the
number of output bits but with an input cardinality that is less than or equal
to the output's cardinality.

What the author is describing here is more commonly called a mixing function.
There are many fast mixing functions.
[https://gist.github.com/badboy/6267743](https://gist.github.com/badboy/6267743)

~~~
zuzun
If someone is looking for modern mix functions, the ones in Murmur are a good
choice.

------
samirm
cryptographic hash^

