
O(1) Data Lookups with Minimal Perfect Hashing - ingve
http://blog.demofox.org/2015/12/14/o1-data-lookups-with-minimal-perfect-hashing/
======
nly
Ilan Schnells AOT generator for perfect hash functions[0] has served me well.
It takes a code template so you can generate code for any language. It also
uses the much more elegant CHM algorithm[1] (Which is also implemented by the
venerable CMPH library[2] if you want to compute these things dynamically.)

If I was doing dynamically in C++, like the author of this post, and didn't
want to add a runtime dependency, I might be tempted to implement CHM using
Boost Graph.

[0] [http://ilan.schnell-web.net/prog/perfect-hash/](http://ilan.schnell-
web.net/prog/perfect-hash/)

[1] [http://ilan.schnell-web.net/prog/perfect-
hash/algo.html](http://ilan.schnell-web.net/prog/perfect-hash/algo.html)

[2] [http://cmph.sourceforge.net/](http://cmph.sourceforge.net/)

~~~
mtdewcmu
I'm thinking about how I'd implement CHM. I think I'd use a union-find data
structure to generate the acyclic graph, as in Kruskal's algorithm. This would
fit into an N-element integer vector. After finding the acyclic graph
(forest), I'd represent the forest implicitly in another N-element integer
vector, for finding the vertex weights.

You can take advantage of the graph being just a forest of trees and use
simpler data structures.

------
rurban
Oh my, again a blog post talking about MPH being O(1) but not talking about
the constant factors which make this simple approach only useful for >100.000
keys.

People, those two-way hashing schemes are already available in the cmph
library, even compressed. But even the fastest of the 6 cmph algorithms is
much slower than any other perfect hash algorithm with a normal number of
keys. (i.e. <100.000) There is a high constant overhead and a high run-time
overhead with the 2 hashes.

For comparisons see [https://github.com/rurban/Perfect-
Hash#benchmarks](https://github.com/rurban/Perfect-Hash#benchmarks)

------
FullyFunctional
Interesting. AFAICT, Cuckoo hashing:

0) Cuckoo hashing is much simpler to implement (correctly).

1) look up and delete is also O(1), but Cuckoo is faster (especially if you
can exploit the inherent parallelism in the two table probes).

2) insert for Cuckoo is O(1) amortized. It's unclear how this compares.

3) Cuckoo can mix and match insert, delete, lookup ops.

4) Cuckoo uses more memory (as the OP is per definition minimal).

Someone please correct me if I'm wrong, but I think Cuckoo looks pretty
favorable to this for most usage. EDIT: formatting.

~~~
beagle3
Cuckoo's insert is, in fact, hard to implement correctly. Every implementation
I looked at was vulnerable to "chosen key" attack which made it either O(n) or
worse -- run into an infinite loop or fail to insert.

~~~
FullyFunctional
Good point. Isn't this a problem with the hash functions exclusively? Can't
you avoid this by picking the two hashing functions at random every time you
rehash?

(I acknowledge that my original comment was misplaced given the off-/on-line
difference).

~~~
beagle3
You can if you're just avoiding bad luck; You often cannot if you're avoiding
an active opponent - in which case, your random would need to be
cryptographically secure AND not leak any data through timing or other ways.

Here's an observation: Quicksort is similarly vulnerable to an active
attacker[0], and can similarly be "protected" with random pivot selection. How
many quicksort implementations have you seen in production quality libraries
that actually have a random pivot selection? I have seen none. As a result, I
avoid quicksort (and cuckoo, and any other basic algorithm which depends on
good crypto to perform well).

[0] I'm not referring to McIlroy's "antiquicksort" here, which makes quicksort
O(n^2) even in the presence of completely unpredictable pivot selection -- I'm
assuming the data to be sorted is laid out before the sorting starts.

------
daveguy
AKA:
[https://en.wikipedia.org/wiki/Dynamic_perfect_hashing](https://en.wikipedia.org/wiki/Dynamic_perfect_hashing)

~~~
ahomescu1
There doesn't seem to be anything dynamic about the algorithm in the article.
They still need to precompute the perfect hash function(s) from the entire
input data, whereas the hashing approach you linked to can handle adding data
incrementally (hence the "dynamic" part).

~~~
daveguy
Good point. The article just does one iteration of generating the layered hash
structure for lookups/reference and does not address the issue of growth of
the structure. The dynamic perfect hashing technique addresses growth of the
structure.

