

Minimal Perfect Hashing - fendrak
http://stevehanov.ca/blog/index.php?id=119

======
danieldk
The article takes a very weird turn:

 _If we need extra information about the words, we can use an additional data
structure along with the MA-FSA. We can store it in a hash table._

This is not necessary at all, since an acyclic finite state automaton can be
used as a perfect hash function. If you sum the right language cardinalities
of the transitions followed, you get a unique hash code (1..n, where n is the
number of strings in the automaton). For efficiency, you can precompute the
cardinalities in the transitions or states. You can then use a normal array
for whatever data you want to associate with a word.

With respect to parts 1 & 2 of the series, one can construct a Levenshtein
automaton for a given word in linear time[1]. Such an automaton matches any
string within the given edit distance of the word. You can then compute the
intersection (language) of the Levenshtein automaton and the dictionary
automaton to get all words within the given edit distance.

(Another approach would be to make the dictionary a transducer that returns
the words within the given edit distance. I tried this once, but the resulting
automaton would be too large for realistic dictionaries.)

I wrote a Java library for dictionary automata, perfect hash automata, and
Levenshtein automata. It also provides immutable Map containers for String
keys, that store strings in an automaton and uses a flat array to store
values:

[https://github.com/danieldk/dictomaton](https://github.com/danieldk/dictomaton)

[1]
[http://csi.ufs.ac.za/resres/files/Schultz.pdf](http://csi.ufs.ac.za/resres/files/Schultz.pdf)

~~~
fendrak
This is a great solution!

What would be the reason to do it this way if you don't already have the FSA?

If you just need to hash a small-ish set of known items and keep memory usage
low, I'd think starting with the table would make more sense. Wouldn't the FSA
approach require more memory than the table, since it has to maintain all the
intermediate nodes?

~~~
danieldk
_Wouldn 't the FSA approach require more memory than the table, since it has
to maintain all the intermediate nodes?_

What do you mean by 'a table'? If you want a set without false
positives/negatives or a perfect hash function, you have to represent the
possible elements in some way. FSAs are compact when you store a set of
sequences where there is a certain amount of redundancy in the form of
subsequences that occur in multiple strings. In such cases, FSAs are far more
compact that a flat table of strings.

Of course, in an actual compact implementation, you would not implement states
and transitions using classes. E.g. Dictomaton stores states as an array of
transitions (which are unboxed), which is far more compact (see the numbers on
the Github page for some comparisons with comparable data structures).

------
afsina
For what its worth, I have a quite fast MPHF implementation that uses an
average of 3.2 bits per key.

[https://github.com/ahmetaa/zemberek-
nlp/blob/master/core/src...](https://github.com/ahmetaa/zemberek-
nlp/blob/master/core/src/main/java/zemberek/core/hash/MultiLevelMphf.java)

it is used for language model compression. Also there is a Dart implementation
as part of a small library.

[https://github.com/ahmetaa/poppy](https://github.com/ahmetaa/poppy)

------
TheCoreh
I had the luck to have Fabiano Botelho — the person behind the Minimal Perfect
Hash thesis — as an Algorithms and Data Structures professor back when I was
studying Computer Engineering at CEFET-MG. He is absolutely brilliant, and has
some incredible knowledge on both the theoretical and practical aspects of
using hash tables.

IIRC, he had just finished publishing his thesis, and had each of us write our
own implementations of minimal perfect hashing for extra credit. The data
structure is really useful for indexing large, immutable data sets.

------
thomasahle
For what it's worth, if you have in the order of N things to store, and you
hash them to a space of size N^3, the possibility for a collision is less than
N^(-1). Basically nonexistent. Hence, if you want to save space, just use a
normal hash table implementation, but use a more precise hash instead of the
keys. This is very useful when your keys are document sized.

~~~
avoid3d
I don't understand what you mean?

10 things to store, 1000 buckets to store in, less than 0.1 probability of a
collision? That seams both not very impressive or even correct?

~~~
thomasahle
It's just a rough (but certain for all n) upper bound. The precise probability
for your case seems to be slightly greater than 0.01. The point is that you
can replace a precise, expensive equality test with a second hash with greater
range.

------
freeqaz
Change that dang font!

~~~
zaroth
Feel free to.

