

Throw away the keys: Easy, Minimal Perfect Hashing - rcfox
http://stevehanov.ca/blog/?id=119

======
Irfaan
Looks like the site's overloaded. Anyone have a link to a cached copy of the
post?

~~~
Stormbringer
Looks like they didn't pay enough attention to that article from yesterday
about server load from Delicious going tits up.

------
thelema314
Article Text:

In part 1 of this series, I described how to find the closest match in a
dictionary of words using a Trie. Such searches are useful because users often
mistype queries. But tries can take a lot of memory -- so much that they may
not even fit in the 2 to 4 GB limit imposed by 32-bit operating systems.

In part 2, I described how to build a MA-FSA (also known as a DAWG). The MA-
FSA greatly reduces the number of nodes needed to store the same information
as a trie. They are quick to build, and you can safely substitute an MA-FSA
for a trie in the fuzzy search algorithm.

There is a problem. Since the last node in a word is shared with other words,
it is not possible to store data in it. We can use the MA-FSA to check if a
word (or a close match) is in the dictionary, but we cannot look up any other
information about the word!

If we need extra information about the words, we can use an additional data
structure along with the MA-FSA. We can store it in a hash table. Here's an
example of a hash table that uses separate chaining. To look up a word, we run
it through a hash function, H() which returns a number. We then look at all
the items in that "bucket" to find the data. Since there should be only a
small number of words in each bucket, the search is very fast.

Notice that the table needs to store the keys (the words that we want to look
up) as well as the data associated with them. It needs them to resolve
collisions -- when two words hash to the same bucket. Sometimes these keys
take up too much storage space. For example, you might be storing information
about all of the URLs on the entire Internet, or parts of the human genome. In
our case, we already store the words in the MA-FSA, and it is redundant to
duplicate them in the hash table as well. If we could guarantee that there
were no collisions, we could throw away the keys of the hash table.

Minimal perfect hashing Perfect hashing is a technique for building a hash
table with no collisions. It is only possible to build one when we know all of
the keys in advance. Minimal perfect hashing implies that the resulting table
contains one entry for each key, and no empty slots.

We use two levels of hash functions. The first one, H(key), gets a position in
an intermediate array, G. The second function, F(d, key), uses the extra
information from G to find the unique position for the key. The scheme will
always returns a value, so it works as long as we know for sure that what we
are searching for is in the table. Otherwise, it will return bad information.
Fortunately, our MA-FSA can tell us whether a value is in the table. If we did
not have this information, then we could also store the keys with the values
in the value table.

In the example below, the words "blue" and "cat" both hash to the same
position using the H() function. However, the second level hash, F, combined
with the d-value, puts them into different slots.

How do we find the intermediate table, G? By trial and error. But don't worry,
if we do it carefully, according to this paper, it only takes linear time. In
step 1, we place the keys into buckets according to the first hash function,
H.

In step 2, we process the buckets largest first and try to place all the keys
it contains in an empty slot using F(d=1, key). If that is unsuccessful, we
keep trying with successively larger values of d. It sounds like it would take
a long time, but in reality it doesn't. Since we try to find the d value for
the buckets with the most items early, they are likely to find empty spots.
When we get to buckets with just one item, we can simply place them into the
next unoccopied spot.

Here's some python code to demonstrate the technique. In it, we use H = F(0,
key) to simplify things.

<SNIP>

Experimental Results I prepared separate lists of randomly selected words to
test whether the runtime is really linear as claimed. Number of items Time (s)
100000 2.24 200000 4.48 300000 6.68 400000 9.27 500000 11.71 600000 13.81
700000 16.72 800000 18.78 900000 21.12 1000000 24.816 Here's a pretty chart.

CMPH CMPH is an LGPL library that contains a really fast implementation of
several perfect hash function algorithms. In addition, it compresses the G
array so that it can still be used without decompressing it. It was created by
the authors of the paper that I cited. Botelho's thesis is a great
introduction to perfect hashing theory and algorithms. gperf Gperf is another
open source solution. However, it is designed to work with small sets of keys,
not large dictionaries of millions of words. It expects you to embed the
resulting tables directly in your C source files.

~~~
mquander
Please don't do this. This is worthless, especially without the diagrams. If
you felt like it made sense to copy and paste the article because the server
was down, then you are wrong; that's why Google's cache and Coral exist.

------
Stormbringer
Hmm... just map everything to 404. Genius! Why didn't I think of that?! :D

