
Implementing Levenshtein Automata - fulmicoton
http://fulmicoton.com/posts/levenshtein/
======
danieldk
FWIW, my Dictomaton Java library also has support for Levenshtein automata
(besides finite state dictionaries, perfect hash automata, and String mappings
based on perfect hash automata):

[https://github.com/danieldk/dictomaton](https://github.com/danieldk/dictomaton)

~~~
burntsushi
> besides finite state dictionaries, ..., String mappings based on perfect
> hash automata

I'm implementing a similar library in Rust. One of the key things I've found
missing in Daciuk's great work on the topic is the handling of the registry.
In your code[1], it looks like you're using a straight a hash table. I thought
you'd like to know that I've had great success sacrificing complete DFA
minimization in exchange for a small _constant_ memory overhead by using an
LRU cache of only some classes of states.

My code still needs a good deal of polish, and I intend to write a blog post,
but I bet you can make sense of the general strategy from the code and reuse
it.[2]

N.B. Lucene's implementation permits sacrificing complete minimization, but
they don't use an LRU cache and I think still permit the registry to grow
without bound.

[1] -
[https://github.com/danieldk/dictomaton/blob/master/src/main/...](https://github.com/danieldk/dictomaton/blob/master/src/main/java/eu/danieldk/dictomaton/DictionaryBuilder.java#L292-L296)

[2] -
[https://github.com/BurntSushi/fst/blob/master/src/fst/regist...](https://github.com/BurntSushi/fst/blob/master/src/fst/registry.rs)
(The data type is a map from node hash to LRU cache. That is, there is a small
LRU cache for each bin in the map.)

~~~
danieldk
Interesting, thanks for the pointers! I'll look at it when we are done moving
:).

BTW, for those who are interested, Jan's code is also available online:

[http://galaxy.eti.pg.gda.pl/katedry/kiw/pracownicy/Jan.Daciu...](http://galaxy.eti.pg.gda.pl/katedry/kiw/pracownicy/Jan.Daciuk/personal/fsa.html)

[http://galaxy.eti.pg.gda.pl/katedry/kiw/pracownicy/Jan.Daciu...](http://galaxy.eti.pg.gda.pl/katedry/kiw/pracownicy/Jan.Daciuk/personal/fadd.html)

His FSA utilities and the fadd library are used in the Alpino parser:

[http://www.let.rug.nl/vannoord/alp/Alpino/](http://www.let.rug.nl/vannoord/alp/Alpino/)

------
nxb
So what's the best OSS search autocomplete these days? Solr's,
Elasticsearch's, Linkedin CLEO?

~~~
fulmicoton
You can add Duck Duck go's implementation to your list
[https://github.com/duckduckgo/cpp-libface](https://github.com/duckduckgo/cpp-
libface).

------
unhammer
Great write-up :-) I hope you'll considering explaining that last bit of code
some more later? I think there may be something missing from the parenthesis
here:

    
    
         Well the trick here is to normalize our states into two parts
    
        * a global offset that is the minimum offset of the states
        * the shape of the shifted states (we already saw that there was around )
    
        With this parametric DFA, transition will tell you, given a “shape”, what shape to transition two, as well as how much you should add to the global offset.

------
kylebgorman
This is a great implementation, thanks. For those of you who need this to be
fast and achieve the theoretical bounds, you may want to do something like
this using a finite-state transducer toolkit. Here's an example with OpenFst:
[http://www.openfst.org/twiki/bin/view/FST/FstExamples](http://www.openfst.org/twiki/bin/view/FST/FstExamples)
(halfway down under "Edit Distance"). (And OpenFst has a Python wrapper now.)

