
Levenshtein Automata (2010) - beau
http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata
======
lorenzhs
Levenshtein automata seem to pop up on here every once in a while. They are
quite interesting from a theory perspective, but (like many things devised by
the theory community) incredibly complex in practice. Lucene 4.0 uses them for
fuzzy queries, you can read the full story of how they struggled to get them
working somewhere in the Lucene blog.

If you want to implement fuzzy string matching, I would look at something like
[http://arxiv.org/abs/1008.1191](http://arxiv.org/abs/1008.1191) . The
experiments look impressively fast.

~~~
rcsorensen
Something of a story at [http://blog.mikemccandless.com/2011/03/lucenes-
fuzzyquery-is...](http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-
is-100-times-faster.html)

~~~
jules
The difficulty of Levenshtein automata is highly overstated. When I read that
Lucene blog post I wrote an implementation in an hour, and prior to reading
that post I hadn't even heard of Levenshtein automata.

~~~
lorenzhs
Unless you are the mythical 100x programmer, I doubt that you wrote a full
implementation of general Levenshtein automata in an hour. I read the paper
that introduced them (
[http://link.springer.com/article/10.1007/s10032-002-0082-8](http://link.springer.com/article/10.1007/s10032-002-0082-8)
) and they are quite the complex beast. Not to mention that the paper is very
technical and you need to keep a dozen definitions in your head.

That said, there seems to be a fairly readable implementation at
[https://github.com/universal-
automata/liblevenshtein](https://github.com/universal-automata/liblevenshtein)

I'm currently working on implementing fast Levenshtein queries in C++ with a
friend, and we intend to implement the paper I linked in my original post. So
far, our dynamic programming Levenshtein already beats Lucene++ (C++
implementation of Lucene), which is a bit embarrassing [1]. If you're
interested, more advanced stuff will hit
[https://github.com/xhochy/libfuzzymatch](https://github.com/xhochy/libfuzzymatch)
when we get around to implementing it.

[1] Lucene++ spends more time converting strings between UTF-8 and UTF-32 than
it does computing Levenshtein distances, says the profiler.

~~~
jules
I'm not a 100x programmer, I just did a couple of things that drastically
reduced the time:

1\. I didn't follow that paper. Even trying to understand that paper would
have taken way more time, so after 5 minutes of trying to understand it I gave
up on that approach. See this comment for what I did do:
[https://news.ycombinator.com/item?id=9699870](https://news.ycombinator.com/item?id=9699870)
That saved maybe 20x.

2\. I used Python instead of C++ or Java. This saved 5x.

3\. The code was throwaway quality code. This saved 2x.

Together that's 200x, but I'm at least a 2x worse programmer than them, so
that gives you the 100x ;-)

~~~
lorenzhs
(see my other comment as well)

An algorithmicist would say that all this saved you a constant factor of work
for a linear slowdown ;)

~~~
jules
That's a nice soundbite but it's not correct. The worst case performance with
the DFA is linear, the same as them.

~~~
lorenzhs
No that's just not true. Your step function takes time linear in the length of
string. For example, `newstate = [0 for x in state]` takes θ(|state|) time,
and because you initialise the state with `range(len(string)+1)`, that's
linear in the string length.

~~~
jules
Now you're talking about the cost of _constructing_ the DFA, not searching the
index with the resulting DFA. The cost of construcing the DFA is irrelevant,
and even then you can construct the DFA in O(n) with my method for fixed max
edit distance and fixed alphabet. Same as that paper.

------
billwashere
I suggest everyone check out the rest of the algorithms on that site. They are
cool. [http://blog.notdot.net/tag/damn-cool-
algorithms](http://blog.notdot.net/tag/damn-cool-algorithms)

------
unhammer
and in ocaml:
[https://github.com/c-cube/spelll/blob/master/spelll.ml#L124](https://github.com/c-cube/spelll/blob/master/spelll.ml#L124)

