

Levenshtein automata can be simple and fast - jules
http://julesjacobs.github.io/2015/06/17/disqus-levenshtein-simple-and-fast.html

======
lorenzhs
Not to bring up our previous discussion again, but your complexity analysis is
still misleading. Your "O(length of string) version" takes O(length of string)
_per step_ , not for the total time. To match an entire word (which you have
to do to ensure that you have a match), you have to call step O(length of
word) many times, so you're back to quadratic complexity.

Also, it would have been nice if you had done your analysis with k as a
variable instead of saying linear when you actually mean "O(nk) for a very
small k". Furthermore, if you have a maximum edit distance k, you can also
implement the dynamic programming approach for a pair of strings in O(nk) time
[1].

Still, your description prunes the trie nicely. It's a nice and simple method
that's a lot faster than the naive implementation and good enough for nearly
all applications. Have you done any benchmarking by any chance?

[1] A quite fast C++ implementation I wrote is at
[https://github.com/xhochy/libfuzzymatch/blob/master/src/libf...](https://github.com/xhochy/libfuzzymatch/blob/master/src/libfuzzymatch/levenshteinLimit.cpp)
\- the repo will at some point in the future also contain other, faster
methods for calculating Levenshtein distance.

~~~
adamtj
> O(length of string) per step

The big win here is not that it's as fast as the optimal algorithm (it's not),
but that it's nearly as fast and vastly simpler.

You're technically right that the given code is O(n^2), but the optimization
discussed later in the article can easily be applied to turn that into O(nk).
Further, for the lucene problem, k=2. That's small and effectively constant,
making it "basically O(n)". That's still theoretically worse that optimal, but
in practice it just doesn't matter. Even with the slightly slower algorithm,
Lucene's cost of searching should be dominated by repeatedly _using_ the DFA
and not by _building_ it.

What's really important is that the O(nk) construction algorithm is much
simpler than the optimal version. I don't know what the Lucene guys were
originally trying to do, but if they had figured out this method instead, they
could have avoided the super complicated implementation of the true O(n)
algorithm.

> you can also implement the dynamic programming approach for a pair of
> strings in O(nk) time

Yes, but that's for every pair of strings you want to test. A DFA costs
exactly as much to build as solving a single problem with DP, but when a large
number of pairs all have one string in common (the query string), a pre-built
DFA can be used repeatedly in only O(n+k) each time. Of course, whether that's
enough to make a practical difference depends on the problem at hand.

------
__john
Can someone explain what a DFA is?

Edit: DFA stands for deterministic finite automaton.

[https://en.wikipedia.org/wiki/Deterministic_finite_automaton](https://en.wikipedia.org/wiki/Deterministic_finite_automaton)

------
DannyBee
BTW, here are some implementations of the algorithm they seem to have had
trouble implementing, in java;

[https://github.com/universal-automata/liblevenshtein-
java/tr...](https://github.com/universal-automata/liblevenshtein-
java/tree/master/src)

Note there is also an algorithm in the same paper to do the calculation
_without actually constructing the automata for it_ , which is even cooler

There are other implementations of the algorithm, like Moman, etc. (I believe
Lucene eventually used Moman as a reference implementation to implement it)

The paper is certainly "not easy" to understand, but it's doable.

~~~
jules
I think that is an algorithm from a different paper, namely this one:
[https://www.fmi.uni-sofia.bg/fmi/logic/theses/mitankin-
en.pd...](https://www.fmi.uni-sofia.bg/fmi/logic/theses/mitankin-en.pdf)

It looks like that is still quite a bit more complicated; 40 lines vs 40+
_files_ ;-)

~~~
DannyBee
That one may be (I think this is the later 2004 paper by one of the same
authors), but Moman definitely is an implementation of the algorithm they did
(it turns out they mention this in a different blog post).

------
jstimpfle
The title is a reference to Russ Cox' excellent article (series) on
implementing regular expressions:

[http://swtch.com/~rsc/regexp/](http://swtch.com/~rsc/regexp/)

------
jhallenworld
It's fun to think of the regular expression which matches all strings off by
Levenshtein=1. For "hello" it's something like:
hello|ello|hllo|helo|hell|.hello|h.ello|he.llo|hel.lo|hell.o|hello.|.ello|h.llo|he.lo|hel.o|hell.

This seems pretty easy to generate the NFA for this, then compile it to a DFA
as usual.

~~~
fenollp
Or: hello|.?ello|h.?llo|he.?lo|hel.?o|hell.?

~~~
cousin_it
That doesn't match "helloo".

------
darklajid
Thanks a lot for the follow-up/expanded article, jules. The previous
discussion was a bit hard to follow for me.

------
jhallenworld
Why not generate an index containing all strings + all strings off by
Levenshtein distance 1 (or N)? This is what I assumed google would do, but I
don't know the index size.

~~~
TheLoneWolfling
Because that is absurdly large, and gets larger as the character set grows.

~~~
jhallenworld
Yeah, realized this later. Maybe the trie (or whatever) index can have a
concept of "any single character" built in to it. This should be easier than
making into a full automata.

~~~
DannyBee
[http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-
Levensht...](http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-
Automata)

See the part about how you would intersect a trie and a levenshtein automata
and find the ones that must be in both. It's not as hard as one would think

(Tries are already DFA)

You can also do it for other types of indexes.

------
TheLoneWolfling
Alternative version:

You build the DFA on-the-fly from an NFA built on-the-fly. Also reasonably
simple, and it doesn't require building the entire DFA unless the entire DFA
is actually needed.

