Levenshtein automata can be simple and fast

lorenzhs · on June 18, 2015

Not to bring up our previous discussion again, but your complexity analysis is still misleading. Your "O(length of string) version" takes O(length of string) per step, not for the total time. To match an entire word (which you have to do to ensure that you have a match), you have to call step O(length of word) many times, so you're back to quadratic complexity.

Also, it would have been nice if you had done your analysis with k as a variable instead of saying linear when you actually mean "O(nk) for a very small k". Furthermore, if you have a maximum edit distance k, you can also implement the dynamic programming approach for a pair of strings in O(nk) time [1].

Still, your description prunes the trie nicely. It's a nice and simple method that's a lot faster than the naive implementation and good enough for nearly all applications. Have you done any benchmarking by any chance?

[1] A quite fast C++ implementation I wrote is at https://github.com/xhochy/libfuzzymatch/blob/master/src/libf... - the repo will at some point in the future also contain other, faster methods for calculating Levenshtein distance.

adamtj · on June 18, 2015

> O(length of string) per step

The big win here is not that it's as fast as the optimal algorithm (it's not), but that it's nearly as fast and vastly simpler.

You're technically right that the given code is O(n^2), but the optimization discussed later in the article can easily be applied to turn that into O(nk). Further, for the lucene problem, k=2. That's small and effectively constant, making it "basically O(n)". That's still theoretically worse that optimal, but in practice it just doesn't matter. Even with the slightly slower algorithm, Lucene's cost of searching should be dominated by repeatedly _using_ the DFA and not by _building_ it.

What's really important is that the O(nk) construction algorithm is much simpler than the optimal version. I don't know what the Lucene guys were originally trying to do, but if they had figured out this method instead, they could have avoided the super complicated implementation of the true O(n) algorithm.

> you can also implement the dynamic programming approach for a pair of strings in O(nk) time

Yes, but that's for every pair of strings you want to test. A DFA costs exactly as much to build as solving a single problem with DP, but when a large number of pairs all have one string in common (the query string), a pre-built DFA can be used repeatedly in only O(n+k) each time. Of course, whether that's enough to make a practical difference depends on the problem at hand.

jules · on June 18, 2015

Yes, the O(length of string) is per step in the automaton. Sorry if that was unclear, it was not my intention to mislead. Note however that that is for the naive version. The only point of the naive version is as a pedagogical stepping stone to the second version. I put the section on building the DFA after the naive version because it provides a nice logical progression: when you try to build a DFA you want the number of states to be finite, which leads to the idea that you set all entries >k to k+1. Then that leads to the idea of the sparse vector. If you were actually building a DFA though, you would use the fast automaton to generate the transitions rather than the naive one.

I tried to stay close to the terminology of the paper. They assume that k is fixed. The complexity of 1 step in the non naive automaton is O(k) where k is the max edit distance (1 or 2 in Lucene). Building the DFA is O(nk) where n is the number of states in the DFA. Note however that while the number of states in the DFA is linear in the size of the string, it's actually exponential in k. This is unavoidable, a Levenshtein DFA simply needs an exponential number of states as a function of k. So if k is large it's not a good idea to build the DFA, it's better to use the step() based automaton directly. In fact I'm pretty sure it's better to do that in any case, because it's unlikely that the initial cost of building the DFA will pay off. If k=2 then you're basically updating an array of 5 numbers at every step. Stepping in a DFA isn't going to be much faster than that. You would need a huge number of steps before it starts to pay off, and the whole point of a Levenshtein automaton is that you search only about 200 positions in the index.

> Furthermore, if you have a maximum edit distance k, you can also implement the dynamic programming approach for a pair of strings in O(nk) time

I think you can even do that without knowing the maximum distance beforehand. Maintain a priority queue of entries in the matrix, and repeatedly propagate the minimum. That way you avoid computing most of the matrix if the edit distance turns out to be small. If insert/delete/substitute costs are all 1 you don't even need a full priority queue.

__john · on June 18, 2015

Can someone explain what a DFA is?

Edit: DFA stands for deterministic finite automaton.

https://en.wikipedia.org/wiki/Deterministic_finite_automaton

DannyBee · on June 18, 2015

BTW, here are some implementations of the algorithm they seem to have had trouble implementing, in java;

https://github.com/universal-automata/liblevenshtein-java/tr...

Note there is also an algorithm in the same paper to do the calculation without actually constructing the automata for it, which is even cooler

There are other implementations of the algorithm, like Moman, etc. (I believe Lucene eventually used Moman as a reference implementation to implement it)

The paper is certainly "not easy" to understand, but it's doable.

jules · on June 19, 2015

I think that is an algorithm from a different paper, namely this one: https://www.fmi.uni-sofia.bg/fmi/logic/theses/mitankin-en.pd...

It looks like that is still quite a bit more complicated; 40 lines vs 40+ files ;-)

DannyBee · on June 19, 2015

That one may be (I think this is the later 2004 paper by one of the same authors), but Moman definitely is an implementation of the algorithm they did (it turns out they mention this in a different blog post).

_pvxk · on June 19, 2015

You can do all Levenshtein edits while looking up in a (minimised, acyclic) FSA, which is fairly easy to implement[1], but means instead of having one FSA-state per index in the word as you look it up, you now have a set of states (each with their N_edits counter).

[1] If you want transpositions to count as a single edit you need some hackery though.

jstimpfle · on June 18, 2015

The title is a reference to Russ Cox' excellent article (series) on implementing regular expressions:

http://swtch.com/~rsc/regexp/

darklajid · on June 18, 2015

Thanks a lot for the follow-up/expanded article, jules. The previous discussion was a bit hard to follow for me.

jhallenworld · on June 18, 2015

This seems pretty easy to generate the NFA for this, then compile it to a DFA as usual.

TheLoneWolfling · on June 18, 2015

The problem with this approach is that it can be (absurdly) complex to generate, even when the actual DFA output required is small.

Look at the intermediates generated by something like "hhhhh". There is a massive amount of redundancy there.

fenollp · on June 18, 2015

cousin_it · on June 18, 2015

That doesn't match "helloo".

jhallenworld · on June 18, 2015

Why not generate an index containing all strings + all strings off by Levenshtein distance 1 (or N)? This is what I assumed google would do, but I don't know the index size.

lorenzhs · on June 18, 2015

It's already been answered why this isn't a good idea, but there is a working approach that works similarly: instead of all words within Levenshtein distance N, you store all words with exactly N characters deleted from them. So for each word in your dictionary, store all words that can appear when you delete N characters. That's a much more reasonable number, and there are tricks to make it manageable for long words. Then when you do queries, do the same with your query word and check if it's in the index.

This is actually a very fast technique, it's described here in more detail: http://arxiv.org/pdf/1008.1191v2.pdf

TheLoneWolfling · on June 18, 2015

Because that is absurdly large, and gets larger as the character set grows.

jhallenworld · on June 18, 2015

Yeah, realized this later. Maybe the trie (or whatever) index can have a concept of "any single character" built in to it. This should be easier than making into a full automata.

DannyBee · on June 18, 2015

http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levensht...

See the part about how you would intersect a trie and a levenshtein automata and find the ones that must be in both. It's not as hard as one would think

(Tries are already DFA)

You can also do it for other types of indexes.

TheLoneWolfling · on June 18, 2015

The simple way to do this is to have a default child in the trie. But even then, it can get absurdly large. Look at something like "aaaa"(4), for instance.

jules · on June 18, 2015

That's basically what this is, except you do it for the query rather than for the index, and you build the structure more intelligently than just inserting all possible insertions/deletions/substitutions. If you wanted to make an index like that you could build a Levenshtein DFA for each term in the index, and then union all those together.

TheLoneWolfling · on June 18, 2015

Alternative version:

You build the DFA on-the-fly from an NFA built on-the-fly. Also reasonably simple, and it doesn't require building the entire DFA unless the entire DFA is actually needed.