Not to bring up our previous discussion again, but your complexity analysis is still misleading. Your "O(length of string) version" takes O(length of string) per step, not for the total time. To match an entire word (which you have to do to ensure that you have a match), you have to call step O(length of word) many times, so you're back to quadratic complexity.
Also, it would have been nice if you had done your analysis with k as a variable instead of saying linear when you actually mean "O(nk) for a very small k". Furthermore, if you have a maximum edit distance k, you can also implement the dynamic programming approach for a pair of strings in O(nk) time [1].
Still, your description prunes the trie nicely. It's a nice and simple method that's a lot faster than the naive implementation and good enough for nearly all applications. Have you done any benchmarking by any chance?
The big win here is not that it's as fast as the optimal algorithm (it's not), but that it's nearly as fast and vastly simpler.
You're technically right that the given code is O(n^2), but the optimization discussed later in the article can easily be applied to turn that into O(nk). Further, for the lucene problem, k=2. That's small and effectively constant, making it "basically O(n)". That's still theoretically worse that optimal, but in practice it just doesn't matter. Even with the slightly slower algorithm, Lucene's cost of searching should be dominated by repeatedly _using_ the DFA and not by _building_ it.
What's really important is that the O(nk) construction algorithm is much simpler than the optimal version. I don't know what the Lucene guys were originally trying to do, but if they had figured out this method instead, they could have avoided the super complicated implementation of the true O(n) algorithm.
> you can also implement the dynamic programming approach for a pair of strings in O(nk) time
Yes, but that's for every pair of strings you want to test. A DFA costs exactly as much to build as solving a single problem with DP, but when a large number of pairs all have one string in common (the query string), a pre-built DFA can be used repeatedly in only O(n+k) each time. Of course, whether that's enough to make a practical difference depends on the problem at hand.
Yes, the O(length of string) is per step in the automaton. Sorry if that was unclear, it was not my intention to mislead. Note however that that is for the naive version. The only point of the naive version is as a pedagogical stepping stone to the second version. I put the section on building the DFA after the naive version because it provides a nice logical progression: when you try to build a DFA you want the number of states to be finite, which leads to the idea that you set all entries >k to k+1. Then that leads to the idea of the sparse vector. If you were actually building a DFA though, you would use the fast automaton to generate the transitions rather than the naive one.
I tried to stay close to the terminology of the paper. They assume that k is fixed. The complexity of 1 step in the non naive automaton is O(k) where k is the max edit distance (1 or 2 in Lucene). Building the DFA is O(nk) where n is the number of states in the DFA. Note however that while the number of states in the DFA is linear in the size of the string, it's actually exponential in k. This is unavoidable, a Levenshtein DFA simply needs an exponential number of states as a function of k. So if k is large it's not a good idea to build the DFA, it's better to use the step() based automaton directly. In fact I'm pretty sure it's better to do that in any case, because it's unlikely that the initial cost of building the DFA will pay off. If k=2 then you're basically updating an array of 5 numbers at every step. Stepping in a DFA isn't going to be much faster than that. You would need a huge number of steps before it starts to pay off, and the whole point of a Levenshtein automaton is that you search only about 200 positions in the index.
> Furthermore, if you have a maximum edit distance k, you can also implement the dynamic programming approach for a pair of strings in O(nk) time
I think you can even do that without knowing the maximum distance beforehand. Maintain a priority queue of entries in the matrix, and repeatedly propagate the minimum. That way you avoid computing most of the matrix if the edit distance turns out to be small. If insert/delete/substitute costs are all 1 you don't even need a full priority queue.
Note there is also an algorithm in the same paper to do the calculation without actually constructing the automata for it, which is even cooler
There are other implementations of the algorithm, like Moman, etc. (I believe Lucene eventually used Moman as a reference implementation to implement it)
The paper is certainly "not easy" to understand, but it's doable.
That one may be (I think this is the later 2004 paper by one of the same authors), but Moman definitely is an implementation of the algorithm they did (it turns out they mention this in a different blog post).
You can do all Levenshtein edits while looking up in a (minimised, acyclic) FSA, which is fairly easy to implement[1], but means instead of having one FSA-state per index in the word as you look it up, you now have a set of states (each with their N_edits counter).
[1] If you want transpositions to count as a single edit you need some hackery though.
It's fun to think of the regular expression which matches all strings off by Levenshtein=1. For "hello" it's something like: hello|ello|hllo|helo|hell|.hello|h.ello|he.llo|hel.lo|hell.o|hello.|.ello|h.llo|he.lo|hel.o|hell.
This seems pretty easy to generate the NFA for this, then compile it to a DFA as usual.
Why not generate an index containing all strings + all strings off by Levenshtein distance 1 (or N)? This is what I assumed google would do, but I don't know the index size.
It's already been answered why this isn't a good idea, but there is a working approach that works similarly: instead of all words within Levenshtein distance N, you store all words with exactly N characters deleted from them. So for each word in your dictionary, store all words that can appear when you delete N characters. That's a much more reasonable number, and there are tricks to make it manageable for long words. Then when you do queries, do the same with your query word and check if it's in the index.
Yeah, realized this later. Maybe the trie (or whatever) index can have a concept of "any single character" built in to it. This should be easier than making into a full automata.
See the part about how you would intersect a trie and a levenshtein automata and find the ones that must be in both. It's not as hard as one would think
The simple way to do this is to have a default child in the trie. But even then, it can get absurdly large. Look at something like "aaaa"(4), for instance.
That's basically what this is, except you do it for the query rather than for the index, and you build the structure more intelligently than just inserting all possible insertions/deletions/substitutions. If you wanted to make an index like that you could build a Levenshtein DFA for each term in the index, and then union all those together.
You build the DFA on-the-fly from an NFA built on-the-fly. Also reasonably simple, and it doesn't require building the entire DFA unless the entire DFA is actually needed.
Also, it would have been nice if you had done your analysis with k as a variable instead of saying linear when you actually mean "O(nk) for a very small k". Furthermore, if you have a maximum edit distance k, you can also implement the dynamic programming approach for a pair of strings in O(nk) time [1].
Still, your description prunes the trie nicely. It's a nice and simple method that's a lot faster than the naive implementation and good enough for nearly all applications. Have you done any benchmarking by any chance?
[1] A quite fast C++ implementation I wrote is at https://github.com/xhochy/libfuzzymatch/blob/master/src/libf... - the repo will at some point in the future also contain other, faster methods for calculating Levenshtein distance.