
Fast Levenshtein distance using a Trie - rcfox
http://stevehanov.ca/blog/?id=114
======
a1k0n
Coincidentally, just today I implemented the greedy packed trie algorithm from
Liang's thesis on LaTeX hyphenation [1]. The idea is to store the trie into a
big array, overlapping the storage of your child nodes with one another where
possible. This reduces the memory footprint of the trie (when compared to
using a separate hash table for each node) and increases the lookup speed as
you don't need a hash table, you just index directly for each character. You
can also collapse common suffixes into the same state, saving a bunch more
space. Would be a nice enhancement to this algorithm; probably more so if it
were written in C/C++.

[1] <http://www.tug.org/docs/liang/liang-thesis.pdf>

------
cletus
The author makes a factual error regarding Norvig's version. He writes:

> Peter Norvig approaches the problem using a different way of thinking. He
> first stores his dictionary in a hash-table for fast lookup. Then he goes
> through hundreds, or even thousands of combinations of spelling mutations of
> the target word and checks if each one is in the dictionary. This system is
> clever, but breaks down quickly if you want to find words with an error
> greater than 1.

Norvig writes [1]:

> The literature on spelling correction claims that 80 to 95% of spelling
> errors are an edit distance of 1 from the target. As we shall see shortly, I
> put together a development corpus of 270 spelling errors, and found that
> only 76% of them have edit distance 1. Perhaps the examples I found are
> harder than typical errors. Anyway, I thought this was not good enough, so
> we'll need to consider edit distance 2. That's easy: just apply edits1 to
> all the results of edits1:

I wrote some code for this a few months ago and tried tries, which ultimately
rejected in favor of a fast Levenshtein implementation. On a side note the
author misses one other possible avenue, which is a family of algorithms using
bit Vector operations [2] [3] to measure edit distance.

Anyway IIRC my code went the opposite way. It traversed the word then looked
in the trie for matches. This quickly broke down as edit distance increased
and I suspect this is what the author is observing in the other
implementation.

The author's approach of going the other way is uninteresting idea and I might
retrofit that to my code to see how it performs.

[1]: <http://norvig.com/spell-correct.html>

[2]: <http://portal.acm.org/citation.cfm?id=846095>

[3]: [http://www.odinlake.net/papers/onsjo-cudastrmatch-
sigal09.pd...](http://www.odinlake.net/papers/onsjo-cudastrmatch-sigal09.pdf)

~~~
wahnfrieden
Interesting, thanks.

BTW, do you use speech recognition? I think you typod "an interesting" as
"uninteresting" at the end there. :)

~~~
cletus
iPad. I spotted that later, too late to fix. :(

------
mmaunder
This is great. For a prepackaged implementation that works with any language
with a Sphinx API check out misc/suggest in the sphinx tarball.

<http://sphinxsearch.com/downloads/release/>

It also lets you feed in your own dictionary from your index and works great
for spelling suggestions. I haven't benchmarked it but it seems extremely fast
- as sphinx is with most things.

Here's the suggest text file from the sphinx 0.9.9 source:

0) What's this. This sample shows how to implement simple keyword correction
suggestions (ie. "did you mean") using Sphinx.

1) Requirements. You will need Sphinx, MySQL, and PHP CLI.

2) Quickstart. (Skip first indexer command to use bundled sample.)

indexer YOURINDEX --config YOURCONFIG.CONF --buildstops

dict.txt 100000 --buildfreqs

cat dict.txt | php suggest.php --builddict > dict.sql

mysql -u root test < dict.sql

indexer --config suggest.conf --all

searchd --config suggest.conf

php suggest.php --query sphynx

------
srean
Is the upper bound

    
    
      O( <number of words> * <max word length> ^2 ) 
    

correct ? It seems to me that the upper bound should be

    
    
      O(<number of words> * <max word length>) ^2
    

Looks like a typo.

In fact there is another fairly simple optimization one can make if one is
interested only in computing the distance, but not interested in the actual
edit sequence. If you look at the algorithm for filling up the table you will
see you refer only to the previous row. So one can work with just two
consecutive rows, thereby substantially reducing the space complexity, but not
the time complexity unfortunately.

The time complexity can however be reduced on average to O(n * d), where n is
the length of the longer word and d is the edit distance between the two
words, this optimization is by Ukkonen. There have been other optimizations
since then. The best I believe gives a time complexity of O(n + d^2)

~~~
xtacy
The consecutive row optimisation is an interesting one and in practice, it has
a large impact on the runtime, even though the theoretical time complexity
does not change.

Having a smaller table => it can fit in the cache => better temporal locality
of reference => faster runtime.

~~~
srean
Very true. Those fast matrix and linear algebra libraries: BLAS, LAPACK are
fast not because of using different algorithms but because of optimizing for
cache reuse. Its the same reason why sometimes binary search trees performs
better than (unoptimized) hash tables.

------
batterseapower
It looks like you get this algorithm almost directly if you just memoise a
very simple recursive Levenshtein distance function.

(This insight could be implemented directly in Haskell by using e.g. the
MemoTrie package)

The standard imperative Levenshtein algorithm (with dynamic programming) is
also basically just a memoised version of that recursive version -- except
that the memoisation is cleared immediately after the result you originally
asked for is returned.

------
jamii
I had a post on HN a while back about an extension of this problem: matching
the search term against all substrings of the corpus strings (eg searching for
a misquote in a list of books).

<http://news.ycombinator.com/item?id=1982351>

------
martinkallstrom
This post impressed me a lot. Steve have improved one famous algorithm for a
special but very common case by combining it with one of the coolest
datastructures out there. And (applying a pinch of poetic freedom here)
yielded results that surpass those of the academics and giants of the field in
versatility and power. All with a very terse piece of code that fits within a
blog post. Very inspiring!

~~~
mfukar
> And (applying a pinch of poetic freedom here) yielded results that surpass
> those of the academics and giants of the field in versatility and power.

I (as well as a dozen other students) were taught an IR course where we had to
implement a search engine from scratch. We used tries both to provide spelling
suggestions and to build inverted indices for text. That was 7 years ago.

Not to discredit Steve, he's done great work and his article is a great
resource, but it's not the groundbreaking discovery you make it to be.

PS. Tries are awesome. <3

------
djhworld
Awesome.

I love Tries, I implemented one last year for a bog standard auto-complete
suggestion box. While they may not be the fastest or most efficient data
structures in the world, they're simple to write and have very reasonable
performance.

------
ma2rten
Did I misunderstand what he said or is he hosting rhymebrain.com on a netbook
??

~~~
brown9-2
Yes <http://stevehanov.ca/blog/index.php?id=71>

