
Fast and easy Levenshtein distance using a trie (2011) - sytelus
http://stevehanov.ca/blog/index.php?id=114
======
fchollet
Here's an alternative that is more efficient at large scales (eg. if you have
one order of magnitude more entries than just an English dictionary):
performing Approximate Nearest Neighbor search [1] over a pre-computed
Levenshtein space.

\- hash all your dictionary entries through a Levenshtein-locality-sensitive
hash function [2] with a well-chosen hash size. Any two words within a
Levenshtein distance D of each other will have the same hash (you pick D).

\- store the hashes into an efficient DB for quick lookup. Each hash will map
to one or several (levenshtein-similar) dictionary entries.

\- when presented with a new input, compute its hash, retrieve the entry for
the hash in your DB (if it exists).

\- finish by sorting the retrieved dictionary entries by Levenshtein distance
to your input... you just recovered all dictionary entries closer than
distance D to your input (approximately).

Pretty much unbeatable, and works over spaces that have hundreds of millions
of dictionary entries.

[1]
[https://en.wikipedia.org/wiki/Nearest_neighbor_search#Approx...](https://en.wikipedia.org/wiki/Nearest_neighbor_search#Approximate_nearest_neighbor)

[2] Can't find one? Build your own by computing the pairwise distance matrix
of your entries, then reducing this matrix through PCA to an appropriate
number of components (the length of your hash). The coordinates of each entry
in this new space is the entry's hash.

~~~
srean
> Build your own by computing the pairwise distance matrix of your entries,
> then reducing this matrix through PCA to an appropriate number of components
> (the length of your hash).

You are kidding right ? I cant imagine anyone seriously suggesting PCA on a
half a million by half a million matrix, or evaluating this matrix in the
first place.

This actually a fairly actively researched topic: whether edit distance can be
embedded, even if approximately in a vector space. It has been surprisingly
resistant to such efforts. Yes there are some results but nothing very
satisfactory.

Edit distance is a metric, so it was believed it would easy to map individual
strings to vectors, such that the distance between two vectors was same as, or
close to the edit distance between the corresponding strings. What we want,
then, is an (approximate) isomorphism between these two metric spaces. In
addition, in the interest of size we would like these vectors to be of small
dimension. If the required dimension is high it would defeat the purpose if
the string to vector map needs to be stored. Unfortunately it seems edit
distance is resistant to isomorphism to a low-dimensional vector space. There
is also this question which metric do we use in the vector space, Euclidean,
Manhattan...? there are some results on the L1, i.e. embedding on a vector
space with Manhattan distances.

Fun story: was an intern in Google's spelling correction team once upon a time
long ago. So I was explaining what I was doing, the person I was explaining it
to was thoroughly unimpressed: "So whats the big deal, you know/hear the word
and write down the spelling". It made me realize how difficult spelling in
English is compared to some of the other languages. The variability in the map
from phonemes to spelling is quite high in English. This person was Italian,
he told me Italian has a lot less variability in this regard.

~~~
StefanKarpinski
PCA is basically just SVD and computing an SVD of that size is not
unreasonable, even of it's dense: 0.5e6^2 = 0.25e12 = 1TB of 32-bit floats.
That's a lot of memory but there are reasonably priced machines ($50k or so)
with that much RAM these days. If it's sparse, this can be done on a laptop,
and it might not even be necessarily to materialize the matrix at all if you
use Lancosz iteration.

~~~
srean
Lets put it this way then, we disagree on what's a reasonable expenditure. 1TB
Ram ? $50k ? Could you fund me please :) ok adding _pretty_ please.

Note its not just memory, its computation also. Every entry of that matrix is
a solution to a dynamic program, if you dont store it you have to recompute
it. The DP scales as product of the lengths of the string, those things have a
fat tail.

You could still get away with it if this matrix was close to a low rank matrix
and with large gaps between adjacent singular values. Its not and thats a
problem. Note you have to store the SVD factors too.

EDIT: Interesting ! my comments seems to have annoyed someone. Did not expect
downvotes on my comments here, thought they were pretty uncontroversial. If
you dear reader can leave some clues it would be helpful and appreciated. I
upvote those who explain what they found unpalatable about my comments,
regardless of agreement.

~~~
StefanKarpinski
It's a lot of computation, granted, but I just wanted to point out that it
wasn't completely ludicrous. The proposed algorithm doesn't work anyway since
there are no non-degenerate hash functions that map all words within D edits
of each other to the same hash value (as I pointed out here:
[https://news.ycombinator.com/item?id=8158942](https://news.ycombinator.com/item?id=8158942)).

------
danieldk
Shameless plug: I have a Java library that does this even more efficiently. It
constructs a Levenshtein automaton of a word in linear time and then computes
the intersection between a dictionary (stored in a deterministic acyclic
minimal finite state automaton) and the Levensthein automaton. It also has
other features, such as:

\- Perfect hash automata (an automaton that gives a unique hash for every
word/sequence).

\- String-keyed maps that use a finite state automaton to store keys.

More info:

[https://github.com/danieldk/dictomaton](https://github.com/danieldk/dictomaton)

[http://search.maven.org/#artifactdetails%7Ceu.danieldk.dicto...](http://search.maven.org/#artifactdetails%7Ceu.danieldk.dictomaton%7Cdictomaton%7C1.1.1%7Cjar)

------
andrewvc
Also interesting, the cutting edge work lucene has done to make this possible
in 4.x, using universal automatons to speed up fuzzy, regex, and other
searches.

[http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-
is...](http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-
faster.html)

------
sassyalex
Here is a smart implementation of the Levenshtein algorithm:
[https://github.com/jefarmstrong/sortzzy](https://github.com/jefarmstrong/sortzzy)

I use in a "common pattern matching" for http routes and it works pretty well.

------
Goopplesoft
Here are some sources on non-standard python data structures (including a text
trie) that I've found very useful: [http://kmike.ru/python-data-
structures/](http://kmike.ru/python-data-structures/)

Also I've used this C implementation of the Levenshtein distance which much
success as well:
[https://code.google.com/p/pylevenshtein/](https://code.google.com/p/pylevenshtein/)

------
dangirsh
Either it's not that well known or I'm missing the point, but the BK Tree [1]
seems relevant here. Haven't seem it mentioned yet.

[1]: [http://nullwords.wordpress.com/2013/03/13/the-bk-tree-a-
data...](http://nullwords.wordpress.com/2013/03/13/the-bk-tree-a-data-
structure-for-spell-checking/)

------
dmritard96
Nice write-up and a cool solution. Tries are also an awesome way to implement
a boggle solver, did that for a course back in school.

------
elchief
Levenshtein is pretty useless compared to damerau-levenshtein or Jaro-Winkler

~~~
adwf
The advantage of basic Levenshtein is that it's a heck of a lot quicker (last
time I checked) than Damerau-L or Jaro-Winkler in real world applications.

And I wouldn't go as far as to say it's useless. Basic Levenshtein only misses
transpositions in comparison to Damerau-L, but it still catches them as
insert/delete combos.

------
rasz_pl
> searches contain mispelled words, and users will expect these searches to
> magically work

HELL NO, I expect searches to return "result not found". This is important
information, as important as actual found results. I dont mind clearly
labelled suggestions, but NEVER EVER give me what you think I want, instead of
what I actually ask for.

~~~
Gigablah
You're not the typical user then.

~~~
twic
I don't think we actually know what the typical user wants. Or is one of us
aware of some really good user research on this subject?

One of our common failings as programmers is making incorrect assumptions
about what the user wants. rasz_pl is assuming that the user is like him, and
values precision. Gigablah is assuming that the user is not like rasz_pl, and
does not value precision. Both of those are simply assumptions.

I hear this all the time. "But of course the user will want to be able to
choose exactly which columns of data are in the table!". "But of course the
user would rather see one nice simple summary number than the actual
details!". Both of those are scary.

