

Faster Spelling Correction algorithm (2012) - tyn
http://blog.faroo.com/2012/06/07/improved-edit-distance-based-spelling-correction/

======
smoyer
Pre-processing a fixed data-set in order to optimize search is a pretty well
known technique (the phone companies have been using "preordering"
(specifically Modified Preorder List Traversal [1]) as a means of routing
phone calls and diagnosing network issues for a very long time.

The downside to any preorder algorithm is that you must rerun the preorder
generation code anytime the input changes (in this case a dictionary) and
often you must allocate extra storage to hold the pre-processed input.

This is a really interesting algorithm but, as always, you have to know the
pros and cons and apply it where if fits (i.e. Not somewhere that needs a
compressed dictionary).

[1] [http://www.codeproject.com/Articles/31669/Hierarchical-
Tree-...](http://www.codeproject.com/Articles/31669/Hierarchical-Tree-
Represented-by-Modified-Preorder)

------
philjohn
See also the Levenshtein Automatons used by Lucene 4.0 for fuzzy matching ...
fun story, although they had to treat some Python code as a "black box" when
implementing it ...

[http://java.dzone.com/news/lucenes-
fuzzyquery-100-times](http://java.dzone.com/news/lucenes-fuzzyquery-100-times)

Looks like some interesting research and dev going on in this space at the
moment.

------
TheLoneWolfling
Hmm...

They don't mention the approach I'd take for a naive spellcheck:

Generate a (giant) trie from the dictionary. Then perform a DFS on the trie
allowing only <the current closest word's number of edits> edits from the
input word, keeping track of the edits done so far - try no changes at each
node first, then a deletion, then a change, then an insertion. This works
better with a smaller alphabet size.

An optimization: a DAWG is potentially even better space-wise than a hashmap
(effectively it ends up compressing the input data), but takes a while to
precompute.

------
dvirsky
It's a nice approach, I've implemented it in Golang a while back. It's really
fast, but the cost in memory is huge, especially if you store bigrams and not
just single words. But it's really elegant in its simplicity.

BTW - if anyone's interested in the Go implementation let me know, I'll try to
find it and post it somewhere.

------
robert_tweed
Interesting, though I'm curious about whether this type of algorithm is state
of the art for accuracy. At this time spell checkers are pretty much fast
enough, but aren't always terribly good at suggesting replacements. I would
imagine that systems based on n-grams are more effective.

~~~
philjohn
it mentioned it's for search engines, in which case you may be checking
against a massive wordlist taken from a search index, so accuracy isn't
(quite) as important as speed.

------
discardorama
Previous discussions:

[https://news.ycombinator.com/item?id=4080791](https://news.ycombinator.com/item?id=4080791)

[https://news.ycombinator.com/item?id=8201598](https://news.ycombinator.com/item?id=8201598)

[https://news.ycombinator.com/item?id=7048225](https://news.ycombinator.com/item?id=7048225)

------
taway2012
To me, it seems to be the same as the Norvig algorithm, except that only
deletes are considered. Which allows them to pre-calculate misspellings with
modest storage cost.

Am I mistaken?

------
mumrah
Anyone know how this compares to the FST/DFA stuff in Lucene 4.x?

------
dang
More or less a repost of
[https://news.ycombinator.com/item?id=7048225](https://news.ycombinator.com/item?id=7048225).

------
piqufoh
Note; this article is from June 2012.

------
jheriko
hmmm... 1000x? how does it scale though?

~~~
simias
It's easy to say "1000x" when they don't mention the reference :)

Actually even reading the article I'm not completely certain where that 1000x
comes from, I suppose it's when compared to the naive approach.

~~~
gorhill
I think it's compared to Peter Norvig's approach:

> This is three orders of magnitude less expensive (36 terms for n=9 and d=2)

Three orders of magnitude is 1000. Peter Norvig's approach was described above
with:

> still expensive at search time (114,324 terms for n=9, a=36, d=2)

So 114,324 / 36 = 3,175, so "three orders of magnitude", and I suppose he went
conservative by saying "1000x" rather than "3000x".

------
grogenaut
1000x is still O(1000N) == N. So technically not significantly faster.

~~~
JoeAltmaier
Except that 'spelling' is a bounded problem. There aren't going to be more
words in the dictionary, at least not orders more, any time soon. 'Scaling as
the word list grows' is meaningless. So the only definition of 'faster' that
matters is the one they used.

~~~
toolslive
That's a rather "English only" view on the topic. Spelling is not a bounded
problem in every language.

Some power languages (Dutch, German) have an infinite number of words, as you
can take 2 nouns (fe a and b) and concatenate them to form a new one (ab means
something else than ba). Some languages also have inflection....

