
1000x Faster Spelling Correction: Source Code released - boyter
http://blog.faroo.com/2012/06/24/1000x-faster-spelling-correction-source-code-released/
======
abecedarius
If you compare to Norvig's first corrector you should also compare to the much
faster [http://norvig.com/ngrams](http://norvig.com/ngrams) (which follows an
algorithm I sent him).

Added: I'd guess the newer Norvig corrector is still slower, though it is more
accurate (taking into account probabilities of different typos). There's a
curiosity not mentioned in his article, that the new algorithm computes a very
slightly different set of candidates. I only noticed the difference when I got
around to checking these sets against each other for all the words in a sample
document -- I'd meant them to be the same. Might be fun to check the OP's code
this way too.

------
haberman
The more interesting blog post that actually describes the algorithm is here:
[http://blog.faroo.com/2012/06/07/improved-edit-distance-
base...](http://blog.faroo.com/2012/06/07/improved-edit-distance-based-
spelling-correction/)

~~~
somethingnew
This is really cool! It reminds me of BiDirectional BFS for path finding. In
this case searching from both directions cuts down the search space by three
orders of magnitude. Genius.

------
jxf
I notice the authors don't actually talk about accuracy. It's 1,000x faster,
but what is gained or lost in terms of accuracy?

We can build a spelling corrector which is accurate most of the time and much
faster than the Faroo implementation:

    
    
        def corrections_for(word)
          return [word]
        end
    

P.S. I added a gist here with some reformatting because I found it hard to
read the code on the author's site:

[https://gist.github.com/fj/8393399](https://gist.github.com/fj/8393399)

The raw source is here:

[https://gist.github.com/fj/8393399/raw/7d4acd14db21dd0c7d802...](https://gist.github.com/fj/8393399/raw/7d4acd14db21dd0c7d802c7aee3000d8bffeeff2/gistfile1.cs)

~~~
pyduan
It looks like what they've done is more along the lines of finding a clever
trick to improve the runtime performance of _one of the common building
blocks_ of a spell correction system (the part that given an input computes a
set of candidate words that are close to it in terms of edit distance). It
doesn't even seem to say anything about how to pick the best candidate once
you've generated this set, so you can't really evaluate its accuracy as if it
were a complete spell correction system. Norvig originally just used the
straightforward "pick the most common word that has the smallest edit
distance", but I didn't see them mention anything about any kind of strategy
on their blog post.

~~~
IanCal
In essence it appears to be a pre-filtering step before you look at the actual
edit distance.

I'm also not sure how they benchmarked the other solution, as that clearly
isn't optimised for runtime but for readability (it's a nice bit of code).

------
jlemoine
This approach is far from been perfect. A simple recursive implementation of
Levenshtein on a Trie representing the dictionary would produce better
performance!

~~~
colanderman
I think you mean a trie representing the dictionary _plus_ edit-distance-2
(like the OP's algorithm does). There's no easy way to "skip" characters in a
trie.

~~~
ghusbands
You can skip characters in a trie, at least towards the right-hand-end of a
word, by recursing down every path. I've done so to good (and performant)
effect, before.

------
darklajid
I'm really intrigued. Working in the OCR field, this is something that we do a
lot, using inhouse libraries or even dedicated 3rd party solutions. I'll
definitely give this a spin in the next couple of days.

I do wonder what happened to normal C# style here though. Had a hard time
reading this, because 'suggestItem' or 'editItem' etc. doesn't _look_ like a
type/class. A single uppercase/lowercase change and I stumbled a couple of
times.

~~~
jeltz
This algorithm seems to be based on edit distance so should be a poor fit for
the OCR field, since OCR rarely swaps letters.

~~~
darklajid
I'm not quite sure what you're saying here. Swap letters as in transpositions?
Yes, correct. Usually the errors are simple replaces or deletes/inserts (which
arguable is might include 'swapping' an 1 for an l).

But the greater field I'm working in doesn't hand you random OCR and that's
it. Most projects here contain a way for typists to correct recognition
mistakes or complete the missing pieces of information on a document. For that
(-> human typist, often you have a database with valid/expected values for
fields) transpositions aren't rare at all.

------
chatman
This is exactly the approach described in:
[http://link.springer.com/chapter/10.1007%2F978-3-642-36973-5...](http://link.springer.com/chapter/10.1007%2F978-3-642-36973-5_34)

This is not new or not previously known. Sensational title.

------
humbledrone
Ugg, in the first comment block one line is 128 characters long. The whole
thing is abhorrent to read in the nicely-columnized blog post; you have to
scroll to the right constantly.

~~~
joosters
Agreed. Bring back line-wrap, it's just stupid web design otherwise.

------
nullc
Has anyone explored using a covering code for spelling correction? (e.g.
[https://eprint.iacr.org/2012/731.pdf‎](https://eprint.iacr.org/2012/731.pdf‎)
)

------
alok-g
Not directly related, but does any algorithm takes into account the layout of
the keyboard? E.g., alphabets closer together may be more likely candidates
for replacement...

~~~
afsina
Mine has but only for Turkish so far:

[https://github.com/ahmetaa/leblebi/blob/master/src/main/java...](https://github.com/ahmetaa/leblebi/blob/master/src/main/java/leblebi/spelling/SingleWordSpellChecker.java)

------
muxxa
Is there such a thing as 'typo checking'? I'd like something that is able to
correct accidentally swapped adjacent letters. I'm imagining something that
looks at the sequence and timing of keydown and keyup events to make a
correction rather than just the finished word.

~~~
gaius
Every phone for the last 10 years has had this.

------
ar7hur
That's interesting from an academic point of view, but in my opinion spell
checkers that don't take into account the context in the sentence (words
before, syntax, context) are so 1990...

I'd prefer to see research in this direction rather than speed improvement on
a method we all know doesn't really work. Because your typo is close to one
entry (an existing word) in a list (the dictionary) that you maybe never heard
of before doesn't mean you intended to type this.

------
afsina
I have a toy project (In Java and Dart) doing spell checking. In a 10.000 word
test dictionary, for 1 distance it finds around 11.000 corrections per second.
For 2 distance 1000 check per second. It does not use Language models yet. And
there are room for performance improvement.

This is the Dart version: [https://github.com/ahmetaa/dart-
spell](https://github.com/ahmetaa/dart-spell)

------
jheriko
I can't help but wonder how this would fare in a fast language with some
optimisation and against real world spell check implementations...

Would we see the same performance increase in practice?

------
mrcactu5
where is the bottle neck in such code? why is this faster than say Norvig
spell checker? [http://norvig.com/spell-correct.html](http://norvig.com/spell-
correct.html)

