

How to Write a Spelling Corrector (2007) - hashx
http://norvig.com/spell-correct.html

======
StefanKarpinski
When I found myself writing a spell corrector for Etsy's search a couple years
ago, I thought "yawn, solved problem". As it turns out, it's not as solved as
this blog post makes it seem. The basics are correct: you need a language
model and an error model. The problem is that both approaches to that
presented are pretty naïve. Indeed, the resulting accuracy of 67% is
completely unacceptable for an real spell corrector – an earlier version of
this blog post had an evaluation bug that made it seem that the accuracy was
much higher – over 90% if I recall correctly.

What's wrong with the language model? Counting occurrences in corpuses of text
is not really good enough. There are very rare words that are nevertheless
real words, and there are misspellings of common words (even in books) that
are far more common than many actual words. You can easily find yourself in a
situation where a correctly spelled rare word is corrected to a common
misspelling of some other, much more common word.

What's wrong with the error model? Edit distance just isn't a very good error
model. There are many common misspellings that aren't very close via additions
or deletions. There are also one letter edits that completely change the
meaning of something. These should be treated as a much bigger errors than
modifications that have no effect on meaning.

For special contexts like Etsy where words like "jewelry" (also correctly
spelled as "jewellery" in the UK), Swarovski, and "ACEO" are extremely common,
you really need a custom language model. I just wanted to put this out there
lest people be under the misapprehension that spell correction is quite as
easy as this blog post makes it out to be.

~~~
abhgh
If you are doing a production grade spellchecker (or for that matter, anything
ML) I would suggest that you spend some time getting familiar with the
literature survey (Google Scholar is your friend). I have done various
implementations over time (and also, as it happens, specifically a
spellchecker, but that was not meant to be production grade), and I find this
to be the most rewarding way in the long run.

~~~
StefanKarpinski
Yes, reading the literature is often helpful.

------
ColinWright
Readers might be interesting in seeing comments from other HN readers from
previous discussions:

[https://news.ycombinator.com/item?id=10967](https://news.ycombinator.com/item?id=10967)

[https://news.ycombinator.com/item?id=42587](https://news.ycombinator.com/item?id=42587)

[https://news.ycombinator.com/item?id=665544](https://news.ycombinator.com/item?id=665544)

[https://news.ycombinator.com/item?id=2034981](https://news.ycombinator.com/item?id=2034981)

There are more ...

========

Other submissions without comment:

[https://news.ycombinator.com/item?id=1225574](https://news.ycombinator.com/item?id=1225574)

[https://news.ycombinator.com/item?id=2036600](https://news.ycombinator.com/item?id=2036600)

[https://news.ycombinator.com/item?id=4609321](https://news.ycombinator.com/item?id=4609321)

------
msoad
The easiest way to suggest correction is to have a list of correct words and
compare it with Levenshtein distance[1].

To do it really good, you have to look for context (previous words) and add
learning behavior. This is done nicely in Swiftkey keyboard for android.

The next step is looking at characters close to mistyped character in keyboard
layout. This is also done in Swiftkey.

[1][http://en.wikipedia.org/wiki/Levenshtein_distance](http://en.wikipedia.org/wiki/Levenshtein_distance)

------
mrcactu5
I remember the spell-checker being harder than it looks. You have to train a
Naive Bayes classifier to "recognize" misspelled words.

I can't access the link right now... OK:
[http://web.archive.org/web/20140217174744/http://norvig.com/...](http://web.archive.org/web/20140217174744/http://norvig.com/spell-
correct.html)

------
pwf
Is there a bug in the train function? Using 'lambda: 1' for the default dict
along with '+=' means that the first time a feature is encountered, the value
is set to 2.

    
    
      In [1]: from collections import defaultdict
    
      In [2]: d = defaultdict(lambda: 1)
    
      In [3]: d['foo'] += 1
    
      In [4]: d['foo']
      Out[4]: 2

~~~
jonas21
I think the code is correct. To implement smoothing, you want to add 1 to the
count of every word, regardless of whether it appears in the training data or
not.

That is to say, a word that appears once should get a count of 2, and word
that doesn't appear at all should get a count of 1.

~~~
pwf
That makes sense, thanks!

I should learn to read...

~~~
justinpombrio
This is a perfect example of where a (short) code comment would be helpful.
The "lambda: 1" a notable piece of code, but it's hard to tell that at a
glance.

~~~
hnriot
no it's not, it's perfectly clearly documented in code. The smoothing is
common practice in this area of software engineering.

~~~
dljsjr
And yet this is written for people who _aren 't_ familiar with this area of
software engineering. It's stated so right at the top of the article.

~~~
hnriot
nonsense, any software dev that can't follow a lambda that adds 1 should be
taken outside and shot.

When Norvig says talks of regular folks, he means people with 1/10th his IQ,
which is still the top 1%. Norvig is so far out there on the IQ scale that I
find it funny when some noob says he's found a bug!

------
gabemart
Google cached version:
[http://webcache.googleusercontent.com/search?q=cache%3Ahttp%...](http://webcache.googleusercontent.com/search?q=cache%3Ahttp%3A%2F%2Fnorvig.com%2Fspell-
correct.html)

------
init0
awk needed just 15 lines, where as java needed 372!

~~~
jonas21
Well... to be fair, there's also a 35-line Java version there.

------
coldcode
Argh, another blog where there are no posting dates. Is it that hard to tell
people when your post was written? The post starts off with "In the past
week..." but we have no idea if this was written 10 years ago or recently. It
may be interesting content but without a context it's confusing to know if
this is new or old.

