

How to Write a Spelling Corrector - lpsz
http://www.norvig.com/spell-correct.html

======
kylebgorman
This is posted here about every three months.

Norvig's approach assumes a very small fixed vocabulary (so it won't scale
well to the basic multilingual plane), and can only recover errors with an
edit distance of two or less.

If you are interested in a vastly superior yet still very intuitive approach
to spelling correction, I recommend the classic paper on noisy channel model
corrector by Kernighan, Church, and Gale, which fuses the probability
distribution the distribution over possible errors and the probabilities over
words themselves.

[http://www.aclweb.org/anthology/C90-2036.pdf](http://www.aclweb.org/anthology/C90-2036.pdf)

~~~
antics
I actually think Norvig is making a totally different point than you are. I
also think (with respect) that it's a much more important point to learn.

Where your point is that there is a better but almost-as-intuitive approach,
Norvig's point is basically that a naive application of probability theory is
often dramatically easier to implement, and also "good enough" when compared
to a (usually much more complicated) rule-based implementation.

That's really a great lesson to learn. It's (IMHO) much less valuable to learn
that there are better approaches out there -- of course there are.

------
joshuak
Do not write a spell checker like this! I will hunt you down and beat you with
a dictionary.

I find spell checking in almost every application that I've tried to be
spectacularly bad for anything other than trivial typos. Google is quite good,
but it is leveraging data not always available in other applications and is a
special case optimized for search applications. Spell correcting for general
writing is somewhat different in requirements and optimization vectors.

Spelling errors that can be classified in terms of edit distance, are far
better characterized as "typos" not spelling errors. A typo is a transcription
error, whereas incorrect spelling is a representation error. Both can be
present in a particular ward, but they are different phenomenon.

People make incorrect spelling choices when they _intentionally_ chose the
wrong letters to represent phonemes, or to apply spelling rules incorrectly.

A better approach in all respects is to do what's often referred to as a
grapheme to phoneme transformation, which is basically to `compile` a word in
to a smaller set of characters representing the phonemes of the language. With
the reduced set of symbols, that in and of themselves eliminate trivial typos,
statistical models can preform better, and faster. Further, unknown words can
often be corrected via the corrected spelling of the phoneme.

Why most spelling checkers fail to do grapheme to phoneme transformation when
it is easy and both reduces the symbol set for statistical analysis and is
demonstrably more accurate (with the possible exception of Indian languages)
is beyond me. It's like watching people try to solve dehydration with a
"better(TM)" formula for Coke. You're thinking about this problem wrong, just
drink some water.

~~~
Ntrails
I had always assumed that spellcheckers took into account likely mistypes
based on keyboard layout when working on spelling correction.

For example, "RRROR" (error) could easily have been me fat fingering the keys,
whilst "MRROR" seems far less likely unless I have truly gargantuan fingers or
a really really small keyboard. So you might infer I'd missed the "i" instead
of mistyped the "e".

Is that ever used as part of statistical analysis?

~~~
gcb0
probably if the spell checker owns the keyboard (eg a android keyboard app)
and probably not on most desktop ones. and even if it does use that on desktop
ones it's probably assuming a layout and missing, giving worse results

~~~
liotier
Why not make the spellchecker keyboard-aware ? Just have it query 'xkb-switch'
or 'xkblayout-state print "%s"' or whatever other API is available.

Doesn't any spellchecker use keyboard layout-aware Damerau–Levenshtein
distance metrics ?
[https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_di...](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance)
doesn't mention keyboard layout-awareness variants...

~~~
hurin
It's not nearly as simple as it sounds in theory. Consider that you have typos
which are from hitting the wrong button - and also typos from hitting the next
letter before the previous letter. Also when looking at typos it's likely
you'd find statistically significant differences for typos by hitting a letter
to the left, or right, or prob. of typoing given the letters general position
within a layout etc.

------
hacknat
This: [http://blog.faroo.com/2012/06/07/improved-edit-distance-
base...](http://blog.faroo.com/2012/06/07/improved-edit-distance-based-
spelling-correction/)

is an interesting attempt at fast spellchecking, for anyone that's interested.

------
tempodox
The tech blogger's worst sin: not dating your article. This article starts
out,

    
    
      In the past week, ...
    

but then it leaves no clue whatsoever as to when it was written. That mistake
is made by almost every blogger out there. We programmers are the worst
offenders when it comes to accelerating the pace of technological change. Yet
our blog articles leave no hint as to their date, with rare exceptions.

I do surmise it is _VERY_ bad form to post undated blogs to the interwebs,
even more so in the realm of software technology.

As always, you are cordially invited to vote me down generously for not
wearing pink glasses.

~~~
DanBC
> Please don't bait other users by inviting them to downmod you.

[https://news.ycombinator.com/newsguidelines.html](https://news.ycombinator.com/newsguidelines.html)

~~~
tempodox
Thanks for the link, and the reminder. Sometimes, temptation is just too hard
to resist.

------
darklajid
Working with OCR data and rankings of likely matches ('fuzzy search'), which
is closely related.

Yes, a neat demo of the general idea, but really, really naive. As the page
states: Edit distances are restricted to 1 or 2 and that thing isn't exactly
fast or efficient. Look at it, think about it, don't copy it. :)

------
biomimic
Can you provide a side by side comparison with Google's spellchecker?

~~~
hnriot
this is a very old page, it uses a very basic approach, it's highly
computationally expensive to support edit distances beyond 2 (exponentially
expensive) so the word length has to be kept small. This makes
phraseslikethishardtobreakup.

Today's google spelling uses vastly more sophisticated approaches with gobs of
data at it's disposal.

------
biomimic
Have you or anyone else thought about, or imagined, applications for a spell
checker that are not language-based spell checking?

~~~
kylebgorman
Not quite sure if this is what you want, but the noisy channel model (call it
"Bayesian" if you must) spell checker I linked to above is inspired by the
classic approach to speech recognition (and codebreaking, and machine
translation, and natural language parsing).

Norvig's algorithm essentially computes a censored form of Levenshtein
distance. Dynamic programming algorithms for computing Levenshtein distance
play a major role in gene sequencing, and also pop up in time series form as
dynamic time warping.

------
pmoriarty
Does anyone else get depressed when they see Norvig programming in Python
instead of Lisp?

~~~
antics
No. He has explicitly said that he does this because he thinks Python makes
better pseudocode, and is easier for beginners to understand, as empirically
measured by the questions he got from students of his AI books.

I understand your heart was probably in the right place, but honestly, your
tone here is kind of hard to take.

