
How to Write a Spelling Corrector - colobas
http://norvig.com/spell-correct.html
======
ChicagoBoy11
I couldn't help but read this and think about all the "coding" initiatives
I've seen in K-12 and shake my head.

What Norvig is doing is what we should be teaching. He is tackling this
seemingly REALLY hard problem by thinking about it methodically, translating
some intuition into code, carefully constructing an argument about how to
solve it, and ways that it could be extended. This is what actual engineers
look like.

Everything I've seen around "coding" though has become a masochistic exercise
in teaching kids random syntax details and then calling them Coders and
Geniuses and Computer Scientists when they successfully copy what the teacher
showed them.

When you read Norvig's code (big fan of his Sudoku one as well), you realize
how the actual "code" is secondary in the sense that what it is really doing
is expressing an idea. A very nunanced, elegant idea, but ultimately the
product of doing some hard thinking and exploration on a problem domain.

If we taught kids to just think about problems in this way, ohh what a world
it would be!

~~~
drauh
By that reasoning, we should not teach kids how to spell, or about
punctuation, and just aim for them writing essays/stories/novels.

~~~
shanusmagnus
"If you want to build a ship, don't drum up the men to gather wood, divide the
work and give orders. Instead, teach them to yearn for the vast and endless
sea."

~~~
kybernetikos
That's a lovely idea although I can't help noticing that I know of no
shipbuilding companies based on the teaching of yearning while I do know of
some that seem based on organising a workforce. I suspect that this is one of
those ideas that while beautiful is far from being literally true.

How about "if you want your citizens to fund a space program, don't gather the
workforce to build a spaceship but teach them to yearn for the vastness and
potential of the void."

------
lb1lf
Spell chequer

Martha Snow

Eye halve a spelling chequer It came with my pea sea It plainly marques four
my revue Miss steaks eye kin knot sea.

Eye strike a quay and type a word And weight four it two say Weather eye am
wrong oar write It shows me strait a weigh.

As soon as a mist ache is maid It nose bee fore two long And eye can put the
error rite It's rare lea ever wrong.

Eye have run this poem threw it I am shore your pleased two no It's letter
perfect awl the weigh My chequer tolled me sew.

~~~
agd
In a way, I'm surprised I can read this so easily. Do our brains read by
converting text to sounds, and then parsing the sounds?

~~~
fallous
I think many people do have brains that operate this way, probably because
their understanding of language is grounded in conversation as a first
experience. I suspect this is the reason you see confusion with "your, you're"
and "there, their, they're."

I don't parse language in quite that manner, and even in speaking I have a
different "mouth feel" for those homonyms. This creates a modest inversion of
the idea of spelling errors for me, in that people who truly do say homonyms
in such a way that to my hearing it is exactly the same require me to parse
for context.

~~~
schoen
> I don't parse language in quite that manner, and even in speaking I have a
> different "mouth feel" for those homonyms.

I believe in the (somewhat controversial) non-subvocalization-based text
processing and even think I do it myself, but I'm wondering if you could
describe more about the "mouth feel" issue. Do you mean that you believe that
you pronounce them using different phonology (that another person would
potentially be able to hear), that your muscles are doing something different
but not in a way that makes an auditory difference, or simply that you're
subjectively aware of the spelling while speaking but not necessarily in a way
that makes a physically-observable difference? Or is it not quite clear which
of these is the case?

I think this is an interesting question in the philosophy of language and also
in the psychology of reading. (I've thought about this myself but haven't
studied the academic literature about it.) If your answer is the first one, I
wonder if you'd be willing to make an audio recording of yourself pronouncing
these words that might show what difference you experience.

By the way, there are documented cases where spelling differences have created
new pronunciation differences that didn't previously exist in the spoken
language. Maybe something like that has been happening in your idiolect?

~~~
khedoros
I have a similar feeling, when I pronounce a homonym/homophone. The word
"mouth feel" resonates for me, although I suspect that sometimes it's just an
awareness of the spelling of the word. Other times, I'll consciously choose to
pronounce a word differently to try to disambiguate it. I'll say "aunt" as
"ant"/"ont" (read those second as phonetically-spelled Californian American
pronunciations) depending on the situation and flow of the sentence. "They're"
is usually "They-er", "their" is sometimes "thur" (like "fur" with a /ð/), and
"there" feels like it's pronounced the expected way. "To" might be "tə", but
"two" and "too" never are.

Of course, that's all when I'm paying close attention to what I say. I'm sure
there are times that I go against those. Also, sorry for the mix of layman
phonetic spelling and IPA.

------
bsenftner
Although not a spell checker specifically, I wrote an offensive language
filter for a chat system used by the National Hockey League for a period when
they hosted communal chat rooms during televised games.

The architecture I used is completely different from what is described here,
but the goals are very similar. I had to handle any curse word in any
language, including curses from one language translated into another language,
as well as offensive phrases, and their translated equals, as well as
offensive slang, and mispelt offensive slang translated from other languages.

I ended up with an offense dictionary of about 700K words and phrases. This
was back in '99, so my memory may not be 100% here, but I remember using
Perfect Hash to generate a compiled hash table for the dictionary, and then a
trie to organize the dictionary lookups. The entire system was about 150K of a
downloaded exe to access the NHL simulcast chat, as all the offensive language
filtering occurred on the client side. Chatting anything that could be
offensive turned into a series of words with their interiors all asterisk '*',
and it ran in something like 500 ms. Fun times. That company and project died
with the dot com bust.

~~~
ThePawnBreak
> but I remember using Perfect Hash to generate a compiled hash table for the
> dictionary, and then a trie to organize the dictionary lookups

But... but programmers on HN told me that tries are just useless pieces of
trivia used to fail people in interviews, and if you need them a library is
just going to do it for you in the most efficient possible way.

~~~
OskarS
Tries are lovely little datastructures, don't let anyone tell you any
different. It's not like you'll use them every day, but they're nice to have
when needed!

------
abecedarius
The code's been updated to more modern Python and to not try to smooth the
probabilities. (Also it computes probabilities instead of frequencies now,
though that shouldn't affect the result.)

------
misiti3780
His python code styling is really awesome. So concise. Probably inspired by
all the LISP he wrote in the past.

Although he does seem to be using doc strings incorrectly

~~~
leblancfg
Came here to say this. I've combed through his pieces many times over just to
glean the way he structures his code. Here, he got my head spinning for a
minute with

    
    
      return set(w for w in words if w in WORDS)
    

and made a mental note to use that idiom in the future.

As for doc strings, well, this is just a toy piece of code after all. The main
purpose for the code is to be read, instead of actually used. _something
something hobgoblin of little minds_

Edit: formatting.

~~~
misiti3780
that is a great idiom. i remember the first time i saw it (possibly in is very
essay), i thought it was awesome. I use it all over the place now. it is
similiar to the javascript ternary operator [1], which I also find to be very
useful

    
    
       var varX = (boolean) ? 'X' : 'Y'
    

[1] [https://developer.mozilla.org/en-
US/docs/Web/JavaScript/Refe...](https://developer.mozilla.org/en-
US/docs/Web/JavaScript/Reference/Operators/Conditional_Operator)

~~~
nommm-nommm
Pretty much every language has a ternary conditional operator: C++, Java,
Python, C#, Ruby....

------
1024core
The last 9 times it was submitted:
[http://goo.gl/mVSi7W](http://goo.gl/mVSi7W)

~~~
gist
I want to see someone do a post on how they analyzed all of the top stories on
HN (ones that got more than "x" upvotes) and calculated the ones that could be
resubmitted after "y" time period had gone by. For extra bonus points figure
out if any particular hn handle has gone in and resubmitted old stories in the
past by a similar technique and analysis.

~~~
dvirsky
In Go.

------
acdimalev

      >>> import spell
      >>> spell.correction('ducking')
      'fucking'
      >>>
    

Yup, it works.

------
bpchaps
I love it. Spell checking was pretty relevant to some of my work recently and
I needed to correct heavily typo'd address text from Chicago parking ticket
data. It used difflib to recursively find similarly typo'd address until it
finds a matching address within a list of correct addresses. Definitely a lot
more to polish, but I'm kinda proud of the naive approach.

[https://github.com/red-bin/tyop_fixer](https://github.com/red-bin/tyop_fixer)

    
    
      coumbia,columbia,0.933333333333
      argy,argyle,0.8
      menomee,menomonee,0.875
      newladn,newland,0.857142857143
      boulevard way,boulevard,0.818181818182
      sherwn,sherwin,0.923076923077
      lawrencec,lawrence,0.941176470588

------
727374
If people like this, also check out his course at Udacity -
[https://www.udacity.com/course/design-of-computer-
programs--...](https://www.udacity.com/course/design-of-computer-programs--
cs212)

------
WalterBright
The D language compiler consults a simple spell corrector when encountering an
unknown identifier. The dictionary is the list of names currently in scope.
It's a nice improvement to the error message about an unknown identifier.

------
r-c-r
I added a spelling corrector to an application a while back. I tried a couple
of libraries that implement these ideas. Problem is it can be slow for big
words.

I stumbled across this algorithm which is much faster if you allow some time
to pre-process your dictionary. [http://blog.faroo.com/2012/06/07/improved-
edit-distance-base...](http://blog.faroo.com/2012/06/07/improved-edit-
distance-based-spelling-correction/)

I implemented it here for fun in common lisp. Excuse the ugly code.
[https://github.com/RyanRiddle/lispell](https://github.com/RyanRiddle/lispell)

------
nxzero
Reverse of the logic presented could be used to inject typos into a document
per distributed copy of it to help identify anyone sharing documents online;
basically each copy is unique to allow for attribution. Hash of each document
could even be given to a third party and archived to provide if needed
independent verification of the claim that the document and the document
itself could be encrypted, loaded to another third party to log the IP and
finger print of the download provide another independent verification of the
exchange.

~~~
ksk
What is the application where typos in a document are acceptable?

~~~
dsr_
Classified reports. This technique is called the Canary Trap, with the name
(but not the technique) coined by Tom Clancy.

[https://en.wikipedia.org/wiki/Canary_trap](https://en.wikipedia.org/wiki/Canary_trap)

~~~
robryk
For that purpose you'd want to add typos that would be "corrected" into the
wrong word by a typical spellchecker or that are actually correct, but
atypical in context, words. Otherwise passing the document through a
spellchecker would remove the watermark.

~~~
ksk
Heh, perhaps an easier way would be to have unique phrases in each document.
Typos risk exposing you because they will stand out to the person who is
reading the document - assuming such a person is more discerning than most,
given that they're accessing classified docs.

------
jedberg
The date on the top says February 2007 to August 2016.

Does anyone know which parts are new in August 2016? I've read this before and
it isn't sticking out to me.

~~~
benhoyt
The code has been updated stylistically, and I'm not sure what else has
changed. Here's the old (Python 2.5) version from
[http://web.archive.org/web/20160110135619/http://norvig.com/...](http://web.archive.org/web/20160110135619/http://norvig.com/spell-
correct.html):

    
    
        import re, collections
    
        def words(text): return re.findall('[a-z]+', text.lower()) 
    
        def train(features):
            model = collections.defaultdict(lambda: 1)
            for f in features:
                model[f] += 1
            return model
    
        NWORDS = train(words(file('big.txt').read()))
    
        alphabet = 'abcdefghijklmnopqrstuvwxyz'
    
        def edits1(word):
           splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]
           deletes    = [a + b[1:] for a, b in splits if b]
           transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
           replaces   = [a + c + b[1:] for a, b in splits for c in alphabet if b]
           inserts    = [a + c + b     for a, b in splits for c in alphabet]
           return set(deletes + transposes + replaces + inserts)
    
        def known_edits2(word):
            return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)
    
        def known(words): return set(w for w in words if w in NWORDS)
    
        def correct(word):
            candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
            return max(candidates, key=NWORDS.get)

------
vcool07
Is there any C++ version of a decent spell checker ? Been looking for sometime
(mainly out of curiosity), most are either amateur school projects or too
academic/phd-ish....would love to go through a good open source C++ based
spell checker that can be used in practise (with little modifications if
required)

~~~
abecedarius
You could make a good start with the ideas from
[http://norvig.com/ngrams/](http://norvig.com/ngrams/) (a fancier corrector in
the vein of the OP).

------
taneq
This is a beautiful example of pithy high-level coding. I feel compelled to
mention, however, that the example word, 'thew', is in fact a(n archaic)
English word:
[https://www.google.com.au/#q=thew](https://www.google.com.au/#q=thew) :P

------
elchief
I wrote one in PLV8 (Postgres) just for fun:

[http://blog.databasepatterns.com/2014/08/postgresql-
spelling...](http://blog.databasepatterns.com/2014/08/postgresql-spelling-
correction-norvig-plv8.html)

------
anonfunction
I wrote a direct port in Golang with benchmark comparisons:

[https://github.com/montanaflynn/toy-spelling-
corrector](https://github.com/montanaflynn/toy-spelling-corrector)

------
gravypod
This is really cool and I'm wondering if you could improve the ability of this
by adding a markov chain/tree structure of most word usage patterns and doing
contextual searching for your word. You wouldn't need your wordlist and your
could compress and package this.

The way this would work is by looking at the previous word, and the next word
is available. It would find every word combination that looks like that and
then do a Levenshtein distance for all of the words that come between these
two items.

Is this the way "big" spelling correction methods work or is it by other
means?

~~~
jomamaxx
Yes, using language models is how it's done.

Theoretically, it's not that hard, in practice, it's _really_ hard.

There is an online language modelling course from Stanford that you should
check out if you want to take a stab.

------
Animats
That's not how Google does it. Try spelling errors with Google Search. Google
does multi-word spelling correction.

~~~
majewsky
I guess that Google looks at each word individually. And what Google does is
intriguingly simple: When they see someone do two similar searches from the
same person, where the second search has way more results, they record the
first search as "wrong spelling", and the second one as "correction".

------
vpanghal
I wrote Rust version of spelling corrector sometime back to explore nitty
gritty of language
[https://github.com/vpanghal/spellcorrector](https://github.com/vpanghal/spellcorrector)

------
tootie
Is this how modern spell checkers actually work? I assumed they would use a
heuristic trying to match common misspellings to their frequent corrections.
That or a combination of heuristic and Bayes.

~~~
anonred
For one, I'm fairly sure that modern spell checkers use n-gram language models
rather than a 1-gram model to decrease perplexity. Disclaimer: I skimmed the
article.

------
yomritoyj
Versions in C, C++, Java, Haskell and Racket [https://github.com/jmoy/norvig-
spell](https://github.com/jmoy/norvig-spell)

------
WhitneyLand
Any article like this for grammar correction?

I've been interested to know why grammar checking and corrections can't be
more accurate.

~~~
WhitneyLand
An open source grammar project written by CMU with LGPL license and links to
current and future research: [http://www.abisource.com/projects/link-
grammar](http://www.abisource.com/projects/link-grammar)

------
evanjacobs
"why should they know about something so far outisde their specialty"

------
Halienja
Neat Code - expressing high-level concepts in such a concept manner.

------
amelius
They don't seem to address keyboard layouts.

~~~
billconan
this is a good point I think.

sometimes, people type wrong words, because certain keys are too close on the
keyboard.

another problem I found about the naive spelling corrector is, it doesn't take
the pronunciation into account. certain wrong spelling looks different from
the correct version by edit distance. but they sound similar.

------
laxatives
Norvig's code is beautiful.

------
realworldview
Carefuly.

------
jomamaxx
I've done this before.

Sadly - the problem is _way_ harder.

First - you get much better results by using language models, n-grams etc. to
predict the likely hood of words given previous words. That can be hard.

The _really_ hard part comes down to language that people use.

Colloquialisms, proper names, and mixed-languages ... make this stuff really,
really hard.

Getting it 'mostly right' is not hard. Getting it really good is very
difficult and depends a lot on context.

'Srinivas' will not be in most people's dictionaries, but it's a common name
in India. 'Le' and 'la' are words in french and some similar in Spanish - and
a lot of writers jam these in all over the place. Over-correction and beta-
errors become a huge problem.

It's a really interesting premise and difficult for Engineers because it's
purely probabilistic _there is no way to build a perfect spellchecker_ unless
you can agree 100% on what 'language' is, precisely ... and trust me there is
no agreement on that. Not even close.

~~~
mrweasel
Writing spell checkers is also hard in language that e.g. doesn't have spaces
between composite nouns. In essence it makes it possible to make up a
perfectly valid new word. We expect the spell checker to be able to recognise
that word, because it knows the two words that the composite is made of.

Finnish is probably even worse because of it's conjugation.

