
How to write a spelling corrector (2016) - partycoder
http://norvig.com/spell-correct.html
======
lb1lf
I know a spelling corrector is not the same thing as a spelling checker, but
this is too good an opportunity to pass to promote Martha Snow's hilarious
poem 'Spell Chequer':

Eye halve a spelling chequer It came with my pea sea It plainly marques four
my revue Miss steaks eye kin knot sea.

Eye strike a quay and type a word And weight four it two say Weather eye am
wrong oar write It shows me strait a weigh.

As soon as a mist ache is maid It nose bee fore two long And eye can put the
error rite It's rare lea ever wrong.

Eye have run this poem threw it I am shore your pleased two no It's letter
perfect awl the weigh My chequer tolled me sew.

~~~
madacoo
Thanks for sharing this.

I found it difficult to parse the initial couple of lines because I was
constantly attempting to read by attaching meaning to the spellings: "Eye
halve" is a bit frightening in that sense. But then I realised I can read the
text as sounds and almost ignore the spellings. Listening to what the sounds
made in my head allowed for a much faster pace of comprehension because I
didn't have to keep correcting myself.

This poem demonstrated for me that it is possible to read and listen
simultaneously, and that I don't usually do that. It seems that it would
probably add to the experience of other poems to read them in this manner.

~~~
scott_s
I think this is how _all_ poems should be read, and I didn't realize until
just now that I automatically did that.

~~~
astine
There is apparently a big divide between people who subvocalize when they read
and those who don't. Those who don't tend to read much faster than those who
do which is why speedreading techniques tend to focus on eliminated
subvocalization. The problem is that people who subvocalize tend to need to do
so in order to understand the text.

[https://en.wikipedia.org/wiki/Subvocalization](https://en.wikipedia.org/wiki/Subvocalization)

~~~
derefr
I wonder whether the people who _do_ subvocalize when they read tend to be
better at writing poetry (or songwriting, or just writing beautiful prose.) I
would expect that they'd have been subconsciously training themselves to the
"feel" of good meter.

~~~
earenndil
There might be a correlation, but it's not a hard-and-fast rule. I've been
told that my prose -- and poetry -- is good, and I definitely don't
subvocalize.

------
apendleton
While doing a bunch of research on exactly this problem space recently for a
project, I stumbled onto this improvement on the Norvig corrector idea
[http://blog.faroo.com/2012/06/07/improved-edit-distance-
base...](http://blog.faroo.com/2012/06/07/improved-edit-distance-based-
spelling-correction/) that's one of those things that's so deceptively simple
you kick yourself for not thinking of it: it turns out you can model the same
"generate all the variations" effect but generating only the deletes, rather
than also the insertions and substitutions, if you symmetrically apply the
same transformation to the lookup side. So if your dictionary contains "cat"
and you want to match queries for "cast," rather than adding "cast" to the
corpus, you drop the s (and every other single letter) on the lookup side
instead, and still match. Turns out it's much faster and, requires a much
smaller index, and as a bonus, doesn't tie you to a specific alphabet like
Norvig's approach does.

~~~
wolfgarbe
Here is the link to the SymSpell Github repository:
[https://github.com/wolfgarbe/SymSpell](https://github.com/wolfgarbe/SymSpell)

An here a benchmark between Norvig's spelling corrector, BK-tree and SymSpell:
[https://towardsdatascience.com/symspell-vs-bk-
tree-100x-fast...](https://towardsdatascience.com/symspell-vs-bk-
tree-100x-faster-fuzzy-string-search-spell-checking-c4f10d80a078)

~~~
apendleton
Ah should have shared the repo, and thanks for publishing it! We're
experimenting now with adapting this idea but using a directed acyclic FSA to
store the index-time variations instead of a hashtable like in your version,
with the idea that we might be able to search for all of the query-time
variations in a single pass rather than one at a time (as for obvious reasons
they'll be textually similar to one another so there should be some shared
work between the lookups).

------
karterk
This has been an area of interest for me as part of working on Typesense[1].
If you are looking to implement spelling correction, you cannot but not
stumble on this excellent post by Peter Norvig (most of it written on a bored
flight journey!).

While it's clever and concise, in terms of raw speed, indexing your vocabulary
in a Trie, and then doing a traversal on it using a Levenshtein distance is
way faster[2]. By using a trie, you eliminate lot of unwanted look-ups that
Peter Norvig's brute-force approach takes.

The other interesting thing for me personally about this problem is that there
is a statistical angle as well. You will often find words that are of the same
edit distance from a given target word -- you will need to rely on some form
of "popularity metric" to rank those tied corrections. E.g. how many times did
people type this word and then change it to another word - that's precisely
the kind of data that Google has and which makes its search suggestions and
spell checking so intuitive.

[1]: [https://typesense.org/](https://typesense.org/) [2]:
[http://stevehanov.ca/blog/index.php?id=114](http://stevehanov.ca/blog/index.php?id=114)

~~~
fnord123
>indexing your vocabulary in a Trie

fst (finite state transducer) is even smaller. And for some languages like
portuguese where a lot of suffixes are extremely common, the size reduction is
dramatic!

>The other interesting thing for me personally about this problem is that
there is a statistical angle as well. You will often find words that are of
the same edit distance from a given target word

Indeed.

[http://www.ling.helsinki.fi/~klinden/pubs/PirinenLrec2010.pd...](http://www.ling.helsinki.fi/~klinden/pubs/PirinenLrec2010.pdf)

------
shdon
Spelling correction rather than spell checking and suggestions really bugs me.
As soon as you have a somewhat international audience (and given that this is
the internet - you probably do) or have any non-English content, automatic
correction can be an actively user-hostile "feature".

Especially on search engines, sometimes I am searching for words in another
language, mixed-language results, or even doing an intentional search for
misspelt words. In my native Dutch his first example would already cause
problems: "spelling" means the same as in English, but "speling" means
margin/give (the noun) and I wouldn't want one to be auto-corrected to the
other. The fuzzy search for names on Facebook makes the search pretty much
worthless especially in case you have an unusual variant of a common name in
your search - you can be sure the actual person you're looking for is listed
after half a billion bogus results.

The whole trend of second-guessing the user and saying "you tried to do A, but
we assume you meant B, so we're gonna do B instead" is one of my biggest pet
peeves.

~~~
rietta
Not only there. Automatic word replacement is my absolute least favorite iOS
mobile feature. I cannot tell you how many times I painstakingly, painfully
thumb typed exactly what I meant just to have it changed out from under me.
And I have gone back to see nonsense messages where I know I didn't type what
was sent. Very frustrating.

~~~
crispyporkbites
I think this could be corrected if the spell corrector factored in the time
taken to write the word out. If the user takes longer than average to write
the word they are probably deliberately spelling it out, compared to just
mashing the screen to get the words through.

~~~
rasz
Throw RNN at it, like this toy example
[http://karpathy.github.io/2015/05/21/rnn-
effectiveness/](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) and
make it figure out which spelling correction is the most probable.

You could even train it on past messages making it more tailored towards
particular user.

------
falsedan
The unit tests worry me:

    
    
        assert len(WORDS) == 32192
        assert sum(WORDS.values()) == 1115504
        assert WORDS.most_common(10) == [
         ('the', 79808),
         ('of', 40024),
         ('and', 38311),
         ('to', 28765),
         ('in', 22020),
         ('a', 21124),
         ('that', 12512),
         ('he', 12401),
         ('was', 11410),
         ('it', 10681)]
        assert WORDS['the'] == 79808
    

Those aren't testing the file open, or Counter, or read, but instead are
tightly-coupling the tests to the exact corpus. The code really should have
not hard-coded the corpus, and the tests should have supplied a smaller corpus
of known length where each word's frequency+probability was obvious from
inspection.

There's another overfitted test just before this:

    
    
        assert Counter(words('This is a test. 123; A TEST this is.')) == (
               Counter({'123': 1, 'a': 2, 'is': 2, 'test': 2, 'this': 2}))
    

What is this verifying? We know how a Counter works, we don't need to test it;
and the previous test already established the correctness of words().

I see a lot of junior engineers following this style of tightly-coupled,
overfitted tests, and seen how a few years of growth can lead to tens of
thousands of tests whose primary value is to provide Amazon with a predictable
stream of AWS revenue (as we run all th tests) & make the engineer feel better
for writing tests.

~~~
cdancette
This code is not written for production, it's just written to make you
understand how the basic of this technology works. So I would say the unit
tests have the exact same purpose: make the reader understand what the
functions are doing (and not a real unit testing).

Since you speak of value, the value Peter Norvig is trying to provide is
making readers understand the principles, he's not trying to provide some
monetary value. So I don't think your criticisms apply here, those tests fit
the purpose very well.

~~~
falsedan
> _the value Peter Norvig is trying to provide is making readers understand
> the principles, he 's not trying to provide some monetary value_

I think he's trying to show _how to approach solving this kind of problem_.
Juniors will copy it, and will then write tests in a business-logic
environment which have dubious value.

The code shows how to solve the problem very well. The tests do not.

------
slx26
This is a really good introduction to some natural language processing
concepts, and the code as other people mentions is very beautiful and easy to
read too.

Just as a curiosity: I once did some text correction applied to tweets, using
word embeddings (word vectors, word2vec), which are a language model too. In
real life contexts, where you don't limit yourself to a dictionary, it's
really interesting because word vectors can include many foreignisms and other
"incorrect" words that you might want to keep in the text.

The article also mentions edit distances, the distances between words (in this
case, the "incorrect" word and its possible corrections). In general,
algorithms are based on common errors at the keyboard. Missing or adding a
letter, transposing two of them, etc. Algorithms like Damerau-Levenshtein
(pretty much the one applied in the article, even though it's not explicitly
mentioned I think, and there are many variations) keep it simple by doing
this, but it's interesting to notice that you can also fine-tune distances by
considering which keys are closer to other keys in the keyboard (it's easier
to accidentally type "a" instead of "s" in qwerty than "u"), but when doing
more general text correction you can also use phonetic distances or even more
sophisticated techniques like finite-state transducers that apply language-
specific rules to suggest corrections. (I know the article is only meant as an
introduction, but thought someone might find it interesting)

------
laurencei
I always wish I could just plugin a Google spelling check API into things like
Microsoft Word instead of using the inbuilt ones.

i.e. the amount of times I cant spell a word, and MS Word has no idea what I'm
trying to say, so I "copy and paste" the wrong word into Google and it
instantly knows what I meant and shows me the correct spelling.

I appreciate part of that is probably related to my previous searches on
Google and it guesses based upon my history - but I'd be ok with using this
for "good" like my spell checking.

~~~
richardboegli
You could make a very simple AutoHotkey script to do this.

~~~
Jaruzel
...or a Word macro that you assign to a toolbar button, that takes the
currently selected word and 'googles' it for the correction.

------
sp332
A couple of years ago, Emily Short made an "interactive fiction" game [that's
what they call text adventures these days] with letter-removal as a major
mechanic. It's called Counterfeit Monkey and it's a lot of fun.
[http://emshort.com/counterfeit_monkey/](http://emshort.com/counterfeit_monkey/)

Install one of the interpreters at the bottom of the page, then download the
"story" file and open it in said interpreter.

~~~
currymj
it is an extremely fun game!

especially for all the oldheads around here, if you ever played Zork or
Hitchhiker's Guide or whatever, give it a shot. it's the same style, just as
funny, with helpful features to make it far less frustrating than the old
games could be, and it really takes advantage of not having to run on a
TRS-80.

~~~
sp332
Here's her blog post about which interpreters to use to play the game.
(Spoilers in the comments but not in the main post.)
[https://emshort.blog/2012/12/31/counterfeit-
monkey/](https://emshort.blog/2012/12/31/counterfeit-monkey/) Also there's a
cheat/hint guide [PDF] [https://emshort.blog/2013/01/24/making-of-counterfeit-
monkey...](https://emshort.blog/2013/01/24/making-of-counterfeit-monkey-
puzzles-and-toys/).

There's a spoiler-filled post about the design of the puzzles here.
[https://emshort.blog/2013/01/24/making-of-counterfeit-
monkey...](https://emshort.blog/2013/01/24/making-of-counterfeit-monkey-
puzzles-and-toys/)

It won Best Game, Best Setting, Best Puzzles, Best Individual Player
Character, and Best Implementation in 2012!
[http://www.ifwiki.org/index.php/Counterfeit_Monkey](http://www.ifwiki.org/index.php/Counterfeit_Monkey)
There's a nod to this game in [https://xkcd.com/1975/](https://xkcd.com/1975/)
Right-click the image, go to games -> advent.exe and start exploring :)

------
yomritoyj
I started a project to implement Norvig's corrector in different languages:
[https://github.com/jmoy/norvig-spell](https://github.com/jmoy/norvig-spell)

Had a pleasant surprise when Matthias Felleisen responded to a request on the
Racket mailing list and practically rewrote the Racket version.

Also got to learn about the HAT-trie[1] data structure.

[1] [https://en.wikipedia.org/wiki/HAT-
trie](https://en.wikipedia.org/wiki/HAT-trie)

------
graup
I wish articles about NLP wouldn't assume by default the language to be
English. Many points of this article can also apply to other languages, but
may require some more thought. Either other languages should be mentioned, or
the title should be "How to Write an English Spelling Corrector".

~~~
mjburgess
The assumption is about the audience, not the language.

The author is an English speaker, internal to an English community, writing
about spell checking.

Its unreasonable to wish every group on the planet to factor in the concerns
of every other group, or add their group identifier to everything they write.

------
celerity
The article doesn't mention it explicitly, but this is a nice example of how
using Bayes theorem helps you ignore the hard-to-compute normalization term of
the input space. In the article, this is the P(w) term of

P(c|w) = P(c)P(w|c)/P(w),

where c is a correction, and w is the original word.

The author does implicitly talk about this when he explains that P(c|w)
conflates the two factors, but it's also not that hard to see that getting a
handle on P(w) -- the probability space of misspellings -- is harder than
getting a hold of P(c) -- the probability space of actual words, and Bayes
lets us get rid of the former during optimization.

------
hellofunk
Here is Rich Hickey's port of this Norvig exercise to Clojure:

[https://en.wikibooks.org/wiki/Clojure_Programming/Examples/N...](https://en.wikibooks.org/wiki/Clojure_Programming/Examples/Norvig_Spelling_Corrector)

~~~
abecedarius
It turns out Hickey added some functions to Clojure to be able to write that
code. Nothing wrong with that -- they're useful functions! -- but I found it
kind of amusing. It can be good to be a king.

~~~
hellofunk
I don't see any functions in that code that wouldn't have existed in Clojure
1.0. It's all pretty standard Clojure stuff, very basic 'for' constructs and
generalized lazy sequences with 'range' and others.

Which function were you thinking were explicitly added to the language that
are in use here?

~~~
abecedarius
"Along the way, Clojure got subs(tring), slurp (a file), max-key and min-
key..." at
[https://groups.google.com/forum/#!topic/clojure/YDQdmAsUaFU](https://groups.google.com/forum/#!topic/clojure/YDQdmAsUaFU)

~~~
hellofunk
Thanks, that is definitely interesting. We have Norvig to thank for every-day
Clojure functions.

------
sewer_bird
"But they didn't, and come to think of it, why should they know about
something so far outisde their specialty?"

Outside ;)

~~~
emmelaich
Nice catch!

------
partycoder
Disclaimer:

\- Couldn't determine the exact date of that article

\- Previous discussion here:
[https://news.ycombinator.com/item?id=12453535](https://news.ycombinator.com/item?id=12453535)

I really enjoyed the code, it is incredibly clever and idiomatic.

~~~
petters
You should check out more code by Norvig then. He has many Python notebooks
and, as you say, writes very beautiful code.

------
crispyporkbites
I wrote a spell checker for my undergraduate to help with automation research
back in 2011, it allowed the researcher to do things like insert words that
would show as incorrect when they were actually correct, and see if people
would notice with varying levels of difficulty. The idea was to try to measure
how some automated systems can improve or reduce net performance.

I actively tried to avoid using libraries, and ended up writing most of it in
vanilla JavaScript - including a bunch of DOM manipulations, which was suicide
at the time. I even wrote my own XML-based DSL for defining the input text.

Would have loved to actually follow through with the research but phd's don't
pay bills.

------
sevensor
I see where Norvig gets his reputation for writing readable code. I don't know
that I've ever had so little difficulty understanding somebody else's code.

------
iask
Thanks for sharing! I just started researching how to analyze text for a
project. I am tasked with analyzing content to see if they match my client
products or something similar. Matching my the product name is easy, but
matching by similar is what’s throwing me off.

Is there an api service for these type of text analysis?

------
ciconia
> I thought Dean and Bill, being highly accomplished engineers and
> mathematicians, would have good intuitions about how this process works. But
> they didn't, and come to think of it, why should they know about something
> so far outside their specialty?

That kind of speaks volumes about modern academics.

------
known
I've used [http://php.net/manual/en/function.pspell-
suggest.php](http://php.net/manual/en/function.pspell-suggest.php) in one of
my projects. It's simple and scalable;

------
chmike
What I would like to know is why the spelling correction suggestion of Apple
iOS are so bad. Why can they get along with this ? I used to disable it.

It can't even make a suggestion when there is only one missing character.

~~~
slfnflctd
I turned off Autocorrect in short order when I had an iPhone for this reason
(and others). Now, my Android speech recognition - which seems to be getting
worse lately - pulls completely incorrect spelling from thin air constantly,
sometimes not even giving me the proper spelling in the 'correction' drop-down
list.

I'm tempted to wonder whether speech-to-text is actually saving me enough time
after all the fixing I need to do now to be worth it. It's more than a little
frustrating.

------
nl
This is a great technique to know. I’ve used a variant to segment Twitter hash
tags into meaningful words, which is a surprisingly hard thing to do.

~~~
abecedarius
He has that covered too!
[http://norvig.com/ngrams/](http://norvig.com/ngrams/)

But yeah, I tackled the same problem (for Flickr tags) and did not at first
use the "obvious" algorithm; I did something slower and suboptimal.

------
meitham
No PEP8! Code rejected </sarcasm>

Most elegant code I've actually seen, and can see the lisp thinking apparent
on the style.

------
dblotsky
This has been my favourite Python program for many years.

I also recommend his Sudoku solver:
[http://norvig.com/sudoku.html](http://norvig.com/sudoku.html), and his XKCD
regex golf code:
[http://nbviewer.jupyter.org/url/norvig.com/ipython/xkcd1313....](http://nbviewer.jupyter.org/url/norvig.com/ipython/xkcd1313.ipynb).

------
michaelmior
Can mods add (2016) to the title?

