Hacker News new | more | comments | ask | show | jobs | submit login
How to write a spelling corrector (2016) (norvig.com)
346 points by partycoder 10 months ago | hide | past | web | favorite | 84 comments



I know a spelling corrector is not the same thing as a spelling checker, but this is too good an opportunity to pass to promote Martha Snow's hilarious poem 'Spell Chequer':

Eye halve a spelling chequer It came with my pea sea It plainly marques four my revue Miss steaks eye kin knot sea.

Eye strike a quay and type a word And weight four it two say Weather eye am wrong oar write It shows me strait a weigh.

As soon as a mist ache is maid It nose bee fore two long And eye can put the error rite It's rare lea ever wrong.

Eye have run this poem threw it I am shore your pleased two no It's letter perfect awl the weigh My chequer tolled me sew.


Thanks for sharing this.

I found it difficult to parse the initial couple of lines because I was constantly attempting to read by attaching meaning to the spellings: "Eye halve" is a bit frightening in that sense. But then I realised I can read the text as sounds and almost ignore the spellings. Listening to what the sounds made in my head allowed for a much faster pace of comprehension because I didn't have to keep correcting myself.

This poem demonstrated for me that it is possible to read and listen simultaneously, and that I don't usually do that. It seems that it would probably add to the experience of other poems to read them in this manner.


I think this is how all poems should be read, and I didn't realize until just now that I automatically did that.


There is apparently a big divide between people who subvocalize when they read and those who don't. Those who don't tend to read much faster than those who do which is why speedreading techniques tend to focus on eliminated subvocalization. The problem is that people who subvocalize tend to need to do so in order to understand the text.

https://en.wikipedia.org/wiki/Subvocalization


I wonder whether the people who do subvocalize when they read tend to be better at writing poetry (or songwriting, or just writing beautiful prose.) I would expect that they'd have been subconsciously training themselves to the "feel" of good meter.


There might be a correlation, but it's not a hard-and-fast rule. I've been told that my prose -- and poetry -- is good, and I definitely don't subvocalize.


Thanks for the term - I think I typically don't subvocalize, but I do when I read poems.


One part I can't "translate" is this:

    It plainly marques four my revue


"It plainly marks for my review"


It does! A lot, imho.


A fun story in the same vein, Ladle Rat Rotten Hut, by H.L. Chace, can be found here: http://www.exploratorium.edu/files/exhibits/ladle/


Check out the book "Tahoe Leap Eyeball" at Amazon. It's the entire KJV bible in that style. (Tahoe Leap Eyeball = The Holy Bible).

Clearly the process was automated. Ingenious program. More complex than mere homophone substritution... it works syllable by phonetic syllable to string together words that, when spoken, sound pretty dang close to the original.


https://www.hyperborea.org/humor/tweeze.phtml was done mostly by computer. It's a fun challenge to write your program to make these. To tie it back to the OP, you could "spell-correct" the source text into a sequence of words that come closest to reproducing the original sound (but with the source sequence blacklisted).


I'm learning a little Icelandic right now, and it's eerie how similar reading this poem feels. The words make no sense at first glance, but there are sounds and a language I understand somewhere underneath.


Judging by how the iOS speech synthesizer pronounces them, that is a lot better than “Ladle Rat Rotten Hut” and “Tweeze Denied Beef Worker Isthmus” (both mentioned in comments to the post mentioning “Spell Chequer”)


My chequer marks chequer as wrong.


Damn spellcheck reverting to "English (US)".


Obligatory party-pooping: this doesn't really seem to be a poem about a spelling checker or corrector.

Naive spelling [auto]correctors correct by text distance, because people almost always make mistakes in text input by typing the right words but making the wrong motions to do so.

Slightly-less-naive autocorrect takes this approach further, and understands that e.g. "yjr" should become "the" because it's the same letters all shifted over by one.

And, as in the submitted article, the best autocorrectors just look at the whole sentence context and try to predict what word the input "should" have been given what it looks like—this is essentially a kind of compressed sensing (like fMRIs use!), though in practice it tends to be baked down to something like markov chains of levenstein automata with back-propagation on w>n.

...whereas this poem is more like the result of a very naive speech to text algorithm, one which parses each word independently without the context of the sentence-so-far.

(Side-note: I'm surprised that speech-to-text algorithms still work mostly in "real time" with only a limited buffer, unable to go back and change anything more than a few words in the past. It's why they fail to recognize names, for example. If speech-to-text algorithms would buffer the entire audio stream for a dictated document, showing an estimate of the output text so far, but continuing to re-estimate the entire document after every word/sentence/paragraph, they'd perform much better. The difference would be on the level of a standard web-renderer's line-at-a-time reflow algorithm, vs. TeX's whole-document reflow.)


I typed the poem in to MS Word, and the only word flagged was 'chequer'. I guess Word's is a 'naive' spelling checker.


Obligatory "Well Actually": the poem isn't trying to demonstrate the technology behind spellcheckers (or correctors). It's demonstrating a near-universal weakness of algorithmic checking/correcting that isn't shared with human proofreaders: when the mistaken word is the correct spelling of another word. The poem takes this to absurd extremes, but it's quite possible to make such mistakes through typos (e.g., "pun" instead of "pin").


While doing a bunch of research on exactly this problem space recently for a project, I stumbled onto this improvement on the Norvig corrector idea http://blog.faroo.com/2012/06/07/improved-edit-distance-base... that's one of those things that's so deceptively simple you kick yourself for not thinking of it: it turns out you can model the same "generate all the variations" effect but generating only the deletes, rather than also the insertions and substitutions, if you symmetrically apply the same transformation to the lookup side. So if your dictionary contains "cat" and you want to match queries for "cast," rather than adding "cast" to the corpus, you drop the s (and every other single letter) on the lookup side instead, and still match. Turns out it's much faster and, requires a much smaller index, and as a bonus, doesn't tie you to a specific alphabet like Norvig's approach does.


The latest version of SymSpell (by the author of that blog post) handles compound words too, so it's pretty capable along with speed.

https://github.com/wolfgarbe/SymSpellCompound/blob/master/RE...

I'm certainly not knocking that achievement, but the two issues I ran into fairly quickly were: 1. It (currently) doesn't handle words that are genuine words but which are contextually wrong 2. You need a high quality dictionary that's also well aligned with your domain or you'll have poor corrections (this last point is merely a matter of effort, so less of a concern)


Here is the link to the SymSpell Github repository: https://github.com/wolfgarbe/SymSpell

An here a benchmark between Norvig's spelling corrector, BK-tree and SymSpell: https://towardsdatascience.com/symspell-vs-bk-tree-100x-fast...


Ah should have shared the repo, and thanks for publishing it! We're experimenting now with adapting this idea but using a directed acyclic FSA to store the index-time variations instead of a hashtable like in your version, with the idea that we might be able to search for all of the query-time variations in a single pass rather than one at a time (as for obvious reasons they'll be textually similar to one another so there should be some shared work between the lookups).


This has been an area of interest for me as part of working on Typesense[1]. If you are looking to implement spelling correction, you cannot but not stumble on this excellent post by Peter Norvig (most of it written on a bored flight journey!).

While it's clever and concise, in terms of raw speed, indexing your vocabulary in a Trie, and then doing a traversal on it using a Levenshtein distance is way faster[2]. By using a trie, you eliminate lot of unwanted look-ups that Peter Norvig's brute-force approach takes.

The other interesting thing for me personally about this problem is that there is a statistical angle as well. You will often find words that are of the same edit distance from a given target word -- you will need to rely on some form of "popularity metric" to rank those tied corrections. E.g. how many times did people type this word and then change it to another word - that's precisely the kind of data that Google has and which makes its search suggestions and spell checking so intuitive.

[1]: https://typesense.org/ [2]: http://stevehanov.ca/blog/index.php?id=114


>indexing your vocabulary in a Trie

fst (finite state transducer) is even smaller. And for some languages like portuguese where a lot of suffixes are extremely common, the size reduction is dramatic!

>The other interesting thing for me personally about this problem is that there is a statistical angle as well. You will often find words that are of the same edit distance from a given target word

Indeed.

http://www.ling.helsinki.fi/~klinden/pubs/PirinenLrec2010.pd...


Norvig also wrote about using a fancier error model and a faster algorithm at http://norvig.com/ngrams/ -- similar to a trie, but a literal trie would not be so efficient in Python.


Spelling correction rather than spell checking and suggestions really bugs me. As soon as you have a somewhat international audience (and given that this is the internet - you probably do) or have any non-English content, automatic correction can be an actively user-hostile "feature".

Especially on search engines, sometimes I am searching for words in another language, mixed-language results, or even doing an intentional search for misspelt words. In my native Dutch his first example would already cause problems: "spelling" means the same as in English, but "speling" means margin/give (the noun) and I wouldn't want one to be auto-corrected to the other. The fuzzy search for names on Facebook makes the search pretty much worthless especially in case you have an unusual variant of a common name in your search - you can be sure the actual person you're looking for is listed after half a billion bogus results.

The whole trend of second-guessing the user and saying "you tried to do A, but we assume you meant B, so we're gonna do B instead" is one of my biggest pet peeves.


Not only there. Automatic word replacement is my absolute least favorite iOS mobile feature. I cannot tell you how many times I painstakingly, painfully thumb typed exactly what I meant just to have it changed out from under me. And I have gone back to see nonsense messages where I know I didn't type what was sent. Very frustrating.


I think this could be corrected if the spell corrector factored in the time taken to write the word out. If the user takes longer than average to write the word they are probably deliberately spelling it out, compared to just mashing the screen to get the words through.


Throw RNN at it, like this toy example http://karpathy.github.io/2015/05/21/rnn-effectiveness/ and make it figure out which spelling correction is the most probable.

You could even train it on past messages making it more tailored towards particular user.


Agreed. I am a touch typist on a regular computer. I literally hate writing on the phone. I find it an utterly odious process. So much for user friendliness from my point of view.


I distinctly remember the time when all of a sudden many of my friends started sending me sms messages with really odd mistakes in them. I didn't jump on the smartphone wagon for a number of years so I just couldn't understand how these errors were being made at the time.


Agreed !! Actually, it's a real problem that's not limited to spelling. Look at Siri for instance, granted I haven't used it in a few years, back in the days it was impossible to use for most of my contacts. Indeed, I have both English and French contacts and it's a nightmare to use with Siri.

My Siri is in English because I figured it was more reliable for everyday us then in French. But my mom's contact is as 'maman', can't use Siri to call her. Or I have to try and pronounce 'maman' with an english accent and hope Siri know what I'm talking about. And that's an easy one, try a name like 'Jean Brentois'.... But I assume setting up Siri in French will lead to the same problem, but with my non French contacts now. I have not tried however.

So obviously I get the challenge, but it makes it very hard to use (I can't use Siri like tools in parts because of this). Especially since so many of my contacts are international.

EDIT: a quick fix for contact pronounciation might be to ask the user how he pronounces a particular contact when adding it.

Similarly, why not allow for a special syntax allowing for multiple languages in a sentence ? Say my first language is English, my second is French, and my third is Spanish. Why not something like that: "Let's go to the $$casa $demain $avec $toi, what do you think ?"


I almost added autocorrection to a translation pipeline once. Curated a subset of unambiguous safe cases to reduce the manual workload. I parked it away because I didn't see a way to reliably convey this was just lowering the number of manual corrections needed.

A bad part about user-end autocorrection is that the end user is looking only at the case they have in front, not the overall average — if they have a case where it worked very well, they'll start overly trusting it and failing at producing good results. If they have a case where it wasn't useful, they'll be annoyed by it and feel they are doing more than they should. In all cases someone is going to be unhappy.

If I can't expect all ends to understand the value I'd mostly just look for pipelines where a-c could be an internal process and there was no need for the data provider and end user to be aware that it's happening. But then, ascertaining this is a problem on its own.


The unit tests worry me:

    assert len(WORDS) == 32192
    assert sum(WORDS.values()) == 1115504
    assert WORDS.most_common(10) == [
     ('the', 79808),
     ('of', 40024),
     ('and', 38311),
     ('to', 28765),
     ('in', 22020),
     ('a', 21124),
     ('that', 12512),
     ('he', 12401),
     ('was', 11410),
     ('it', 10681)]
    assert WORDS['the'] == 79808
Those aren't testing the file open, or Counter, or read, but instead are tightly-coupling the tests to the exact corpus. The code really should have not hard-coded the corpus, and the tests should have supplied a smaller corpus of known length where each word's frequency+probability was obvious from inspection.

There's another overfitted test just before this:

    assert Counter(words('This is a test. 123; A TEST this is.')) == (
           Counter({'123': 1, 'a': 2, 'is': 2, 'test': 2, 'this': 2}))
What is this verifying? We know how a Counter works, we don't need to test it; and the previous test already established the correctness of words().

I see a lot of junior engineers following this style of tightly-coupled, overfitted tests, and seen how a few years of growth can lead to tens of thousands of tests whose primary value is to provide Amazon with a predictable stream of AWS revenue (as we run all th tests) & make the engineer feel better for writing tests.


This code is not written for production, it's just written to make you understand how the basic of this technology works. So I would say the unit tests have the exact same purpose: make the reader understand what the functions are doing (and not a real unit testing).

Since you speak of value, the value Peter Norvig is trying to provide is making readers understand the principles, he's not trying to provide some monetary value. So I don't think your criticisms apply here, those tests fit the purpose very well.


> the value Peter Norvig is trying to provide is making readers understand the principles, he's not trying to provide some monetary value

I think he's trying to show how to approach solving this kind of problem. Juniors will copy it, and will then write tests in a business-logic environment which have dubious value.

The code shows how to solve the problem very well. The tests do not.


The value he is trying to provide is also showing people by example how to write readable, idiomatic, functional Python code.

From that perspective, I agree the tightly coupled unit tests are a bit grating. Though they would not look out of place as doctests.


"They" are Peter Norvig. I mention this not to appeal to authority, so much as to remind you that you're not reading the code of a junior engineer.


This is a fair criticism, I think. Anyone want to write a blog post about the tests they would write for this task? Maybe it'd be especially interesting as going from "here's what you might do for the first version coded on a plane flight, now here's how it might evolve as it's polished for publication". I'm not volunteering as I'm not especially good at writing tests.


His tests are appropriate because they are task-oriented. He is not writing a general spellchecker, but a spellchecker designed to work with a specific corpus. His tests ensure that the corpus remains unchanged during development (say, doesn't accidentally remove some words). What's implicit here is that if the corpus changes, he may need to change his approach.


> What's implicit here is that if the corpus changes, he may need to change his approach.

Will that be obvious to a junior engineer? (I think that's the target audience for this article)

If I was writing such an article, I would (maybe) call that out as explicit, or (extremely likely) remove those tests so as not to accidentally implicitly recommend overfitted tests as a useful technique.


1. Read blog post.

2. Opine on how blog code is not ready for production.

3. Post to HN.

4. Propser.


This is a 1% engineer? The guy who can't spell "prosper"?


hahaha man. Thank you. I upvoted :)


Some tests are similar to the "texas sharpshooter fallacy".


Those asserts seem like a pre-Jupyter form of narrative.


This is a really good introduction to some natural language processing concepts, and the code as other people mentions is very beautiful and easy to read too.

Just as a curiosity: I once did some text correction applied to tweets, using word embeddings (word vectors, word2vec), which are a language model too. In real life contexts, where you don't limit yourself to a dictionary, it's really interesting because word vectors can include many foreignisms and other "incorrect" words that you might want to keep in the text.

The article also mentions edit distances, the distances between words (in this case, the "incorrect" word and its possible corrections). In general, algorithms are based on common errors at the keyboard. Missing or adding a letter, transposing two of them, etc. Algorithms like Damerau-Levenshtein (pretty much the one applied in the article, even though it's not explicitly mentioned I think, and there are many variations) keep it simple by doing this, but it's interesting to notice that you can also fine-tune distances by considering which keys are closer to other keys in the keyboard (it's easier to accidentally type "a" instead of "s" in qwerty than "u"), but when doing more general text correction you can also use phonetic distances or even more sophisticated techniques like finite-state transducers that apply language-specific rules to suggest corrections. (I know the article is only meant as an introduction, but thought someone might find it interesting)


I always wish I could just plugin a Google spelling check API into things like Microsoft Word instead of using the inbuilt ones.

i.e. the amount of times I cant spell a word, and MS Word has no idea what I'm trying to say, so I "copy and paste" the wrong word into Google and it instantly knows what I meant and shows me the correct spelling.

I appreciate part of that is probably related to my previous searches on Google and it guesses based upon my history - but I'd be ok with using this for "good" like my spell checking.


You could make a very simple AutoHotkey script to do this.


...or a Word macro that you assign to a toolbar button, that takes the currently selected word and 'googles' it for the correction.


A couple of years ago, Emily Short made an "interactive fiction" game [that's what they call text adventures these days] with letter-removal as a major mechanic. It's called Counterfeit Monkey and it's a lot of fun. http://emshort.com/counterfeit_monkey/

Install one of the interpreters at the bottom of the page, then download the "story" file and open it in said interpreter.


There's an updated version here - I didn't realize that page had an old release on it. https://emshort.blog/2017/12/26/counterfeit-monkey-release-7...


it is an extremely fun game!

especially for all the oldheads around here, if you ever played Zork or Hitchhiker's Guide or whatever, give it a shot. it's the same style, just as funny, with helpful features to make it far less frustrating than the old games could be, and it really takes advantage of not having to run on a TRS-80.


Here's her blog post about which interpreters to use to play the game. (Spoilers in the comments but not in the main post.) https://emshort.blog/2012/12/31/counterfeit-monkey/ Also there's a cheat/hint guide [PDF] https://emshort.blog/2013/01/24/making-of-counterfeit-monkey....

There's a spoiler-filled post about the design of the puzzles here. https://emshort.blog/2013/01/24/making-of-counterfeit-monkey...

It won Best Game, Best Setting, Best Puzzles, Best Individual Player Character, and Best Implementation in 2012! http://www.ifwiki.org/index.php/Counterfeit_Monkey There's a nod to this game in https://xkcd.com/1975/ Right-click the image, go to games -> advent.exe and start exploring :)


I started a project to implement Norvig's corrector in different languages: https://github.com/jmoy/norvig-spell

Had a pleasant surprise when Matthias Felleisen responded to a request on the Racket mailing list and practically rewrote the Racket version.

Also got to learn about the HAT-trie[1] data structure.

[1] https://en.wikipedia.org/wiki/HAT-trie


I wish articles about NLP wouldn't assume by default the language to be English. Many points of this article can also apply to other languages, but may require some more thought. Either other languages should be mentioned, or the title should be "How to Write an English Spelling Corrector".


The assumption is about the audience, not the language.

The author is an English speaker, internal to an English community, writing about spell checking.

Its unreasonable to wish every group on the planet to factor in the concerns of every other group, or add their group identifier to everything they write.


As this is an issue that concerns you, perhaps you could take a fraction of the time the author spent on this article to share some insights on the problems specific to other languages. I, for one, would be interested to read about them.


The article doesn't mention it explicitly, but this is a nice example of how using Bayes theorem helps you ignore the hard-to-compute normalization term of the input space. In the article, this is the P(w) term of

P(c|w) = P(c)P(w|c)/P(w),

where c is a correction, and w is the original word.

The author does implicitly talk about this when he explains that P(c|w) conflates the two factors, but it's also not that hard to see that getting a handle on P(w) -- the probability space of misspellings -- is harder than getting a hold of P(c) -- the probability space of actual words, and Bayes lets us get rid of the former during optimization.


Here is Rich Hickey's port of this Norvig exercise to Clojure:

https://en.wikibooks.org/wiki/Clojure_Programming/Examples/N...


It turns out Hickey added some functions to Clojure to be able to write that code. Nothing wrong with that -- they're useful functions! -- but I found it kind of amusing. It can be good to be a king.


I don't see any functions in that code that wouldn't have existed in Clojure 1.0. It's all pretty standard Clojure stuff, very basic 'for' constructs and generalized lazy sequences with 'range' and others.

Which function were you thinking were explicitly added to the language that are in use here?


"Along the way, Clojure got subs(tring), slurp (a file), max-key and min-key..." at https://groups.google.com/forum/#!topic/clojure/YDQdmAsUaFU


Thanks, that is definitely interesting. We have Norvig to thank for every-day Clojure functions.


Looks pretty unreadable and verbose.


"But they didn't, and come to think of it, why should they know about something so far outisde their specialty?"

Outside ;)


Nice catch!


Disclaimer:

- Couldn't determine the exact date of that article

- Previous discussion here: https://news.ycombinator.com/item?id=12453535

I really enjoyed the code, it is incredibly clever and idiomatic.


You should check out more code by Norvig then. He has many Python notebooks and, as you say, writes very beautiful code.


I wrote a spell checker for my undergraduate to help with automation research back in 2011, it allowed the researcher to do things like insert words that would show as incorrect when they were actually correct, and see if people would notice with varying levels of difficulty. The idea was to try to measure how some automated systems can improve or reduce net performance.

I actively tried to avoid using libraries, and ended up writing most of it in vanilla JavaScript - including a bunch of DOM manipulations, which was suicide at the time. I even wrote my own XML-based DSL for defining the input text.

Would have loved to actually follow through with the research but phd's don't pay bills.


I see where Norvig gets his reputation for writing readable code. I don't know that I've ever had so little difficulty understanding somebody else's code.


Thanks for sharing! I just started researching how to analyze text for a project. I am tasked with analyzing content to see if they match my client products or something similar. Matching my the product name is easy, but matching by similar is what’s throwing me off.

Is there an api service for these type of text analysis?


> I thought Dean and Bill, being highly accomplished engineers and mathematicians, would have good intuitions about how this process works. But they didn't, and come to think of it, why should they know about something so far outside their specialty?

That kind of speaks volumes about modern academics.


I've used http://php.net/manual/en/function.pspell-suggest.php in one of my projects. It's simple and scalable;


What I would like to know is why the spelling correction suggestion of Apple iOS are so bad. Why can they get along with this ? I used to disable it.

It can't even make a suggestion when there is only one missing character.


I turned off Autocorrect in short order when I had an iPhone for this reason (and others). Now, my Android speech recognition - which seems to be getting worse lately - pulls completely incorrect spelling from thin air constantly, sometimes not even giving me the proper spelling in the 'correction' drop-down list.

I'm tempted to wonder whether speech-to-text is actually saving me enough time after all the fixing I need to do now to be worth it. It's more than a little frustrating.


Personally, I get confused why the spell-checker for Chrome is so bad. I so often type words that it recognizes are wrong, but has no suggestions at all for, when the correct word is just a single character off.

elipsis (ellipsis), elipses (ellipses), independant (independent), disserning (discerning)...

How can it not find the correct word?


This is a great technique to know. I’ve used a variant to segment Twitter hash tags into meaningful words, which is a surprisingly hard thing to do.


He has that covered too! http://norvig.com/ngrams/

But yeah, I tackled the same problem (for Flickr tags) and did not at first use the "obvious" algorithm; I did something slower and suboptimal.


No PEP8! Code rejected </sarcasm>

Most elegant code I've actually seen, and can see the lisp thinking apparent on the style.


This has been my favourite Python program for many years.

I also recommend his Sudoku solver: http://norvig.com/sudoku.html, and his XKCD regex golf code: http://nbviewer.jupyter.org/url/norvig.com/ipython/xkcd1313.....


Can mods add (2016) to the title?




Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: