

How to Write a Spelling Corrector - dlnovell
http://www.norvig.com/spell-correct.html

======
RiderOfGiraffes
Previous discussions:

<http://news.ycombinator.com/item?id=42587>

<http://news.ycombinator.com/item?id=327897>

Many great thoughts and comments already posted, so it's worth reading the
thoguhts of HN contributors as well as this classic from Norvig.

~~~
fizx
Google has also released a full-web 5-gram corpus that contains "counts for
all 1,176,470,663 five-word sequences that appear at least 40 times" Its fun
to try training on that.

[http://googleresearch.blogspot.com/2006/08/all-our-n-gram-
ar...](http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-
to-you.html)

~~~
raffi
As the comments (on the google page) illustrate, this corpus is close to
useless unless you're a university researcher. Commercial use of this data
costs big $$$. And if you're an independent you still have to pay $150 or so
to get LDC to mail the six DVDs to you.

One of the painful things about commercial NLP work is lack of good datasets
without restrictions. There are some treasures like WordNet (which is a
lexical db). For tagged data I found the American National Corpus but the
excerpt they release without restrictions is too little data and it contains
only a few styles of writing.

I had to spend a lot of time gathering, fleecing, and marking up my own data
to make After the Deadline.

~~~
lsb
Start with Wikipedia? They've got a few billion words there, and it's a 5GB
download.

~~~
raffi
That is what I ended up doing. Except this is raw data. Usually NLP
researchers work from corpora that are already segmented into sentences and
tagged with part-of-speech tags. I had to go through the hassle of segmenting
and marking up the Wikipedia and Project Gutenberg.

~~~
mattrepl
Would you consider making it available to others?

Perhaps the UCI Machine Learning Repository would accept it. If not, I'm sure
I could find a way to have it hosted at my university.

~~~
bravura
archive.org will host datasets.

To preprocess wikipedia, I have used the following software:
[http://sourceforge.net/apps/mediawiki/wikiprep/index.php?tit...](http://sourceforge.net/apps/mediawiki/wikiprep/index.php?title=Main_Page)

To remove boilerplate from gutenberg requires painfully constructed
heuristics. It would be great to have software released to do that.

------
abecedarius
Norvig expanded on this theme in a chapter in
<http://oreilly.com/catalog/9780596157111/> (not yet out) -- the draft I read
applied the Google n-gram corpus to word segmentation, decryption, and a
faster spelling corrector. Lovely and instructive code, as always.

~~~
raffi
That will probably be a really good book. I read Programming Collective
Intelligence by Toby Segaran and took a lot from it.

~~~
abecedarius
Could be! I've only seen the one chapter (because it used a couple suggestions
I'd made for the original article linked above).

------
Create
In case you do

\- not have the luxury to have such a large ecological footprint (taking into
account all the externalities too)

\- are not always connected and

\- are not granted access to the full UN corpus etc.

then you can still do quite good, cheaper and smarter.

<http://hunspell.sourceforge.net/>

Hunspell is the default spell checker of OpenOffice.org and Mozilla Firefox 3
& Thunderbird. Gőg hasn't beaten that yet.

~~~
raffi
If we're going to yackety yack about spell checkers, After the Deadline is
really accurate because it looks at context and uses trained neural networks
to sort recommendations. Hunspell does sophisticated stuff but as Norvig said
(somewhere else), the simplest technique with 10x data will beat the most
complicated technique every time.

Oh and AtD does grammar, style checking, and misused words. Embed it in your
application today :) <http://www.afterthedeadline.com>

------
jcromartie
Interesting: a Java implementation of this algorithm is 372 lines, while a
Clojure one is 18!

~~~
nikblack
35 lines in Java actually, the other one is a poor implementation.

execution speed is more important than lines of code in this case

