

Does a small startup have a chance on hard problems? AtD vs. The Spell Checker Poem - raffi
http://killall.dashnine.org/2009/05/the-spell-checker-poem-shootout/

======
raffi
This is work in progress. I'm working on grammar rules now. Last night though
I posted the spell checker poem in just to see how the system would respond to
it and I was really surprised (sometimes we're our own harshest critics). I
have a plugin for Wordpress too at <http://www.afterthedeadline.com>

~~~
jcl
Did it really get the correct word for the first "weigh"? (which is "away",
not "way")

~~~
raffi
Hah, ok score -1 me. It didn't. It came up with way (which was prefixed by a).
I think you'll forgive me as that piece is something of a landmine.

------
lacker
This is pretty neat to see. But this is probably the wrong metric to use to
measure performance of a spell checker. In the real world you don't want to
spell check writing that is 0.5 spelling mistakes per word, and many of these
misspellings are not representative of how people actually misspell words.
(Who would actually type "mist ache" for "mistake"?) Also, this may just
reflect the fact that you have your thresholds for tagging a misspelling set
lower. For good metrics you would want to have a large corpus of real
documents and measure both the false positive and false negative rates.

(To be fair, the author probably knows this stuff already, and is just having
fun.)

~~~
raffi
I'm the author: There are two types of spelling errors, real word and non-real
word errors.

A non-real word error is what most spellcheckers look for. They check if a
word is in a dictionary and if it isn't they generate a set of potential words
from that error and then rack and stack them. AtD's non-real word spelling
corrector uses a dictionary gleaned from the intersection of many wordlists
and a corpus of text. The rating of each correction is handled by neural
networks. It's all fun and I have carefully measured, tweaked, and remeasured
this part of the technology. I also reproduced an experiment conducted by some
Polish researchers in 2005 and used their data to find out how I compare to
Office 2003. I learned my dictionary needs more work but my accuracy is on par
with Office here.

The real word errors are just as serious but somehow ignored by most spelling
correctors. A real word error is where you use a word that is in the
dictionary but really meant some other word. You can imagine why this is
difficult. I use a statistical model to try to pick the best word from pre-
made confusion sets. While I was impressed with the accuracy of this I had to
bias heavily against false positives making it less than what I'd like to see.
I'm currently working to have the grammar checker take over or augment some
words where this technology does poorly. That said, this poem is all about the
real word detection and correction and I was quite delighted to see how well
my code did against it anyways (even after I bias heavily against false
positives).

So in short, yeah I'm just having fun.

------
gregk
I tried the demo at <http://www.polishmywriting.com/> and it found some
mistakes I missed in my writing. I am impressed with it's suggestions although
it suggested many technical words as spelling errors. Check it out on your
writing.

------
dangoldin
It seems in this case you can just break them down into sets of phonemes and
do a frequency analysis to get the most common usages.

That would take care of the "Eye have"

------
hwijaya
Just fyi, there are other startups in Australia that also work on this
spelling problem <http://spellr.us/>

~~~
raffi
Judging by the results it looks like they are using Hunspell or Aspell to do
the actual checking. I applaud their well executed effort. AtD is more of a
complete writing improvement solution that someone like spellr.us could use to
provide their service.

------
alain94040
Sounds like a great OEM play: license your technology so it becomes
widespread.

