
Spelling Corrector in 21 lines of Python  - l0nwlf
http://norvig.com/spell-correct.html
======
rayval
Interesting that the shortest (and among the most readable) is the 15-line
version written in old-school AWK:
[http://pacman.blog.br/wiki/index.php?title=Um_Corretor_Ortog...](http://pacman.blog.br/wiki/index.php?title=Um_Corretor_Ortográfico_em_GAWK)

A surprise is that the Perl implementation weighs in at 63 lines. I would have
expected much less. I expect a much shorter version is possible, relying on
idiomatic constructs at the expense of readability.

~~~
tedunangst
If you run a2p on the awk, then remove some boilerplate and unformat as per
the original, you end up with only about 18 lines of perl.

------
timrobinson
I'm no Python expert, but I liked doing this in Haskell:

[http://www.partario.com/blog/2009/10/a-spelling-corrector-
in...](http://www.partario.com/blog/2009/10/a-spelling-corrector-in-
haskell.html)

[https://github.com/timrobinson/spell-
correct/blob/master/Cor...](https://github.com/timrobinson/spell-
correct/blob/master/Correct.hs)

~~~
cormullion
liked your Haskell version. Makes me want to learn the language!

------
clvv
I have seen this one before somewhere, and what amazes me is that how you can
solve problems without a hassle if you get the "trick" right. Another case I
read was that Google use(at least used) two vectors(each consists of many 0s
and 1s, which in turn represent whether the web page has the keyword or not)
to represent web pages, and calculate the angle between the vectors to figure
out the similarity(a value) between web pages.

~~~
tyler
It sounds like you're conflating two techniques here. The first (as others
have mentioned) is cosine similarity, which measures the angle between the
vectors. However, the bit about 0s and 1s sounds like you're talking about
locality-sensitive hashing
(<http://en.wikipedia.org/wiki/Locality_sensitive_hashing>). LSH is often used
to estimate cosine similarity, as cosine similarity can be quite expensive to
calculate. I know Google and others are using it for such.

------
prs
Scrolling to the bottom of the article gives you a list of similar
implementations in other languages.

Very interesting to see how other have tackled this programming problem.

~~~
gregschlom
I was somewhat amused to see that most other implementations take around 30 -
60 lines of code, except Java : 372.

But the comparison may not be fair, as the Java author may have tweaked the
algorithm or added new features

~~~
freakwit
There are two implementations listed for Java. The first is 35 LOC and the
second is 372.

------
tgflynn
Anyone else having trouble parsing this list comprehension ?

    
    
      e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS

~~~
marcinw
In a traditional for loop fashion:

    
    
      def known_edits2(word):
        L = []
        
        for e1 in edits1(word):
          for e2 in edits1(e1):
            if e2 in NWORDS:
              L.append(e2)
        
        return set(L)

~~~
tgflynn
Thanks for the clarification. I didn't realize that list comprehensions mapped
so directly to nested loops, I was particularly confused by the variable name
scoping rules.

------
veb
In PHP, you can use the levenshtein() function.

<http://php.net/manual/en/function.levenshtein.php>

~~~
adamzochowski
Levenshtein is good for many languages.

However, for English nothing beats soundex type algo. I believe major SQLs and
php do soundex.

Soundex is an old algorithm, century old, designed to find immigrants by their
last name, no matter how they converted it from their native language to
english. For example: if one came and had name Szczybliewski or had it
Shcheeblevsky , then soundex should return close match.

Metaphone is improved version of soundex (available in php). And if you are
careful, you can find double-metaphone out there.

Kind regards

~~~
Terretta
I dropped by to say this.

The examples in Future Work 2 would all have been resolved by checking results
against Soundex, a simple check with significant improvement.

There are two classes of errors: misspellings and typos. Edit distance
(Levenshtein) is reasonable for typos, while the examples in Future Work 2 are
misspellings.

Another trivial improvement is to weight edit distance by typo distance.

------
SlyShy
If anyone is interested in other applications of this technique, this is a
Ruby library I wrote to do sentence tokenization:
<https://github.com/SlyShy/Tactful_Tokenizer>

------
singular
I found it fascinating that you can actually implement something seemingly so
magical in such a small amount of code.

I'm down as 22 lines of C#, though to be fair I am cheating vastly and the
lines are huge :)

[http://www.codegrunt.co.uk/2010/11/02/C-Sharp-Norvig-
Spellin...](http://www.codegrunt.co.uk/2010/11/02/C-Sharp-Norvig-Spelling-
Corrector.html)

C# does offer some nice features to give you some succinctness but there's no
getting away from the verboseness of a java-like language.

~~~
Someone
Clean, but I think your 'idiomatic' version should use words.TryGetValue( key,
out value) instead of the duplicated dictionary lookup of known words in
words.Contains( key) and words[ key]

~~~
singular
Thanks, good point. Fixed :)

