Spelling Corrector in 21 lines of Python

rayval · on Dec 23, 2010

Interesting that the shortest (and among the most readable) is the 15-line version written in old-school AWK: http://pacman.blog.br/wiki/index.php?title=Um_Corretor_Ortog...

A surprise is that the Perl implementation weighs in at 63 lines. I would have expected much less. I expect a much shorter version is possible, relying on idiomatic constructs at the expense of readability.

tedunangst · on Dec 23, 2010

If you run a2p on the awk, then remove some boilerplate and unformat as per the original, you end up with only about 18 lines of perl.

abecedarius · on Dec 23, 2010

I like awk (and used it a lot 15-20 years ago), but that code to my eye is oversqueezed and hard to read.

joshu · on Dec 24, 2010

That's perl6, btw...

timrobinson · on Dec 23, 2010

I'm no Python expert, but I liked doing this in Haskell:

http://www.partario.com/blog/2009/10/a-spelling-corrector-in...

https://github.com/timrobinson/spell-correct/blob/master/Cor...

cormullion · on Dec 24, 2010

liked your Haskell version. Makes me want to learn the language!

clvv · on Dec 23, 2010

I have seen this one before somewhere, and what amazes me is that how you can solve problems without a hassle if you get the "trick" right. Another case I read was that Google use(at least used) two vectors(each consists of many 0s and 1s, which in turn represent whether the web page has the keyword or not) to represent web pages, and calculate the angle between the vectors to figure out the similarity(a value) between web pages.

tyler · on Dec 24, 2010

It sounds like you're conflating two techniques here. The first (as others have mentioned) is cosine similarity, which measures the angle between the vectors. However, the bit about 0s and 1s sounds like you're talking about locality-sensitive hashing (http://en.wikipedia.org/wiki/Locality_sensitive_hashing). LSH is often used to estimate cosine similarity, as cosine similarity can be quite expensive to calculate. I know Google and others are using it for such.

l0nwlf · on Dec 23, 2010

You are probably talking about Cosine similarity ( http://en.wikipedia.org/wiki/Cosine_similarity ) , a commonly used technique in the field of IR/NLP.

maurits · on Dec 23, 2010

That would be the cosine metric then. I don't think it is that simple. The position (or dimension if you will) of a keyword in the data-vector is all important. So equal vectors in terms of the symbolic presence of a number of keywords will generate great cosine distances the moment the keywords are not in equal position.

bad_user · on Dec 24, 2010

That wouldn't work well, because you need to take into account the relevance of a keyword for those web pages.

Check out the Tf-Idf weight for getting an idea on how to do that: http://en.wikipedia.org/wiki/Tf%E2%80%93idf

A useful trick based on the technique you describe: build vectors where each dimension represents a user in the system (1 means the user visited that web page); since people are interested in a narrow array of topics. Furthermore you can extend this to find similar users.

Amazon does this to find similar or complementary products.

prs · on Dec 23, 2010

Scrolling to the bottom of the article gives you a list of similar implementations in other languages.

Very interesting to see how other have tackled this programming problem.

gregschlom · on Dec 24, 2010

I was somewhat amused to see that most other implementations take around 30 - 60 lines of code, except Java : 372.

But the comparison may not be fair, as the Java author may have tweaked the algorithm or added new features

freakwit · on Dec 24, 2010

There are two implementations listed for Java. The first is 35 LOC and the second is 372.

tgflynn · on Dec 23, 2010

Anyone else having trouble parsing this list comprehension ?

  e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS

marcinw · on Dec 23, 2010

In a traditional for loop fashion:

  def known_edits2(word):
    L = []
    
    for e1 in edits1(word):
      for e2 in edits1(e1):
        if e2 in NWORDS:
          L.append(e2)
    
    return set(L)

tgflynn · on Dec 23, 2010

Thanks for the clarification. I didn't realize that list comprehensions mapped so directly to nested loops, I was particularly confused by the variable name scoping rules.

cema · on Dec 24, 2010

The double "comprehension" loop is like a double nested "regular" loop here.

veb · on Dec 23, 2010

In PHP, you can use the levenshtein() function.

http://php.net/manual/en/function.levenshtein.php

adamzochowski · on Dec 23, 2010

Levenshtein is good for many languages.

However, for English nothing beats soundex type algo. I believe major SQLs and php do soundex.

Soundex is an old algorithm, century old, designed to find immigrants by their last name, no matter how they converted it from their native language to english. For example: if one came and had name Szczybliewski or had it Shcheeblevsky , then soundex should return close match.

Metaphone is improved version of soundex (available in php). And if you are careful, you can find double-metaphone out there.

Kind regards

Terretta · on Dec 24, 2010

I dropped by to say this.

The examples in Future Work 2 would all have been resolved by checking results against Soundex, a simple check with significant improvement.

There are two classes of errors: misspellings and typos. Edit distance (Levenshtein) is reasonable for typos, while the examples in Future Work 2 are misspellings.

Another trivial improvement is to weight edit distance by typo distance.

cloudkj · on Dec 23, 2010

Levenshtein/edit distance isn't too useful for spell checking though. Edit distance can yield you the absolute difference between two words, but your goal is to find the words that have the smallest such distance with the word in question. Simply calculating the edit distance between the word in question and all other dictionary words isn't good enough - you'll have to be a bit more clever.

l0nwlf · on Dec 23, 2010

Levenshtein Distance ( http://en.wikipedia.org/wiki/Levenshtein_distance ) is a standard algorithm taught in algorithm courses as a standard example of Dynamic Programming.

Some discussion here : http://stackoverflow.com/questions/1564184/edit-distance-alg...

SlyShy · on Dec 23, 2010

If anyone is interested in other applications of this technique, this is a Ruby library I wrote to do sentence tokenization: https://github.com/SlyShy/Tactful_Tokenizer

singular · on Dec 24, 2010

I found it fascinating that you can actually implement something seemingly so magical in such a small amount of code.

I'm down as 22 lines of C#, though to be fair I am cheating vastly and the lines are huge :)

http://www.codegrunt.co.uk/2010/11/02/C-Sharp-Norvig-Spellin...

C# does offer some nice features to give you some succinctness but there's no getting away from the verboseness of a java-like language.

Someone · on Dec 24, 2010

Clean, but I think your 'idiomatic' version should use words.TryGetValue( key, out value) instead of the duplicated dictionary lookup of known words in words.Contains( key) and words[ key]

singular · on Dec 30, 2010

Thanks, good point. Fixed :)