Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Spelling Corrector in 21 lines of Python (norvig.com)
181 points by l0nwlf on Dec 23, 2010 | hide | past | favorite | 27 comments


Interesting that the shortest (and among the most readable) is the 15-line version written in old-school AWK: http://pacman.blog.br/wiki/index.php?title=Um_Corretor_Ortog...

A surprise is that the Perl implementation weighs in at 63 lines. I would have expected much less. I expect a much shorter version is possible, relying on idiomatic constructs at the expense of readability.


If you run a2p on the awk, then remove some boilerplate and unformat as per the original, you end up with only about 18 lines of perl.


I like awk (and used it a lot 15-20 years ago), but that code to my eye is oversqueezed and hard to read.


That's perl6, btw...



liked your Haskell version. Makes me want to learn the language!


I have seen this one before somewhere, and what amazes me is that how you can solve problems without a hassle if you get the "trick" right. Another case I read was that Google use(at least used) two vectors(each consists of many 0s and 1s, which in turn represent whether the web page has the keyword or not) to represent web pages, and calculate the angle between the vectors to figure out the similarity(a value) between web pages.


It sounds like you're conflating two techniques here. The first (as others have mentioned) is cosine similarity, which measures the angle between the vectors. However, the bit about 0s and 1s sounds like you're talking about locality-sensitive hashing (http://en.wikipedia.org/wiki/Locality_sensitive_hashing). LSH is often used to estimate cosine similarity, as cosine similarity can be quite expensive to calculate. I know Google and others are using it for such.


You are probably talking about Cosine similarity ( http://en.wikipedia.org/wiki/Cosine_similarity ) , a commonly used technique in the field of IR/NLP.


That would be the cosine metric then. I don't think it is that simple. The position (or dimension if you will) of a keyword in the data-vector is all important. So equal vectors in terms of the symbolic presence of a number of keywords will generate great cosine distances the moment the keywords are not in equal position.


That wouldn't work well, because you need to take into account the relevance of a keyword for those web pages.

Check out the Tf-Idf weight for getting an idea on how to do that: http://en.wikipedia.org/wiki/Tf%E2%80%93idf

A useful trick based on the technique you describe: build vectors where each dimension represents a user in the system (1 means the user visited that web page); since people are interested in a narrow array of topics. Furthermore you can extend this to find similar users.

Amazon does this to find similar or complementary products.


Scrolling to the bottom of the article gives you a list of similar implementations in other languages.

Very interesting to see how other have tackled this programming problem.


I was somewhat amused to see that most other implementations take around 30 - 60 lines of code, except Java : 372.

But the comparison may not be fair, as the Java author may have tweaked the algorithm or added new features


There are two implementations listed for Java. The first is 35 LOC and the second is 372.


Anyone else having trouble parsing this list comprehension ?

  e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS


In a traditional for loop fashion:

  def known_edits2(word):
    L = []
    
    for e1 in edits1(word):
      for e2 in edits1(e1):
        if e2 in NWORDS:
          L.append(e2)
    
    return set(L)


Thanks for the clarification. I didn't realize that list comprehensions mapped so directly to nested loops, I was particularly confused by the variable name scoping rules.


The double "comprehension" loop is like a double nested "regular" loop here.


In PHP, you can use the levenshtein() function.

http://php.net/manual/en/function.levenshtein.php


Levenshtein is good for many languages.

However, for English nothing beats soundex type algo. I believe major SQLs and php do soundex.

Soundex is an old algorithm, century old, designed to find immigrants by their last name, no matter how they converted it from their native language to english. For example: if one came and had name Szczybliewski or had it Shcheeblevsky , then soundex should return close match.

Metaphone is improved version of soundex (available in php). And if you are careful, you can find double-metaphone out there.

Kind regards


I dropped by to say this.

The examples in Future Work 2 would all have been resolved by checking results against Soundex, a simple check with significant improvement.

There are two classes of errors: misspellings and typos. Edit distance (Levenshtein) is reasonable for typos, while the examples in Future Work 2 are misspellings.

Another trivial improvement is to weight edit distance by typo distance.


Levenshtein/edit distance isn't too useful for spell checking though. Edit distance can yield you the absolute difference between two words, but your goal is to find the words that have the smallest such distance with the word in question. Simply calculating the edit distance between the word in question and all other dictionary words isn't good enough - you'll have to be a bit more clever.


Levenshtein Distance ( http://en.wikipedia.org/wiki/Levenshtein_distance ) is a standard algorithm taught in algorithm courses as a standard example of Dynamic Programming.

Some discussion here : http://stackoverflow.com/questions/1564184/edit-distance-alg...


If anyone is interested in other applications of this technique, this is a Ruby library I wrote to do sentence tokenization: https://github.com/SlyShy/Tactful_Tokenizer


I found it fascinating that you can actually implement something seemingly so magical in such a small amount of code.

I'm down as 22 lines of C#, though to be fair I am cheating vastly and the lines are huge :)

http://www.codegrunt.co.uk/2010/11/02/C-Sharp-Norvig-Spellin...

C# does offer some nice features to give you some succinctness but there's no getting away from the verboseness of a java-like language.


Clean, but I think your 'idiomatic' version should use words.TryGetValue( key, out value) instead of the duplicated dictionary lookup of known words in words.Contains( key) and words[ key]


Thanks, good point. Fixed :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: