A surprise is that the Perl implementation weighs in at 63 lines. I would have expected much less. I expect a much shorter version is possible, relying on idiomatic constructs at the expense of readability.
I have seen this one before somewhere, and what amazes me is that how you can solve problems without a hassle if you get the "trick" right. Another case I read was that Google use(at least used) two vectors(each consists of many 0s and 1s, which in turn represent whether the web page has the keyword or not) to represent web pages, and calculate the angle between the vectors to figure out the similarity(a value) between web pages.
It sounds like you're conflating two techniques here. The first (as others have mentioned) is cosine similarity, which measures the angle between the vectors. However, the bit about 0s and 1s sounds like you're talking about locality-sensitive hashing (http://en.wikipedia.org/wiki/Locality_sensitive_hashing). LSH is often used to estimate cosine similarity, as cosine similarity can be quite expensive to calculate. I know Google and others are using it for such.
That would be the cosine metric then. I don't think it is that simple. The position (or dimension if you will) of a keyword in the data-vector is all important. So equal vectors in terms of the symbolic presence of a number of keywords will generate great cosine distances the moment the keywords are not in equal position.
A useful trick based on the technique you describe: build vectors where each dimension represents a user in the system (1 means the user visited that web page); since people are interested in a narrow array of topics. Furthermore you can extend this to find similar users.
Amazon does this to find similar or complementary products.
Thanks for the clarification. I didn't realize that list comprehensions mapped so directly to nested loops, I was particularly confused by the variable name scoping rules.
However, for English nothing beats soundex type algo. I believe major SQLs and php do soundex.
Soundex is an old algorithm, century old, designed to find immigrants by their last name, no matter how they converted it from their native language to english. For example: if one came and had name Szczybliewski or had it Shcheeblevsky , then soundex should return close match.
Metaphone is improved version of soundex (available in php). And if you are careful, you can find double-metaphone out there.
The examples in Future Work 2 would all have been resolved by checking results against Soundex, a simple check with significant improvement.
There are two classes of errors: misspellings and typos. Edit distance (Levenshtein) is reasonable for typos, while the examples in Future Work 2 are misspellings.
Another trivial improvement is to weight edit distance by typo distance.
Levenshtein/edit distance isn't too useful for spell checking though. Edit distance can yield you the absolute difference between two words, but your goal is to find the words that have the smallest such distance with the word in question. Simply calculating the edit distance between the word in question and all other dictionary words isn't good enough - you'll have to be a bit more clever.
Clean, but I think your 'idiomatic' version should use words.TryGetValue( key, out value) instead of the duplicated dictionary lookup of known words in words.Contains( key) and words[ key]
A surprise is that the Perl implementation weighs in at 63 lines. I would have expected much less. I expect a much shorter version is possible, relying on idiomatic constructs at the expense of readability.