
How I Trie to Make Spelling Suggestions - raffi
http://blog.afterthedeadline.com/2010/01/29/how-i-trie-to-make-spelling-suggestions/
======
mrshoe
Rather than re-implement this it might be better to use an existing library.
In Python, I've used pylevenshtein:

<http://code.google.com/p/pylevenshtein/>

However, I've found the Jaro-Winkler distance to be more useful than the
Levenshtein. You can find good implementations for Jaro out there as well.

<http://en.wikipedia.org/wiki/Jaro-Winkler_distance>

~~~
raffi
These algorithms solve a different problem. They calculate the edit distance
between a string A and B. The algorithm in the post finds all strings B within
an edit distance N of string A.

------
elblanco
Clever title. In my experience, in a practical sense, tries tend to be very
memory intensive with only a marginal increase in speed over other
alternatives like simple hash lookups. However, this is a very interesting and
more typical use-case for a trie. I've always like it better than using
Hamming distance which still seems to get recommended in undergraduate texts.

Oh, and I glossed over the perl code, until I reread it and realized it was
sleep code. First time hearing about this language...interesting.

<http://sleep.dashnine.org/>

~~~
raffi
AtD uses a lot of memory but this technique made a difference speeding up the
spell checker. For things best done with a hash lookup, we use hashtables.
This is what we're using for generating edits still (I wrote most of this post
in August 09, finally finished it today).

[Edit: I wrote Sleep :) I don't promote it much but it's been with me for many
years and I've applied it to a lot of problems including the NLP in AtD]

~~~
elblanco
Sweet. I tried doing a similar experiment targeting C++ years ago but never
finished it.

Nicely done! How far along is it? Writing Perl-ish code that targets the JVM
seems inspired to me.

Object oriented? Can write threaded code and all that? (there's been a number
of times where Perl's shortcomings in those areas have prevented me from
building better code).

~~~
raffi
It's not OO but you can roll your own. See:
<http://sleep.dashnine.org/manual/functions.html#3>

It does threading using a fork(&function) abstraction. You can pass an initial
set of variables to share but otherwise it's shared nothing. The fork returns
an I/O handle which you can use to communicate values back and forth (even
serialized objects). You can also wait(fork(&function)) to get a return value
when the thread finishes.

The Manual, covers the whole language: <http://sleep.dashnine.org/manual/>

An article talking about some of the fun things to do with continuations:
[http://today.java.net/pub/a/today/2008/07/24/fun-with-
contin...](http://today.java.net/pub/a/today/2008/07/24/fun-with-
continuations.html)

Blog with Sleep examples: <http://www.jroller.com/sleepsnip/feed/entries/rss>

There is also a web app server for it.
<http://www.hick.org/~raffi/moconti.html>

------
abecedarius
Norvig posted a similar algorithm using a hashtable instead of a trie: edits()
in ngrams.py at <http://norvig.com/ngrams/>

You use a table of all prefixes of all dictionary words. This might be more or
less efficient than a trie, depending on implementation; but in interpreted
Python the built-in hashtables are bound to win.

(It's descended from code I sent in response to <http://norvig.com/spell-
correct.html>, rewritten to return a dict of candidates each paired with a
description of how it's different from the original word. There's a bug of
sorts in that the result set misses a very few candidates his first article's
code finds, unless you extend the edit-distance cutoff; I only noticed the
problem after I'd mailed him the code. I'm not sure if the OP's algorithm has
the same shortcoming -- I haven't read it closely.)

------
tom_pinckney
Careful what dictionary you use...my /usr/share/dict/words has things in it
that might not be polite/acceptable to suggest.

------
llimllib
<http://en.wikipedia.org/wiki/Patricia_tree> would be an improvement.

~~~
raffi
Possibly. I think it's the difference between a plain BST and an AVL or
Red/Black tree. More work for some gain but ultimately the data has something
to say about whether it matters. I think this scheme would certainly balloon
the logic for generating edits for a word which is the whole purpose of this
post.

------
yan
Do people pronounce trie as 'try' or 'tree'? I pronounce it as 'tree,' which
is why it took me a second to catch the pun.

~~~
bmm6o
The etymology suggests that you should pronounce it "tree". The argument for
pronouncing it "try" is to avoid creating a homophone with the other (more
popular) data structure. I would pronounce it "try", but honestly it doesn't
come up in conversation as much as I would like.

------
indigoviolet
I've used Burkhard-Keller trees to do something like this:

[http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-
Part-1-BK...](http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-
Trees)

I don't know off the top of my head which one is more efficient.

------
kristianp
What is the algorithmic complexity of your trie code compared to Norvig's
example corrector?

