

The Edit Distance Problem - Rod
http://rjlipton.wordpress.com/2009/03/22/bellman-dynamic-programming-and-edit-distance

======
Todd
I thought I would post a practical application, FWIW. I recently applied edit
distance to a problem I was facing on my free dating site. Being free, we
attract an inordinate number of scammers. I've been developing an arsenal of
tools to help identify and track suspects in order to delete their accounts
before they do too much damage. The most successful to date has used edit
distance.

Scammers tend to post identical or slightly altered messages to a large number
of users. As you might expect, this has the effect of quickly polluting the
valid ecosystem. The solution I employed was to delay sending messages between
users long enough to capture spammer patterns and compare the messages to one
another.

I use the Levenshtein distance algorithm (which has O(n^2) performance). If I
get a similarity over a certain threshold. I quarantine the message for
further analysis. To work around the performance, I bound the problem by
capping the number of messages used for analysis.

It has been incredibly effective.

------
carterschonwald
Dick Lipton's blog has a huge collection of really amazing kiddie pool tastes
and pointers within awesome nuggets of theoretical computer science, one of
his posts that I want to walk through carefully when i have the time is the
one on the quantum method, for proving various nifty algorithmic complexity
results etc.

[http://rjlipton.wordpress.com/2009/03/28/erds-and-the-
quantu...](http://rjlipton.wordpress.com/2009/03/28/erds-and-the-quantum-
method/)

sexy stuff

~~~
amichail
It is indeed amazing. But why put so much effort into a blog? Makes you wonder
whether this blog is intended primarily to attract more PhD students to the
field.

~~~
ced
It makes a lot more sense to publish scientific results on the web than on
paper. Hyperlinks, more graphs, more data, less costs, more feedback.
Scientists still publish in journals because of inertia and an antiquated
incentive structure (grant money is proportional to the number of published
papers, esp. in high-profile journals). I hope we'll eventually make the
transition, and possibly, so does the author of that blog.

------
mish4
What a coincidence! I have been looking at the edit distance problem, because
it can be applied to computer vision. One of the uses is if you have a stereo
image pair you can figure out the depth of the objects in the scene, by
matching corresponding features. The edit distance problem helps with figuring
out which pixels in one stereo image correspond to pixels in the other stereo
image.

I am hoping with a bit more effort I can understand the dynamic programming
side of it better.

------
amichail
For a bit-parallel approach to this problem, see:
[http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.19.4...](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.19.4342)

------
keefe
I spent a year and a half as an undergraduate converting a custom C edit
distance implementation into Java for a computer assisted language instruction
system. This is a really nice article discussing the topic. I think the
interesting facet of dynamic programming is how it shows the dynamic between
space and time. In his creative book on algorithms, Udi Manber gives a nice
description of a general approach to designing algorithms with dynamic
programming.

~~~
ovi256
A year and a half ? How many SLOCs was that ?

~~~
keefe
I haven't a clue this was like 99-01, it wasn't just edit distance algorithm
but also all the UI code surrounding it and optimizing things for learning a
foreign language. Basically it was a quiz program in (e.g.) German which
compared the student's answer to the proper answer and then generated
suggestions when appropriate. We did a lot of tweaking of that code, iterating
through versions starting with an attempt to just port some C code and onward
- I was pretty green at that time and it wasn't a full time job, so it's not
like implementing edit distance took you know, 70 man weeks...

------
jrschulz
A completely different approach to the problem is to minimize the number of
necessary distance computation in the first place. There are various indexing
algorithms which make finding nearest neighbours feasible in large datasets.
Incidentally, I implemented two of them. :)

<http://well-adjusted.de/mspace.py>

For real usage, you would probably want to reimplement them with performance
in mind (and in a different language), though. My implementation is dog slow.
But on the plus side, it is very readable and heavily documented.

------
albertcardona
The edit distance and the algorithms to solve it have been of great use in my
research work in neuroscience. The places where I can apply them are as
surprising as 3D mesh reconstruction from a stack of outlines (
[http://www.blender.org/community/blender-
conference/blender-...](http://www.blender.org/community/blender-
conference/blender-conference-2006/proceedings/albert-cardona/) ), to neuronal
recognition (I'm currently finishing a paper on that; see some images here:
<http://albert.rierol.net/doodle_programming.html#11> ).

I know of other labs using edit distance to compare people's timelines (to
estimate cost of shifting from one lifestyle to another). And the most widely
use application of edit distances is in the comparison of any DNA/RNA/protein
sequence for similarity, both for identification purposes and for the study of
their evolution.

