

Damn Cool Algoritms: BK-Trees - VeXocide
http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees

======
lpolovets
Cool idea.

Both the article and other sites
(<https://issues.apache.org/jira/browse/LUCENE-2230> is an example) suggest
that 3-20x performance improvements are typical.

I've worked on this problem before, and it seems like there are also simpler
(and less memory intensive) ways of speeding up the computation. For example
you can use properties like string lengths (or character distributions within
strings) to get results that are just as good if not better. If you have a
corpus of 1 million strings and you want to find all strings within an edit
distance of 2 to a query, you can automatically eliminate any strings among
the corpus whose length differs by > 2 from the length of the query string.
Also, if you are willing to store 1-2 bytes of metadata about each string, you
can use that metadata to very quickly generate lower bounds on edit distances,
avoiding the O(N^2) distance computation whenever the lower bound is greater
than your max tolerable distance. A good example of metadata would be to store
(for each string in the corpus) the number of characters with ascii codes that
are 0 mod 5, 1 mod 5, ... 4 mod 5.

~~~
rubyrescue
What information about 'character distributions within strings' would be
useful?

the idea of Levenshtein Distance as constituting a metric space is something
i've never thought of. I implemented a simple "Did You Mean?" using L.D. in
Ruby/C but didn't build this type of structure and now that i've read this it
seems obvious.

edit: clarification

~~~
lpolovets
Let's say you plan to keep 2 bytes/16 bits of metadata for each string in the
corpus. You can use these 16 bits to store five 3-bit counters for the number
of characters in a string whose ascii values are 0, 1, ... 4 mod 5. Each
counter can go up to 7 (0x111), at which point further increments are no-ops
(so a string with 7 characters that have an ascii value of 0 mod 5 will have
the same value in the first counter as a string with 20 characters that have
an ascii value of 0 mod 5).

When you want to see if the Levenshtein distance between two strings is less
than some threshold (say 2), you first compare the two sets of 5 counters. The
difference in values for corresponding counters can help you determine a lower
bound on the Levenshtein distance. For example, if the first string has six
characters that are 1 mod 5, and the second string has 2 characters that are 1
mod 5, then your edit distance will be at least 2, so if you were only looking
for strings that are within a distance of 1, you no longer need to bother
computing Levenshtein for your two strings. Comparing counters is just a few
bit operations, so it's super fast.

~~~
chronomex
That reminds me of the Boyer-Moore string search algorithm. One way to
implement B-M is with a bitmap of the characters in the query string, so that
regions of the search space that can't possibly match the query will be
skipped efficiently.

------
stromhold
The quick-select algorithm is a really simple and effective way of selecting
the k-th smallest element and one that would be nice to cover. It's especially
useful in situations where you need O(n) selection of the median element.

~~~
eru
Yes. And you can unify quick-select and tree-sort (i.e. quicksort, but not in-
place) in a lazy language easily in the same algorithm.

------
jbw
Quad trees would be a cool algorithm to cover.

~~~
varikin
There is one that includes quad trees.

[http://blog.notdot.net/2009/11/Damn-Cool-Algorithms-
Spatial-...](http://blog.notdot.net/2009/11/Damn-Cool-Algorithms-Spatial-
indexing-with-Quadtrees-and-Hilbert-Curves)

I haven't read it yet though.

~~~
jbw
Nice, thanks.

I'll have to look around the site a bit more.

