

Fuzzy string search - Animus7
http://ntz-develop.blogspot.com/2011/03/fuzzy-string-search.html

======
jphackworth
Another area where fuzzy string search is really useful is computational
biology. You often have a particular DNA or RNA sequence and you want to
figure out what it is. However, the version on file can be different by a few
mutations. For this sort of search where the database can be quite large,
there are specialized algorithms like BLAST that index by n-grams.

Just thought I'd mention it because if you find it interesting, computational
biology is a real growth area right now - the cost of sequencing genomes has
been dropping faster than Moore's law, and it's quite possible that within our
lifetimes we will be using this sort of technology to genetically engineer
people.

~~~
dhruvbird
Yes, its useful for alignment. However, I've noticed that such alignment
techniques are quite slow (since many of them resort to brute force on O(n^2)
on sufficiently large strings - a few thousand maybe).

I've found that a better way to approach this problem is to use suffix arrays
to generate the best overlaps (ranges) from the reference and then perform a
"range compression" (not of the audible kind) to coalesce adjacent ranges that
differ by very little and insert/delete the offending characters.

I'm using a similar technique for genome compression against a reference and
am able to get a 700x compression on 2 Korean genomes and 130x compression on
hg18(build 36.1) and another build/assembly(number 36.3) of the human
reference genome.

------
timtadh
I have been working on a Ternary Search Trie implementation for Python on and
off for a while. This is definitely a good read for some other methods for
fuzzy string search. The benefits of the TST (versus other Trie formulations)
is summarized on my blog.[1] I think I will do a follow up post covering their
characteristics in comparison to the methods Nikita covered in this post.

The author mentions at the end that he didn't consider Tries. However, I think
they are valuable data structure.

[1] [http://blog.hackthology.com/ternary-search-tries-for-fast-
fl...](http://blog.hackthology.com/ternary-search-tries-for-fast-flexible-
string)

------
rkalla
REF: <http://code.google.com/p/fuzzy-search-tools/>

For those that don't make it all the way through the article, the author has
implemented all the algorithms in Java and is available at his project page
(above).

------
leeoniya
i've used Damerau–Levenshtein + Jaro-Winkler for database record
matching/customer dedupe. these things _work magic_. found 2x as many matches
vs exact comparison.

------
devs1010
I'm currently using a Patricia trie in an open source project I'm working on,
still pretty new to trees / tries but now that I have begun to use them I can
see how widely used they are in a lot of modern software products

------
aquarin
I have implemented TST (ternary search tree) in C to create fast dictionary
(histogram) for counting words in Wikipedia. It is fast and memory efficient.

~~~
Cieplak
Any intentions of open-sourcing it?

------
olihb
We use aho-corasick to extract keywords from terabyte size corpus. Works
really well but you have to build your matching tree before.

