

Cool Algorithm - Fast text search using BWT - varun729
http://blog.avadis-ngs.com/2012/04/elegant-exact-string-match-using-bwt-2/

======
bazzargh
BWT is a really neat trick, I first came across it in Andrew Tridgell's thesis
on rsync, which is worth a read
(<http://www.samba.org/~tridge/phd_thesis.pdf>). I changed PMD's copy-paste
detector (CPD) to use it, which at the time was a massive improvement over its
brute-force approach:
[http://onjava.com/pub/a/onjava/2003/03/12/pmd_cpd.html?page=...](http://onjava.com/pub/a/onjava/2003/03/12/pmd_cpd.html?page=last&x-maxdepth=0)
...fairly obviously, the sorted permutations of BWT allow you just to read off
duplicates; I was using permutations of tokens not characters.

CPD now uses Rabin-Karp searching, which is faster still. However, writing a
copy-paste detector with BWT is fairly trivial and I still keep that script in
my head for languages CPD can't handle.

------
WilhelmJ
As far as I can tell, author is talking about FM-Index. It compresses the
search data into a much smaller index memory footprint. I tried using it few
times, but never figured out how to use it as a key-value data store. If
anybody is interested, here is the code:
[http://pizzachili.di.unipi.it/indexes/FM-
indexV2/fmindexV2.t...](http://pizzachili.di.unipi.it/indexes/FM-
indexV2/fmindexV2.tgz)

~~~
superbobry
I guess FM index is just not the right thing to use when you need a key-value
data store. It's a _full text_ index -- a data structure, which allows _fast_
substring queries over a _fixed_ text corpus.

~~~
arethuza
Perhaps if you want to store (tag) sub-strings with stored data then it might
make sense?

~~~
superbobry
Yup, that might work, but still, this is a weird idea for a key-value store,
maybe a DAWG or a radix tree would do better.

------
dunham
Also see Alex Bowe's blog for a description of this:

    
    
       http://www.alexbowe.com/tag/datastructures
    

He describes how you can store the FM-index in less space than the original
text.

------
techfiltered
This is awesome!

