
New string search algorithm - akaus
http://volnitsky.com/project/str_search/
======
dasht
Interesting.

Note that the worst case of complexity for this algorithm is much, much worse
than the worst case complexity for Boyer Moore. Do not use this algorithm
carelessly. For example, if you use it in a thoughtless way in your web
server, you may open yourself to a DoS attack.

Note that the author nicely characterizes it as of potential use for small
alphabets and possibly multiple substrings (in a single search). That
immediately made me think he might have devised it for genomics research. In
most applications I would think you'd also want regexp features.
Interestingly, DNA research and use in a regexp engine is something he goes on
to suggest. (If you are searching for a very large number of regexps in a big
genome database, I would not use this algorithm. I found that some simple
variants on classic NFA techniques work very well for a wide class of typical
regexps (e.g., regexps modeling SNPs, small read position errors, small
numbers of read errors, etc. There probably isn't any one obviously right
answer, though, and a lot depends on your particular hardware situation, data
set sizes, etc.).

The HN headline is very bogus hype. "X2 times faster than Boyer-Moore" is far
from true in the general case. "breakthrough" is a gross exaggeration: this is
a technique that anyone with some good algorithms course or two under the belt
should be able to think of an, for most applications, decide to not use
because of the limitations of the thing. I can definitely see it being nice
for some applications tolerant of its limitations but... breakthrough it
ain't.

~~~
thisisnotmyname
For sequence alignment, the state of the art is BWA, which first compresses
the "haystack", then builds a trie.

See <http://en.wikipedia.org/wiki/Burrows–Wheeler_transform> or
[http://bioinformatics.oxfordjournals.org/content/early/2009/...](http://bioinformatics.oxfordjournals.org/content/early/2009/05/18/bioinformatics.btp324.full.pdf)

~~~
dasht
Thanks. That's interesting. I'm pretty confident that you don't really need to
compress the reference that way. It doesn't make a lot of intuitive sense that
you would, in a way: streaming over the reference can be very fast and the
question is how many reads you can align per pass, how flexibly, and with how
low a pre-processing cost. I think my stuff (which doesn't count since we
didn't get to publication stage for other reasons, etc.) was probably faster
and more flexible.

~~~
epistasis
You'd be prerty confident in the other direction once you understand the
problem better. The BWT is not compression as much as an extremely clever way
of rearranging the haystack. There are many many different string alignment
(not string search) algorithms that are useful with DNA, and where the BWT is
used your algorithm is not going to be in the realm of useful. BWT based
aligners run in time completely independent of the haystack length. When you
have 2 billion needles of size 50-200, and the haystack is 3 billion long, it
makes a ton of sense to pay the preprocessing cost of O(n lg n), since it only
has to be done on the order of once a year.

~~~
dasht
Given a (not even very large) cluster, we (with the fairly brute force-ish but
cleverly tuned NFA) can populate hash tables for the read-based NFA of your 2
billion needles just about as fast as you can transmit the data and give you
optimal alignment (by flexible criteria) just about as fast as you can then
stream through the 3 billion base pairs. We are down to, at most, single or
double digit dollar differences either way, but mine is vastly simpler and
easier to tweak. Probably, we can do my way cheaper as the commodity computing
market improves.

If you want to tell me that BWT is more useful for something like searching an
FBI or CIA DNA database, and that intel types want to encourage commercial
development of BWT (even if irrational for the customers) to subsidize its
covert use in intel -- that I can believe. I can see where it would be helpful
not for resequencing (a way to read off interesting details of an individuals
genotype based on a lot of reads) but rather for fingerprinting of suspects
and surveillance targets against a large database of reference genomes.

~~~
atjj
Burroughs-Wheeler transform is not just an academic idea, but is the basis for
one of the standard tools for aligning short reads: <http://bowtie-
bio.sourceforge.net/index.shtml> . There was a mention of compression earlier
in the thread, but this is a red herring: BWT also used in bzip2's
compression, but the use with DNA sequences is not to compress, but a pre-
processing step on the haystack. As epistasis said, when you have billions
needles, it makes sense to pre-process the haystack so that the per-needle
cost is as low as possible.

As for the FBI's DNA database, this does not consist of sequence data, but, I
believe, just microsatellite data. Even if they had full sequence data, it
wouldn't make sense to search for new samples against the whole database, but
to align the sample with a reference genome and then match the variations
across a database of variations.

~~~
dasht
I get that its the standard but see my comment above. If you can come close to
I/O saturating inputing read data at only the cost of streaming the reference
near I/O saturation, that's optimal -- and I don't think you need the hair of
BWT to do that (and the BWT memory cost might even hurt you).

Its different for an imagined database where you are trying to align reads
against a lot of personal genomes, e.g., to identify someone given a pretty
complete genomic database of the usual suspects plus a few reads from the
crime scene -- more around-the-corner "GATTACA" scenario than today's FBI,
perhaps. There, perhaps, its economic to dedicate one server to ever N "usual
suspects", loading the server with BWT of the suspect database, then stream
reads out to the servers rather than loading servers with set of reads and
streaming a single-human reference over that. (Of course, maybe you could just
align the crime scene reads against a single human reference and then .... So
even in GATTACA land use of BWT seems like dubious hair against a lightweight
opponent like MAQ)

------
mayank
No DBLP profile for the author, no proofs on site, fastest algorithm known "to
_me_ " qualifier, no results for "suffix tree" on page, not a good sign.

EDIT: Am I missing something???

Complexity analysis according to the author: m = search term, n = text

O(m) preprocessing -- that's right, O(search term). And O(n times m) worst-
case query string search, so the worst case traverses the whole text.

Now compare that to suffix trees:

O(n) preprocessing O(m) string search

where worst case complexity is linear in _search term_.

~~~
kragen
He's a Russian hacker. That's where awesome new algorithms come from these
days, including things like Dual-Pivot Quicksort. (And QuickLZ may have been
invented by some Scandinavian guy, but I'm pretty sure it was announced on
encode.ru.) He's not an academic, but that doesn't mean he can't do competent
algorithmic analysis.

I concur with the other commenters that it's silly for you to complain about
his not comparing his online string-search algorithm against offline string-
search algorithms that search an index of the text, such as suffix-tree
algorithms.

~~~
cperciva
Dual-Pivot Quicksort is an "awesome new algorithm"? Hardly. It's very small
step in the direction of Samplesort -- which is asymptotically optimal and has
been around for four decades.

Dual-Pivot Quicksort is a demonstration that someone didn't read the existing
literature; nothing more.

~~~
kragen
Unfortunately I don't have access to the Samplesort paper, so I don't know
which sense of "asymptotically optimal" you're using here. Quicksort is
already asymptotically optimal in the sense that its asymptotic average
performance is Θ(N log N), which is the best a sorting algorithm can do.

It's true that dual-pivot quicksort is a step in the direction of Samplesort.
(I _do_ understand Samplesort well enough to say that.) That doesn't mean it's
not a worthwhile contribution in its own right. Jon Bentley (advisor of Brian
Reid, Ousterhout, Josh Bloch, and Gosling while at CMU; later at Bell Labs in
its glory days) was quoted as having a substantially different opinion from
yours:

[http://permalink.gmane.org/gmane.comp.java.openjdk.core-
libs...](http://permalink.gmane.org/gmane.comp.java.openjdk.core-
libs.devel/2628)

> I think that Vladimir's contributions to Quicksort go way beyond

> anything that I've ever done, and rank up there with Hoare's original

> design and Sedgewick's analysis. I feel so privileged to play a very,

> very minor role in helping Vladimir with the most excellent work!

I am not sure Bentley will be persuaded by your claim that he didn't read the
existing literature.

~~~
anatoly
When that thread came up originally, I was intrigued by the idea of the dual-
pivot quicksort, and went on to browse Robert Sedgewick's Ph.D. thesis,
_Quicksort_, published in 1978. Sedgewick considers and analyzes several
variants of quicksort in his thesis, and I've quickly discovered that he
analyzes and dismisses the dual-pivot variant under a different name. I don't
know whose analysis is better, but it does follow that we shouldn't treat
dual-pivot qsort as an original invention (not to detract from Vladimir
Yaroslavskiy's doubtless ingenuity).

Here's an excerpt from the email I wrote to Prof. Sedgewick back then; he
didn't reply, so I've no idea what he thinks about this:

"I believe that your PhD thesis studies this under the name of "double
partition modification". I could find no real difference between the two
algorithms. Yaroslavsky claims to achieve a factor of 0.8 improvements over
the standard quicksort on the number of exchanges (neutral on comparisons),
while your analysis led you to reject the technique after concluding it
required many more comparisons than the standard quicksort. I've only
attempted very cursorily to reconcile the math; it seems that both your thesis
and Yaroslavsky arrive at 0.8n(log n) as the average number of swaps, but you
have (1/3) n(log n) as the average number of swaps in the classic algorithm
while Yaroslavsky has 1.0 n(log n) for the same thing, so what he sees as an
improvement you viewed as a disaster. My interpretation of this may be wrong,
and I haven't attempted to verify either analysis yet."

The "classic algorithm" in the above is the classic quicksort as presented by
Sedgewick in the thesis, which is slightly different than e.g. the standard
Java implementation. I lost interest soon thereafter and stopped investigating
this, but it shouldn't be too difficult to find out who was right in that
algorithm comparison, if someone's interested enough.

~~~
cperciva
Times change. The relative cost of a compare is considerably higher now than
it was in 1978, thanks to the cost of branch (mis)predictions.

It's entirely possible that a modification which was a significant loss in
1978 is a significant gain in 2008.

~~~
kragen
Yes, but the number of comparisons and the number of swaps wouldn't change.
Sedgewick and Yaroslavskiy can't both be right, unless anatoly misread
something.

~~~
cperciva
Oops, I misread what he wrote.

------
matt4711
Pattern matching performance also depends on the alphabet size of the text. In
his experiment he doesn't report the alphabet size of the text nor does he
provide results for different text collections.

The algorithm itself looks very similar to the one used in agrep proposed by
Wu and Manber [1].

I also found the book "Flexible Pattern Matching in Strings" to be a very good
reference on all things related to pattern matching [2].

[1] S. Wu and U. Manber. A fast algorithm for multi-pattern searching. Report
TR-94-17, Department of Computer Science, University of Arizona, 1994.

[2] [http://www.amazon.com/Flexible-Pattern-Matching-Strings-
Line...](http://www.amazon.com/Flexible-Pattern-Matching-Strings-
Line/dp/0521039932)

~~~
dasht
He talks a bit about how to pick the right number of successive letters to use
as hash keys - which is where you can get a handle on alphabet sizes. I would
guess (maybe it actually says) that he Wikipedia dump in the benchmark was
UTF-8 or ASCII and, either way, treated as an alphabet of 8-bit characters.
The DNA case is kind of interesting (2 bits min but more likely 3 or 4 in a
typical genome record).

~~~
matt4711
An illustration from the book I cited above showing the importance of the
alphabet size (y-axis) and the pattern length (x-axis):

<http://i.imgur.com/KGOZW.jpg>

In the experiment he used patterns of different length on the same text
collection. As you can see in the graph, different algorithms perform best for
a certain alphabet size.

He describes the text collection as "text corpus taken from wikipedia text
dump" so I'm guessing the alphabet size is around 90?

It's also probably not a good thing that all the strings he is searching for
are prefixes of the same pattern.

References:

Shift-Or: [http://www-igm.univ-
mlv.fr/~lecroq/string/node6.html#SECTION...](http://www-igm.univ-
mlv.fr/~lecroq/string/node6.html#SECTION0060)

BNDM: [http://www-igm.univ-
mlv.fr/~lecroq/string/bndm.html#SECTION0...](http://www-igm.univ-
mlv.fr/~lecroq/string/bndm.html#SECTION00300)

BOM: [http://www-igm.univ-
mlv.fr/~lecroq/string/bom.html#SECTION00...](http://www-igm.univ-
mlv.fr/~lecroq/string/bom.html#SECTION00245)

------
bnoordhuis
The author states that preprocessing takes O(m) time but that is on average.

A quick review of the code makes me think that its worst case is actually on
the order of O((s * (s + 1)) / 2), where s = m / 2.

The Achilles heel is the hash function. It's trivial to create collisions and
have the insertion time for word w turn from O(1) to O(w).

~~~
martincmartin
Um, O((s * (s + 1)) / 2) = O(m^2). Quadratic, not exponential.

~~~
bnoordhuis
Sorry, I updated my comment just as you posted yours.

But - and I don't want to sound pedantic - how is m^2 not exponential growth?

Edit: mea culpa guys, I carelessly translated from Dutch. You're all right:
quadratic, not exponential growth.

~~~
bigiain
Because 2^m and m^2 are very different...

Using terminology like "exponential" and "quadratic" correctly in a discussion
of algorithms is not pedantic...

------
yhlasx
If i am gonna need a string search algorithm for something serious, i would
definitely use KMP (knuth morris pratt). Linear in worst case complexity
(wouldn't risk)

------
zitterbewegung
This seems like a very practical website about the algorithm but where is the
theory and proofs of the time complexity of the algorithm??

~~~
dasht
I don't mean to be a turd but the proofs are kind of obvious on the face on
this one. He's claiming expected linear time in the string being searched for
"natural" texts and worst case O(M _N). Proof of the worst case is pretty
trivial by construction (of examples of that complexity) and contradiction
(reaching the non-existence of worse cases). One can't be casual about proofs
of course but: try thinking of an O(M_ N) example and then you can probably
see from there why you can't do worse than that. Hint: if you can construct an
example where you have to do the length M check for nearly every position in
the length N haystack, aaaaaaaah... hmmmmm...., the rest should be clear.

------
tansey
From the site:

>Preprocessing phase in O(M) space and time complexity. Searching phase
average O(N) time complexity and O(N*M) worst case complexity.

I don't trust the analysis of someone referring to "average O(N) time"; Big O
notation refers to boundary times.

Edit: Okay, based on arguments here and on [1], I'm going to accept that maybe
he's just bastardizing the notation.

[1] [http://stackoverflow.com/questions/3905355/meaning-of-
averag...](http://stackoverflow.com/questions/3905355/meaning-of-average-
complexity-when-using-big-o-notation)

~~~
mayank
> Big O notation refers to boundary times.

No it doesn't. You can have an O(N) amortized time. Big-O is a bounding
function up to a constant factor, not necessarily a boundary (as in worst-
case) time.

<http://en.wikipedia.org/wiki/Amortized_analysis>

~~~
pjscott
To say that something runs in amortized O(n) time guarantees an upper bound on
the average time per operation in a worst-case sequence of operations. It does
_not_ deal with average-case time on random or typical data.

~~~
mayank
I didn't say it did. On the other hand, unless the author of the algorithm is
really clueless (edit: or knowingly making a probabilistic statement), I'm
sure he meant amortized time.

------
aristus
Skeptical but excited. Will definitely be studying this at the weekend. I had
been working on a long writeup on string matching but stopped the project for
lack of recent progress.

------
b0b0b0b
it seems that his algorithm is faster because it exploits the model of
computation (memory aligned accesses and multi-byte operations). He gets up to
a constant factor more comparisons for free.

