Hacker News new | comments | ask | show | jobs | submit login
New string search algorithm (volnitsky.com)
181 points by akaus on Feb 12, 2011 | hide | past | web | favorite | 59 comments


Note that the worst case of complexity for this algorithm is much, much worse than the worst case complexity for Boyer Moore. Do not use this algorithm carelessly. For example, if you use it in a thoughtless way in your web server, you may open yourself to a DoS attack.

Note that the author nicely characterizes it as of potential use for small alphabets and possibly multiple substrings (in a single search). That immediately made me think he might have devised it for genomics research. In most applications I would think you'd also want regexp features. Interestingly, DNA research and use in a regexp engine is something he goes on to suggest. (If you are searching for a very large number of regexps in a big genome database, I would not use this algorithm. I found that some simple variants on classic NFA techniques work very well for a wide class of typical regexps (e.g., regexps modeling SNPs, small read position errors, small numbers of read errors, etc. There probably isn't any one obviously right answer, though, and a lot depends on your particular hardware situation, data set sizes, etc.).

The HN headline is very bogus hype. "X2 times faster than Boyer-Moore" is far from true in the general case. "breakthrough" is a gross exaggeration: this is a technique that anyone with some good algorithms course or two under the belt should be able to think of an, for most applications, decide to not use because of the limitations of the thing. I can definitely see it being nice for some applications tolerant of its limitations but... breakthrough it ain't.

For sequence alignment, the state of the art is BWA, which first compresses the "haystack", then builds a trie.

See http://en.wikipedia.org/wiki/Burrows–Wheeler_transform or http://bioinformatics.oxfordjournals.org/content/early/2009/...

Thanks. That's interesting. I'm pretty confident that you don't really need to compress the reference that way. It doesn't make a lot of intuitive sense that you would, in a way: streaming over the reference can be very fast and the question is how many reads you can align per pass, how flexibly, and with how low a pre-processing cost. I think my stuff (which doesn't count since we didn't get to publication stage for other reasons, etc.) was probably faster and more flexible.

You'd be prerty confident in the other direction once you understand the problem better. The BWT is not compression as much as an extremely clever way of rearranging the haystack. There are many many different string alignment (not string search) algorithms that are useful with DNA, and where the BWT is used your algorithm is not going to be in the realm of useful. BWT based aligners run in time completely independent of the haystack length. When you have 2 billion needles of size 50-200, and the haystack is 3 billion long, it makes a ton of sense to pay the preprocessing cost of O(n lg n), since it only has to be done on the order of once a year.

Given a (not even very large) cluster, we (with the fairly brute force-ish but cleverly tuned NFA) can populate hash tables for the read-based NFA of your 2 billion needles just about as fast as you can transmit the data and give you optimal alignment (by flexible criteria) just about as fast as you can then stream through the 3 billion base pairs. We are down to, at most, single or double digit dollar differences either way, but mine is vastly simpler and easier to tweak. Probably, we can do my way cheaper as the commodity computing market improves.

If you want to tell me that BWT is more useful for something like searching an FBI or CIA DNA database, and that intel types want to encourage commercial development of BWT (even if irrational for the customers) to subsidize its covert use in intel -- that I can believe. I can see where it would be helpful not for resequencing (a way to read off interesting details of an individuals genotype based on a lot of reads) but rather for fingerprinting of suspects and surveillance targets against a large database of reference genomes.

Burroughs-Wheeler transform is not just an academic idea, but is the basis for one of the standard tools for aligning short reads: http://bowtie-bio.sourceforge.net/index.shtml . There was a mention of compression earlier in the thread, but this is a red herring: BWT also used in bzip2's compression, but the use with DNA sequences is not to compress, but a pre-processing step on the haystack. As epistasis said, when you have billions needles, it makes sense to pre-process the haystack so that the per-needle cost is as low as possible.

As for the FBI's DNA database, this does not consist of sequence data, but, I believe, just microsatellite data. Even if they had full sequence data, it wouldn't make sense to search for new samples against the whole database, but to align the sample with a reference genome and then match the variations across a database of variations.

I get that its the standard but see my comment above. If you can come close to I/O saturating inputing read data at only the cost of streaming the reference near I/O saturation, that's optimal -- and I don't think you need the hair of BWT to do that (and the BWT memory cost might even hurt you).

Its different for an imagined database where you are trying to align reads against a lot of personal genomes, e.g., to identify someone given a pretty complete genomic database of the usual suspects plus a few reads from the crime scene -- more around-the-corner "GATTACA" scenario than today's FBI, perhaps. There, perhaps, its economic to dedicate one server to ever N "usual suspects", loading the server with BWT of the suspect database, then stream reads out to the servers rather than loading servers with set of reads and streaming a single-human reference over that. (Of course, maybe you could just align the crime scene reads against a single human reference and then .... So even in GATTACA land use of BWT seems like dubious hair against a lightweight opponent like MAQ)

You have just been shown an amazing invention: the BWT. Rather than dismiss it and the entire field of performance obsessed people that find it more useful than any hash based algorithm, perhaps you should study it. The speculation in your comment is laughable, from a cluster costing double digit dollar difference over a single machine, to sequence data being used in forensic DNA. Sure, you throw around some introductory jargon that may fool an outsider into thinking you know what you're talking about, and that's why I'm bothering to comment so that no one is deceived.

BWT =~ suffix tree with far far greater memory efficiency

Please relax a bit, geeze! For resequencing, you have to input all of the reads and this should be a significantly larger number of bytes than the number of bytes than the number of bytes in the unindexed reference. So let's look at the I/O costs of inputing the read data.

Essentially, any alignment technique that can come close enough to I/O saturating reading both the dominant factor (the reads) and the minor factor (the reference) is optimal.

When (for a slightly different kind of gapped, error-prone read) I wrote alignment software, I also beat MAQ by 10-20X, like the BWA algorithm but without using anything like the BWT hair. I contemplated how the gapping rules and error rules looked in a brute force NFA translation. I found simple ways to optimize that NFA very cheaply. I found that it paid to maximize read-data throughput by devoting all of memory to the NFA built from as many reads as possible. It was not hard, this way, to get in the ballpark (small enough constant factor) of I/O bounding read processing (where "not hard" means something like 3-6 months of staring at walls and scribbling stuff on paper until I found some solutions that were simple enough it is only bad luck I didn't find them right off the bat). Building a prefix or suffix tree or any other kind of index to the reference was the approach my boss suggested at the start and that I explicitly set out to beat (and did!). Streaming the reference is not the bottleneck if you can do it in an I/O bound way because (for resequencing, at least), the size of the read data is much larger and you can do alignment while coming within a small constant of I/O saturating the input of that read data using a simpler NFA approach.

Looking at the description of MAQ, besting its performance by any number of means is not so surprising. Reaching for the hair of using BWT does surprise me.

> Note that the worst case of complexity for this algorithm is much, much worse than the worst case complexity for Boyer Moore.

Can you explain your thought process for why worst case complexity is worse than other? Did you do any measurements? I believe you are incorrect. I am author of this page. I've just did quick test for SS = "aaa ...aaaBaaa...aaa", with SS_size=240. My algorithm is faster than 3 BMs out of 4 tested.

> Do not use this algorithm carelessly. For example, if you use it in a thoughtless way in your web server, you may open yourself to a DoS attack.

Same can be said about all BM, naive and BSD's memmem/strstr. Possibility of DOS for substring search algorithm was known for a long time - but it never materialized. The cure is trivial - limit substring size.

> That immediately made me think he might have devised it for genomics research.

No, it wasn't. It was devised when I was preparing for Google interview (which I failed). And of cause algorithms which pre-index haystack will always be faster. I myself do not consider haystack pre-indexing algorithms a "text search" algorithms (maybe incorrectly).

> gross exaggeration

If you count pre-indexing algorithms and edge cases - then you are correct. For most common case - can you show me something faster (from a student or even yourself)?

> Can you explain your thought process for why worst case complexity is worse than other? Did you do any measurements? I believe you are incorrect. I am author of this page. I've just did quick test for SS = "aaa ...aaaBaaa...aaa", with SS_size=240. My algorithm is faster than 3 BMs out of 4 tested.

You write on the page that your algorithm has "O(NM) worst case complexity." Compare Boyer-Moore's time complexity. The grandfather was interested in asymptotics. Not in some examples.

> Same can be said about all BM, naive and BSD's memmem/strstr.

Red herring.

> Possibility of DOS for substring search algorithm was known for a long time - but it never materialized.

What are you talking about?

> If you count pre-indexing algorithms and edge cases - then you are correct. For most common case - can you show me something faster (from a student or even yourself)?

What do you mean by egde cases? All the cases that make your program run slow?

Using your terminology, KMP has complexity O(N + M). That's better than O(MN).

No DBLP profile for the author, no proofs on site, fastest algorithm known "to me" qualifier, no results for "suffix tree" on page, not a good sign.

EDIT: Am I missing something???

Complexity analysis according to the author: m = search term, n = text

O(m) preprocessing -- that's right, O(search term). And O(n times m) worst-case query string search, so the worst case traverses the whole text.

Now compare that to suffix trees:

O(n) preprocessing O(m) string search

where worst case complexity is linear in search term.

He's a Russian hacker. That's where awesome new algorithms come from these days, including things like Dual-Pivot Quicksort. (And QuickLZ may have been invented by some Scandinavian guy, but I'm pretty sure it was announced on encode.ru.) He's not an academic, but that doesn't mean he can't do competent algorithmic analysis.

I concur with the other commenters that it's silly for you to complain about his not comparing his online string-search algorithm against offline string-search algorithms that search an index of the text, such as suffix-tree algorithms.

His being Russian has no bearing on the issue. nginx and the pivot improvement to quicksort don't imply that Russian mathematicians no longer need to prove their claims. Not being an academic doesn't exempt you from having to do rigorous analysis either.

As for the online algorithm issue, see my reply below.

Dual-Pivot Quicksort is an "awesome new algorithm"? Hardly. It's very small step in the direction of Samplesort -- which is asymptotically optimal and has been around for four decades.

Dual-Pivot Quicksort is a demonstration that someone didn't read the existing literature; nothing more.

Unfortunately I don't have access to the Samplesort paper, so I don't know which sense of "asymptotically optimal" you're using here. Quicksort is already asymptotically optimal in the sense that its asymptotic average performance is Θ(N log N), which is the best a sorting algorithm can do.

It's true that dual-pivot quicksort is a step in the direction of Samplesort. (I do understand Samplesort well enough to say that.) That doesn't mean it's not a worthwhile contribution in its own right. Jon Bentley (advisor of Brian Reid, Ousterhout, Josh Bloch, and Gosling while at CMU; later at Bell Labs in its glory days) was quoted as having a substantially different opinion from yours:


> I think that Vladimir's contributions to Quicksort go way beyond

> anything that I've ever done, and rank up there with Hoare's original

> design and Sedgewick's analysis. I feel so privileged to play a very,

> very minor role in helping Vladimir with the most excellent work!

I am not sure Bentley will be persuaded by your claim that he didn't read the existing literature.

When that thread came up originally, I was intrigued by the idea of the dual-pivot quicksort, and went on to browse Robert Sedgewick's Ph.D. thesis, _Quicksort_, published in 1978. Sedgewick considers and analyzes several variants of quicksort in his thesis, and I've quickly discovered that he analyzes and dismisses the dual-pivot variant under a different name. I don't know whose analysis is better, but it does follow that we shouldn't treat dual-pivot qsort as an original invention (not to detract from Vladimir Yaroslavskiy's doubtless ingenuity).

Here's an excerpt from the email I wrote to Prof. Sedgewick back then; he didn't reply, so I've no idea what he thinks about this:

"I believe that your PhD thesis studies this under the name of "double partition modification". I could find no real difference between the two algorithms. Yaroslavsky claims to achieve a factor of 0.8 improvements over the standard quicksort on the number of exchanges (neutral on comparisons), while your analysis led you to reject the technique after concluding it required many more comparisons than the standard quicksort. I've only attempted very cursorily to reconcile the math; it seems that both your thesis and Yaroslavsky arrive at 0.8n(log n) as the average number of swaps, but you have (1/3) n(log n) as the average number of swaps in the classic algorithm while Yaroslavsky has 1.0 n(log n) for the same thing, so what he sees as an improvement you viewed as a disaster. My interpretation of this may be wrong, and I haven't attempted to verify either analysis yet."

The "classic algorithm" in the above is the classic quicksort as presented by Sedgewick in the thesis, which is slightly different than e.g. the standard Java implementation. I lost interest soon thereafter and stopped investigating this, but it shouldn't be too difficult to find out who was right in that algorithm comparison, if someone's interested enough.

Times change. The relative cost of a compare is considerably higher now than it was in 1978, thanks to the cost of branch (mis)predictions.

It's entirely possible that a modification which was a significant loss in 1978 is a significant gain in 2008.

Yes, but the number of comparisons and the number of swaps wouldn't change. Sedgewick and Yaroslavskiy can't both be right, unless anatoly misread something.

Oops, I misread what he wrote.

The asymptotic optimality of samplesort is in the sense that it takes (1 + o(1)) N log N / log 2 comparisons, i.e., you can't beat it by any constant factor.

Bentley's comments from that email don't make any sense to me. I can't understand what he was thinking, unless he's just unusually prone to flattering other people's work (I don't know Bentley personally, but I've met other people who do this, much to my irritation).

> The asymptotic optimality of samplesort is in the sense that it takes (1 + o(1)) N log N / log 2 comparisons, i.e., you can't beat it by any constant factor.

Why isn't it used everywhere that people currently use Quicksort? I've never actually heard of anyone using Samplesort except in a parallel context.

Two reasons: 1. It's a PITA to implement; 2. You need a large N to overcome the o(1), and at that point you're usually dominated by the performance of your memory hierarchy rather than by the cost of doing compares.

Last I heard, python's built-in sort was a type of samplesort, though, so it isn't completely unused.

The previous Python sort was a type of samplesort, but the current one is a “adaptive, stable, natural mergesort” by Tim Peters (“timsort”). On random data I think it performs about the same or slightly worse, but when data has some structure, it performs much better. See http://bugs.python.org/file4451/timsort.txt

Your #2 suggests that it would be slower than ordinary Quicksort on normal-sized datasets. But dual-pivot quicksort is faster than ordinary Quicksort on normal-sized datasets. So in at least one sense, dual-pivot is not a step in the direction of samplesort.

Python's built-in sort is timsort, last I heard, which is a variant of mergesort that looks for existing sorted or backwards runs in order to run faster in the common cases of mostly-sorted or mostly-backwards data. It's true that Python's built-in sort was a samplesort variant until 2002, though, which I didn't know.

(It does seem like Samplesort's virtue of minimizing comparisons would be particularly desirable in an environment like Python, where every comparison potentially involves a call into slow interpreted code.)

I just read the original paper on samplesort, so this might not be the most up to date description, but the basic idea is as follows:

- From an input sequence of length n, choose a random subsequence of length k.

- Sort the subsequence.

- Partition the remaining (n-k) elements of the original sequence into the (k+1) partitions given by the subsequence.

- Sort the (k+1) partitions.

In the paper, the partitioning and sorting is always done using quicksort. Any n*log(n) sorting algorithm should work though, which includes using samplesort recursively.

If you use an optimal value for k (and this might be a significant fraction of n), you can prove that lim_{n->infty} E(C_n)/log(n!) = 1, where E(C_n) is the expected number of comparisons for a sequence of length n.

Now, using samplesort recursively, quicksort is samplesort with k = 1. Double-Pivot quicksort is samplesort with k = 2.

I guess it depends on what type of queries you expect; if you want to find the same (fixed) substring across a body of (dynamic) texts, the O(n) cost of preprocessing (suffix trees/arrays) is terrible.

If, on the other hand, you have a fixed "corpus" and a dynamic query, O(n) search time (this algo, purportedly) is terrible.

Hmm...there's no mention of it being suited for a dynamic or "online" setting as another commenter notes (not sure why my other comment was downvoted for that). I don't know what to make of this from the description:

"This algorithm especially well suited for long S, multi-substrings, small alphabet, regex/parsing and search for common substrings."

Long source text: the O(n times m) worst-case time per search kills it. Even for a single search, O(n times m) worst case here versus O(n + m) for suffix trees.

multi-substrings: suffix trees do each substring of length m in O(m), but are not compared.

search for common substrings: again, suffix trees would be more appropriate.

I'm assuming he is only comparing online algorithms which process only the pattern not the text.

Nevertheless, it's inexcusable to be proposing a string matching algorithm without at least mentioning why suffix trees can't do the job. At the very least, showing that your algorithm beats suffix trees in any instantiation does wonders for credibility. For the record, suffix trees can be built online as well:


Sure, but they use a lot more space (even if you use suffix arrays instead of suffix trees) and are generally only worth if you search for more than one pattern on the same text.

In most academic papers I have read that deal with online pattern matching suffix arrays/trees are generally not compared.

Pattern matching performance also depends on the alphabet size of the text. In his experiment he doesn't report the alphabet size of the text nor does he provide results for different text collections.

The algorithm itself looks very similar to the one used in agrep proposed by Wu and Manber [1].

I also found the book "Flexible Pattern Matching in Strings" to be a very good reference on all things related to pattern matching [2].

[1] S. Wu and U. Manber. A fast algorithm for multi-pattern searching. Report TR-94-17, Department of Computer Science, University of Arizona, 1994.

[2] http://www.amazon.com/Flexible-Pattern-Matching-Strings-Line...

He talks a bit about how to pick the right number of successive letters to use as hash keys - which is where you can get a handle on alphabet sizes. I would guess (maybe it actually says) that he Wikipedia dump in the benchmark was UTF-8 or ASCII and, either way, treated as an alphabet of 8-bit characters. The DNA case is kind of interesting (2 bits min but more likely 3 or 4 in a typical genome record).

An illustration from the book I cited above showing the importance of the alphabet size (y-axis) and the pattern length (x-axis):


In the experiment he used patterns of different length on the same text collection. As you can see in the graph, different algorithms perform best for a certain alphabet size.

He describes the text collection as "text corpus taken from wikipedia text dump" so I'm guessing the alphabet size is around 90?

It's also probably not a good thing that all the strings he is searching for are prefixes of the same pattern.


Shift-Or: http://www-igm.univ-mlv.fr/~lecroq/string/node6.html#SECTION...

BNDM: http://www-igm.univ-mlv.fr/~lecroq/string/bndm.html#SECTION0...

BOM: http://www-igm.univ-mlv.fr/~lecroq/string/bom.html#SECTION00...

The author states that preprocessing takes O(m) time but that is on average.

A quick review of the code makes me think that its worst case is actually on the order of O((s * (s + 1)) / 2), where s = m / 2.

The Achilles heel is the hash function. It's trivial to create collisions and have the insertion time for word w turn from O(1) to O(w).

Um, O((s * (s + 1)) / 2) = O(m^2). Quadratic, not exponential.

Sorry, I updated my comment just as you posted yours.

But - and I don't want to sound pedantic - how is m^2 not exponential growth?

Edit: mea culpa guys, I carelessly translated from Dutch. You're all right: quadratic, not exponential growth.

Because 2^m and m^2 are very different...

Using terminology like "exponential" and "quadratic" correctly in a discussion of algorithms is not pedantic...

Exponential growth is O(2^m). This is not pedantic -- it's definition. O(m^k) is polynomial, O(k^m) is exponential.

how is m^2 not exponential growth

m^2 is exponential -- in the value 2. But the value 2 doesn't change very rapidly, so we tend to focus on the m component of the expression. :-)

Because 2, the exponent in that expression, is a constant.

Exponential growth would be 2^m.

x^2 is quadratic, 2^x (for example) is exponential.

Have a look at http://en.wikipedia.org/wiki/Time_complexity , it's pretty comprehensive.

I am the author. You are right. I've updated the page.

Universal hashing will prevent collision problems.

If i am gonna need a string search algorithm for something serious, i would definitely use KMP (knuth morris pratt). Linear in worst case complexity (wouldn't risk)

This seems like a very practical website about the algorithm but where is the theory and proofs of the time complexity of the algorithm??

I don't mean to be a turd but the proofs are kind of obvious on the face on this one. He's claiming expected linear time in the string being searched for "natural" texts and worst case O(MN). Proof of the worst case is pretty trivial by construction (of examples of that complexity) and contradiction (reaching the non-existence of worse cases). One can't be casual about proofs of course but: try thinking of an O(MN) example and then you can probably see from there why you can't do worse than that. Hint: if you can construct an example where you have to do the length M check for nearly every position in the length N haystack, aaaaaaaah... hmmmmm...., the rest should be clear.

From the site:

>Preprocessing phase in O(M) space and time complexity. Searching phase average O(N) time complexity and O(N*M) worst case complexity.

I don't trust the analysis of someone referring to "average O(N) time"; Big O notation refers to boundary times.

Edit: Okay, based on arguments here and on [1], I'm going to accept that maybe he's just bastardizing the notation.

[1] http://stackoverflow.com/questions/3905355/meaning-of-averag...

> Big O notation refers to boundary times.

No it doesn't. You can have an O(N) amortized time. Big-O is a bounding function up to a constant factor, not necessarily a boundary (as in worst-case) time.


To say that something runs in amortized O(n) time guarantees an upper bound on the average time per operation in a worst-case sequence of operations. It does not deal with average-case time on random or typical data.

I didn't say it did. On the other hand, unless the author of the algorithm is really clueless (edit: or knowingly making a probabilistic statement), I'm sure he meant amortized time.

This is not an amortized analysis, it is a probabilistic analysis.

Big O notation is frequently used to refer to the average case bounds of an algorithm. Haven't you seen an analysis of quicksort?

It could be just an abuse of notation, in the same way that people say that Quicksort is O(n lg n) on average. Sure, on perverse data the best time bound you can prove is O(n^2), but on random data you get expected O(n lg n) time, and on typical data with good partition selection you can generally expect to not go quadratic.

(You could also use the O(n) time median algorithm to construct a truly O(n lg n) Quicksort, but that's just a fun theoretical side-note here.)

> ...an abuse of notation, in the same way that people say that Quicksort is O(n lg n) on average.

I think I may have misunderstood what you meant by this. If I have, I apologise, if I haven't...

Talking about the average-case complexity of algorithms with Big-Oh notation is in no way an "abuse" of that notation. If the average time complexity of an algorithm is `O(f(n))`, then its running time averaged over all inputs of size `n` is bounded above (in some sense) by `f(n)`.

I guess that statement doesn't say that the number of comparisons or swaps or "steps" the algorithms must make to complete is `O(n lg n)`. Perhaps that's what you meant.

I love how beautiful that median algorithm is. :-D

Asymptotic bounds can be given in the worst case, the best case, the average case ("in expectation"), or with high probability. All have rigorous definitions and you should go learn about them. There is nothing bastardized about this notation.

Skeptical but excited. Will definitely be studying this at the weekend. I had been working on a long writeup on string matching but stopped the project for lack of recent progress.

it seems that his algorithm is faster because it exploits the model of computation (memory aligned accesses and multi-byte operations). He gets up to a constant factor more comparisons for free.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact