
Faster string searching and matching with Rabin-Karp algorithm - freeatnet
http://www.stoimen.com/blog/2012/04/02/computer-algorithms-rabin-karp-string-searching/
======
ratzkewatzke
This article is a mess, and people need to stop voting it up. For example, it
contains this perplexing passage:

"Let’s say we have the string “hello world”, and let’s assume that its hash is
hash(‘hello world’) = 12345. So if hash(‘he’) = 1 we can say that the pattern
“he” is contained in the text “hello world”"

What the heck is he talking about here?

His hash function is O(m), where m is the length of the pattern being searched
for. This makes the implementation trivially O(mn), which he notes, writing in
another passage:

"The Rabin-Karp algorithm has the complexity of O(nm) where n, of course, is
the length of the text, while m is the length of the pattern. [...] However
it’s considered that Rabin-Karp’s complexity is O(n+m) in practice"

Next time I submit a code review, I'm going to "consider" my complexity this
way.

This isn't a good guide to Rabin-Karp, which is effectively discussed on
Wikipedia, in Cormen, etc. It is probably a very effective way of attracting
ad dollars for the banners all over the page.

Avoid, avoid.

~~~
ahelwer
Hmm? Expected running time is a valid and important concept. In Quicksort, for
example, picking the pivot element randomly results in O(n^2) runtime in the
worst case, but has an expected runtime of O(n log(n)). The O(n log(n)) limit
can be enforced using the median-of-medians method for pivot selection, but
that's a pain to program and has a much higher constant factor than generating
a random number; the only benefit being theoretical computer scientists sleep
well at night.

------
albertzeyer
It lacks an important detail: For any random hash function, the complexity is
O(n * m) as in the naive implementation. You need a rolling hash function
where you can get the next hash value of the next sub-string in constant type.

The implementation is also wrong. First of all, it has that detail wrong, thus
it is O(n * m). But the wrong part is the comparison; it just compares the
hash and not the value itself.

More about Rabin-Karp:
<http://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm>

(Btw., are those tags auto-generated? It's funny that it got the 'rolling
hash' tag but the article doesn't mention it at all.)

------
kevingadd
No discussion of whether it's actually faster (no benchmarks, etc) other than
some handwaving about algorithmic complexity. Meh. :(

Clever algorithm, but I would be shocked if it were actually faster than
Boyer-Moore or KMP on a modern computer for average search cases.

~~~
robrenaud
For plain string search, it's definitely much slower than Boyer-Moore or KMP.

... but the cool thing about Karp-Rabin is that it generalizes well to
different problems.

* 2D pattern matching (requires a little bit of thinking to implement sliding hash blocks quickly).

* matching against a set of equal length strings instead of a single string by replacing the hash equality check with a set containment check. Aho-Corasick, a generalization of KMP to this problem, will outperform this though. Note this composes with previous modification, you can do 2D pattern matching against different sets of 2D patterns, as long as they have the same dimension.

* You can use it to build a data structure to quickly let you answer queries like is substring(a, b) == substring(c, d).

My favorite use is to pick content aware variable block sizes for a block-
deduping file system. Imagine before you write a block, you compute a check
sum and re-use a block if you've seen it already. You want to pick blocks so
that if you don't store redundant copies of content you have. Imagine you have
a 1GB file A, and some modified file version A' that has a few character
insertions at the beginning. Ideally, you would reuse most of the blocks for
the files, since they are basically identical. The naive strategy of picking
say, every 8k as a block isn't going to re-use any of the contents, because
the two files aren't lined up. Instead, imagine you keep a rolling hash of
length 16. Whenever the rolling hash mod 8k == 0, you start a new block. Now
you'll miss some of the commonality early in the file, but eventually the same
window will cause the blocks to restart in an aligned place in A and A', so
you'll only store one copy of most of the data. See section 3.1.1 in this
paper for more details:
<http://pdos.csail.mit.edu/papers/lbfs:sosp01/lbfs.pdf>

~~~
tptacek
This is such a great comment. Thank you.

------
tomkinstinch
We shouldn't forget about Boyer-Moore or Knuth-Morris-Pratt either.

[http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_sear...](http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm)

[http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pr...](http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm)

~~~
ahelwer
Those are for matching a single pattern against a text. The intended usage of
Rabin-Karp is to match a set of patterns against a text, in a more efficient
manner than running Boyer-Moore or KMP on each individual pattern. An
algorithm with a similar use-case to Rabin-Karp is the Aho-Corasick algorithm,
which is very simple to learn if you know KMP already - the preprocess of the
set of patterns is just a generalization of the preprocess of the pattern in
KMP.

------
perlgeek
If I want to match multiple patterns at the same time, wouldn't it be more
efficient to build a (possibly tagged) DFA from them? Then the performance
would be independent from the number of patterns to match (modulo some startup
cost to construct the DFA in the first place).

------
tocomment
Does this beat burrows wheeler?

~~~
pjscott
The Burrows-Wheeler transform is something completely unrelated.

[http://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transfo...](http://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform)

Maybe you were thinking of something like Aho-Corasick or Knuth-Morris-Pratt
or Boyer-Moore? We really need better names for these algorithms. For example,
by the authority vested in me by my own hubris, I hereby rename Rabin-Karp to
"rolling hash search". Aho-Corasick is now "substring-trie search". Much
easier to remember!

~~~
epistasis
There is a string search based on the BWT which is used for most DNA read
mapping these days due to its favorable performance compared to existing hash
based algorithms.

~~~
ahelwer
You are probably thinking of Smith-Waterman:

[http://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorith...](http://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm)

Company I work at (in HPC) was in negotiations to implement it for one of
those big plant biotech firms long before I was around, or so I have been
told.

~~~
epistasis
Smith-Waterman is the gold standard, but it's too slow for most applications
so many heuristics have been developed.

BWA is the most popular tool these days for short-read mapping, and has a
Smith-Waterman based approach for long reads as well:

<http://bio-bwa.sourceforge.net/bwa.shtml>

