
Knuth–Morris–Pratt algorithm - rbcoffee
https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm
======
kevingadd
I've always liked the elegance of this algorithm, fun little example of how
you can make things incredibly fast by thinking about your problem.

And a related anecdote: I was interviewing at Apple for a systems development
related role (graphics drivers, I think?) and one of the senior-level folks
asked me to write strstr on the whiteboard. I started with a naive, working
implementation, then he asked me how I'd optimize it. I said 'knuth morris
pratt' and gave a basic overview of the algorithm and explained how it's
faster.

He insisted the algorithm couldn't possibly work. I spent a few more minutes
trying to explain it, but I couldn't convince him. The dark magic of efficient
string searches evades us all sometimes, I suppose. I always like coming away
from an interview feeling like I learned something, so I hope he googled the
algorithm later. :-)

~~~
jiggy2011
That's the risk with interview questions, that the candidate suggests a better
solution that the interviewer anticipated. Leaving the interviewer with the
task of ascertaining whether the solution is correct.

~~~
foobarian
I _love_ to be surprised like that in interviews, let me tell you. Makes my
decision much easier.

~~~
noir_lord
> Makes my decision much easier.

You didn't say which way, I've worked with and for people who wouldn't hire
someone smarter than them, the human psyche is a dark place.

~~~
gedrap
That's true. On other hand, it's just for the interviewees own good - you
don't want to work with people like that ;) except from megacorps with
hundreds and thousands of devs

------
volaski
Why is KMP on the front page of HN? Not complaining, just confused. Is this
related to some other news?

~~~
DanBC
It's a weekend and other material surfaces.

~~~
devcpp
Yup welcome to the slow hours in the weekend on HN. Personally, I like the
change.

------
chaoxu
I have implemented KMP in Haskell. This version doesn't use any index! It is
built purely functionally by realizing KMP's failure table is just a finite
state automaton(well, almost...) However it is much longer than the C++
version...

Code: [https://github.com/Mgccl/haskell-
algorithm/blob/master/KMP.h...](https://github.com/Mgccl/haskell-
algorithm/blob/master/KMP.hs) Description:
[http://www.chaoxuprime.com/posts/2014-04-11-the-kmp-
algorith...](http://www.chaoxuprime.com/posts/2014-04-11-the-kmp-algorithm-in-
haskell.html)

Actually, KMP is a little harder to program purely functionally than the MP
algorithm. An extremely elegant MP algorithm is implemented here:
[http://twanvl.nl/blog/haskell/Knuth-Morris-Pratt-in-
Haskell](http://twanvl.nl/blog/haskell/Knuth-Morris-Pratt-in-Haskell) (Note it
says the algorithm is KMP, but it is actually the MP algorithm).

The Aho–Corasick string matching algorithm is a generalization of the MP
algorithm. Which I also coded in Haskell inspired by the MP code above.
[https://github.com/Mgccl/haskell-
algorithm/blob/master/AhoCo...](https://github.com/Mgccl/haskell-
algorithm/blob/master/AhoCorasick.hs)

~~~
platz
Looks like there's also a KMP implementation here specialized to ByteStrings:
[http://hackage.haskell.org/package/stringsearch](http://hackage.haskell.org/package/stringsearch)

------
bitL
KMP is conceptually very cool, as well as other clever string searching
algorithms (BM, RK, AC), though one of the questions for me always was if even
its most efficient implementation wouldn't be always slower than executing a
brute-force combination of REP CMPSL/CMPSB instructions (x86) for vast
majority of searched strings?

~~~
ekr
The time complexity of KMP is O(n), while the complexity of your idea is
O(n^2). Considering the fact that KMP is not even doing a lot of things in its
inner loop, (few memory accesses), a rep cmpsb approach is really no match for
it, even in the trivial cases.

So no, it's much faster.

~~~
bitL
I know about the time complexity, but my argument points to the CPU
architecture and cache/branch prediction efficiency.

KMP has a lot of branching which is more expensive than a simple cache (line)
hit, cache REP CMPSx doesn't have any branching, aborts immediately after a
mismatch with only a single loop over the starting position, search is also
pretty much linear. This is not a Turing machine where it is being executed
;-)

So in theory KMP is faster, in practice for not very large strings I am really
not sure... There are plenty of optimal algorithms where the fixed cost is too
high for majority of useful cases comparing to less optimal algorithms with
very low fixed costs. Does KMP have as much "mechanical sympathy" to overcome
specific machine code instructions for most frequent cases?

~~~
dalke
This depends completely on your needle and your haystack. If it's the string
"ABC" in "QABC" then certainly the naive algorithm will be the fastest.

"Mechanical sympathy" unfortunately is easily misinterpreted to mean to only
listen to the machine. Remember though that Martin Thompson is alluding to
Jackie Stewart and Formula One racing. All of the cars competed on the same
track. But you wouldn't enter a Formula One car for the Baja 1000.

In fact, I don't think you should use term unless you have a specific goal in
mind. KMP is only best for certain classes of searches. Boyer-Moore is used
for others. And there are plenty of other algorithms with there own pros and
cons. See [http://www-igm.univ-mlv.fr/~lecroq/string/index.html](http://www-
igm.univ-mlv.fr/~lecroq/string/index.html) for descriptions.

Python, for example, uses (or used?) the algorithm described at
[http://effbot.org/zone/stringlib.htm](http://effbot.org/zone/stringlib.htm) ,
for the reasons listed therein.

------
mavam
KMP starts with the first character of the pattern (or substring/needle) and
then jumps forward by the length of the mismatch. A related algorithm is
Boyer-Moore
(BM)([http://en.wikipedia.org/wiki/Boyer–Moore_string_search_algor...](http://en.wikipedia.org/wiki/Boyer–Moore_string_search_algorithm)),
which operates the other way round: it begins with the last character of the
pattern and then compares backwards until the full pattern matches. The
advantage of BM is that it allows for bigger jumps.

It seems that KMP works well for small alphabets (e.g., DNA), whereas BM
shines for larger alphabets (e.g., plain English).

~~~
yxhuvud
One difference is that BM is O(m*n) while KMP is O(n + m). Depending on how
the input string looks like, this can matter - especially in small alphabets
where the likelihood of pattern repeating themselves are bigger.

~~~
IsTom
BM is too provably O(n + m) (and without dependence on alphabet size unlike
KMP) if you apply two heurestics that I don't really remember. For some reason
it's not mentioned on wikipedia.

~~~
jasode
Turbo BM?

[http://www-igm.univ-mlv.fr/~lecroq/string/node15.html](http://www-igm.univ-
mlv.fr/~lecroq/string/node15.html)

------
Cieplak
Here is the boost implementation:
[http://www.boost.org/doc/libs/1_55_0/libs/algorithm/doc/html...](http://www.boost.org/doc/libs/1_55_0/libs/algorithm/doc/html/the_boost_algorithm_library/Searching/KnuthMorrisPratt.html)

~~~
Intermernet
Here's one in Go :-)

[http://play.golang.org/p/chYGT69vBc](http://play.golang.org/p/chYGT69vBc)

------
wslh
I always wondered why most common KMP and RE implementations don't take into
account the case of using streams instead of strings. That's why I ended up
writing this article (with code) "Searching for Substrings in Streams: a
Slight Modification of the Knuth-Morris-Pratt Algorithm in Haxe" [1] and
adding information about a currently unsupported RE lib that take into account
streams.

Hope this helps.

[1] [http://blog.databigbang.com/searching-for-substrings-in-
stre...](http://blog.databigbang.com/searching-for-substrings-in-streams-a-
slight-modification-of-the-knuth-morris-pratt-algorithm-in-haxe/)

------
reledi
A few months ago I stumbled on James Morris' GitHub profile [1] by accident.
There's not much activity, but he has dabbled with Ruby on Rails.

1: [https://github.com/jhm15217/](https://github.com/jhm15217/)

------
deckar01
The FM-Index changed the way I think about searching.

[http://alexbowe.com/fm-index/](http://alexbowe.com/fm-index/)

------
kinow
I learned about this algorithm recently, because of an exercise in the
Rosalind bioinformatics problems list:
[http://rosalind.info/problems/kmp/](http://rosalind.info/problems/kmp/)

There are tons of other interesting algorithms put in practice in different
bioinformatics problems.

------
ladon86
Here's an easy to follow video explaining the algorithm:
[https://www.youtube.com/watch?v=rfisBOOLN9M](https://www.youtube.com/watch?v=rfisBOOLN9M)

