The explanation is a bit lengthy, but the idea is simple.
Consider the normal naive way to search for a string in a document. First you search for the next instance of the first letter of your search string, then try to match the entire string at that position. If the match fails then you advance your search position by one character and repeat (looking for next occurrence of first letter, etc).
The Knuth-Morris-Pratt algorithm improves upon this naive approach by usually advancing by more than one character after a failed match, thereby speeding up the search. It does this by taking advantage of its knowledge of the search string and the position at which the match failed.
To get the idea, consider searching for the string "explosion" .. first we find the next "e", then try matching the rest of the word. Say we match "explo" then fail (perhaps the document had "explode"), so now we want to start over and find the next "e" in the document... What KMP would do here is note that the document matched the first 5 letters "explo", none of which (other than 1st letter) are an "e", so it can advance by 5 characters, not just 1, to start looking for the next "e".
The amounts it can advance at each failed match position are pre-calculated to be efficient.
That's the idea, but what makes it a bit more complicated is that a string can contain a prefix of itself, like "mama" in the article, or in the worst case "aaaaaaa".
Yes, so for example if you were searching for "elephant" and matched "ele", then you could only advance by 2 since the 2nd "e" might be beginning of "elephant", but if you matched "elep" than you could advance by 4 since "ep" is NOT a prefix of "elephant".
But still, it all comes down to how far can you advance at any given match failure position.
Another fun one is the Aho-Corasick algorithm [0] which is great (i.e. more efficient) when you have multiple different target strings you want to find and you don't want to do multiple scans of the document.
Like KMP, it involves a preliminary step of analyzing the search term(s), and from them it builds a directed graph representing rules for what to do while consuming the stream of document-characters.
To find all occurrences, it's something like O(document_length + all_target_words_combined_length + number_of_hits) .
Thanks for this - I need to solve exactly this problem on a toy project I've been working on; you've saved me the time to hunt down the appropriate algorithm!
I hope it helps, but if you are just looking for text strings, you might want to benchmark your local regex libraries first. Among other optimizations, they might use that algorithm under the hood for long (foo|bar|baz|...) expressions.
Learning some Haskell is very educational; it changes the way you think about programming, including daily code (much like a Lisp, Erlang, SQL, Smalltalk also do).
These examples use pretty minimal features of Haskell, mostly expressions, function definitions with argument pattern matching, and algebraic type definitions. These would take an hour or two to get acquainted with. Nothing fancy is used in the code, in particular, nothing outright monadic :)
For comparison, you can check out a Lisp version at the end. It's much longer, and Lisp (well, Racket here) is usually a very expressive language.
If Haskell seems too intimidating I think Standard ML is easier to wrap one's head around as it has fewer features. But it still delivers that FP way of thinking insight.
Looks problematic. Is a string in Haskell a list? I would think that in Racket a string is a vector. FIRST and REST operations then would have very different implementations.
Much like in Rust or C++, there are several ways to represent sequences of characters in Haskell. The default way is a List, which is like a classic Lisp list,va chain of cons cells.
Yes, the String is a list of characters. The Haskell code uses head and tail on it. In Racket, they provided string-first and string-rest (at the end) to replace those.
I had the same reaction! I was excited about this article before realizing that "only elementary functional programming techniques" promised in the abstract actually means a full programming language used by 1% of programmers.
-- The any function determines if some element of the input list satisfies the given predicate:
any f [] = False
any f (x:xt) = f x || any f xt
In the meantime here's a more palatable version of 'any' for the 99% of programmers who are put off by the confusing Haskell:
// Returns whether any elements of this stream match the provided predicate. May not evaluate the predicate on all elements if not necessary for determining the result. If the stream is empty then false is returned and the predicate is not evaluated.
public final boolean anyMatch(Predicate<? super P_OUT> predicate) {
return evaluate(MatchOps.makeRef(predicate, MatchOps.MatchKind.ANY));
}
final <R> R evaluate(TerminalOp<E_OUT, R> terminalOp) {
assert getOutputShape() == terminalOp.inputShape();
if (linkedOrConsumed)
throw new IllegalStateException(MSG_STREAM_LINKED);
linkedOrConsumed = true;
return isParallel()
? terminalOp.evaluateParallel(this, sourceSpliterator(terminalOp.getOpFlags()))
: terminalOp.evaluateSequential(this, sourceSpliterator(terminalOp.getOpFlags()));
}
public static <T> TerminalOp<T, Boolean> makeRef(Predicate<? super T> predicate,
MatchKind matchKind) {
Objects.requireNonNull(predicate);
Objects.requireNonNull(matchKind);
class MatchSink extends BooleanTerminalSink<T> {
MatchSink() {
super(matchKind);
}
@Override
public void accept(T t) {
if (!stop && predicate.test(t) == matchKind.stopOnPredicateMatches) {
stop = true;
value = matchKind.shortCircuitResult;
}
}
}
return new MatchOp<>(StreamShape.REFERENCE, matchKind, MatchSink::new);
}
It would help if it was clear what tokens are part of the language or standard library, and what tokens are implementation of this algorithm (as in, some syntax highlighting would be nice :).
I mean this is the Journal of Functional Programming... That being said, I wish more people had your response. Curiosity and openness are beautiful things.
Nice. I tried visualizing this algorithm here - https://visuallyexplain.pages.dev/kmp/algorithm-(wip) . The sidebar has a bunch of completed algos -- DFS, BFS, binary heap queries, dijkstra's, union find, topological sort, quicksort, etc.
I abandoned this project without completing KMP. If anyone is interested in this sort of algorithm explanations, I'd love to collaborate / finish this project.
Memories from algorithm class.. I recall going in seeing the course plan the intimidating words of dynamic programming, linear programming, weighted graph search, etc. But string algorithms looked much easier on a surface level – how hard can it be? Turns out, pretty hard. Who’d have thought?
It’s really easy to come up with KMP. Just write an algorithm that searches for a substring p in string s. Make your algorithm efficient by sliding two pointers down s to find the biggest possible match, often called “sliding window”.
You want to grow the window as big as possible to match the substring. The data structure you naturally come up with to do checks efficiently here is the one you use in KMP.
I'm not surprised the pool can make the article reappear but I'm a bit surprised it adjusts the submission date of the article and I'm very surprised that the comment dates have changed.
There has to be multiple time fields with differing view contexts using different fields.
My submissions page shows the first submission time, the "actual" submission comments page shows the second post time.
There's a fair bit of lispy magic going on around here; dang and other mods can merge comments on seperate submissions, view comment voting history, and generally do a wide range of back end housekeeping | forensic | anti troll operations and views .. very probably nothing really changes .. just our end user perception of the HN world is filtered by transformation.
That’s funny. I remember the wake up after doing all the complexity analysis and such of academia, that once you’re engineering for performance, it’s often very different from theory. For instance, a lot of heap-allocations and the resulting memory fragmentation can really tank performance on real hardware. Naive linked lists for instance, is a good example of something that looks super useful in CS class, but turns out is quite niche in practice, and also quite differently implemented.
To be fair, it is largely the Big O analysis that is very different. And that is easily understood as a case of Goodhart's Law. It remains a good way of comparing the growth of algorithms, of course; but a full analysis would include more than only the top end growth numbers.
We, humankind managed to get a good optimisation for this problem by using spaces between words.
When trying these algorithms for searching a word in a string of text, I was surprised how little they could improve vs just skipping to the next word.
I think you would be surprised about how much of a performance hit that would be over existing state of the art. The thing you are missing is that the human visual system evaluates existence of spaces in parallel, a single threaded algorithm would need to check every character to see if it's a space, in addition to checking the letters of the search string. Also that fails to generalize to languages without spaces, search strings with spaces, etc.
Consider the normal naive way to search for a string in a document. First you search for the next instance of the first letter of your search string, then try to match the entire string at that position. If the match fails then you advance your search position by one character and repeat (looking for next occurrence of first letter, etc).
The Knuth-Morris-Pratt algorithm improves upon this naive approach by usually advancing by more than one character after a failed match, thereby speeding up the search. It does this by taking advantage of its knowledge of the search string and the position at which the match failed.
To get the idea, consider searching for the string "explosion" .. first we find the next "e", then try matching the rest of the word. Say we match "explo" then fail (perhaps the document had "explode"), so now we want to start over and find the next "e" in the document... What KMP would do here is note that the document matched the first 5 letters "explo", none of which (other than 1st letter) are an "e", so it can advance by 5 characters, not just 1, to start looking for the next "e".
The amounts it can advance at each failed match position are pre-calculated to be efficient.