There are so many string searching algorithms now, with different best and worst cases. Some work better on low alphabets (eg DNA), some are better for text or high entropy data, some take advantage of CPU instructions, some are generic. The real challenge is picking the right algorithm.
I've implemented a few of them in java here, and extended them to support multi byte matching at any position:
For Part I, though I get the point to demonstrate proofs with CBMC, this is a pretty contrived example. E.g. using state machines for the cocos search makes an on-line very lazily written implementation human-verifiable at a read. I would like to see an example that doesn't have such a contrived nature.
Edit: Now I see in part II/III where the title actually comes to fruition and there is a state machine oriented generator is made and used. That still doesn't satisfy me about part I, honestly I would have just skipped it in retrospect.
for (unsigned i=0; i<len; i++)
{
if (s[i]=='o' && s[i+1]=='k')
return i; // found
};
The last iteration of this loop, it accesses s[len-1] and s[len]. If the string is null terminated, that's OK. If it's really just a length qualified string, that's not OK.
The later implementations treat len differently.
EDIT: It's worse than I said. The later version won't detect "ok" at the end of the string, because they are off by one in the OPPOSITE direction.
Then why are we providing a length? Plenty of string-handling functions in the c standard library result in no terminating null if max length -- e.g. strncpy.
Note the second function is off by one in the opposite direction-- it doesn't find "ok" in "ok".
Because the writer of the function doesn't want to bother to check the null terminator. I'm not saying it's a good function, it just doesn't have an issue as long as the input is a valid string with a correct length.
strncpy, despite what its name implies, does indeed not necessarily produce strings.
Regardless, a C string is defined to have a null terminator. All bets are off if you don't provide a string.
The issue is: the two functions that we're comparing for correctness treat lengths very differently: one scans len-1 bytes, the other scans len+1 bytes.
Even with a null terminator, there isn't a length that would work for both without overrunning the buffer.
printf("%d %d\n", search_ok_1("mok", 2), search_ok_2("mok", 2));
printf("%d %d\n", search_ok_1("mok", 3), search_ok_2("mok", 3)); /* Overruns without null term */
printf("%d %d\n", search_ok_1("mok", 4), search_ok_2("mok", 4)); /* Overruns even WITH null term */
"You see, there are two memory accesses per one character of the input string."
No, there's one memory access on average even if your compiler is maximally dumb (short circuit AND is something C compilers need to know how to do to be conformant). And in a smarter compiler this will be one memory access per byte anyways. Not to mention on any modern architecture, even if your compiler doesn't squash the second memory access, the microarchitecture will.
Also all the semicolons after his brackets bug me. Not to mention that the thing looks past the end of the string. (We have a length specified, so I assume it's not null terminated... indeed, we can get a match past the end).
> It doesn't matter whether we have a worst case or not. They claim two memory accesses per character.
Yes, a worst case of two memory accesses per character. This is obvious, no?
When people analyze algorithm runtime it is not always explicitly stated whether the author is doing worst-case or average-case analysis. So, as a reader, you sometimes have to figure it out.
It seems much more plausible to me that the author does understand short-circuit evaluation, but did not feel like explaining it in such detail to satisfy the tedious pedantry of HN comments.
My experience with writing answers Stack Overflow is that people in the comments will always figure out a way to interpret an answer as being incorrect, and latch onto it.
There is a tradeoff between clarity and pedantry. You don’t want to simplify something too much where it is no longer insightful, and you don’t want to write something with irrelevant detail that obscures the point you are trying to make.
I really don't think the author is fluent in C. They consistently put semicolons after brackets, and they have off-by-1 errors in array indexing, etc (the first search_ok looks past the end of string; the second search_ok won't see ok in the string "ok"). That's part of why I read it that way.
My experience is that C programmers with many years of experience will still make off-by-one errors. Most of the examples have the correct bounds. The semicolons don’t belong but… so what?
If you wish to report errors to the author, there is an email address at the bottom:
> Please drop me email about bug(s) and/or suggestion(s):
"worst" versus "average". He probably could've phrased the sentence better, but his conclusion isn't wrong. If the string is all 'o' characters we'd hit this worst case memory access.
Isn't the "homebrew" result at the end exactly the state machine from the DFA?
It has been a while, but I remember KMP calculating a prefix table of some sort beforehand, to skip the repeated checking on yet-unclear partial matches?
Yeah, back in university I've really only learned the version that is presented in part 3. I mean I feel like the prefix table is just a more succinct representation of the DFA but I do find it interesting to see one way that that approach could've been arrived at.
Thank you for the write-up, I love the way you use CMBC to construct your algorithm, seem like a way to synthesize code :)
Thanks also for point CMBC, I was looking for something like that for years!
Long ago, I wrote a text search that used the 8088 XLAT instruction in a very short loop to do a text search, the carry flag got set when there was a hit.
On modern machines, it would all end up in cache quickly, the branch would be well predicted, and it would just zip through text.
It handled upper/lower case (you just set the bits corresponding to both upper and lower case of the letter)... but it couldn't do regex.
Part 1 is a fairly naïve hand-rolled solution. Part 2 shows KMP generating a DFA, which does reduce branching (specifically in the search, where the branches in the first part become array lookups in the second part).
It is and it isn't. The DFA is not deliberately constructed, it's more a "happy accident". A renaming/reframing of the `seen` variable would make it obvious that what is constructed is a manually conceived DFA. As presented, `seen` is "how many characters have we seen of the target string?". Reframing it as "There exists a state for each character in the target, `seen` represents which of these states we are presently in" makes it clear that it is a DFA. But as constructed it's an accidental DFA, rather than a deliberate one.
Does avoiding memory accesses for short strings really provide much performance improvement? Surely the adjacent characters, which need to be read at least once anyway, will be in cache. I can see the benefit when the searched-for string is long.
It makes the algorithm (if not using a series of unoptimized if-else-ifs) linear in the length of the string being searched. You go from O(n*k) to O(n).
I mean, it may not matter much, but it is more efficient. A more impactful optimization is knowing how far to skip in the string being searched based on which state you're in and what character appears.
For those who are interested, there's a good tool to specifically test string searching algorithms here:
https://github.com/smart-tool/smart
There are so many string searching algorithms now, with different best and worst cases. Some work better on low alphabets (eg DNA), some are better for text or high entropy data, some take advantage of CPU instructions, some are generic. The real challenge is picking the right algorithm.
I've implemented a few of them in java here, and extended them to support multi byte matching at any position:
https://github.com/nishihatapalmer/byteseek