For those who are interested, there's a good tool to specifically test string searching algorithms here:
There are so many string searching algorithms now, with different best and worst cases. Some work better on low alphabets (eg DNA), some are better for text or high entropy data, some take advantage of CPU instructions, some are generic. The real challenge is picking the right algorithm.
I've implemented a few of them in java here, and extended them to support multi byte matching at any position:
Edit: Now I see in part II/III where the title actually comes to fruition and there is a state machine oriented generator is made and used. That still doesn't satisfy me about part I, honestly I would have just skipped it in retrospect.
for (unsigned i=0; i<len; i++)
if (s[i]=='o' && s[i+1]=='k')
return i; // found
The later implementations treat len differently.
EDIT: It's worse than I said. The later version won't detect "ok" at the end of the string, because they are off by one in the OPPOSITE direction.
Note the second function is off by one in the opposite direction-- it doesn't find "ok" in "ok".
strncpy, despite what its name implies, does indeed not necessarily produce strings.
Regardless, a C string is defined to have a null terminator. All bets are off if you don't provide a string.
Even with a null terminator, there isn't a length that would work for both without overrunning the buffer.
printf("%d %d\n", search_ok_1("mok", 2), search_ok_2("mok", 2));
printf("%d %d\n", search_ok_1("mok", 3), search_ok_2("mok", 3)); /* Overruns without null term */
printf("%d %d\n", search_ok_1("mok", 4), search_ok_2("mok", 4)); /* Overruns even WITH null term */
No, there's one memory access on average even if your compiler is maximally dumb (short circuit AND is something C compilers need to know how to do to be conformant). And in a smarter compiler this will be one memory access per byte anyways. Not to mention on any modern architecture, even if your compiler doesn't squash the second memory access, the microarchitecture will.
Also all the semicolons after his brackets bug me. Not to mention that the thing looks past the end of the string. (We have a length specified, so I assume it's not null terminated... indeed, we can get a match past the end).
vs. what I said:
> > No, there's one memory access on *average*
Even if it didn’t say that, I think we can figure out from context that the article is talking about worst-case memory accesses.
"At worst" calculation is just calculating the worst case (string not found) where it has to scan the whole string and... access each byte twice.
It doesn't matter whether we have a worst case or not. They claim two memory accesses per character.
Yes, a worst case of two memory accesses per character. This is obvious, no?
When people analyze algorithm runtime it is not always explicitly stated whether the author is doing worst-case or average-case analysis. So, as a reader, you sometimes have to figure it out.
The article says “len*2 at worst”. There is no intervening text, that’s a direct quote.
> It heavily implies that there's 2 accesses per byte always.
This interpretation is incorrect. The article does not imply this.
It says this, DIRECTLY, not qualifying it as "up to" or "worst case". The next sentence says it can be 2*len, e.g. if you scan the entire length.
Those two sentences feel like something that someone would write if they didn't understand short circuit evaluation.
You read it differently, but your opinion of how it is intended isn't sufficient to refute my interpretation of the text.
My experience with writing answers Stack Overflow is that people in the comments will always figure out a way to interpret an answer as being incorrect, and latch onto it.
There is a tradeoff between clarity and pedantry. You don’t want to simplify something too much where it is no longer insightful, and you don’t want to write something with irrelevant detail that obscures the point you are trying to make.
I understand the concern over excessive pedantry.
If you wish to report errors to the author, there is an email address at the bottom:
> Please drop me email about bug(s) and/or suggestion(s):
It has been a while, but I remember KMP calculating a prefix table of some sort beforehand, to skip the repeated checking on yet-unclear partial matches?
Long ago, I wrote a text search that used the 8088 XLAT instruction in a very short loop to do a text search, the carry flag got set when there was a hit.
On modern machines, it would all end up in cache quickly, the branch would be well predicted, and it would just zip through text.
It handled upper/lower case (you just set the bits corresponding to both upper and lower case of the letter)... but it couldn't do regex.
I mean, it may not matter much, but it is more efficient. A more impactful optimization is knowing how far to skip in the string being searched based on which state you're in and what character appears.