Hacker News new | past | comments | ask | show | jobs | submit login
Knuth-Morris-Pratt string-searching algorithm: DFA-less version (yurichev.com)
109 points by ingve on Jan 22, 2021 | hide | past | favorite | 38 comments



That was a fun read, I liked the use of cmbc to validate the algorithm.

For those who are interested, there's a good tool to specifically test string searching algorithms here:

https://github.com/smart-tool/smart

There are so many string searching algorithms now, with different best and worst cases. Some work better on low alphabets (eg DNA), some are better for text or high entropy data, some take advantage of CPU instructions, some are generic. The real challenge is picking the right algorithm.

I've implemented a few of them in java here, and extended them to support multi byte matching at any position:

https://github.com/nishihatapalmer/byteseek


For Part I, though I get the point to demonstrate proofs with CBMC, this is a pretty contrived example. E.g. using state machines for the cocos search makes an on-line very lazily written implementation human-verifiable at a read. I would like to see an example that doesn't have such a contrived nature.

Edit: Now I see in part II/III where the title actually comes to fruition and there is a state machine oriented generator is made and used. That still doesn't satisfy me about part I, honestly I would have just skipped it in retrospect.


Note that the use of CMBC here was not thorough enough to spot that his reference implementation reads past the end of the string.


I don't think it does, assuming the passed len is correct.


        for (unsigned i=0; i<len; i++)
        {
                if (s[i]=='o' && s[i+1]=='k')
                        return i; // found
        };

The last iteration of this loop, it accesses s[len-1] and s[len]. If the string is null terminated, that's OK. If it's really just a length qualified string, that's not OK.

The later implementations treat len differently.

EDIT: It's worse than I said. The later version won't detect "ok" at the end of the string, because they are off by one in the OPPOSITE direction.


A string in C is, by definition, null terminated.


Oh I have some bad news for you, buddy.


Then why are we providing a length? Plenty of string-handling functions in the c standard library result in no terminating null if max length -- e.g. strncpy.

Note the second function is off by one in the opposite direction-- it doesn't find "ok" in "ok".


Because the writer of the function doesn't want to bother to check the null terminator. I'm not saying it's a good function, it just doesn't have an issue as long as the input is a valid string with a correct length.

strncpy, despite what its name implies, does indeed not necessarily produce strings.

Regardless, a C string is defined to have a null terminator. All bets are off if you don't provide a string.


The issue is: the two functions that we're comparing for correctness treat lengths very differently: one scans len-1 bytes, the other scans len+1 bytes.

Even with a null terminator, there isn't a length that would work for both without overrunning the buffer.

        printf("%d %d\n", search_ok_1("mok", 2), search_ok_2("mok", 2));
        printf("%d %d\n", search_ok_1("mok", 3), search_ok_2("mok", 3));   /* Overruns without null term */
        printf("%d %d\n", search_ok_1("mok", 4), search_ok_2("mok", 4));   /* Overruns even WITH null term */
Prints out:

    1 2
    1 3
    1 1


Thats why strncpy is deprecated for strlcpy.


"You see, there are two memory accesses per one character of the input string."

No, there's one memory access on average even if your compiler is maximally dumb (short circuit AND is something C compilers need to know how to do to be conformant). And in a smarter compiler this will be one memory access per byte anyways. Not to mention on any modern architecture, even if your compiler doesn't squash the second memory access, the microarchitecture will.

Also all the semicolons after his brackets bug me. Not to mention that the thing looks past the end of the string. (We have a length specified, so I assume it's not null terminated... indeed, we can get a match past the end).


    search_ok("ooooooooooooooo", 15)
30 memory accesses. length times two. The article is correct.


> 30 memory accesses. length times two. The article is correct.

vs. what I said:

> > No, there's one memory access on *average*


Ah, you must have missed the part where the article says “at worst”.

Even if it didn’t say that, I think we can figure out from context that the article is talking about worst-case memory accesses.


No, it clearly says two memory accesses per character.

"At worst" calculation is just calculating the worst case (string not found) where it has to scan the whole string and... access each byte twice.

It doesn't matter whether we have a worst case or not. They claim two memory accesses per character.


> It doesn't matter whether we have a worst case or not. They claim two memory accesses per character.

Yes, a worst case of two memory accesses per character. This is obvious, no?

When people analyze algorithm runtime it is not always explicitly stated whether the author is doing worst-case or average-case analysis. So, as a reader, you sometimes have to figure it out.


It says at worst later, talking about if there is no match. It heavily implies that there's 2 accesses per byte always.


> It says at worst later, talking about if there is no match.

The article says “len*2 at worst”. There is no intervening text, that’s a direct quote.

> It heavily implies that there's 2 accesses per byte always.

This interpretation is incorrect. The article does not imply this.


"You see, there are two memory accesses per one character of the input string."

It says this, DIRECTLY, not qualifying it as "up to" or "worst case". The next sentence says it can be 2*len, e.g. if you scan the entire length.

Those two sentences feel like something that someone would write if they didn't understand short circuit evaluation.

You read it differently, but your opinion of how it is intended isn't sufficient to refute my interpretation of the text.


It seems much more plausible to me that the author does understand short-circuit evaluation, but did not feel like explaining it in such detail to satisfy the tedious pedantry of HN comments.

My experience with writing answers Stack Overflow is that people in the comments will always figure out a way to interpret an answer as being incorrect, and latch onto it.

There is a tradeoff between clarity and pedantry. You don’t want to simplify something too much where it is no longer insightful, and you don’t want to write something with irrelevant detail that obscures the point you are trying to make.


I really don't think the author is fluent in C. They consistently put semicolons after brackets, and they have off-by-1 errors in array indexing, etc (the first search_ok looks past the end of string; the second search_ok won't see ok in the string "ok"). That's part of why I read it that way.

I understand the concern over excessive pedantry.


My experience is that C programmers with many years of experience will still make off-by-one errors. Most of the examples have the correct bounds. The semicolons don’t belong but… so what?

If you wish to report errors to the author, there is an email address at the bottom:

> Please drop me email about bug(s) and/or suggestion(s):


"worst" versus "average". He probably could've phrased the sentence better, but his conclusion isn't wrong. If the string is all 'o' characters we'd hit this worst case memory access.


Isn't the "homebrew" result at the end exactly the state machine from the DFA?

It has been a while, but I remember KMP calculating a prefix table of some sort beforehand, to skip the repeated checking on yet-unclear partial matches?


Yes. The author discusses the relation between the homebrew solution and a DFA solution in part 2, and more about partial matches in part 3.


Yeah, back in university I've really only learned the version that is presented in part 3. I mean I feel like the prefix table is just a more succinct representation of the DFA but I do find it interesting to see one way that that approach could've been arrived at.


Thank you for the write-up, I love the way you use CMBC to construct your algorithm, seem like a way to synthesize code :) Thanks also for point CMBC, I was looking for something like that for years!



Wouldn't it be better to do less branching?

Long ago, I wrote a text search that used the 8088 XLAT instruction in a very short loop to do a text search, the carry flag got set when there was a hit.

On modern machines, it would all end up in cache quickly, the branch would be well predicted, and it would just zip through text.

It handled upper/lower case (you just set the bits corresponding to both upper and lower case of the letter)... but it couldn't do regex.


Part 1 is a fairly naïve hand-rolled solution. Part 2 shows KMP generating a DFA, which does reduce branching (specifically in the search, where the branches in the first part become array lookups in the second part).



Is this really DFA-less? Isn't the look-back table you construct just encoding the DFA?


It is and it isn't. The DFA is not deliberately constructed, it's more a "happy accident". A renaming/reframing of the `seen` variable would make it obvious that what is constructed is a manually conceived DFA. As presented, `seen` is "how many characters have we seen of the target string?". Reframing it as "There exists a state for each character in the target, `seen` represents which of these states we are presently in" makes it clear that it is a DFA. But as constructed it's an accidental DFA, rather than a deliberate one.


Does avoiding memory accesses for short strings really provide much performance improvement? Surely the adjacent characters, which need to be read at least once anyway, will be in cache. I can see the benefit when the searched-for string is long.


It makes the algorithm (if not using a series of unoptimized if-else-ifs) linear in the length of the string being searched. You go from O(n*k) to O(n).

I mean, it may not matter much, but it is more efficient. A more impactful optimization is knowing how far to skip in the string being searched based on which state you're in and what character appears.


I took a look at part 1, and the use of CMBC was cool!


This sound interesting. I wanted to read the documentation but the link in the README file returns 404.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: