Boyer-Moore string search algorithm explanation and formal verification

37ef_ced3 · on April 23, 2021

Here is a twist on Boyer-Moore that is obviously correct, much simpler, and much faster in practice:

https://www-igm.univ-mlv.fr/~lecroq/string/node19.html

It uses the FOLLOWING character to do big jumps, after failing a vectorized forward compare (memcmp)

This is the good shit, my friends

userbinator · on April 23, 2021

Isn't that just BMH? It's a good choice in practice and no doubt many others have reinvented it or some small variant: https://old.blog.phusion.nl/2010/12/06/efficient-substring-s...

lofi_lory · on April 23, 2021

Could you explain it? To me this falls into a long list of bad BM explanations.

When I had to implement it for an assignment I thought I was going mad. E.g. everybody mentions the good suffix rule can be created in O(n), but no one explained how to, or even sold their O(n²) as the real deal. Then they talk about borders and such... I had to come up with my own thing based on the Z-algorithm... and then realized that what the others were doing as well, but can't fucking spell. And yeah, if they drawn that connection they would have the O(n) connection at hand too. Btw. it matters for real life performance.

edflsafoiewq · on April 23, 2021

Start with

  ADCBCADCBA  <- haystack
  CBA         <- needle

If the needle doesn't match, shift the needle over so the next character in haystack, B, lines up with the last B in needle. Repeat until done.

So here we shift by 2 so the Bs line up.

  ADCBCADCBA
    CBA

No match. Shift by 1 so the As line up.

  ADCBCADCBA
     CBA

No match. The next character is D. Since there are no Ds in needle to line up, shift needle past the D.

  ADCBCADCBA
         CBA

Match!

The main point is the amount to shift depends only on the next character so it can be precomputed and stored in an array (qsBc). You always shift by 1 to line up an A, 2 to line up a B, and 3 to line up a C. Anything else you shift by 4.

lofi_lory · on April 23, 2021

I am not sure I follow either.

i: Index of comparison, running 0..n

p: Index in pattern P, running m..0

t: Index in text T, running 0..n, t=i at cycle beginning.

Is the trick to not seek the character mismatching at T[i], but T[t+1], in front of the pattern-text-alignment?

I think that's the Sundance (?) variance or something. It's quite neat because you can do the simple BC preprocessing in O(m) and access in O(1) and still lose nothing over the extended BC, if the text is random (at least I can then prove it via expected value for character occurrences). So you simplify BC pre-processing, access times and gain one char possible shift distance.

However I am not sure how this goes with repetitive texts theoretically. And e.g. DNA is not random, although at our level of understanding it probably is practically.

In my implementation cutting GS didn't improve performance. Tho, I wrote it in Rust and then the compiler really weirdly optimize for cache misses in a not predictable manner; it mattered way too much where I put functionally equivalent code (do A, then so B wasn't the same as do B, then do A, for the most trivial and independent things).

ganafagol · on April 23, 2021

This article does not actually get anywhere with the "formal verification" part. The authors dumps it into a bounded model checker, observes that it does not work and then does some crude manual bounded loop unrolling. Does this increase confidence in the algorithm? Not sure.

rurban · on April 24, 2021

Then you missed the point that CBMC is a formal verification tool (the best practical one btw), which also does bounded model checking. But it has of course its limitations, esp. with unbounded models, such as linked lists or recursion.

Here he explains how to overcome its limitations in one case. A likely bug in the loop unroller. With --unwind N you can tell CBMC how far to unroll automatically, and --depth is for recursion. In this case he just unrolled it manually. Note: should check if it works out of the box with a current version. CBMC is getting better and better lately.

I often use satabs then, which doesn't struggle with unbounded models, but has other limitations.

faeyanpiraat · on April 23, 2021

Where is this knowledge useful? Will I ever benefit from going into this rabbit hole as a web developer?