Hacker News new | past | comments | ask | show | jobs | submit login

Could you explain it? To me this falls into a long list of bad BM explanations.

When I had to implement it for an assignment I thought I was going mad. E.g. everybody mentions the good suffix rule can be created in O(n), but no one explained how to, or even sold their O(n²) as the real deal. Then they talk about borders and such... I had to come up with my own thing based on the Z-algorithm... and then realized that what the others were doing as well, but can't fucking spell. And yeah, if they drawn that connection they would have the O(n) connection at hand too. Btw. it matters for real life performance.




Start with

  ADCBCADCBA  <- haystack
  CBA         <- needle
If the needle doesn't match, shift the needle over so the next character in haystack, B, lines up with the last B in needle. Repeat until done.

So here we shift by 2 so the Bs line up.

  ADCBCADCBA
    CBA
No match. Shift by 1 so the As line up.

  ADCBCADCBA
     CBA
No match. The next character is D. Since there are no Ds in needle to line up, shift needle past the D.

  ADCBCADCBA
         CBA
Match!

The main point is the amount to shift depends only on the next character so it can be precomputed and stored in an array (qsBc). You always shift by 1 to line up an A, 2 to line up a B, and 3 to line up a C. Anything else you shift by 4.


I am not sure I follow either.

i: Index of comparison, running 0..n

p: Index in pattern P, running m..0

t: Index in text T, running 0..n, t=i at cycle beginning.

Is the trick to not seek the character mismatching at T[i], but T[t+1], in front of the pattern-text-alignment?

I think that's the Sundance (?) variance or something. It's quite neat because you can do the simple BC preprocessing in O(m) and access in O(1) and still lose nothing over the extended BC, if the text is random (at least I can then prove it via expected value for character occurrences). So you simplify BC pre-processing, access times and gain one char possible shift distance.

However I am not sure how this goes with repetitive texts theoretically. And e.g. DNA is not random, although at our level of understanding it probably is practically.

In my implementation cutting GS didn't improve performance. Tho, I wrote it in Rust and then the compiler really weirdly optimize for cache misses in a not predictable manner; it mattered way too much where I put functionally equivalent code (do A, then so B wasn't the same as do B, then do A, for the most trivial and independent things).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: