Hacker News new | past | comments | ask | show | jobs | submit login

Start with

  ADCBCADCBA  <- haystack
  CBA         <- needle
If the needle doesn't match, shift the needle over so the next character in haystack, B, lines up with the last B in needle. Repeat until done.

So here we shift by 2 so the Bs line up.

No match. Shift by 1 so the As line up.

No match. The next character is D. Since there are no Ds in needle to line up, shift needle past the D.


The main point is the amount to shift depends only on the next character so it can be precomputed and stored in an array (qsBc). You always shift by 1 to line up an A, 2 to line up a B, and 3 to line up a C. Anything else you shift by 4.

I am not sure I follow either.

i: Index of comparison, running 0..n

p: Index in pattern P, running m..0

t: Index in text T, running 0..n, t=i at cycle beginning.

Is the trick to not seek the character mismatching at T[i], but T[t+1], in front of the pattern-text-alignment?

I think that's the Sundance (?) variance or something. It's quite neat because you can do the simple BC preprocessing in O(m) and access in O(1) and still lose nothing over the extended BC, if the text is random (at least I can then prove it via expected value for character occurrences). So you simplify BC pre-processing, access times and gain one char possible shift distance.

However I am not sure how this goes with repetitive texts theoretically. And e.g. DNA is not random, although at our level of understanding it probably is practically.

In my implementation cutting GS didn't improve performance. Tho, I wrote it in Rust and then the compiler really weirdly optimize for cache misses in a not predictable manner; it mattered way too much where I put functionally equivalent code (do A, then so B wasn't the same as do B, then do A, for the most trivial and independent things).

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact