The prime here is needlessly small, I believe-- which will inflate the false positive rate. It's using 64-bit registers and 64-bit multiplies, so the prime could be 0xfffffffffffffb. For 32-bit registers and multiplies the prime could be 0xfffffd.
I'm guessing it's using an overly small value because it's based on java code and java is a little brain-damaged about unsigned numbers. :)
You could get a better use of your comparison register by using the syndrome of BCH code over GF(256)-- e.g. the full 64-bits for a 64-bit register instead of 56, though given how fast multipliers are these days using a prime modulus is probably fastest (in hardware or maybe in SIMD a characteristic-2 BCH code would almost certainly be faster and better!).
The code here also has the issues that permutations of the characters of the needle will match-- this may fairly bad performance for some plausible inputs. A BCH code would be less bad in that respect.
With the right choice of BCH code, you could be guaranteed that any false positive would differ in at least X characters out of Y, if Y is equal to or less than your window size, that may reduce the amount of comparisons you do for false positives (mostly interesting for very small hashes, I suppose).
> though given how fast multipliers are these days using a prime modulus
They are really fast on a single thread, but at least on Xeons I use the mul and modulo seems to have a weak spot with thread parallelism though.
To fix a set of threading scaling issue with LIKE "%pattern%" SQL queries (something like TPC-H Query13), I spent a week digging through the algorithms which work better when all the threads are doing the same thing.
I narrowed down on BoyerMooreHorspool[1] as the algorithm which doesn't seem to be hit by other threads running the same algorithm (of course, everything utf-8 matching turns unicode char patterns into exact byte lookups).
If the modulo is a compile time constant it should get converted into a multiply. If its not, for whatever reason the performance will be less than amazing.
This is a great blog explaining various ways to force collisions if you're not careful about how you pick your base and mod: https://codeforces.com/blog/entry/60442
Fair enough, but the strlen is not part of the search algorithm, it’s just figuring out the input. It could easily be an argument, but that would complicate the test code below.
If removing it would complicate the code, then it is part of the algorithm.
String APIs are critical to the discussion of algorithm costs, e.g. removing the final character from a string is O(n) if they are null terminated, or O(1) if the string stores its size. Assuming the best of both worlds by ignoring strlen costs is hand-waving away a lot of the real world performance.
The fact that C strings are presented in this format is totally irrelevant for the discussion of string search algorithms. They all have to know the length. Strings in most languages are not in that format, and it is an entirely reasonable assumption that the length is a known quantity.
If you wanna be annoying about this, just imagine he's implementing `strnstr` instead of `strstr`.
To me it looks like you could quite easily convert this strstr function to look for 0 bytes. You don't even need to know the length up front. It's just used to make some of the loop bounds more clear.
I did once and was promptly told the interview was over and they had security escort me out of the building. Apparently they really do not like smoking or smoking paraphernalia. Dunno if they ever found those two guys strings.
I'm guessing it's using an overly small value because it's based on java code and java is a little brain-damaged about unsigned numbers. :)
You could get a better use of your comparison register by using the syndrome of BCH code over GF(256)-- e.g. the full 64-bits for a 64-bit register instead of 56, though given how fast multipliers are these days using a prime modulus is probably fastest (in hardware or maybe in SIMD a characteristic-2 BCH code would almost certainly be faster and better!).
The code here also has the issues that permutations of the characters of the needle will match-- this may fairly bad performance for some plausible inputs. A BCH code would be less bad in that respect.
With the right choice of BCH code, you could be guaranteed that any false positive would differ in at least X characters out of Y, if Y is equal to or less than your window size, that may reduce the amount of comparisons you do for false positives (mostly interesting for very small hashes, I suppose).