Hacker News new | past | comments | ask | show | jobs | submit login
Rolling hash; Rabin-Karp string search (yurichev.com)
131 points by kulikov009 on Aug 24, 2021 | hide | past | favorite | 20 comments



The prime here is needlessly small, I believe-- which will inflate the false positive rate. It's using 64-bit registers and 64-bit multiplies, so the prime could be 0xfffffffffffffb. For 32-bit registers and multiplies the prime could be 0xfffffd.

I'm guessing it's using an overly small value because it's based on java code and java is a little brain-damaged about unsigned numbers. :)

You could get a better use of your comparison register by using the syndrome of BCH code over GF(256)-- e.g. the full 64-bits for a 64-bit register instead of 56, though given how fast multipliers are these days using a prime modulus is probably fastest (in hardware or maybe in SIMD a characteristic-2 BCH code would almost certainly be faster and better!).

The code here also has the issues that permutations of the characters of the needle will match-- this may fairly bad performance for some plausible inputs. A BCH code would be less bad in that respect.

With the right choice of BCH code, you could be guaranteed that any false positive would differ in at least X characters out of Y, if Y is equal to or less than your window size, that may reduce the amount of comparisons you do for false positives (mostly interesting for very small hashes, I suppose).


> though given how fast multipliers are these days using a prime modulus

They are really fast on a single thread, but at least on Xeons I use the mul and modulo seems to have a weak spot with thread parallelism though.

To fix a set of threading scaling issue with LIKE "%pattern%" SQL queries (something like TPC-H Query13), I spent a week digging through the algorithms which work better when all the threads are doing the same thing.

I narrowed down on BoyerMooreHorspool[1] as the algorithm which doesn't seem to be hit by other threads running the same algorithm (of course, everything utf-8 matching turns unicode char patterns into exact byte lookups).

[1] - https://github.com/apache/hive/blob/master/storage-api/src/j...


> and modulo

If the modulo is a compile time constant it should get converted into a multiply. If its not, for whatever reason the performance will be less than amazing.


There is a trick to turn a modulo info a multiplication even if it's not known at compile time:

https://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-...


The blog post you linked is about fair range mapping.

I don't see how that applies to the problem of computing a modulo while avoiding expensive cpu instructions.


Yeah, that is inapplicable.

This is the relevant reference:

Arch D. Robison, "{N}-bit Unsigned Division via {N}-bit Multiply-Add" (2005)

https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.51...


> mul and modulo seems to have a weak spot with thread parallelism

is that only with hyperthreading enabled or mul/modulo affect other unrelated cores as well?


Implementation-wise, if you multiply two 64 bit numbers you need to hold the result in a __int128 before taking the mod which makes it less portable.


The larger prime I gave is 56 bits for that reason. It will not overflow 64 bits. (-- unless I've messed something up, of course!)


Unsigned numbers are usable in Java, they just require a bit more care.


This is a great blog explaining various ways to force collisions if you're not careful about how you pick your base and mod: https://codeforces.com/blog/entry/60442


I'll admit I found it very amusing to read

> Only one byte is loaded for each character of string.

right after he calls strlen on the input on the first line of his function :P


Fair enough, but the strlen is not part of the search algorithm, it’s just figuring out the input. It could easily be an argument, but that would complicate the test code below.


If removing it would complicate the code, then it is part of the algorithm.

String APIs are critical to the discussion of algorithm costs, e.g. removing the final character from a string is O(n) if they are null terminated, or O(1) if the string stores its size. Assuming the best of both worlds by ignoring strlen costs is hand-waving away a lot of the real world performance.


The fact that C strings are presented in this format is totally irrelevant for the discussion of string search algorithms. They all have to know the length. Strings in most languages are not in that format, and it is an entirely reasonable assumption that the length is a known quantity.

If you wanna be annoying about this, just imagine he's implementing `strnstr` instead of `strstr`.


To me it looks like you could quite easily convert this strstr function to look for 0 bytes. You don't even need to know the length up front. It's just used to make some of the loop bounds more clear.


Now implement this in an interview.


I did once and was promptly told the interview was over and they had security escort me out of the building. Apparently they really do not like smoking or smoking paraphernalia. Dunno if they ever found those two guys strings.


I have done this? I'm not particularly great at interviewing either.


Without preparing for it?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: