A much faster solution would be the same approach rsync takes with its "rolling checksum." To calculate the weak checksum of each block as you roll along the length of the file, you only have to do a calculation involving the bytes rolling out and in. http://samba.anu.edu.au/rsync/tech_report/node3.html
Then, for only the blocks that match this weak checksum, do memcmp or a strong digest check.
Absolutely. The right way to do it is to use the memmem library call.
A much faster solution would be the same approach rsync takes with its "rolling checksum."
That's one way to improve performance, yes. (But it far predates rsync -- I think it might even predate Tridge.)
In many cases, a "look and skip" algorithm like KMP or BM will outperform a rolling checksum search; but of course if you've got 328MB of data on disk, nothing will make the search faster than the time it takes to read the data into RAM.
I know this function is attributed to Bernstein and it's called Bernstein hash . You're describing the Rabin-Karp string matching algorithm .
EDIT: A quick test shows adding comparison of the first byte reducing execution time here from 2.6 seconds to 0.6 seconds.
(this is in reply to tptacek, I can't reply to his post?)
Let us consider five test cases:
memcmp0 - Calling memcmp() at each byte in the file looking ofr a match, as the original article.
memcmp1 - As above, but optimize by checking the first character and only calling if it matches.
memmem0 - A single call to the glibc memmem() function. Not portable.
kmp0 - The Knuth-Morris-Pratt algorithm as implemented at "Exact String Matching Algorithms", http://www-igm.univ-mlv.fr/~lecroq/string/index.html
bm0 - The Boyer Moore algorithm, also from above
Looking for a 1k needle which lies at the end of a 100MB haystack of random bytes, with the data in file system buffers on a 1.6GHz Atom processor running Linux 2.6.29:
memcmp0 2400 ==================================
memcmp1 530 ========
memmem0 460 =======
kmp0 970 ==============
bm0 70 =
If we assume the data is out on a hard drive, then we get these numbers (same machine, a WD Green Power HD which is an evil thing to benchmark with because it has variable spindle and seek speeds, but suffice it to say it is not terribly fast). All OS caches flushed before each test.
memcmp0 3500 ===================================
memcmp1 1300 =============
memmem0 1200 ============
kmp0 1700 =================
bm0 1100 ===========
Testing on a beefier machine, with a Core 2 Duo 2.8GHz, OS X gives this:
memcmp0 1100 =========================
memcmp1 180 ====
memmem0 not portable
kmp0 360 ========
bm0 88 ==
Update: well crap. this fell of the front page before I finished testing. No one will see these results, but I enjoyed the tests anyway. It is useful to update my rules of thumb and preconceived notions.