
Exploring memcmp - jmtulloss
http://justin.harmonize.fm/index.php/2009/05/exploring-memcmp/
======
rarrrrrr
memcmp is not the way to do it.

A much faster solution would be the same approach rsync takes with its
"rolling checksum." To calculate the weak checksum of each block as you roll
along the length of the file, you only have to do a calculation involving the
bytes rolling out and in.
<http://samba.anu.edu.au/rsync/tech_report/node3.html>

Then, for only the blocks that match this weak checksum, do memcmp or a strong
digest check.

~~~
cperciva
_memcmp is not the way to do it._

Absolutely. The right way to do it is to use the memmem library call.

 _A much faster solution would be the same approach rsync takes with its
"rolling checksum."_

That's one way to improve performance, yes. (But it far predates rsync -- I
think it might even predate Tridge.)

In many cases, a "look and skip" algorithm like KMP or BM will outperform a
rolling checksum search; but of course if you've got 328MB of data on disk,
nothing will make the search faster than the time it takes to read the data
into RAM.

~~~
brtzsnr
""" A much faster solution would be the same approach rsync takes with its
"rolling checksum." """

I know this function is attributed to Bernstein and it's called Bernstein hash
[0]. You're describing the Rabin-Karp string matching algorithm [1].

[0]
[http://fr.wikipedia.org/wiki/Table_de_hachage#Fonction_de_Ha...](http://fr.wikipedia.org/wiki/Table_de_hachage#Fonction_de_Hachage)

[1] [http://en.wikipedia.org/wiki/Rabin-
Karp_string_search_algori...](http://en.wikipedia.org/wiki/Rabin-
Karp_string_search_algorithm)

------
tptacek
I don't have the time or energy to make a valuable comment about this, but my
hunch is that in the real world, cache effects dominate instruction selection
in how fast in-memory searching is.

~~~
jws
7% speedup by going with SIMD instructions. I think I'd have guarded the
memcmp() call with a check of the first byte and saved the call overhead on
99.5% of the bytes instead (insert assumptions on data).

~~~
jmtulloss
memcmp returns as soon as it finds a byte that doesn't match.

~~~
pieter
The point is that calling a function is much more expensive than comparing a
single byte. You'll have to fix your stack, jump to another location, check
the first byte, unwind the stack, and continue your program.

EDIT: A quick test shows adding comparison of the first byte reducing
execution time here from 2.6 seconds to 0.6 seconds.

(this is in reply to tptacek, I can't reply to his post?)

~~~
tptacek
You're right, but this is such a "meh" microoptimization. Compared to the cost
of a real memcmp(), a simple function call is below the noise floor.

~~~
jws
I, and my data, disagree. Most memcmp() calls look at a single byte and
return.

Let us consider five test cases:

memcmp0 - Calling memcmp() at each byte in the file looking ofr a match, as
the original article.

memcmp1 - As above, but optimize by checking the first character and only
calling if it matches.

memmem0 - A single call to the glibc memmem() function. Not portable.

kmp0 - The Knuth-Morris-Pratt algorithm as implemented at "Exact String
Matching Algorithms", <http://www-igm.univ-mlv.fr/~lecroq/string/index.html>

bm0 - The Boyer Moore algorithm, also from above

Looking for a 1k needle which lies at the end of a 100MB haystack of random
bytes, with the data in file system buffers on a 1.6GHz Atom processor running
Linux 2.6.29:

    
    
      algorithm  milliseconds
        memcmp0      2400     ==================================
        memcmp1       530     ========
        memmem0       460     =======
        kmp0          970     ==============
        bm0            70     = 
    

On the whole codesink1 wins with the Boyer Moore suggestion. In broader terms,
algorithm selection beats optimization and all is right with the CS universe.
(KMP's poor showing surprises me, but I've not used it before and don't know
what to expect.)

If we assume the data is out on a hard drive, then we get these numbers (same
machine, a WD Green Power HD which is an evil thing to benchmark with because
it has variable spindle and seek speeds, but suffice it to say it is not
terribly fast). All OS caches flushed before each test.

    
    
      algorithm  milliseconds
        memcmp0      3500     ===================================
        memcmp1      1300     =============
        memmem0      1200     ============
        kmp0         1700     =================
        bm0          1100     =========== 
    

Here disk read time sets a lower bound, algorithm is still important, but not
so much as before. I have a sneaky suspicion that mmap() is a poor choice
here. There doesn't seem to be much overlap between the IO and compute when
looking at the memcmp0 times.

Testing on a beefier machine, with a Core 2 Duo 2.8GHz, OS X gives this:

    
    
      algorithm  milliseconds
        memcmp0      1100     =========================
        memcmp1       180     ====
        memmem0       not portable
        kmp0          360     ========
        bm0            88     == 
    

The processor is several times faster, but bringing pages in from the OS is
slightly slower. I don't know how to flush OS buffers on this machine so I
can't do the disk reading tests.

Update: well crap. this fell of the front page before I finished testing. No
one will see these results, but I enjoyed the tests anyway. It is useful to
update my rules of thumb and preconceived notions.

~~~
tptacek
I'm impressed with the benchmark, but it seems to me like using the right
algorithm for the job (in this case, Boyer-Moore or memmem) beats hacking
memcmp to do a job it shouldn't be employed to do.

------
codesink
the boyer-moore algorithm should have been used. it would probably outperform
any other given implementation for this kind of problems.

