
Faster memory comparison in C - feelix
https://macosxfilerecovery.com/faster-memory-comparison-in-c/
======
jepler
the code is not exactly memcmp, because the sign of the result doesn't match
the sign of memcmp. I verified this with my favorite open source C program
prover, CBMC [http://www.cprover.org/cbmc/](http://www.cprover.org/cbmc/)

You can see that CBMC proves that (1) a zero result from mycmp guarantees a
zero result from memcmp, and (2) a nonzero result from mycmp guarantees a
nonzero result fro mycmp. But (3) a negative result from mycmp doesn't
guarantee a negative result from memcmp and (4) a positive result doesn't
guarantee a positive result.

So, modulo problems with "strict aliasing" in modern C standards, you _can_
use this code to compare blocks of memory for equality, but you _cannot_ use
it to order blocks according to a less-than or greater-than predicate.

[https://gist.github.com/jepler/760cfcd4b326d6d11241256a6a4c7...](https://gist.github.com/jepler/760cfcd4b326d6d11241256a6a4c7a48)

~~~
feelix
Yes, that's true, because memcmp returns the sign based on the first byte, and
this compares four bytes, on little endian architectures it may return the
sign differently from the built-in memcmp(). That should be mentioned along
with the caveats, that it's not a suitable drop-in replacement when people are
not simply testing that the result is 0 (which I imagine to be the majority of
cases).

------
wahern
All of the difference comes from 1) inlining and 2) avoiding the alignment
test. Once you adjust for that it's actually slower.

Regardless of alignment penalties, memcmp always needs to test for alignment
so that it doesn't read past the end of a page, causing a segfault. This hack
isn't appropriate for general purpose use; only where you know for sure an
unaligned read won't overflow a page boundary.

~~~
feelix
It checks that it wont overflow beforehand by making sure the length is >= 4
(and it's still faster).

And I'm not sure why you think inlining would result in the speed increase? I
believe the compiler will inline it when the standard memcmp is used too.

~~~
wahern
You're right about the page overflow. Sorry.

Regarding inlining, using GCC 6.2 with "-O3 -march=native" on macOS I first
had to remove "111111" and "222222" as constants, otherwise GCC precomputed
both loops.

When I used __attribute__((noinline)) on mycmp, half of the difference went
away. When I forced mycmp to check the alignment, mycmp became slower by
almost a full second.

That was on a 2011 Mac Mini. Using a newer box with a Haswell chip (Xeon
E3-1230 v3) and GCC 5.4.0 performance is about the same when preventing
inlining and adding the alignment check, rather being slower. In both cases
the assembly confirms that neither mycmp nor memcmp were inlined.

------
jepler
Hm, if the task is to identify a set of fixed headers at a variety of file
offsets, what you really want is a rolling hash
([https://en.wikipedia.org/wiki/Rolling_hash](https://en.wikipedia.org/wiki/Rolling_hash)).
Select a rolling hash of size N, no bigger than the smallest fixed header. At
each step, update the rolling hash with the next byte seen. If it matches one
of the known header hashes, branch to code to check if it actually is.
Otherwise, continue to the next byte. As long as the rolling hash function
isn't too low quality, and you can check if a hatch matches a known header,
you can find all the headers in the block in O(F) time for an F-byte file,
rather than O(F*P) time for an F-byte file and P patterns. That is, on average
you inspect every byte of the file way fewer times with a rolling hash.

~~~
feelix
It uses qsort and bsearch in practice

------
blacksqr
A Study in Memcmp:

[http://www.picklingtools.com/study.pdf](http://www.picklingtools.com/study.pdf)

~~~
feelix
Interesting. And also interestingly, they use the same method in their
"fast_memeq" implementation:

if (* src1_as_int ++ != * src2_as_int ++) return 0; /* compare as ints */

