
Memcmp is the most efficient way to compare large chunks of data - arash9k
http://mymicrocontroller.com/2018/04/04/memcmp-is-the-most-efficient-way-to-compare-large-chucks-of-data/
======
eesmith
The author's proposed for-loop version is:

    
    
      int *p = (int *)0x80000000;
      int *q = (int *)0x90000000;
      while(p < q){
    	if(*p++ != *q++){
    		flag = 1;
    	}	
      }
    

There must have been a transcription error as that while loop doesn't look
like it will exit until q reaches the end of memory and wraps back to 0.

On my machine, the following takes about 161 ms (that's the best of 5 timings
on a non-quiet machine):

    
    
      result = memcmp(p, q, SIZE);
    

The author's for-loop is optimistic, and doesn't short-circuit exit on
mismatch, so I'll replace it with a non-branching equivalent:

    
    
      result = 0;
      for (int i=0; i<SIZE/sizeof(long); i++) {
          result = result || (p[i] != q[i]);
      }
    

That takes 310 ms (compiled with Apple LLVM version 8.0.0, using -O3). About
1/2 the performance of built-in.

Replace the int with a long and it's 200ms. Manually unroll by 4 and it's
175ms. My final code is:

    
    
      result = 0;
      for (int i=0; i<SIZE/sizeof(long); i+=4) {
         result = result
           || (p[i] != q[i])
           || (p[i+1] != q[i+1])
           || (p[i+2] != q[i+2])
           || (p[i+3] != q[i+3]);
      }
    
    

That's only somewhat slower than the memcmp, not the 10x that the author
reports.

Of course, memcmp does short-circuit evaluation, and I know the size is a
multiple of sizeof(long), so this test case is biased in favor of my code.

I would, of course, always favor the tested, often optimized library code over
rolling my own unless there are good reasons otherwise.

~~~
rurban
Op implemented a poor version of timingsafe_memcmp, which is of course slow.
His other HN post today is also full of such errors.

I even don't agree with the major premise that the libc version should be
always preferred. In fact memcmp and memcpy can be made much faster.

At [http://blogs.perl.org/users/rurban/2014/08/perfect-hashes-
an...](http://blogs.perl.org/users/rurban/2014/08/perfect-hashes-and-faster-
than-memcmp.html) I wrote a jit like memcmp with switching on wordsized chunks
which is 200x faster than memcmp.

And I benchmarked my safelibc implementation of memcpy_s with a good compiler,
clang >4, beating the handoptimized and unsafe glibc assembler version by ~5%.
gcc stinks there.
[https://github.com/rurban/safeclib/blob/master/tests/perf_me...](https://github.com/rurban/safeclib/blob/master/tests/perf_memcpy_s.c)
Let the compiler optimize it.

But I saw much faster memcpy than this.

~~~
eesmith
Sure, they can be made faster.

I'm not going to spend my life on those sorts of optimizations unless there
are good reasons.

~~~
rurban
You don't see the point. The point is to write proper loops in C which the
compiler can optimize. This the fastest way and benefits all platforms.
Special cases which are actually slower should be removed. Compilers which are
not able to optimize proper loop patterns should be fixed and documented as
such. Haven't checked gcc-8 yet, but older versions should not be used for
such core functions. libc should be compiled with clang.

