
Vector addition benchmark in C, C++ and Fortran - octopus
http://solarianprogrammer.com/2012/04/11/vector-addition-benchmark-c-cpp-fortran/
======
hermanhermitage
The inner loop the author is benchmarking is doing different things between
the versions.

In the C++ version they are hot loading c[] into the cache outside of the
timing loop with: vector<double> c[M];

In the C version you are doing it inside the timing loop. Thus your C version
is doing much more work inside the loop.

Try running the timing loops on both, or pre-initializing the c[] array in the
C version as in the C++ version and you will see the same performance.

In both cases C/C++ aliasing analysis can determine the arrays are not
overlapping and so the same optimized loop can be used. ie all fortran, C and
C++ versions are working on the same playing field for this example.

~~~
enqk
(not the author)

What do you mean hot loading? In both cases we write to c, I don't think c
needs to be in the cache at all. (Despite my belief this wasn't the
difference, I also tested pre-initialization and it does not make a
difference)

However I checked the inner loop's asm and there are differences. The addition
part is exactly the same, however the C code has an extra instruction which I
do not understand the origin of:

c version:

    
    
            0x1e84          movsd    -8(%edi, %eax, 8), %xmm0       !       Loop start[3], SSE      
            0x1e8a          movl     -32(%ebp), %edx                        
            0x1e8d          addsd    -8(%edx, %eax, 8), %xmm0                       
            0x1e93          movsd    %xmm0, -8(%esi, %eax, 8)                       
            0x1e99          incl     %eax                   
            0x1e9f          jnz      0x00001e84     <main+154>              Loop end[3]     
    

c++ version: 0x2ac2 movsd (%ecx, %eax, 8), %xmm0 ! Loop start[1], SSE 0x2ac7
addsd (%edi, %eax, 8), %xmm0 0x2acc movsd %xmm0, (%esi, %eax, 8) 0x2ad1 incl
%eax 0x2ad4 jnz 0x00002ac2 <main+312> Loop end[1]

~~~
hermanhermitage
vector<double> c(MM); => Initializes all c[i] to 0.0. Thus it writes to each
entry (causing a cache fill or cache allocate depending on copy-back versus
write-thru configuration of cache). Thus in the timed C++ loop access to c[]
is mainly at minimal latency.

To reproduce the behaviour change the C init to:

    
    
        for(i = 0; i < MM; ++i){
            a[i] = 1.0/(double)(i+1);
            b[i] = a[i];
            c[i] = 0;
        }
    

Then your times will coincide.

On my machine here the inner loops are identical:

gcc -O9 test.c -S:

    
    
        callq _clock
        movq  %rax, %r13
        .align  4, 0x90
      LBB2_3:
        movsd (%rbx,%r12,8), %xmm0
        addsd (%r14,%r12,8), %xmm0
        movsd %xmm0, (%r15,%r12,8)
        incq  %r12
        cmpq  $90000000, %r12
        jne LBB2_3
        callq _clock
    

g++ -O9 test.cc -S

    
    
        callq _clock
        movq  %rax, %r13
        .align  4, 0x90
      LBB5_19:
        movsd (%rbx,%r12,8), %xmm0
        addsd (%r14,%r12,8), %xmm0
        movsd %xmm0, (%r15,%r12,8)
        incq  %r12
        cmpq  $90000000, %r12
        jne LBB5_19
        callq _clock
    

(gcc/g++ 4.2 on Mac OS X)

Furthermore, compiling on Intel Fortran with the same change to pre load c
into cache:

    
    
        do i = 1,MM
            a(i) = 1.0/dble(i)
            b(i) = a(i)
            c(i) = 0.0
        enddo
    

and compiling with ifort -fast test.f90, I get a performance 1.33x greater in
fortran than C/C++.

    
    
      C:   0.168993 
      C++: 0.169194
      F90: 0.122215
    

Intel Fortran is compiling with more aggressive detection of AVX etc out of
the box. A modern version of GCC or adding compiler switches might get it to
meet the f90 figures.

So this all highlights how difficult micro benchmarks are to get right. It's
much easier to measure sustained thruput than create accurate micro
benchmarks.

