
GCC vs. Clang/LLVM: An In-Depth Comparison of C/C++ Compilers - flipchart
https://medium.com/@alitech_2017/gcc-vs-clang-llvm-an-in-depth-comparison-of-c-c-compilers-899ede2be378
======
ncmncm
One key metric where Gcc and Clang differ is in their willingness to produce
'cmov', "conditional move", instructions. ('cmov' is an instruction that
implements '(a=c?b:a)' without a conditional branch.)

Partition is an essential component of numerous key algorithms; almost half of
C++ STL algorithms depend on it. A key primitive to implement partition is the
'swap_if' operation. Performance of this operation thus directly determines
the performance of many important algorithms. For best performance with word-
sized arguments, 'swap_if' must be implemented with a pair of back-to-back
'cmov' instructions.

Gcc got its fingers burned a few years ago from emitting 'cmov' in loop
contexts where the condition was predictable and the result immediately
depended upon in the next iteration, creating dependency chains that slowed
execution of some benchmarks by a factor of 2. Now, Gcc will never, under any
circumstances, emit two 'cmov' instructions in one basic block.

Clang is happy to emit a pair of back-to-back 'cmov' instructions, which
enables it to perform partition operations very substantially faster than Gcc.
However, it fails to generate 'cmov' instructions in important cases.

The fully general expression of 'swap_if' looks like:

    
    
      template <typename T>
      bool swap_if(
        bool c, T& a, T& b)
      {
        T v[2] = { a, b };
        b = v[1-c], a = v[c];
        return c;
      }
    

It may be used in the inner loop of quicksort's partition operations as

    
    
      right += swap_if(
        *left < pivot,
        *left, *right);
    

A 'swap_if' implemented with 'cmov' makes quicksort fully twice as fast as,
e.g., current 'std::sort'. However, neither Gcc nor Clang recognizes this
'swap_if' as a place to substitute 'cmov' instructions for the pointer/offset
accesses. (The latter are slower, probably because they produce more L1 bus
traffic.)

Clang will happily emit 'cmov' instructions for a specialization like

    
    
      template <>
      bool swap_if(
        bool c, int& a, int& b)
      {
        int v[2] = { a, b };
        a =  -c&v[1] | 1-c&v[0];
        b = 1-c&v[0] |  -c&v[1];
        return c;
      }
    

where Gcc emits exactly the 'and' and 'or' instructions. (Interestingly, Gcc's
code for both definitions, despite failing to use 'cmov', runs quicksort quite
a lot faster than std::sort; just not 2x as fast.) Clang will also emit 'cmov'
instructions for

    
    
      bool swap_if(
        bool c, int& a, int& b)
      {
        int ta = a, tb = b;
        a = c ? tb : ta;
        b = c ? ta : tb;
        return c;
      }
    

but Gcc will only produce very slow (i.e. badly predicted) branches for this
case.

If we had 'std::swap_if' in the C++ Standard Library, probably all
implementations would produce good code for it.

(As an aside, I find it odd that changing the second line of the template
'swap_if' above to

    
    
      a = v[c], b = v[!c];
    

makes it much, much slower when compiled with Gcc for recent Intel. I would
welcome any insight into why this is.)

~~~
ncmncm
Apologies, the "1-c&v[.]" above should be "c-1&v[.]".

