
Can we do better than our C compiler? - signa11
https://briancallahan.net/blog/20200812.html
======
celrod
For the limited scope of rectangular loop nests without loop carried
dependencies, my Julia library LoopVectorization.jl is able to vectorize and
achieve often significantly better performance than Clang, GCC, or the Intel
compilers; see the different examples here:
[https://chriselrod.github.io/LoopVectorization.jl/latest/exa...](https://chriselrod.github.io/LoopVectorization.jl/latest/examples/matrix_multiplication/#Matrix-
Multiplication-1)

However, that's only one component of vectorization. It's still relying on
LLVM to do lots of clean up, among other things.

I feel like there's a lot of reverence surrounding compilers and their
capabilities, but at least when it comes to SIMD, it's not hard to find
examples where a human can do much better, or can at least come up with
algorithms that will do better for that and related problems.

I think saying compilers produce better asm than a programmer is like saying
statistical libraries like `GLM.jl` are better at doing statistics than
statisticians. True in that you should let the routines crunch the numbers.
The skill is in deciding which routines/analysis to run, how to organize them,
and -- when necessary -- when to develop your own.

~~~
gnufx
I don't know what to make of results without any idea how to reproduce them.

For what it's worth, BLIS' generic C DGEMM code gave around 60% of the hand-
coded kernel on Haswell with GCC (8, from memory) on large matrices without
any real effort to speed it up. Obviously that's not a simple triply-nested
loop, since just vectorizing is useless for large dimensions, and the generic
code does get well auto-vectorized.

There are results from Polly/LLVM showing it do well against tuned GEMM.
However, Polly apparently pattern-matches the loop nest and replaces it with a
Goto BLAS-like implementation, and I don't know whether it's now in
clang/flang. I also don't know how well the straight Pluto optimization does
relative to BLAS, and whether that's what original Polly uses; it's a pity
GCC's version isn't useful.

~~~
celrod
I should include more detailed instructions. Dependencies include: eigen, gcc,
clang, and (harder) the Intel compilers, as well as a recent version of Julia.

If you have those, you should be able to instantiate the Manifest:
[https://github.com/chriselrod/LoopVectorization.jl/tree/mast...](https://github.com/chriselrod/LoopVectorization.jl/tree/master/benchmark)
and then `include("/path/to/LoopVectorization/benchmark/driver.jl")` to run
the benchmarks.

There are other examples. For example, for code as simple as the dot products,
clang shows some pretty bad performance characteristics while gcc needed a few
infrequently used compiler options to do well there.

Image filtering / convolutions without compile-time known kernel sizes are
another example that showed dramatic speedup.

When talking about optimized gemm, LoopVectorization currently only does
register tiling. Therefore its matmul performance will continue to degrade
beyond 200x200 or so on my deskop. An optimized BLAS implementation would
still need to take care of packing arrays to fit into cache, multithreading,
etc.

Maybe I should add pluto. I did test Polly before, but it took several minutes
to compile the simple c file, and the only benchmark that seemed to improve
was the gemm without any transposes. And it still didn't perform particularly
well there -- worse than the Intel compilers, IIRC.

PRs are welcome.

------
jacb
Have there been attempts to measure modern compilers' ability to optimize
large programs against "perfect" machine code? The few times I've poked
through demo scene binaries, I've felt a sense of awe at the complexity
described by just a few instructions, but demos aren't starting from a
specification and trying to find a compressed representation, they're trying
to create small binaries that expand into something complex and interesting.
Clearly this problem is NP-hard as hell (and most software doesn't need to be
optimized instruction-by-instruction), but it would be incredibly neat to have
an estimate of _how much worse_ our compilers might be.

I'm reminded of the old story about someone who tried to evolve an FPGA
configuration to perform signal processing, and the final generation exploited
imperfections in the chip to use fewer gates.

~~~
tomsmeding
That would indeed be very neat. I think it's also impossible to do in a
reasonable way, because how well a compiler does is probably dependent in a
huge way on the particular program. There is a huge class of program
transformations that are technically correct but far too specific to implement
in a general-purpose compiler, and I think how well a compiler does is mostly
dependent on how many of those transformations happen to apply to your program
at hand.

For example: suppose you have a program that renders a 3D scene from a
hardcoded scene description. A compiler might do a lot of stuff to that
program, but what it will certainly not do is notice that the whole hardcoded
scene can be replaced with a small representation in the form of a distance
estimation function, if the rendering algorithm is also changed from
raytracing to distance estimation.

A human in a demo scene situation will certainly perform (or at least try to
perform) that transformation, if only because distance estimation is well
known to often provide compact representations of certain, a priori very
complex, scenes.

Also, nitpick:

> Clearly this problem is NP-complete as hell

I think it's almost certainly not even in NP; that would require a polynomial
algorithm that, given some evidence that you can choose, proves that a
particular program really is optimal. I think that's still impossible (but am
gladly proven wrong). If I'm correct, it's not NP-complete, it's NP-hard.

~~~
jacb
> how well a compiler does is mostly dependent on how many of those
> transformations happen to apply to your program at hand

Right, there's an infinite number of distinct useful code optimizations,
there's a cost to checking if any given optimization can be applied, and some
optimizations have _massive_ costs for very rare and specific savings, so any
given compiler is making a practical decision to include some optimizations
and omit others. There was some discussion in the Rust community about LLVM
having a period where they weren't tracking regressions in compile times - so
"the perfect optimizing compiler" isn't really a coherent goal. But I still
wonder how much faster Optimal Chromium would be. Just another interesting
number I'll never know, I suppose.

> I think it's almost certainly not even in NP

Yeah, that was sloppy of me, I think this is undecidable by Rice's theorem.
Nice catch!

------
alephnil
At the level the article look at, it is not even worth doing better, since the
potential gain is minuscule compared to the available disk and memory, and
this has been the case at least since the late 80s. That does not mean that it
is always better to let the compiler do all optimizations.

There are still cases, especially where clever uses of SIMD instructions can
pay off. This is especially the case for video processing, deep learning
engines and numerical code. It is still hard to beat the compiler even in
those cases. You have to know about branch prediction, that modern processors
execute many instructions in parallel (superscalar) as well as instruction
latency. You also need to know about the cache hierarchy, but that's the case
even when you let the compiler generate the code. Writing highly optimized
code like that is not something every programmer will do, but most will use
code written by others that has part in optimized assembly. An example of that
is ffmpeg.

~~~
PaulDavisThe1st
Within Ardour, a cross-platform DAW, we introduced hand-written SSE-based asm
for metering, which is the most expensive computational operation in the
application (without plugins): compare every single (floating point) audio
sample that passes through to some recent max and min, and if larger or
smaller, store it as the new max or min. The savings were dramatic, though
dependent on the size of the buffers that get processed. With larger buffers,
adding this asm reduced DSP load by 30%.

This was done 15 years ago, with a few minor updates about 6 years ago.
Recently, we've revisited it to add support for AVX on Linux, which was
missing.

It turns out that gcc can now do more or less as good a job as the hand-
crafted asm. The hand-crafted version understood (mis)alignment better than
gcc from 15 years ago, but at this point, gcc's inner loop is 1 instruction
less than the hand-crafted version, although its setup before reaching the
inner loop is several instructions longer.

There is still some human skill required to understand the best way to suggest
to gcc what is going on, but it increasingly seems as though it's likely best
to let the compiler do the actual code generation, even for SIMD.

~~~
compiler-guy
Compilers get better every year. Hand-tuned assembly only gets better if
someone pays attention to it.

So even if you can beat the compiler today, you might not be able to beat it
tomorrow.

Therefore it is often useful to have two switchable implementations. High-
level code that is basically straightforward, and hand-tuned assembly. You can
the compare the two with every new compiler release.

~~~
gnufx
Not only that, but your micro-architecture-specific hand-tuning probably isn't
good for tomorrow's micro-architecture, and definitely not for a new
architecture. You still need craft kernels in cases like BLAS, but you really
want to reduce them, e.g. BLIS in the case of BLAS.

------
andreaorru
The article is missing benchmarks. Less instructions != faster execution.

~~~
rurban
Normally it is

~~~
Google234
How many instructions is an infinite loop :p More seriously, processors are
superscalar and can execute out of order.

~~~
saagarjha
In Intel, basically one.

------
RcouF1uZ4gsC
If you want to see the assembly code a compiler produces from high level
languages, Compiler Explorer is awesome for that
([https://gcc.godbolt.org/](https://gcc.godbolt.org/))

You type your C/C++ code in the left pane, and it generates assembly in the
right pane as you type. There is color coding to show you which C/C++ code
correlates to the assembly code. If you hover over an assembly instruction
such as mov, a tooltip appears telling you what it does.

It makes exploring what the compiler is doing easy and a lot of fun

------
ritter2a
There will probably always be particular code transformations (algorithmic or
target-machine-specific) that speed up your code and that your general purpose
compilers will not find. (And that's good, because otherwise the compile time
would be horrible!)

The good thing about compiler optimizations is not that they produce actual
"optimal" binaries, but that they allow you to not care too much about
performance when writing code (e.g. using abstractions that will be inlined,
etc.) and still get a decent binary. If you then require more performance
(which will probably not be the case for most applications) you can still
search for bottlenecks and start optimizing by hand there.

------
okaleniuk
It's not that people do better than compilers, it's just compilers often do
badly. That's why checking disassembly in tight spots is the integral
procedure in our team. Compilers and people together by definition do not
worse than compilers alone.

Also, this is an odd choice of program to compare. Compilers often do
suboptimal on superscalarisation, inlining, or simplifying control flow. In
this example, there is not enough space to go wrong nor for a compiler neither
for a programmer.

~~~
gnufx
Isn't the first thing to do looking at what the compiler can tell you (-fopt-
info for GCC)? That can be helpful without deep knowledge of the target, or
knowing its assembler. There are also dumps from stages before generating
assembly which may be helpful. Is it deep knowledge of the whole system that
makes going straight to assembly best?

~~~
okaleniuk
Yes, GCC intermediates are quite nice. Lispy and readable. And indeed, you can
take a look at every stage. But LLVM intermediate code is basically just
another assembly language. And as for MSVC, I'm not sure if it's even
accessible. So it's not always an option.

Disassembly is either way our end result so unless you're looking for
something particular, it makes sense to browse that first just to check that
the compiler gets your intention just right.

------
Google234
Could we actually get a benchmark? COUs don’t just execute on instruction at a
time serially..,

------
Snild
I find that manual assembly is mostly useful when I know something about the
data or desired operation that is not possible to communicate to the compiler.

~~~
dvfjsdhgfv
But it doesn't happen often these days, does it?

~~~
10000truths
In cases of parallel processing where code executing in the critical path is a
significant bottleneck, hand-rolled SIMD can sometimes outperform what gcc or
clang spits out by a few clock cycles/iteration. But even then, the sanest way
to do that is to let the compiler do most of the legwork and then
incrementally tweak its output.

~~~
hansvm
IME gcc is extraordinarily bad at anything but the most basic of auto-
vectorizations, but its output from SIMD intrinsics is decent -- not usually
quite as good as hand-rolled assembly, but much faster to write (for me
anyway, YMMV) and still within 2x performance.

------
kazinator
Here is an iovec version, doing it all in a single writev.

    
    
      #include <unistd.h>
      #include <stdlib.h>
      #include <string.h>
      #include <alloca.h>
      #include <sys/uio.h>
    
      int main(int argc, char **argv)
      {
        if (argc > 1) {
          int nstr = argc - 1;
          char **svec = argv + 1;
          int i, j;
          struct iovec *iov = alloca(sizeof *iov * nstr * 2);
          for (i = 0, j = 0; i < nstr; i++, j += 2)
          {
            iov[j].iov_base = svec[i];
            iov[j].iov_len = strlen(svec[i]);
            iov[j + 1].iov_base = " ";
            iov[j + 1].iov_len = 1;
          }
          iov[j - 1].iov_base = "\n";
          writev(STDOUT_FILENO, iov, j);
        }
    
        return 0;
      }
    

#include <numerous_caveats.h>. We are not checking that everything got
written, let alone handling a partial write. Systems have some limit on how
many elements you can have in the iov array passed to writev.

------
ManDeJan
213 bytes! :) Sure there's a few more to shave

    
    
      7f454c4602010100000000000000000002003e0001000000820001000000
      000040000000000000000000000000000000000000004000380001004000
      000000000100000005000000800000000000000080000100000000008000
      010000000000550000000000000055000000000000001000000000000000
      0000000000000000200a4c8b0c2449ffc94d39d17441eb0eb00140b701be
      80000100b2010f054e8b44d41041807c100100488d520175f4b00140b701
      4c89c60f0549ffc24d39ca75cfb00140b701be81000100b2010f05b03c31
      ff0f05

~~~
aweinstock
It's definitely the case that some of the ELF header fields can be repurposed
to store additional data, the following articles all address that
optimization:

[http://www.muppetlabs.com/~breadbox/software/tiny/teensy.htm...](http://www.muppetlabs.com/~breadbox/software/tiny/teensy.html)
[https://www.pimzero.com/2020/04/19/golf_so.html](https://www.pimzero.com/2020/04/19/golf_so.html)
[https://rpis.ec/blog/plaidctf-golfso/](https://rpis.ec/blog/plaidctf-golfso/)

There are also a few instructions in the article that are size-suboptimal
(e.g. "movl $4, %eax" is 5 bytes (b804000000), while the equivalent "xor %eax,
%eax; movb $4, %al" is only 4 bytes (31c0b004)).

~~~
ManDeJan
I tried to do it as optimal as possible, but I think there is indeed more to
be saved with putting code in the elf header, i'm not very familiar with the
elf format. This is as small as I could get it without doing that, just using
yasm, ld and strip.

------
fxtentacle
It depends on how much information you give to the compiler. A modern GCC can
do magic like AVX vectorization and OpenMP parallelization if you give your
function the right restrict and pragma annotations.

~~~
gnufx
Something like #pragma simd may actually be worse than letting GCC do its
thing. It may not use FMA, just AVX, for instance. You may also want (a subset
of) -ffast-math for NEON, reductions, etc., if your code is OK with it.

