
Comparing Compiler Optimizations - rcfox
http://blog.regehr.org/archives/320
======
CountHackulus
I work in compiler development, and have worked on a C/C++ compiler. Let me
first say that these side by side comparisons are excellent for pointing out
the weakpoints in compilers, and showing where extra optimizations could be
made, they are, as presented in this article, not very useful.

When's the last time you saw a program that was just a single function? In
real production environments, you're going to be running programs with
hundreds of modules, thousands of functions, and millions of instructions.
Optimizations like inlining, partial inlining, partial redundancy elimination,
and inter-procedural analysis become much more important.

When you're a compiler developer looking for more performance, the typical
process starts with profiling an important benchmark (like SPEC2006) and
looking at where the program is spending most of its time. Then you take a
look at the offending code and see if it can be distilled down to one of these
microbenchmarks. There you can decide how best to optimize this. Without this
context though, these type of microbenchmarks are nearly pointless. They show
some optimization that could be done, but who knows if it'll be worth it, or
even make any sort of impact on the final performance of a production
workload.

~~~
jules
Do you also use real world applications for benchmarking? Or is that too
difficult (too big)?

~~~
CountHackulus
Absolutely, I can't talk too much about what we actually benchmark, but real
world applications are always used for benchmarking a compiler. Especially if
it's a JIT compiler. Though it is in a repeatable environment with scripts and
such to minimize variation and allow us to measure more accurately.

------
ronnix
Most compilers typically do a good job, but I was surprised to see such a
difference on some examples.

On those examples, Clang seems to be usually better than GCC at static code
evaluation and local optimizations.

However, GCC beats Clang in examples 2 and 3 due to better loop optimization
capabilities (such as unrolling), allowing it to use SSE instructions.

Example 3 show that the Intel compiler has even stronger unrolling and
vectorization capabilities. Not unexpected, as their compiler has to be
especially good at this in order to get good performance on the Itanium
architecture (IA-64).

I'm also impressed by the trick used by ICC in example 7 (using a bittest
against a statically computed bitmap to implement the comparisons).

It will be quite interesting to see the results the researchers will get on
many more examples...

------
jefffoster
Acovea (<http://www.coyotegulch.com/products/acovea/>) is a genetic algorithm
for analysing the performance of code with various optimization flags.

There's a practical example of this being used on a Haskell program at
[http://donsbot.wordpress.com/2010/03/01/evolving-faster-
hask...](http://donsbot.wordpress.com/2010/03/01/evolving-faster-haskell-
programs-now-with-llvm/)

------
kristianp
As someone who finds this kind of optimization story interesting, but to whom
optimisation in everyday life means adding an index or caching a query,
wouldn't it be equally important to analyse the worst case optimisations as
well?

Then again, that is probably done as part of regression tests by the compiler
makers.

------
jedbrown
FWIW, -mfpmath=sse is default on x86-64

