
The Death of Optimizing Compilers - mpweiher
http://blog.cr.yp.to/20150314-optimizing.html
======
danso
On the topic of compiler optimization, Peter Seibel's interview with Fran
Allen, a Turing Award recipient for her "pioneering contributions to the
theory and practice of optimizing compiler techniques", is a good read:

[http://books.google.com/books?id=nneBa6-mWfgC&lpg=PA9&pg=PA5...](http://books.google.com/books?id=nneBa6-mWfgC&lpg=PA9&pg=PA501#v=onepage&q&f=false)

She considers the development of C to be a "big blow", saying that "C has
destroyed our ability to advance the state of the art in automatic
optimization, automatic parallelization, automatic mapping of a high-level
language to the machine. This is one of the reasons compilers...are basically
not taught much anymore in the colleges and universities"

Her whole interview is great...in fact the whole book is fantastic, one of the
best books about programming I've ever read.

~~~
Animats
_" C has destroyed our ability to advance the state of the art in automatic
optimization, automatic parallelization, automatic mapping of a high-level
language to the machine."_

She's right. As I occasionally point out on why C programs are so buggy, the
three big problems in C are "how big is it", "who owns it", and "who locks
it". The language doesn't help with any of those. Those problems also inhibit
optimization.

One optimization that's been lost to history is optimizing subscript checking.
This was done in some Pascal compilers, and about 95% of subscript checks can
be optimized out. Where there's a loop involved, it's often possible to make
one check at loop entry, rather than a check at every reference. Sometimes
that one check can be proven true from FOR statement bounds, and can be
eliminated. This is usually the case for numeric matrix work, where it really
matters.

I had a talk with one of the designers of Rust about this last Wednesday.
They're not optimizing subscript checks yet. They hope to do so. They're
currently limited by being a front-end to LLVM, which doesn't really
comprehend subscript checks. The Go compiler does some subscript check
optimizations, but only on FOR statements of restricted form.

C, Go, and Rust all lack multidimensional arrays. This prevents a whole range
of array-related optimizations which FORTRAN compilers have had all the way
back to Backus's first FORTRAN compiler in 1956. This is part of why the
supercomputer crowd still uses FORTRAN.

I was trying to push the Go crowd towards multidimensional arrays. People in
love with the slicing syntax wanted to extend it to multidimensional arrays.
There were long arguments over how far to take that. Should you be able to
slice out an arbitrary sub-hypercube of an N-dimensional array? If you allow
that, you have to have a slice representation which is general but
inefficient. After much discussion, nothing happened.

Meanwhile, the number-crunching crowd is trying to compile Matlab, which has
multi-dimensional arrays but not much else.

~~~
kibwen
It's not a priority for Rust to hoist subscript checks in loops, because Rust
uses iterators pervasively and iterators are guaranteed to check the bounds of
the array no more than once.

~~~
Animats
Write a matrix multiply. Or a linear equation solver.

~~~
pcwalton
The bounds checks for matrix multiplication over fixed length arrays are
completely optimized out by LLVM, and LLVM will also optimize the bounds
checks for non-fixed-length arrays in many cases. The slice length is an SSA
value, so it's subject to constant propagation and SCCP.

------
DannyBee
I'd be interested to see this talk, because as far as i know, it's pretty far
off base.

1\. He says" As computation has become cheaper, users have correspondingly
expanded the volume of data that they are handling, and optimization remains a
critical challenge for the occasional "hot spots" in the code.""

Except, uh, a lot of people have applications whose profiles are mostly flat,
because they've spent a lot of time optimizing them. We build optimizing
compilers that can speed up _these_ apps _anyway_.

2."Have compilers become so smart that they automatically turn clean high-
level code for these hot spots into optimized code, removing the need for
humans to be optimization experts? The reality, unfortunately, is very much
the opposite: general-purpose "optimizing" compilers are falling farther and
farther behind the actual capabilities of modern processors."

This is just flat out wrong.

That said, Daniel is a very smart (and obviously pretty opinionated :P) guy,
so i'm really interested to see what he has to say.

~~~
nkurz

      > The reality, unfortunately, is very much the opposite: 
      > general-purpose "optimizing" compilers are falling farther 
      > and farther behind the actual capabilities of modern 
      > processors.
    
      This is just flat out wrong.
    

Saying this is wrong implies that the gap between ideal processor-specific
assembly and generated code is closing, and that compilers today can achieve a
higher percentage of potential performance than they could in the past. Do you
have evidence that suggests this is true? Are you possibly conflating
"optimizing the code as written including all unintentially specified corner
cases" with "optimizing the production of the desired answer"?

In my personal experience (which is mostly compression and scientific
computing), I find that I can get significantly higher x64 performance from
Haswell than from earlier Intel generations --- frequently more than twice as
fast as I can do for Sandy Bridge, due to a combination of better memory
interactions, fewer scheduling gotcha's, and wider integer vectors. Very
rarely do I find that compiler generated code (Clang, GCC, icc) achieves this
level of speedup, and even more rarely do I find that the same code gets
optimal performance on both of these generations. Even more rarely does the
generated assembly surprise me by exceeding my expectations of maximal
performance.

I'm not sure where Daniel is going with his presentation, but what I often
find myself wanting is a non-brain-dead non-optimizing compiler. It makes me
sad to know how to make code run fast, but be unable to keep the compiler from
slowing it down. Obviously, this is a rare case, since only a tiny percentage
of code will ever be allowed to be this highly optimized. But I sure wish that
there was a way to return to "glorified assembly", and that I had a way to
tell the compiler to keep its grubby little fingers off certain small sections
of my consciously crafted code.

~~~
DannyBee
"aying this is wrong implies that the gap between ideal processor-specific
assembly and generated code is closing, and that compilers today can achieve a
higher percentage of potential performance than they could in the past. Do you
have evidence that suggests this is true? "

Yes. We now do things like
[http://drona.csa.iisc.ernet.in/~uday/publications/pluto+.pdf](http://drona.csa.iisc.ernet.in/~uday/publications/pluto+.pdf)

Which will fully, automatically, parallelize and optimize loop nests to the
target architecture, including tiling, interchange, cache blocking, blah blah
blah Basically anything you can think of. _You_ can't do this by hand. You can
do something by hand, maybe tiling, maybe unrolling, maybe interchange but the
math is too complex for you to get it right in the general case and to
_combine all these things at once_. We couldn't do this 20 years ago. We
couldn't even approach the performance of most architectures.

We can also vectorize whatever you want. The only hard part is calculating the
cost vs benefit.

Past that, now we actually have the opposite problem. We come so close to
optimal on most architectures that we can't do much more without using NP
complete algorithms instead of heuristics. We can only try to get little
niggles here and there where the heuristics get slightly wrong answers. We
decided we care more about compile speed than doing stuff like this.

"Very rarely do I find that compiler generated code (Clang, GCC, icc) achieves
this level of speedup, and even more rarely do I find that the same code gets
optimal performance on both of these generations."

I don't know your code, so i really can't speak to this at all, but it's the
opposite experience for us. In fact, moving platforms now makes little
performance difference. Getting the compiler up to snuff for that architecture
does. If what you say is true, it just tells me you need to spend a small
amount of time figuring out why the compiler is doing the wrong thing. It's
almost certainly just a small bug or tweak somewhere.

~~~
nkurz
_Yes. We now do things
like[http://drona.csa.iisc.ernet.in/~uday/publications/pluto+.pdf](http://drona.csa.iisc.ernet.in/~uday/publications/pluto+.pdf)
_

Thanks, I'll read this over.

 _In fact, moving platforms now makes little performance difference._

Interestingly, we are saying the same thing but coming to opposite
conclusions. I agree that compiler generated code produces approximately the
same performance on Sandy Bridge vs Haswell. But I also know that the
performance potential on Haswell can be much higher than that of Sandy Bridge,
thus see this as a drop in compiler performance. Whereas you seem to be
concluding that because the compiler achieves comparable performance, there
has been little hardware improvement across generations.

 _We decided we care more about compile speed than doing stuff like this._

I've never really understood the emphasis on compile speed --- it seems
misguided. Sure, you don't want to wait for the compiler to slowly redo the
same suboptimal "optimization" on every recompilation, but shouldn't there be
some way to cache the previous "best" so you can start from there? And
wouldn't it be more efficient if there was some way that you could specify to
the compiler where you want it to spend its time, maybe even offline so it
won't hold up everything else? Not just the spectrum from "don't optimize and
while you are at it please pass all function variables on the stack" to "do
the best you can as long as it doesn't take longer than a millisecond", but
"do what it takes and I'll tell you when it's good enough"?[1]

 _It 's almost certainly just a small bug or tweak somewhere._

Perhaps I can try to send you an example the next time I come across a compact
example. I'm dubious that increasing the automatization is the right approach
--- what I think I want is easier access to "manual mode". And I'm equally
dubious that it's going to be an easy fix, but would be wonderful it it was.

As a quick example of the sort of issue, Haswell allows two 32B loads and one
32B store per cycle, and has three address generation ports to support this.
Two of the AGP's can be used for stores or loads, and one (Port 7) is only for
stores. But Port 7 can only generate "simple" addresses --- fixed offsets from
a register.

I've yet to find a way to convince a compiler to generate two loads and a
store that can be executed in a single cycle without resorting to assembly.
Effectively, the cost of an instruction varies depending on what other
instructions are executed the same cycle. They can't be scored independently,
and the compiler gets stuck in a "local optimum" rather than finding the
"global optimum".

Just the ability to say "No, really, I want both of these to be loop
variables, and this one to be incremented, and this one to be decremented, and
I want the loop to terminate when the decrement hits zero" would be
tremendous. Or a least, I'd like to be able to specify this as a starting
point for optimization, and have some guarantee that the compiler won't
"simplify" by combining the registers unless the "simplification" actually has
better performance.

[1] I feel like Dan Luu's excellent and under-discussed article on Software
Testing is very applicable here:
[http://danluu.com/testing/](http://danluu.com/testing/) Like software
testing, compilers seem to have settled into the strange expectation that they
should do the same thing over and over again very quickly, rather than doing
something once and well.

~~~
vardump
I pretty much share your sentiment on this subject. My conclusion is was to
play with specialized JITs. Although so far my attempts have been more like
code generators that just concatenate instructions and take care of loop
target alignment and so on. SSE2/SSE4.2/AVX2 depending on runtime CPU
architecture. Achieved performance has been very good. There seems to be a
huge potential, shame I have so little time to work on this.

Large pages (2MB+) can sometimes contribute a nice amount of extra
performance, of course depending on access patterns. It can also have negative
effect under some circumstances, like some very random access patterns.
Gigabyte pages could help there, but support isn't great.

Other thing I've investigated is memory channel interleaving. Local memory
seems to be mostly 64 bytes each channel round robin, but I guess it can be
more complicated too. NUMA systems seem to be either round robin 4096 byte per
NUMA region or all CPU local memory in one multi-gigabyte (?) chunk.
Understanding memory interleaving can help balancing the work between
different memory channels.

------
jeffreyrogers
For those who aren't aware, this is an abstract for a talk to be given next
month by Daniel Bernstein, who has written several very high quality C
programs including qmail, daemontools, and djbfft, which is a very fast
library for computing FFTs, so he presumably knows what he is talking about.

------
amelius
I think what we need is a programming language that addresses performance
orthogonally from correctness/functionality. What this means is that a
programmer could write his code in a very easy-to-understand functional style
(say), and then he could write a library of rewrite rules that allows the
program to be executed more efficiently.

~~~
Dewie
That sounds like library writers being able to bake in optimizations (rewrite
rules) into their libraries. Which makes sense; you distribute the
optimizations to domain experts (like data structures) instead of just having
general optimizations that the compiler writers had time to add.

------
mike_hearn
And the counter argument:

[http://www.chrisseaton.com/rubytruffle/cext/](http://www.chrisseaton.com/rubytruffle/cext/)

~~~
dalke
I read it, but I don't see the counter-argument. It seems to say that people
who want a C extension for Ruby, for performance reasons over native Ruby, can
use a system which dynamically compiles the C extension code.

While this proposal seems to be that fewer and fewer programmers know how to
optimize against the machine.

~~~
mike_hearn
So optimising compilers are becoming more important, not less, then.

Edit: to clarify, the interesting thing about that blog is not the cross
language interop magic, though that is pretty cool, it's that a smart compiler
that can compile C and ruby together simultaneously can delete vast swathes of
abstraction and overhead to produce radical speedups, just through better
optimisation. So whilst things might not be moving very fast in the world of
more traditional compilers, research compilers with a focus on dynamic
languages are still showing big speedups. It's a bit early to proclaim the
death of optimising compilers just yet, though perhaps I've just misunderstood
the abstract. I'm not sure the latest hanging fruit for compiler developers is
using more and more exotic machine instructions

~~~
dalke
I view it as two different things.

There are many optimizations which work on a simple machine model that hasn't
changed much since the VAX. There's memory with constant time lookup, some
registers, and assembly code.

Real hardware changed long ago. As a simple example, on traditional hardware
it's best to save and re-use intermediate values, but on modern hardware RAM
lookup is so slow that it can be faster to recompute rather than doing a
random lookup. An optimizing compiler isn't going to do much if the code
wasn't designed to match the hardware, or expects a different hardware model.

While lowering the impedance mismatch between two systems, whether it be two
different languages as in your example, or serialization/deserialization, or
call thunking, belongs to an old and well-understood set of optimizations
which are nearly insensitive to the machine model.

RAM lookup, and cache performance, and cache lines, are important for high
performance code. Though as the abstract says, most people only work with
"freezingly cold" code, so aren't aware of the limitations in the conceptual
machine model vs. the actual machine. Fixing this almost always requires deep
changes to the code, which an optimizing compiler can't do.

------
101914
"...most users still spend time waiting for computers."

------
ufo
I'm only seeing the abstract here. Is there a link for the full tutorial?

~~~
morcheeba
The presentation hasn't happened yet. It will take place at ETAPS 2015: 11-18
April 2015, London, UK

------
Dewie
I don't have much faith in _future processors_ if it really is impractical to
make optimizing compilers for them.

~~~
Animats
Well, that's happened in the past. Itanium and the Cell made it to volume
production while presenting extremely hard problems for a compiler.

I once met the group from HP who were trying to write an optimizing Itanium
compiler. It wasn't going well at all. The Itanium could execute several
instructions simultaneously, but the compiler had to block them together and
specify this. About a hundred instructions at a time had to be scheduled
together to get decent optimization.

