
The death of optimizing compilers [pdf] - fcambus
http://cr.yp.to/talks/2015%2E04%2E16/slides-djb-20150416-a4.pdf
======
juliansimioni
I wish this was given in a better format instead of just slides, hopefully the
talk is actually better. Here's what I think the talk is about:

The pervasive view of software performance is that compilers are better than
humans at optimizing code, but the few humans who optimize important bits of
code to the maximum extent disagree Similarly, computer programs today are
increasingly diverging into a state where there is a tiny amount of extremely
performance critical code, and a large amount of code where performance is so
good on our hardware today that even horribly unoptimized code has no
noticeable effect on performance.

Thus, optimizing compilers are useless on the first type of code (humans are
better), and useless on the second (performance doesn't matter). So what good
are they at all?

If optimizing compilers aren't useful, what system should we use instead for
making performant code? The author and collaborators' experience suggests that
the reasons a compiler can't optimize code as well as a human when it matters
is that our current programming languages don't give the compiler enough
information about the intent of the code to optimize it. Therefore we should
design programming languages that on the surface look very unoptimized, but
specify enough information that compilers can do a really good job. It sounds
like no one knows what such a programming language would look like.

~~~
ch
Haskell? :)

~~~
eru
Haskell is a good starting point. It's rather high level.

I'd like to see a variant that banishes bottom _|_ to the same corner that
unsafePerformIO already occupies today. I want to see total Haskell. If all
programs had to halt by default, the compiler would be much freer in choosing
evaluation models---strict vs lazy would make no semantic difference.

~~~
gamegoblin
See the Morte language proposed (not sure if actively developed) by Gabriel
Gonzalez, author of many popular Haskell libraries.

~~~
agumonkey
Found this discussion pretty interesting on the subject :
[http://www.reddit.com/r/haskell/comments/2g6wsx/haskell_for_...](http://www.reddit.com/r/haskell/comments/2g6wsx/haskell_for_all_morte_an_intermediate_language/)

------
ezyang
The talk was given to a packed room at ETAPS, a European computer science
conference leaning more on the theoretical side (I suppose everyone was
curious how optimizing compilers were dying). All in all, the audience did not
bust out the pitchforks, although one might say that "domain-specific
compilers" is basically the direction academia has already been heading. I
doubt any of the people in the audience who were working on compilers/JITs are
planning to stop working on them, though it did make for some fun dinner
discussion.

There was one mid-talk exchange was one professor asking djb upfront whether
or not he thought, in ten years, Mike Pall (author of LuaJIT) would be out of
a job--after all, JITs are basically optimizing compilers. Well, the original
question was more diplomatic than that, but eventually he pushed it enough
that he got djb to not deny that this would be the case.

The talk was somewhat marred by a very large digression into an undergrad
level primer of computer architecture (it probably would have been better
served by an extended Q&A session), although the sorting example he finally
built up to was pretty cute.

------
robmccoll
I really like the conversation with the compiler approach. I had the good
fortune to write some code for one of these:
[http://en.m.wikipedia.org/wiki/Cray_XMT](http://en.m.wikipedia.org/wiki/Cray_XMT)
which is a multi socket TB-scale shared memory machine with 128 hardware
thread contexts per socket. It has an autoparallelizing C compiler that
attempts to parallelize mostly for loops where it thinks it can (really quite
a clever thing), but you can also tell it where to do things and give it hints
through compiler pragmas. The compiler infrastructure will print out annotated
copies of the code that tell you where it did and didn't parallelize and why.
The effect is that you have this conversation with the compiler in which each
of you tries to tell the other how to make the code more parallel (which in
the case of the XMT means better and faster). It's very simple but the result
can be orders of magnitude improvement.

~~~
krylon
A couple of years back I was working as a programmer, working on an
application written in C. The application was always built with all
optimizations disabled, because a bug in the compiler's optimizer (is that the
correct term?) had bitten the programmers, so they had become really careful.

One day, as an experiment, I built the applications with pretty much all safe
optimizations enabled and did a simple benchmark that showed a performance
improvement of ... (drum-roll) two percent. This left me scratching my head,
wondering if that meant the compiler really sucked at optimizing the code or
if the code was so badly written the optimizer could not really do anything
about it (or maybe it was so well-written it already was close to optimal
without "optimizing" it? Probably not.).

How nice would it be, I thought, if the compiler could give me some kind of
feedback on which parts of my source code it did or did optimize, and why.

I am relieved to hear (well, read, technically speaking) that I am not the
only person to have had that idea.

Many programmers have all kinds of ideads in their heads about how writing
their code in a certain way will make it easier for the compiler to optimize
the heck out of it, but I guess few actually go and check if their assumption
have any basis in reality.

It would be so incredibly nice to have a compiler that told you about such
things, so you could _know_ if your assumptions were true or bogus.

 _Sigh_

~~~
jleyank
SGI's compilers provided this feedback back in the 90's. I have been working
with domain-specific languages for years but I would have thought what was
relatively common back then persisted...

In my experience (computationally-intensive numerical simulations), the
optimizers helped. Although we had to use a language that was algebra (Fortran
derivatives), write C-tran with #pragmas to help out the optimizer or use
hand-coded libraries for things like LAPACK.

~~~
kragen
I'm guessing that the XMT compiler mentioned upthread came from Tera, rather
than SGI, but it's still interesting that significant parts of SGI (maybe
including those same compilers?) ended up at the same company as the XMT
compilers.

------
p932
Some bits of Fran Allen in Coders At Work book:
[http://www.codersatwork.com/fran-
allen.html](http://www.codersatwork.com/fran-allen.html)

"We were making so much good progress on optimizations and transformations. We
were getting rid of just one nice problem after another. When C came out, at
one of the SIGPLAN compiler conferences, there was a debate between Steve
Johnson from Bell Labs, who was supporting C, and one of our people, Bill
Harrison, who was working on a project that I had at that time supporting
automatic optimization. The nubbin of the debate was Steve’s defense of not
having to build optimizers anymore because the programmer would take care of
it. That it was really a programmer’s issue"

"We have seriously regressed, since C developed. C has destroyed our ability
to advance the state of the art in automatic optimization, automatic
parallelization, automatic mapping of a high-level language to the machine.
This is one of the reasons compilers are . . . basically not taught much
anymore in the colleges and universities. "

------
mafribe
I attended djb's ETAPS talk and am still not true if he was deliberately
provocative or genuine. Assuming the latter, I disagree with several of his
points. Here I want to bring one to the readers' attention and it's to do with
the economics of correctness proofs.

One of his key arguments was that compiler optimisations are difficult to
prove correct, and that's one of the reasons why optimising compilers will be
replaced by a combination of simple compilers + assembly hand-written by
domain experts. It is true that such proofs are (currently) expensive, but
misses the point of the economics of correctness proofs: Correctness proofs
are difficult, but proving compilers correct amortises that cost over all
subsequent uses. In contrast, program specific correctness proofs are
typically of comparable difficulty, but don't amortise in this way. Therefore
it seems to be cheaper in the long run, to focus on the correctness of
optimising compilers. Moreover, as compilers and optimisations are quite a
restricted class of algorithms, hence it is more likely that we can reuse
(parts of) correctness proofs and prover technology for compilers.

~~~
sanxiyn
As I understand, the point is precisely that optimizations are not a
restricted class of algorithms, because you need entirely new optimization
techniques for new domains.

~~~
mafribe
I'm not sure I understand what you mean. Those "entirely new optimization
techniques" cannot be baked into the compiler, they will have to be invented
and implemented by a human programmer on an ad-hoc basis. Compilers only use
general purpose optimisations.

------
peapicker
To finish the Don Knuth quote: "There is no doubt that the grail of efficiency
leads to abuse. Programmers waste enormous amounts of time thinking about, or
worrying about, the speed of noncritical parts of their programs, and these
attempts at efficiency actually have a strong negative impact when debugging
and maintenance are considered. We should forget about small efficiencies, say
about 97% of the time: premature optimization is the root of all evil. "

"Yet we should not pass up our opportunities in that critical 3%. A good
programmer will not be lulled into complacency by such reasoning, he will be
wise to look carefully at the critical code; but only after that code has been
identified."

~~~
zinxq
Sadly at least some of the points of relevance of this quote are outdated.

For example, it would behoove a large company to spend a great deal of time
optimizing say, their JSON parsing library. Although it may not identify as a
hotspot in any one place in their immense codebase, it's extreme prevalence
causes performance degradation subtly but pervasively.

I also measured injected object creation using Guice to be 40x slower than a
simple constructor in java (agree or disagree with the 40x, but using
reflection to set a variable instead of simple object construction is
intuitively far slower).

Guice may not show up on any profiler as problem - but if you slow down object
creation by a factor of 40x, something you may do thousands of times per
second for the life of your program, you are degrading performance across the
board. Rather the same if you simply clock your CPU down a few hundred Mhz.

~~~
al2o3cr
"Guice may not show up on any profiler as problem - but if you slow down
object creation by a factor of 40x, something you may do thousands of times
per second for the life of your program, you are degrading performance across
the board."

If your profiler doesn't point right at the thing that you're doing 40x slower
thousands of times, either your profiler is BROKEN or that isn't actually a
bottleneck.

~~~
thrownaway2424
If you have ten thousand different kinds of objects then the creation of
instances of any given class will not rise to the top of the profile.

------
nullc
I often boggle at people who claim that compilers are magic and outperform
humans-- perhaps thats true for unimportant code that you'd pay no attention
to, or with developers who aren't familiar with the underlying micro-
architecture at all.

It's pretty usual for me to see a factor of 2 performance for the same
algorithm implemented in the same manner when moving from SIMD intrensics
(which almost directly map the underlying platform) to hand coded ASM.

Even non-SIMD code can result in some pretty stark changes.

A non-SIMD example from a crypto library I work on which isn't (yet) very well
optimized for ARM, benchmarked on my novena (with GCC 4.9.1 -O3):

ecdsa_verify: min 1927us / avg 1928us / max 1929us

And a hand conversion of two hot inner functions (which are straight line
multiple and add lattices, no control flow) into arm assembly:

ecdsa_verify: min 809us / avg 810us / max 811us

Again, same algorithm.

(The parallel change for x86_64 is still significant, but somewhat less
extreme; in that case converting the same functions is only a 15% speedup
overall, partially because the 64bit algorithm is different).

When thats a difference which results in 2x the amount of hardware (or 15% for
that matter) for a big deployment; it can justify a LOT of optimization time.

(Or in my case, the performance of this code will have a material effect on
the achievable scale and decentralization of the Bitcoin network.)

From a straight development time vs performance perspective I'd use even more
hand written code; ... but there is a maintenance/auditing/eview/verification
overhead too. And often the same code that you cannot tolerate being slow you
also cannot tolerate being wrong.

------
tel
Copied from lobste.rs since I think it'd be interesting to the audience here
as well

\---

This “dialogue with the compiler” bit that djb lands on is in some sense
obviously the right way to go forward. I’ve found this to be the case not in
optimization—though I’m not in the least bit surprised to see it there—but
instead in the world of dependent types. The language that the program writer
writes in is often just a skeleton of the genuine information of the program.
For instance, in a dependently typed program it’s often very difficult for an
author to immediately write all of the complex intermediate types required to
drive proofs forward, but it’s much easier to achieve this in collaboration
with the compiler (really the typechecker, and e.g. via a tactics language and
interactive proof general-like interface). The ultimate result, the
“elaborated” program, contains much, much more information than the skeletal
program the programmer originally wrote. It has been annotated by the
collaboration of the compiler and the program writer to extract more of the
programmer’s intention.

The same kind of thing could clearly arise from a “collaboration” over
optimization. It’s even quite possible that these are the same dialogue as
dependent types certainly provide much more information to the compiler about
the exact properties the code ought to substantiate—in a nice, machine
readable format even.

~~~
tel
It's worth adding in archive that this link is highly relevant:
[http://cs.ru.nl/~freek/courses/tt-2010/tvftl/epigram-
notes.p...](http://cs.ru.nl/~freek/courses/tt-2010/tvftl/epigram-notes.pdf)

------
dfbrown
An optimizing compiler may not be better than me at optimizing hot code paths,
but my time is a very limited resource. The compiled version may only be 75%
as fast as my hand optimized version, but writing that hand optimized version
will likely take several times longer. Sometimes it is worth spending the
extra time for that performance, but usually it is not.

~~~
danieldk
Indeed, plus you need multiple hand-optimized versions. Not only per
architecture, but also e.g. pre-AVX and AVX. An optimizing compiler will give
you optimizations for all current and future platforms for free.

Another problem is that the number of people who can write good general hand-
optimized assembly is small. E.g. I used a numeric Go library (which I will
not name, because I should've submitted an issue) that used
assembly-'optimized' routines. Replacing those with simple C loops and turning
on auto-vectorization beat those hand-written routines handsomely.

------
corysama
Very related, but surprisingly not covered, I'll point out what was covered in
depth at Mike Acton's CppCon14 keynote "Data-Oriented Design and C++"

[https://www.youtube.com/watch?v=rX0ItVEVjHc](https://www.youtube.com/watch?v=rX0ItVEVjHc)

[http://www.slideshare.net/cellperformance/data-oriented-
desi...](http://www.slideshare.net/cellperformance/data-oriented-design-and-c)

And that is: Because of the ever-growing disparity between ALU vs IO speeds,
the vast majority of time spent waiting on computers is because of issues that
_the compiler can not optimize_. In general, compilers have very few
opportunities to rearrange your data structures without your explicit, manual
input. They can't help your CPU stall on memory/disc/network IO less by any
significant amount. They can only help when your CPU actually has the data it
needs to proceed --which often is less than 20% of total execution time.

In that case, no matter how smart GCC gets, it probably can't ever speed up
your existing code by over 20% than it does today. It's not allowed to by the
spec. I'm not aware of any general-purpose language where this is an option to
any significant degree (silent AOS-to-SOA, hot-vs-cold data segregation, tree
clustering, etc...)

If your program is too slow, it's almost certainly because you haven't done
the hard, still-manual work of optimizing your data access patterns. Not just
your Big-O's (N^2) vs (NlogN), but also your Big O's hidden, implicit K. The K
that academia actively ignores and that most people rarely think about because
it is mostly composed of cache misses that are implicit and invisible in your
code. x = sqrt(y) is super cheap compared to x = *y. But, the same people who
fret over explicit ALU costs usually think very little of x->y->z.

~~~
taliesinb
> I'm not aware of any general-purpose language where this is an option to any
> significant degree (silent AOS-to-SOA, hot-vs-cold data segregation, tree
> clustering, etc...)

Not yet, of course, because it hasn't been released, but Jonathan Blow's Jai
language offers transparent switching between AOS and SOA, as well as many
other features that make it easy to explore the 'optimization space' of a
program manually.

[https://www.youtube.com/watch?v=ZHqFrNyLlpA](https://www.youtube.com/watch?v=ZHqFrNyLlpA)

(for those who don't know, Jonathan Blow was the creator of Braid)

------
mightybyte
If you agree with the author's conclusion that we need better languages that
allow us to give the compiler more information about the optimization needs of
our program, then I think you have to look in the direction of languages like
Haskell, Idris, etc. Fortran can be faster than C because C has aliasing that
limits the optimizations that the compiler can perform. Similarly, strongly
typed and pure languages like Haskell give you even more. You can do a lot
with types, but on top of that Haskell allows you to define rewrite rules that
tell the compiler how things can be optimized. This allows the compiler to
automatically convert things like this:

    
    
        map f (map g items)
    

...into this:

    
    
        map (f . g) items

~~~
e12e
Assuming f and g are not evaluated for their side-effects...

While I get that Haskell allows/demands you to write what you mean, consider
an app that adjust reverb and volume up, from 1..10 (here items is the range
1..10). You could argue (rightfully) that the code should be structured
differently (in this imaginary case, f and g would be identity functions on
integers -- or perhaps Maybe integers (return int if volume/reverb set to the
requested number, error/nil otherwise).

I realize this is a strange and contrived example -- the main point is that
while functional programming is great -- iff what is wanted is a single
multidomain language, it needs to function (heh!) in many different paradigms.

Btw, what would be the appropriate way to define functions for their side
effects in Haskell - eg a function like this that takes an integer, performs
some kind of i/o, interrupt, writes to a hw address -- to set some external
state -- and returns the new value on success? Anyone have some pointers to
that at the top of their head? Maybe there's a chapter in "Real World Haskell"
that deals with it?

[ed: I just realized that the behaviour of (f . g) and map-map might be
similar with the "Maybe identity"-behaviour I sketched above - modulus error
handling/recovery (eg: do we want to try f on a given int i, even if g
failed?). Oh well :-) ]

~~~
mightybyte
> Assuming f and g are not evaluated for their side-effects...

And that is the whole point. In Haskell we can know that f and g have no side
effects. If they had side effects you would not be able to compose them with
the dot operator.

I don't quite understand your example but I assure you that there are elegant
ways to accomplish that in Haskell. There was recently a presentation [1] at
the NY Haskell meetup that talked about a new Haskell sound synthesis library
called vivid.

Making functions that perform side effects is also easy. Here's a simple
example:

    
    
        intToFile :: Int -> IO ()
        intToFile anInt = do
            writeFile "myfile.txt" (show anInt)
    

The idea that you have to be multiparadigm to function in real world settings
is a fallacy. Haskell is simply more expressive, which equips it to solve just
about any problem with more clean and concise code than you're likely to get
in other languages. Even if the code size stayed the same the extra safety you
get from purity and strong types would be worth it alone.

[1]
[https://www.youtube.com/watch?v=xo3zUvPsizo](https://www.youtube.com/watch?v=xo3zUvPsizo)

~~~
guipsp
>The idea that you have to be multiparadigm to function in real world settings
is a fallacy. Haskell is simply more expressive, which equips it to solve just
about any problem with more clean and concise code than you're likely to get
in other languages. Even if the code size stayed the same the extra safety you
get from purity and strong types would be worth it alone.

Haskell is also not used any major project. It's an academia language.

~~~
e12e
I don't think it's quite right to say it's (still) _just_ an academia language
-- but it's absolutely marginal. I use two (three) programs implemented in
haskell on a (semi)regular basis: xmonad (and xmobar) and pandoc.

Other than that, I know of git-annex that is both actual programs _and_ is in
actual use.

That's still more than the number of programs I use that are implemented in
Arc...

Haskell is academic in the sense that it is _very_ opinionated about a few
things (as this sub-thread illustrates).

Lets see how Shen, Open Dylan and clojure end up doing ... (and of those,
probably only clojure have meaningful _contemporary_ systems implemented in it
- so far). Maybe there's even more life left in Common Lisp (sbcl etc) or
Scheme (Racket, guile w/guix nix (haskell! yay!) ...).

Then there's OCaml of course...

------
sp332
More discussion from a month ago
[https://news.ycombinator.com/item?id=9202858](https://news.ycombinator.com/item?id=9202858)

------
Animats
There are a few basic optimizations we should routinely have in compilers
today, but don't always find there.

\-- Multidimensional array optimizations. Basically, get to the level of
subscript calculation overhead seen in FORTRAN compilers of 50 years ago.

\-- Subscript checking optimization. Work on Pascal compilers in the 1980s
showed that about 95% of subscript checks could be eliminated or hoisted out
of inner loops without loss of safety. This was forgotten during the C era,
because C is vague about array sizes. Go optimizes out checks for the simple
cases; Rust should and probably will in time. Optimization goal: 2D matrix
multiply and matrix inversion should have all subscript checks hoisted out of
inner loops or eliminated. Compilers that don't have this feature lead to
users demanding a way to turn off subscript checking. That leads to buffer
overflows. (Compilers must know that it's OK to detect a subscript check
early; it's OK to abort before entering a loop if there will inevitably be an
subscript error at iteration 10000.)

\-- Automatic inlining. If the call is expensive relative to the code being
called, inline, then optimize. Ideally, this should work across module
boundaries.

\-- Atomic operations and locking. The compiler needs to know how to do those
efficiently. Calling a subroutine just to set a lock is bad. Making a system
call is worse. Atomic operations often require special instructions, so the
compiler needs to know about them.

\-- Common subexpression elimination for pure functions. Recognize pure
functions (where x=y => f(x) = f(y) and there are no side effects) and
routinely optimize. This is essential in code with lots of calls to trig
functions.

~~~
dbaupp
_> Go optimizes out checks for the simple cases; Rust should and probably will
in time._

rustc uses the industrial strength LLVM optimiser, which is perfectly capable
of elimatinating bounds checks: almost certainly more capable than the main Go
compiler in any case.

This has been pointed out to you several times, and so your repeated
assertions otherwise are now almost malicious. Maybe you could be more
concrete (e.g. with an instance of a subscript check eliminated by the Go
compiler but not rustc)?

~~~
danieldk
_" Rust essentially never wastes cycles on bounds checking thanks to the
design of its iterators. The Servo team tells me that bounds checking has
never shown up in any performance profiles."_

Source:
[https://news.ycombinator.com/item?id=9392131](https://news.ycombinator.com/item?id=9392131)

~~~
dbaupp
That's... not exactly relevant. :)

Rust's iterators definitely resolve a lot of problems one might have with
bounds checking, but there's still times when one is essentially required to
do explicit indexing (e.g. iterating down a column of a multidimensional
array) but in such a way that the bounds checks should be removable.

My point was that, modulo bugs, rustc (more specifically, LLVM) _will_
eliminate such bounds checks. With or without iterators.

~~~
Animats
I really need to write a matrix multiply and see what code Rust generates.

------
AKrumbach
"For some reason we all (especially me) had a mental block about optimization,
namely that we always regarded it as a behind-the-scenes activity, to be done
in the machine language, which the programmer isn't supposed to know. This
veil was first lifted from my eyes when I ran across a remark by Hoare that,
ideally, a language should be designed so that an optimizing compiler can
describe its optimizations in the source language. Of course!"

That sounds like he wants some sort of homoiconic assembly or machine language
to target. Does such a thing even exist?

~~~
justincormack
Some optimisations can be described in any language (eg dead code
elimination). But something that maps 1:1 to LLVM IR would work for pretty
much everything.

~~~
avmich
"Lisp is... a good assembly language because it is possible to write Lisp in a
style that directly reflects the operations available on modern computers."

Peter Norvig, Paradigms of Artificial Intelligence Programming.

------
DannyBee
He actually quotes my rebuttal comment - "“Except, uh, a lot of people have
applications whose profiles are mostly flat, because they’ve spent a lot of
time optimizing them.”

and his response is "this view is obsolete, and to the degree it isn't, flat
profiles are dying".

Oh great, that's nice, i guess i can stop worrying about the thousands of C++
applications google has built that display this property, and ignore the fact
that in fact, the profiles _have gotten more flat over time_ , not less flat.

Pack it up boys, time to go home!

Basically, he's just asserting i'm wrong, with little to no data presented,
when i'm basing mine on the results of not only thousands of google programs
(which i know with incredible accuracy), but _the thousands of others at other
companies that have found the same_. I'm not aware of him poring over
performance bugs for many many thousands of programs for the past 17 years. I
can understand if he's done it for his open source programs (which are
wonderful, BTW :P)

He then goes on to rebut other comments with simple bald assertions (like the
luajit author's one) with again, no actual data.

So here's some more real data: GCC spent quite a while optimizing interpreter
loops, and in fact, did a better job than "the experts" or whatever on every
single one it was been handed.

So far, as far i can tell, the record is: If GCC didn't beat an expert at
optimizing interpreter loops, it was because they didn't file a bug and give
us code to optimize.

There have been entire projects about using compilers/jits to supplant hand-
written interpreter loops.

Here's one: [https://code.google.com/p/unladen-
swallow/wiki/ProjectPlan](https://code.google.com/p/unladen-
swallow/wiki/ProjectPlan)

While the project was abandoned for other reasons, it produced 25+% speedups
over the hand written interpreter versions of the same loop by doing nothing
but using compilers.

Wake me up when this stops happening ....

He then goes on to make further assertions misundertanding compiler authors
and what they do:"A compiler will not change an implementation of bubble sort
to use mergesort. ... they only take responsibility for machine-specific
optimization”.

This is so false i don't know where to begin. Compilers would, if they could,
happily change algorithms, and plenty do. They change the time bounds of
algorithms. They do in fact, replace sorts. Past that, the problem there is
not compilers, but the semantics of languages often do not allow them to
safely do it.

But that is usually a programming language limitation, and not a "compilers
don't do this" problem.

For example, the user may be able to change the numerical stability of an
algorithm, but the programming language may not allow the compiler to do so.

Additionally, it's also generally not friendly to users.

As an example: ICC will happily replace your code with Intel performance
primitives where it can. It knows how to do so. These are significant
algorithm changes.

But because users by and large don't want the output of ICC to depend on
Intel's Math Kernel Library or anything similar, they don't usually turn it on
on by default.

GCC doesn't perform quite as much here, because even things like replacing
"printf" with "puts" has caused tremendous amounts of annoyed users. Imagine
the complaints if it started replacing algorithms.

Past that, i'd simply suggest he hasn't looked far enough into the history of
optimizing compilers, because there has been _tons_ of work done on this.
There are plenty of high level language optimizers that have been built that
will completely rewrite or replace your code with rewritten algorithms, etc.

I stopped reading at page 50.

~~~
mpweiher
Maybe you should have continued to the end, because as far as I can tell, you
are almost completely missing the point.

You are saying "look at all the wonderful things optimizing compilers can do".
And he is saying "that's cool, but it doesn't really matter".

Now I am pretty sure he wold concede immediately that there are _some_
examples where it matters, but my experience matches what he is saying very
closely.

A little bit on my background: hired in 2003 by the BBC for the express
purpose of making one of their systems faster, succeeded at around 100x -
1000x with simpler code[1], hired by Apple to work on their Mac OS X
performance team, similar successes with Spotlight, Mail indexing, PDF, etc.
Also worked with/on Squeak. Squeak runs a bytecode interpreter, with a bunch
of primitives. The interpreter is perfectly adequate for the vast majority of
the system. Heck, the central page imposition routine in my BookLightning PDF
page imposition program[2] is written in Objective-Smalltalk[3], or more
precisely an interpreter for the language that's probably the slowest computer
language implementation currently in existence. Yes it's slower than Ruby, by
a wide margin. And yet BookLightning runs rings around similar programs that
are based on Apple's highly optimized Quartz PDF engine. And by "rings" I mean
order(s) of magnitude.

Why is BookLightning so fast? Simple: it knows that it doesn't have to unpack
the individual PDF pages to impose them, and is built on top of a PDF engine
that allows it to do that[4]. The benefit of an optimizing compiler in this
scenario? Zilch.

At the BBC, the key insight was to move from 20+ machine distributed and SQL
database backed system to a single jar event logger working in-memory with a
filesystem based log[5]. How would a compiler make this transformation? And
after the transformation, we probably could have run interpreted byte-code and
still be fast enough, though the JIT probably helped us a little bit in not
having to worry about performance of the few algorithmic parts.

As to changing a bubblesort to a mergesort, you hand-wave around that, but I
think the point is that this is not a safe transformation, because the author
may have specifically chosen bubblesort because he likes its characteristics
for the data he knows will be encountered (or cares more about that case).

When I worked at Apple, I saw the same pattern: low-level optimizations of the
type you discuss generally gained a few percent here and there, and we were
happy to get them in the context of machine/system wide optimizations.
However, the application-specific optimizations that were possible when you
consider the whole stack and opportunities for crossing or collapsing layer
boundaries were a few factors x or orders of magnitude.

And here is where the "optimizing compiler" mindset actually starts to get
downright harmful for performance: in order for the compiler to be allowed to
do a better job, you typically need to nail down the semantics much tighter,
giving less leeway for application programmers. So you make the automatic
%-optimizations that don't really matter easier by reducing or eliminating
opportunities for the order-of-magnitude improvements.

And yes, I totally agree that optimizations should be user-visible and
controllable, rather than automagically applied by the compiler. Again, the
opportunities are much bigger for the code that matters, and the stuff that is
applied automatically doesn't matter for most of the code it is applied to.

[1]
[http://link.springer.com/chapter/10.1007%2F978-1-4614-9299-3...](http://link.springer.com/chapter/10.1007%2F978-1-4614-9299-3_11)

[2] [http://www.metaobject.com/Products/](http://www.metaobject.com/Products/)

[3] [http://objective.st](http://objective.st)

[4]
[http://www.metaobject.com/Technology/](http://www.metaobject.com/Technology/)

[5]
[http://martinfowler.com/bliki/EventPoster.html](http://martinfowler.com/bliki/EventPoster.html)

~~~
haberman
Is it your position that all of OS X could run on Squeak, or gcc -O0, with
performance comparable to what it does now?

Because that seems to be the logical conclusion of your position.

Yes, it may be that low-level optimizations are fighting for single-digit
percentage improvements. And of course re-architecting big systems at a high
level can yield much bigger gains. But you have to remember that the baseline
is an entire system that's already using an optimizing compiler.

If you want to argue that optimizing compilers are dead, you'd have to show
that you can _remove_ optimizing compilers from your toolchain, and have
nobody notice the difference. That's a pretty hard argument to make.

~~~
Rapzid
This is where my mind keeps coming back too. I keep hearing "it doesn't
matter" and yet nearly every compiler I come into contact with is an
optimizing compiler. Heck, even with some pretty straight up "performant" code
that difference between optimizations on and off in a lot of these can be
significant.

C#,C++,C,Go,V8

Just about everything.. I can only imagine what would happen if every process
on all the systems I deal with were re-compiled with optimizations off :|

------
pron
His vision:

 _The time is clearly ripe for program-manipulation systems... The programmer
using such a system will write his beautifully-structured, but possibly
inefficient, program P; then he will interactively specify transformations
that make it efficient._

But what if the answers the programmer gives the compiler turn out not to
match reality, and some weird bug is introduced that has no representation in
the source? The compiler's decisions need to be somehow spelled out in
debuggable form.

There is another approach (that can also be complementary). The programmer
specifies in advance various specific scenarios, and a JIT compiler guesses
which of the scenarios is in effect and optimizes for that (e.g. a certain
condition, like input size is always true), but adds a guard (that hopefully
adds negligible overhead). If the scenario does not match reality, the JIT
deoptimizes and tries another. This process itself, of course, adds overhead,
but it's warmup overhead, but it's more robust. This is the approach being
tried by Graal, HotSpot's experimental next-gen JIT (and Truffle, Graals
complementary a programming language construction toolkit aimed at
optimization):
[https://wiki.openjdk.java.net/display/Graal/Publications+and...](https://wiki.openjdk.java.net/display/Graal/Publications+and+Presentations)

------
acqq
Even if DJB wrote some very effective code, now when he "goes meta" he comes
somehow in the strange area of being "not even wrong." Or maybe we miss his
ideas when we read the slides instead of hearing him at the talk.

People who make compilers used in the production know: if the naive users
claim that "optimizing compilers don't matter" it's because the optimizing
compilers are so good in doing what they're doing.

There's the argument which is here buried deep in the discussions and which I
think DJB missed to address, nicely stated by haberman:

"If you want to argue that optimizing compilers are dead, you'd have to show
that you can remove optimizing compilers from your toolchain, and have nobody
notice the difference."

------
spiritplumber
So, Mel Kaye got the last word in?

[http://en.wikipedia.org/wiki/The_Story_of_Mel](http://en.wikipedia.org/wiki/The_Story_of_Mel)

------
nitwit005
Interacting with a compiler sounds horrible. Think of the questions it might
need to ask: "Hey! I could use AVX2 instructions here, after I inline a bunch
of stuff and eliminate some dead code, but it requires doing a bunch of memory
copying. Is that a good idea?". How would you answer a question like that?

And then, since optimizations are target dependent, you would need to go
through this exercise for each target. Sounds fun.

~~~
pjscott
How you'd answer depends very much on what you're doing -- which is the point
of asking. You might be able to remove the memory copying by fiddling with
alignment somewhere in calling code and adding an annotation saying you did
that. Bam, problem solved. Or you might not be able to do that, in which case
it would still be nice to know what tradeoffs you're making.

~~~
nitwit005
But consider, for it to be worth asking, it's got to be something that's very
difficult to infer, or the compiler wouldn't need to ask. That means it will
probably be difficult for you to infer as well.

In practice, you might have to profile all possible options, which would
quickly turn into a major project if you're asked more than a handful of
questions.

And, of course, even if you can figure it out, that alignment annotation you
just made is only optimal on the one target. Even on x86 it will depend on
what instructions are available.

------
phkahler
I'd be happy if C and C++ had 2,3,and 4 element vectors as built in types,
along with cross and dot product operations. There are intrinsics, and GCC has
it's own intrinsics that can be used across architectures. But the languages
need to have these. They are so fundamental to so many things.

There are many more things to wish for, but I'm starting with one of the
simplest.

~~~
adrianN
Why would you need this in the language? It seems to me that these things can
be provided by a library without any loss.

~~~
phkahler
Why would you need this in the language? It seems to me that these things can
be provided by a library without any loss.

No loss in function. C++ will certainly let you create classes, but you need
to be extremely good to make them low overhead. You also can't pass them by
value (the intrinsics can). And lastly, classes don't exist in C.

Vector math is so common it should have direct support and operators.

------
anewhnaccount
Here's a compiler which uses program synthesis to target a mesh network type
architecture:
[http://pl.eecs.berkeley.edu/projects/chlorophyll/](http://pl.eecs.berkeley.edu/projects/chlorophyll/)
. It uses a guided process like is implied in the last slides.

------
zurn
Compiler optimization is obviously a sleeping field currently. There's nothing
that's made it into practical compilers to address the bottlenecks shifting
from ALU work to data storage and layout considerations.

Consider all the gnashing of teeth and wringing of hands that goes on in C++
circles about inefficient data layout & representation by inferior
programmers, and the stories of victorious manual data layout refactorings by
performance heroes.

DJB's slides don't address the data side because he only does crypto, and
that's one of the fields where the ALU twiddling is still relevant. But crypto
is also rarely a bottleneck.

------
troydj
The original abstract for the talk (which summarizes the slides) were posted
by the author here:

[http://blog.cr.yp.to/20150314-optimizing.html](http://blog.cr.yp.to/20150314-optimizing.html)

------
carapace
This is very good. I've been working towards something similar to what DJB is
talking about. (Nice to know I'm in good company. :)

In a nutshell, although automated systems will be good (are already and
getting better) there will always be aspects that require humans in the loop
(unless and until the machines actually become sentient, defined in this
context as gaining that _je ne sais quoi_ that humans do seem to have.)

------
copsarebastards
The cases where optimizing compilers aren't good enough are where Java Hotspot
Compiler and similar techniques really shines. Combined with novel
superoptimization techniques, hotspot optimization could far outperform hand-
optimization (although AFAIK that hasn't happened in practice yet).

~~~
jerven
I completely agree, although hotspot is really hampered by the fact that
objects are allocated willy nilly on the heap instead of together for pipeline
efficiency. So while hotspot (and graal) generate lovely assembly the lack of
data locality kills a lot of possible performance. Hoping objectlayout.org
changes that!

Yet I think that hotspot and intrinsics are a nice case study of showing why
optimising compilers are not dead. Even for performance critical code.
[https://news.ycombinator.com/item?id=9368137](https://news.ycombinator.com/item?id=9368137)
discusses in part how intrinsics (hand optimised) at some point get beaten by
the optimiser (simd/superword) and then hand-optimised again to again beat the
optimiser. Mostly, because machines change.

A whole problem with static binaries as produced by C and GO compilers is that
they assume machines are static. Which leads to lowest common denominator
optimiser settings :( When the optimisers are humans this gets even worse.
Then you end up with optimisations in your C code that was a good idea 15
years ago, but that make no use of SIMD today (or to short SIMD e.g. SSE2 when
AVX-512 is available).

Of course real optimisations happen, not by doing the same thing faster but by
doing a faster thing. Take for example the HMMER
([https://news.ycombinator.com/item?id=9368137](https://news.ycombinator.com/item?id=9368137))
family of algorithms. HMMER2 is part of SPECCPU, and the compiler guys doubled
the speed of this algorithm, in about 5 years. Then the bioinformaticians
redid it as HMMER3 which is quite different in how it globally works and gets
100x speed up in practice.

~~~
copsarebastards
You might find this interesting (it's what I'm talking about when I said
superoptimization).

[http://superoptimization.org/wiki/Superoptimizing_Compilers](http://superoptimization.org/wiki/Superoptimizing_Compilers)

------
rurban
So just for a start: Which optimizing compiler actually properly solves the
optimization problems? I know of none.

When I look at the list of optimization solvers for constrained linear or non-
linear programming models (i.e.
[https://en.wikipedia.org/wiki/List_of_optimization_software](https://en.wikipedia.org/wiki/List_of_optimization_software))
and the list of compilers the intersection is still zero.

All optimizers are still using their own ad-hoc versions of oversimplified
solvers, which never fully explore their problem space. Current optimizing
compilers are still just toys without a real-world solver. And it's clear why
so. Current optimizable languages are still just toys without a real world
solvable optimization goal.

You can think of strictly typed, sound declarative languages where solvers
would make sense, or you can think of better languages, like fortress or
functional languages, which are not encumbered by not-properly optimizable
side-effects and aliasing which harm most modern language designs.

~~~
astrange
> or you can think of better languages, like fortress or functional languages,
> which are not encumbered by not-properly optimizable side-effects and
> aliasing which harm most modern language designs.

Remember, if you ban aliasing then you can't express any problems requiring
aliasing! That kind of sucks for writing programs that are fast in the first
place.

I've never had to use a language with really no aliasing (Fortran?), but many
languages forbid possibly overlapping arrays, because of the pain interior
pointers make for GC.

If you were ever to write an image decoder in Java, you'd see some steps
involve reading and writing from different parts of the same image (usually
called intra prediction), but now you have to pass all these extra coordinates
along with the image array. And the optimizer is not good enough to make up
for all the extra work it was just given.

------
jokoon
Isn't that why people advocates C ? Isn't C just that type of language you can
tell the compiler how to optimize ?

C might not give very explicit information on how to optimize, but isn't it
simple and bare enough to let the compiler do a better job ?

------
wolf550e
audio of the talk:
[http://cr.yp.to/talks/2015.04.16/audio.ogg](http://cr.yp.to/talks/2015.04.16/audio.ogg)

------
raverbashing
What a waste of time.

Yes, specialists can squeeze the last performance improvements in ASM compared
to C. Doesn't mean that -O2/-O3, auto-vectorizing can't do a nice job and get
to 90% of that

Optimizing DO matter. Just compare -O0 and -O1. Really. It's not because CPUs
are fast that people shouldn't do that and compilers shouldn't optimize a bare
minimum.

It's _even better_ that the compiler do that because ASM optimizing by hand is
very error prone.

And compilers get better every day. Just look at LLVM.

------
faragon
Room for optimizing compilers = distance between programming languages and CPU
instructions/microarchitecture

------
asgard1024
He is soooo spot on! This "dialogue with the compiler" is going to be really
big in the next decades, but it's in no way a death of automatic optimization,
it's just the beginning of it.

Here's a simple example how I expect it to work: You write a code that uses a
list-like data structure. The compiler then instruments the code and you run
some tests. The tests will then be evaluated and the (post?)compiler selects
what kind of data structure is to be used (let's make example simple, choices
are array vs. list). For instance, if you need to look up elements a lot
(based on the evidence from testing), an array will be chosen as the
underlying data structure.

And you actually get (if you want to see it, normally this information will be
hidden) a little box in the IDE, where the variable is used, that tells you:
"Here, an array will be used." You can then with one click say: "I don't want
array, make it a list." So for all the possible optimizations, there are two
viewpoints presented: The viewpoint of the compiler (based-on evidence from
the tests or static analysis) and the viewpoint of the programmer (which
allows for confirmation or override in case there are some unknown
assumptions).

And if the specifications change (say, we chose list earlier but now we have
actually a lot of direct access to elements), you can just recompile the same
code with the previously-agreed compiler choices removed! And without changing
any line of code, a different and more fitting data structure will be used.

You can easily see this can apply to many things, not just data structures.
You can also see that different ways how the dialogue can be implemented are
possible, once we syntactically separate the "what" from the "how" in the
programming language. In the future, I believe, we will program just with
abstract data types, and the concrete type specification will be selected
based on the evidence from the running program (or static analysis augmented
with that information). So the dialogue will not happen just with the
programmer, but the compiler will also observe the real world behavior of the
program and facilitate adaptation to that.

In this way, it's even possible to input assumptions that doesn't have to be
provably correct. This approach can potentially bridge the static vs dynamic
types divide, and others.

Finally, Haskell and functional languages are very nice, but I don't think
they are final word in programming. If we wanted the above, they have
syntactical problems such as mixing of concrete and abstract types (type
classes). Also, there are limits to static analysis in the real world. The
future will be lot more interesting.

------
jingo
The birth of an optimizing assembler.

~~~
astrange
If you're curious, here is an optimizing assembler.

[https://code.google.com/p/mao/](https://code.google.com/p/mao/)

------
bcheung
I would prefer death to the font-size: 2^1000px

------
MrPatan
Trying to read this gave me cancer

