
What Challenges and Trade-Offs Do Optimising Compilers Face? - mpweiher
http://tratt.net/laurie/blog/entries/what_challenges_and_trade_offs_do_optimising_compilers_face.html
======
pizlonator
> However, dynamic compilation has to be even more mindful of the costs of
> optimisation, since time spent dynamically compiling is bad for the user
> (draining their battery or simply taking potential CPU time away from the
> user). Thus while dynamic compilation is the most effective way of
> determining which parts of a program will benefit most from optimisation, it
> has much less leeway to use expensive optimisations than static compilation.

WebKit used LLVM as its optimizing JIT compiler for a couple years, and it was
fine. We only switched away from LLVM when we wrote a compiler that generated
code that was as good as LLVM (for our narrow use case) but in less compile
time; we would probably switch back to LLVM (or any other heavy AOT-style
compiler) if it gave us more throughput than our current compiler. In general,
we're happy to pay compile time to get peak throughput.

The reason why we have this attitude is that we have more than one JIT. WebKit
has three JIT tiers for JS and two for WebAssembly. The top tier does crazy
expensive optimizations (the most expensive of which is a must-points-to
analysis for removing object allocations; in second place is the Yorktown-
style regalloc) and has long compile times, but it's OK because by the time we
run it, we've already done a pretty good job of optimizing the code with the
JIT tier just below the top one. The second-to-top JIT tier in WebKit, called
the DFG JIT, runs very quickly.

Overall this is a great post! I just think that there is even more potential
for perf with JITs than what the post implies.

~~~
amelius
Interesting. Does WebKit run its JIT compiler in a separate thread from the
actual execution?

> In general, we're happy to pay compile time to get peak throughput.

Shouldn't a program which runs a UI (which WebKit basically is) optimize for
lowest latency instead?

~~~
chrisseaton
> Shouldn't a program which runs a UI (which WebKit basically is) optimize for
> lowest latency instead?

Do you want to achieve low latency with low latency, or do you want low
latency and it's so important that you don't mind high latency to get it :)

What I mean is yes people want lowest latency. But you can either compile
quickly, to get to a reasonably low latency when using the app and to get to
that point itself with low latency, or you can compile with optimisations, and
get lower latency when the compilation is done but have to wait with high
latency to get to that point.

Both approaches are 'optimising for latency' so you'd need to be more
specific.

The practical compromise is tiers of compilation, each taking higher latency
to apply but producing lower latency when they're done.

------
reikonomusha
One neat thing about Common Lisp is that you can specify compilation
directives in an extremely granular way: per file, per function, and even in a
scope-delimited fashion within a part of a function. You can do this with
Lisp's DECLAIM, DECLARE, and LOCALLY DECLARE syntax respectively.

Compiler directives are given as numerical levels of speed, safety,
compilation speed, code size, and debuggability. There are other more
specialized directives beyond those.

If you're building a web app and the backend has 2 or 3 really hot code paths,
you can do something like:

    
    
        (defun hot-path (request)
          (declare (optimize (speed 3) (safety 0) (debug 0)))
          ...)
    

This roughly reads: "For ONLY the function HOT-PATH, compile it only with
execution speed in mind. Eliminate extra runtime checks like array bounds
checking or overflow, because I have meticulously checked and proved that to
not be an issue. While you're at it, don't mind how debuggable it is at
runtime; don't record anything about the call or it's internal workings,
perform tail call elimination, etc."

I'll usually abstract away the DECLARE with more semantically meaningful
terms: OPTIMIZE-DANGEROUSLY-FAST, OPTIMIZE-FOR-SCRUTINY, etc.

In general, in Lisp, I would say it's even considered bad hygiene to flip the
global optimization switch; as djb and the article say, most of your large
programs don't need it. You also often sacrifice debuggability in the entirety
of your program if you ask the compiler to do what it can to make it fast.

~~~
greglindahl
The problem with directives is not on day 1, they're awesome on day 1. The
problem is usually that the directives are in the wrong place 10 years later.

Loop unrolling directives went through that process in just a few years, from
giving a nice speedup, to compilers being able to mostly match the directive,
to the directive causing slowdowns, to compilers ignoring that directive.

~~~
fao_
... at which point you can edit a single point in the source code to fix it,
either defining hot-path to be equivalent to eval, removing it outright, or
altering it to be better. Isn't that the point of abstraction?

------
chrisaycock
I've been fascinated with recent advances in compiler optimization.

The article mentioned superoptimization. Google's Souper uses an SMT solver
(like Z3) to make LLVM peephole optimizations.

[https://github.com/google/souper](https://github.com/google/souper)

Alternatively, Alive allows users to specify _rewrite rules_ for peephole
optimizations. The rules are verified with Z3 and then compiled into an LLVM
pass. If the rewrite can't be verified, then Alive won't allow it.

[https://github.com/nunoplopes/alive](https://github.com/nunoplopes/alive)

And then there's STOKE, which uses stochastic search to investigate possible
program transformations of x86 to find non-obvious code sequences. STOKE also
verifies the resulting transformation with Z3.

[https://github.com/StanfordPL/stoke](https://github.com/StanfordPL/stoke)

------
taneq
> [O]ptimising compilers are operating in a region between a problem that’s
> not worth solving (most parts of most programs) and a problem they can’t
> solve (the really performance critical parts).

I know this isn't the position being put forward or defended by the author but
I feel it's worth addressing nonetheless.

The flaw in this logic is that whether a problem is worth solving or not
depends not only on the benefits of doing so, but also the costs. In this
case, a compiler which optimises for size can, for almost zero cost, provide a
modest benefit. This is absolutely worth doing.

------
lmm
What we need is not to drop the optimising compiler entirely, but make the
parts compilers are good at available for programers to use in a more
controlled way. Something along the lines of what
[https://www.confluent.io/blog/turning-the-database-inside-
ou...](https://www.confluent.io/blog/turning-the-database-inside-out-with-
apache-samza/) does to the database.

~~~
mhh__
Optimizing compilers are purely an implementation detail of a language: This
shouldn't leak into the language (If that's what you meant)

~~~
lmm
I used to think that, but I no longer do. Most code is plenty fast enough
already, sure, but for the cases where performance is a requirement, it needs
to be exposed and accessible in a way that the developer can work with it, the
same way as for any other feature. Compare what's happened with 3D graphics
APIs - OpenGL tried to abstract over the hardware capabilities and give the
illusion that your card was always capable of everything, but actually that
just resulted in developers having to almost reverse engineer what the OpenGL
implementation was doing so that they could find out what the actual hardware
fast path was and use it. So newer APIs concentrate on exposing more of the
hardware details and giving the developer more precise control over what's
going on.

~~~
mhh__
That's the language design/philosophy, not specific details of how it was
implemented.

~~~
lmm
What distinction are you drawing? I think concerns like "this operation is
O(n^2) in the length of the passed list" are absolutely the responsibility of
the language, and probably even lower-level details. I'm not sure how to
expose "this operation takes n machine-code instructions" without sacrificing
cross-platformness, but I do think that's something we're going to need for
important use cases (e.g. cryptographic libraries need to be able to do
various things in constant time, which no existing language really has first-
class support for).

------
joosters
Does profile-guided optimisation really 'hone in on the parts of a program
which will most benefit from optimisation' ?

I thought that all it really did, at least on gcc, was record which branches
were taken and optimise the code to reflect the common path. Do compilers
really make use of the profile timings to decide to spend more time optimising
'hot' parts of the code with more involved/expensive transformations?

~~~
bzbarsky
Last I checked, gcc in PGO mode can basically do things like "-O2 for hot
code, -Os for everything else".

MSVC in PGO/LTO mode will do all sorts of exciting things. For example, in hot
code it will detect that a virtual function call is always or nearly always
happening with the same vtable pointer and emit a guard on the vtable pointer
followed by an inlining of that one particular implementation (with a fallback
to a normal virtual call if the guard fails). I'm pretty sure it doesn't do
this for all virtual function calls in the program. Even if it only does it
for ones where it has "enough data", that would bias towards the hot ones.

------
StillBored
Not all code has small single locations where the CPU spends all its time.
Take the linux kernel, what part is the 10% that needs optimization first?
Depending on which benchmark you run you will get a different answer, so
having really good code generation everywhere helps considerably. So, yes
there is a place for hand optimized compression routines, but there is also a
place for compilers to optimize the remaining 99% of code.

These days, with CPU vendors putting in massive effort for <10% performance
improvements, getting a 2x speedups in applications that compile to hundreds
of MB's, is more "free" performance than you are likely to gain over the next
decade from the hardware vendors.

Lastly, the kinds of code paths most suited to these gains are also the
hardest to benchmark. Which is why I fall into the camp that believes picking
a fairly performant language is more valuable than a small bump in developer
productivity for a little syntactic sugar for anything more complex than throw
away code.

~~~
tomsmeding
> Take the linux kernel, what part is the 10% that needs optimization first?

Well syscall entry/exit would be a contender, but those are already in hand-
written assembly anyway, so that doesn't count.

Then I'd go for the read()/write() syscalls; I bet (not verified) that they're
called really often in comparison to the other ones, with all the
file/socket/IPC I/O that's going on.

I'm in the same camp about language choice though; there's lots of place for
slower and easier languages, like Python in the role of gluing together high-
performance C code, but there are still also lots of scenarios where an
across-the-board performant language seems a good fit. Compilers spring to
mind, but they are even an extreme example.

~~~
Retra
Aren't read()/write() going to be waiting on memory most of the time? Wouldn't
you rather have faster context switching first?

~~~
tomsmeding
Very true, didn't think of that one.

------
azakai
The article focuses on throughput performance, which is obviously very
important, but maybe it's also worth mentioning that code size matters a lot
in some cases, like delivering code on the web.

If an optimizing compiler can reduce code size by 50%, that's extremely
useful, and as opposed to throughput optimization for which perhaps only a few
small methods matter, for code size the whole program matters, so humans can't
compete with compilers there.

~~~
mpweiher
Not just that, for many, many applications these days, latency and
predictability are more important than throughput.

And throughput is actually easier (see: _Latency Lags Bandwidth_ ). So we're
often/usually getting help with the less important problem that's easier to
solve, while at the same time making the more important and harder problems
more difficult.

YMMV.

------
lukego
Paul Graham once made a point about designing languages for yourself, or for
people who are not as smart as you are, or for people who are smarter than you
are. [https://youtu.be/agw-wlHGi0E?t=485](https://youtu.be/agw-wlHGi0E?t=485)

Somehow this blog and the debate with djb reminds me of this point. What kind
of person is are the compiler optimizations intended to benefit?

------
avar
The author mentions how Chrome & Firefox gained a 10% speedup with Profile-
Guided Optimization with GCC et al.

I.e. you compile your program with profiling, then run e.g. a test suite, and
compile your release binary with input from that profiling so the compiler
knows what to optimize.

Has any C compiler closed the loop on that by shipping the full compiler with
the program to do runtime optimization? With on-the-fly symbol replacement
you'd effectively have a C JIT compiler.

~~~
sanxiyn
Google has this in production. According to
[https://research.google.com/pubs/pub45290.html](https://research.google.com/pubs/pub45290.html)
AutoFDO is deployed to hundreds of binaries at Google. Over half of CPU cycles
used are now spent in some flavor of FDO-optimized binaries. And that was in
2016.

~~~
twic
HP's Dynamo did something comparable in 2000:

[http://archive.arstechnica.com/reviews/1q00/dynamo/dynamo-1....](http://archive.arstechnica.com/reviews/1q00/dynamo/dynamo-1.html)

------
phkahler
Regarding undefined behavior, one thing I very often assume is 2's complement
math with wraparound on overflow. That assumption has never failed for me, so
it really bothers me that this is undefined in the standard.

I have used processors that have saturating arithmetic and I even used that
feature, but even there compiler defaulted to the more common rollover
behavior.

~~~
usefulcat
Assuming you're using gcc or something with similar options, you should have a
look at -fwrapv if you haven't already.

------
nickpsecurity
"but the C standard isn't even freely available (though one can get drafts,
which are “close” to the final standard)"

I assumed it was free. If it's not, then it isn't even fully open as a
language. Can anyone refute it with a link to free standard?

~~~
FigBug
This is true for any ISO standardized language: C, C++, ECMAScript/Javascript,
COBOL, Ada, Forth, Fortran, Pascal, Ruby, SQL and more. C# dropped their ISO
standard after 2.0

~~~
krylon
I am not sure if they use any special legal construct, but last time I played
around with Ada, the full Ada standard was available for free. The actual
standard, not some revised draft. On top of that the rationale document which
explains many of the design decisions was also available for free.

(It's been a while, but I found these rationale documents fascinating
reading.)

------
jerven
Just as a data point I wanted to see what the benefit of JVM HotSpot is over
JVM just interpreting. Code that normally runs in 22 minutes with HotSpot is
still running 99minutes later in interpreter only mode!

I am starting to think that the complaints about optimizing compilers are due
to C family undefined behavior leading to C programmers having a love hate
relationship with their compiler writers. In contrast Java programmers trust
their JIT compilers to not screw up. But then UB is something that can cause
issues even if there was no optimizing compiler at all.

So to conclude I am extremely grateful that there are optimizing compiler
writers out there.

~~~
jerven
Another option is to compare a minimally optimizing compiler with a more
optimizing compiler. This one can do in Java land with client/server flags.
Client takes for our workload 15% more time to run than server.

Now what is that 15% worth? Take the new AMD Epyc to go from 2Ghz to 2.2Ghz is
worth 800USD (3200->4000). That means running JVM -server gives a value at
that level of about 1/3 of the server CPU cost over running the same JVM
-client. Of course that is at the top level and would be less in the middle.

------
lloydjatkinson
Title is wrong

------
pwdisswordfish
> In practise, therefore, trying to optimise all parts of a _program_ means
> doing a so-so job on all parts of a _programme_.

This reads as if the author had suddenly remembered he's British halfway
through the sentence. Quite jarring.

(Not to mention that 'practise' is a verb.)

~~~
ltratt
Mea culpa! Fixed with thanks.

------
otempomores
Is there a compiler which can be attached to a debug build and optimize from
the runtime-codepath information generated during usage?

