
Make LLVM Fast Again - notriddle
https://nikic.github.io/2020/05/10/Make-LLVM-fast-again.html
======
londons_explore
Am I the only one who wants to see a split into a "fast compile" mode and a
"spend hours making every optimization possible" mode?

Most code is executed a lot more frequently than it is compiled, so if I can
get a 1% speed increase with a 100x compile slowdown, I'll take it.

I don't want to see good PR's that improve LLVM delayed simply because they
cause a speed regression.

~~~
embrassingstuff
How many people would.use a cloud compiler?

Let's set aside thecnicalities and assume it a real X5 improvements and all
the files are mirrored seamlessly.

~~~
mrich
If you are referring to a caching distributed compile cluster, most companies
with large codebases are using one (or should). It helps a lot and can make
the difference between taking half a day for a scratch make (i.e. unusable) to
getting down to 10 min.

There are open-source ones but I'm also aware of at least two internal,
custom-developed ones.

~~~
jingwen
More here: [https://docs.bazel.build/versions/master/remote-
execution.ht...](https://docs.bazel.build/versions/master/remote-
execution.html)

------
DerSaidin
This work is critical to compile times improving.

As the author of one of the changes which could have unknowingly causing a 1%
regression, I really appreciate this work measuring and monitoring compile
times. Thanks to nikic for noticing the regression and finding a solution to
avoid it.

~~~
fluffything
I really hope this type of infrastructure gets moved into LLVM itself, and
people start adding more benchmarks for all the frontends, and somehow
integrating this into the CI infrastructure, to be able to block merging PRs
on changes that accidentally impact LLVM's performance, like is currently the
case for rustc.

But I guess the LLVM project should probably start by making code-reviews
mandatory, gating PRs on passing tests so that master doesn't get broken all
the time, etc. I really hate it when I update my LLVM locally from git master,
and it won't even build because somebody pushed to master without even testing
that their changes compile...

For Rust, I hope Cranelift really takes off someday, and we can start to
completely ditch LLVM and make it opt-in, only for those cases in which you
are willing to trade-off huge compile-times for that last 1% run-time
reduction.

~~~
bluGill
Unfortunately sometimes losing performance is the correct tradeoff. However it
needs to be carefully considered before you just make it.

~~~
fluffything
Accepting a performance loss is almost always the right trade-off. If it
weren't, everybody would be writting their code in assembly at 5LOC/day.

------
drivebycomment
Good read. One thought-provoking bit for me was:

> Waymarking was previously employed to avoid explicitly storing the user (or
> “parent”) corresponding to a use. Instead, the position of the user was
> encoded in the alignment bits of the use-list pointers (across multiple
> pointers). This was a space-time tradeoff and reportedly resulted in major
> memory usage reduction when it was originally introduced. Nowadays, the
> memory usage saving appears to be much smaller, resulting in the removal of
> this mechanism. (The cynic in me thinks that the impact is lower now,
> because everything else uses much more memory.)

Any seasoned programmers would remember a few of such things - you undo a
decision made years ago because the assumptions have changed.

Programmers often make these kinds of trade-off choices based on the current
state (typical machines the program runs, and typical inputs the program deals
with, and the current version of everything else in the program). But all of
those environmental factors change over time, which can make the input to the
trade-off quite different. Yet, it's difficult to revisit all those decisions
systematically as they require too much human analysis. If we can encode those
trade-offs in the code itself in a form that's accessible to programmatic API,
one can imagine implementing a machine learning system that can make those
trade-off decisions automatically over time as everything else changes via
traversing the search space of all those parameters. The programming language
of today doesn't allow encoding such a high-level semantic unfortunately, but
maybe it's possible to start small - e.g. which of the associative data
structure to use can be chosen relatively easily, the initial size of
datastructure can also be potentially chosen automatically based on some
benchmarks or even metric from the real world metric, etc.

~~~
adrianN
I don't think that exploding the state space of your program by making the
history of your design decisions programmatically accessible (and changing
them regularly to reflect new assumptions) would be good for the quality of
the result.

~~~
gugagore
I don't think it's as simple as saying "the state space explodes, and that's
bad".

When you say state space, I think about what is dynamically changing. If you
can select one of two design decisions e.g. at compile time then, yes, your
state space is bigger, but you don't have to reason about the whole state
space jointly. The decision isn't changing at run time.

~~~
adrianN
You have to have tests for all combinations though. At least those
combinations that you actually want to use. You get the same problem when your
code is a big ifdef-hell.

~~~
gugagore
Testing is important, for sure, but just because you have two parameters with
n choices each, does not mean you have to test n^2 combinations. You can aim
to express parameterization at a higher level than ifdefs.

For example, template parameters in C++. The STL defines map<K, V>. You don't
have to test ever possible type of key and value.

~~~
adrianN
I'm pretty sure that you need n^2 tests if you have n non-equivalent choices
each. For maps many types are equivalent so you don't need an infinite number
of tests.

~~~
smolder
If the two hypothetical parameters only affect disparate program logic for
some or all of their possible choices, they could require as few as 2n tests
instead of the full n^2... If I'm understanding the hypothetical right. (It
depends on their potential for interaction.)

------
nh2
> I’m not sure whether this has been true in the past

Phoronix.com has a lot of Clang benchmarks over the years.

I recall seeing some benchmark that showed that as Clang approached GCC in
performance of compiled output, the compile speed also went down to approach
GCC levels.

But I haven't managed to find that exact benchmark yet.

~~~
baybal2
An expected result when they copy GCC's features, and functionality. Isn't it?

Pretty much the sole point of Clang/LLVM to the corporate sponsors is to get
the GCC, but without GPL

~~~
the_pwner224
I've heard this said before. Why would someone want this? AFAIK the GPL isn't
really relevant unless you're modifying and redistributing GCC itself. Even if
you use modified GCC internally for compiling for-profit software you just
need to allow the employees who use GCC to see your modified code, which
doesn't seem like a big deal since you already trust them with your
application code.

~~~
redis_mlc
Because Apple is intensely allergic to the GPL.

The reasons are:

\- As somebody else mentioned, Apple redistributes developer tools, clang
being the poster child

\- Since they releases OS products, they don't want to co-mingle their
software with GPL code. (So they use an older bash on Mac OS X.)

\- fear of an Apple developer quietly copying GPL source into a commercial
product (well-founded, actually)

\- Apple Legal exerting an "abundance of caution" on IP

\- at this point, it's institutional. When I worked there, Linux and MySQL
were forbidden, for example, but that has relaxed recently.

Also, I think you misunderstand the GPL. If you distribute modified gcc,
anybody receiving it can ask for sources. So employees plus end-users.

(One of the strangest examples is that Yamaha uses real-time linux in their
synths, and you can download the GPL portions. I can't imagine a musician ever
wanting to do that!)

Source: ex-Apple.

~~~
sjwright
> Apple is intensely allergic to the GPL.

Their actual marketplace behaviour demonstrates that they're allergic to GPL
version 3 specifically, not the GPL or copyleft in general.

~~~
saagarjha
They’re fairly allergic to copyleft these days; I can’t recall them adopting a
new project with any version of GPL for quite a while.

~~~
sjwright
There aren't so many GPL 2.0–only projects out there to adopt.

------
ndesaulniers
We plan on starting to track compile times for Linux kernel builds with llvm.
If you have ideas for low hanging fruit in LLVM, we'd love to collaborate.

------
Myrmornis
Shameless plug:
[https://github.com/dandavison/chronologer](https://github.com/dandavison/chronologer)
runs a benchmark (using hyperfine) over every commit in a repository (or
specified git revision range) and produces a boxplot-time-series graph using
vega-lite. It works but is rough and I haven't tried to polish it -- does
another tool exist that does this?

~~~
The_Amp_Walrus
This is interesting. I'm working in epidemiological modelling atm and
something this would be pretty useful to run in a github action CI-style to
find performance regressions over time.

I did a quick Google and found this:
[https://github.com/marketplace/actions/continuous-
benchmark](https://github.com/marketplace/actions/continuous-benchmark)

~~~
mhh__
This is something that I've been meaning to work on for a while - I'm doing a
hardware project ATM so it won't be for a while - but there seems to be a
strong use case for a generic software-tracker app.

Lots of projects have "are we fast yet" type graphs but I'm not aware of a
generic tool that also allows you to set alerts for fine grained benchmarks (I
made a toy that alerts you to _x_ -sigma increases in cache misses when
testing compiler backend patches for example).

One of those projects that I actually want to build but slightly too dull to
finish.

------
aogl
Well this is interesting. I thought I was the only one who noticed things
getting slower. For a couple releases now I've been thinking I was going
crazy, as if something was only ever getting slower on my own machines. Glad
to realise someone else illustrating some data to prove it. Thanks, I'll
definitely watch this conversation play out as others realise the obvious..

------
jeffbee
Pretty cool improvements. For any large project profiling it, making it
faster, and preventing or reverting regressions can be a full-time job.
Perhaps LLVM project needs such a person in such a role. Still, I question the
utility of timing optimized builds. Usually when I have to wait for the
compiler it's an incremental fastbuild to execute unit tests. Optimized builds
usually happen while I'm busy doing something else.

~~~
tbodt
The problem with that is LLVM non-optimized codegen is so bad that many
projects build with -O2 even in debug mode.

~~~
jeffbee
Really? It's good enough that Google used clang for fastbuild (tests) for many
years before switching release builds off GCC.

------
schlupa
Building llvm+clang from source is also ludicrous. 70 GB of diskspace usage
and takes an hour to build, ridiculous. It's the static linking which is the
culprit here, hundred of MB big binaries are a catastrophe for cache and
memory subsystem. The funny thing is that my project also uses modules in D.
Building the D compiler takes 10 seconds including unpacking of the tarball.

~~~
jcelerier
I build LLVM+Clang regularly and it definitely does not take 70GB.

~~~
schlupa
llvm+clang v10 on Linux builds 8 GB in bin, 13GB in lib and >5 GB in tools +
things here an there. That's ~30 GB. Then you need that much for the install.
So hard requirement 2 x 30 GB + some slack. Last time I built it was around
version 4 and it did not need that much disk space.

~~~
jcelerier
> llvm+clang v10 on Linux builds 8 GB in bin, 13GB in lib and >5 GB in tools +
> things here an there.

just checked and my build folder with llvm & clang is 3GB. That's a release
build (pass -DCMAKE_BUILD_TYPE=Release !!) - you don't need a debug build
unless you're hacking on llvm itself

> Then you need that much for the install

you want make install/strip, not make install (but why do you need to install
? you can run clang from the build dir just fine)

~~~
schlupa
Thank you for your tips. It's a pity that this information is not very visible
on the quick start page.

------
moonchild
> I can’t say a 10% improvement is making LLVM fast again, we would need a 10x
> improvement for it to deserve that label. But it’s a start…

It’s a shame, one of the standout feature of llvm/clang used to be that it was
faster than GCC. Today, an optimized build with gcc is faster than a debug
build with clang. I don’t know if a 10x improvement is feasible, though; tcc
is between 10-20x faster than gcc and clang, and part of the reason is that it
does a lot less. The architecture of such a compiler may by necessity be too
generic.

Here’s a table listing build times for one of my projects with and without
optimizations in gcc, clang, and tcc. Tcc w/optimizations shown only for
completeness; the time isn’t appreciably different. 20 runs each.

    
    
      ┌─────────────────────────────┬──────────┬──────────┬──────────┬─────────┬────────────┬────────────┐
      │                             │Clang -O2 │Clang -O0 │GCC -O2   │GCC -O0  │TCC -O2     │TCC -O0     │
      ├─────────────────────────────┼──────────┼──────────┼──────────┼─────────┼────────────┼────────────┤
      │Average time (s)             │1.49 ±0.11│1.24 ±0.08│1.06 ±0.08│0.8 ±0.04│0.072 ±0.011│0.072 ±0.014│
      ├─────────────────────────────┼──────────┼──────────┼──────────┼─────────┼────────────┼────────────┤
      │Speedup compared to clang -O2│        - │     1.20 │     1.40 │    1.86 │      20.59 │      20.69 │
      ├─────────────────────────────┼──────────┼──────────┼──────────┼─────────┼────────────┼────────────┤
      │Slowdown compared to TCC     │    20.68 │    17.20 │    17.72 │   11.12 │          - │          - │
      └─────────────────────────────┴──────────┴──────────┴──────────┴─────────┴────────────┴────────────┘

~~~
judofyr
> Today, an optimized build with gcc is slower than a debug build with clang.

Did you mean "an optimized build with gcc is _faster_ than a debug build with
clang"?

~~~
fluffything
If that's the case, an optimized build with clang is also often faster than a
debug build with... clang itself.

The reason is that many of the optimization passes that run first, like dead
code elimination, can remove a lot of code early on, so "optimized" builds end
up processing significantly less code, which is inherently faster.

The OP might just not be aware of what a "debug build" is. The goal of a debug
build is for the binary to execute your code as closely to how you wrote it as
possible, so that you can easily debug it.

Their goal isn't "fast compile-times". If you want fast compile-times, try
using -O1. At that level, both clang and gcc do optimizations that are known
to be cheap and remove a lot of code, which speeds up compile-times
significantly. Another trick to speed-up compile-times is to use -g0, and if
you do not need exceptions, use -fno-exceptions, since those make the front-
end emit much less data, which results in less data having to be processed by
the backends.

~~~
moonchild
In my testing, -O1 results in slower compile times than -O0.

Emitting debug symbols doesn't change compile times.

~~~
fluffything
You might have an interesting project, for all my C++ projects, -O1 is
significantly faster than -O0 (~2x faster).

Or maybe my projects are the interesting ones :D

~~~
moonchild
Ah - you are using c++.

My project is c, which is why I can use tcc.

Anyway, that makes sense; in c++, there's a lot of 'extra' stuff, single lines
of code that add up to much more than they would seem. I bet -O1 lets the
compiler inline a lot of std::move, smart ptr semantics; elide
monomorphisations, copy constructors/RVO; etc. Which just means less code to
spit out the backend.

~~~
fluffything
Ah right, for some reason I thought you were talking about C++.

Yes for C what you mention makes perfect sense.

I agree with you about C++ as well. In particular, C++ templates end up
expanding a lot of duplicate code, and at O1 the compiler can remove them.

------
nickcw
I think this is a worthy effort :-) I find the compile times of rust to be
quite a big negative point.

However:

> For every tested commit, the programs are compiled in three different
> configurations: O3, ReleaseThinLTO and ReleaseLTO-g. All of these use -O3 in
> three different LTO configurations (none, thin and fat), with the last one
> also enabling debuginfo generation.

I would have thought for developer productivity tracking -O1 compile times
would be better wouldn't it?

I'm happy for the CI to spend ages crunching out the best possible binary, but
taking time out of the edit-compile-test loop would really help developers.

~~~
bluGill
Both are worth tracking. If O3 is doing useless work then I'll take the speed
up. If it is twice as long for a 1% improvement I'll take it.

------
gameswithgo
to Nikic, thank you for this effort.

------
thu2111
Hmm. This is one of the unexpected upsides to systems using JIT compilation
that I guess we tend to take for granted. The very fact that a JITC runs in
parallel to the app means the compiler developers care intensely about the
performance of the compiler itself - any regression increases warmup time
which is a closely tracked metric.

As long as you can tolerate the warmup, and at least for Java it's not really
a big deal for many apps these days because C1/C2 are just _so_ fast, you get
fast iteration speeds with pretty good code generation too. The remaining
performance pain points in Java apps are things like the lack of explicit
vectorisation, value types etc, which are all being worked on.

------
RX14
I would greatly greatly appreciate an effort to benchmark builds without
optimizations too. We've seen some LLVM-related slowdowns in Crystal, and
--release compile times are far less important than non-release builds to us.

------
NCG_Mike
A couple of things a C++ developer can do is to put template instantiation
code into a .cpp file, where possible.

"#pragma once" in the header files helps as does using a pre-compiled header
file.

Obviously, removing header files that aren't needed makes a difference too.

~~~
brandmeyer
`pragma once` doesn't do anything that a well-written header guard does.

------
The_rationalist
The root cause of the issue is that they should make mandatory for each pull
request merged into llvm to (almost) not regress performance. The CI should
have a bunch of canonical performance tests. If it was made mandatory from the
start llvm could have been far faster, it is not too late but it's time to put
an end to this mediocrity

~~~
yjftsjthsd-h
Although tracking it would help, I see 2 issues (in opposite directions): I
think there are times when slower performance is an acceptable cost. And, I
think that if you allow tiny slowdowns, over time we'll get back here. There's
judgment involved.

~~~
m463
You could make the argument that:

\- a slower build time should not be made at the expense of a more extensible
compiler - one that can be modified easily to add capabilities and features to
the build output

\- a slower build time is acceptable if the build result executes faster or
more efficiently. One slower compile vs one million faster executions is
keeping your eye on the prize.

~~~
streb-lo
The argument is simple IMO:

* release target build times aren't an issue. They can be done overnight and aren't part of the work cycle.

* un-optimized build times are part of the work cycle and should be as speedy as possible.

~~~
gpm
> * release target build times aren't an issue. _They can be done overnight
> and aren 't part of the work cycle._

Emphasis added. This isn't true for many use cases. There are times when
release build + single run is faster than debug build because run time is
relatively long (e.g. scientific sims with small code bases + big loops).
There are times when debug builds simply aren't sufficient (e.g. when
optimizing code).

~~~
streb-lo
OK that's true but I think my point still stands. Someone who is doing very
heavy scientific computation with long-run times will still prefer a release
build that is optimizing for run-time speedup over compile-time speedup,
within reason of course.

~~~
gpm
I agree the point still largely stands, that's why I added the emphasis. Maybe
I should have made the intent of that clearer.

------
xvilka
Reimplementing LLVM in Rust could make a big difference as well.

~~~
adrianN
Why do you think that? Rust and C++ are reasonably close in performance.

~~~
tom_mellior
Not the OP here, but AFAIK LLVM has been struggling with parallelizing
compilation of independent functions due to some shared global context that
ideally wouldn't be there. A full rewrite would allow a redesign of this part
in a more concurrency-friendly way. So it's conceivable that a concurrency-
oriented rewrite would bring nice speedups in wall time, not total CPU time.
And Rust might give some more guarantees that there really aren't any hidden
shared corners.

~~~
gizmondo
Given that parallelizing rustc is also kinda struggling, it seems unfair to
think that C++ is necessarily the culprit in LLVM case.

~~~
est31
In Firefox there've been two failed attempts to make css styling parallel
while using C++. Only the parallel rust rewrite, stylo, succeeded.

Rust doesn't replace careful planning of parallel infrastructure or
performance optimization, but it makes it possible to maintain the parallel
system.

------
dangwu
LLVM has devolved into complete garbage in Xcode for large Swift projects.
Slowless aside, at least half the time it won't display values for variables
after hitting a breakpoint, and my team has to resort to using print
statements to debug issues.

~~~
MobiusHorizons
Are you debugging optimized builds? I have definitely had variables be
unavailable at debug time, due to optimization in C in lldb. I would guess
that the same could be true for swift's integration.

~~~
loeg
Clang loses a lot of information for values that are still available or
computable in its optimization passes. It's not purely the values being
completely lost to optimization.

GCC continues to emit relatively better debuginfo at similar optimization
levels. Samy Al Bahra has written and talked about this a couple of times over
the years.

~~~
brandmeyer
I dunno, I see the same issue in GCC, even at -Og. Both compilers will
aggressively mark variables dead and reuse their (memory, registers) as soon
as possible. Just because its in scope, doesn't mean its still live.

~~~
loeg
Yes, scope and liveness are not exactly the same. No, that does not mean the
scoped value cannot be restored cheaply. DWARF can express the value of a
variable in terms of expressions that do computations on other registers
and/or access memory; it does not have to be as simplistic as "this value
lives in this register for some period of time." Clang (and to a lesser
extent, GCC) fail to do that for non-live, in-scope registers much of the
time. Clang in particular just loses that metadata in many optimization
passes.

------
dgentile
The Rust is 10% slower metric I think is unfair. If you look on godbolt, the
LLVM IR that rustc emits isn't that great, so LLVM has to take some extra time
to optimize that, compared to the output of clang.

~~~
est31
It's not 10% slower compared to clang but slower compared to the prior version
of LLVM. That comparison IS fair as LLVM specifically invites people to target
it.

------
simonw
As a general rule, "Make X Y again" is worth avoiding - it has connotations
that are likely to distract from the message you are trying to put across.

This is a great post full of really interesting technical details. Don't be
put off by the title!

~~~
Thorentis
It's become a well-known catch phrase now. It's only a distraction if you find
yourself offended by it, which frankly with everything that's going on right
now, is pretty thin-skinned.

~~~
skavi
As it currently stands, 15 out of 24 comments in this thread are about the
phrasing in the title. It has demonstrably become a distraction, at least in
this post.

