
How to speed up the Rust compiler one last time - nnethercote
https://blog.mozilla.org/nnethercote/2020/09/08/how-to-speed-up-the-rust-compiler-one-last-time/
======
__s
Nicholas Nethercote didn't just speed up Rust. He went in & did the dirty work
of dredging through Firefox profiling

> It’s rare that a single micro-optimization is a big deal, but dozens and
> dozens of them are. Persistence is key

Persistence is work. Mozilla is cutting the people who put in the work of
staving off bitrot

~~~
nnethercote
Thank you for the kind words.

To clarify: I am still at Mozilla! But I will be working fully on Firefox for
the foreseeable future. I have edited the opening paragraph of the post to
make this clearer.

~~~
qzw
Rust's loss is Firefox's gain. I'm sorry to see you leave your Rust work, but
I think Mozilla is right to have their best engineers focus on their core
product. If we hope to see Firefox survive and remain relevant, then Mozilla
really needed to refocus their energies onto it. Also, I assume someone of
Nicholas's caliber has a great deal of agency over their own career path, so
perhaps a return to Firefox is not entirely unwelcomed by him.

~~~
Ygg2
I'd honestly rather see Rust survive, than Mozilla. Mozilla is few years away
from going Blink, and on deathbed.

Rust is up and coming language.

~~~
pedrocr
While I'd begrudgingly agree the two are not independent. I moved back to
Firefox from Chrome exactly because Firefox became once again performance
competitive thanks to Quantum. Most (all?) the big Firefox improvements were
integrations of Rust code pioneered in servo. The only chance I see for
Firefox is doubling down on that and replacing more and more components with
those Rust rewrites. If not for that I don't see Firefox remaining competitive
with Chrome for long.

~~~
Ygg2
Didn't Mozilla like fire all of their Rust devs? There is no plans for
supporting Rust in Firefox AFAIK.

~~~
steveklabnik
There already is Rust in Firefox and has been for years.

~~~
Ygg2
True, but not what parent is saying.

> The only chance I see for Firefox is doubling down on that and replacing
> more and more components with those Rust rewrites

If Firefox has no more Rust developers now, it can't "double down" on
replacing components, and all software is prone to bitrot, so after a few
years, it will be an unmaintainable, buggy mess.

~~~
nnethercote
I think there is confusion here over the meaning of "Rust developer".

Mozilla did lay off most of its employees that were working directly on the
Rust language and its implementation. This was a handful of people.

Mozilla also laid off some employees that were using Rust, such as the Servo
team.

But Mozilla still has plenty of employees that know and use Rust, both in
Firefox (e.g. the WebRender team), and in code relating to Firefox such as
services. This is a much larger number of people.

------
ndesaulniers
> The improvements I did are mostly what could be described as “bottom-up
> micro-optimizations”.

> I also did two larger “architectural” or “top-down” changes

My summer intern started doing profiling work on compile times with clang:
[https://lists.llvm.org/pipermail/llvm-
dev/2020-July/143012.h...](https://lists.llvm.org/pipermail/llvm-
dev/2020-July/143012.html)

Some things we found:

* for a large C codebase like the Linux kernel, we're spending way more time in the front-end (clang) than the backend (llvm). This was surprising based on rustc's experience with llvm. Experimental patches simplifying header inclusion dependencies in the kernel's sources can potentially cut down on build times by ~30% with EITHER gcc or clang.

* There's a fair amount of low hanging fruit that stands out from bottom up profiling. We've just started fixing these, but the most immediate was 13% of a Linux kernel build recomputing target information for every inline assembly statement in a way that was accidentally quadratic and not being memoized when it could be (in fact, my intern wrote patches to compute these at compile time, even). Fixed in clang-11. That was just the first found+fixed, but we have a good list of what to look at next. The only real samples showing up in the llvm namespace (vs clang) is llvm's StringMap bucket lookup but that's from clang's preprocessor.

* GCC beats the crap out of Clang in compile times of the Linux kernel; we need to start looking for top down optimizations to do less work overall. I suspect we may be able to get some wins out of lazy parsing at the cost of missing diagnostics (warnings and errors) in dead code.

* Don't speculate on what could be slow; profiles _will_ surprise you.

> Using instruction counts to compare the performance of two entirely
> different programs (e.g. GCC vs clang) would be foolish, but it’s reasonable
> to use them to compare the performance of two almost-identical programs

Agree. We prefer cycle counts via LBR, but only for comparing diffs of the
same program, as you describe.

~~~
epage
> for a large C codebase like the Linux kernel, we're spending way more time
> in the front-end (clang) than the backend (llvm). This was surprising based
> on rustc's experience with llvm.

rustc sends large, generally unoptimized chunks to llvm, compared to clang. In
Rust, the translation unit is at the crate level, causing llvm to do more
analysis. MIR is also still relatively new and I think there is still work to
be done doing optimizations in it to get less data sent to llvm.

~~~
ndesaulniers
Clang doesn't do optimizations though. It generates llvm via a simple tree
walk.

~~~
fluffything
What's big in C or C++ translation units are header files, but since these
mainly contain declarations, and declarations do not require code generation,
they don't create any work for the compiler backend.

Rust translation units do not have the header file problem (so the frontend
does less work), and they are also much larger in terms of definitions, often
spawning multiple files, so there is more for the backend to do per
translation unit.

The consequence is that Rust spends more time on LLVM relatively speaking than
C and C++.

The solution to this problem in Rust is naively simple: write smaller
translation units.

Rust programmers just want to structure their code however their want, and
still have good compile-times. Which is kind of the opposite of how C and C++
programmers typically structure their code in the largests projects, because
they value faster compile-times over that kind of "ergonomics"/code
organization.

~~~
ndesaulniers
Ah. Same problem with static inline functions defined in headers in the
kernel. They may not end up generating code, but clang doesn't know that and
will still generate llvm for the backend to determine is dead, after clang has
already done lots of work performing semantic analysis on it.

Orthogonal to your point:

Also, aggressively marking functions __attribute__((always_inline)) we found
was blowing up compile times (reoptimizing the same code again and again).

Finally, expansions of function like macros containing GNU C statement
expressions could cause the preprocessed source to bloat very quickly
(megabytes of input, IIRC).

~~~
fluffything
Rust doesn't have the "static in header" problem, but it does have the "always
inline" and "macros" problem.

The "always inline" problem is smaller in Rust, because it does Thin LTO by
default, compiles partially to bit-code, that can be partially inlined if
necessary, etc. So essentially the Rust toolchain is a bit better at inlining
when its profitable than C and C++ toolchains "by default" (you can tune C and
C++ toolchains like clang to be as good as Rust).

In C and C++, macros are generally frowned upon, not because of this compile-
time issues, but rather because they are dangerous, tricky, non-hygienic,
powerful usage requires complex patterns, their interaction with header files
and PCHs, etc.

Rust macros are awesome, super useful, widely used, etc. So people end up
using them a lot, and this is by design, not a flaw of the language. This ends
up resulting in a lot of duplicated code being expanded, and that leads to the
problem that you mention, but much worse, because there are just many more
macros in Rust.

This kind of applies to templates / traits as well. There are many C++ that
don't write templates, but all Rust programmers use generics because they are
great. Rust can pre-compile generics, so the cost of this is not as bad as for
C++ where the same generic might be instantiated by multiple TUs. But still,
just by the fact that they are more widely used, the size of the problem
grows.

------
cesarb
> I was surprised by how many people said they enjoyed reading this blog post
> series. The appetite for “I squeezed some more blood from this stone” tales
> is high.

There's something satisfying about seeing code get cleaned up and optimized. I
also enjoyed following the LibreOffice commits back when they were in their
"heavy cleanup" phase after it became clear OpenOffice was dead (which meant
they didn't have to worry about diverging from the upstream anymore).

~~~
zem
the early neovim posts were very absorbing too

------
vlovich123
> Contrary to what you might expect, instruction counts have proven much
> better than wall times when it comes to detecting performance changes on CI,
> because instruction counts are much less variable than wall times (e.g.
> ±0.1% vs ±3%; the former is highly useful, the latter is barely useful).
> Using instruction counts to compare the performance of two entirely
> different programs (e.g. GCC vs clang) would be foolish, but it’s reasonable
> to use them to compare the performance of two almost-identical programs
> (e.g. rustc before PR #12345 and rustc after PR #12345). It’s rare for
> instruction count changes to not match wall time changes in that situation.
> If the parallel version of the rustc front-end ever becomes the default, it
> will be interesting to see if instruction counts continue to be effective in
> this manner.

This is a supremely surprising conclusion, especially in 2020. Is instruction
count really still tied to wall clock count? I would have thought that some
instructions could be slower than others (especially on x86) so that using
more faster individual instructions could be faster than 1 slower instruction.
Similarly, cache effects & data dependencies can result in more instructions
being faster than fewer instructions.

I _think_ what the author is trying to say is that when evaluating micro-
optimizations, cycle counts are pretty valuable still because you're making a
small intentional change & evaluating its impact & _usually_ the correlation
holds. The dashboard clearly still measures wall-clock since just comparing
instruction count over time would be misleading.

I'm curious if the Rust team has evaluated stabilizer to be more robust about
the optimizations they choose:
[https://emeryberger.com/research/stabilizer/](https://emeryberger.com/research/stabilizer/)

~~~
nnethercote
> This is a supremely surprising conclusion

That's why I started the paragraph with "Contrary to what you might expect".

As for Stabilizer: "Stabilizer eliminates measurement bias by comprehensively
and repeatedly randomizing the placement of functions, stack frames, and heap
objects in memory." Those placements can affect cycle counts and wall times a
lot, but don't affect instruction counts.

~~~
vlovich123
So have you not found in practice any data dependencies or cache issues show
up as bottle necks? Or do current tools just make this more of a blind spot
for optimization?

Also is there any work to multi-thread the Rust compiler on a more fine-
grained level like the recent GCC work? I know you allude to that potentially
that would make the instruction counts potentially less reliable so wondering
if that's something being explored.

Finally, while I have you, I'm wondering if there's been any exploration of
the idea of keeping track of information across builds so that incremental
compilation is faster (i.e. only bother recompiling/relinking the parts of the
code impacted by a code change). I've always thought that should almost
completely eliminate compilation/linking times (at least for debug builds
where full utmost optimization is less important).

~~~
nnethercote
I mentioned in the post several areas I myself haven't looked at, including
cache misses. There may be room for improvements there.

There is an experimental parallel rustc front-end, e.g. see
[https://internals.rust-lang.org/t/help-test-parallel-
rustc/1...](https://internals.rust-lang.org/t/help-test-parallel-rustc/11503)

> any exploration of the idea of keeping track of information across builds so
> that incremental compilation is faster

That's exactly what incremental compilation does.

~~~
vlovich123
There's an effort to track which functions are modules & what the downstream
implications of that are in terms of needing recompilation? Are there any
links to technical descriptions? Super interested in reading up on the
technical details involved.

~~~
Muximize
You might want to look into Salsa:

[http://smallcultfollowing.com/babysteps/blog/2019/01/29/sals...](http://smallcultfollowing.com/babysteps/blog/2019/01/29/salsa-
incremental-recompilation/)

[http://smallcultfollowing.com/babysteps/blog/2020/04/09/libr...](http://smallcultfollowing.com/babysteps/blog/2020/04/09/libraryification/)

------
est31
It's sad to see your rustc contributions stop, nnethercote. I guess rustc now
has to run an experiment on how quickly performance improves without you :(.

IMO compiler speed still remains the main ergonomics hurdle in developing Rust
software.

------
steveklabnik
Thanks for all you've done over the years here. I'm sad you won't be able to
do more of it.

------
The_rationalist
Nnerthercote own the best blog on performance profiling that I've ever seen,
congrats to your huge skill set, Firefox, chromium, and programming languages
need more people like you.

------
Ar-Curunir
Thank you for your excellent work over the years! Your efforts have gone a
long way to making Rust enjoyable to write =)

If there are any smart rust-using company, they should definitely hire
nnethercote to continue their excellent work!

~~~
dblohm7
Considering that nnethercote is still with us at Mozilla, I sure hope that he
_doesn’t_ get hired away! :-)

~~~
Ar-Curunir
Oops, missed that part :)

------
alex_reg
> ... Perhaps this relates to the high level of interest in Rust ...

I would have loved these blog posts regardless of what code was actually being
optimised.

They offer a fascinating glimpse into a workflow that requires expertise,
experimentation and creativity.

Sadly something that most developers can't engage in very often, due to the
nature of their work or time constraints.

------
oshea64bit
This is a fascinating blog series. I've been dabbling in Rust lately and
really appreciate how powerful and helpful the compiler is even to beginners.

> Due to recent changes at Mozilla my time working on the Rust compiler is
> drawing to a close.

This sort of statement makes me a bit worried though. I don't mean to echo
what a lot of the community has said over the past month, but I really hope
that development on Rust doesn't stagnate because of the layoffs.

------
jimbob45
How hasn’t Google taken over and hired the Rust team? Weren’t they practically
funding them by funding their parent, Mozilla?

~~~
steveklabnik
Well, most of the Rust team was not employed by Mozilla, so that’s one reason
why they have not.

~~~
nanagojo
What about the Servo team? Wouldn't they have an impact on Rust development?

~~~
steveklabnik
They certainly contributed, yes, but there are like two hundred people total
on all of the Rust teams. Losing them hurts, they’re fantastic folks, but Rust
is just way bigger these days.

~~~
jimbob45
It’s just that the team is _right there_ for the taking and it’s so easy to
hold control over the most exciting language of the next five years as Google
instead of MS EEE’ing Rust soon.

------
stackzero
gg man. Really enjoyed your posts since starting rust.

------
xiaodai
Hmmm... Rust needs alot more given its slow reputation

~~~
xiaodai
Don't know why I get down votes for pointing out the slowness of Rust which is
well known

~~~
lucozade
Probably because you just stated something that is well known without adding
anything constructive?

------
k__
The title made it sound like the Rust compiler is at its performance limit and
they doing the last possible optimization.

