How to speed up the Rust compiler some more in 2018

breakingcups · on June 5, 2018

I liked this post, not only for showing an overview of all the big and small wins but also for showing approaches that were tried and didn't work out.

It's nice to see someone write about failures in a very neutral and informative way, as it is a daily part of our jobs and life but most of what you read online is (logically) about successes.

glangdale · on June 5, 2018

Good point. I'm working on a project at the moment that's 100% performance oriented and it's sort of a pain to think about writing about it responsibly - i.e. keeping all the intermediate steps and half-good ideas around (and bug-free, and sorta-usable) long enough to be able to do performance analysis on all the stuff.

It's hard not just to sweep all the bad ideas and half-way steps under the rug and hit the world with this "aren't I clever" looking final product (without really helping anyone learn how to get there themselves on their own projects).

acdha · on June 5, 2018

One of the other benefits is keeping those notes around in case some assumptions change in the future and the optimization you picked is no longer viable.

glangdale · on June 6, 2018

Absolutely, yes. Or the optimization landscape suddenly changes. I had a super-cool trick for doing state transitions in a DFA at the rate of the throughput of a shuffle instruction rather than at the rate of the latency of a shuffle instruction, and smugly congratulated myself about how well it worked on Ivy Bridge (latency = 1, reciprocal throughput = 0.5). Then Haswell came along and took over the second shuffle capability to do 256-bit shuffles (latency = 1, throughput = 1). So the clever trick went obsolete overnight. :-(

acdha · on June 6, 2018

I remember a few stories like that back in the P3/P4/AMD era where a researcher ended up ripping out his hand-tuned assembly because the C reference implementation was increasingly faster. It was really good that they were very conscientious about testing both implementations regularly so there was zero concern about subtle incompatibility.

e12e · on June 5, 2018

You might also enjoy the recently featured:

"Sloc Cloc and Code – What Happened on the Way to Faster Cloc" https://news.ycombinator.com/item?id=17232884

cesarb · on June 5, 2018

> Cachegrind’s output showed that the trivial methods for the simple BytePos and CharPos types in the parser are (a) extremely hot and (b) not being inlined.

To me, this points to a deficiency in the rustc compiler. All trivial methods (trivial in that they will always be cheaper than a function call) should be automatically treated by the compiler as if they had been marked #[inline]. Currently a rust programmer has to annotate every single trivial getter and setter with that attribute, which is both tiresome and easy to forget.

pcwalton · on June 5, 2018

This choice came out of a desire to have some guarantees around ABI stability. The idea was that changing the body of a non-inlined, non-generic function should not require recompilation of the caller. Since then, ABI stability has been eroded for various reasons, and it may well be time to revisit this choice.

eddyb · on June 5, 2018

We've experimented with MIR-only rlibs - https://github.com/rust-lang/rust/issues/38913#issuecomment-... has some recent data on it - which seem to me like the best approach for solving this, but it needs more work.

Eventually, we might compile everything at once, making the whole "crate" separation a tad bit obsolete (or even silly), other than for code organization (it has other complications, like trait coherence).

bluejekyll · on June 5, 2018

I’d love someone more knowledgeable to correct this, but it’s my understanding that it does inline properly inside a crate, but across crate boundaries you must mark a method as #[inline]...

stochastic_monk · on June 5, 2018

That sounds like using gcc for C or C++ without link-time optimization.

Edit for clarification: functions not marked as inline cannot be inlined when used in another translation unit unless LTO is enabled.

steveklabnik · on June 5, 2018

As far as I know, that is true.

hobofan · on June 5, 2018

Isn't that only true for the default "release" profile? AFAIK you have to turn LTO on, if you want to get the most performance out of it (LTO isn't on in any profile by default), but it's also been some time since I last optimized to that extent.

steveklabnik · on June 5, 2018

I’m not sure to be honest, this is a corner of the language I find hard to remember.

johnparkar7777 · on June 5, 2018

Informative link