
How to speed up the Rust compiler in 2020 - est31
https://blog.mozilla.org/nnethercote/2020/04/24/how-to-speed-up-the-rust-compiler-in-2020/
======
ChrisSD
> ...giving incremental compilation speed-ups of up to 13% across many
> benchmarks. It also added a lot more comments to explain what is going on in
> that code, and __removed multiple uses of unsafe __.

I wanted to emphasize that last point because there's sometimes a
misconception that using `unsafe` always means speeding up the code. In truth
you can have speed and safety. Where unsafe is needed is when it's necessary
to break from Rust's memory model or to otherwise write code that Rust cannot
reason about.

I mention this because I've noticed even people familiar with Rust who
sometimes imply `safe == slow` and `unsafe == fast`. At the extreme end you
get some new users who think merely adding an `unsafe` block around code will
provide a speed boost. It will not. In fact `unsafe` doesn't do anything
itself. It simply allows calling some functions and performing a few other
operations that would otherwise be forbidden. Sometimes this is indeed
necessary but often it isn't.

~~~
lidHanteyk
So why is `unsafe` even in the language, then? It would seem to me that it
simultaneously undermines safety guarantees, and it can't even ensure
performance improvements.

~~~
umanwizard
Because performance improvements are not why `unsafe` exists. It exists to
disable various conservative compile-time checks in order to let you write
code that the compiler is unable to prove correctness properties about.

One reason to do this _might_ be performance. An example: if you can prove
lifetime properties but the compiler can’t, you might be able to just use raw
pointers without the borrow checker, which is unsafe. This can let you avoid
the performance penalty of smart pointers or cloning.

But there are other reasons to do it, too. Some things simply can’t be
reasoned about by the compiler, like calling external C code, so they always
need `unsafe`.

~~~
steveklabnik
> It exists to disable various conservative compile-time checks

There is a subtle but important distinction here. Unsafe does not disable
_any_ compile time checks. It adds additional constructs which are not
checked.

~~~
umanwizard
Thank you, you are right. It enables certain constructs, e.g. dereferencing
raw pointers, which have fewer compile-time checks than safe constructs, e.g.
references.

------
mkj
"The size of rlibs on disk has also shrunk by roughly 15-20%."

This is very welcome! The size of building a few rust projects on a laptop
disk is fairly incredible sometimes. I guess cargo needs to learn clean up
prior-to-latest deps directories too.

~~~
Benmcdonald__
A 'cargo clean' every once in a while helps

~~~
mkj
Yeah, but then you have to wait 5 minutes to build the deps again. If it
cleared the stale ones it'd be fine.

~~~
jynelson
You might be interested in [https://github.com/holmgr/cargo-
sweep](https://github.com/holmgr/cargo-sweep)

------
david_draco
I love that they include a "Failures" section. It is most interesting to learn
from.

~~~
yinyang_in
Can you share the link of Failures section, couldn't find in articles website.

~~~
dwaltrip
It's at the bottom of the post.

------
IshKebab
I have a question about LEB128. I use a similar format in my code, but with
two changes:

1\. It's big endian.

2\. The "last byte" bit is inverted (so it's 00001 instead of 11110).

I changed the last byte flag so that it is easy to count the number of values
in an array - you just do this:

    
    
      let mut count = 0;
      for byte in bytes {
        count += byte >> 7;
      }
    

And the reason I used big endian is because it makes decoding simpler.
Encoding is more difficult but data is only ever encoded once, and is commonly
decoded multiple times.

    
    
      let mut value = 0;
      for byte in bytes {
        value = (value << 7) | (byte & 0x7F);
        if value & 0x80 != 0 {
           return value;
        }
      }
    

Vs the original implementation which has do to this basically:

    
    
      let mut value = 0;
      let mut shift = 0;
      for byte in bytes {
        value |= (byte & 0x7F) << shift;
        if value & 0x80 != 0 {
          return value;
        }
      }
    

Here is the generated assembly in both cases:
[https://godbolt.org/z/6FbX_E](https://godbolt.org/z/6FbX_E)

No idea if it would be faster than the new code but surely it is faster than
the old code?

~~~
KMag
You should put all of those "has another byte" bits together, so you can look
at just the first two bytes to figure out what to do, and then perform your
load. In your most common case, you can perform an 8-bite MOVBE, an LZCNT/BSR,
and if the number of leading 0s/1s < 8, just shift and mask to get your value.

Have a look at how UTF-8 is encoded for a similar scheme. The space taken is
exactly the same as LEB128, but your code is less branchy and needs fewer
masks and shifts.

~~~
IshKebab
> The space taken is exactly the same as LEB128

No because values under 128 (very common in my data) take 2 bytes instead of
1. Unless I'm misunderstanding you.

~~~
KMag
You are misunderstanding me. Values less than 128 take only one byte.

The obvious implementation fast path is:

1\. Check that the current position is at least a 8 bytes from the end of the
buffer. Otherwise, go to the short buffer slow path.

2\. Count the number of leading 1s (or 0s) using BSR/LZCNT.

3\. If the number of leading 1s (or 0s) indicates more than 8 bytes are
necessary, use the multi-word slow path.

4\. Shift and mask out your value.

5\. Increment your position pointer by the length of your varint.

Values less than 128 will have no leading 1s (or no leading zeros, if your
scheme uses leading zeroes instead). The shift value is 56 bits. Unless you
have a 1-byte optimized path, your calculated mask value is (((uint64_t)-1LL)
>> 57). The position pointer is incremented by one byte at the end of the
decode.

~~~
KMag
Now that I think about it, for the uint126_t encoding case, you actually need
to inspect up to the first 3 bytes. I've only implemented this for
encoding/decoding uint64_ts, in which case there's a little trick that saves
one byte, but the trick only works because 64 % 7 == 1.

Maybe it's most clear if I spell out the 19 cases necessary to encode any
uint128_t (system using BSR (leading 1s) insteod of LZCNT (leading 0s)).

    
    
      0 xxxxxxx -> 0 to 127
      10 xxxxxx yyyyyyyy -> 128 to 2**14 - 1
      110 xxxxx yyyyyyyy zzzzzzzz -> 2**14 to 2**21-1
      1110 xxxx yyyyyyyy zzzzzzzz aaaaaaaa -> 2**21 to 2**28-1
      11110 xxx yyyyyyyy zzzzzzzz ... 2 more bytes -> 2**28 to 2**35-1
      111110 xx yyyyyyyy zzzzzzzz ... 3 more bytes -> 2**35 to 2**42-1
      1111110 x yyyyyyyy zzzzzzzz ... 4 more bytes -> 2**42 to 2**49-1
      11111110  yyyyyyyy zzzzzzzz ... 5 more bytes -> 2**49 to 2**56-1
      11111111 0 yyyyyyy zzzzzzzz ... 6 more bytes -> 2**56 to 2**63-1
      11111111 10 yyyyyy zzzzzzzz ... 7 more bytes -> 2**63 to 2**70-1
      11111111 110 yyyyy zzzzzzzz ... 8 more bytes -> 2**70 to 2**77-1
      11111111 1110 yyyy zzzzzzzz ... 9 more bytes -> 2**77 to 2**84-1
      11111111 11110 yyy zzzzzzzz ... 10 more bytes -> 2**84 to 2**91-1
      11111111 111110 yy zzzzzzzz ... 11 more bytes -> 2**91 to 2**98-1
      11111111 1111110 y zzzzzzzz ... 12 more bytes -> 2**98 to 2**105-1
      11111111 1111111 0 zzzzzzzz ... 13 more bytes -> 2**105 to 2**112-1
      11111111 11111111 0zzzzzzz ... 14 more bytes -> 2**112 to 2**119-1
      11111111 11111111 10zzzzzz ... 15 more bytes -> 2**119 to 2**126-1
      11111111 11111111 110zzzzz ... 16 more bytes -> 2**126 to 2**133-1

------
sysk
Stupid question but could Rust skip monomorphization during development builds
and use dynamic dispatch? I heard generics were often to blame for slow
builds.

~~~
Ygg2
It's a tradeoff. Do you want your code to run as if it was written by hand
(i.e. duplicated by hand)? Then monomorphization is best.

For Rust, it makes a lot of sense to pay for some compile speed penalty to not
have any performance penalty.

~~~
crazyjncsu
> during development builds

That's the key phrase in his statement.

I'm not familiar with Rust, but if compilation speed is a major issue, and
there are aspects of the compilation that are avoidable to trade-off for
runtime performance, it seems to be a good idea to make those configurable
per-build-type. Does Rust not offer this?

~~~
snippy
I've written an image processing tool in Rust, where running a large image
downscaling test with a release build takes 2.0 seconds, and with a debug
build it takes 18.7 seconds. So most of the time during development I compiled
in release mode, because it was actually faster overall.

For many applications debug builds are fast enough, but their runtime
performance is already so slow, that I hope they don't get even slower.

~~~
hobofan
It could certainly be a config flag on the profile that can be optionally
enabled, and wouldn't need to be enabled by default.

------
jonny383
How much slower is rust as opposed to C or C++ for production sized builds?

Traditionally people have thrown more hardware at these kinds of problems. As
someone who has yet to use rust for anything more than small hobby sized
projects, I am naive about rustic in terms of speed

~~~
DougBTX
I had a quick search, but finding examples where the same project has been
implemented in C++ and Rust with reported compile times seems to be rare...

There is a benchmark from way back in 2016 here: [https://users.rust-
lang.org/t/are-there-any-compilation-time...](https://users.rust-
lang.org/t/are-there-any-compilation-time-benchmarks-of-rust-vs-g-vs-
clang/4895/14)

Dev build was 2.91s for Rust vs 8.48s for C++, release build was 5.97s for
Rust vs 9.79s for C++.

Would need to find much more recent examples for the comparison to be
worthwhile.

~~~
slavik81
That's a somewhat misleading comparison even for the time. All his C++ files
could be compiled in parallel and recompiled incrementally, and his need for
-flto to get good performance is really just due to how he structured his
program. He seems to regard cpp files as the default location for functions
and inlining as an optimization, but IMO it's the opposite. An empty function
goes in the header, and every function starts off empty. You move it to a cpp
file once it requires including a header. Moving a heavy function from a
header to a cpp file is an action to optimize the compile time of your
program.

On the other hand, it could be worse. There are some monstrously slow template
libraries out there, and he's not using any of them. His code isn't very
optimized, but it's sane.

------
vhbit
when all of those changes will be in stable? will it be 1.44 or more likely
1.45?

~~~
steveklabnik
Depends on the change. The earliest one looks to be in 1.41, the latest one
landed two days ago, so should be 1.45.

