Hacker News new | past | comments | ask | show | jobs | submit login
How to speed up the Rust compiler in 2020 (blog.mozilla.org)
247 points by est31 44 days ago | hide | past | web | favorite | 114 comments

> ...giving incremental compilation speed-ups of up to 13% across many benchmarks. It also added a lot more comments to explain what is going on in that code, and removed multiple uses of unsafe.

I wanted to emphasize that last point because there's sometimes a misconception that using `unsafe` always means speeding up the code. In truth you can have speed and safety. Where unsafe is needed is when it's necessary to break from Rust's memory model or to otherwise write code that Rust cannot reason about.

I mention this because I've noticed even people familiar with Rust who sometimes imply `safe == slow` and `unsafe == fast`. At the extreme end you get some new users who think merely adding an `unsafe` block around code will provide a speed boost. It will not. In fact `unsafe` doesn't do anything itself. It simply allows calling some functions and performing a few other operations that would otherwise be forbidden. Sometimes this is indeed necessary but often it isn't.

It's great you emphasise it and I would generalise it even further. Any perf optimisation should be considered after reliably measuring (controlling for hot/cold cache, overall system load among others) its effect, instead of blindly adding unsafe everywhere.

Similar to people who sprinkle their Cpp code with inline thinking it will force the compiler to inline.

In fact, sometimes you can improve perf by adding an additional safe operation to your code! http://troubles.md/rust-optimization/#assert-conditions-befo...

The key takeaway from all perf optimisation articles to me is the methodology used for realiably verifying the effect of suggested changes. It's awesome to see the author include failed experiments.

It is shocking how often people suggest dubious code changes to "improve performance" with exactly zero benchmarks, or solely artificial benchmarks. Computers and compilers are far more complicated than your naive code analysis implies, you _must_ prove any performance changes with robust, real-world benchmarks, especially if your changes make the code more unsafe or harder to understand.

When I was younger, I implemented the 8 queens problem on my 386, and micro optimized the hell out of it. Assembler, cycle counting, disable interrupts, etc... Didn't switch to protected mode, but it came close.

Then I found a smarter algorithm, and my qbasic proof of concept was magnitudes faster. Ouch!

This kid learned something about the nature of optimization that day. Now Im an old fart but even older farts than me stumble into this.

That’s a great lesson to have early in your career, too.

My favorite example of that was a memory-bound app where a key function had been implemented in both C and x86 assembly. Someone tested and found that the assembler version had been slight performance loss since (IIRC) after the Pentium Pro, and since nobody had wanted to substantially change the assembly code going C-only meant they actually refactored the data access patterns and saw a substantial improvement.

Not to excuse people, who provide no metrics or benchmarks for their "performance optimisations", I want to acknowledge that benchmarking is hard.

I hope something like NixOS will enable people to write and publish "reproducible" (in quotes because precise numbers are impossible) benchmarks by enabling you to setup your OS, user space and compilation flags exactly like in the benchmark.

Oh that would be really compelling. I only tried NixOS briefly a while ago, but I think I recall you can switch (actually I think literally `switch`) environment like that pretty easily and back again?

So your benchmark environment could actually be pretty minimal, or if you were sharing a bug reproduction it could be devoid of unrelated packages/config in your usual environment.

In modern C++, inline is used to prevent 'multiple-definitons' errors, not to actually inline the function.

That’s terrifying. I will choose to avoid C++ projects if given the chance.

On a similar note: some people seem to think immutable is slower than mutable. I have seen this many times, especially in my language of choice (scheme). A lot of people coming from python or c write loops using (while ...) and mutation, which in most schemes (and in many dynamic languages) is a nice way to get boxing penalties. Instead, a recursive function is almost always preferred. I have since since long preferred recursion.

Which (not strictly relevant) leads me to my biggest gripe (which is pretty small) with rust. I like rust. I really do. But a couple of times a week I have to use loop when tail recursion would have been a lot clearer. In debug builds , rust disables sibling recursion (a less general form of tail recursion) which means I have to either use a trampoline macro (simple and slow) or some kind of syntax rewriting macros (complex and fast). Neither seem like a good idea.

Should I just accept the ugliness of loop+mutation or is there a nice alternative?

Consider contributing to the RFC process. This has been proposed a number of times, and I don't think anyone's against it. It's just not been considered priority.

I would advise that TCE will probably be more acceptable to the Rust community with an explicit opt-in syntax. And that `become` is a reserved keyword in Rust that is semi-earmarked for this purpose.

I doubt I have the know-how it needs to navigate the issues raised in the previous discussions. I could maybe write a proposal where sibling recursion (i.e. self recursion and calls to equally-typed procedures) is guaranteed via the becomes keyword, but i dont really have the motivation to spend that much time discussing it or write the proposal.

Such a proposal would in any way be "future proof" in the sense that general becomes-TCO could be added later in a compatible way.

Rust has controlled mutable state, which is sort of a best of both worlds (vs pervasively everything-is-mutable and functional approaches)

Yeah! I have found that writing rust as I write scheme (with some limits, of course) usually means code runs fast without having to even think about the borrow checker.

Sometimes iterators aren't enough though, and if I had reliable sibling recursion in debug mode, most of my use of loop would disappear.

Recursive functions should be avoided whenever reasonably possible, for several reasons (performance, safety, optimizer shenanigans...).

Immutable may be slower or faster than mutable, depending on what one is writing and optimizer.

Tail recursion is not slower. A properly tail recursive function can be transformed into a jump, and not mutating loop variables makes flow analysis less complex.

Scheme implementations usually doesn't even bother doing type inference once you start using set!, and just implicitly boxes everything that you mutate. Of course you _can_ optimize it, but once you have a language with multi-shot continuations you are probably better off encouraging immutability.

Tail call optimization often doesn't happen when you think it should (in Rust) - when it does work it is not slower, but you can't rely on it. In practice recursive functions, even tail recursive functions, are a performance footgun that will likely result in slower code.

There was a proposal many years ago to add a keyword like "return" that guaranteed tail call optimization to rust. Unfortunately that RFC was never accepted and implemented.

The other way in which recursion is a BIG footgun is in reading Profiler output.

In particular it is hard to determine whether the body of your function is slow or whether it is recursing too much.

How is that different from loops?

I know that it is almost impossible to even hope for in c++ due to finalizers and things like that. I don't know enough about the borrow checker to know if there are similar problems in rust. I have noticed that LLVM does sibling recursion (LLVM's name for self and like-typed function TCO) quite often for release builds by inspecting the assembly output. I have never actually had it not happen, which speaks for it being possible, at least for sibling recursion that does not pass into extern functions.

"become" became reserved before rust 1.0. One could maybe just let it guarantee LLVM sibling recursion as a starting point and maybe extend it with proper TCO later.

Tail recursion is a trivial subset of recursion. If you are in that case, you are better off writing a loop and avoiding to rely on the optimizer.

As for Scheme, that is a detail of Scheme, not general mutability vs. immutability.

Which is what I said I'm doing. What I said is that I would to not have to do that. I just said I would love to be able to (explicitly) guarantee self- and/or sibling recursion, maybe using a compiler directive. I explicitly mentioned the tail in tail recursion in my post.

General TCO would be nice as well, say when writing a state machine/recursive descent parser, but that is a hassle, especially if you want C interop. Since many of the rust devs are old ocamlers, I doubt I need to explain the virtues of (tail) recursion to them. Which is also why I never bothered to bring it up: they probably have very good reasons for not supporting it. The reserved "become" keyword is testament to them at least giving it some thought.

My understanding is that they couldn't find a way to get TCO to play nicely with destructors.

Unsafe does not imply fast, but safe does restrict what you can do.

In other words, the optimal safe solution can be slower than the optimal unsafe one.

On the other hand, it's possible that safe code gives the compiler more guarantees, so it has more room to optimize.

Whether your code is safe or unsafe can have a performance impact, but what the impact is depends on the specific situation.

Reading the article made me wonder when the remove of unsafe was due to refactoring the code and when it was enabled by enhancements to the language since the code was originally written.

I heard that C/C++ game engine devs would throw everything on the stack and do their own thing in terms of memory management.

How does this approach play with Rust?

Not a Rust dev, but from what I understand, the lifetime tracking of Rust means you can safely give out references to stack allocated memory without worrying about corruption/segfaults/use-after-free errors or the like, because the compiler will catch it if that is possible, whereas in C++ you'd usually want to do defensive copying in situations like that. So in some cases, Rust actually allows you to fearlessly do things the most efficient way in a way you'd not be able to do in C++.

You can do this in Rust as well, and it's in fact being done in the compiler: https://github.com/rust-lang/rust/search?p=2&q=TypedArena&un...


In general, there is bad support in the standard for custom allocators, but there are crates that provide generic custom containers.

No problem at all with using the stack a lot? But it would be explicit in most cases. Much less chances of memory corruption issues than in C/C++ thanks to lifetime tracking. And much less copying due to default move semantics, and the assurances given by lifetime tracking.

Adding unsafe can even slowdown your code. The unsafe block prevents the compiler from making certain assumptions, which can prevent some optimizations from happening.

'unsafe' is like telling the compiler 'trust me on this', and the compiler (not fully understanding what is going on in there) wisely decides not mess with what you wrote.

Adding an unsafe block changes nothing about compiler assumptions. Some unsafe constructs do not have the same optimization potential as their safe counterparts.

So why is `unsafe` even in the language, then? It would seem to me that it simultaneously undermines safety guarantees, and it can't even ensure performance improvements.

Because performance improvements are not why `unsafe` exists. It exists to disable various conservative compile-time checks in order to let you write code that the compiler is unable to prove correctness properties about.

One reason to do this might be performance. An example: if you can prove lifetime properties but the compiler can’t, you might be able to just use raw pointers without the borrow checker, which is unsafe. This can let you avoid the performance penalty of smart pointers or cloning.

But there are other reasons to do it, too. Some things simply can’t be reasoned about by the compiler, like calling external C code, so they always need `unsafe`.

> It exists to disable various conservative compile-time checks

There is a subtle but important distinction here. Unsafe does not disable any compile time checks. It adds additional constructs which are not checked.

Thank you, you are right. It enables certain constructs, e.g. dereferencing raw pointers, which have fewer compile-time checks than safe constructs, e.g. references.

At some point a Rust program does have to run on a real platform. Anything interacting with that platform will be `unsafe` because Rust cannot reason about things outside itself. You'd usually wrap these unsafe things in safe functions. Note this isn't about performance at all.

Also in Rust's model there are only unique or shared pointers (and the owned resource that they point to). This can't be used to represent all data structures. You need some unsafe code (ideally wrapped as Safe structures) to create these structures.

And as noted, `unsafe` can be used for performance improvements if benchmarking shows that breaking out of Rust's model does indeed improve performance. But you can't simply assume safe Rust incurs a performance hit or that your unsafe code will optimize better than the safe code.

In short, my point was one of emphasis and framing. When thinking about `unsafe` it shouldn't be thought of as a performance feature per se. That may be one effect of using certain features in certain situations but its an effect that needs to be proved by robust benchmarking.

`unsafe` is in the language for compatibility, not performance. Rust programs run between countless other chunks of software, most of which do not have comparable safety features, same as the communication channels inbetween. But you still want Rust programs to be fully compatible with that "outside world" when needed.

Using 'unsafe' isn't about improving performance, though it may be required to enable some performance improvements. It allows you to use constructs that the Rust safety checking infrastructure can't reason about. Memory used as a DMA target (ubiquitous in database engines) is an example.

There's a great new talk by on Gjengset on youtube that tackles just this question:

Rust NYC: Jon Gjengset - Demystifying unsafe code


Unsafe can't ensure performance improvements, but you can sometimes get performance boosts using it.

It should be a last resort if you have a bottle neck that is a result of memory restrictions.

It's also how FFI works. If you're calling a c library, you cannot guarantee memory safety anymore since its outside of rusts domain, so it needs to be unsafe.

> Unsafe can't ensure performance improvements, but you can sometimes get performance boosts using it.

Maybe to back this up with a concrete example, let's say I'm parsing something and I have a &[u8] (a string of bytes) which I have already verified contains only ASCII digits. I wanted to get the numerical value of that number. I don't want to write my own number parsing code, but std only has a parsing function for &str (UTF-8 strings), not for &[u8]. I could do

  let string = str::from_utf8(bytes);
  let value: u64 = string.parse()?;
But then str::from_utf8 would iterate over the bytestring again to verify that it's valid UTF-8. This check is useless because I already know the string only contains ASCII digits. So in this case, I can improve performance with an unsafe block:

  //SAFETY: `bytes` was already proven to only contain ASCII digits
  let string = unsafe { str::from_utf8_unchecked(bytes) };
  let value: u64 = string.parse()?;
The performance gain comes not from unsafe per se, it comes from using a different function that skips the UTF-8 check. Since this function does not guarantee on its own that str's invariants are upheld, it's marked unsafe.

In one of my crates, std::str::from_utf8 is indeed the only reason it uses unsafe.

    let (result, len) = match std::str::from_utf8(buf) {
        Ok(s) => (s, buf.len()),
        Err(err) => match (err.valid_up_to(), err.error_len()) {
            (0, Some(_)) => return Err(Error::Utf8(err)),
            (0, None) => return Err(Error::NeedMoreData),
            (valid_up_to, _) => (
                unsafe { std::str::from_utf8_unchecked(buf.get_unchecked(..valid_up_to)) },
The intention is to extract as much valid &str from the given buf as possible, and only fail if there is no more valid &str to read from the buf. Unfortunately std::str::from_utf8's error does not also yield a &str corresponding to the part that did parse successfully ( * ), so I have to compute it myself. Using std::str::from_utf8 for this second pass unfortunately does end up verifying the UTF8-ness of the buf again, as verified from the asm, because the compiler isn't sufficiently smart. Similarly slicing buf with valid_up_to safely also reruns the bounds check, which is why the code also uses get_unchecked for that instead.

( * ) Nothing prevents it from doing that, and one of these days I might get around to PRing it.

Isn't this a bit fragile though? It's totally fine as long as you own and understand the code. But what's to stop someone new and/or inexperienced from coming along and violating the 'already proven' claim? How will you know?

Isn't there a risk of Rust code having "Don't ever change this" comments?

That's a general problem of coding, that the next edit can violate an invariant. The advantage with Rust is that, for several types of invariants, you only need to check places with an unsafe block during code review or audits. Everywhere else, the compiler upholds these invariants for you.

Why not lift the encoding to the type system? The type system could remember which strings are ASCII and which are UTF-8. More generally, why not lift such proofs to types?

That's exactly the difference between [u8] and str. The unsafety here arises from skipping the standard proof over a more domain-specific one.

> So why is `unsafe` even in the language, then?

Because you cannot do "Hello world" without unsafe code somewhere in the middle.

What? Plenty of languages lack an unsafe mode. Java, for instance.

The unsafe code is in the JVM, and Rust does not have an equivalent.

(Also, Java had an unsafe mode for a long time, and its removal caused quite the upheaval, from what I understand.)

It's not the case that Rust's unsafe feature is there to facilitate the standard library. I'm still not clear if that's what you were saying.

If you're trying to print "hello world" on the screen, you need to make a system call. Doing so is unsafe, by Rust's definition.

There are two main solutions to this:

1. have some sort of "unsafe" language construct so that you can make syscalls yourself.

2. Disallow arbitrary syscalls. put the code for making said syscall inside of some trusted code, whether that be the the runtime or the standard library.

Rust has chosen option #1. Java has chosen #2.

(Also, virtually every language has an "unsafe mode" in whatever C FFI exists. I was referring to sun.misc.Unsafe earlier, but JNI still exists, of course.)

Sure, that all sounds about right.

To rephrase my point: including an unsafe mode is an option when designing a language. Your original comment appeared be suggesting that it was a necessity, and that one could not write a Hello World program in a language that lacks an unsafe mode. Which is not the case, of course. JavaScript, for instance.

It would seem that a takeaway from this thread is that languages should not have C FFI if they want to be safe.

unsafe lets you exit the small subset of code patterns whose safety the compiler can prove.

Not everything that you write in unsafe is necessarily unsafe: it only means that the compiler can't prove it. Another way of thinking: if unsafe let you only write unsafe code, then it would be useless because that unsafe code would eventually segfault.

"The size of rlibs on disk has also shrunk by roughly 15-20%."

This is very welcome! The size of building a few rust projects on a laptop disk is fairly incredible sometimes. I guess cargo needs to learn clean up prior-to-latest deps directories too.

A 'cargo clean' every once in a while helps

Yeah, but then you have to wait 5 minutes to build the deps again. If it cleared the stale ones it'd be fine.

You might be interested in https://github.com/holmgr/cargo-sweep

I love that they include a "Failures" section. It is most interesting to learn from.

Can you share the link of Failures section, couldn't find in articles website.

It's at the bottom of the post.

I have a question about LEB128. I use a similar format in my code, but with two changes:

1. It's big endian.

2. The "last byte" bit is inverted (so it's 00001 instead of 11110).

I changed the last byte flag so that it is easy to count the number of values in an array - you just do this:

  let mut count = 0;
  for byte in bytes {
    count += byte >> 7;
And the reason I used big endian is because it makes decoding simpler. Encoding is more difficult but data is only ever encoded once, and is commonly decoded multiple times.

  let mut value = 0;
  for byte in bytes {
    value = (value << 7) | (byte & 0x7F);
    if value & 0x80 != 0 {
       return value;
Vs the original implementation which has do to this basically:

  let mut value = 0;
  let mut shift = 0;
  for byte in bytes {
    value |= (byte & 0x7F) << shift;
    if value & 0x80 != 0 {
      return value;
Here is the generated assembly in both cases: https://godbolt.org/z/6FbX_E

No idea if it would be faster than the new code but surely it is faster than the old code?

You should put all of those "has another byte" bits together, so you can look at just the first two bytes to figure out what to do, and then perform your load. In your most common case, you can perform an 8-bite MOVBE, an LZCNT/BSR, and if the number of leading 0s/1s < 8, just shift and mask to get your value.

Have a look at how UTF-8 is encoded for a similar scheme. The space taken is exactly the same as LEB128, but your code is less branchy and needs fewer masks and shifts.

> The space taken is exactly the same as LEB128

No because values under 128 (very common in my data) take 2 bytes instead of 1. Unless I'm misunderstanding you.

You are misunderstanding me. Values less than 128 take only one byte.

The obvious implementation fast path is:

1. Check that the current position is at least a 8 bytes from the end of the buffer. Otherwise, go to the short buffer slow path.

2. Count the number of leading 1s (or 0s) using BSR/LZCNT.

3. If the number of leading 1s (or 0s) indicates more than 8 bytes are necessary, use the multi-word slow path.

4. Shift and mask out your value.

5. Increment your position pointer by the length of your varint.

Values less than 128 will have no leading 1s (or no leading zeros, if your scheme uses leading zeroes instead). The shift value is 56 bits. Unless you have a 1-byte optimized path, your calculated mask value is (((uint64_t)-1LL) >> 57). The position pointer is incremented by one byte at the end of the decode.

Now that I think about it, for the uint126_t encoding case, you actually need to inspect up to the first 3 bytes. I've only implemented this for encoding/decoding uint64_ts, in which case there's a little trick that saves one byte, but the trick only works because 64 % 7 == 1.

Maybe it's most clear if I spell out the 19 cases necessary to encode any uint128_t (system using BSR (leading 1s) insteod of LZCNT (leading 0s)).

  0 xxxxxxx -> 0 to 127
  10 xxxxxx yyyyyyyy -> 128 to 2**14 - 1
  110 xxxxx yyyyyyyy zzzzzzzz -> 2**14 to 2**21-1
  1110 xxxx yyyyyyyy zzzzzzzz aaaaaaaa -> 2**21 to 2**28-1
  11110 xxx yyyyyyyy zzzzzzzz ... 2 more bytes -> 2**28 to 2**35-1
  111110 xx yyyyyyyy zzzzzzzz ... 3 more bytes -> 2**35 to 2**42-1
  1111110 x yyyyyyyy zzzzzzzz ... 4 more bytes -> 2**42 to 2**49-1
  11111110  yyyyyyyy zzzzzzzz ... 5 more bytes -> 2**49 to 2**56-1
  11111111 0 yyyyyyy zzzzzzzz ... 6 more bytes -> 2**56 to 2**63-1
  11111111 10 yyyyyy zzzzzzzz ... 7 more bytes -> 2**63 to 2**70-1
  11111111 110 yyyyy zzzzzzzz ... 8 more bytes -> 2**70 to 2**77-1
  11111111 1110 yyyy zzzzzzzz ... 9 more bytes -> 2**77 to 2**84-1
  11111111 11110 yyy zzzzzzzz ... 10 more bytes -> 2**84 to 2**91-1
  11111111 111110 yy zzzzzzzz ... 11 more bytes -> 2**91 to 2**98-1
  11111111 1111110 y zzzzzzzz ... 12 more bytes -> 2**98 to 2**105-1
  11111111 1111111 0 zzzzzzzz ... 13 more bytes -> 2**105 to 2**112-1
  11111111 11111111 0zzzzzzz ... 14 more bytes -> 2**112 to 2**119-1
  11111111 11111111 10zzzzzz ... 15 more bytes -> 2**119 to 2**126-1
  11111111 11111111 110zzzzz ... 16 more bytes -> 2**126 to 2**133-1

Ah I see, that makes sense.

I would guess that Rust wants to stick to one standard encoding, and LEB128 is already in WebAssembly, DWARF, and LLVM. Adding BEB128 would add more complexity. Also, since Rust is generating a bunch of output files, it's definitely possible that it encodes more often than it decodes.

You missed a "shift += 7;" in your original implementation code, though it's present at the godbolt link.

It's not clear what point 2 is supposed to specify. LEB128 sets the 0x80 bit if there's a continuation. I'm not sure what "(00001 instead of 11110)" is supposed to specify. Does BEB128 leave the top bit clear when there's a continuation? Or does it set the lowest bit when there's a continuation? The code for counting would work for LEB128, but the text around it makes it seem like it's an improvement over LEB128.

Yeah I mean the continuation bit is just inverted so 0=continue, 1=end-of-value. You can still count values efficiently with 1=continue, 0=end-of-value, but you have an extra subtraction at the end (`return number_of_bytes - number_of_bytes_with_top_bit_set;`)

Good point about not adding another format. I guess they want to avoid confusion.

I think its widely understood that a little-endian prefix VarInt decodes much faster as the leading alternative to LEB128. You can implement the whole thing without a loop, thus you don't have to exercise branch and loop prediction resources at all.

- count leading zeros (or ones, depending on the first byte's tagging preference)

- Unaligned load expressed as memcpy. Let the compiler's instruction scheduler peephole it out into a single unaligned load instruction.

- shift the loaded value by an amount indicated by the leading zero count.

- Zig-zag decode for signed integers

What do you mean by "little-endian prefix varint"? Like this?

``` enum VarInt { Byte(u8), Short(u16), Long(u32), LongLong(u64), } ```

I.e. one byte tag followed by the value? It's definitely an option for me but the main issue is that values under 128 now take 2 bytes instead of 1. I guess I could do something like what CBOR does and use 0-251 = literal value, 252 = u8 follows, 253 = u16 follows, 254 = u32 follows, 255 = u64 follow.

It can be considered a variation on LEB128 that packs all of the continuation bits into the first byte.

Ie, two leading ones set in the first byte mean that there are two trailing bytes.

https://news.ycombinator.com/item?id=11263378 is one description.

https://github.com/stoklund/varint has some benchmarks.

Ah ok I understand. Might be a little awkward to implement in Javascript given its lack of 64-bit ints, etc, although it does at least have CLZ32!

Stupid question but could Rust skip monomorphization during development builds and use dynamic dispatch? I heard generics were often to blame for slow builds.

It could potentially do that, but there are a lot of things that "normal" monomorphized generic code can do (and does routinely) that would be very very difficult to implement dynamically.

This post goes into a lot of detail on the subject: https://gankra.github.io/blah/swift-abi/

There are perhaps other tradeoffs that could be made instead, like doing more implicit boxing in debug builds, but it's not clear to me that those would be any easier to reconcile with all of the low-level details that Rust exposes. E.g. what happens to `mem::size_of` when an object of a polymorphic type has been boxed?

Related to this area is ongoing "polymorphization" work to detect pieces of generic code that actually do not depend on their type parameters, to avoid duplicating it: https://github.com/rust-lang/rust/pull/69749

This could potentially be made into a much finer-grained analysis to reduce the duplication further, though that may or may not actually help compile times depending on how expensive the analysis is.

It's a tradeoff. Do you want your code to run as if it was written by hand (i.e. duplicated by hand)? Then monomorphization is best.

For Rust, it makes a lot of sense to pay for some compile speed penalty to not have any performance penalty.

> during development builds

That's the key phrase in his statement.

I'm not familiar with Rust, but if compilation speed is a major issue, and there are aspects of the compilation that are avoidable to trade-off for runtime performance, it seems to be a good idea to make those configurable per-build-type. Does Rust not offer this?

I've written an image processing tool in Rust, where running a large image downscaling test with a release build takes 2.0 seconds, and with a debug build it takes 18.7 seconds. So most of the time during development I compiled in release mode, because it was actually faster overall.

For many applications debug builds are fast enough, but their runtime performance is already so slow, that I hope they don't get even slower.

It could certainly be a config flag on the profile that can be optionally enabled, and wouldn't need to be enabled by default.

Not currently, but it's been discussed a few times. Nobody has tried to write a patch yet, that I'm aware of.

> Do you want your code to run as if it was written by hand (i.e. duplicated by hand)? Then monomorphization is best.

That's not always true because monomorphization where the vast majority of the object code is the same in all actual cases means bloating the text of the program, which means putting pressure on the Icache, which means more cache misses, which... is slow.

In some cases maybe, but I don't think it could work in all cases. One of the things that monomorphization does for you, is statically determine the size of your types. For example, if you have a `Vec<u32>` and a `Vec<u64>`, the implementation of `Vec` needs to multiply by the size of the element type (4 bytes vs 8 bytes) to figure out the offset of each element. I don't think regular dynamic dispatch provides size information, so the compiler might need to "invent" a new flavor of dynamic dispatch internally if it wanted to do this?

With dynamic dispatch, they'd all be fat pointer sized. It would move to Vec<Box<dyn SomeTrait>> where SomeTrait is auto-generated and implemented for u32 and u64.

Ah I was thinking about "keep things allocated where they were before, but try to dynamically dispatch implementation code, to avoid compiling it N different times." It sounds like you're talking about "heap allocate everything by default." If everything was implicitly boxed, no type could be `Copy`, right? Is that strategy possible without breaking existing code?

Ah, you might be able to get away with a &dyn Trait too, not sure.

You can box Copy types.

You can Box Copy types, yes, but the result is !Copy. So for example, maybe I have a slice of ints, and I'm calling ptr::copy_nonoverlapping on that slice. (Maybe via safe code, like slice::copy_from_slice.) That no longer works; I need new heap allocations for each int. And my slices of ints no longer have a memory layout that's compatible with C, so all ffi breaks?

Ah yeah.

I do think you're correct here, which is that this idea sounds simple but has a lot of edge cases.

Is it a goal of production vs debug build that the data structures that you define in your program are always the same?

I could see some kinds of funny differences in behavior between debug / production builds, if for a debug build a particular thing is boxed, but in production it isn't.

If it's a program-visible kind of box, yes that would be a problem.

But if you're generating polymorphic dispatch code at a lower level than expressible in Rust itself, there is no reason why hidden box types cannot be used to simulate unboxed types. That's just a memory representation difference. A bit like the way V8 creates specialised types for JavaScript objects at run time, but it's invisible to JavaScript except for speed.

The trouble is that the memory representation is fixed and observable in many cases. If you have arrays of primitive or repr(C) types, those are guaranteed to have the same contiguous layout as an array in C would. In particular, you often need to pass pointers to those things to actual C code. Copy types are similar; even if you can't necessarily assume all the details about their layout, you can copy them byte for byte and assume the result is valid. But that doesn't work if the compiler secretly inserts an owning heap pointer that needs to remain unique. The compiler could try to deduce when the memory layout is actually observed in practice, and do whatever it wants in cases where it can prove the difference doesn't matter, but I feel like that would end up making everything more complicated rather than less.

Yes, I do also think this is a big risk with this strategy.

Somebody wrote a macro for the simple cases: https://llogiq.github.io/2019/05/18/momo.html

How much slower is rust as opposed to C or C++ for production sized builds?

Traditionally people have thrown more hardware at these kinds of problems. As someone who has yet to use rust for anything more than small hobby sized projects, I am naive about rustic in terms of speed

I think the range of compilation times for C and different styles of C++ is so huge that this question cannot be answered without going into a lot of details about how the C or C++ project is structured.

You can compile hundreds-of-thousands of lines of "user-side" C++ code in a few seconds, or in a few hours. The range is much smaller in most C projects, but can also differ by orders of magnitude.

In short, splitting a C++ project into many small source files (e.g. "one file per class") combined with using a lot of C++ stdlib headers in each small source file is the worst case for a C++ project because of all the complex stdlib template code that needs to be compiled over and over again. This can increase the "under-the-hood" line count by several hundred times (in Google Chrome, the amount of "project C++ code" is only 0.125% of all the code the compiler needs to compile when rebuilding the project because of included headers):


AFAIK Rust solves part of this problem with its module system, but the compilation time problems lurk elsewhere.

Here's an interesting webpage which looks at compilation time of C++ stdlib headers and some popular 3rd-party libs:


IF I understand correctly, the borrow checker comes with some pretty severe compilation costs, since it requires a pretty wide level of code analysis, and might be single threaded.

Not generally, no. Or rather, there are other things that are a much bigger cost.

Rust's type system takes up a surprisingly small portion of compilation time. See the comment rooted at [0] for some discussion.

tl;dr: Profiling and tests show that codegen and LLVM work takes up the lion's share of compilation time. The borrow checker itself isn't that expensive in the grand scheme of things, and languages like OCaml show that it's very feasible to have a complex type system that compiles quickly.

[0]: https://news.ycombinator.com/item?id=22935795

From what I've heard they're roughly comparable, some projects will swing one way, some the other.

Rust has solid incremental compilation now though so you shouldn't be paying the price of a full rebuild often.

I had a quick search, but finding examples where the same project has been implemented in C++ and Rust with reported compile times seems to be rare...

There is a benchmark from way back in 2016 here: https://users.rust-lang.org/t/are-there-any-compilation-time...

Dev build was 2.91s for Rust vs 8.48s for C++, release build was 5.97s for Rust vs 9.79s for C++.

Would need to find much more recent examples for the comparison to be worthwhile.

That's a somewhat misleading comparison even for the time. All his C++ files could be compiled in parallel and recompiled incrementally, and his need for -flto to get good performance is really just due to how he structured his program. He seems to regard cpp files as the default location for functions and inlining as an optimization, but IMO it's the opposite. An empty function goes in the header, and every function starts off empty. You move it to a cpp file once it requires including a header. Moving a heavy function from a header to a cpp file is an action to optimize the compile time of your program.

On the other hand, it could be worse. There are some monstrously slow template libraries out there, and he's not using any of them. His code isn't very optimized, but it's sane.

The Rust compiler was far different in 2016, across a number of fronts. It's gotten significantly faster, as well as added a number of features to tweak compile times in various scenarios, like incremental compilation.


This is a handy resource, though no benchmarks are ever perfect. Rust comes out faster on about half the algorithms. They're pretty close though.

I don't know why people feel the need to downvote instead of clarify, but this page is about runtime speed. The article is about compilation speed.

True, although for example:

n-body Rust #7 program 16.37s to complete and log all make actions

n-body C++ g++ #2 program 6.22s to complete and log all make actions



Interesting - I didn't know the site gave those numbers at all. Obviously they are not the ones in the main comparisons.

Something seemed a bit weird with those numbers to me. My heavily loaded 2015 MacBook Air, which is also overheating, took <5s to compile nbody with the same Rust version and command line. The ratio remains about the same though - a similar GCC version is also much faster. Turns out, the hardware is really, really old ("Measured on a quad-core 2.4Ghz Intel® Q6600® with 3.8 GiB of RAM and 250GB SATA II disk drive; using Ubuntu™ 19.10 Linux x64 5.3.0-29-generic.").

> The ratio remains about the same though…

So for that back-of-the-envelope comparison of compile times (of 300 line tiny tiny programs) it really really doesn't matter that "the hardware is really, really old" :-)

It matters a lot for the absolute numbers. Apparently not for the relative ones (my measurements were super noisy though).

> How much slower is rust as opposed to C or C++ for production sized builds?

"relative ones" :-)

I'm not sure what you mean. The absolute difference appears to have shrunk considerably vs old hardware, based on this single, unreliable example. Probably most people care more about that than the relative difference? It's not explicit in the question (and neither are the benchmarks production-sized builds).

> It's not explicit in the question…


Although if jonny383 actually wanted to know how many seconds then shouldn't our answer be — it depends.

Thanks, I inferred incorrectly.

when all of those changes will be in stable? will it be 1.44 or more likely 1.45?

Depends on the change. The earliest one looks to be in 1.41, the latest one landed two days ago, so should be 1.45.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact