In general, if Rust is significantly slower than equivalent C, it's a bug. We do...

k__ · on Sept 18, 2018

Are there theoretical boundaries that prevent Rust from ever reaching C performance or is this just a practical question of compiler optimization?

steveklabnik · on Sept 18, 2018

It really depends on exactly what you mean.

There's one significant thing that's in C that's not in Rust: alloca. If you need that, well, we've talked about it, but haven't accepted a design, and some people never want to gain it directly.

EDIT: another comment pointed out goto; I have no idea what the performance implications are but I’ll have to add that to my list of “stuff C has that Rust doesn’t.”

Sometimes, you need nightly-only features. This means that, in some sense, today's Rust is slower, but tomorrow's Rust may not be.

Sometimes, "equivalence” is the issue. You can turn off bounds checking for array access, for example, but most Rust code doesn't, for hopefully good reason. Sometimes those checks are elided, and so it's exactly the same. Sometimes they're not. Sometimes they're not and that inhibits other optimizations. Is unsafe Rust "equivalent" to the C? Or does that not count? (Unsafe Rust is not always faster than Safe Rust...)

At the theory level, I can't think of anything off the top of my head that would inherently make Rust slower than C, generally speaking. There's also some degree of argument that in theory, Rust should be faster than C, thanks to how much more we know about aliasing, etc. That's a whole other can of worms...

simcop2387 · on Sept 18, 2018

> Sometimes those checks are elided, and so it's exactly the same. Sometimes they're not.

This is something I've personally thought about a lot and I have no idea how I'd actually want it implemented either on the backend or in the syntax, but I think it'd be useful to have some way to say "I want these bound checks to be elided, if they can't be then make that an error". Similar to how you'd decorate a pure function in other languages to say that it can't cause side effects or depend on side effects for what it's doing.

I think that's probably a high level ask that's probably significantly more difficult that it seems initially too.

andrewflnr · on Sept 19, 2018

You probably just want a dependently typed language. Proving bounds at compile time is an introductory exercise for the likes of Idris.

simcop2387 · on Sept 19, 2018

Quite possibly that's what would work. I haven't tried any of them in any kind of serious capacity before.

Edit: Looks like there's been an RFC in the past to add it to rust, but it got punted because the const generics stuff went in first and they wanted to see how that would shake out before trying to fully tackle this. https://github.com/ticki/rfcs/blob/pi-types-ext-2/text/0000-...

zem · on Sept 19, 2018

ats [http://www.ats-lang.org/] is a C-level language with dependent types

lodi · on Sept 19, 2018

Note that ATS has a limited version of dependent types as compared to Idris/Agda/Coq/etc. As I understand it, you can't run arbitrary (recursive, etc) functions at the type level. But even this 'lite' version lets you express preconditions like "n is a multiple of 4" or "the source array and destination array must not overlap in memory".

steveklabnik · on Sept 19, 2018

Yeah that'd be very interesting.

edflsafoiewq · on Sept 19, 2018

Can this sort of information be gotten out of LLVM?

JoshTriplett · on Sept 19, 2018

> EDIT: another comment pointed out goto; I have no idea what the performance implications are but I’ll have to add that to my list of “stuff C has that Rust doesn’t.”

On that front, one interesting thought would be to reimplement the CPython bytecode interpreter in Rust and see if you can match the performance of C using computed goto.

In principle, there's no fundamental reason Rust couldn't optimize "loop around match" exactly the same way, without needing the computed goto. (For that matter, so could C.) Doing that would help cases like this.

pcwalton · on Sept 18, 2018

Well, the answer should be "no" for unsafe Rust, since you can mechanically translate any C to unsafe Rust and vice versa.

For safe Rust, sure, there are limitations: you can't turn off bounds checks, for instance. But all Rust is partially unsafe Rust, because the standard library uses unsafe code ⃰. So it's a fuzzy distinction.

⃰ ⃰You wouldn't want it any other way. Implementing low-level functionality as unsafe code allows us to write code, not compiler-code-that-generates-code.

edflsafoiewq · on Sept 19, 2018

In a sense you can mechanically translate between any Turing complete languages, but there's not a "naive" translation, so to speak, because C can have irreducible CFGs and Rust cannot. It's not clear to me that reflowing a CFG can always be done without a performance impact.

legulere · on Sept 19, 2018

You can convert an irreducible CFG to a reducible CFG with a loop and a switch case(/match in case of rust):

In pseudocode

  Goto = Start;
  While(Goto!=End) {
    Match(Goto) {
    ...
    }
  }

It’s a pretty basic obfuscation technique. Compilers will happily unroll this leaving you with an irreducible CFG.

edflsafoiewq · on Sept 19, 2018

You have more faith in the compiler than I do. Here's a stupid test I made.

* In C++, using goto: https://godbolt.org/z/YqCMbU

* In Rust, with relooped CFG: https://godbolt.org/z/IIXtKo

The compiler was obviously not able to unroll the relooped code into the original CFG.

A specific example: in principle, the compiler could see the exact target of every continue in the Rust code, so the continue on line 34 could go directly to line 27 (or, better, as in C++ the basic blocks could just be laid out adjacently), but the compiler does not actually do this and there are a bunch of unnecessary tests on the path between those two lines.

dbaupp · on Sept 19, 2018

It's unfortunate that the enum-based solution doesn't work, but that particular example can be made "optimal" (at least, as good control flow as LLVM can get this code: the same as clang, not GCC), by using a rather unpleasant series of loops and labelled breaks: https://godbolt.org/z/sQ1nH8 (rustfmt'd version: https://play.rust-lang.org/?gist=b1091e0c583b88f27bf1eaeae3a...).

This doesn't work in general, and is ... impossible to maintain, but it's not an unreasonable approach for generated code.

eddyb · on Sept 19, 2018

Note that Rust has "forward goto" in the form of "labelled break" (which can even carry a value), so I suspect some cases might not even need converting to a loop-match "state machine".

pcwalton · on Sept 19, 2018

True, I forgot about goto.

edflsafoiewq · on Sept 19, 2018

Not only true goto but also switch and computed goto. AFAIK C/C++ are essentially the only games in town for these. No modern language has even corrected the purely syntactic limitation that prevents nested switches (by eg. allowing you to attach label names to switch blocks and cases).

gameswithgo · on Sept 19, 2018

note that when you use iterators, bounds checks do get elided.

thsowers · on Sept 18, 2018

Not qualified to answer, but I did notice recently that Rust appears to generate more compiler code than equivalent C, but I don't think this realistically has a negative performance impact. You can check this out on godbolt (https://godbolt.org/)

raphlinus · on Sept 18, 2018

It's possible to set compiler flags to optimize for size. That said, I've also seen large binaries, and monomorphization can contribute to that. Some projects are taking that more seriously, for example miniserde is designed to explicitly minimize monomorphization and this improve both code size and compile time.

thsowers · on Sept 19, 2018

Hadn't heard of miniserde before, awesome find, thanks!!

nwmcsween · on Sept 18, 2018

Bounds checking will inhibit vectorization, panicking will bloat code, maybe I'll make a comparison against a rust coreutils bin vs something like sbase.

usefulcat · on Sept 19, 2018

Regarding performance: For my first foray into rust, I recently tried reimplementing a simple C++ tool in rust. It reads pcap data from stdin, finds the vlan tag for each packet and writes packets to one of a few different files (or stdout) depending on the vlan id.

I know C++ quite well, and I'm confident that the C++ version is reasonably well optimized, though it probably has a little room for improvement.

In rust, I think I'm doing things in a reasonable fashion but so far the performance is only half of the C++ version. So, not bad, but I was hoping it would be closer. I'd like to know if anyone has any suggestions for resources related to rust optimization.

steveklabnik · on Sept 19, 2018

You can use your usual tools, like perf, on Rust.

What’s the situation with the IO? Are you buffering? Are you holding the lock for a long time or rapidly locking and unlocking?

If you can share the code I can take a look.

usefulcat · on Sept 19, 2018

I realized that stdout was acquiring and releasing a mutex on every write operation; fixing that improved things. The good news is that when writing to /dev/null, it's now faster than the C++ version, but writing to stdout is still ~25% slower. I suppose there's something else suboptimal about the way I'm using stdout, but I haven't figured it out yet.

Code is here: https://gist.github.com/usefulcat/56f334bc58c97edb073b457b68...

There is actually one other file but it only contains a couple of struct definitions for the libpcap file and packet headers. Thanks for having a look!

usefulcat · on Sept 19, 2018

Update: I wrapped the locked stdout handle in a BufWriter and now the rust version is >25% faster than the C++ version in all cases.

I didn't think there was room for that much improvement; I'm really impressed.

gnuvince · on Sept 19, 2018

If you have a blog, that would make for an amazing entry, going from 50% slower to 25% faster.

Edit: HOLY CRAP! I have a program written in Rust, ppbert[1], and I just tried wrapping my StdoutLock object in a BufWriter, and I improved the performance of my pretty-printing by a factor of 2x! I knew to use BufReader for files, I didn't know it was helpful for stdin and stdout! Thank you _so much_ for sharing your experience, I've certainly benefited!

    Benchmark #1: ppbert -2 *.bert2
      Time (mean ± σ):      3.816 s ±  0.115 s    [User: 2.494 s, System: 1.321 s]
      Range (min … max):    3.688 s …  4.028 s

    Benchmark #2: ppbert-dev -2 *.bert2
      Time (mean ± σ):      1.728 s ±  0.045 s    [User: 1.493 s, System: 0.234 s]
      Range (min … max):    1.678 s …  1.843 s

    Summary
      'ppbert-dev -2 *.bert2' ran 2.21x faster than 'ppbert -2 *.bert2'

[1] https://github.com/gnuvince/ppbert

steveklabnik · on Sept 19, 2018

Awesome! Glad you got it sorted. Those two things are always footguns...

portmanteaufu · on Sept 19, 2018

Can you share the code? Are you using the `--release` flag to build/run it?

usefulcat · on Sept 19, 2018

Yes I'm using --release.

the_clarence · on Sept 19, 2018

Hey Steve, thanks for commenting here. As a person who is trying to get into embedded development but who knows very little, can you explain the challenges of having a language support most/all embedded platforms? I’m guessing you would need to support old versions of gcc that or something like this? Or be able to compile things with each platform special flags?

It seems like even C projects have a hard time being portable. People even avoid cmake because of the large dependency and the fact that it is cross platform until it isn’t.

steveklabnik · on Sept 19, 2018

The first issue is to have a compiler backend for that target. Someone has to write that. gcc supports more targets than LLVM.

Then, you have to make sure that it doesn't break; this means running on that target in CI, somehow. Given that you're already talking about devices that may not even have an OS... emulation can work sometimes?

Then, there's ecosystem stuff. You probably want some sort of HAL and support for not just one platform, but all of the platforms you're deploying to. So that's more work...

Finally, some platforms are proprietary and basically give you their own fork of gcc and so C is pretty much your only option anyway.

pjmlp · on Sept 19, 2018

Not Steve, but do know or or two things about embedded and portable C code.

First of all, most embedded development makes use of bare metal, where the libraries take the role of an OS, or they use a specialized OS from the hardware vendor.

Just using pure ANSI C isn't possible, because the standard does not expose the hardware features from the underlying platform, so the alternatives are to use Assembly, or language extensions.

Naturally language extensions are more convenient to use, so that is what most developers end up doing.

Also there are many types of embedded platforms, you can be targeting anything between a tiny PIC with 8KB FLASH RAM to a powerful multi-core ARMv8-A with 8GB.

So the toolchain must allow for customization of what actually gets linked into the final binary, and the runtime must be as thin as possible.

Then there this the drivers story, each vendor gives you their own SDK, which most of the time is the only way to access their devices.

It is typical for open source projects to reverse engineer some of those SDKs to get the necessary information for linker maps, compiler flags and driver information.

Regarding Rust, there is an ongoing effort to create a embedded library for hardware drivers, as means to write portable code.

https://github.com/rust-embedded/awesome-embedded-rust

meta_AU · on Sept 19, 2018

For some of the unsupported targets there is the Rust->C compiler. Haven't used it myself but I've seen that, for example, the ESP32/ESP8266/Xtensa is supported by it.

steveklabnik · on Sept 19, 2018

That's true, but the issue with it as a real development option is that its job was to bootstrap the compiler, and so it implements Rust 1.19 (I believe...), and nothing further. So you're missing out on a lot of useful stuff.