I'm amused that the author reached a satisfactory result of "only" 7 hours, just to later realize that they forgot to compile their Rust code with optimizations thus bringing the runtime down to 30 minutes. I've never used Julia, but unoptimized Rust binaries contain such naive and suboptimal codegen that I assume that something must have been wrong with their Julia code; my ballpark for the performance of unoptimized Rust code is that it's roughly on-par with interpreted (not JITted) Ruby or Python (parallelism notwithstanding, which even without optimizations bears much more fruit in Rust than in Ruby/Python).
Still, at the same time, we can judge a language based on how easy it is to fall into the pit of success rather than the pit of failure; if their Julia code had a problem, then maybe the Julia developers could tweak some defaults (or change some documentation) so that people in the future do not fall afoul of the same result. The same sort of thing exists in Rust as well, where the fact that I/O is unbuffered by default makes it perform poorly in microbenchmarks that involve printing a zillion times in a loop: https://nnethercote.github.io/perf-book/io.html#buffering
The biggest performance trap they had was copying all their strings in a really hot loop to a vector of characters. I'm not sure what we could do to steer people away from that...
I've been teaching people to use Rust since 2011. It is quite common for someone coming from an interpreted language background to not realize that they have to manually ask for optimizations when building their code. In such circumstances they often come asking for help saying "I must be doing something wrong, I rewrote this Python code in Rust and it appears to be slower somehow?" For a long time on the /r/rust sidebar we even had a note saying "before asking for help, have you tried compiling with --release?" And indeed, once they turn on optimizations their code goes from as slow as Python to as fast as C. The OP's case is completely representative; a 14x improvement merely by compiling in release mode is entirely in line with what I have observed.
This may be surprising to people coming from C or C++, where unoptimized code is slower but not that slow. But the point of Rust is that it's selectively provided as many zero-cost abstractions as it can, which when paired with a modern sufficiently-smart backend like LLVM boils away those abstractions into nothing. It's a great feeling the first time that you crack open the assembly output for a highly abstract chain of iterator/closure chains only to find that the whole shebang has been reduced to a handful of fixed-form arithmetic operations without a loop or a function call in sight. But the tradeoff is that you do have to perform the step of boiling those abstractions away.
This is not new problem actually, when IBM did their PL.8 research compiler for RISC, and safe systems programming, they took exactly the same approach as Rust, just in 1982.
Quite interesting paper, in case you haven't read it, "An Overview of the PL.8 Compiler"
I don’t think there are C compilers out there that do not optimize at all.
AFAIK, all of them will do some register assignment, some constant folding, etc.
For example, if you write
void f(int a)
{
int b = a;
b = b + b;
g(b);
}
are there compilers out there that compile that to
- load a
- store into b
- load b
- load b into another register
- add
- store into b
- load b
- call g
- return
?
Languages that do overflow and range checks such as Julia and Rust in debug mode would have to add lots of them there if they do not do any optimization.
I like Rust a lot but this is a big frustration with it. I have a hard time seriously imagining Rust taking off in game development when debug code is so bad, at least without some style guidelines that avoid the worst problems with it.
I'm glad Zig is providing some competition here that's more in line with what one would expect from C.
There is also a trick to compile your own program without optimizations, but optimize all dependencies.
I too had the "Un-optimized Rust code is slow!" experience when I started out, because my first project required PNG encoding and decoding, and the png crate was incredibly slow without optimizations.
But it's typical for C and C++ to be un-optimized at the level of `main`, but calling into heavily-optimized system libraries like ffmpeg.
One promising avenue of Rust development is that, while --release implies -O2 and --debug (the default) implies -O0, the -O1 case hasn't received a lot of attention. In the future it seems like the contexts you mention would benefit from such a sweet spot.
One of the early advent of code challenges triggered this situation for many people (including myself). The same program compiled in release mode was more than 4000x faster than debug mode.
Although it's worth mentioning that LLVM understands math and might be replacing an inefficient sum calculation with an efficient algorithm, so it might not be all down to codegen.
That’s true, I didn’t realize the gap between debug and release Rust optimizers. Most of what I do in Rust is not sensitive to the extent that debug performance is an issue.
A tradeoff in the separate direction, though, is compilation performance. Currently the master branch takes 15 minutes to compile in release mode. It’s probably due to the hardcoded root choices, and I assume compilation is faster if they are loaded from a string instead of a structure, but it is a bit of a gotcha.
Surely you don't mean the code in the OP takes you 15 minutes to compile in release mode; it's a 100-line file with no dependencies, and no macros or generics to hide codegen behind, and can't take more than an instant to compile. To get to a 15-minute-long release build you'd either need a very large amount of code or you may be attempting extensive type-level programming that rustc was not designed to support. You may also want to try swapping out your linker; ditching the system linker for gold or lld can have dramatic results if you're bottlenecked on the link stage.
Not the code linked from the article; the goal of that code was to generate the scores of the guesses at the root of the search tree, so that I could hardcode them in.
It is possible that there is a recommended way to do it differently which I missed. I tried lazy_static!, but ended up having to fight the type system, and the related GitHub issues didn’t bring me hope that I could overcome it easily.
Interesting, you appear to be hitting some kind of pathological case with the `vec!` macro. Apparently it doesn't like being used with a 15,000-line literal. :P Fortunately you're right, there's a different way to do this, which AFAICT doesn't suffer from the same pathology. I replaced this:
and it brought the time of `cargo clean && cargo build --release` down from 345 seconds to 13 seconds. I consider this a compiler bug, I'll file it if I can't find an existing issue.
Still, at the same time, we can judge a language based on how easy it is to fall into the pit of success rather than the pit of failure; if their Julia code had a problem, then maybe the Julia developers could tweak some defaults (or change some documentation) so that people in the future do not fall afoul of the same result. The same sort of thing exists in Rust as well, where the fact that I/O is unbuffered by default makes it perform poorly in microbenchmarks that involve printing a zillion times in a loop: https://nnethercote.github.io/perf-book/io.html#buffering