Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Data-driven performance optimization with Rust and Miri (medium.com/source-and-buggy)
88 points by dmit on Dec 9, 2022 | hide | past | favorite | 51 comments



> The most surprising thing for me is how unintuitive it is to optimize Rust code given that it’s honestly hard to find a Rust project that doesn’t loudly strive to be “blazingly fast”. No language is intrinsically fast 100% of the time, at least not when a mortal like me is behind the keyboard. It takes work to optimize code, and too often that work is guess-and-check.

I get the feeling that a lot of Rust projects claim to be "blazingly fast" just because they are written in Rust, and not because they've made any attempts to actually optimize it. I rarely see any realistic benchmarks, and the few times I've looked deeply into the designs they are not implemented with execution speed in mind, or in some cases prematurely optimized in a way that is actively detrimental [1].

Personally I think it's because so many of the new Rust programmers are coming from scripting languages, so everything feels fast. I don't have any problems with that, but I'd advise anyone seeing a "blazingly fast" Rust project to check if the project has even a single reasonable benchmark to back that up.

[1] https://jackson.dev/post/rust-coreutils-dd/



Good examples, but one of those languages is not like the others, no matter how well it tries to blend in.


So wait, are you complaining that Rust projects advertise themselves as "blazingly fast" but actually aren't? Or are you complaining that not all Rust projects are fast? If it's the former, I don't think uutils advertises itself as "blazingly fast." And if it's the latter, then that... kind of seems unreasonable?


The former, that some projects say "it's blazingly fast because it's in Rust" but don't seem to have any benchmarks.

And admittedly I do agree that uutils is not really a good example.


Can you give any good examples?

Surely there are some people that advertise their projects as fast but actually aren't, or at least aren't in some cases, whether they be written in Rust or anything else. (Even in ripgrep's case, of which I have published benchmarks, GNU grep is still sometimes faster.) But are there any substantial programs with such false advertising? It's hard to qualify what "substantial" means, but perhaps one could start with "is available in Debian" as a starting point.

Although it seems you aren't necessarily complaining about false advertising, but rather, missing benchmarks. Most programs, regardless of advertising, don't publish carefully curated benchmarks. I myself have advertised Rust's regex crate as "fast" (albeit not "blazingly fast," I am not one for such flowery language), but I have never published any benchmarks. Of course, benchmarks exist and others can run them, but they are more for internal development than public consumption.

Much benchmarking proceeds by an ad hoc nature. I'm not even aware of such venerable programs as GNU grep published benchmarks either, for example.

To be clear, I agree it would be nice for programs to publish benchmarks, flowery advertising or not. But solid benchmarks are incredibly difficult to do. Have you ever published any? It takes enormous effort. Yet, I would excuse the words "blazingly fast" if at least some ad hoc benchmarks were run and at least some people are able to reproduce them, at least in common cases.

I think my bottom line here is that you seem to be complaining about a pattern, but I'm not entirely sure it is warranted.


> the few times I've looked deeply into the designs they are not implemented with execution speed in mind, or in some cases prematurely optimized in a way that is actively detrimental [1].

I'm not sure citing uutils/coreutils as an example is fair. I love that project. I learned Rust contributing to that project, however, as the blog entry you cite itself notes:

> I saw the maintainers themselves mention that a lot of the code quality isn’t great since a lot of contributions are from people who are very new to Rust

I'm sure plenty of my slow code is still in `ls` and `sort` and that's okay?


“blazing fast” often means “I want to say it’s fast but I’ve never benchmarked it”


Every Linux C/C++/Rust developer should know about https://github.com/KDAB/hotspot. It's convenient and fast. I use it for Rust all the time, and it provides all of these features on the back of regular old `perf`.


What perf record settings do you use? Trying to use dwarf has never worked well fore with rust, so I've been using lbr, but even then it seems like it gets which instructions are part of which function wrong a significant portion of the time.


I've had no problems with dwarf.


With optimizations and debug symbols turned on and the arch specified perf report very often puts some pieces of functions in the calling function for me when I use dwarf. Do you do anything specific in the build?


You need to activate the (somewhat slow?) inline-stack-aware addr2line integration/usage for optimized builds.

It doesn't place them there, they exist there (due to inlining).


Ah, very nice! I’ve been looking for something like Instruments on Linux since I like to click around and this looks cool.


Miri is not really meant for performance profiling; it runs on unoptimized MIR, which has very different performance from LLVM-optimized machine code.


This is really important. I think author found results surprising and unintuitive, because they've been looking at the wrong thing.

Rust libraries are designed to be fast when optimized with LLVM. Rust has a lot of layers of abstractions, and they're zero cost only when everything is fully optimized. If you look at unoptimized miri execution, or insert code-level instrumentation that gets in the way of the optimizer, these aren't zero cost any more, and overheads add up where they normally don't exist.


I don't understand why you felt the need to point this out when the linked article explicitly mentioned it.

> Important note: Miri is not intended to accurately replicate optimized Rust runtime code. Optimizing for Miri can sometimes make your real code slower, and vice versa. It’s a helpful tool to guide your optimization, but you should always benchmark your changes with release builds, not with Miri.


The author was gently panned when this was originally posted on Reddit, and they added that note, but now that the article is being reposted uncritically where people may not know the difference, it's worth pointing out again. The way the author got a profile out of Miri is creative, but Miri was never a helpful guide for profiling, it seems that the second benchmark they used for confirmation was also unoptimised (run without --release which is a rookie mistake). Then they got wrong conclusions from flawed observations about the costs of abstractions like range.contains.


I think they undersell the difference. You'd be laughed out of the room if you profiled debug code.


As a caveat, it can still be useful for profiling constant-evaluated code to improve compile time, since Miri is built on the same evaluator that the compiler uses.


Whaaat, Chrome has a built-in flamegraph profiler that you can use with profiling data from languages like Rust (and presumably others)?!

Sweet tip.


Similarly, py-spy is a sampling profiler for Python programs. It lets you visualize what your Python program is spending time on without restarting the program or modifying the code in any way. py-spy is extremely low overhead: it is written in Rust for speed and doesn't run in the same process as the profiled Python program. This means py-spy is safe to use against production Python code.

I'm not sure if it exports results in a format Chrome can render but it does produce great interactive SVGs and is compatible with speedscope.app

https://github.com/benfred/py-spy

https://github.com/jlfwong/speedscope




While yes, it's really meh compared to something like Hotspot. Interactivity and availability do not favor JavaScript+DOM.


> In order to get any useful results I had to run my Advent of Code solution 1000 times in a loop

Yeah that's my general problem with all these flamegraphs and other time based tools. There's a bunch of noise!

I'd image for something with deterministic GC (or hell no-gc) you should be able to get a "instruction count" based approach that'd be much more deterministic as to what version of the code is fastest (for that workflow).


An instruction account would be a decent first order approximation, but I don't think it will be very useful for optimization.

The bottleneck of modern hardware (generally) is memory. You can get huge speedups by tweaking the way your program structures and operates on data to make it more cache friendly. This won't really affect instruction count but could make your program run 2x faster.


I have been wondering for a while if the solution to the memory bottleneck is to stop treating cache memory as an abstraction and make it directly addressable. Removing L1 cache is probably impossible, but I suspect removing L2 and L3 caches and replacing them with working memory might fare better.

The latest iteration of this thought process was wondering what would happen if you exposed microcode for memory management to the kernel, and it ran a moral equivalent of eBNF directly in a beefed up MMU. Legacy code and code that doesn't deign to do its own management would elect to use routines that maintain the cache abstraction. The kernel could also segment the caches for different processes, reducing the surface area for cache-related bugs.


Somebody else pointed out that I probably want to use Valgrind. It seems to have the ability to count instruction counts as well as cache misses [1] [2].

[1]: https://web.stanford.edu/class/archive/cs/cs107/cs107.1202/r...

[2]: http://www.codeofview.com/fix-rs/2017/01/24/how-to-optimize-...


Yup! valgrind is an amazing tool. Just keep in mind that the program itself will be slower when run under valgrind. Its great for debugging, but your benchmarks should be run raw.


And when all else fails, you reach for Intel VTune. :) Which has provided a free community version for a couple years now, I might add.


Something about Flamegraphs has been bugging me for a while and as more languages are adopting async/await semantics is coming to a head: Flamegraphs work well for synchronous, sequential code. I don't think it's an accident that a lot of people first encountered them for Javascript, in the era immediately before Promises became a widely known technique. They actually worked then, but now they're quickly becoming more trouble than they're worth.

A long time ago when people still tried to charge for profilers, I remember one whose primary display was a DAG, not unlike the way some microservices are visualized today. Only the simplest cases of cause and effect can be adequately displayed as a flamegraph. For anything else it's, as you say, all noise.


Your asynchronous code is made up of synchronous segments between the yield points. The flamegraph measures those just fine.


It tells you something happened, it doesn't tell you why.

The value of the flamegraph over the previous iteration was that it more directly showed the chain of events. So no, it doesn't measure it 'just fine'. It's accomplishing absolutely nothing.


>The value of the flamegraph over the previous iteration was that it more directly showed the chain of events.

Yes, and it continues to show that.

A perf trace for:

    async fn foo() -> i32 {
        bar().await
    }

    async fn bar() -> i32 {
        baz().await;
        whatever()
    }
... that happens to be captured while `whatever` is executing sees a callstack that is:

    #1 whatever() (somewhere inside it)
    #2 bar::poll() (at the cpu_intensive_i32 call)
    #3 foo::poll() (at the bar.await line)
    #4 runtime executor ... (internals we don't care about)
The flamegraph records the CPU usage accordingly.

>So no, it doesn't measure it 'just fine'. It's accomplishing absolutely nothing.

How about you actually try to use it instead of assuming?


> Yes, and it continues to show that.

That’s a toy example and if you can’t see the bottleneck in a toy example then good luck to you.

If you’re trying to figure out why 5 service calls followed by two more to merge the data together is taking 70ms, it’s not going to be very helpful. If you’re trying to speed up a batch process that makes 400 calls it’s only going to tel you about the cold start and the long tail. Both important problems but not your biggest issue.


I use flamegraphs to investigate perf issues in a distributed system at my day job. But it's okay. Keep being confident about your ignorance.


I also use flamegraphs. And log files. And stats (duration, counts, memory, etc). And a debugger, microbenchmarks, htop, {io,vm,net}stat, human behavior (Steve's code tends to have this sort of problem), dead reckoning, educated guesses, staring at the code, and even on occasion brushing my teeth, going for a walk, or washing my hair. Getting the job done in spite of bad signal to noise ratio from tools is just part of the job description - for a very senior person. The SNR for flamegraphs is dropping, doubly so for junior and mid-level developers. And as I said in another thread a few days ago, we're also hiding data in 'modern' perf tools, which indicates to me that even the tool writers are missing some tricks. These days it's just easier to bump the cluster size from 18 to 21 machines and move on.


Rust has libraries like https://lib.rs/criterion that help running code 1000 times in a loop, with proper timing, elimination of outliers, etc.


Valgrind (and “friends”) are exactly this, they can measure cache misses, branch mispredictions, etc.


One nice think about Go’s built in benchmarking is that it automatically takes care or running benchmarks repeatedly until runtime is stable and then averaging the time over a number of hot runs.


Talking of data driven, I think I read that the rust compiler team checks itself against some massive list of popular crates to check it doesn't break anything.

Would it be a reasonable use of resources to run all those test suits and identify hot spots for community wide optimization?


> I read that the rust compiler team checks itself against some massive list of popular crates to check it doesn't break anything.

Yes, that "massive list" being every single crate in the crates.io repository.

> Would it be a reasonable use of resources to run all those test suits and identify hot spots for community wide optimization?

I believe the approach on the perf side of things has been to take reports on crates that are particularly slow (even at one particular part of the compiler) and create benchmarks from those. I believe this is partly because running against every single crate would be too slow, and partly because it would be a moving target (as new versions of crates are released) and thus would make it hard to track performance accurately.


As crater was already linked to in a sibling comment, the performance suite dashboard can be seen at https://perf.rust-lang.org/, and the suite itself at https://github.com/rust-lang/rustc-perf/tree/master/collecto...


I think what ZeroGravitas was getting at was the idea of optimising the actual software rather than the compiler.

For example you could imagine noticing that people make a lot of Vecs with 400 or 800 things in them, but not so many with 500 or 1000 things in them and so maybe the Vec growth rule needs tweaking to better accommodate that.


The tool you're referring to is called Crater: https://github.com/rust-lang/crater.


An often overlooked option for profiling Rust is Apple's Instruments.app. It's amazing and usually the first thing I reach for when I need a profiler on Mac OS X.


Yeah, Instruments is pretty great and it can even do that thing the article mentions, showing the hot lines of code annotated with their percentage of runtime. It’s not always perfect but at least it’s a quick way to know where to start looking.


My key take away from this is different - be very sceptical of third party packages! Both performance issues were traced back to them, and his replacement of their functionality - while not being "battle tested" and surely constituting "re-inventing the wheel" - were faster, easy to read, and easy to understand.

Any front-end devs reading this? :)


I'm surprised there aren't CPU emulators that are just for collecting performance information.

Edit: Maybe something like this: https://github.com/guillon/run-qemu-profile


This is kind of like what LLVM MCA does.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: