
Lolbench: automagically and empirically discovering Rust performance regressions - anp
https://blog.anp.lol/rust/2018/09/29/lolbench/
======
MikeHolman
Do you do have any plans to better distinguish between noise and regressions?
I run a similar performance testing infrastructure for Chakra, and found that
comparing against the previous run makes the results noisy. That means more
manual review of results, which gets old fast.

What I do now is run a script that averages results from the preceding 10 runs
and compares that to the average of the following 5 runs to see if the
regression is consistent or anomalous. If the regression is consistent, then
the script automatically files a bug in our tracker.

There is still some noise in the results, but it cuts down on those one-off
issues.

~~~
mkl
Do you mean 10 preceding versions, or 10 repeated timings of the same version?
If you repeat the timing for the each version many times, why is that not
enough to smooth out the noise?

~~~
dsamarin
Instead of averaging, I can recommend my go-to L-estimator for this sort of
thing: the midsummary. Take the average of the 40% and 60% percentile as your
measure of central tendency of performance.

------
chriswarbo
For those wanting to do similar tracking of benchmarks across commits, I've
found Airspeed Velocity to be quite nice (
[https://readthedocs.org/projects/asv](https://readthedocs.org/projects/asv)
). It allows (but doesn't require) benchmarks to be kept separate to the
project's repo, can track different configurations separately (e.g. using
alternative compilers, dependencies, flags, etc.), keeps results from
different machines separated, generates JSON data and HTML reports, performs
step detection to find regressions, etc.

It was intended for use with Python (virtualenv or anaconda), but I created a
plugin (
[http://chriswarbo.net/projects/nixos/asv_benchmarking.html](http://chriswarbo.net/projects/nixos/asv_benchmarking.html)
) which allows using Nix instead, so we can provide any commands/tools/build-
products we like in the benchmarking environment (so far I've used it
successfully with projects written in Racket and Haskell).

------
anp
hi! author here if you want to ask questions or (nicely pls) let me know where
I've made mistakes!

------
valarauca1
How do you determine baseline load of the test machine in order to qualify the
correctness of the benchmark?

Assuming the compiling, and testing is done in the cloud how do you ensure the
target platform (processor) doesn't change, and that you aren't being
subjected to neighbors who are stealing RAM bandwidth, or CPU cache resources
from your VM and impacting the results?

~~~
anp
Each benchmark result is only compared against values from running on
literally the same machine, actually. I agree that good results here would be
extremely difficult to produce on virtualized infra, so I rented a few cheap
dedicated servers from Hetzner. I'm glad that I decided to pin results to a
single machine, because even between these identically binned machines from
Hetzner I saw 2-4% variance between them when I ran some phoronix benches to
compare.

I go into a little bit of detail on this in the talk I link to towards the
bottom of the post, here's a direct link for convenience:
[https://www.youtube.com/watch?v=gSFTbJKScU0](https://www.youtube.com/watch?v=gSFTbJKScU0).

~~~
usefulcat
A suggestion: consider using callgrind to measure performance (instructions
retired, cache misses, branch mispredictions, whatever) instead of wall clock
time. It will be much slower per run, but since it will also be precise you
shouldn't need to do multiple runs, and you should be able to run a bunch of
different benchmarks concurrently without them interfering with each other or
having anything else interfere with them.

~~~
anp
I currently do something pretty similar by using the perf subsystem in the
Linux kernel to track the behavior of each benchmark function. In my early
measurements I found concurrent benchmarking to introduce unacceptable noise
even with this measurement tool and with cgroups/cpusets used to pin the
different processes to their own cores. Instead of trying to tune the system
to account for this, I chose to build tooling for managing a single runner per
small cheap machine.

~~~
usefulcat
No such 'noise' is possible with callgrind, as it's basically simulating the
hardware. If you're using a VM it seems like you could still get variation
between different runs due to other activity on the host system.

~~~
claudius
The problem with callgrind is ([http://valgrind.org/docs/manual/cg-
manual.html#branch-sim](http://valgrind.org/docs/manual/cg-manual.html#branch-
sim)):

> Cachegrind simulates branch predictors intended to be typical of mainstream
> desktop/server processors of around 2004.

In other words, the data produced by Callgrind may be suitable to find obvious
regressions, but there still may be more regressions which are only relevant
on more modern CPUs.

------
panic
The "More Like Rocket Science Rule of Software Engineering" has been WebKit
policy for a while:
[https://web.archive.org/web/20061011203328/http://webkit.org...](https://web.archive.org/web/20061011203328/http://webkit.org/projects/performance/index.html)
(now at [https://webkit.org/performance/](https://webkit.org/performance/)).

~~~
twtw
> Common excuses people give when they regress performance are, “But the new
> way is cleaner!” or “The new way is more correct.” We don’t care. No
> performance regressions are allowed, regardless of the reason. There is no
> justification for regressing performance. None.

This seems a bit extreme. Would they accept a regression to fix a critical
security vulnerability? Code can be infinitely fast if it need not be correct.

~~~
bloomer
A better version...

Common excuses people give when introducing security vulnerabilities are, “but
the new way is faster” or “the new way is more clever”. We don’t care. No
security vulnerabilities are allowed, regardless of the reason. There is no
justification for security vulnerabilities. None.

I love that “correct” is a “justification” in the original. I would be
embarrassed to be associated with such a juvenile page. Move fast with broken
things...

------
habitue
This project looks awesome, but as a complete aside:

How long do we expect it to take before "automagically" completely replaces
"automatically" in English?

I am guessing less than a decade to go now

~~~
anp
I use this word the way we did when I worked as a PC technician and help
desker, where there's a lot of automation but then we sneak a bit of manual
labor in to make it actually useful. Like how user accounts would be
maintained in the correct state automagically.

~~~
LukeShu
automagically: /aw·toh·maj´i·klee/, adv. Automatically, but in a way that, for
some reason (typically because it is too complicated, or too ugly, or perhaps
even too trivial), the speaker doesn't feel like explaining to you.

[http://www.catb.org/jargon/html/A/automagically.html](http://www.catb.org/jargon/html/A/automagically.html)

~~~
anp
Well this interpretation certainly explains some of the reactions that word
has gotten in my use of it!

------
hsivonen
Very nice!

Do you track opt_level=2 (the Firefox Rust opt level) in addition to the
default opt_level=3?

~~~
anp
Thanks!

Not yet, but I am tracking this as a desired feature:
[https://github.com/anp/lolbench/issues/9](https://github.com/anp/lolbench/issues/9).
The benchmark plan generation, storage keys, and results presentation will at
a minimum need to be extended to support a matrix of inputs to each benchmark
function. Right now there are a number of implicit assumptions that each
benchmark function is tracked as a single series of results.

------
thsowers
This is really cool, love the project and the writeup! I regularly use nightly
(I work with Rocket) and I had always wondered about this. Thank you!

------
Twirrim
Can I suggest you consider putting
[https://github.com/anp/lolbench/issues/1](https://github.com/anp/lolbench/issues/1)
in to the README.md file, so people can easily see where to look for some TODO
items?

~~~
dikaiosune
great idea!

------
awake
Is there any equivalent project for java.

~~~
dralley
Java has a JIT, right? Seems more difficult to get consistent results.

~~~
anothergoogler
You can benchmark reliably by warming up the JVM and disabling garbage
collection.

[https://www.ibm.com/developerworks/java/library/j-benchmark1...](https://www.ibm.com/developerworks/java/library/j-benchmark1/index.html)

[https://www.ibm.com/developerworks/library/j-benchmark2/inde...](https://www.ibm.com/developerworks/library/j-benchmark2/index.html)

[http://www.ellipticgroup.com/html/benchmarkingArticle.html](http://www.ellipticgroup.com/html/benchmarkingArticle.html)

Java now ships with microbenchmarking helpers:

[http://openjdk.java.net/projects/code-
tools/jmh/](http://openjdk.java.net/projects/code-tools/jmh/)

~~~
LukeShu
[https://tratt.net/laurie/blog/entries/why_arent_more_users_m...](https://tratt.net/laurie/blog/entries/why_arent_more_users_more_happy_with_our_vms_part_1.html)

