
How to get consistent results when benchmarking on Linux? - walterbell
https://easyperf.net/blog/2019/08/02/Perf-measurement-environment-on-Linux
======
ltratt
Benchmarking is hard -- I've spent far more time investigating it then I ever
wanted to. Much of the advice in this article matches the advice I give when
people are foolish enough to ask me (though CPU pinning can have very odd
results for systems which expect to use all of a CPU's cores). However, there
are two related points where I respectfully disagree with the author:
disabling ASLR is probably not a good idea for most people; and using minimum
values is almost never a good idea. Ultimately, the traditional idea that
programs have a single performance number doesn't generally hold true on
modern systems: one needs to accept that programs have a set of performance
numbers. The Stabilizer paper [1] does a very good job of articulating this,
IMHO, and does so particularly in the context of ASLR. As soon as you accept
the idea that programs don't have a single performance point, it then becomes
fairly clear that the best way to get a good idea of a program's performance
is to run it as many times as possible and compute some basic statistics over
that. ASLR is one aspect where simply running a program enough times gives you
some idea of the overall performance of the program. Some modern benchmarking
tools make calculating reasonable statistics fairly painless (e.g. things like
Criterion for Rust). But, overall, this is still a painful area and there's a
lot of work we could do to make life easier for people measuring software
IMHO.

[1] [http://www.cs.umass.edu/~emery/pubs/stabilizer-
asplos13.pdf](http://www.cs.umass.edu/~emery/pubs/stabilizer-asplos13.pdf)

~~~
rrss
> using minimum values is almost never a good idea

Care to explain why? I find the argument in
[http://blog.kevmod.com/2016/06/benchmarking-minimum-vs-
avera...](http://blog.kevmod.com/2016/06/benchmarking-minimum-vs-average/) for
using the minimum to be pretty compelling.

Also, FWIW, Julia's automated performance tracking determines likely
regressions based on the minimum. Their reasoning is explained in
[http://math.mit.edu/~edelman/publications/robust_benchmarkin...](http://math.mit.edu/~edelman/publications/robust_benchmarking.pdf).

~~~
ltratt
In my experience people think/hope that the minimum tells them the performance
of their program when all noise is removed. However, this is based on the
assumption that programs have a single performance point which, as the
Stabilizer paper shows, is rarely the case on modern systems. Rather, what the
minimum tells you is the behaviour of your program at one given point in the
non-deterministic performance space. This might seem like a minor, perhaps
even a pedantic, difference, but I have found it to be a profound observation.

Imagine, for example, your program has 100 possible ASLR states: in 99 of
those states it takes 1s to run and in the 1 remaining state it takes 0.8s to
run. Using the minimum, you will think your program takes 0.8s to run even
though there's only a 1% of chance of observing that performance in practise.
That's bad, IMHO, but the problem compounds when you compare different
versions of the same program. Imagine that I optimise the program such that in
those 99 slow states the program now takes 0.9s to run, but in the remaining
state it still takes 0.8s. The minimum will tell me that my optimisation made
no difference (and thus should probably be removed) even though in 99% of
cases I've improved performance by 10%.

~~~
manwe150
It says something like "without interference, some execution path has proven
to be as fast as x". This probably isn't exactly what you wanted to know (e.g.
expected time), but is likely to be much faster to compute for some thing
that's probably usually strongly correlated.

The problem with the existence of 100 independent states is that you need a
very large number of trials to get a performance measurement of each—and
you're still left with the problem of statistic to reduce each to (min? mean?
and how??). Attempting to take the mean with the CST should eventually work,
but it might need a very long time. Instead, the minimum lets to pick
whichever mode was fastest and compare the time of that mode. Sure, that might
not be great for all cases, but if that's the worst problem with the benchmark
suite, I'd say you're in pretty good shape. I've run into issues where the CPU
appears to have memorized the random number sequence in the test data—how are
you supposed to pick the better algorithm when the CPU won't let you run the
99% case under a benchmarking harness...

Yeah, minimum isn't perfect, but it's at least pretty clear what it says, as
long as there isn't incentive to abuse it (I might say the same about p-value
vs bayesian).

disclosure: worked with the author of the Julia paper cited above.

~~~
SomeOldThrow
> Yeah, minimum isn't perfect, but it's at least pretty clear what it says

You’ve lost me. What is the single, obvious meaning of the minimum
measurement, and what are you benchmarking against, and what do you intend to
get out of the benchmark? Typically the minimum just means “program ran x
input as fast as y”, which doesn’t actually help with most questions about
performance.

If X is constant, and time is the only thing you care about, then maybe that
works.

~~~
jcelerier
> which doesn’t actually help with most questions about performance.

well, from a marketing standpoint it means that you are able to write

OBSERVED MAXIMUM SPEED : 789634 gigabrouzoufs per second on a core i5-xxxx

on your brochure, and you wouldn't be lying, and it would be better that a
majority of products which don't even give an observed speed but just a
theoretical one.

------
kccqzy
When Intel benchmarks the performance of certain instructions, they compile
the code into a kernel module, and run them with interrupts turned off[0].
Even then there are additional considerations like OoO execution, pipelining,
etc. This could be applicable in some narrow scenarios but for most it's
overkill. So the question is, how much of this consistency is really needed?

[0]: See sample code in
[https://www.intel.com/content/dam/www/public/us/en/documents...](https://www.intel.com/content/dam/www/public/us/en/documents/white-
papers/ia-32-ia-64-benchmark-code-execution-paper.pdf)

~~~
amluto
I’m a bit afraid to mention this, but you can do iopl(3) and then use CLI and
STI from user code. I try to keep this from breaking the kernel _too_ badly.

------
ThrowawayR2
Good information but it rather depends on what the goal of the benchmarking
is. Many (most?) of those changes aim for consistency at the expense of making
the benchmark server vastly different from the configuration of a normal end-
user server, raising the question of how to translate the benchmark results
into usable information for the end-user.

~~~
dendibakh
Yep. And I think I stated that in the beginning of the article. :)

This one:

> It is important that you understand one thing before we start. If you use
> all the advices in this article it is not how your application will run in
> practice. If you want to compare two different versions of the same program
> you should use suggestions described above. If you want to get absolute
> numbers to understand how your app will behave in the field, you should not
> make any artificial tuning to the system, as the client might have default
> settings.

------
qznc
Temci is a tool which tries to provide such techniques and more in an easy
way:
[https://github.com/parttimenerd/temci](https://github.com/parttimenerd/temci)

~~~
parttimenerd
(I'm the author of temci) After you installed temci, you can just open a shell
via `temci short shell --sudo` and get a shell in an environment that
implements the guidelines mentioned in the article.

------
sfink
A couple of minor typos and nits:

The for loop body should use > $i

Inconsistent use of redirection (which requires a root shell) and sudo
<command> | tee, which doesn't.

I think the S in SMT is for _symmetric_

And a comment - I liked the mention of mean vs minimum, though I would have
liked the median to be included as well, and more info on when and why each
may be more appropriate. Minimum is good when you want to answer "how much
time is required to do just what is required for what this code is trying to
accomplish". The minimum is the closest approximation you can get; the "true
time required" can never be more than the minimum. Mean is good for something
you'll be doing a lot of, and want to predict how much total time it'll take.
Median gives you a realistic measure of how long a task will take; if you tell
someone that phase X will take 3 seconds, and 3 is the median, then you'll be
too high half the time and too low the other half, which is better than
reporting the minimum which will be too low nearly 100% of the time. But then
again, your article is more about eliminating noise when measuring the effects
of changes, so minimum is probably generally the most reasonable thing to be
looking at. (Though if it's a large phase with many constituent sources of
variance, you have to be careful because your minimum will keep shrinking the
more repetitions you do!)

~~~
mh-
>The minimum is the closest approximation you can get; the "true time
required" can never be more than the minimum.

I think you meant to say "never be _less_ than the minimum." Agree about its
utility here.

~~~
lixtra
If the performance bounces around, how do you know you hit the _true_ minimum?
Therefore the true time can be less than the observed minimum.

On the other hand, the true time required fits in the observed minimum, so it
cannot be more.

------
parttimenerd
Benchmarking is hard and the best combination of configuration settings for
one benchmark, measurement one property (e.g. run-time or branch-misses) might
not be useful for another benchmark.

If we look for example on disabling turbo-boost (the first point on the list
in the article), then there is some evidence that this might increase the
variance of benchmarks that measure the amount of cache references. This might
also be the case for dropping the file system caches.

The same can be seen for disabling address space randomization: It seems to
increase the variability when measuring the wall-clock but decreasing the
variability when measuring the number of branch misses.

Furthermore setting the process priority or the CPU governor might not have
any effect at all. Setting the process priority might even increase the
variability of your measurements due to e.g. priority inversion.

Setting the CPU affinity, preferably using CPU sets
([http://man7.org/linux/man-
pages/man7/cpuset.7.html](http://man7.org/linux/man-
pages/man7/cpuset.7.html)), seems to be helpful most of the time, but only
really plays out if your benchmarking system is running other resource
intensive processes.

On a busy system disabling turbo boost might increase the variability of your
benchmarks too.

For the evidence: I ran the lean4 benchmarking suite with different
environment configurations multiple times and looked at the standard
deviation, a presentation covering the results can be found at
[https://pp.ipd.kit.edu/~bechberger/2m.pdf](https://pp.ipd.kit.edu/~bechberger/2m.pdf).

Overall the best environment setup seems to really depend on the state of your
benchmarking system, the benchmarks themselves and the properties you focus
on.

P.S.: Three configurations that are not in the article but can be helpful are
preheating your CPU with CPU bound task (thermal capacity of CPUs…), letting
your system rest quietly between the program runs, syncing the file system and
disabling the swap. P.P.S.: An automatic environment configuration selector is
currently being developed for temci
([https://temci.readthedocs.io/en/latest/](https://temci.readthedocs.io/en/latest/),
I'm its author), this will help tailoring the configuration to your setup and
benchmarks.

~~~
dendibakh
Thanks for the great comment. Where I can read more about the impact of each
item on variability?

------
viraptor
> Drop file system cache

It's mentioned in the text, but... This should be "consider the impact of fs
cache". Sometimes you want a cold run, sometimes you want to explicitly get a
warm run. The only case you never want is mixing them up.

------
nqzero
1, 3, 6 and 8 are generally good

for the rest, a lot depends on what your goal in doing the benchmarking is and
what your target and testing machines are. if you're developing software to
run in server environments under load then the rest don't make sense to me. 5
means you may be able to continue to use your machine for other light tasks
and still get semi-reasonable numbers, so it's a maybe

one addition. if your tests are relatively short (and shorter is better in
that it allows you to iterate quicker), they won't be run under thermal steady
state. fixing the cpu freq, or setting a conservative max, will help

~~~
dendibakh
Thanks for the comment. As answered to the previous comment, yes, I agree.
Likely I did a poor job of explicitly saying that those advises are not
generally applicable. It usually makes sense to have a dedicated machine (pool
of machines) that a properly configured to run perf tests. And nothing else is
executed on them.

------
shmerl
Hyper threading is Intel's marketing term, simultaneous multi-threading (SMT)
is the general one (the article uses both in the same section).

------
rurban
He forgot the most trivial of all: Check your load. If higher than x don't
continue. That's what I do in my bench scripts.

x being something between 0.1 or 1, depending on your needed accuracy.

------
Fnoord
How is Phoronix Test Suite these days? Back when it was announced, people
looked at it and grilled it (IIRC at Slashdot). No idea how it is these days.

------
grapeli23
All these proposals have their justification. Surprisingly only
incomprehensible complications, is it easier to use

cpupower frequency-set -g performance

and use intel-speed-select tool
[https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/power/x86/intel-
speed-select/isst-config.c)

~~~
fourier_mode
Why did this get downvoted?

------
vthriller
re: ASLR, there's probably no need to disable it globally, since you can do
that on a per-process basis with `setarch -R`

------
d33
How does disabling ALSR help performance?

~~~
mehrdadn
It doesn't help performance. It helps you get consistent results.

~~~
stordoff
If it makes you get less consistent results, doesn't that imply that in some
cases it hurts performance?

~~~
mehrdadn
Sure.

------
mistrial9
in a recent set of math-related dot-so performance benchmarks on Linux, the
following was included:

for more consistant testing results, test-app builds here assume the x86_64
environment and explicitly enable Intel AVX support

CXXFLAGS -march=sandybridge -mtune=sandybridge

see AVX
[https://en.wikipedia.org/wiki/Advanced_Vector_Extensions](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions)

