Hacker News new | past | comments | ask | show | jobs | submit login
How to get consistent results when benchmarking on Linux? (easyperf.net)
108 points by walterbell on Aug 4, 2019 | hide | past | favorite | 38 comments

Benchmarking is hard -- I've spent far more time investigating it then I ever wanted to. Much of the advice in this article matches the advice I give when people are foolish enough to ask me (though CPU pinning can have very odd results for systems which expect to use all of a CPU's cores). However, there are two related points where I respectfully disagree with the author: disabling ASLR is probably not a good idea for most people; and using minimum values is almost never a good idea. Ultimately, the traditional idea that programs have a single performance number doesn't generally hold true on modern systems: one needs to accept that programs have a set of performance numbers. The Stabilizer paper [1] does a very good job of articulating this, IMHO, and does so particularly in the context of ASLR. As soon as you accept the idea that programs don't have a single performance point, it then becomes fairly clear that the best way to get a good idea of a program's performance is to run it as many times as possible and compute some basic statistics over that. ASLR is one aspect where simply running a program enough times gives you some idea of the overall performance of the program. Some modern benchmarking tools make calculating reasonable statistics fairly painless (e.g. things like Criterion for Rust). But, overall, this is still a painful area and there's a lot of work we could do to make life easier for people measuring software IMHO.

[1] http://www.cs.umass.edu/~emery/pubs/stabilizer-asplos13.pdf

> using minimum values is almost never a good idea

Care to explain why? I find the argument in http://blog.kevmod.com/2016/06/benchmarking-minimum-vs-avera... for using the minimum to be pretty compelling.

Also, FWIW, Julia's automated performance tracking determines likely regressions based on the minimum. Their reasoning is explained in http://math.mit.edu/~edelman/publications/robust_benchmarkin....

In my experience people think/hope that the minimum tells them the performance of their program when all noise is removed. However, this is based on the assumption that programs have a single performance point which, as the Stabilizer paper shows, is rarely the case on modern systems. Rather, what the minimum tells you is the behaviour of your program at one given point in the non-deterministic performance space. This might seem like a minor, perhaps even a pedantic, difference, but I have found it to be a profound observation.

Imagine, for example, your program has 100 possible ASLR states: in 99 of those states it takes 1s to run and in the 1 remaining state it takes 0.8s to run. Using the minimum, you will think your program takes 0.8s to run even though there's only a 1% of chance of observing that performance in practise. That's bad, IMHO, but the problem compounds when you compare different versions of the same program. Imagine that I optimise the program such that in those 99 slow states the program now takes 0.9s to run, but in the remaining state it still takes 0.8s. The minimum will tell me that my optimisation made no difference (and thus should probably be removed) even though in 99% of cases I've improved performance by 10%.

It says something like "without interference, some execution path has proven to be as fast as x". This probably isn't exactly what you wanted to know (e.g. expected time), but is likely to be much faster to compute for some thing that's probably usually strongly correlated.

The problem with the existence of 100 independent states is that you need a very large number of trials to get a performance measurement of each—and you're still left with the problem of statistic to reduce each to (min? mean? and how??). Attempting to take the mean with the CST should eventually work, but it might need a very long time. Instead, the minimum lets to pick whichever mode was fastest and compare the time of that mode. Sure, that might not be great for all cases, but if that's the worst problem with the benchmark suite, I'd say you're in pretty good shape. I've run into issues where the CPU appears to have memorized the random number sequence in the test data—how are you supposed to pick the better algorithm when the CPU won't let you run the 99% case under a benchmarking harness...

Yeah, minimum isn't perfect, but it's at least pretty clear what it says, as long as there isn't incentive to abuse it (I might say the same about p-value vs bayesian).

disclosure: worked with the author of the Julia paper cited above.

> Yeah, minimum isn't perfect, but it's at least pretty clear what it says

You’ve lost me. What is the single, obvious meaning of the minimum measurement, and what are you benchmarking against, and what do you intend to get out of the benchmark? Typically the minimum just means “program ran x input as fast as y”, which doesn’t actually help with most questions about performance.

If X is constant, and time is the only thing you care about, then maybe that works.

> which doesn’t actually help with most questions about performance.

well, from a marketing standpoint it means that you are able to write

OBSERVED MAXIMUM SPEED : 789634 gigabrouzoufs per second on a core i5-xxxx

on your brochure, and you wouldn't be lying, and it would be better that a majority of products which don't even give an observed speed but just a theoretical one.

That would be a very unusual performance distribution. I've seen a lot of distributions, and they included a load of right-skewed distributions, some normal distributions, but I honestly can't think of any case with a left-skewed distribution like this one.

The problem is more-or-less as bad for a normal distribution as it is for a skewed distribution (I'd have said the distribution from my examples is right skewed, not that it makes any difference to my argument): the minimum gives you no information about the distribution. As Figure 1 of [1] (bias alert: I'm a co-author) shows, it seems that, on modern systems, you can can find pretty much any distribution you want if you have enough benchmarks.

[1] https://soft-dev.org/pubs/html/barrett_bolz-tereick_killick_...

When Intel benchmarks the performance of certain instructions, they compile the code into a kernel module, and run them with interrupts turned off[0]. Even then there are additional considerations like OoO execution, pipelining, etc. This could be applicable in some narrow scenarios but for most it's overkill. So the question is, how much of this consistency is really needed?

[0]: See sample code in https://www.intel.com/content/dam/www/public/us/en/documents...

I’m a bit afraid to mention this, but you can do iopl(3) and then use CLI and STI from user code. I try to keep this from breaking the kernel too badly.

Good information but it rather depends on what the goal of the benchmarking is. Many (most?) of those changes aim for consistency at the expense of making the benchmark server vastly different from the configuration of a normal end-user server, raising the question of how to translate the benchmark results into usable information for the end-user.

Yep. And I think I stated that in the beginning of the article. :)

This one:

> It is important that you understand one thing before we start. If you use all the advices in this article it is not how your application will run in practice. If you want to compare two different versions of the same program you should use suggestions described above. If you want to get absolute numbers to understand how your app will behave in the field, you should not make any artificial tuning to the system, as the client might have default settings.

Temci is a tool which tries to provide such techniques and more in an easy way: https://github.com/parttimenerd/temci

(I'm the author of temci) After you installed temci, you can just open a shell via `temci short shell --sudo` and get a shell in an environment that implements the guidelines mentioned in the article.

A couple of minor typos and nits:

The for loop body should use > $i

Inconsistent use of redirection (which requires a root shell) and sudo <command> | tee, which doesn't.

I think the S in SMT is for symmetric

And a comment - I liked the mention of mean vs minimum, though I would have liked the median to be included as well, and more info on when and why each may be more appropriate. Minimum is good when you want to answer "how much time is required to do just what is required for what this code is trying to accomplish". The minimum is the closest approximation you can get; the "true time required" can never be more than the minimum. Mean is good for something you'll be doing a lot of, and want to predict how much total time it'll take. Median gives you a realistic measure of how long a task will take; if you tell someone that phase X will take 3 seconds, and 3 is the median, then you'll be too high half the time and too low the other half, which is better than reporting the minimum which will be too low nearly 100% of the time. But then again, your article is more about eliminating noise when measuring the effects of changes, so minimum is probably generally the most reasonable thing to be looking at. (Though if it's a large phase with many constituent sources of variance, you have to be careful because your minimum will keep shrinking the more repetitions you do!)

Minor nitpick: I think you meant to write "<command> | sudo tee".

And a question on the comment - In general (I don't have a lot of benchmarking experience, I did a few of them and used lower quartile to reduce the impact from interference), is it hard to determine that the minimum is not shrinking anymore ? I suppose we could run a benchmark until we have a certain amount (say 3, but it will depend on the distribution) of relatively close minimum values ? This seems to be a good way to have a good probability of obtaining the "true time required" you mentioned.

> I think the S in SMT is for symmetric

It's definitely for simultaneous. You are probably thinking of SMP (symmetric multiprocessing).

>The minimum is the closest approximation you can get; the "true time required" can never be more than the minimum.

I think you meant to say "never be less than the minimum." Agree about its utility here.

If the performance bounces around, how do you know you hit the true minimum? Therefore the true time can be less than the observed minimum.

On the other hand, the true time required fits in the observed minimum, so it cannot be more.

Benchmarking is hard and the best combination of configuration settings for one benchmark, measurement one property (e.g. run-time or branch-misses) might not be useful for another benchmark.

If we look for example on disabling turbo-boost (the first point on the list in the article), then there is some evidence that this might increase the variance of benchmarks that measure the amount of cache references. This might also be the case for dropping the file system caches.

The same can be seen for disabling address space randomization: It seems to increase the variability when measuring the wall-clock but decreasing the variability when measuring the number of branch misses.

Furthermore setting the process priority or the CPU governor might not have any effect at all. Setting the process priority might even increase the variability of your measurements due to e.g. priority inversion.

Setting the CPU affinity, preferably using CPU sets (http://man7.org/linux/man-pages/man7/cpuset.7.html), seems to be helpful most of the time, but only really plays out if your benchmarking system is running other resource intensive processes.

On a busy system disabling turbo boost might increase the variability of your benchmarks too.

For the evidence: I ran the lean4 benchmarking suite with different environment configurations multiple times and looked at the standard deviation, a presentation covering the results can be found at https://pp.ipd.kit.edu/~bechberger/2m.pdf.

Overall the best environment setup seems to really depend on the state of your benchmarking system, the benchmarks themselves and the properties you focus on.

P.S.: Three configurations that are not in the article but can be helpful are preheating your CPU with CPU bound task (thermal capacity of CPUs…), letting your system rest quietly between the program runs, syncing the file system and disabling the swap. P.P.S.: An automatic environment configuration selector is currently being developed for temci (https://temci.readthedocs.io/en/latest/, I'm its author), this will help tailoring the configuration to your setup and benchmarks.

Thanks for the great comment. Where I can read more about the impact of each item on variability?

> Drop file system cache

It's mentioned in the text, but... This should be "consider the impact of fs cache". Sometimes you want a cold run, sometimes you want to explicitly get a warm run. The only case you never want is mixing them up.

1, 3, 6 and 8 are generally good

for the rest, a lot depends on what your goal in doing the benchmarking is and what your target and testing machines are. if you're developing software to run in server environments under load then the rest don't make sense to me. 5 means you may be able to continue to use your machine for other light tasks and still get semi-reasonable numbers, so it's a maybe

one addition. if your tests are relatively short (and shorter is better in that it allows you to iterate quicker), they won't be run under thermal steady state. fixing the cpu freq, or setting a conservative max, will help

Thanks for the comment. As answered to the previous comment, yes, I agree. Likely I did a poor job of explicitly saying that those advises are not generally applicable. It usually makes sense to have a dedicated machine (pool of machines) that a properly configured to run perf tests. And nothing else is executed on them.

Hyper threading is Intel's marketing term, simultaneous multi-threading (SMT) is the general one (the article uses both in the same section).

He forgot the most trivial of all: Check your load. If higher than x don't continue. That's what I do in my bench scripts.

x being something between 0.1 or 1, depending on your needed accuracy.

How is Phoronix Test Suite these days? Back when it was announced, people looked at it and grilled it (IIRC at Slashdot). No idea how it is these days.

All these proposals have their justification. Surprisingly only incomprehensible complications, is it easier to use

cpupower frequency-set -g performance

and use intel-speed-select tool https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

Why did this get downvoted?

re: ASLR, there's probably no need to disable it globally, since you can do that on a per-process basis with `setarch -R`

How does disabling ALSR help performance?

ALSR has the same problem as the size of ENV. It can lead to certain locations being aligned or misaligned, causing significant variations at run-time. He forgot to mention ENV changes also.

Check the size of env with wc -c, and ensure it's the same size on later runs.

It doesn't help performance. It helps you get consistent results.

If it makes you get less consistent results, doesn't that imply that in some cases it hurts performance?


if the target production environment doesn't disable it, then your performance tests also shouldn't

Depends what you're trying to measure.

in a recent set of math-related dot-so performance benchmarks on Linux, the following was included:

for more consistant testing results, test-app builds here assume the x86_64 environment and explicitly enable Intel AVX support

CXXFLAGS -march=sandybridge -mtune=sandybridge

see AVX https://en.wikipedia.org/wiki/Advanced_Vector_Extensions

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact