Care to explain why? I find the argument in http://blog.kevmod.com/2016/06/benchmarking-minimum-vs-avera... for using the minimum to be pretty compelling.
Also, FWIW, Julia's automated performance tracking determines likely regressions based on the minimum. Their reasoning is explained in http://math.mit.edu/~edelman/publications/robust_benchmarkin....
Imagine, for example, your program has 100 possible ASLR states: in 99 of those states it takes 1s to run and in the 1 remaining state it takes 0.8s to run. Using the minimum, you will think your program takes 0.8s to run even though there's only a 1% of chance of observing that performance in practise. That's bad, IMHO, but the problem compounds when you compare different versions of the same program. Imagine that I optimise the program such that in those 99 slow states the program now takes 0.9s to run, but in the remaining state it still takes 0.8s. The minimum will tell me that my optimisation made no difference (and thus should probably be removed) even though in 99% of cases I've improved performance by 10%.
The problem with the existence of 100 independent states is that you need a very large number of trials to get a performance measurement of each—and you're still left with the problem of statistic to reduce each to (min? mean? and how??). Attempting to take the mean with the CST should eventually work, but it might need a very long time. Instead, the minimum lets to pick whichever mode was fastest and compare the time of that mode. Sure, that might not be great for all cases, but if that's the worst problem with the benchmark suite, I'd say you're in pretty good shape. I've run into issues where the CPU appears to have memorized the random number sequence in the test data—how are you supposed to pick the better algorithm when the CPU won't let you run the 99% case under a benchmarking harness...
Yeah, minimum isn't perfect, but it's at least pretty clear what it says, as long as there isn't incentive to abuse it (I might say the same about p-value vs bayesian).
disclosure: worked with the author of the Julia paper cited above.
You’ve lost me. What is the single, obvious meaning of the minimum measurement, and what are you benchmarking against, and what do you intend to get out of the benchmark? Typically the minimum just means “program ran x input as fast as y”, which doesn’t actually help with most questions about performance.
If X is constant, and time is the only thing you care about, then maybe that works.
well, from a marketing standpoint it means that you are able to write
OBSERVED MAXIMUM SPEED : 789634 gigabrouzoufs per second on a core i5-xxxx
on your brochure, and you wouldn't be lying, and it would be better that a majority of products which don't even give an observed speed but just a theoretical one.
: See sample code in https://www.intel.com/content/dam/www/public/us/en/documents...
> It is important that you understand one thing before we start. If you use all the advices in this article it is not how your application will run in practice. If you want to compare two different versions of the same program you should use suggestions described above. If you want to get absolute numbers to understand how your app will behave in the field, you should not make any artificial tuning to the system, as the client might have default settings.
The for loop body should use > $i
Inconsistent use of redirection (which requires a root shell) and sudo <command> | tee, which doesn't.
I think the S in SMT is for symmetric
And a comment - I liked the mention of mean vs minimum, though I would have liked the median to be included as well, and more info on when and why each may be more appropriate. Minimum is good when you want to answer "how much time is required to do just what is required for what this code is trying to accomplish". The minimum is the closest approximation you can get; the "true time required" can never be more than the minimum. Mean is good for something you'll be doing a lot of, and want to predict how much total time it'll take. Median gives you a realistic measure of how long a task will take; if you tell someone that phase X will take 3 seconds, and 3 is the median, then you'll be too high half the time and too low the other half, which is better than reporting the minimum which will be too low nearly 100% of the time. But then again, your article is more about eliminating noise when measuring the effects of changes, so minimum is probably generally the most reasonable thing to be looking at. (Though if it's a large phase with many constituent sources of variance, you have to be careful because your minimum will keep shrinking the more repetitions you do!)
And a question on the comment - In general (I don't have a lot of benchmarking experience, I did a few of them and used lower quartile to reduce the impact from interference), is it hard to determine that the minimum is not shrinking anymore ? I suppose we could run a benchmark until we have a certain amount (say 3, but it will depend on the distribution) of relatively close minimum values ? This seems to be a good way to have a good probability of obtaining the "true time required" you mentioned.
It's definitely for simultaneous. You are probably thinking of SMP (symmetric multiprocessing).
I think you meant to say "never be less than the minimum." Agree about its utility here.
On the other hand, the true time required fits in the observed minimum, so it cannot be more.
If we look for example on disabling turbo-boost (the first point on the list in the article), then there is some evidence that this might increase the variance of benchmarks that measure the amount of cache references. This might also be the case for dropping the file system caches.
The same can be seen for disabling address space randomization: It seems to increase the variability when measuring the wall-clock but decreasing the variability when measuring the number of branch misses.
Furthermore setting the process priority or the CPU governor might not have any effect at all. Setting the process priority might even increase the variability of your measurements due to e.g. priority inversion.
Setting the CPU affinity, preferably using CPU sets (http://man7.org/linux/man-pages/man7/cpuset.7.html), seems to be helpful most of the time, but only really plays out if your benchmarking system is running other resource intensive processes.
On a busy system disabling turbo boost might increase the variability of your benchmarks too.
For the evidence: I ran the lean4 benchmarking suite with different environment configurations multiple times and looked at the standard deviation, a presentation covering the results can be found at https://pp.ipd.kit.edu/~bechberger/2m.pdf.
Overall the best environment setup seems to really depend on the state of your benchmarking system, the benchmarks themselves and the properties you focus on.
P.S.: Three configurations that are not in the article but can be helpful are preheating your CPU with CPU bound task (thermal capacity of CPUs…), letting your system rest quietly between the program runs, syncing the file system and disabling the swap.
P.P.S.: An automatic environment configuration selector is currently being developed for temci (https://temci.readthedocs.io/en/latest/, I'm its author), this will help tailoring the configuration to your setup and benchmarks.
It's mentioned in the text, but... This should be "consider the impact of fs cache". Sometimes you want a cold run, sometimes you want to explicitly get a warm run. The only case you never want is mixing them up.
for the rest, a lot depends on what your goal in doing the benchmarking is and what your target and testing machines are. if you're developing software to run in server environments under load then the rest don't make sense to me. 5 means you may be able to continue to use your machine for other light tasks and still get semi-reasonable numbers, so it's a maybe
one addition. if your tests are relatively short (and shorter is better in that it allows you to iterate quicker), they won't be run under thermal steady state. fixing the cpu freq, or setting a conservative max, will help
x being something between 0.1 or 1, depending on your needed accuracy.
cpupower frequency-set -g performance
and use intel-speed-select tool
Check the size of env with wc -c, and ensure it's the same size on later runs.
for more consistant testing results, test-app builds here assume the x86_64 environment and explicitly enable Intel AVX support
CXXFLAGS -march=sandybridge -mtune=sandybridge
see AVX https://en.wikipedia.org/wiki/Advanced_Vector_Extensions