He says, "it’s absolutely vital to report something about the distribution as a whole. In general, confidence intervals are a good technique for doing this".
But confidence intervals are a way of exhibiting uncertainty in a the estimate of a parameter (such as the mean), not a way of describing a distribution. Hopefully, what he actually meant is that giving the 5% and 95% (for example) quantiles of the distribution would be useful. The 90% confidence interval for the mean would be much smaller than the range from 5% to 95% quantiles, and would collapse to a point in the limit as you did more repetitions - not at all the desired behaviour.
Also, I disagree that it's always useful to report something about the whole distribution rather than just the mean. If the benchmark is for a small part of the program, the time will be combined with many other parts (perhaps repetitions of this part), with variation in a single part averaged out.
In this regard, he makes another mistake in footnote : "I’ve sometimes heard it said that the central limit theorem saves us when benchmarking. In my experience, this hope is based on an over simplification of the theorem, which is that if you run a program enough times any randomness (including noise) will converge towards a normal distribution. That is only true if the various random variables are close to being independent and identically distributed."
This again seems to confuse the distribution data with the distribution of an estimate for the mean. Only the latter is subject to the central limit theorem. The distribution of the run times will not approach a normal distribution regardless of how many times you run the program (unless it's normal to start with, of course), even if the run times are independent and identically distributed. This should be obvious to anyone who hasn't heard of the central limit theorem. It should also be obvious to those who have heard it and understood it.
If 0.5% of requests take 100x the average time you want that to be as obvious as possible. With quintiles those extreme outliers are easy to miss. With confidence interval error bars the outliers become more obvious.
Further, running the test 1 Billion times would not reduce those error bars because that unusual event would occur ~5 million times.
Your comment about "confidence interval error bars" revealing outliers makes no sense. I think you still don't understand that a confidence interval (for some parameter estimate, maybe the mean) is not a description of the distribution of the data, but rather is about the distribution of an estimate obtained from the data - not at all the same thing.
Running a test 1 billion times certainly would reduce the confidence interval for the mean, even if 5 million of those are 100x outliers.
So, running the benchmark test again keep’s n the same when calculating the confidence interval. This means error bars are showing the calculated mean and standard * arbitrary scaling factor all lined up on an easy to read graph.
PS: A lot of statistical tools are poor fits as the data is rarely a normal distribution. But, rigger is not the goal here.
If you have a better idea feel free to suggest it. Bonus points if it's easy to calculate and display.
If nothing changes from one time to another, you might as well combine all the results, and display the overall mean, or a histogram, or whatever. If you think that something may have changed from one time to another (eg, ambient temperature, which affects what clock frequency the processor may have been throttled down to), then you've got a complicated problem, if you're really trying to get to the bottom of the situation.
But once again, if you're looking at how the sets of runs at different times differ, confidence intervals do not indicate the spread of the data. Perhaps you're computing quantiles instead, which would be sensible, but then incorrectly referring to them as "confidence intervals", when they're not?
Something changes, but not necessarily something important for every code path.
Test could be to run optimization A x 1,000 then optimization B x 1,000.
So now I Run the test, edit optimization B, and just run the full test again. Running A again might seem wasteful, but what if something else where to affect A and B, if I just test B then it’s not obvious something other than my code changed. Perhaps as you say the heat sink is failing and my CPU is now down clocking.
The goal is to have a simple test that shows if I am improving things while also isolating that I am measuring what I think I am measuring. That’s not to say I will actually run several long tests repeatedly, but I may re run a test expecting similar results.
Anyway, all of the above has nothing to do with why as want a confidence interval. It’s useful for getting a ballpark feeling of the underlying data.
"Benchmarks with random performance tend to mislead"
Both of these "benchmarks" have exactly the wrong amount of randomness. A little more, and you could get a useful mean out of the results.
When I advocate for the use of minimum time in benchmarking, the scenario in mind is running periodic regression testing on a set of known good benchmarks that don't have intrinsic randomness (or it is averaged out) and yet vary by 0-10% as a result of actual noise. In these cases, it is very useful to have a single number (and if that single number surprises you, then by all means take a look at the distribution).
Neither of the examples here show the impact of actual system or measurement noise, only random performance.
Linear search: 0.92 +- less than 0.0001 s
Binary search: 0.15 s (also small variance, but didn't compute it)
We can talk as long as we want about mean vs minimum vs histogram vs etc, but let's go back to the beginning: you get what you measure.
I mostly advocate for the min in the specific context of benchmarking to improve the performance of a specific piece of code, on the theory that it's the best measure to cut away all the random bits you can't control with changes to the algorithm under test. If you've got a 5x variance in performance, and you make a 10% improvement in your algorithm, min seems like one of the better ways to figure that out, but you've still got a fundamental problem that that amount of non-determinism is just too much noise to see through via any mechanism very reliably, without a lot of runs.
In terms of measuring "real" performance, yeah, min isn't a very good choice, being equivalent to just measuring the 0%th percentile, which anyone doing real measuring knows isn't all that useful in the "real world". But when I'm optimizing, I use "min". I also, as others have said, try to just eliminate any variance as high as what is discussed in the article because that's difficult-to-impossible to work with no matter how you slice it. (You can in principle always do "enough runs" to get through any random variance, but that doesn't mean you're willing to wait for a 1000 runs of your minute-long computation just to see if this particular improvement was statistically-significant, and then do it another fifty times.)
Your idea works for comparing machines though.