Minimum Times Tend to Mislead When Benchmarking 47 points by kristianp 63 days ago | hide | past | web | favorite | 13 comments

 The author seems to have two statistical misconceptions (which don't, however, detract from his main points).He says, "it’s absolutely vital to report something about the distribution as a whole. In general, confidence intervals are a good technique for doing this".But confidence intervals are a way of exhibiting uncertainty in a the estimate of a parameter (such as the mean), not a way of describing a distribution. Hopefully, what he actually meant is that giving the 5% and 95% (for example) quantiles of the distribution would be useful. The 90% confidence interval for the mean would be much smaller than the range from 5% to 95% quantiles, and would collapse to a point in the limit as you did more repetitions - not at all the desired behaviour.Also, I disagree that it's always useful to report something about the whole distribution rather than just the mean. If the benchmark is for a small part of the program, the time will be combined with many other parts (perhaps repetitions of this part), with variation in a single part averaged out.In this regard, he makes another mistake in footnote [4]: "I’ve sometimes heard it said that the central limit theorem saves us when benchmarking. In my experience, this hope is based on an over simplification of the theorem, which is that if you run a program enough times any randomness (including noise) will converge towards a normal distribution. That is only true if the various random variables are close to being independent and identically distributed."This again seems to confuse the distribution data with the distribution of an estimate for the mean. Only the latter is subject to the central limit theorem. The distribution of the run times will not approach a normal distribution regardless of how many times you run the program (unless it's normal to start with, of course), even if the run times are independent and identically distributed. This should be obvious to anyone who hasn't heard of the central limit theorem. It should also be obvious to those who have heard it and understood it.
 I think you missed his point.If 0.5% of requests take 100x the average time you want that to be as obvious as possible. With quintiles those extreme outliers are easy to miss. With confidence interval error bars the outliers become more obvious.Further, running the test 1 Billion times would not reduce those error bars because that unusual event would occur ~5 million times.
 If you want to know about the 99.5% quantile, then of course you have to look at that, not the 95% quantile. (Note that "quintiles" refers to the 20%, 40%, 60%, and 80% quantiles,but you can look at whatever quantile you like.)Your comment about "confidence interval error bars" revealing outliers makes no sense. I think you still don't understand that a confidence interval (for some parameter estimate, maybe the mean) is not a description of the distribution of the data, but rather is about the distribution of an estimate obtained from the data - not at all the same thing.Running a test 1 billion times certainly would reduce the confidence interval for the mean, even if 5 million of those are 100x outliers.
 I think I figured out the confusion. Benchmarking is a comparison and each test is a fixed sample size. A test might collect 100 or 100,000 samples by running the task repeatedly, but the data is not aggregated across multiple tests as the data is generally not saved.So, running the benchmark test again keep’s n the same when calculating the confidence interval. This means error bars are showing the calculated mean and standard * arbitrary scaling factor all lined up on an easy to read graph.PS: A lot of statistical tools are poor fits as the data is rarely a normal distribution. But, rigger is not the goal here.
 You're still not getting it. Confidence intervals are not what you want to look at. Neither are error bars. They would be relevant if you wanted to know the mean performance, and wondered how accurately you had estimated this mean. But as far as I can tell, that's what you want to know.
 It's not how accurately you had estimated this mean, but how representative the mean is for what's going on.If you have a better idea feel free to suggest it. Bonus points if it's easy to calculate and display.
 I'm not sure I understand what you're suggesting, but maybe you're thinking that you will run the benchmark for some number of repetitions, and then at some later time run it again, for that number of repetitions, and so forth for some number of times?If nothing changes from one time to another, you might as well combine all the results, and display the overall mean, or a histogram, or whatever. If you think that something may have changed from one time to another (eg, ambient temperature, which affects what clock frequency the processor may have been throttled down to), then you've got a complicated problem, if you're really trying to get to the bottom of the situation.But once again, if you're looking at how the sets of runs at different times differ, confidence intervals do not indicate the spread of the data. Perhaps you're computing quantiles instead, which would be sensible, but then incorrectly referring to them as "confidence intervals", when they're not?
 > If nothing changesSomething changes, but not necessarily something important for every code path.Test could be to run optimization A x 1,000 then optimization B x 1,000.So now I Run the test, edit optimization B, and just run the full test again. Running A again might seem wasteful, but what if something else where to affect A and B, if I just test B then it’s not obvious something other than my code changed. Perhaps as you say the heat sink is failing and my CPU is now down clocking.The goal is to have a simple test that shows if I am improving things while also isolating that I am measuring what I think I am measuring. That’s not to say I will actually run several long tests repeatedly, but I may re run a test expecting similar results.Anyway, all of the above has nothing to do with why as want a confidence interval. It’s useful for getting a ballpark feeling of the underlying data.
 I'm pretty sure I was part of the "resurgence in people advocating for the use of the minimum time recently," and I'll tell you right now that I did not have a "benchmark" that varied by 7x from run to run in mind."Benchmarks with random performance tend to mislead"Both of these "benchmarks" have exactly the wrong amount of randomness. A little more, and you could get a useful mean out of the results.When I advocate for the use of minimum time in benchmarking, the scenario in mind is running periodic regression testing on a set of known good benchmarks that don't have intrinsic randomness (or it is averaged out) and yet vary by 0-10% as a result of actual noise. In these cases, it is very useful to have a single number (and if that single number surprises you, then by all means take a look at the distribution).Neither of the examples here show the impact of actual system or measurement noise, only random performance.
 I went ahead and modified the author's second benchmark (used to compare linear vs binary search) to iterate over various choices over words, rather than just searching for one word 10000 times (which may be the first on one run and the last on another).Linear search: 0.92 +- less than 0.0001 sBinary search: 0.15 s (also small variance, but didn't compute it)We can talk as long as we want about mean vs minimum vs histogram vs etc, but let's go back to the beginning: you get what you measure.
 I kinda think this article proves too much; similar arguments could be made against any statistical simplification of performance. It's not that it's "wrong", it's that it's "too right".I mostly advocate for the min in the specific context of benchmarking to improve the performance of a specific piece of code, on the theory that it's the best measure to cut away all the random bits you can't control with changes to the algorithm under test. If you've got a 5x variance in performance, and you make a 10% improvement in your algorithm, min seems like one of the better ways to figure that out, but you've still got a fundamental problem that that amount of non-determinism is just too much noise to see through via any mechanism very reliably, without a lot of runs.In terms of measuring "real" performance, yeah, min isn't a very good choice, being equivalent to just measuring the 0%th percentile, which anyone doing real measuring knows isn't all that useful in the "real world". But when I'm optimizing, I use "min". I also, as others have said, try to just eliminate any variance as high as what is discussed in the article because that's difficult-to-impossible to work with no matter how you slice it. (You can in principle always do "enough runs" to get through any random variance, but that doesn't mean you're willing to wait for a 1000 runs of your minute-long computation just to see if this particular improvement was statistically-significant, and then do it another fifty times.)
 I don't know. When I see that my benchmark has nondeterministic performance, I consider that as a bug and just remove the nondeterminism. For example, if Rust's hashmaps have a random seed, I'd find a way to initialize the hashmap with the same seed every time I run my benchmark. But I guess you can model the performance as a random variable instead, in which case you'll be benchmarking the Hashmap and not your code.
 Ah but that wouldn't be representative benchmarking. Consider the later example about finding a random word on the large list -- if you fix the random seed and that turns out to search for 'aardvark' then the linear search will perform well, but it won't be a good measure of how fast the program is in more general conditions.Your idea works for comparing machines though.

Search: