
Minimum Times Tend to Mislead When Benchmarking - kristianp
https://tratt.net/laurie/blog/entries/minimum_times_tend_to_mislead_when_benchmarking.html
======
radford-neal
The author seems to have two statistical misconceptions (which don't, however,
detract from his main points).

He says, "it’s absolutely vital to report something about the distribution as
a whole. In general, confidence intervals are a good technique for doing
this".

But confidence intervals are a way of exhibiting uncertainty in a the estimate
of a parameter (such as the mean), _not_ a way of describing a distribution.
Hopefully, what he actually meant is that giving the 5% and 95% (for example)
_quantiles_ of the distribution would be useful. The 90% confidence interval
for the mean would be much smaller than the range from 5% to 95% quantiles,
and would collapse to a point in the limit as you did more repetitions - not
at all the desired behaviour.

Also, I disagree that it's always useful to report something about the whole
distribution rather than just the mean. If the benchmark is for a small part
of the program, the time will be combined with many other parts (perhaps
repetitions of this part), with variation in a single part averaged out.

In this regard, he makes another mistake in footnote [4]: "I’ve sometimes
heard it said that the central limit theorem saves us when benchmarking. In my
experience, this hope is based on an over simplification of the theorem, which
is that if you run a program enough times any randomness (including noise)
will converge towards a normal distribution. That is only true if the various
random variables are close to being independent and identically distributed."

This again seems to confuse the distribution data with the distribution of an
estimate for the mean. Only the latter is subject to the central limit
theorem. The distribution of the run times will not approach a normal
distribution regardless of how many times you run the program (unless it's
normal to start with, of course), even if the run times are independent and
identically distributed. This should be obvious to anyone who hasn't heard of
the central limit theorem. It should also be obvious to those who have heard
it and understood it.

~~~
Retric
I think you missed his point.

If 0.5% of requests take 100x the average time you want that to be as obvious
as possible. With quintiles those extreme outliers are easy to miss. With
confidence interval error bars the outliers become more obvious.

Further, running the test 1 Billion times would not reduce those error bars
because that unusual event would occur ~5 million times.

~~~
radford-neal
If you want to know about the 99.5% quantile, then of course you have to look
at that, not the 95% quantile. (Note that "quintiles" refers to the 20%, 40%,
60%, and 80% quantiles,but you can look at whatever quantile you like.)

Your comment about "confidence interval error bars" revealing outliers makes
no sense. I think you still don't understand that a confidence interval (for
some parameter estimate, maybe the mean) is not a description of the
distribution of the data, but rather is about the distribution of an estimate
obtained from the data - _not at all_ the same thing.

Running a test 1 billion times certainly _would_ reduce the confidence
interval for the mean, even if 5 million of those are 100x outliers.

~~~
Retric
I think I figured out the confusion. Benchmarking is a comparison and each
test is a fixed sample size. A test might collect 100 or 100,000 samples by
running the task repeatedly, but the data is not aggregated across multiple
tests as the data is generally not saved.

So, running the benchmark test again keep’s n the same when calculating the
confidence interval. This means error bars are showing the calculated mean and
standard * arbitrary scaling factor all lined up on an easy to read graph.

PS: A lot of statistical tools are poor fits as the data is rarely a normal
distribution. But, rigger is not the goal here.

~~~
radford-neal
You're still not getting it. Confidence intervals are _not_ what you want to
look at. Neither are error bars. They would be relevant if you wanted to know
the mean performance, and wondered how accurately you had estimated this mean.
But as far as I can tell, that's what you want to know.

~~~
Retric
It's not _how accurately you had estimated this mean_ , but how representative
the mean is for what's going on.

If you have a better idea feel free to suggest it. Bonus points if it's easy
to calculate and display.

~~~
radford-neal
I'm not sure I understand what you're suggesting, but maybe you're thinking
that you will run the benchmark for some number of repetitions, and then at
some later time run it again, for that number of repetitions, and so forth for
some number of times?

If nothing changes from one time to another, you might as well combine all the
results, and display the overall mean, or a histogram, or whatever. If you
think that something may have changed from one time to another (eg, ambient
temperature, which affects what clock frequency the processor may have been
throttled down to), then you've got a complicated problem, if you're really
trying to get to the bottom of the situation.

But once again, if you're looking at how the sets of runs at different times
differ, confidence intervals _do not_ indicate the spread of the data. Perhaps
you're computing quantiles instead, which would be sensible, but then
incorrectly referring to them as "confidence intervals", when they're not?

~~~
Retric
> If nothing changes

Something changes, but not necessarily something important for every code
path.

Test could be to run optimization A x 1,000 then optimization B x 1,000.

So now I Run the test, edit optimization B, and just run the full test again.
Running A again might seem wasteful, but what if something else where to
affect A and B, if I just test B then it’s not obvious something other than my
code changed. Perhaps as you say the heat sink is failing and my CPU is now
down clocking.

The goal is to have a simple test that shows if I am improving things while
also isolating that I am measuring what I think I am measuring. That’s not to
say I will actually run several long tests repeatedly, but I may re run a test
expecting similar results.

Anyway, all of the above has nothing to do with why as want a confidence
interval. It’s useful for getting a ballpark feeling of the underlying data.

------
rrss
I'm pretty sure I was part of the "resurgence in people advocating for the use
of the minimum time recently," and I'll tell you right now that I did not have
a "benchmark" that varied by 7x from run to run in mind.

"Benchmarks with random performance tend to mislead"

Both of these "benchmarks" have exactly the wrong amount of randomness. A
little more, and you could get a useful mean out of the results.

When I advocate for the use of minimum time in benchmarking, the scenario in
mind is running periodic regression testing on a set of known good benchmarks
that don't have intrinsic randomness (or it is averaged out) and yet vary by
0-10% as a result of actual noise. In these cases, it is very useful to have a
single number (and if that single number surprises you, then by all means take
a look at the distribution).

Neither of the examples here show the impact of actual system or measurement
noise, only random performance.

~~~
rrss
I went ahead and modified the author's second benchmark (used to compare
linear vs binary search) to iterate over various choices over words, rather
than just searching for one word 10000 times (which may be the first on one
run and the last on another).

Linear search: 0.92 +- less than 0.0001 s

Binary search: 0.15 s (also small variance, but didn't compute it)

We can talk as long as we want about mean vs minimum vs histogram vs etc, but
let's go back to the beginning: you get what you measure.

------
jerf
I kinda think this article proves too much; similar arguments could be made
against any statistical simplification of performance. It's not that it's
"wrong", it's that it's "too right".

I mostly advocate for the min in the specific context of benchmarking to
improve the performance of a specific piece of code, on the theory that it's
the best measure to cut away all the random bits you can't control with
changes to the algorithm under test. If you've got a 5x variance in
performance, and you make a 10% improvement in your algorithm, min seems like
one of the better ways to figure that out, but you've still got a fundamental
problem that that amount of non-determinism is just too much noise to see
through via any mechanism very reliably, without a _lot_ of runs.

In terms of measuring "real" performance, yeah, min isn't a very good choice,
being equivalent to just measuring the 0%th percentile, which anyone doing
real measuring knows isn't _all_ that useful in the "real world". But when I'm
_optimizing_ , I use "min". I also, as others have said, try to just eliminate
any variance as high as what is discussed in the article because that's
difficult-to-impossible to work with no matter how you slice it. (You can in
principle always do "enough runs" to get through any random variance, but that
doesn't mean you're willing to wait for a 1000 runs of your minute-long
computation just to see if this particular improvement was statistically-
significant, and then do it another fifty times.)

------
KKKKkkkk1
I don't know. When I see that my benchmark has nondeterministic performance, I
consider that as a bug and just remove the nondeterminism. For example, if
Rust's hashmaps have a random seed, I'd find a way to initialize the hashmap
with the same seed every time I run my benchmark. But I guess you can model
the performance as a random variable instead, in which case you'll be
benchmarking the Hashmap and not your code.

~~~
kardos
Ah but that wouldn't be representative benchmarking. Consider the later
example about finding a random word on the large list -- if you fix the random
seed and that turns out to search for 'aardvark' then the linear search will
perform well, but it won't be a good measure of how fast the program is in
more general conditions.

Your idea works for comparing machines though.

