What Metric to Use When Benchmarking?

ajuc · on July 1, 2022

In interactive programs (for example games) it's often more important to be consistent than to be fast.

For example a game which runs at 120 fps but every 10 seconds has 1 frame that takes 1/30th of a seconds feels awful.

A game that runs at constant 60 fps feels much better.

In this case it's better to just count the frames that took too long and by how much.

zigzag312 · on July 1, 2022

This. When optimizing realtime apps, focus on worst cases.

Also, FPS is inconvenient metric because framerate is not a linear scale. You can't say a task takes x FPS to compute, but you can say it take x milliseconds, since time is a linear scale.

For example: if a task takes 10 ms and other stuff takes 1 ms. FPS delta for the task is ~909 FPS (1000 FPS - 91 FPS). But if other stuff takes 20 ms, we get only a ~17 FPS difference (50 FPS - 33 FPS).

patrulek · on July 1, 2022

CPU utilization. Modern CPU can execute like 4-6 uops per cycle IIRC. Multiply that by clock freq and number of cores and you get theoretical max. Then take your executed instructions per sec and divide by this theoretical max. The better the ratio, the more efficient your program is.

carlmr · on July 1, 2022

How can the 99% confidence interval for time in the first example be 7.391 +- 0.26? Most of the values listed lie outside of that.

I got a mean of 7.395 and a sigma of 0.533 (this is without hte DoF adjustment because these are guessed from the histogram). 2.576 * sigma is the 99% confidence interval if we assume a normal distribution. I.e. 1.373.

In any case we'd also have to consider that we estimated the sigma from the distribution, so we'd have to do an upward correction here: https://en.wikipedia.org/wiki/Degrees_of_freedom_(statistics....

twawaaay · on July 1, 2022

This is an old problem resulting from lack of education.

People apply standard deviation without first learning that it only makes sense for data that has standard distribution.

Standard deviation is a prediction and characterisation tool. Knowing that the data set has standard distribution you can characterise entire data set very easily by giving just few parameters of the standard distribution to then allow you to predict other information. A bit like being to tell everything about black hole by just stating its mass, charge and angular momentum.

This only makes sense when the distribution of a special class with the common characteristic. Specifying mass, charge and angular momentum of a chair does not let you predict everything about the chair, it only works for black holes.

If you are not convinced, try calculating standard deviation of number of testicles on humans. Based on it, infer how many humans have one testicle. How many have more than 5?

rsfern · on July 1, 2022

Maybe it is the confidence interval on the estimate of the mean, not the 99% interval of the whole distribution

Update: yeah, the next paragraph describes the uncertainty of the estimate of the mean, which is not the same as the spread of the distribution discussed in the initial motivation

carlmr · on July 1, 2022

Ah, that is very misleading though since they're discussing confidence intervals. Maybe some methodology section would be good.

menaerus · on July 1, 2022

> My rule of thumb is that when I'm looking at the performance of an individual function, CPU instructions executed are probably more appropriate

Yet, the positive correlation between the CPU time and wall-clock time in real world programs isn't guaranteed. E.g. improvement on micro level (function or some excerpt of program) doesn't imply or necessitate improvement on the macro level (end-to-end wall-clock performance of whole program).

As a matter of fact improving something on a micro level can either negatively impact the end-to-end performance or it doesn't have to impact it at all, e.g. performance remains stable.

That said, this is still nonetheless an interesting article because it speaks about the stuff which, well, most benchmarketing blogs or engineers never mention. Getting the performance figures that are both statistically significant _and_ reproducible is amazingly difficult feat. Means and confidence intervals cannot help you because they can remain stable and large and you can also use many other similar statistical metrics but you could still be measuring consistently degraded performance of the system because of XYZ reason which is very difficult to identify. More often than not engineers will easy dismiss such benchmarks because they diverge so much from other measurements but the thing is that you don't really know what's the reason behind such experiment results: it could be a measurement error, it could be a "glitch" in the system whatever that might be (network, disk, kernel bug, etc.) but it could very well be an artifact of the software that you're actually benchmarking. Or it could be a combo of these things. Knowing which of these are the culprit for the results you are observing is a very difficult task. I haven't managed yet to find a robust approach which doesn't involve manual investigation. Think NUMA-aware systems where access to a non-local (remote) global variable can cost you a dozen of cycles so it turns out that the underlying problem actually stems from your code _and_ the way you're running the experiments! E.g. https://www.anandtech.com/show/16315/the-ampere-altra-review...

dan-robertson · on July 1, 2022

The whole blog is excellent.

One issue with looking at instructions retired for small functions is that performance of small functions may be dominated by cache misses (and not having branch predictor data) so two versions may execute a similar number of instructions but have quite different perf due to fewer branches or better memory access patterns. But I guess if you’re optimising that then you’ll know to look at that instead. I guess the moral of ‘get a measurement setup that is good enough to reliably measure the thing you actually care about’ still holds.

ivankelly · on July 1, 2022

The missing dimension here is the percentage of time the program is running on the core. Even if the benchmark is using 3 seconds to run 11billion instructions, that doesn't tell you if the core is still idle 90% of the time because it's blocking on I/O. CPU bound work should pin the core. This is especially true on server side stuff, because if you are not maximizing the time on core, you are paying for CPU that is not being used.

xxs · on July 1, 2022

you should not even consider microbenchmarking IO. If you go with I/O you need full blown load tester.

jhoechtl · on July 1, 2022

The SI-metric system please. Imperial units are so impractical.

Shadonototra · on July 1, 2022

Watts