
Statistically rigorous Java performance evaluation - 0wl3x
https://blog.acolyer.org/2017/11/06/statistically-rigorous-java-performance-evaluation/
======
scott_s
I prefer reporting the mean and the standard deviation - the paper advocates a
confidence interval instead of standard deviation. Typically, I'm more
concerned with the _spread_ of obtained performance values than I am with how
likely it is that our measured mean is the within some interval. I generally
don't think of that spread of obtained values as noise or random errors, but
as systematic consequences of using real computing systems. The reason I don't
consider that systematic _error_ is that the sources of variation in real
computer systems are often the result of things like memory hierarchies and
system buffers that will exist in practice. Real systems will have these
things, so I want my experiments to have them as well - so long as our
benchmark has them in the same way a real production system will have them.

For example, see Table 2 in a recent paper I am a co-author on (page 8 of the
pdf, page 73 using the proceedings numbering): [http://www.scott-
a-s.com/files/debs2017_daba.pdf](http://www.scott-
a-s.com/files/debs2017_daba.pdf) In this paper, we care about latency, and we
report the average latency along with the standard deviation. Here, a tighter
standard deviation is more important than confidence that the mean falls
within a particular range. And the variation in latencies is caused by both
software and hardware realities of the memory hierarchy.

~~~
srean
Mean and standard deviation seem downright dangerous to use, especially to
understand latencies. Both mean and std are very sensitive to extreme values,
and these extreme values tend to be frequent enough that shoving them under
the 'outlier' carpet does not instill much confidence. Looking at multiple
quantiles (if not all of them) would be more sound. But if one is compelled to
reduce the data to 2 numbers, one can go with trimmed-means, and L-moments or
some other robust estimate of location and scale (EDIT. for example, median
and interquartile. Just that trimmed mean is more efficient, both
statistically and computationally).

Deficiencies of mean and std are easy to demonstrate. Just draw multiple
samples of different sizes from a Pareto distribution (some other long tailed
distribution would work too) and observe how drastically the mean and std
varies among those samples.

Gaussian distribution based methods, for example ANOVA, had their moment of
glory in centuries past (good stats has since moved on). They are still
fantastic tools when the assumptions they make are true (this is rather rare)
but they break badly when those assumptions are violated a tiny wee bit. One
can easily end up making a wrong decision.

At the very least one should verify that the Gaussian assumption holds before
believing those statistics. There are many post-Gaussian statistics that one
can use, if it turns out that the data is not Gaussian.

~~~
scott_s
In my work, I have never discarded outlier values. We investigate them to make
sure we're not doing something boneheaded, but once I'm confident it's legit,
I always include it.

I agree that trimmed-means are L-moments are better at characterizing a
distribution. But! I had to look them up just now, and I have never seen a
computer systems paper which used them. I don't know if most readers would be
able to intuit meaning from them. My claim is not that mean and standard
deviation is the best way to characterize computer systems performance data,
but that for my work, it's better than mean and a confidence interval.

~~~
srean
> In my work, I have never discarded outlier values.

This is great to hear, hope everyone follows this.

> I have never seen a computer systems paper which used them. I don't know if
> most readers would be able to intuit meaning from them.

That's rather unfortunate, is it not ? Current practice is broken. So should
one not strive to fix it and introduce the community to better tools, champion
better tools. I am aware I am being harsh here but, correctness be damned,
lets follow the crowd does not show the community in stellar light. If trimmed
mean is too exotic, even median and inter-quantile distance would be fine.

> My claim is not that mean and standard deviation is the best way to
> characterize computer systems performance data, but that for my work, it's
> better than mean and a confidence interval.

I have two comments to make (i) Its not the case that mean and std are mostly
OK but rather suboptimal. I would be fine with that. The thing is these can be
horribly wrong and unrepresentative. Seemingly innocuous looking distributions
have infinite variance. Since you use a lot of Gaussian based tools, you would
be familiar with a t-test and the t-distribution. The variance of a
t-distribution with 2 degrees of freedom is infinite, any finite number one
reports for the std would be wrong by an infinite order of magnitude.

> it's better than mean and a confidence interval.

(ii) I am having trouble with this one. Assuming that confidence intervals
have been computed using the Gaussian assumption, width of the confidence
interval would be a scaled up version of the standard deviation, nothing
fundamentally different. Instead of std, you get, say, 3x of std. When they
are bad, both are equally bad.

I am not against the use of Gaussian, sometimes it is very reasonable, but if
correctness is a goal, then it needs to be demonstrated that it is not a wrong
thing to use. There are standard techniques to verify that using Gaussian
based stats would be fine.

I would be sympathetic if it was the case that we don't know how to handle
departures from Gaussian, or it is very expensive to handle that. Neither of
these are true, and haven't been true since the 80s. in fact earlier.

------
igouy
More recently:

"Quantifying performance changes with effect size confidence intervals" Tomas
Kalibera and Richard Jones Technical Report 4-12, University of Kent, June
2012.

[https://www.cs.kent.ac.uk/pubs/2012/3233/](https://www.cs.kent.ac.uk/pubs/2012/3233/)

Kalibera, Tomas and Jones, Richard E. (2013) "Rigorous Benchmarking in
Reasonable Time"

[https://kar.kent.ac.uk/33611/](https://kar.kent.ac.uk/33611/)

------
filereaper
SPECjvm98 is an outdated measure of both system and JVM performance, the
benchmark to look at is SPECjbb2015 which very aggressively taxes JVM
subsystems like the GC and the JIT.

~~~
efferifick
Yes, but this paper is old so they did the right thing by using SPECjvm98 at
the time the paper was written.

