
Show HN: Hyperfine – a command-line benchmarking tool - sharkdp
https://github.com/sharkdp/hyperfine
======
kqr
Most -- nearly all -- benchmarking tools like this work from a normality
assumption, i.e. assume that results follow the normal distribution, or is
close to it. Some do this on blind faith, others argue from the CLT that "with
infinite samples, the mean is normally distributed, so surely it must be also
with finite number of samples, at least a little?"

In fact, performance numbers (latencies) often follow a heavy-tailed
distribution. For these, you need a literal shitload of samples to get even a
slightly normal mean. For these, the sample mean, the sample variance, the
sample centiles -- they all severely underestimate the true values.

What's worse is when these tools start to remove "outliers". With a heavy-
tailed distribution, the majority of samples don't contribute very much at all
to the expectation. The strongest signal is found in the extreme values. The
strongest signal is found in the stuff that is thrown out. The junk that's
left is the noise, the stuff that doesn't tell you very much about what you're
dealing with.

I stand firm in my belief that unless you can prove how CLT applies to your
input distributions, you should not assume normality.

And if you don't know what you are doing, stop reporting means. Stop reporting
centiles. Report the maximum value. That's a really boring thing to hear, but
it is nearly always statistically and analytically meaningful, so it is a good
default.

~~~
JoshTriplett
> Report the maximum value.

In benchmarks, assuming you run the same workload each time, you often want
the _minimum_ value. Anything else just tells you how much system overhead you
encountered.

(Complete agreement that applying statistics without knowing anything about
the distribution can mislead.)

~~~
dzamlo
[This
article]([https://tratt.net/laurie/blog/entries/minimum_times_tend_to_...](https://tratt.net/laurie/blog/entries/minimum_times_tend_to_mislead_when_benchmarking.html))
explaib why using the minimum time may not be a so great idea.

~~~
virgilp
Just looking at the chart, your article makes a case that minimum is much
better than maximum, and in fact if you report one number, the minimum is the
best number to report (in that particular example).

If we go into details - sure, 1 number is not ideal, but neither is the
confidence interval, because people will assume normal distribution for them.
If you report 2 numbers, perhaps instead of confidence interval (which falsely
implies normal distribution) it's better to report the mode and the median.

------
sharkdp
I have submitted "hyperfine" 1.5 years ago when it just came out. Since then,
the program has gained functionality (statistical outlier detection, result
export, parametrized benchmarks) and maturity.

Old discussion:
[https://news.ycombinator.com/item?id=16193225](https://news.ycombinator.com/item?id=16193225)

Looking forward to your feedback!

~~~
dwohnitmok
Since you cite bench as an inspiration, have you ever thought about including
the nice graphical HTML page with graphs that bench outputs? In a similar vein
what are your thoughts on directly depending on and using criterion (the Rust
port)?

~~~
sharkdp
Thank you for the feedback.

No, we never thought about HTML output. However, there are multiple other
export options and we also ship Python scripts that can be used to plot the
benchmark results. The script is not very large, so far, but we are happy to
add new scripts if the need for one should arise. What kind of diagrams would
you like to see?

Also, I have never thought about using criterion.rs. My feeling was that it is
suited for benchmarks with thousands of iterations, while we typically only
have tens of iterations in hyperfine (as we typically benchmark programs with
execution times > 10 ms). Do you have anything specific criterion feature in
mind that we could benefit from?

~~~
dwohnitmok
Ah those were earnest questions, not feature requests in disguise :),
especially the second question. I haven't so far had any need for either of
those things while using hyperfine (although admittedly I've only used
hyperfine for about 3 small projects).

I ask purely as a comparison to some of the design choices bench made for
personal edification.

That being said, one of the hypothetical advantages of using criterion is that
you get to piggy-back on its visualizations and statistical analyses, which
are quite useful for showing friends and coworkers. I'm not sure of the
specifics of how you're doing outlier detection, but Criterion's method is
quite nice and I find the choice of linear regression for linearly increasing
iterations to be an interesting sanity check.

Criterion can work for lower iterations as well (although tens is definitely
getting into the lower bound).

------
mplanchard
I started using hyperfine a few months ago now on a colleague’s recommendation
and I really like it.

In the past, I’ve cobbled together quick bash pipelines to run time in a loop,
awk out timings, and compute averages, but it was always a pain. Hyperfine has
a great interface and really useful reports. It actually reminds me quite a
bit of Criterion, the benchmarking suite for Rust.

I also use fd and bat extensively, so thanks for making such useful tools!

~~~
sharkdp
Thank you very much for your feedback!

------
breck
This is great! I was looking for something like this a year ago for
benchmarking imputation scripts as part of a paper. This would have been
awesome to use. Will keep it in my in the future.

------
Myrmornis
hyperfine is really nice!

FWIW I wrote a rough first version of a tool that runs a hyperfine benchmark
over all commits in a repo and plots the results in order to see which commits
cause performance changes:
[https://github.com/dandavison/chronologer](https://github.com/dandavison/chronologer)

~~~
sharkdp
Very cool! I'd love to reference this in the hyperfine README

~~~
Myrmornis
Great, please do. I wanted to add zoom capabilities to the boxplot
visualization, but it didn’t seem to be available in vega(-lite) yet.

