
Lies, Damned Lies, and Averages: Perc50, Perc95 Explained for Programmers - kqr
https://www.schneems.com/2020/03/17/lies-damned-lies-and-averages-perc50-perc95-explained-for-programmers/
======
kqr
> Well, when it comes to performance - you can’t use the average if you don’t
> know the distribution.

...and if you have the distribution, you no longer need the average!

Latency as experienced by the end user is dominated by the fat fail, for both
technical reasons and psychological ones.

Technical ones are probably the most convincing: it is very rare to have users
submit single requests and then be done. Especially in the cloudified,
service-oriented stacks of today, even single requests lead to a cascade of
requests inside the system. Whenever you have tens or hundreds of requests for
a single user, it starts becoming very likely that they hit that fat tail at
some point in their journey.

Given that latency is dominated by "outliers", looking at anything but p99
_and beyond_ is meaningless.

What's worse is this: since most people at best look at p95 or p99, they tend
to optimise for "the common case" at the cost of tail latencies! They
introduce insane variance in latencies making benchmarks better, but things
actually get worse for real users.

Sorry, this is a pet peeve of mine.

------
mjb
> Well, when it comes to performance - you can’t use the average if you don’t
> know the distribution.

This is frankly wrong. Performance comes in multiple flavors. Latency is one
of those, and there we know that percentiles really matter (see Andrew
Certain's section of this talk:
[https://www.youtube.com/watch?v=sKRdemSirDM&feature=youtu.be...](https://www.youtube.com/watch?v=sKRdemSirDM&feature=youtu.be&t=175)
for Amazon's experience).

But for others, like throughput and scale, you don't need to know the
distribution. In fact for throughput, the only thing that really matters is
the long-term mean latency. For concurrency, it's that and long-term mean
arrival rate. I wrote a blog post about it a while back
([http://brooker.co.za/blog/2017/12/28/mean.html](http://brooker.co.za/blog/2017/12/28/mean.html)).

The core point here is that _all_ summary statistics are misleading. You need
to be clear on what you care about, and making absolute statements about the
mean isn't a good way to do that.

Edit: This came across a bit more confrontational than I had intended. The OP
makes some good points, but I think his point about the mean is overly broad.

~~~
apk
> The core point here is that all summary statistics are misleading. You need
> to be clear on what you care about

I couldn't agree more. A few months ago I gave a talk that tried in part to
emphasize this point
([https://www.youtube.com/watch?v=EG7Zhd6gLiw](https://www.youtube.com/watch?v=EG7Zhd6gLiw)).
mjb, I hadn't seen your post until just now but I wish I'd known about it
earlier.

Another hard-earned lesson on many teams I've worked with is that humans just
aren't very good at judging the variance that's intrinsic to many [summary]
statistics. Even when your system is operating in what a human would consider
a steady-state, summary statistics are naturally going to bounce around a bit
over time. The variance is often higher for tail percentiles just because the
density of the PDF is lower in that region. When faced with a question like
"did the behavior of my system get worse?" in response to an external change
(such as a config change, a code deploy, a traffic increase, etc.), it can be
difficult to come up with a reliable answer just by eyeballing a squiggly time
series line.

------
heinrichhartman
Shameless plug. Just wrote a paper about this:

[https://arxiv.org/abs/2001.06561](https://arxiv.org/abs/2001.06561)

Containing a survey of the most popular Latency Aggregation methods used in
the industry (Prometheus Histograms, t-digest, HDR-Histogram, DD-
sketch/histogram).

~~~
jmacd333
The Circllhist algorithm is interesting, but I get an uneasy feeling from the
use of a _relative error_ measure to evaluate the performance of quantile
estimation. Note that other authors don't use this measure.

Dunning uses Mean Absolute Error in his latest T-digest paper:
[https://arxiv.org/pdf/1902.04023.pdf](https://arxiv.org/pdf/1902.04023.pdf)

Cohen uses Normalized Root-Mean-Squared Error to evaluate sampling schemes,
which are equally capable of estimating latency quantiles:
[https://dl.acm.org/doi/abs/10.1145/3234338](https://dl.acm.org/doi/abs/10.1145/3234338)

The problem with Relative Error as a measure of accuracy is that it depends on
the location of the distribution. The same size absolute error becomes a large
relative error near zero and becomes a small relative error farther up the
number line.

Another thing about this study is that only one value for the T-digest quality
is tested. Of course, the T-digest quality parameter equates directly with
compressed size, so it's unsurprising that T-digest's size is fixed throughout
the experiment. I also suspect that the choice of data set matters quite a lot
in this study. If your latency values were clustered around a small range,
then the algorithms like DDSketch and Circllhist will indeed have relative
error less than 5% (as they prove) but T-digest will be significantly more
accurate.

~~~
heinrichhartman
Thanks for your comment!

Relative error is a practical choice, since it allows to cover an extremely
large value range (essentially all floating point numbers) with small size (
O(log(range)) ) and zero-configuration. You can't have that with bounding the
absolute error.

Also the relative error is what you are interested in most of the time as a
practitioner. (200ms+/-10ms; 1year+/-15days)

DDSketch uses relative error for estimation as well.

> If your latency values were clustered around a small range, [...] T-digest
> will be significantly more accurate.

That is correct!

One point of this example was to demonstrate that merged t-digest can have
unbounded errors. In the t-digest paper it was speculated that merged digest
have bounded error, but the proof was just more difficult. As it turns out, if
you have heavy merges and an extremely large value range, you can get
unbounded errors.

------
contravariant
Using the median in lieu of the average isn't always a good idea either. A
service could completely fail to respond almost 50% of the time and you'd
still get a low median. Same holds for perc95, but to a lesser extent.

The main problem is that people try to summarize their data too early. What
you want is a measure for how good or bad a _single_ datum is and only _then_
can you summarize the end result. And usually averages aren't a bad choice at
that stage.

Averages have some particularly nice properties when dealing with dependent
variables, sums of variables, and when you want to minimize the distance
between your estimate and the actual value. However to take advantage of that
your measure actually needs to make sense. For companies the holy grail is if
you can directly express how much money you make / lose because of that single
datum, but failing that you'll need to find something that's at least somewhat
proportional to it.

~~~
alexhutcheson
My normal approach is to measure and set up separate alerts for "error rate"
and possibly "timeout rate". You definitely want to know about those, but
"mean latency" mixes those metrics with the latency metrics for successful
requests, which makes it less sensitive to changes in either one.

In general I agree that averages aren't always bad. One additional advantage
is that it's often possible to generate robust confidence intervals for
averages, but it's often not valid to generate CIs for medians/percentiles
without introducing other probably flawed assumptions.

~~~
contravariant
Error rate and timeout rate are good examples of an average that measures the
thing you're actually interested in.

The whole point is that trying to average the latency and only then try to
figure out what it means is backwards.

------
imtringued
Every time I hear about percentiles I am thinking: Why not just show the whole
distribution instead of picking a few values? I was immediately thinking of
just showing the latency distribution in a histogram and was pleased by the
article doing exactly that. Of course graphing percentiles over time is much
easier because they just represent a single value. Percentiles are very useful
for finding latency spikes but not that good for analyzing them.

~~~
TeMPOraL
> _Of course graphing percentiles over time is much easier because they just
> represent a single value._

Ridgeline plots (joyplots) are severely underutilized.

~~~
mncharity
R
[https://cran.r-project.org/web/packages/ggridges/vignettes/i...](https://cran.r-project.org/web/packages/ggridges/vignettes/introduction.html)
; d3 [https://observablehq.com/@d3/ridgeline-
plot](https://observablehq.com/@d3/ridgeline-plot) ; py
[https://github.com/sbebo/joypy](https://github.com/sbebo/joypy) .

------
kasey_junk
I honestly think most teams would be better off measuring max on the latency
curve. It at least isn’t subject to the many transform errors introduced by
most metric pipelines & it is easy to explain to people without getting into
why 95 vs 99 vs 99.9.

~~~
lonelappde
Max is always infinity/timeout/highly variable isn't it?

And it doesn't tell you when you just made all your requests 1s slower.

~~~
kasey_junk
It’s not infinite, it better be timeout but frequently isn’t and yes there is
variance to it but usually (handwave) not any more variant than the tail
percentile that teams choose.

I’ve had the experience multiple times where simply shifting a graph to max
shows that timeouts/load shedding is t working, then teams get to the point
where they are hitting timeouts way more than they thought. Only after working
through those issues do you get to actually improving latency.

The upside of simplicity in the number is only overtaken by the downside when
you start chasing real time constraints in systems that don’t need it.

------
jedberg
I talk about this a lot when giving talks or working with folks on their
reliability. This article does a great (if a bit long winded) job on
explaining why it's important to know your p50,p95, and p99.

But what rarely gets mentioned is that there is no one right answer as to
which to use.

It's a business decision on a per product basis.

In some cases, it's totally fine if 5% of the customers get an awful response.
In some cases, p99 must be sub 5ms or your customers will leave.

This is one of the key areas where engineering and management need to work
together -- deciding which percentile is key for which metrics.

------
lmeyerov
I may have missed it, but in many scenarios, the reason p95 etc matters isn't
that it is 5% of cases ('of course 3g users are slow') but each user may issue
many requests. Ex: If serving 20 assets over a session, most users will be hit
w a great p50 and bad p95.

Trouble shootong at the session-of-indiv-requests level is tough, so being
able to zoom in/out is the power of correlation IDs and observability stacks
(vs this kind of monitoring view afaict).

~~~
bobbiechen
And similarly, within a microservice architecture the tail latency is
amplified through all its downstream requests or if there is fan-out:

 _Consider a system where each server typically responds in 10ms but with a
99th-percentile latency of one second. If a user request is handled on just
one such server, one user request in 100 will be slow (one second). The figure
here outlines how service-level latency in this hypothetical scenario is
affected by very modest fractions of latency outliers. If a user request must
collect responses from 100 such servers in parallel, then 63% of user requests
will take more than one second (marked “x” in the figure)._

"The tail at scale", Jeffrey Dean and Luiz André Barroso:
[https://dl.acm.org/doi/abs/10.1145/2408776.2408794?download=...](https://dl.acm.org/doi/abs/10.1145/2408776.2408794?download=true)

------
allanrbo
The "how not to measure latency" talk by Gil Tene is another really good
explanation of this topic:
[https://youtu.be/lJ8ydIuPFeU](https://youtu.be/lJ8ydIuPFeU)

