Another point often missed is the diagnostic value of tail measurements. One of ...

lostcolony · on April 20, 2021

My experience has always been people dismissing it when I show them the worst case. "Well, yeah, but that's timeouts downstream and etc etc and so of course it's hitting worst case". They then never have an answer when I ask "How do you know that?" or, the times it has happened, "How do you explain it taking longer than the configured request timeout?"

In every case it has been useful and actionable.

benlivengood · on April 20, 2021

Generally I want 90th, 99th, and max on the same graph/plot. Otherwise there's less sense of the overall distribution.

90th percentile is the fast path. You notice overall effects of growth, load, regressions, etc. 99th percentile (or sometimes 99.9th) is often the actual SLA/SLO so it should be on the plot. Max, as you point out, is necessary to actually see the long tail.

The other thing that's important is to make sure the timeseries and plotting software isn't averaging the max value; it's easy to miss spikes of latency because someone thought the graphs looked nicer with avg(1 min) vs max(1 min) or a tsdb is automatically rewindowing/aggregating with the wrong function. To a lesser extent even percentile buckets can be deceiving if they're too big. 99th percentile with 5-min buckets can miss significant but brief (~3 second) degradations.

solipsism · on April 20, 2021

Max, as you point out, is necessary to actually see the long tail.

What a strange way to put it. No one wants to see the long tail. The long tail is the means, not the end.

To a lesser extent even percentile buckets can be deceiving if they're too big. 99th percentile with 5-min buckets can miss significant but brief (~3 second) degradations.

Now that's the actual reason to care about max. Otherwise you have values (and often negative user experiences) getting lost in the aggregation.

lacker · on April 20, 2021

One of the first things I do at any job is replace the 90th percentile with the maximum in all plots.

Really? I find percentiles to be more informative. Looking at the maximum is like looking at an error log, it is basically throwing out all performance data except data from a tiny slice of time. A bad maximum shows you that a service failed at least once, but you oftenalready knew or expected that. A bad 90th percentile tells you that a lot of users are experiencing poor performance.

kqr · on April 21, 2021

With fat tailed distributions, like the ones our latencies tend to look like, the signal is in the tail. The central/common observations are just noise and tell you little to nothing about how the system performs.

That said, I do look at percentiles in addition to the max. But I find the 99th, 99.9th, and the max together tells me a lot more about the system performance than the uninformative stuff at the 90th and below.

loopz · on April 20, 2021

When you care about it, use max. For enterprise, use 95 perc and charge 5x.

viraptor · on April 20, 2021

min(max_time, const_reasonable_max) is also a good graph if your software supports it. It stops the outliers from polluting the view. After all, your user will leave/refresh after a few seconds, so only matters your response took longer than a minute - it's irrelevant it was 20minutes.

kqr · on April 20, 2021

Due to the systems in question themselves timing out relatively soon, this is what I in practise end up looking at anyway.

It's a good point, even though it cuts both ways: given some assumptions about the tail behaviour of latencies, the 20 minute extreme event is a treasure trove for estimating the probabilities of smaller tail events.

_ezkx · on April 20, 2021

For most of the services I've looked this closely at, maximum would be a proxy for load. Or in the case of a benchmark suite you'd expect max to increase with the number of iterations. What do you find the maximum useful for?

kqr · on April 21, 2021

Aren't all performance numbers a proxy for load, almost by definition?

I tend to look at them per request, per user, per iteration, and so on, to control for that effect.