Hacker News new | past | comments | ask | show | jobs | submit login

Another point often missed is the diagnostic value of tail measurements. One of the first things I do at any job is replace the 90th percentile with the maximum in all plots.

Sure, it gets messier, and definitely less visually appealing, but the reaction by others has uniformly been "Did we have this data available all along and just never showed it?!"

It's also worth mentioning that even in a system where technically tail latencies aren't a big problem, psychologically they are. If you visit a site 20 times and just one of those are slow, you're likely to associate it mentally with "slow site" rather than "fast site".




My experience has always been people dismissing it when I show them the worst case. "Well, yeah, but that's timeouts downstream and etc etc and so of course it's hitting worst case". They then never have an answer when I ask "How do you know that?" or, the times it has happened, "How do you explain it taking longer than the configured request timeout?"

In every case it has been useful and actionable.


Generally I want 90th, 99th, and max on the same graph/plot. Otherwise there's less sense of the overall distribution.

90th percentile is the fast path. You notice overall effects of growth, load, regressions, etc. 99th percentile (or sometimes 99.9th) is often the actual SLA/SLO so it should be on the plot. Max, as you point out, is necessary to actually see the long tail.

The other thing that's important is to make sure the timeseries and plotting software isn't averaging the max value; it's easy to miss spikes of latency because someone thought the graphs looked nicer with avg(1 min) vs max(1 min) or a tsdb is automatically rewindowing/aggregating with the wrong function. To a lesser extent even percentile buckets can be deceiving if they're too big. 99th percentile with 5-min buckets can miss significant but brief (~3 second) degradations.


Max, as you point out, is necessary to actually see the long tail.

What a strange way to put it. No one wants to see the long tail. The long tail is the means, not the end.

To a lesser extent even percentile buckets can be deceiving if they're too big. 99th percentile with 5-min buckets can miss significant but brief (~3 second) degradations.

Now that's the actual reason to care about max. Otherwise you have values (and often negative user experiences) getting lost in the aggregation.


One of the first things I do at any job is replace the 90th percentile with the maximum in all plots.

Really? I find percentiles to be more informative. Looking at the maximum is like looking at an error log, it is basically throwing out all performance data except data from a tiny slice of time. A bad maximum shows you that a service failed at least once, but you oftenalready knew or expected that. A bad 90th percentile tells you that a lot of users are experiencing poor performance.


With fat tailed distributions, like the ones our latencies tend to look like, the signal is in the tail. The central/common observations are just noise and tell you little to nothing about how the system performs.

That said, I do look at percentiles in addition to the max. But I find the 99th, 99.9th, and the max together tells me a lot more about the system performance than the uninformative stuff at the 90th and below.


When you care about it, use max. For enterprise, use 95 perc and charge 5x.


min(max_time, const_reasonable_max) is also a good graph if your software supports it. It stops the outliers from polluting the view. After all, your user will leave/refresh after a few seconds, so only matters your response took longer than a minute - it's irrelevant it was 20minutes.


Due to the systems in question themselves timing out relatively soon, this is what I in practise end up looking at anyway.

It's a good point, even though it cuts both ways: given some assumptions about the tail behaviour of latencies, the 20 minute extreme event is a treasure trove for estimating the probabilities of smaller tail events.


For most of the services I've looked this closely at, maximum would be a proxy for load. Or in the case of a benchmark suite you'd expect max to increase with the number of iterations. What do you find the maximum useful for?


Aren't all performance numbers a proxy for load, almost by definition?

I tend to look at them per request, per user, per iteration, and so on, to control for that effect.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: