
We can do better than percentile latencies - kiyanwang
https://medium.com/theburningmonk-com/we-can-do-better-than-percentile-latencies-2257d20c3b39
======
imtringued
Just plot the latency distribution in a histogram [1] and be done with this
boring topic. The problem you're complaining about is that aggregating
everything into a single number doesn't give you the information you want to
know. It's impossible by design. If you want a different type of information
you will also need a different type of diagram or statistic.

[1] [https://rethinkdb.com/assets/images/docs/performance-
report/...](https://rethinkdb.com/assets/images/docs/performance-report/w-a-
reads-latency.png)

~~~
jcrites
I think you may be overlooking the challenges that the author is trying to
tackle with their proposal.

First, the author is looking for a characteristic that can be monitored
automatically - for example, alarm if P99 latency is over 2 s. Visualizations
while useful don’t help with that.

Second, the author is looking for a solution that can run in soft real time,
so that it can be used for system monitoring.

Third, they’re looking for a solution that does not have to aggregate the full
raw data set from across the fleet. It is implied that they are working with
reasonably large fleets such that full aggregation is impractical; or maybe
just too costly or too slow.

If you were able to aggregate the full raw data set in real time and compute
the Nth percentile, then that statistic would meet the author’s needs. Their
point is that actually computing the Nth percentile is expensive and not
commonly done in real-time monitoring (hence the statistic is usually an
average of host-level Nth percentile).

The challenge they’ve proposed is to define a statistic that is more useful
for alarming while still avoiding the need to aggregate the entire raw data
set.

I thought this was a thoughtful article with a clever suggestion. “Percent of
requests over threshold” meets these criteria. One criticism of this approach
however is that the threshold needs to be known ahead of time, prior to
aggregation.

~~~
heinrichhartman
> Second, the author is looking for a solution that can run in soft real time,
> so that it can be used for system monitoring.

Histograms can absolutely be used for alerting. We have done this at Circonus
for ages:
[https://www.circonus.com/features/analytics/](https://www.circonus.com/features/analytics/)

I wrote up the case of latency monitoring 6 weeks ago here:
[https://www.circonus.com/2018/08/30/latency-slos-done-
right/](https://www.circonus.com/2018/08/30/latency-slos-done-right/)

------
latch
Worth watching is How NOT to Measure Latency:
[https://www.youtube.com/watch?v=lJ8ydIuPFeU](https://www.youtube.com/watch?v=lJ8ydIuPFeU)

The speaker Gil Tene is also the author of the HdrHistogram which addresses
this articles point:
[https://hdrhistogram.github.io/HdrHistogram/](https://hdrhistogram.github.io/HdrHistogram/)

~~~
romed
I don't get hdrhistogram at all. It's really just histogram with a shitload of
logarithmic buckets.

~~~
sbanach
The compression format is really smart, there's a neat trick to make the
logarithm calculation fast, and there's a concurrent thread handoff mechanism
so you can swap out a histogram without disturbing the thread you're measuring
(though last two probably only in the java version). Those three make it super
useful for very low impact performance measurements.

------
WorkLifeBalance
Stop trying to re-invent statistics. Use a box and whisker plot of latency.
You quickly get to see the mean, the quartiles, and all the outliers and you
get it in a format which is familiar and easy to understand. You can even plot
box and whisker plots next to each other for quick meaningful comparisons
between different things.

~~~
scott_s
I've recently grown to like violin plots for latency
([https://en.wikipedia.org/wiki/Violin_plot](https://en.wikipedia.org/wiki/Violin_plot)).
I've also added 99%ile tick marks, which with the already present median mark,
gives a relatively full picture of latency that is easily digestible.

------
mkesper
The article states that almost everyone is doing percentile latencies wrong by
averaging on agent level (thus creating nonsensical data) and proposes using
the per­cent­age of requests that are over the thresh­old instead, a metric
that can be averaged properly. He additionally suggests to always use
actionable dashboards catered to its users (dev/ops/manager).

------
vvern
I agree with the premise but it seems that there are more solutions out there.
As other commenters noted, you can collect histograms or hdrhistograms. Those
have the problem of needing to be precofiguring and of not being able to be
merged unless they are configured the same way.

Instead you can use the t-digest
([https://github.com/tdunning/t-digest](https://github.com/tdunning/t-digest)),
a very cool online quantile estimation data structure from Ted Dunning (which
he has recently improved with the Merging approach). There are a number of
implementations out there. It is not unreasonable to serialize them and merge
them. Unfortunately there’s no easy way to set this up in Prometheus but
making that easy could be a fun project

------
henridf
I'm not sure which tools the author has tried, but the Prometheus monitoring
system supports both histograms and quantiles.

There's a good discussion of the respective merits of each at
[https://prometheus.io/docs/practices/histograms/#quantiles](https://prometheus.io/docs/practices/histograms/#quantiles)

~~~
deathanatos
Histograms require you to configure buckets into which your samples are
allocated; to allocate the buckets appropriately, you need to know what your
expected values are — that is, to measure latency, you need to know your
latency. While this can work (I think most of us have a clear idea, or can
obtain an idea of what our typical latencies is, and configure buckets around
that) it is inelegant. I feel like I would rather have X=percentile,
Y=latency, but such a bucketing gives you X=latency, Y=request count. Still
useful, but only as informative as you are good at choosing buckets. (There is
the histogram_quantile function, but I am unclear that its assumption of
linear distribution within buckets really makes much sense, since most things
would be long-tail distributions, and thus I would think that once you get
past the main "hump" of typical latencies, most samples would cluster towards
the lower end of any particular bucket.)

I am not clear on how Summaries actually _work_ ; they appear to report count
and _sum_ of the thing they're monitoring; that is, if one were to use them
for latencies (and the docs do indeed suggest this), it would report a value
like "3" and "2000ms", indicating that 3 requests took a total of 2000ms
_together_ ; how is one supposed to derive a latency histogram/profile from
that?

Prometheus's fatal flaw here, IMO, is that it requires sampling of metrics.
That is, things like CPU, which are essentially a continuous function that
you're sampling over time. But its collection method/format doesn't seem to
really work that well for when you have an event-based metric, such as request
latency, which only happens at discrete points. (If no requests are being
served, what is the latency? It makes no sense to ask, _unlike_ CPU usage or
RAM usage.)

To me, ideally, you want to collect up all the samples in a central location
and then compute percentiles. Anything else seems to run afoul of the very
"doing percentiles on the agents, then 'averaging' percentiles at the
monitoring system" critique pointed out in the video posted in this sibling
comment:
[https://news.ycombinator.com/item?id=18194507](https://news.ycombinator.com/item?id=18194507)

~~~
tyldum
Your points are largely valid, but prometheus is a monitoring solution, not a
scientific or financial tool. Certain tradeoffs are taken since the monitoring
aspect comes first and being scientifically correct comes second. Hence poll
vs push, for instance.

------
digikata
For diagnosing, I like building up a cumulative distribution function (CDF)
plot. If you're collecting data either for percentiles or thresholds you
likely have the data already. If you're setting thresholds, it's a useful plot
to judge how likely a given threshold might trigger an alarm.

------
amarant
I really like the idea of displaying what percentage is over a certain
threshold. at my work, we kinda sorta simulate this by having separate alarms
for many percentiles (with increasing thresholds). The approach suggested by
the article seems to be quite obviously better tbh.

------
jordanthoms
The approach here sounds a bit like the ApDex metric New Relic has been doing
for years? Is there something different I'm missing?

------
afpx
Reservoir sampling with outliers?

~~~
krona
Correct.

------
asplake
Mean excess delay?

