First, the author is looking for a characteristic that can be monitored automatically - for example, alarm if P99 latency is over 2 s. Visualizations while useful don’t help with that.
Second, the author is looking for a solution that can run in soft real time, so that it can be used for system monitoring.
Third, they’re looking for a solution that does not have to aggregate the full raw data set from across the fleet. It is implied that they are working with reasonably large fleets such that full aggregation is impractical; or maybe just too costly or too slow.
If you were able to aggregate the full raw data set in real time and compute the Nth percentile, then that statistic would meet the author’s needs. Their point is that actually computing the Nth percentile is expensive and not commonly done in real-time monitoring (hence the statistic is usually an average of host-level Nth percentile).
The challenge they’ve proposed is to define a statistic that is more useful for alarming while still avoiding the need to aggregate the entire raw data set.
I thought this was a thoughtful article with a clever suggestion. “Percent of requests over threshold” meets these criteria. One criticism of this approach however is that the threshold needs to be known ahead of time, prior to aggregation.
Histograms can absolutely be used for alerting.
We have done this at Circonus for ages: https://www.circonus.com/features/analytics/
I wrote up the case of latency monitoring 6 weeks ago here:
Store counts of numbers of events in each bucket.
A few hundred integers per server isn't hard to store and aggregate.
Prometheus does this out of the box.
Now you can recreate any of the charts you want!
As long as your histogram has the final long tail bucket (>99%) included, you'll be fine.
After all, the 100'th percentile latencies are what your users will experience as the worst case. That's what they will perceive and remember. That matters for usability. While there is no sane way to ever eliminate the most obscene outliers, you can target the worst-case behaviour and find ways to limit how badly it impacts your users.
Anecdote from work: our exchange team (who routinely consider 4ms for service response too slow) monitor p99 for general performance and p100 for the nastiest outliers. They want to know exactly how bad the performance is for the observed worst-case scenarios.
The speaker Gil Tene is also the author of the HdrHistogram which addresses this articles point: https://hdrhistogram.github.io/HdrHistogram/
In this particular case the challenge is aggregating statistics from a very large fleet & having automated alarms. Visualization tools don't help with any of that. More specifically, the reporting tools out there apparently have a very common & persistent flaw of reporting an average of percentiles across agents which is a statistically meaningless metric. It makes no difference how you visualize it - the data is bunk.
This article flips it so that agents simply report how many requests they got & how many exceeded the required threshold. This lets them report the percentage of users having a worse experience than the desired SLA. You can also build reliable tools on top of this metric. It's not a universal solution but it's a neat trick to maintain the performance properties of not needing to pull full logs from all agents & still have a meaningful representation of the latency of your users.
Instead you can use the t-digest (https://github.com/tdunning/t-digest), a very cool online quantile estimation data structure from Ted Dunning (which he has recently improved with the Merging approach). There are a number of implementations out there. It is not unreasonable to serialize them and merge them. Unfortunately there’s no easy way to set this up in Prometheus but making that easy could be a fun project
There's a good discussion of the respective merits of each at https://prometheus.io/docs/practices/histograms/#quantiles
I am not clear on how Summaries actually work; they appear to report count and sum of the thing they're monitoring; that is, if one were to use them for latencies (and the docs do indeed suggest this), it would report a value like "3" and "2000ms", indicating that 3 requests took a total of 2000ms together; how is one supposed to derive a latency histogram/profile from that?
Prometheus's fatal flaw here, IMO, is that it requires sampling of metrics. That is, things like CPU, which are essentially a continuous function that you're sampling over time. But its collection method/format doesn't seem to really work that well for when you have an event-based metric, such as request latency, which only happens at discrete points. (If no requests are being served, what is the latency? It makes no sense to ask, unlike CPU usage or RAM usage.)
To me, ideally, you want to collect up all the samples in a central location and then compute percentiles. Anything else seems to run afoul of the very "doing percentiles on the agents, then 'averaging' percentiles at the monitoring system" critique pointed out in the video posted in this sibling comment: https://news.ycombinator.com/item?id=18194507