
Why Percentiles Don’t Work the Way We Think - dmit
https://www.vividcortex.com/blog/why-percentiles-dont-work-the-way-you-think
======
lotyrin
Quantile Digest (Q-digest) or something similar is what I believe is desired
here.

From what I understand it's a fixed size data structure, represents quantiles
as a tree of histogram bands, pruning nodes with densities that vary the least
from their parents to achieve error / size trade-off. They also have the
property that you can merge them together and re-compress in order to turn
second data into minute data, or compress more accurate (large) archival
digests into smaller ones to say, support stable streaming of a varying number
of metrics across a stream of varying bandwidth by sacrificing quality.

They're pretty simple because they're designed for sensor networks, but I
think you could design similar structures with a dynamic instead of fixed
value range, and variable size (prune nodes based on error threshold instead
of or in addition to desired size).

If anyone knows of a time-series system using something like this, I'd love to
learn about it.

------
Xcelerate
You know what I've always found really useful? Entire distributions. Forget
means, medians, percentiles, etc. — give me the complete distribution (along
with sample size) so I can understand all of the nuances of the data.

(Better yet, just give me the raw data so I can analyze it myself. I find it
hard to blindly trust someone else's conclusions considering all of the
p-hacking going on nowadays.)

~~~
secabeen
Yeah. We're starting to contemplate a new monitoring/performance/etc. system,
and I'm thinking I don't want to do any of the long-term storage aggregration
like in rrd. We'll store the full data for each item forever. Not sure if we
can afford the storage, but I want to try for a while and see if it works.

~~~
jedberg
We did something similar at Netflix. We had all the aggregations but also
stored all the raw data. The raw data would be pushed out of the live system
and replaced with aggregates and then stored as flat text in S3. If for some
reasons you needed the old data, you just put in a ticket with the monitoring
team to load the data back into the live system (I think this is even self
service now).

The system would then load the data from S3 back into the live system via
Hadoop. Turns out it was pretty cheap to store highly compressible files in
S3.

~~~
halayli
I've built a similar system as well. Raw data compressed and stored in s3, but
aggregations are stored in postgres. Data stored in Postgres is compressed
binary representation of a histogram and I added few C functions in postgres
to do things like select x,y, histogram_percentile(histogram_merge(data), 99)
group by x,y etc..

------
swk82
I've recently written a page about "averaging" percentiles correctly by
approximating the combined histograms of two distributions. This is
demonstrated in a live time diagram with logarithmic time axis here:

[http://www.siegfried-
kettlitz.de/blog/posts/2015/11/28/linlo...](http://www.siegfried-
kettlitz.de/blog/posts/2015/11/28/linlog_plot_quantiles/)

If you have questions or comments, feel free to reply or email.

------
nostromo
A more accurate title would be something like: "Percentiles are great, but you
may not be calculating them correctly."

------
torinmr
Great article, I ran into this exact problem at work today. You can also get a
very similar problem if the timeseries aggregation system you're using does
any pre-aggregation before you calculate the percentile - for example, if you
sample your servers every 90 seconds, then any latency number it reports is
likely already averaged over the requests the server received during that time
period, meaning your 99th percentile number is really the latency of the 99th
percentile server, not the 99th percentile request. Using latency buckets
solves this problem as well, however.

------
stdbrouw
The article is useful because it outlines many different ways in which to
monitor the performance of a system, many of which are better than just
looking at the mean and P99. However, the main thesis that "an average of a
percentile is meaningless" is just plain wrong. If the distribution is fixed,
then averaging different P99 measurements will give you the best possible
estimate of what P99 is in the population (as opposed to your sample.) If the
distribution is moving (you're making performance improvements or your user
base is growing) then a moving average of a percentile will move with it.

~~~
swk82
You are right when looking at the true P99 of the whole distribution. Then the
P99 will always have the same value. The problem arises when estimating a
distribution from incomplete data.

E.g. estimating the P99 from 200 data points each will give you a large
variation. In this case averaging will be wrong. This gets worse if you
experience bunching. E.g. a server will have a high ping on several packets
when the load is high; after that it will have a low ping for thousands of
packets when the load is low. Here, it just doesn't work.

Instead, you can get a better result by adding the (estimated) histograms and
resampling the quantiles from the combined distribution function.

Here I've written how that works with a live demonstration:
[http://www.siegfried-
kettlitz.de/blog/posts/2015/11/28/linlo...](http://www.siegfried-
kettlitz.de/blog/posts/2015/11/28/linlog_plot_quantiles/)

~~~
stdbrouw
Even P99 estimates averaged from many tiny samples are at most inaccurate,
they won't be biased.

The problem you talk about is one of granularity: you don't want to know how
bad things are in general, you want to know how bad things are when they are
actually bad. That's perfectly legitimate, but it's a different metric
altogether that just happens to also be based off of the 99th percentile. It
depends on your definition of "the distribution".

This probably sounds like nitpicking, but people should know what they're
talking about when they call out others for being wrong.

~~~
swk82
> Even P99 estimates averaged from many tiny samples are at most inaccurate,
> they won't be biased.

Are you sure about this?

~~~
stdbrouw
Sure, simulate it if you don't believe me.

    
    
        x <- rnorm(1000, mean=50, sd=10)
        quantile(x, 0.99)
        mean(sapply(1:20, function(i) {
          i <- i * 50
          quantile(x[i-49:i], 0.99)
        }))

~~~
swk82
Matlab / GNU/Octave code:

    
    
        rand('seed',123);
    
        outer_rounds = 100;
        inner_rounds = 100;
        items = 500;
    
        q99_mean_vec = [];
        q99_complete_vec = [];
    
        for ior = 1:outer_rounds;
    
          q99_vec = [];
    
          x = rand( 1, rounds*items );
    
          for iir = 1:inner_rounds
    	
    	q99_vec(iir) = quantile( x( (iir-1)*items+(1:items) )', 0.99 );
    	
          end
    
          q99_complete_vec(ior) = quantile( x', 0.99 );
          q99_mean_vec(ior) = mean( q99_vec' );
    
        end
    
        hist( [ q99_complete_vec; q99_mean_vec ]', [0.988:0.0001:0.991] );
        xlim( [0.988, 0.991] );
        legend( 'from complete data', 'from means' );
    

This looks biased to me even though the distribution is uniform. I did not
expect that. Maybe the quantile function is broken.

EDIT: I've installed R and run your code. You use a very low number of samples
in the mean, so I changed it a bit. 1. I use the uniform distribution, so we
know the true value of the 99th percentile is 0.99. 2. I've increased the
number of samples:

    
    
      x <- runif(100000)
        quantile(x, 0.99)
        mean(sapply(1:2000, function(i) {
          i <- i * 50
          quantile(x[i-49:i], 0.99)
        }))
    
    

When run multiple times, I've found that the mean is always lower than the
quantile over the full sample.

~~~
stdbrouw
Actually, doh, you're right, my bad, turns out sample quantiles are only
asymptotically unbiased, and a better estimator would take some sort of
weighted average of P99 and P99.5 with the weights depending on the sample
size. (You still don't need the full population and you don't have to merge
histograms, though.)

~~~
swk82
I find this part of statistics very hard to argue about. Simulations only work
with well-known distributions, but even then not all of them. Even when you
have a simulation that confirms hypothesis A1 using distribution D1 it says
little about some pathetical distribution D2.

Using mathematical precision might take years to get it right and often there
is no easy answer.

What I've been looking for is something that is fast and works "good enough".
Estimating the 95th percentile and averaging it did not work well in my use
case. Using the histogram method does work well although it certainly is not
perfect.

~~~
GFK_of_xmaspast
Nonparametric estimates of the CDF have a long history and plenty of theory:
[http://www.win.tue.nl/~rmcastro/AppStat2013/files/lecture1.p...](http://www.win.tue.nl/~rmcastro/AppStat2013/files/lecture1.pdf)
[http://www.mathworks.com/help/stats/examples/nonparametric-e...](http://www.mathworks.com/help/stats/examples/nonparametric-
estimates-of-cumulative-distribution-functions-and-their-inverses.html)

~~~
swk82
Thanks, this is informative and pretty similar to what I've been implementing
in JavaScript. I guess I'll have to take a deeper look into literature when
doing an update to that topic.

~~~
GFK_of_xmaspast
You may also be interested in [http://www.amazon.com/All-Nonparametric-
Statistics-Springer-...](http://www.amazon.com/All-Nonparametric-Statistics-
Springer-Texts/dp/0387251456) and
[https://en.wikipedia.org/wiki/Bootstrapping_(statistics)](https://en.wikipedia.org/wiki/Bootstrapping_\(statistics\))

------
mgalka
Gave the post an upvote because it is interesting from a theoretical
perspective, but I have a hard time imagining a real life scenario where
averaging a 99 percentile will lead you to the wrong conclusion.

Perhaps I'm wrong, but whenever I'm looking at the tail of a distribution,
it's usually just to understand the order of magnitude, not to reach a precise
number.

------
guidedlight
The problem with percentiles is that you are discarding data points. e.g.
Measuring the 98th percentile involves discarding the top 2% of data.

The problem with that is, sometimes the top 2% you are discarding might
correlate to your top 2% of customers...and you are literally throwing their
data away by using percentiles. Not good.

My recommendation is to pick two aggregation types. Maybe Percentile and
Maximum, or Maximum and Mean, or Percentile and Mean. You can't really go
wrong with that approach.

------
adamcath
Does anyone know the math behind exactly how wrong the averaged percentiles
are? My dim understanding of stats makes me think the central limit theorem is
at play here; the averaged p99 values will tend towards a normal distribution,
which is obviously wrong. Would love to be schooled on it.

------
kabouseng
I think the lesson is, don't just blindly calculate some numbers / metrics,
have a look at your data (visually!) and see if it makes sense (for instance
do you use the 99th / 95th / 90th percentile).

------
Lightbody
A day late, but I wanted to add: at New Relic (where I work) we ended up just
deciding to store all the data points. We literally store every transaction
and page view for our customers and then go back and read every data point
when they are queried.

------
frik
Why not let the user decide? Let them switch the graph between average,
median, 95%, 99% and histogram. And everyone will be happy!

------
knughit
Headline is utterly false.

~~~
johncolanduoni
I mean, "Why Percentile Calculations Don't Work the Way you Think" is better,
but I don't think the current one is "utterly false".

~~~
anthony_d
It depends who's reading it. The whole genre of titles "* you don't know",
"you're wrong about *" is condescending and frequently false.

I suspect that's what the parent post was referring to and I would agree:
percentile calculations do work the way I think, and the article didn't say
anything controversial.

~~~
xaprb
I'm the article's author. I don't think the headline HN chose is great but I
don't mind if people have different interpretations of what I write. Hopefully
all the commenters here are actually reading the original article and judging
it on its merits (and flaws), right? right?

