Hacker News new | past | comments | ask | show | jobs | submit login
How percentile approximation works and why it's more useful than averages (timescale.com)
685 points by od0 8 days ago | hide | past | favorite | 163 comments





Awhile ago I wrote a Python library called LiveStats[1] that computed any percentile for any amount of data using a fixed amount of memory per percentile. It uses an algorithm I found in an old paper[2] called P^2. It uses a polynomial to find good approximations.

The reason I made this was an old Amazon interview question. The question was basically, "Find the median of a huge data set without sorting it," and the "correct" answer was to have a fixed size sorted buffer and randomly evict items from it and then use the median of the buffer. However, a candidate I was interviewing had a really brilliant insight: if we estimate the median and move it a small amount for each new data point, it would be pretty close. I ended up doing some research on this and found P^2, which is a more sophisticated version of that insight.

[1]: https://github.com/cxxr/LiveStats

[2]: https://www.cs.wustl.edu/~jain/papers/ftp/psqr.pdf


There are some newer data structures that take this to the next level such as T-Digest[1], which remains extremely accurate even when determining percentiles at the very tail end (like 99.999%)

[1]: https://arxiv.org/pdf/1902.04023.pdf / https://github.com/tdunning/t-digest


NB: Post author here.

Yeah, that was one of the reasons we chose it as one of the ones to implement, seemed like that was a really interesting tradeoff, we also used uddsketch[1] which provides relative error guarantees, which is pretty nifty. We thought they provided different enough tradeoffs that we wanted to implement both.

[1]: https://arxiv.org/abs/1908.10693


Is it using https://github.com/tvondra/tdigest under the hood, or a separate implementation?

If folks are interested:

https://github.com/timescale/timescaledb-toolkit/blob/main/e...

(The TimescaleDB Toolkit is also implemented in Rust)


Facebook seems to have an even better performance implementation using sqrt. Might make sense to port that over to Rust. https://github.com/facebook/folly/blob/master/folly/stats/TD...

a separate implementation

Hi, in an unrelated nitpick: The relative error should be calculated by dividing the error by the true value, not by it's approximation. Still, very nice writeup!

Thanks! messed up the formula but had it right in the text :( Fixed now.

Not an expert on this topic but I noticed that the KLL algorithm (published in 2016) was not mentioned in this thread, which provides theoretically optimal performance for streaming quantiles with guaranteed worst case performance: http://courses.csail.mit.edu/6.854/20/sample-projects/B/stre... (And is pretty fast in practice).

NB: Post author here.

Interesting will have to take a look! Thanks for sharing!


Also relevant: Single-Pass Online Statistics Algorithms

[1] http://www.numericalexpert.com/articles/single_pass_stat/


That's pretty neat! Can these be used to efficiently compute rolling percentiles (over windows of the data), or just incremental?

The UDDSketch (default) implementation will allow rolling percentiles, though we still need a bit of work on our end to support it. There isn't a way to do this with TDigest however.

Sure there is. You simply maintain N phases of digests, and every T time you evict a phase and recompute the summary (because T-digests are easily merged).

I think this would be a tumbling window rather than a true "rolling" tdigest. I suppose you could decrement the buckets, but it gets a little weird as splits can't really be unsplit. The tumbling window one would probably work, though Tdigest is a little weird on merge etc as it's not completely deterministic with respect to ordering and merging (Uddsketch is) so it's likely you get something that is more than good enough, but wouldn't be the same as if you just calculated it directly so it gets a little confusing and difficult.

(NB: Post author here).


This is what I do, it's not a true rolling digest but it works well enough for my purposes.

yep I had to implement t-digest in a monitoring library. another alternative (although older) that the prometheus libraries use is CKMS quantiles [0].

[0] http://dimacs.rutgers.edu/~graham/pubs/papers/bquant-icde.pd...


i think the new ones started wtih Greenwald-Khanna. but i definately agree - p^2 can be a little silly and misleading. in particular it is really poor at finding those little modes on the tail that correspond to interesting system behaviours.

That sounds familiar, I remember reading about Greenwald-Khanna before I found T-Digest after I ran into the "how to find a percentile of a massive data set" problem myself.

> Find the median ... randomly evict items

So, not find, but approximate. That's a different thing.


> without sorting it... have a fixed size sorted buffer

(that you sort yourself)


That doesn't really make sense to me at all. Don't sort it, just have it?

Is the storage restriction the point?


Yes. The goal is to find (or approximate) the median without storing all the elements. Instead, it approximates the median by finding the median of randomly selected samples from the elements.

NB: Post author here.

Thanks for sharing! Hadn't heard of that algorithm, have seen a number of other ones out there, we chose a couple that we knew about / were requested by users. (And we are open to more user requests if folks want to use other ones! https://github.com/timescale/timescaledb-toolkit and open an issue!)


Did the candidate get an offer? Genuinely curious.

I had a basic screening call fail once because the expected answer was (in my perspective) more naive than my answer. I'd love it if generating curiosity were an interview +1.


Whether they got it or not probably isn't useful information. Having a good/brilliant answer probably isn't the only point of the question, this probably wasn't the only question of the interview, and this probably wasn't the only interview

> if we estimate the median and move it a small amount for each new data point, it would be pretty close.

Yeah, this is gradient descent on the absolute loss.


> The question was basically, "Find the median of a huge data set without sorting it," [...]

You can use eg QuickSelect (https://en.wikipedia.org/wiki/Quickselect) or Median of Medians.

They don't sort the data, but they do need linear amount of storage.


> The question was basically, "Find the median of a huge data set without sorting it,"

Isn't this done using a min heap and a max heap in conjuction?


The real constraint here is probably "find the median of a huge data set without holding the entire data set in memory".

'Estimate the median of an arbitrary sized data set using a constant amount of memory'.

No, for 2 reasons,

1. huge data set: heaps requires storing the whole set, but "huge" means "more than you can store"

2. without sorting it: heaps are basically semi-sorted, so you are breaking the rules


You can actually construct the heap from unsorted data in O(n) time, so constructing the heap is definitely not sorting. However, yeah, to actually use the heap to find median in O(n) time, you need to do something similar to magic-five (median of medians) algorithm.

Something like QuickSelect is probably better in practice than median-of-medians.

> but "huge" means "more than you can store"

Really? Where's it coming from?


I think they mean having more than you can store simultaneously on a single device.

With a few exceptions this is pretty common scenario.


More than you can store.

And possibly it's live data.


The two heap method helps you maintain the current median given an incoming stream of “online” data, and technically is not “sorting”.

The original problem is probably more accurately described as “what if you have too much data to sort” though.


It's funny that this is often left out from a data & algorithm class.

Is the minheap-maxheap approach faster than sorting the data? The obvious approach (process each element, one by one, into the appropriate heap, and rebalance the heaps so they are of equal size) takes n log n time and linear space. You can use the same resources to just produce a sorted copy of the input, which is a much better thing to have than two heaps that center on the median.

The minheap-maxheap approach is better for streaming data, to get the median as data comes in.

I agree that if you have the whole thing, doing heapsort and pulling a[N/2] or a[1 + N/2] is simpler.


> The minheap-maxheap approach is better for streaming data, to get the median as data comes in.

I see that it's better if you need to know "what is the median of the amount of the list that I've consumed so far?"

But if what you want is the median of the whole list, which might be in a random order, the medians of random prefixes of the list don't seem especially relevant. And if you do have an indefinite amount of data coming in, so that you need a "well, this is what we've seen so far" data point, the minheap-maxheap approach doesn't seem very well suited since it requires you to remember the entirety of the data stream so far.

My first instinct is to divide the possible data values into buckets, and just count the number of datapoints that fall into each bucket. This gives you a histogram with arbitrary resolution. You won't know the median value, but you will know which bucket contains the median value, and your storage requirements depend only on the number of buckets you want to use.


Because unlike many dynamic programming algorithms, it is something anyone running a large system will need.

How does this get you the median?

One way to think about why we tend to use averages instead of medians is that it is related to a really deep theorem in probability: The Central Limit Theorem.

But I think we can twist our heads and see in a way that this is backwards. Mathematically, the mean is much easier to work with because it is linear and we can do algebra with it. That's how we got the Central Limit Theorem. Percentiles and the median, except for symmetric distributions, are not as easy to work with. They involve solving for the inverse of the cumulative function.

But in many ways, the median and percentiles are a more relevant and intuitive number to think about. Especially in contexts where linearity is inappropriate!


i think of it as: if the data is gaussian, use a mean, otherwise go non-parametric (medians/percentiles).

or put another way, if you can't model it, you're going to have to sort, or estimate a sort, because that's all that's really left to do.

this shows up in things from estimating centers with means/percentiles to doing statistical tests with things like the wilcoxon tests.


Assume up front none of your measured latencies from a software networked system will be Gaussian, or <exaggereation> you will die a painful death </exaggeration>. Even ping times over the internet have no mean. The only good thing about means is you can combine them easily, but since they are probably a mathematical fiction, combining them is even worse. Use T-Digest or one of the other algorithms being highlighted here.

This is why I try to plot a proper graph of times from any "optimization" I see in a PR. Too many times I see people making this assumption for example, and even if they're right they usually forget to take the width of the gaussian into account (i.e. wow your speedup is 5% of a standard deviation!)

yep, have made that mistake before. even turned in a write-up for a measurement project in a graduate level systems course that reported network performance dependent measurements with means over trials with error bars from standard deviations.

sadly, the instructor just gave it an A and moved on. (that said, the amount of work that went into a single semester project was a bit herculean, even if i do say so myself)


> the median and percentiles are a more relevant and intuitive number to think about

A good example of this is when people say "more than half of drivers say they are above average drivers! Idiots!" Of course that's perfectly possible if most drivers are basically fine and some are really really bad. For example, 99.99% of people have greater than the average number of legs.

The correct impossible statement would be if more than half of drivers are better than the median driver.


It's more related to the law of large numbers than the CLT

For some things, you can't even sensibly measure the mean. For example, if you're measuring the mean response time for a service, a single failure/timeout makes the mean response time infinite (because 100 years from now the response still hasn't been received).

"Why Averages Suck and Percentiles are Great": https://www.dynatrace.com/news/blog/why-averages-suck-and-pe...


It's indeed always worth pointing that a mean may or may not exist. Same with variances/standard-deviation.

The central limit theorem seems to have given people the slightly wrong idea that things will always average out eventually.


Coughs Cauchy

totally. that blog was also one of the sources I mentioned in the post! Good stuff

NB: Post author here.


Gamers have an intuitive sense of this. Your average framerate can be arbitrarily high, but if you have a big stutter every second between the smooth moments, then a lower but more consistent framerate may be preferable, typically expressed as the 1% and 0.1% slowest frames, which at a relatively typical 100fps, represents the slowest frame every second and every 10 seconds.

I have no hope of finding a cite for this, but a long time ago I read some command line UI research that found if you had a system where commands ranged from instant to taking a small but noticeable time and you introduced delays in the faster commands to make it so all commands took the same small but noticeable time people would think that the system was now faster overall.

I guess that’s because our minds (and animal minds as well) are always aware of the pace of repetitive events. If something is off, the anxiety alarm rings. One old book on the brain machinery described an example of a cat that was relaxing near the metronome and became alert when it was suddenly stopped. Unpredictable delays are disturbing, because a mispredicted event means you may be in a dangerous situation and have to recalibrate now.

I think the explanation may be even more low level than that. Iirc, even with a single neuron (or maybe if it was very small clusters, sorry recollection is a bit hazy) you can see that it learns to tune out a repetitive signal.

Presumably, also the reason we have those fake queue lines at airports (not sure the correct word is for it).

My understanding is that "fake queues" at airports are for security. To spread people out so that they are not all waiting in the same place.

Long queues are a prime target for terrorists, and if something bad happens, small groups of people are more manageable than large groups.


Similarly, I would rather have no wifi than bad/spotty wifi for this reason

NB: Post author here.

Love this example. Might have to use that in a future post. Feel like a lot of us are running into a similar thing with remote work and video calls these days...


Gamers do not have a more intuitive sense for this than movie watchers, video/voice talkers, not to mention users who type text into bloated web browsers or lagged remote login sessions.

I would rather have a rock-steady consistent 500 ms reponse time when typing text, than to have a 100 ms average response time which randomly spikes to outliers that go past one second.

A rock-steady, though poor, event rate in a paint application is better for drawing a freehand curve (especially if the program interpolates well) than a really fast rate that suddenly has a glitch in it, spoiling your work.


> Gamers do not have a more intuitive sense for this than movie watchers, video/voice talkers, not to mention users who type text into bloated web browsers or lagged remote login sessions

I would be shocked if gamers didn't on average have a more intuitive sense of this than any of the groups you mentioned.


Yah, I don't think anyone's arguing that ONLY gamers will observe the phenomena, but I would be shocked if they weren't more sensitive to it than most groups.

"more sensitive" != "more intuitive"

Maybe not all, but ones who create and consume analysis like this

https://www.techpowerup.com/review/msi-geforce-rtx-3090-gami...


"Slow is smooth and smooth is fast"

There was a “Latency” multiplayer setting in Starcraft 1. I think it was there precisely to give players control over smoothness of network latency instead of graphics and processing latency. I think this just adds yet another example of your observation.

For games, one compounding factor is that you need to estimate how long the next frame will take in order to know how much time to simulate. If the variance is greater, the prediction will be worse leading the perceived game "speed" to vary from frame to frame.

Yep, I often find it useful to limit framerate to a number I know my GPU can handle so that I don't get as much stuttering. It's better to run at 60 FPS all the time than 80 FPS most of the time but with stuttering.

Looking particularly at latency measurements, I found the "How NOT to Measure Latency" [1] talk very illuminating. It goes quite deep into discussing how percentiles can be used and abused for measurement.

[1]: https://www.infoq.com/presentations/latency-response-time/


I watch this video once a year and send it to my co-workers whenever averages or medians shows up in a graph for public consumption.

Are the points written in a readable format anywhere?


What are some of the ways percentiles can be abused?

# Power Statement

!*Plowser*!* is in the 99th percentile of browsers by global user count!

# Fact Sheet

- !*Plowser*! is used by 2,050 people

- sample size of browsers is 10,000 (this includes toy apps on GitHub and almost all browsers in the sample have no record of active users)

- those using !*Plowser*! have no choice as the developing company $*DendralRot Inc*$ forces all its employees, contractors and users of their enterprise '?_shuiteware_?' (a name derived by mashing |-->shite|software|suite<--| into one word) to use their browser

- if we place the number of global browser users at a conservative 1,000,000,000, !*Plowser*! actually has 0.00000205% of users


Good opportunity to plug https://en.wikipedia.org/wiki/Anscombe%27s_quartet : if you don't know much about the underlying distribution, simple statistics don't describe it well.

From Wikipedia description: Anscombe's quartet comprises four data sets that have nearly identical simple descriptive statistics, yet have very different distributions and appear very different when graphed. Each dataset consists of eleven (x,y) points. They were constructed in 1973 by the statistician Francis Anscombe to demonstrate both the importance of graphing data before analyzing it, and the effect of outliers and other influential observations on statistical properties.


There's a fun paper by Autodesk where they make datasets that look whatever way you want them to.

https://www.autodesk.com/research/publications/same-stats-di...


Yes, a both funny and insightful lesson on how weak basic indicators (mean, standard deviation, correlation) can be, the interest and limits of box-plot with quartiles and whiskers, the benefit of the violin-plot.

Definitely worth a quick look.


NB: Post author here.

This is great! So fun...will have to use in the future...


While I have your attention...

> For the median and average to be equal, the points less than the median and greater than the median must have the same distribution (i.e., there must be the same number of points that are somewhat larger and somewhat smaller and much larger and much smaller).

[0, 2, 5, 9, 9] has both median and mean = 5, but the two sides don't really have the same distribution.


Totally true...thoughts on how I could rephrase? I guess it's more the "weight" of points greater than and less than the median should be the same, so symmetric distributions definitely have it, asymmetric may or may not. Definitely open to revising...

> symmetric distributions definitely have it, asymmetric may or may not

Doesn't have to be any more complicated than that. It's more a curio than an important point anyway :)


This is excellent information, thank you for posting this! I was not familiar with this example previously, but it is a perfect example of summary statistics not capturing certain distributions well. It's very approachable, even if you had to limit the discussion to mean and variance alone. Bookmarked, and much appreciated.

NB: Post author here.

That's really nifty, wish I'd heard about it earlier. Might go back and add a link to it in the post at some point too! Very useful. Definitely know I wasn't breaking new ground or anything, but fun to see it represented so succinctly.


>importance of graphing data before analyzing it

Very discouraging if one is trying to analyze data algorithmically. Often when faced with a problem in statistics, the answer is: "Look at the graph and use intuition!".


Interesting, thanks for the concept. What's one to do for high dimensional data then?

I just recently tried giving a presentation to my department (they're developers, I'm architect) about this stuff and they all just sort of blinked at me. It brought in Little's Law and Kingman's Formula, in an attempt to underscore why we need to limit variation in the response times of our requests.

There are a bunch of queuing theory formulas that are really cool but don't exactly apply if your responses vary a lot like the article describes. I think the assumption is that response time distributions are exponential distributions, which I don't think is a good assumption (is an Erlang distribution an exponential distribution?) Nevertheless, hooking the equations up to some models is a good way to get directional intuition. I didn't realize how steep the performance drop-off is for server utilization until I started moving sliders around.

Our ops team doesn't really follow this either. We're not a huge department though - is this the kind of stuff that an SRE team usually pays attention to?


NB: Post author here.

I found it surprisingly difficult to explain well. Took a lot of passes and a lot more words than I was expecting. It seems like such a simple concept. I thought the post was gonna be the shortest of my recent ones, and then after really explaining it and getting lots of edits and rewriting, it was 7000 words and ...whoops! But I guess it's what I needed to explain it well (hope you thought so anyway).

It's somewhat exponential, but yeah, not necessarily, it's definitely long-tailed in some way and it sort of doesn't matter what the theoretical description is (at least in my mind) the point is these types of distributions really don't get described well by a lot of typical statistics.

Can't talk too much about SREs at the ad analytics company I mentioned, we were the backend team that wrote a lot of the backend stores / managed databases / ran the APIs and monitored this stuff a bit (and probably not all that well). It was a bit more ad hoc I guess, probably now that company is large enough they have a dedicated team for that...


The article hit the pocket pretty exactly for how I felt I needed to explain it. (Actually I had the same experience where I thought I would be able to go over it quickly and then I felt like I was getting mired.) The graphs look great too - I've been able to explore it pretty well with JupyterLab but I can't export that into something super-interactive that is read-only. I've thought about creating an idyll page out of it to help others explore but that's a bit overkill for the work style our team has.

I think the weirdly-shaped long-tail graphs we come across are just sums of more naturally-distributed response times, for different types of responses. Another reason to limit variation I think.


> I think the weirdly-shaped long-tail graphs we come across are just sums of more naturally-distributed response times, for different types of responses. Another reason to limit variation I think.

The best method I've found for digging into latency is profiling RPC traces to see where time is spent. This at least separates the code and parameters that have simple latency distributions from those that don't. Some distributions will be very data-dependent (SQL queries).


This is exactly the algorithm we developed at LogNormal (now part of Akamai) 10 years ago for doing fast, low-memory percentiles on large datasets.

It's implemented in this Node library: https://github.com/bluesmoon/node-faststats

Side note: I wish everyone would stop using the term Average to refer to the Arithmetic mean. "Average" just means some statistic used to summarize a dataset. It could be the Arithmetic Mean, Median, Mode(s), Geometric Mean, Harmonic Mean, or any of a bunch of other statistics. We're stuck with AVG because that's the function used by early databases and Lotus 123.


No we’re stuck with it because average was used colloquially for arithmetic mean for decades.

I wish people would stop bad-mouthing the arithmetic mean. If you have to convey information about a distribution and you’ve got only one number to do it, the arithmetic mean is for you.


I think it still depends on the nature of the data and your questions about it, and median is often the better choice if you have to pick only one.

“It depends” is always right but which function is better for arbitrary, unknown data?

At least the arithmetic mean is fine for Gaussian distributions, and coneys a sense about the data even on non-Gaussian ones. but the median doesn’t even work at all on some common distributions like scores of very difficult exams (where median=0)

For the mean, at least every data point contributes to the final value.

Just my 2c


Yes, Lotus 123 came out 38 years ago :)

```To calculate the 10th percentile, let’s say we have 10,000 values. We take all of the values, order them from largest to smallest, and identify the 1001st value (where 1000 or 10% of the values are below it), which will be our 10th percentile.```

Isn't this contradictory? If we order the values from largest to smallest and take the 1001st value, then 10 % of the values are above/larger and not below/smaller. I believe it should say order from smallest to largest.


NB: Post author here.

Oops, yep, that should probably be order from smallest to largest. Thanks for the correction!


Fixed!

Yes, this looks like a typo in the article. It should be smallest to largest.

Thanks, we corrected this quite quickly. Appreciated!

One of the difficulties when dealing with percentiles is estimating error of the estimated percentile values, which is not necessarily trivial compared to estimating error of mean approximation (e.g. see https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6294150/)

This is a really important topic if you're doing web services. Especially if you're doing parallel processing services. When I was at Blekko the "interesting" queries were the ones above the 95th percentile because they always indicated "something" that hadn't worked according to plan. Sometimes it was a disk going bad on one of the bucket servers, sometimes it was a network port dropping packets, and sometimes it was a corrupted index file. But it was always something that needed to be looked at and then (usually) fixed.

It also always separated the 'good' Ad networks from the 'bad' ones as the bad ones would take to long to respond.


I often see HN articles crop up soon after a related post, in this case this Ask HN poster [0] being driven crazy by people averaging percentiles and them not seeing that it's a big deal. It's pretty funny to see such tuples of posts appearing.

https://news.ycombinator.com/item?id=28518795


Be careful translating percentiles of requests to percentiles of users; if less than 10% of your requests take over 1 second, but a typical user makes 10 requests, it's possible that the majority your users are seeing a request take over 1 second.

NB: Post author here.

Yep! Briefly noted that in the post, but deserves re-stating! it's definitely a more complex analysis to figure out the percentage of users affected (though often more important) could be majority could also be one user who has some data scientist programmatically making hundreds of long API calls for some task...(can you tell that I ran into that? Even worse it was one of our own data scientists ;) ).


I really enjoyed this post! The author also wrote an interactive demonstration of the concepts (using Desmos). It's super helpful.

https://www.desmos.com/calculator/ty3jt8ftgs


NB: Post author here.

Glad you liked it! I was so excited to actually be able to get to use Desmos for something work wise, I've been wanting to do it for years!


There’s something called a five number summary in statistics: mean, median, standard deviation, 25th percentile and 75th percentile.

The bonus is that the 75th - 50th gives you the interquartile range.

Mean is not a robust measure and as such you need to look at variety to truly understand the spread of your data.


If you're going to use multiple quantities to summarize a distribution, wouldn't using percentiles for all of them give you the most information? The mean and standard deviation could then be estimated from that data.

IQR is 75th - 25th, aka, the middle-50%

Ok this, box plots are a good way to visualize and show distribution esp to a not so stat heavy audience.

You’re right I mistyped.

Whenever this topic comes up, I always encourage folks to watch this 2011 classic 15m talk at the Velocity conference by John Rauser:

https://www.youtube.com/watch?v=coNDCIMH8bk


Percentiles for sure are better than average if you want to explore distribution: there are several percentiles comparing to single average. However median is harder to use if you want to do calculations based on this metric. For example distribution of sample median could be a problem, if you want to understand confidence interval for it for example.

NB: Post author here.

Totally can be true. In our case, we use these approximation methods that allow you to get multiple percentiles "for free" definitely need to choose the right ones for the job. (We talk a bit more about the whole approach where we do the aggregation then the accessor thing in the previous post on two-step aggregation [1]). But there are definitely times when averages/stddev and potentially the 3rd and 4th moments are more useful etc.

[1]: https://blog.timescale.com/blog/how-postgresql-aggregation-w...


Thanks for the refresher! I am using the timescaledb-toolkit with much success. LTTB, for example. Excellent.

Ive been trying to get the marketing team to always include a std deviation with averages. Average alone is simply not useful, standard deviation is a simple way to essentially include percentiles.

They regularly compare experiments to the mean but dont use a T test to ensure the results are actually different from the mean.


I heavily caution against the feeling that "standard deviation is a simple way to essentially include percentiles." The usefulness of the standard deviation depends on the distributions that you are working with. Heavy tailed distributions appear a fair amount in practice, and the combo of summary statistics mentioned would not do well on those. Also, Madars' comment in this thread is a beautiful example of this: 4 completely different distributions, with identical mean and standard deviation (among other things). Histograms and percentiles, and if necessary their approximations, are more desirable for the above reasons.

I assume most of the distributions a marketing department would be dealing with are generally normal in which case stddev is a great way to analyze the data. This can be easily verified by just plotting said data and making sure the tails don't look weird.

I can't help but idly wonder what humans are doing when they are eyeballing the tails, to see if things look good. Like lets say we wanted to do the eyeball test but automatically. Would the best way be to use an image classifier on the plot? Is there something magic about the plot representation that would make it good even for computers to use?

One thing I like about this post is that it explains things in an accessible way before getting into a deep dive. Might be worth sharing with the marketing team as they'll "get" long tail in the context of web search, so the concept is fairly transferable to stuff that they would know about.

NB: Post author here.

Std deviation definitely helps a lot, still often not as good as percentiles, was actually thinking about adding some of that in the post but it was already getting so long. It's funny how things you think are simple sometimes take the most effort to explain, definitely found that on this one.


Yeah -- std deviation has a similar problem to the mean in that it doesn't give you a full picture unless the distribution is close to normal / gaussian.

Pretty much why summary statistics often give the IQR, which gives some idea to the skew and shape of the distribution as well.

Unfortunately, BD and marketing just want a single number to show that the value is bigger and hate anything more complicated than a barchart.


NB: Post author here.

We've been meaning to add IQR as an accessor function for these, may have to go back and do it...the frequency trails [1] stuff from Brendan Gregg also goes into some of this and it's really cool as a visualization.

[1]: https://www.brendangregg.com/FrequencyTrails/mean.html


Barchart is basically your percentiles (just more of them) so why not show it? Bars and whiskers could be more complicated for them but still the same sort of data

Barcharts across categorical data :P

That is, the first bar is "Our Number" and the second bar is "Competitor's number."


Someone clearly gets it. Variability viz and spread detracts from that clear message.

Recovering product manager in me sees 90th percentile queries with outlier high latency and starts asking instead of how to reduce it, how we can to spin it out into a dedicated premium query feature, as if they're willing to wait, they're probably also willing to pay.

Highly recommend modelling your solution using queueing theory with this: https://queueing-tool.readthedocs.io/en/latest/

As an exercise in discovering the economics of how your platform works, even just thinking about it in these terms can save a great deal of time.


If you want to calculate exact percentiles there's a simple in-place algorithm that runs in expected O(n) time. You basically do quicksort but ignore the "wrong" partition in each recursive step. For instance if your pivot is at the 25% percentile and you're looking for the 10% percentile you only recurse on the "left" partition at that point. It's pretty easy to implement. (And rather straightforward to change to a loop, if necessary.)

Can someone bake this down to a sentence? I think I understand what they are saying since I have been faced with using these metrics and in having been tempted to use average response times I recognized that average is not a good baseline since it moves in relation to the outliers (which are usually the problem requests and there can be many or few).

How do you calculate these percentiles and use them as a trigger for alerts?


looking at a histogram is probably the best thing you can do to understand how data is distributed

I had to explain the data usage of an interface that looked extremely busy from the standard graphs

I did a percentile graph of the usage - the data was only typically using 5% of the maximum throughput, no-one could really understand the graph though

so I did a zoomed-in version of the normal data usage graph and it looked like a blip lasting 1/20 of the time - everyone got that - eg it was peaking every few seconds and then doing nothing for ages


Just as a note, same topic: https://news.ycombinator.com/item?id=19552112

""" Averages Can Be Misleading: Try a Percentile (2014) (elastic.co) 199 points by donbox on April 2, 2019 | | 55 comments """


Side note, but I love the animations, code snippet design and typography in this blog post.

Will think about how I can improve my own blog with these ideas.


Thank you! Huge shout out to Shane, Jacob and others on our team who helped with the graphics / design elements!

If you are doing this sort of work I highly recommend the Datasketches library. It's used in Druid, which is a database specifically designed for these sorts of aggregations over ridiculously large datasets.

FYI, typo:

> “In the below graph, half of the data is to the left (shaded in blue), and a half is to the right (shaded in purple), with the 50th percentile directly in the center.”

But the half on the right is actually shaded yellow.


Thanks for the feedback we've made the change. Appreciated!

Thanks! Will get that fixed!

Sorry for the minor nitpick, but I find it a bit unusual (disappointing?) that there's an image of a candlestick chart at the very top, but the article only uses API response times as examples...

If you think the data is normal-ish but want to account for skew and kurtosis, you can try fitting a distribution such as the skewed Student's t -- there are R packages for this.

Toss the percentiles to the curb and just use a histogram.

Shouldn't averages&variance be enough for start?

Surprisingly, many software engineers I know never used percentiles and keep using mean average. True story.

Bah, I'll be happy if I could even get correct averages. I see pipelines getting value X1 from a server that served 100 requests, another value X2 from a server that served one request, and then it returns (X1+X2)/2.

The mean is something you can easily compute progressively and with trivial resources. Median and percentiles, on the other hand, can be super expensive and potentially unsuitable for some real-time applications, since you need to maintain a sorted list of all relevant samples.

Not surprising, because computing mean is O(n) and median is O(n log n).

Lack of resources or pure laziness doesn't make it the right measure to use though.


Introselect is O(n), right?

Is there anyone who still uses averages ?

Statistics 101. Mean, median and mode.

Looks like timescale did a big marketing push this morning only for their whole service to go down minutes later. lol.

I've skimmed some of the literature here when I've spent time trying to help people with their bucket boundaries for Prometheus-style instrumentation of things denominated in "seconds", such as processing time and freshness.

My use case is a little different from what's described here or in a lot of the literature. Some of the differences:

(1) You have to pre-decide on bucket values, often hardcoded or stored in code-like places, and realistically won't bother to update them often unless the data look unusably noisy.

(2) Your maximum number of buckets is pretty small -- like, no more than 10 or 15 histogram buckets probably. This is because my metrics are very high cardinality (my times get recorded alongside other dimensions that may have 5-100 distinct values, things like server instance number, method name, client name, or response status).

(3) I think I know what percentiles I care about -- I'm particularly interested in minimizing error for, say, p50, p95, p99, p999 values and don't care too much about others.

(4) I think I know what values I care about knowing precisely! Sometimes people call my metrics "SLIs" and sometimes they even set an "SLO" which says, say, I want no more than 0.1% of interactions to take more than 500ms. (Yes, those people say, we have accepted that this means that 0.1% of people may have an unbounded bad experience.) So, okay, fine, let's force a bucket boundary at 500ms and then we'll always be measuring that SLO with no error.

(5) I know that the test data I use as input don't always reflect how the system will behave over time. For example I might feed my bucket-designing algorithm yesterday's freshness data and that might have been a day when our async data processing pipeline was never more than 10 minutes backlogged. But in fact in the real world every few months we get a >8 hour backlog and it turns out we'd like to be able to accurately measure the p99 age of processed messages even if they are very old... So despite our very limited bucket budget we probably do want some buckets at 1, 2, 4, 8, 16 hours, even if at design time they seem useless.

I have always ended up hand-writing my own error approximation function which takes as input like

(1) sample data - a representative subset of the actual times observed in my system yesterday

(2) proposed buckets - a bundle of, say, 15 bucket boundaries

(3) percentiles I care about

then returns as output info about how far off (%age error) each estimated percentile is from the actual value for my sample data.

Last time I looked at this I tried using libraries that purport to compute very good bucket boundaries but they give me, like, 1500 buckets with very nice tiny error, but no clear way to make real-world choice about collapsing this into a much smaller set of buckets with comparatively huge, but manageable, error.

I ended up just advising people to

* set bucket boundaries at SLO boundaries, and be sure to update when the SLO does

* actually look at your data and understand the data's shape

* minimize error for the data set you have now; logarithmic bucket sizes with extra buckets near the distribution's current median value seems to work well

* minimize worst-case error if the things you're measuring grow very small or very large and you care about being able to observe that (add extra buckets)


One thing you could try to use are variance optimal histograms. These are histograms which set bucket boundaries such that the weighted average variance in the buckets is minimized. Unfortunately, this algorithm is quadratic with the number of data points, but you could try approximating the optimum by taking random subsets of the data, computing the histogram, then seeing how well the histogram does on the whole dataset.

http://www.mathcs.emory.edu/~cheung/Courses/584/Syllabus/pap...


NB: Post author here.

This is interesting and I totally get at least some of the problems you're facing. I wonder if you could take some of the strategies from t-digest and modify a bit to accomplish...I'd be interested in seeing some sort of implementation of this and would love to see if we can get it into our toolkit if you do...or you can also open up a ticket for us and we'll see if we can prioritize to work on something like it ourselves.

I do think there are some interesting ways you can cut corners if you know things about the "SLO" or other sort of cutoff values, but I'd have to think more deeply about it to say more. Essentially we want a variable error rate based on the distance away from a known value, where you have little error in the values relatively close to the known value, care little if at all for fine distinctions on the low end and, once you get past some high end value you could probably bucket everything above it into a "large outliers" bucket or something too...meaning you p999 could get out of hand if you start getting lots of stuff above a threshold, but, if that starts happening, probably everything's already burning down, so might not be that useful to know it's burning down in a very specific way...


Are you following the work on sparse high resolution histograms in both Prometheus and OpenTelemetry?

It's coming together and will be available soon.

What this means is that you won't have to compromise and pick histogram bucket boundaries anymore. And each bucket will be much narrower.


look at Greenwald-Khanna (and the followon work). It adapts the bucket size to minimize the total error (and the number of buckets is proportional to log-epsilon I think).

with all the competition and 'innovation', you would think the data dogs of this world would grow up beyond time series.


Spent several years in venture capital investing and averages were always misleading - as Nassim Taleb says "Never cross a river that is on average 4 feet deep"

Median/50 percentile isn't a whole lot better in that case.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: