The reason I made this was an old Amazon interview question. The question was basically, "Find the median of a huge data set without sorting it," and the "correct" answer was to have a fixed size sorted buffer and randomly evict items from it and then use the median of the buffer. However, a candidate I was interviewing had a really brilliant insight: if we estimate the median and move it a small amount for each new data point, it would be pretty close. I ended up doing some research on this and found P^2, which is a more sophisticated version of that insight.
: https://arxiv.org/pdf/1902.04023.pdf / https://github.com/tdunning/t-digest
Yeah, that was one of the reasons we chose it as one of the ones to implement, seemed like that was a really interesting tradeoff, we also used uddsketch which provides relative error guarantees, which is pretty nifty. We thought they provided different enough tradeoffs that we wanted to implement both.
(The TimescaleDB Toolkit is also implemented in Rust)
Interesting will have to take a look! Thanks for sharing!
(NB: Post author here).
So, not find, but approximate. That's a different thing.
(that you sort yourself)
Is the storage restriction the point?
Thanks for sharing! Hadn't heard of that algorithm, have seen a number of other ones out there, we chose a couple that we knew about / were requested by users. (And we are open to more user requests if folks want to use other ones! https://github.com/timescale/timescaledb-toolkit and open an issue!)
I had a basic screening call fail once because the expected answer was (in my perspective) more naive than my answer. I'd love it if generating curiosity were an interview +1.
Yeah, this is gradient descent on the absolute loss.
You can use eg QuickSelect (https://en.wikipedia.org/wiki/Quickselect) or Median of Medians.
They don't sort the data, but they do need linear amount of storage.
Isn't this done using a min heap and a max heap in conjuction?
1. huge data set: heaps requires storing the whole set, but "huge" means "more than you can store"
2. without sorting it: heaps are basically semi-sorted, so you are breaking the rules
Really? Where's it coming from?
With a few exceptions this is pretty common scenario.
And possibly it's live data.
The original problem is probably more accurately described as “what if you have too much data to sort” though.
I agree that if you have the whole thing, doing heapsort and pulling a[N/2] or a[1 + N/2] is simpler.
I see that it's better if you need to know "what is the median of the amount of the list that I've consumed so far?"
But if what you want is the median of the whole list, which might be in a random order, the medians of random prefixes of the list don't seem especially relevant. And if you do have an indefinite amount of data coming in, so that you need a "well, this is what we've seen so far" data point, the minheap-maxheap approach doesn't seem very well suited since it requires you to remember the entirety of the data stream so far.
My first instinct is to divide the possible data values into buckets, and just count the number of datapoints that fall into each bucket. This gives you a histogram with arbitrary resolution. You won't know the median value, but you will know which bucket contains the median value, and your storage requirements depend only on the number of buckets you want to use.
But I think we can twist our heads and see in a way that this is backwards. Mathematically, the mean is much easier to work with because it is linear and we can do algebra with it. That's how we got the Central Limit Theorem. Percentiles and the median, except for symmetric distributions, are not as easy to work with. They involve solving for the inverse of the cumulative function.
But in many ways, the median and percentiles are a more relevant and intuitive number to think about. Especially in contexts where linearity is inappropriate!
or put another way, if you can't model it, you're going to have to sort, or estimate a sort, because that's all that's really left to do.
this shows up in things from estimating centers with means/percentiles to doing statistical tests with things like the wilcoxon tests.
sadly, the instructor just gave it an A and moved on. (that said, the amount of work that went into a single semester project was a bit herculean, even if i do say so myself)
A good example of this is when people say "more than half of drivers say they are above average drivers! Idiots!" Of course that's perfectly possible if most drivers are basically fine and some are really really bad. For example, 99.99% of people have greater than the average number of legs.
The correct impossible statement would be if more than half of drivers are better than the median driver.
"Why Averages Suck and Percentiles are Great":
The central limit theorem seems to have given people the slightly wrong idea that things will always average out eventually.
NB: Post author here.
Long queues are a prime target for terrorists, and if something bad happens, small groups of people are more manageable than large groups.
Love this example. Might have to use that in a future post. Feel like a lot of us are running into a similar thing with remote work and video calls these days...
I would rather have a rock-steady consistent 500 ms reponse time when typing text, than to have a 100 ms average response time which randomly spikes to outliers that go past one second.
A rock-steady, though poor, event rate in a paint application is better for drawing a freehand curve (especially if the program interpolates well) than a really fast rate that suddenly has a glitch in it, spoiling your work.
I would be shocked if gamers didn't on average have a more intuitive sense of this than any of the groups you mentioned.
Discussed previously: https://news.ycombinator.com/item?id=10334335
!*Plowser*!* is in the 99th percentile of browsers by global user count!
# Fact Sheet
- !*Plowser*! is used by 2,050 people
- sample size of browsers is 10,000 (this includes toy apps on GitHub and almost all browsers in the sample have no record of active users)
- those using !*Plowser*! have no choice as the developing company $*DendralRot Inc*$ forces all its employees, contractors and users of their enterprise '?_shuiteware_?' (a name derived by mashing |-->shite|software|suite<--| into one word) to use their browser
- if we place the number of global browser users at a conservative 1,000,000,000, !*Plowser*! actually has 0.00000205% of users
From Wikipedia description: Anscombe's quartet comprises four data sets that have nearly identical simple descriptive statistics, yet have very different distributions and appear very different when graphed. Each dataset consists of eleven (x,y) points. They were constructed in 1973 by the statistician Francis Anscombe to demonstrate both the importance of graphing data before analyzing it, and the effect of outliers and other influential observations on statistical properties.
Definitely worth a quick look.
This is great! So fun...will have to use in the future...
> For the median and average to be equal, the points less than the median and greater than the median must have the same distribution (i.e., there must be the same number of points that are somewhat larger and somewhat smaller and much larger and much smaller).
[0, 2, 5, 9, 9] has both median and mean = 5, but the two sides don't really have the same distribution.
Doesn't have to be any more complicated than that. It's more a curio than an important point anyway :)
That's really nifty, wish I'd heard about it earlier. Might go back and add a link to it in the post at some point too! Very useful. Definitely know I wasn't breaking new ground or anything, but fun to see it represented so succinctly.
Very discouraging if one is trying to analyze data algorithmically. Often when faced with a problem in statistics, the answer is: "Look at the graph and use intuition!".
There are a bunch of queuing theory formulas that are really cool but don't exactly apply if your responses vary a lot like the article describes. I think the assumption is that response time distributions are exponential distributions, which I don't think is a good assumption (is an Erlang distribution an exponential distribution?) Nevertheless, hooking the equations up to some models is a good way to get directional intuition. I didn't realize how steep the performance drop-off is for server utilization until I started moving sliders around.
Our ops team doesn't really follow this either. We're not a huge department though - is this the kind of stuff that an SRE team usually pays attention to?
I found it surprisingly difficult to explain well. Took a lot of passes and a lot more words than I was expecting. It seems like such a simple concept. I thought the post was gonna be the shortest of my recent ones, and then after really explaining it and getting lots of edits and rewriting, it was 7000 words and ...whoops! But I guess it's what I needed to explain it well (hope you thought so anyway).
It's somewhat exponential, but yeah, not necessarily, it's definitely long-tailed in some way and it sort of doesn't matter what the theoretical description is (at least in my mind) the point is these types of distributions really don't get described well by a lot of typical statistics.
Can't talk too much about SREs at the ad analytics company I mentioned, we were the backend team that wrote a lot of the backend stores / managed databases / ran the APIs and monitored this stuff a bit (and probably not all that well). It was a bit more ad hoc I guess, probably now that company is large enough they have a dedicated team for that...
I think the weirdly-shaped long-tail graphs we come across are just sums of more naturally-distributed response times, for different types of responses. Another reason to limit variation I think.
The best method I've found for digging into latency is profiling RPC traces to see where time is spent. This at least separates the code and parameters that have simple latency distributions from those that don't. Some distributions will be very data-dependent (SQL queries).
It's implemented in this Node library: https://github.com/bluesmoon/node-faststats
Side note: I wish everyone would stop using the term Average to refer to the Arithmetic mean. "Average" just means some statistic used to summarize a dataset. It could be the Arithmetic Mean, Median, Mode(s), Geometric Mean, Harmonic Mean, or any of a bunch of other statistics. We're stuck with AVG because that's the function used by early databases and Lotus 123.
I wish people would stop bad-mouthing the arithmetic mean. If you have to convey information about a distribution and you’ve got only one number to do it, the arithmetic mean is for you.
At least the arithmetic mean is fine for Gaussian distributions, and coneys a sense about the data even on non-Gaussian ones. but the median doesn’t even work at all on some common distributions like scores of very difficult exams (where median=0)
For the mean, at least every data point contributes to the final value.
Just my 2c
Isn't this contradictory? If we order the values from largest to smallest and take the 1001st value, then 10 % of the values are above/larger and not below/smaller. I believe it should say order from smallest to largest.
Oops, yep, that should probably be order from smallest to largest. Thanks for the correction!
It also always separated the 'good' Ad networks from the 'bad' ones as the bad ones would take to long to respond.
Yep! Briefly noted that in the post, but deserves re-stating! it's definitely a more complex analysis to figure out the percentage of users affected (though often more important) could be majority could also be one user who has some data scientist programmatically making hundreds of long API calls for some task...(can you tell that I ran into that? Even worse it was one of our own data scientists ;) ).
Glad you liked it! I was so excited to actually be able to get to use Desmos for something work wise, I've been wanting to do it for years!
The bonus is that the 75th - 50th gives you the interquartile range.
Mean is not a robust measure and as such you need to look at variety to truly understand the spread of your data.
Totally can be true. In our case, we use these approximation methods that allow you to get multiple percentiles "for free" definitely need to choose the right ones for the job. (We talk a bit more about the whole approach where we do the aggregation then the accessor thing in the previous post on two-step aggregation ). But there are definitely times when averages/stddev and potentially the 3rd and 4th moments are more useful etc.
They regularly compare experiments to the mean but dont use a T test to ensure the results are actually different from the mean.
Std deviation definitely helps a lot, still often not as good as percentiles, was actually thinking about adding some of that in the post but it was already getting so long. It's funny how things you think are simple sometimes take the most effort to explain, definitely found that on this one.
Unfortunately, BD and marketing just want a single number to show that the value is bigger and hate anything more complicated than a barchart.
We've been meaning to add IQR as an accessor function for these, may have to go back and do it...the frequency trails  stuff from Brendan Gregg also goes into some of this and it's really cool as a visualization.
That is, the first bar is "Our Number" and the second bar is "Competitor's number."
Highly recommend modelling your solution using queueing theory with this: https://queueing-tool.readthedocs.io/en/latest/
As an exercise in discovering the economics of how your platform works, even just thinking about it in these terms can save a great deal of time.
How do you calculate these percentiles and use them as a trigger for alerts?
I did a percentile graph of the usage - the data was only typically using 5% of the maximum throughput, no-one could really understand the graph though
so I did a zoomed-in version of the normal data usage graph and it looked like a blip lasting 1/20 of the time - everyone got that - eg it was peaking every few seconds and then doing nothing for ages
Averages Can Be Misleading: Try a Percentile (2014) (elastic.co)
199 points by donbox on April 2, 2019 | | 55 comments
Will think about how I can improve my own blog with these ideas.
> “In the below graph, half of the data is to the left (shaded in blue), and a half is to the right (shaded in purple), with the 50th percentile directly in the center.”
But the half on the right is actually shaded yellow.
Lack of resources or pure laziness doesn't make it the right measure to use though.
My use case is a little different from what's described here or in a lot of the literature. Some of the differences:
(1) You have to pre-decide on bucket values, often hardcoded or stored in code-like places, and realistically won't bother to update them often unless the data look unusably noisy.
(2) Your maximum number of buckets is pretty small -- like, no more than 10 or 15 histogram buckets probably. This is because my metrics are very high cardinality (my times get recorded alongside other dimensions that may have 5-100 distinct values, things like server instance number, method name, client name, or response status).
(3) I think I know what percentiles I care about -- I'm particularly interested in minimizing error for, say, p50, p95, p99, p999 values and don't care too much about others.
(4) I think I know what values I care about knowing precisely! Sometimes people call my metrics "SLIs" and sometimes they even set an "SLO" which says, say, I want no more than 0.1% of interactions to take more than 500ms. (Yes, those people say, we have accepted that this means that 0.1% of people may have an unbounded bad experience.) So, okay, fine, let's force a bucket boundary at 500ms and then we'll always be measuring that SLO with no error.
(5) I know that the test data I use as input don't always reflect how the system will behave over time. For example I might feed my bucket-designing algorithm yesterday's freshness data and that might have been a day when our async data processing pipeline was never more than 10 minutes backlogged. But in fact in the real world every few months we get a >8 hour backlog and it turns out we'd like to be able to accurately measure the p99 age of processed messages even if they are very old... So despite our very limited bucket budget we probably do want some buckets at 1, 2, 4, 8, 16 hours, even if at design time they seem useless.
I have always ended up hand-writing my own error approximation function which takes as input like
(1) sample data - a representative subset of the actual times observed in my system yesterday
(2) proposed buckets - a bundle of, say, 15 bucket boundaries
(3) percentiles I care about
then returns as output info about how far off (%age error) each estimated percentile is from the actual value for my sample data.
Last time I looked at this I tried using libraries that purport to compute very good bucket boundaries but they give me, like, 1500 buckets with very nice tiny error, but no clear way to make real-world choice about collapsing this into a much smaller set of buckets with comparatively huge, but manageable, error.
I ended up just advising people to
* set bucket boundaries at SLO boundaries, and be sure to update when the SLO does
* actually look at your data and understand the data's shape
* minimize error for the data set you have now; logarithmic bucket sizes with extra buckets near the distribution's current median value seems to work well
* minimize worst-case error if the things you're measuring grow very small or very large and you care about being able to observe that (add extra buckets)
This is interesting and I totally get at least some of the problems you're facing. I wonder if you could take some of the strategies from t-digest and modify a bit to accomplish...I'd be interested in seeing some sort of implementation of this and would love to see if we can get it into our toolkit if you do...or you can also open up a ticket for us and we'll see if we can prioritize to work on something like it ourselves.
I do think there are some interesting ways you can cut corners if you know things about the "SLO" or other sort of cutoff values, but I'd have to think more deeply about it to say more. Essentially we want a variable error rate based on the distance away from a known value, where you have little error in the values relatively close to the known value, care little if at all for fine distinctions on the low end and, once you get past some high end value you could probably bucket everything above it into a "large outliers" bucket or something too...meaning you p999 could get out of hand if you start getting lots of stuff above a threshold, but, if that starts happening, probably everything's already burning down, so might not be that useful to know it's burning down in a very specific way...
It's coming together and will be available soon.
What this means is that you won't have to compromise and pick histogram bucket boundaries anymore. And each bucket will be much narrower.
with all the competition and 'innovation', you would think the data dogs of this world would grow up beyond time series.