For example, in a user session with 5 page views that
load 40 resources per page, how many users will not
experience worse than 95%’ile of HTTP requests? The
chances of not seeing it is ~.003%.
Especially the example of Google and Amazon is misleading. Most of those requests are assets which are loaded from a CDN, which is highly reliable compared to the application code.
The actual latency you should measure for such a download is the total time taken to get to a usable view for the user. How to do that is left as an exercise for the reader ;-)
- Google (you hit many search nodes to get the results for a single search, and Google goes to some effort to make sure you get back both accurate and fast responses)
- Facebook (there are many distinct services that have to be queried to render any FB page)
- YouTube (union of the above)
Now, I hope the engineers building these sites understand the points he is making, but they stand true either way.
If half of them go to a few unlucky users and half go to scattered users, the chance is of avoiding a <95% request is 0.6% for the lucky users and 0 for the unlucky ones.
This share is a bit unrealistic — who will go on for five whole page views if performance is that bad? But it doesn't matter, 99.4% or 99.997% both mean "practically all users experience slow requests".
Seriously, one of the best tool we use is to count thresholds and buckets. On a per-route basis, identify a threshold that you deem worrisome (say, 50ms), and then simply count the number of responses above that. Similarly, create buckets, 0-9ms, 10-19, 20-29... I say "best" because it's also very easy to implement (even against streaming data) and doesn't take a lot of memory. Not sure if there's a better way, but we sample our percentiles, which makes me uneasy (but keeps us to a fixed amount of memory)
They also don't let you exclude static assets from the graphs, so the numbers are fairly unhelpful when trying to understand performance bottlenecks of a dynamic application.
I think you might have a misunderstanding of how our stuff works. While we originally only captured aggregates, the last ~2 years or so we've been capturing and reporting on every transaction/request taking place.
As such, when you do a 95th percentile chart in our product, we're not "averaging the percentile" like many monitoring tools do. We are literally looking at every single record during that time.
We also allow filtering by transaction, which means you can check out just the percentile for "CheckoutController" or "AddToCartController" -- definitely not just the aggregate application.
And if you're a customer and want to verify this yourself, just pop over to New Relic Insights (insights.newrelic.com) and run a query like this to really see the power that comes from not pre-aggregating anything:
SELECT count(*), percentile(duration, 50, 90, 87, 95)
SINCE 1 day ago
FACET BY name
It's definitely gotten better, however I think the default view is still just averaging all requests, which isn't very helpful.
For a blocking server (like most Rails apps use) the key insight has to do with which controller actions need to be optimized to prevent slow user-facing aggregate response times.
I don't currently have it in use on any of my apps, but next time I'll give that query a try.
Feel free to drop me an email at firstname.lastname@example.org if you decide to check it out again and have more feedback.
Defining critical transactions, and measuring them against their SLAs is really the only valid way of summarizing total application performance (x% of traffic is meeting SLA, y% is 1 stddev above, z% 2 stddev, etc).
"More shocking: 99% of users experience ~99.995%’ile HTTP response times. You need to know this number from Akamai if you want to predict what 99% of your users will experience. You only know a tiny bit when you know the 99%’ile."
What does this mean? Can you explain this better? How is this a truth? I'm not convinced.
If I'm making requests for assets for a page and requests 100 assets, sure... But the worst asset doesn't dictate the user experience. We're primarily concerned with the API response time - not all of the individual static assets. Assets don't block so if one little image is slow I wouldn't take that as a breaker. We don't even measure entire page load times - just what the user needs to see to have some sort of interest especially stuff loading above the fold etc.
A good APM tool will measure this as well for ajax requests by measuring how long it takes from click to the end of the callbacks executed after receiving the response (i.e. displaying the refreshed content).
Instead of looking at normalised stats, look at the worst offenders. Be it the URLs or the heaviest queries. Then find out what is using most of the resources and focus on those first.
However, the biggest asset in useful monitoring is focusing in on the right events and data. Is an average latency across all of your requests useful? Probably much less than averages per API or page.
It feels like the article has taken a philosophical position, then gone through a lot of confirmation bias to support it.
His 5% / 95% observation is simply that you should not be focusing on the average, or even majority case if that's not what your users actually experience. One interpretation of the specific example you don't like is that more requests fell into a given range, thereby giving a distribution that the graph reflects. Another potential interpretation is just what he describes, where you have a set of especially poor performing requests at that point in time which effectively skew the distribution so that the 95th is "pulled up". His broader point is that you don't know based on this graph what you're dealing with, and he demonstrates a better visualization technique to determine the precise distribution of responses without having to look at every request.