
How Not to Measure Latency - toddh
http://highscalability.com/blog/2015/10/5/your-load-generator-is-probably-lying-to-you-take-the-red-pi.html
======
jefftk

        For example, in a user session with 5 page views that
        load 40 resources per page, how many users will not
        experience worse than 95%’ile of HTTP requests? The
        chances of not seeing it is ~.003%.
    

This is assuming complete independence, which is very wrong.

~~~
ma2rten
This!

Especially the example of Google and Amazon is misleading. Most of those
requests are assets which are loaded from a CDN, which is highly reliable
compared to the application code.

------
latch
I like how the article talks about the uselessness of percentiles, when a lot
of people are still using averages!

Seriously, one of the best tool we use is to count thresholds and buckets. On
a per-route basis, identify a threshold that you deem worrisome (say, 50ms),
and then simply count the number of responses above that. Similarly, create
buckets, 0-9ms, 10-19, 20-29... I say "best" because it's also very easy to
implement (even against streaming data) and doesn't take a lot of memory. Not
sure if there's a better way, but we sample our percentiles, which makes me
uneasy (but keeps us to a fixed amount of memory)

~~~
lpage
There's a highly tuned set of libraries available for computing HDR histograms
with minimal overhead:
[https://github.com/HdrHistogram](https://github.com/HdrHistogram). This gets
you a little closer to a continuous distribution, and does away with the
branching needed for explicit buckets.

------
grandalf
NewRelic graphs are another highly misleading example. They average the
response time across all endpoints and make it very hard to understand the
95th percentile for a specific endpoint.

They also don't let you exclude static assets from the graphs, so the numbers
are fairly unhelpful when trying to understand performance bottlenecks of a
dynamic application.

~~~
Lightbody
Disclaimer: I'm part of New Relic's Product Management team.

I think you might have a misunderstanding of how our stuff works. While we
originally only captured aggregates, the last ~2 years or so we've been
capturing and reporting on _every_ transaction/request taking place.

As such, when you do a 95th percentile chart in our product, we're not
"averaging the percentile" like many monitoring tools do. We are literally
looking at every single record during that time.

We also allow filtering by transaction, which means you can check out just the
percentile for "CheckoutController" or "AddToCartController" \-- definitely
not just the aggregate application.

And if you're a customer and want to verify this yourself, just pop over to
New Relic Insights (insights.newrelic.com) and run a query like this to really
see the power that comes from not pre-aggregating anything:

    
    
      SELECT count(*), percentile(duration, 50, 90, 87, 95)
      FROM Transaction
      SINCE 1 day ago
      FACET BY name
    

That will return all transactions in the last day, grouped by transaction name
(ex: "CheckoutController"), along with their respective count and median,
90th, and 95th percentile. I threw in the 87th percentile just to show that
you can do that too if that's your kind of thing.

~~~
grandalf
> every transaction/request taking place.

It's definitely gotten better, however I think the default view is still just
averaging all requests, which isn't very helpful.

For a blocking server (like most Rails apps use) the key insight has to do
with which controller actions need to be optimized to prevent slow user-facing
aggregate response times.

I don't currently have it in use on any of my apps, but next time I'll give
that query a try.

~~~
Lightbody
Yes, the first graph is for the entire app. But we (try to) make it easy to
quickly drop down to a view that shows you the most time consuming controllers
and then let you quickly see performance for just a single one.

Feel free to drop me an email at lightbody@newrelic.com if you decide to check
it out again and have more feedback.

Take care!

~~~
grandalf
I appreciate the response here, and appreciate the offer.

------
ninjakeyboard
I don't know if I understand the point.

"More shocking: 99% of users experience ~99.995%’ile HTTP response times. You
need to know this number from Akamai if you want to predict what 99% of your
users will experience. You only know a tiny bit when you know the 99%’ile."

What does this mean? Can you explain this better? How is this a truth? I'm not
convinced.

If I'm making requests for assets for a page and requests 100 assets, sure...
But the worst asset doesn't dictate the user experience. We're primarily
concerned with the API response time - not all of the individual static
assets. Assets don't block so if one little image is slow I wouldn't take that
as a breaker. We don't even measure entire page load times - just what the
user needs to see to have some sort of interest especially stuff loading above
the fold etc.

------
joshwa
The most important thing to measure is the user-experienced response time. If
your app isn't ready until some assets have loaded and your UI is actually
displayed and usable, then you need to measure and instrument THAT. (domready
is a poor but often usable proxy).

A good APM tool will measure this as well for ajax requests by measuring how
long it takes from click to the end of the callbacks executed after receiving
the response (i.e. displaying the refreshed content).

------
velox_io
Interesting article, I dislike vanity metrics too. Seems most stats are
designed tell the best story, rather than the truth or nitty-gritty.

Instead of looking at normalised stats, look at the worst offenders. Be it the
URLs or the heaviest queries. Then find out what is using most of the
resources and focus on those first.

------
wpeterson
Averages can be useful, percentiles can be more useful.

However, the biggest asset in useful monitoring is focusing in on the right
events and data. Is an average latency across all of your requests useful?
Probably much less than averages per API or page.

------
divan
One of the previous versions of this talk I keep in a bookmarks folder named
'Must see videos'.

------
dang
This is a summary of
[https://www.youtube.com/watch?v=lJ8ydIuPFeU](https://www.youtube.com/watch?v=lJ8ydIuPFeU).
It looks like a good summary, so I guess we won't replace the URL above, even
though HN prefers original sources. We did change the title to that of the
talk though.

------
vacri
In addition to some of the other comments, another factoid of dubious
understanding is _" that 5 percent had to be so bad it pulled the other 95
percent of observations up into bad territory"_. No, that is demonstrably not
the case, and it's even referenced earlier in the article that the sample
graph only saw movement in the 90% and 95% lines.

It feels like the article has taken a philosophical position, then gone
through a lot of confirmation bias to support it.

~~~
sulam
He's taking a very pragmatic position that you will experience directly if
you're ever responsible for the SLA of a "web scale" service.

His 5% / 95% observation is simply that you should not be focusing on the
average, or even majority case if that's not what your users actually
experience. One interpretation of the specific example you don't like is that
more requests fell into a given range, thereby giving a distribution that the
graph reflects. Another potential interpretation is just what he describes,
where you have a set of especially poor performing requests at that point in
time which effectively skew the distribution so that the 95th is "pulled up".
His broader point is that you don't know based on this graph what you're
dealing with, and he demonstrates a better visualization technique to
determine the precise distribution of responses without having to look at
every request.

