

Why Average Latency Is a Terrible Way To Track Website Performance - mvolo
http://mvolo.com/why-average-latency-is-a-terrible-way-to-track-website-performance-and-how-to-fix-it/

======
aetherson
TL;DR: Average anything is a terrible way to track anything. (And median or
mode are bad, too). Any single-scalar value that compresses information that
is best expressed as a graph (or multiple graphs!) is immensely lossy to the
point where arguably it obfuscates more than it makes clear.

Back when we had to live with sort of printing-press methods of displaying
information (ie, where anything that wasn't pure text was very difficult to
display), mean/median/mode numbers were a necessary evil. But if you're
looking at a computer screen, there's really no reason to subject yourself to
an abstraction that throws out 90% of your data.

~~~
stephencanon
> Average anything is a terrible way to track anything.

Came here to say exactly this. And averages are _especially_ insidious when
used for data that doesn't have a symmetric distribution, like most latencies.

~~~
mvolo
hi Steve,

Author here. I think most people on HN would echo your sentinment about
averages wholesale ... But I wanted to go a little deeper into selecting a
better alternative for operational monitoring.

Its easy to say "averages are bad" but harder to say "use X instead", and
explain why. We tried. Do you think we did it?

~~~
chetanahuja
Well the title seems a bit childish (since obviously everybody on HN knows
it's a terrible idea.) Why don't you change to post title to more
appropriately reflect what you were trying to propose as an alternative.

------
spitfire
Michael Abrash talked about this in his black book of graphics programming.

WHen he was writing quake, they could trade off between lighting fast graphics
(40fps+, on a 486) 99% of the time with the occasional horrible slowdown to
less than 5fps. vs a steady frame rate that never changed much, but wasn't
terribly fast.

Turns out people notice the occasional horrible lag much more than when things
perform uniformly.

When tuning a performance critical service, focus on the outliers.

~~~
jamesaguilar
I think you should not ignore either. By default, think about 99%ile and
50%ile when tuning and optimizing. Depending on the context (e.g. games), even
99%ile might not be enough, or you might want to think about 99%ile _of what_?
Frames? Scenes? Seconds of gameplay?

Also, back to the topic of the article at hand, I hope that their "T" is not
really two seconds. That is already way too slow for most web purposes.

------
taylorbuley
One problem I have with this approach is that it requires you to pick a
threshold after which the response is "too slow." This number can change a lot
over the course of an application lifetime, and would be hard to pick
objectively anyways.

Median latency -- perhaps with (the smoothing-effect of) a rolling median --
would be more robust to outliers without having to resort to hardcoding of
"too slow" thresholds. It would still require the human to connect the dots
(e.g. median latency of >200 is "too slow") but it's an improvement on mere
average response time for reasons noted.

~~~
pla3rhat3r
I agree with this. It's hard to gauge what is acceptable because it really
depends on the application. So many other dependencies when dealing with
latency and how it effects performance.

~~~
mvolo
hi, author here.

Unfortunately, you HAVE TO do it. If you do not set a threshold for what is
acceptable, how do you determine whether or not your are providing an
acceptable experience to your users?

No amount of aggregate metrics can help you answer this question unless you
know whats acceptable, and what isnt - for each important set of URLs in your
app.

I agree that its "hard" to do. In our own product
(<https://www.leansentry.com>), we solve this problem by grouping urls, and
using good defaults / making it easy to override the thresholds for users.

~~~
pla3rhat3r
There so many moving parts that this could either be a great tool to analyze
data or it could open up a can of worms which could lead down the road to
network re-architecture. If this takes into account best effort or SLA based
ISP, Network topology, QoS, Packet Prioritization, etc then I think it could
be a useful tool. Without it it's just a tool that spits out pretty pictures.
If your main selling point is data then it has to be more than just what
latency can show.

~~~
mvolo
hi there,

The post is about selecting a top level metric for monitoring website
performance. One a problem is indicated, you would definitely need to drill in
to figure out what part of your app is affected, when, and what caused it.

LeanSentry (our own application monitoring product,
<https://www.leansentry.com>) does this. However, describing this was outside
the scope of my post (but you can see the demo of it on the website).

------
edouard1234567
I think there's something to be said about keeping some key metrics super
simple so that "everybody" can understand without having to refer to a formula
or arbitrarily set thresholds. I've been using 99 and 90 percentile avg
performance. It captures enough information in most cases and doesn't require
any explanation.

~~~
mvolo
Hi edouard,

I completely agree! Keeping toplevel metrics SIMPLE is the key. Of course,
simple but also not misleading you into any wrong beliefs.

While we liked the 95 percentile approach, we decided against it. Its still
too focused on the actual response time itself, which we thought was less
relevant than the number of users experiencing bad performance.

I think for us the bottom line was:

A) If you are having a site-wide performance issues, 95% percentile is a good
metric.

B) However, if you have more isolated issues (we find this happens more often
to more mature sites), satisfaction score is better.

Best, Mike

~~~
taproot
Im seeing a lot of "averages are bad" etc but I think you come closest to what
I had in mind: there isnt anything inherently wrong with using simple metrics.
The caveat is you just need to keep in mind and understand their limitations
and where they fall down. I think a lot of people understand that using 99 or
95 percentiles and what not but just failed to lay the reasoning out.

------
bluesmoon
Posted this comment on the article, but thought it would be useful here as
well:

Good points on why average latency is a bad metric, and while the idea behind
Apdex was good, it never ended up being the right measure. The Apdex score
still depends on a HiPPO (Highest Paid Person''s Opinion) to determine what T
should be, and this can change over time.

At SOASTA (and previously at LogNormal), we borrowed the concept of LD50 (the
median lethal dose) from biology. The LD50 value has the property of adapting
to what your audience thinks rather than what your HIPPO thinks is a good
experience.

We described the method at the Velocity conferences (Santa Clara and London)
last year, and wrote it up in a blog post here:
[http://www.lognormal.com/blog/2012/10/03/the-3.5s-dash-
for-a...](http://www.lognormal.com/blog/2012/10/03/the-3.5s-dash-for-
attention/#ld50)

Hope you find it interesting.

I should also mention that it's useful to apply some kind of smoothing to
timeseries data (like latency over time). Holt-Winters double-exponential
smoothing is particularly good at this. What it does is smooth out temporary
glitches and show you when things turn unexpectedly bad. If you've ever
received a page and said, "Oh yeah, that one... that goes away in 3 seconds.
Happens every day.", then you'll find this useful. H-W D-E smoothing only
shows you the ones that don't go away after 3 seconds.

------
Jabbles
I thought looking at the 99th (or other) percentile was pretty standard
practice?

~~~
acdha
Depends on standard - within the clued-in performance community, yes, but
there are major, major companies still pushing averages and that causes a lot
of people, particularly those without much stats / engineering background, to
expect it everywhere.

To use one example which is prevalent throughout marketing, advertising, etc.
Google Analytics reports only averages – this makes the results unreliable
enough that I'm now advising people to simply pretend that field does not
exist as it's completely untrustworthy. Awhile back I blogged about an example
where 3 samples out of 200K threw the average off by a full order of
magnitude: [http://chris.improbable.org/2012/05/18/google-analytics-
dece...](http://chris.improbable.org/2012/05/18/google-analytics-deceptive-
site-speed-report/)

~~~
Jabbles
Very interesting, thank you. I especially like the replies from the Google
analytics team, 8 months apart, that both acknowledge the issue and say
they'll fix it...

~~~
acdha
Also that in addition to not having fixed it, accurate stats are apparently
less of a priority than, say, a gigantic fixed-position toolbar. That's been
disappointing…

------
joshfraser
I prefer histograms... [http://highscalability.com/blog/2012/5/23/averages-
web-perfo...](http://highscalability.com/blog/2012/5/23/averages-web-
performance-data-and-how-your-analytics-product.html)

------
nateabele
Searched the page for "standard deviation". Didn't find it. Hit the back
button.

~~~
rm999
Standard deviation isn't the problem, skew is. Yes, skew will increase the
standard deviation, but the heart of the issue here is how fat the right tail
of the distribution is.

Standard deviation is often a useful metric, but it's at least as flawed as
mean in skewed distributions because it doesn't treat either direction around
the (already flawed) mean any differently.

