
The Most Misleading Measure of Response Time: Average - dsiroker
http://blog.optimizely.com/2013/12/11/why-cdn-balancing/
======
JackFr
Liked it better when it was called "Programmers Need To Learn Statistics Or I
Will Kill Them All"
[http://zedshaw.com/essays/programmer_stats.html](http://zedshaw.com/essays/programmer_stats.html),
but still kind of wrong.

If you need to reduce a distribution to a single number, the most informative
number is going to be the mean.

I understand their point about the 99th percentile, but consider that it's
possible to improve the 99th percentile measure, while increasing the mean and
degrading the performance of all but 1% of the users.

The real issue is reducing a distribution to one number.

~~~
tomp
No, the most appropriate "average" number is the _median_ \- the number that
is at exactly 50% of the population. At least, that's what most people think
about when they hear "average".

For example, most people intuitively perceive the fact that "the majority of
drivers consider themselves above average" as human stupidity, however
mathematically it makes perfect sense if average == arithmetic mean.

If you are not convinced, consider Bill Gates walking into a room full of
college students - suddenly, almost everybody becomes below-average wealthy
(if average == mean).

~~~
gngeal
_most people intuitively perceive the fact that "the majority of drivers
consider themselves above average" as human stupidity_

I intuitively perceive it as good drivers being clustered together near the
top, with a few outstanding lousy drivers near the bottom.

 _If you are not convinced, consider Bill Gates walking into a room full of
college students - suddenly, almost everybody becomes below-average wealthy
(if average == mean)._

That's true in my country even if Bill Gates doesn't walk in. Hell, perhaps
it's true even in the US.

~~~
tomp
I meant "below room-average wealthy", i.e. taking only the people in the room
into the account.

------
ender7
Using FPS as a measure of your UI's performance is equally problematic. FPS is
a great measurement for for games since performance dips usually occur over a
span of many frames, but for UIs a lot of work tends to get concentrated into
a single frame. A single frame that takes 110ms (or, heaven forbid, 500ms) to
render won't move the needle on your FPS meter, but it will be instantly
recognizable by the user.

I've complained about this before; use maximum frame delay [0] instead of FPS
when measuring UI responsiveness.

[0] The maximum time elapsed between any two sequential frames during your
test.

~~~
stonemetal
Actually max frame delay is a good metric for games too. AMD has had a driver
defect for years that caused stutter. It wasn't until their rival, Nvidia,
released a max frame delay test tool and rubbed AMD's nose in it that they
realised there was a problem.

------
programminggeek
Why not break the numbers down more granularly to like 25% 50% 70% 90% 95% 99%
?

Understanding where your users are on the curve is probably more interesting
than a single number. Worrying about that last 1% really only makes a
meaningful difference if your user base is huge enough that fixing something
for 1% of your users can jump revenue by a significant multiple.

Mentally, I try and think of the 80 or 90% of users with a similar experience,
needs, etc. and make it better for them. In this case, speed is good for
everybody, but I care very little about the needs of that last 1% if your
customers are all paying the same. No sense in putting the needs of a small
number of users in front of the needs of a much larger set of users.

~~~
jamesaguilar
50-90-99 is my go-to set. I really don't give a rip about the ops that are
faster than the median. Seeing those is just me patting myself on the back.

------
krmmalik
Big fan of Optimizely here. I've used their product with a handful of my
clients. The thing that struck me about the article the most was how well it
was written. Very engaging all the way through. That kind of quality of
writing, I would say is quite rare.

Anyway, I'm really glad that they've improved the load times for their
snippet, because this issue is always a genuine concern that needs resolving.

------
res0nat0r
Anyone at AWS should know the phrase TP99. That is used all of the time to
measure the 1% and is something they are very concerned with.

~~~
alttab
Amazon is all about bringing the best value to the most customers, so even a
deviation in mean to bring down the p99 is worth it for most customers.
Especially when that correlates directly to sales.

------
durbatuluk
Sadly is hard to say anything without any value on axis. Difference between
mean and 99% is 5s or 20ms? As someone said here, threshold for "slow loading"
should be used before picking metric for measuring it. Pick graphic one and
draw a line where users whine about slow loading, check how many are under and
above. If the number of users below threshold value is "greater" then above
threshold you shouldn't be so worried. How much greater can be picked from
standard error from threshold measure from users.

------
noelwelsh
Regarding mean vs 99% etc. In this case all you care about is: did loading the
script delay page rendering to an extent that it was perceptible to the user?
It's basically a step function. 99% is appropriate in this case.

Want to do it yourself? This talk by Etsy a few weeks ago has some detail on
how they did a similar thing:

[http://www.slideshare.net/marcusbarczak/integrating-
multiple...](http://www.slideshare.net/marcusbarczak/integrating-multiple-cdn-
providers-at-etsy)

Some links at the end of the talk. Infrastructure wise, I think you have to be
prepared to pay for some expensive DNS before this kind of thing is viable.

~~~
dsiroker
You are exactly right and this presentation is spot on.

------
josephscott
The problem with using average for many performance stats is that it hides
issues. There is a great paper on the topic -
[http://method-r.com/downloads/doc_details/44-thinking-
clearl...](http://method-r.com/downloads/doc_details/44-thinking-clearly-
about-performance)

It is only about 13 pages, making it a quick but very informative read. I
highly recommend it for anyone trying to measure performance, throughput,
response time, efficiency, skew and load.

------
michaelbuckbee
I wish they talked more about how they had combined Akamai and Edgecast -
seems like a very useful and effective technique.

~~~
dsiroker
Check out the whitepaper, it has a lot of details about that. :)

Direct link:
[http://pages.optimizely.com/CDNBalancingWhitepaper_GeneralLP...](http://pages.optimizely.com/CDNBalancingWhitepaper_GeneralLP.html)

~~~
daurnimator
Care to give a summary so I don't have to give them personal details?

~~~
rcsorensen
[http://pages.optimizely.com/rs/optimizely/images/CDN_Balanci...](http://pages.optimizely.com/rs/optimizely/images/CDN_Balancing_Whitepaper.pdf)

""" At the highest level, a “balanced” CDN architecture is one that leverages
two or more CDNs hosting identical content to increase (a) the overall number
of physical Points of Presence for the network (PoPs) and (b) the proximity of
those PoPs to the end users accessing them around the world """

------
odonnellryan
That's not true.

I worked at a place that ONLY cared about the longest response time. Imagine!
They ignored everything else!

~~~
evincarofautumn
Well, bringing down the maximum response time is a good goal. Really, if _all_
your responses are fast enough, then it doesn’t matter if most of them are on
the slow end of that range.

------
amikula
I think in Optimizely's case, the most important factor is making sure that
there's no statistically significant correlation between higher response times
and A/B testing. In other words, if the higher response times result an
imbalanced impact on the test, the test is invalid.

~~~
dsiroker
You are right that if one variation takes longer to load and is noticeable it
is likely to cause a lower conversion rate. However with Optimizely the
response time (on average and 99th percentile) is the same between control and
test variations since our implementation is a single snippet of code
regardless of what bucket you are put into.

What we are optimizing here for is end-user experience and minimizing the
chance they have a higher-than-tolerable response time regardless of the
variation they see.

------
d4rti
I've used ApDex[1] before for giving a better measure of response times for
user experience

1:[http://apdex.org/](http://apdex.org/)

------
zcarter
Step one (always): Look at your data.

Only then should you choose the statistic(s) you 'care' about.

------
tairizzle
This was a very insightful read.

------
binarymax
The most misleading measure of almost everything: Average

