
How Not to Measure Latency [pdf] - bracewel
http://www.azulsystems.com/sites/default/files/images/HowNotToMeasureLatency_LLSummit_NYC_12Nov2013.pdf
======
KayEss
Video of a more recent version:
[https://www.youtube.com/watch?v=lJ8ydIuPFeU](https://www.youtube.com/watch?v=lJ8ydIuPFeU)

------
lordnacho
Can someone explain in words what the coordinated omission problem is?

Is it that long samples tend to kick other samples out of the window, messing
up your stats?

~~~
meteorfox
The explanations here gave a lot of details on the effect, but IMHO, not as
many details in the cause of Coordinated Omission (CO). Most of what I'll be
saying here comes from a CMU's paper titled "Open vs Closed: A Cautionary
Tale"[1] and from Gil Tene's talk.

First, some terminology which I think is important for the discussion, also
when I say 'job' this could be something like a user, HTTP request, RPC call,
network packet, or some sort of task the system is asked to do, and can
accomplish in some finite amount of time.

Closed-loop system, aka closed system - is a system where new job arrivals are
only triggered by job completions, some examples are interactive terminal,
batch systems like a CI build system.

Open-loop system, aka open system - is a system where new job arrivals are
independent of job completions, some examples are the requesting the front
page of Hacker news, or arriving packets to a network switch.

Partly-open system - is a system where new jobs arrive by some outside process
as in an open system, and every time a job completes there is a probability
_p_ it makes a follow-up request, or probability _(1 - p)_ it leaves the
system. Some examples are web applications, where users request a page, and
make follow-up requests, but each user is independent, and new users are
arriving and leaving in their own.

Second, workload generators (e.g. JMeter, ab, Gatling, etc) can also be
classified similarly. Workload generators that issue a request, and then block
to wait for a response before making the next request are based on a closed
system (e.g. JMeter[2], ab). Those generators that continue to issue requests
independently of the response rate, regardless of the system throughput, are
based on an open system (e.g. Gatling, wrk2[3])

Now, CO happens whenever a workload generator based on a closed system is used
against an open system or partly open system, and the throughput of the system
under load is slower than the injection rate of the workload generator.

For the sake of simplicity, assume we have an open system, say a simple web
page, where multiple users arrive by some probability distribution and simply
request the page, and then 'leave'. Assume the arrival probability
distribution is uniform, where the _p_ is 1.0 that a request will arrive every
second.

In this example, if we use a workload generator based on a closed system to
simulate this workload for 100 seconds, and the system under load never slows
downs so it continuous to serve a response under 1 second, say that is always
500 ms. Then there's no CO here. In the end, we will have 100 samples of
response times of 500ms, all the statistics (min, max, avg, etc) will be
500ms.

Now, say we are using the same workload generator at an injection rate of 1
request/s, but this time the system under load for the first 50 seconds will
behave as before with responses taking 500 ms, and for the later 50 seconds
the system stalls.

Since the system under load is an open system, we should expect 50 samples of
response times with 500 ms, and 50 samples where response times linearly
decrease from 50s to 1s. The statistics then would be

min=500ms, max=50s, avg=13s, median=0.75s, 90%ile=45.05s

But because we used a closed system workload generator, our samples are
skewed. Instead, we get 50 samples of 500ms and only 1 samples of 50 seconds!
This happens because the injection rate is slowed down by the response rate of
the system. As you can see this is not even the workload we intended because
essentially our workload generator backed off when the system stalled. The
stats now look like this:

min=500ms, max=50s, avg=1.47s, median=500ms, 90%ile=500ms.

[1][pdf]
[http://repository.cmu.edu/cgi/viewcontent.cgi?article=1872&c...](http://repository.cmu.edu/cgi/viewcontent.cgi?article=1872&context=compsci)
[2] [http://jmeter.512774.n5.nabble.com/Coordinated-Omission-
CO-p...](http://jmeter.512774.n5.nabble.com/Coordinated-Omission-CO-possible-
strategies-td5718456.html) [3]
[https://github.com/giltene/wrk2](https://github.com/giltene/wrk2)

~~~
hyperpape
Thanks for the paper!

On classifying testing tools as open/closed, I've been able to use JMeter to
simulate open requests against very heavy endpoints [1] by not having
individual threads loop, increasing the number of threads and using the ramp-
up feature.

I suspect that this wouldn't work for testing something that can handle large
amounts of traffic, but there are cases where you can fit a squarish peg into
a somewhat round hole.

[1] 100s of ms or seconds of time for endpoints that do a lot of work (both
CPU and IO) per request).

------
federico3
What the author calls "Percentile Distribution" is the same thing as a
probability density function?

