Is it that long samples tend to kick other samples out of the window, messing up your stats?
However, this statistic is misleading. Imagine a system that responds in 1 second to 98 of the requests, in 2 seconds to the second-worst request ("99th percentile latency") and in 100 seconds to the worst request. If this system were to process a continuous stream of requests - e.g. 100 requests from visitors to a web site - half of all visitors would arrive while the system was processing the worst-case request (which takes 100s of a total 200s of run time). So the customer with the 99th-percentile worst experience will experience about 100s of latency, not 2s!
E.g. the author's https://github.com/giltene/wrk2 tries to avoid this "coordinated omission" problem by sending requests at a constant rate, instead of constant-time-after-response.
(Azul Systems competes against HotSpot JVMs with garbage collectors tuned for throughput, which suffer from occasional long pauses; Azul sells a rather impressive JVM which largely avoids such long pauses, and needs to educate its customers on how to benchmark it properly.)
For example, if I have a system that alternates between 100 requests/second for 1 second and 5 request/second for 1 second, reporting distribution per request would yield numbers like:
Average latency: 2000 ms / 105 ~ 19 ms;
90th percentile: 10 ms.
These metrics aren't wrong… but they're not very useful if you're interested in latency rather than throughput. From a latency standpoint, I'd rather know the distribution of latency based on something like: if I choose to kick off a request at a random point in the next minute/hour, what latency can I expect on average? what's the 90th percentile in terms of random request time? I'd find something like:
Average latency: .5 * 10 ms + .5 * 200 ms = 105 ms;
90th percentile: 200 ms.
The difference with percentiles is key, in my view. With event-based sampling, you can significantly improve percentiles by focusing on the cases that are already good (and that's exactly what we try to avoid with percentiles): a system with 200 rps for 1 sec, then 5 rps for 1 sec now has a 90th percentile latency of... 1000 / 200 = 5 ms! Even worse, we can also improve the 99th percentile simply by processing fewer events in the slow region.
In the extreme case, the system is so completely useless during its "slow" phase that it processes 0 events… and the information we gather now represents only the fast phase, even when the system is in slow mode 99% of the time.
TL;DR: Pay attention to the X axis in distribution data! Is it actually what you want to sample from to characterise your system?
...Except, it turns out that the requests were spending 6330ms queued in your load balancer because ... well, it doesn't really matter why...
Something is coordinating the flow so that what you're measuring is not going to see the full picture.
RapGenius v. Heroku comes to mind: http://genius.com/James-somers-herokus-ugly-secret-annotated
First, some terminology which I think is important for the discussion, also when I say 'job' this could be something like a user, HTTP request, RPC call, network packet, or some sort of task the system is asked to do, and can accomplish in some finite amount of time.
Closed-loop system, aka closed system - is a system where new job arrivals are only triggered by job completions, some examples are interactive terminal, batch systems like a CI build system.
Open-loop system, aka open system - is a system where new job arrivals are independent of job completions, some examples are the requesting the front page of Hacker news, or arriving packets to a network switch.
Partly-open system - is a system where new jobs arrive by some outside process as in an open system, and every time a job completes there is a probability p it makes a follow-up request, or probability (1 - p) it leaves the system. Some examples are web applications, where users request a page, and make follow-up requests, but each user is independent, and new users are arriving and leaving in their own.
Second, workload generators (e.g. JMeter, ab, Gatling, etc) can also be classified similarly. Workload generators that issue a request, and then block to wait for a response before making the next request are based on a closed system (e.g. JMeter, ab). Those generators that continue to issue requests independently of the response rate, regardless of the system throughput, are based on an open system (e.g. Gatling, wrk2)
Now, CO happens whenever a workload generator based on a closed system is used against an open system or partly open system, and the throughput of the system under load is slower than the injection rate of the workload generator.
For the sake of simplicity, assume we have an open system, say a simple web page, where multiple users arrive by some probability distribution and simply request the page, and then 'leave'. Assume the arrival probability distribution is uniform, where the p is 1.0 that a request will arrive every second.
In this example, if we use a workload generator based on a closed system to simulate this workload for 100 seconds, and the system under load never slows downs so it continuous to serve a response under 1 second, say that is always 500 ms. Then there's no CO here. In the end, we will have 100 samples of response times of 500ms, all the statistics (min, max, avg, etc) will be 500ms.
Now, say we are using the same workload generator at an injection rate of 1 request/s, but this time the system under load for the first 50 seconds will behave as before with responses taking 500 ms, and for the later 50 seconds the system stalls.
Since the system under load is an open system, we should expect 50 samples of response times with 500 ms, and 50 samples where response times linearly decrease from 50s to 1s. The statistics then would be
min=500ms, max=50s, avg=13s, median=0.75s, 90%ile=45.05s
But because we used a closed system workload generator, our samples are skewed. Instead, we get 50 samples of 500ms and only 1 samples of 50 seconds! This happens because the injection rate is slowed down by the response rate of the system. As you can see this is not even the workload we intended because essentially our workload generator backed off when the system stalled. The stats now look like this:
min=500ms, max=50s, avg=1.47s, median=500ms, 90%ile=500ms.
On classifying testing tools as open/closed, I've been able to use JMeter to simulate open requests against very heavy endpoints  by not having individual threads loop, increasing the number of threads and using the ramp-up feature.
I suspect that this wouldn't work for testing something that can handle large amounts of traffic, but there are cases where you can fit a squarish peg into a somewhat round hole.
 100s of ms or seconds of time for endpoints that do a lot of work (both CPU and IO) per request).