
Throughput vs. Latency and Lock-Free vs. Wait-Free - ingve
http://concurrencyfreaks.blogspot.com/2016/08/throughput-vs-latency-and-lock-free-vs.html
======
ChuckMcM
This is an exceptionally important concept when building web services. I had
tacitly understood this but the lead engineer at Blekko really helped me
internalize what it meant. My first test question now to someone who asserts
they understand the performance of their web based product is what is the 99%
quantile latency. Answers start with "huh?" (ok this person hasn't thought to
hard about the problem), "the average is ..." (this person measures latency
but doesn't really understand that you can't infer very much useful
information from the average), "the median is ..." (this person at least gets
that latency can be all over the map and so they are trying to identify the
midpoint), and "its xx mS". Which is someone who realizes that by starting at
the 99th percentile value they can reason about what how the service is
working in its entirety.

~~~
nkurz
Maybe, but even this still feels incomplete. Missing from this (and ironically
missing from the linked article given that it's part of the blog's title) is
the notion of concurrency. Whether a xx mS 99th percentile latency is good or
bad depends very strongly on how many parallel requests are in flight. In a
single-threaded situation where a slow response holds up all other requests, a
few slow responses might be a disaster, but if it just means that an
individual user has to wait half a second longer, it might be a non-issue.

I agree with the blog author's point that characterizing the entire latency
distribution is more valuable than looking at an individual statistic, but
it's difficult for me to understand how this blog post can say avoid talking
about the equally important degree of parallelism. I'm probably a broken
record, but I think Little's Law is one of the most under-utilized concepts in
system analysis and design.
[http://web.mit.edu/sgraves/www/papers/Little's%20Law-
Publish...](http://web.mit.edu/sgraves/www/papers/Little's%20Law-
Published.pdf).

~~~
infinite8s
Actually concurrency makes it worse. Say your site has an median latency of <
second but a p99 of 10 seconds for any particular request, but your site needs
to make 100 requests (not unheard of with modern sites) for any particular
page. This means that now almost 99% of your users will be affected by the p99
latency. This is usually a big issue because most systems are designed with
median or mean latency in mind, so the p99 is usually orders of magnitude
greater.

~~~
devbug
Mathmatically, 1-(1-0.99)^100 = ~1.

~~~
dmit
Not quite. What you calculated is the probability that visitors will _not_ hit
the p99 case on _all_ their requests. In other words, the chance of getting a
p99 response on all 100 requests is infinitesimal.

The chance of getting at least one p99 response out of a hundred is 1 -
0.99^100 ≈ 63%.

[https://latencytipoftheday.blogspot.com/2014/06/latencytipof...](https://latencytipoftheday.blogspot.com/2014/06/latencytipoftheday-
most-page-loads.html)

~~~
forgotpwtomain
This is only if it's Markovian - but it almost certainly is not.

If you need X requests served for a full page-load (or whichever for the
primary functionality of the page), you should be totaling those request times
into a single entry for your histogram, otherwise you are calculating the
distribution of something potentially useless (e.g. 50% of
page_is_broken_without_it.js might actually be in p99 even if that file is 1%
of your requests and the generic histogram might look deceptively good).

------
matt4077
And wrk2 ([https://github.com/giltene/wrk2](https://github.com/giltene/wrk2))
is the way to measure it.

~~~
Groxx
So most benchmarking tools _don 't_ do this? Any idea if `ab`[1] does? I've
been assuming that they all fired off requests at constant rates, 'cuz yeah,
otherwise it's pretty inaccurate compared to real-world load.

[1]:
[https://httpd.apache.org/docs/2.2/programs/ab.html](https://httpd.apache.org/docs/2.2/programs/ab.html)

~~~
matt4077
I just checked wrt and as it turns out, they did indeed add it in the
meantime. Wrt2 was about the only one when I need it a year or two ago. Most
people here seem to be working with so much traffic that they primarily care
about throughput, but I live of about 300 pageviews/day. With the way that
people and google react to latency, it's a good day when I can shave 10ms of
every request.

(Can't wait for ruby->elixir transition to be complete. So far it seems to
safe about 100ms).

------
emmelaich

       "the inverse of the mean of the latency
        distribution is [...] the throughput!
    

i.e. Little's Law?

~~~
honkhonkpants
No, that's not Little's Law. Little's Law describes the depth of a queue, not
it's throughput.

In the original article, this assertion is wrong except under special
conditions that aren't described in the article. In computer systems you can
have a latency of X and a throughput of Y, Z, or any number with no particular
relation to X. Consider for example if I have a system with a throughput of
1000/s and a latency of 1ms, which satisfies the given relation but for no
particular reason. Now suppose I add at the beginning of every request a
1-hour delay. Now my latency is 3600001ms, but my throughput is still 1000/s.
The only thing that has changed is the number of requests in flight.

~~~
rawnlq
If you assume that queue is bounded (which is true for real systems), then
Little's Law does hold:

Number of requests in queue = latency * throughput

E.g., left hand side is a constant if you assume the system is fully utilized
or constantly under max load.

------
trienism
I'm thinking about this problem in the context of a data ingestion API where
the requests are processed asynchronously.

The latency from the client to the backend is not really an issue so long as
the request is processed at some point. In this situation, I'm looking at the
throughput and cpu usage to see how far an instance can scale.

Am I missing something, or does latency not really play a large role in this
particular scenario?

~~~
vikiomega9
Well, for your setup latency has no effect. The constraint here is how quickly
you can process incoming requests. Throughput on the system would simply
depend on how much of resource you have that process those requests.

------
zzzcpan
> there is typically a tradeoff when it comes to throughput vs latency

I don't think it's that simple. Unless the algorithm does something
periodically and blows up tail latency - the one with better throughput might
do better on tail latency as well, but under the same load, which is more
important than latency under higher load, that it can handle.

~~~
Jweb_Guru
I think it would be more accurate to say one can often break a problem into a
deterministic one with commuting stages, producing a fork-join algorithm that
would be optimal if each step could be guaranteed to synchronously take
exactly the same amount of time at each step, all hardware resources were
truly saturated, and there was no communication overhead between parallel
workers. For all of the above to be true, one would probably have to run the
algorithm in serial on data that fit in L1, on a processor with no pipelining;
in practice, a common strategy is to try to make the "horizontal" length of
the stages larger (fitting problem instances into each stage, for example) to
reduce the overall percentage of the time that's taken up by contention and
communication overhead (or correlated problems like spurious aborts in the
optimistic concurrency paradigm). Obviously, there are other tradeoffs that
can be made to improve both latency _and_ throughput (especially if you're
dealing with a streaming problem and can employ latency hiding techniques like
prefetching that exploit underutilized hardware), but I hope you'll agree that
that particular sort of tradeoff (which is extremely common) does manifest as
a latency-for-throughput-tradeoff; pipelining itself is a great example of
that. Another way to say the same thing is that many algorithms improve
throughput by deprioritizing fairness; usually extremely simple algorithms
with small constants are going to beat more complicated ones with larger
constants unless you make the problem sizes large enough, so to make those
more complex algorithms useful you often have to "artificially" increase the
problem size.

> the one with better throughput might do better on tail latency as well, but
> under the same load, which is more important than latency under higher load,
> that it can handle.

If algorithm A performs worse at "the same load" than algorithm B, what sane
benchmark is going to say algorithm A has better latency? Either there's a
point where that ceases to be the case, or we can just say B has better
latency _and_ throughput.

------
jwatte
The whole reason to use lock/wait free algorithms is to minimize latency. Else
you can use cheap locks and he done with it!

And for latency, the worst var is what matters. With 10 occurrences per macro
operation, about half of all operations will see 90 percentile latency at
least once!

