I know its not main point of the article, but the "Bigger Holding Space" results are surprising to me. The idle time barely changes at all, and even the response time change is very moderate (<10% increase), while rejection rate drops three orders of magnitude(!). I suppose this is somehow an artifact on the particular distributions chosen in the example? Also intuitively tail-end latency (like p99) would suffer more when you add queues, I guess figuring that out analytically is more difficult..
Most queues are nearly empty most of the time. (Consequently, most queues have oversized limits.)
The only time I've experienced queues grow beyond "nearly empty" are either
1. When there is high request-to-request variability. E.g. one request triggers a garbage collection or something pathological in the database or another shared resource, that blocks all other requests from processing. This can quickly build up a serious queue.
2. When the system is designed too weak for peak loads (e.g. in the afternoon for online stores) with the expectation that once the peak subsides, things will be quiet enough to catch up with the overload.
This model does not deal with case 1 (since the exponential distribution has comparatively little variance) and is supposed to represent a system where 2 isn't a factor either.
----
In terms of tail latency: I agree! That might end up being a future article because it would have been too much for this one.
"If what you need is a fast system, don't start with a slow system and try to load balance or queue your way out of your problems. Design a fast system from the start."
No, but I am curious and will once I can take the time! It doesn't appear to bear much resemblance to the TL;DR though, since it assumes the service rate is inflexible.