Queueing theory is an interesting discipline because the benefits of being a queue-theory expert are extremely subtle. There are some pressures that make it hard to show off:
- There aren't really any flashy results (the whole thing could be sold as restatements of "if the average processing rate is similar to the arrival rate, any variance in arrivals will lead to long queues".
- As a corollary, most queues encountered in practice can also be dealt with using very simple techniques or making the queue negligible. Neither of which require the study of queue theory.
- The quick recommendations are really boring (if you want the queue to get shorter, you need to process faster or add more servers).
But studying queue theory handy nonetheless because it turns out that queues are, in practical systems, about as common as list data structures or associative maps in programming. They are everywhere. Every time a stream meets a buffer, in fact. Being able to see a situation and reducing a lot of the noise to (M/M/n, lambda = 0.2, mu = 0.4) can free up a lot of thinking horsepower for more interesting problems. Then there is no need to try to reason about queue lengths vs serving times vs variance vs how those change with the addition of servers. An expert understands that a lot of results flow from a few simple variables, and doesn't have to remember the details because they are just symptoms of a few key observations.
So, in a sense, the reward for knowing a lot of queue theory is not having to think very much about queues.
Something not as obvious is the relationship between service time, utilisation, and response time. Even rough estimations of that can be very helpful, but most people I speak to don't even know there is a useful relationship -- some even think service time and response time are the same.
Come to think of it, the obvious general principles of queueing theory have several consequences that aren't immediately intuitive to people:
- A system where concurrency is limited (practically all systems) often bottlenecks on its slowest component, meaning almost any upgrade will do nothing to improve its performance.
- The average time a task is stuck waiting is longer than the average waiting time, once there's significant variation (greater than Poisson) in arrivals.
- Based on only local measurements in an auxiliary, asynchronous component you can determine global throughput for the whole system.
If we're intellectually honest, how many hours of studying Markov chains does any one insight there really justify? And what are the odds that any one insight is useful even while dealing with an honest-to-goodness queue? We're not exactly talking e^{i\pi} levels of "wow!" which almost justify teaching complex numbers just to hit people with the one equation.
The power is in the sheer number of semi-trivial observations that a queue theorist can start making after seeing only a small part of the system. And that is impressive - but mostly because you don't need to consider all those individual things as variables once the basic theory is understood. So the theorist can start ignoring all those variables really quickly and move on to dealing with the problem at hand.
You should really check out the recent Aperture[1] project on GitHub that applies all these ideas in practice to protect services from cascading failures.
1. Aperture automatically detects queue buildup based on metrics such as latency.
2. Adjusts the concurrency on a service.
3. Weighted Fair Scheduling of workloads (i.e. APIs) based on their labels.
Another fun fact: if instead of considering a distribution on the inputs, you consider worst case, then Queueing Theory becomes Online Algorithms. The techniques there seem very different from Queueing Theory, but I also heard some Computer Scientists calling Online Algorithms boring because the techniques used there are very similar and results aren't too suprizing (I can't say I share this sentiment).
Back when I was in college, an online algorithm was one where you had to make scheduling decisions as the jobs came in. There would be an optimal schedule (say differing size jobs coming in with n cpus) if you knew the future, and you prove things like "this algorithm can hit that optimal time with 1.x as many CPUs" or "this algorithm is within 1.x of optimal".
This also applies to stuff like swapping / expiring cache - if you knew all of the future disk / memory accesses you could hit optimal, otherwise you need some heuristics.
It's been a couple of decades, but I think some the proofs involved playing the part of an omniscient adversary.
Indeed. One impressive result I remember from that area (if I'm not mistaken) is that Least Recently Used cache policy with cache size S performs at least as well as an offline algorithm with cache size S/2.
I don't remember the details now, but the proof wasn't too complicated, they gave it to us as an exercise.
Theme parks have made queueing a multi-billion dollar business.
I just spent a nightmarish time in Orlando. The theme parks there are insane, and I don't believe I'll ever go back. 2 of my friends and I literally stood in a two and a half hour queue for one ride and it cost us $150 each.
The ride was terrible, and the queue length was constantly being hidden. The companies have no incentive to increase throughput because building extremely long lines is cheaper than building efficient rides.
It can also be applied in creative ways! In a company's annual report they state revenue for the year as well as a snapshot of accounts receivable. Divide one by the other and you get a snapshot of how long time it takes the company to collect on its debts!
I feel like this would go hand-in-hand with graph theory, as many problems are something like a graph with a queue at each node, with the goal being to manage traffic through the graph so that each queue is kept as short as possible.
+100. If people are interested in quick-win practical usage - consider efficacy in system design intervews. Eg What many people dont realise is that google system design interviews are subtly testing your ability to model queues :)
Erlang’s formulas are incredibly applicable to things we work on today - and shockingly unknown to many developers.
I’ve used them several times to show that a proposed system is mathematically impossible: “if the backend processes n requests/sec with a max response time of X, and the P95 of the backend is Y, the queue satisfaction is 0%”. People think you’re a wizard while you’re just plugging numbers into a century-old formula.
I worked for a Usenet provider for a number of years designing and writing the software that powered their Usenet feeding infrastructure. The Usenet feed is basically a giant stream of articles, and we used queuing theory to scale out the systems that read it in, process it, and distribute it to readers.
My experience with the formula's is a little different from yours. People thought what I showed them couldn't be true. "If the feed is X articles/second and system's P95 is Y, the backlog will continue to grow," would often be met with, "That can't be true ..."
Totally! I worked for a time, in the mid 00s, at a company that built realtime dashboards for old school telephone call centres (contact centers) and it was there that I became aware of the other (original) meaning of Erlang, beyond the nifty cool up and coming programming language.
For a while for me everything was about queuing theory. Since then I keep forgetting and remembering things about this later.
Strikes me that I would like to have an easy to use and generic library for Rust with the various formulas etc. in it.
What's the other original meaning? Googling I find no meanings except the programming language, although it was originally used by telecom. But it seems to be the same language originally used by telecom that is now getting use elsewhere, not two languages with the same name or two meanings, no? https://en.wikipedia.org/wiki/Erlang_(programming_language)
I studied queueing theory at school 20 years ago for my CompSci degree. Last year I implemented a QT model (relatively simple version of M/M/1) to control the queueing length at a very large chain of hypermarkets (core mechanism: use historic lambda and mu timeseries to predict future lambda and mu, reverse engineer the formula to get the right number of checkout counters to keep average queueing length to the desired number of seconds). The thing is now used across many countries. Very satisfying payoff after 20 years..
I'm a queueing theory researcher, just completing my PhD. My advisor, Mor Harchol-Balter, wrote one of the most popular textbooks in the area: http://performancemodeling.org/. If anyone has questions about the field, please ask, I'd love to answer any questions.
That really is a great book, and very approachable! All the examples are practical, results derived clearly, and a nice build up.
Here's my question for you: queue theory is a nice tool, but most of the classic results are about systems (like M/M/c) that don't match the real world in important ways. Are there good rules of thumb for thinking about how different changes (e.g. burstier than Poisson, seasonality, constrained queue lengths, etc) map to these results?
Obviously, simulation is a powerful tool for more general systems, but being able to reason about effects quickly is super useful.
For thinking about bursty arrivals, a good rule of thumb is to look at the variance of the inter-arrival times. The key number is the variance of interarrival times divided by the mean interarrival time squared. The waiting time in a system with bursty arrivals will roughly be larger than the M/M/c by this multiplicative factor. Kingman's formula is the equivalent for the single-server setting: https://en.wikipedia.org/wiki/Kingman%27s_formula
For seasonality, if the arrival rates fluctuate over a long time period relative to the typical waiting time, it makes sense to just do separate calculations for the different conditions you experience. If the fluctuation is very fast, just use the average arrival rate.
For constrained queue lengths, there are a lot of theoretical results in this area, such as the M/M/c/c model: https://en.wikipedia.org/wiki/M/M/c_queue. The second "c" refers to the buffer size.
> For seasonality, if the arrival rates fluctuate over a long time period relative to the typical waiting time, it makes sense to just do separate calculations for the different conditions you experience. If the fluctuation is very fast, just use the average arrival rate.
Thanks, that makes sense. More quantitatively, about where would get set the bar on "very fast"? Is it ~1x the mean interarrival time, or ~1 million x?
By the way, I really enjoyed your "Nudge" paper from last year. The result about FCFS was very surprising to me!
There's a transition zone from "fast" to "slow" fluctuations around the mean waiting time that's more complicated and is an area of active research. If the fluctuations are 5x below the mean waiting time, I'd guess the effects of fluctuation will be gone.
One more: my understanding is that in Nudge I need to know processing time, but in FCFS I don't. How sensitive is your optimality result to errors in processing time estimates (I don't recall this being covered in the paper, but if it is feel free to tell me).
In the cloud services and databases settings, we seldom have accurate processing time estimates until we're quite far down processing a request (post-auth, post-parse, at least, but also for databases post-query-plan).
The only thing we use processing time for in the Nudge paper is to classify jobs as "Large", "Small" or "Other". If instead of exact processing times, we had estimates, the result would still work as long as a job that was estimated to large was typically longer than a job estimated to be small. So Nudge totally works in these more realistic settings.
If the estimates were super noisy, you might be better off using Nudge very sparingly, only when you're more confident about the relative sizes.
I had actually bought Mod’s book, and she’s absolutely one of the best authors in the field. She takes the notions of giants like Moshe Haviv and Gallagher and translate it in a way that is super intuitive and simple. She’s one of my role models in academia. Pass her my greetings. I intend to teach a course based on her textbook after I’m done with my PhD (but mostly IEEE articles here, so not real science like ACM’s SIGs. Wish my PhD was in the awesome topics you guys are working on. I see the papers of Mor’s and Ziv Scully and the whole lab, really awesome work! Kudos to you guys. I won’t hide my envy of having Mor as a supervisor :D)
what are the most common types of queues for large data center applications (eg. Stateless app that handles rpcs, running on N machines). Are they mostly M/M/N? For example, is memcached different than nginx, or Postgres?
While an M/M/n is a good starting point, there are several extra important features to capture.
First, many of these applications have a load-balancing step, where arriving jobs are dispatched to queues at each of the machines. The performance will depend on how this load-balancing is done.
Second, some applications will parallelize across many cores or machine, while others will run each job on a single core or machine. This obviously has major implications for performance.
Third, your application may have variance in interarrival times or in job completion lengths. This is also important to measure and incorporate into a model. Something like Kingman's formula can be useful: https://en.wikipedia.org/wiki/Kingman%27s_formula
As a regular programmer, I got really into queuing theory thinking I was going to learn secrets of performance tuning, then was slightly disappointed.
But it turns out the simple parts of it go a long way! Eg at work we have a single deployment queue for a monorepo. At first approx this is an MD1 queue (deploy job takes roughly the same amount of time every time, though arrivals are actually way spikier than poisson), and realized that wait time was inversely proportional to total utilization.
While the infra team was saying “we do X deploys per day out of 4X and are only at 25% capacity”, I realized even hitting 2X would more than double the already bad wait time.
What happened is that a few initiatives were under way to increase capacity, but then out of nowhere the queue got log jammed (bc of high arrival rate variability) and we had to switch to Gitlab merge trains, which run CI concurrently on the optimistic result of merging. I wrote about it here: https://engineering.outschool.com/posts/doubling-deploys-git...
I’m planning on writing a blog post about the math of CI/CD deploy queues as G/D/1 queues.
For a programmer’s view of queuing theory and great performance testing foundations, I highly recommend “Analysing Computer system performance with Perl:PDQ” (don’t worry about the Perl, the book is very relevant) http://www.perfdynamics.com/iBook/ppa_new.html, which shows examples of queues inside computer systems and how to model them.
The author has a nice little library to model different computer systems you come across.
I liked realizing that the dreaded “coordinated omission” problem in load testing (when you can only generate X rps and the server can handle more than that, your numbers are bunk) is actually when you think you are modeling an open system but you don’t have enough resources and end up seeing the closed system behaviour.
That's not what coordinated omission is. CO happens when your load generator can do, say, 500 requests/second, and you're aiming for something well below that, e.g. 100, but! the top throughput for the server is 80 requests per second, and instead of building up an infinite queue, the load generator throttles itself to roughly 80 requests per second.
Why would you build a load generator like this? Normally because you run out of threads -- you have 800 parallel requests in flight and you can't open a new one until one has returned.
Correcting for CO takes a mathematical sleight of hand.
That was a neat blog post! Can you clarify this bit: "If a pipeline fails, the associated MR is kicked from the train and all preceding pipelines are restarted." - Do you mean all the pipelines _already started_ are restarted? Because I'm guessing that if a pipeline fails, the preceding pipelines shouldn't be affected, strictly speaking.
I love this book but its focus is primarily on practical applications and not as much on mathematical foundations, which was the question, if I read it right.
The book comparable to the mathematical level of CLRS, in my opinion. Which is either an intro book or very dense depending on a person’s math background.
Where does one start with understanding and applying queueing theory to real world problems? I manage a system that receives a highly variable number of requests that are queued for processing by a horizontally scalable number of VMs up to a resource cap. Success of this system is based on the 95p of aggregate processing times over a specified time period being below a certain threshold value. It’s my responsibility to manage the system such that this threshold is always met. I also need to pre-allocate additional system capacity to the right bottlenecks based on predicted increases in demand. So it sounds like knowing queuing theory could be a big help. Where do I begin?
I’ve implemented queuing simulators in the past, but two questions I’ve always been bugged by (which I feel like are easily answered if I ever took proper math courses…):
1. I don’t understand how you “know” you’re in the steady state, beyond looking at the graph — I’ve always just made it go like 10k steps to force it, but this seems needlessly wasteful. I imagine you could track the derivative but if it oscillates?
2. I don’t know how to verify if the distributions are poisson/exponential in reality, except by plotting
More relevant note, I’ve always been surprised by how little exists online on the subject, and how little code it actually requires; I forget the terms but multiple event classes, multiple agent, infinite buffer sim took maybe 100 LoC C# from stdlib. Maybe 60 LoC in python — probably 25% was just tracking the stats I wanted like wait time.
Biggest performance trick is that poison inter-arrival rate is exponential, and so you can generate all arrival times up front; then instead of simulating every second, you can just skip forward in time to when the events actually occur… so your sim scales on number of events occurring, rather than length of time simulated. Pretty sure you can even do arbitrarily length simulations in constant memory but never tried
Correct, in real life queued objects (or people) are "agents" that exhibit imperfect behavior that doesn't really adhere to a perfect queuing equation. Classic example is motor vehicle throughput: queuing equations might be helpful in broad strokes for coarse estimates but need agent level simulation to accurately predict outcome.
Simulation, as kqr said. I like to use the classic queue theory results (e.g. results on M/M/1 and M/M/c) as unit tests for the simulators I build. I've found that application of queue theory very helpful in avoiding silly simulation bugs.
This has a lot of very complex mathematics involving stochastic processes and probabilities but results in a bunch of rather simple equations that can help you solve problems like:
"How many operators I need to have at the same time when my telephone center receives 10 calls per minute with an average duration of 30 seconds so that the wait time won't be over 5 minutes."
The fact that such problems can be solved but I still need to wait 2 hours on the phone angers me very much.
Q: How many operators I need to have at the same time when my telephone center receives 10 calls per minute with an average duration of 30 seconds so that the wait time won't be over 5 minutes?
A: X, and then the operators will be busy Y% of their time.
Especially if Y is low, many callcenter operators will rethink what SLAs they want to give callers.
Many call centers optimise for keeping their operators busy aka paying as few operators as possible (and some of them then show their personnel the queue length or average waiting time and expect them to more rapidly kick out callers when that gets too long)
For most people, it’s the same with doctor visits. There is no inherent need to have to wait at a dentist, for example, but if dentists plan to have their day filled with paid work, there has to be. They may even schedule their first patient (possibly even the second) at a time before they plan to start working.
It’s different for people whose time is worth (much) more than that of the dentist. They (effectively) pay their dentist to wait for them.
Anyone building web-scale apps should check out Aperture[1] - a flow control platform that dynamically adjusts concurrency limits on services when queue buildup is detected. Aperture has a weighted fair queuing scheduler that prioritizes workloads when concurrency limits are adjusted.
I learnt about queueing practice on an IBM Mainframe (MVS/CICS) mixing with theory at the university. From that point I was very disappointed with RabbitMQ and other technologies way back because these systems were overpromoted as message brokers but it didn't have the basic capabilities of a message broker (contention) [1]. My take away from this is: stay alert of overpromotion and have real implementations (yes IBM...) And theory in mind.
Heh. I'm curious how a link to a wikipedia entry made it to the front page, or why it was even submitted at all! Is it Poisson's birthday or something?
Yea, are we going to see a link to Wikipedia's page for Calculus next?
HN is for new information, not wikipedia pages that happen to be interesting but are already common knowledge. Although I guess it does technically fit the guidelines [1].
Perhaps, however I find decent value in some of these kinds of submissions - it is in the comments. The book recommendations and benefits of queueing theory mentioned here have already been worth it for me. I love going down technical rabbit holes.
I would assume most programmers have heard of queueing theory, and I thought most HN readers were programmers. The comments look interesting and it's not against the guidelines, I was just surprised to see it on the front page.
M/M/s queues have nice properties, but require that arrival and service times are exponentially distributed. Once you deviate from this assumption, closed-form solutions become difficult to obtain and numerical approximations are quite messy.
I've therefore always wondered whether there is software that uses theoretical results from queueing theory to improve control or to optimize parameters of complex systems. Any references?
...except! If the server is processor-sharing/timeslicing. The results derived for timeslicing servers are surprisingly general in service distribution, and can be stated in product form like the networks of simple M/M/c servers you know.
(Arrivals still need to be Poisson, but that's a fairly nice requirement in that many real-life arrivals actually look Poisson.)
An hotel where a mathematical congress on queuing theory was happening that weekend, tried to provide the best service, by opening extra check-in counters at registration time, with their most professional receptionists...Guess what happened when the mathematicians arrived?
Surprisingly applicable to modern problems, such as queuing within computer systems. I studied this for my industrial engineering degree and thought it was extremely elegant.
- There aren't really any flashy results (the whole thing could be sold as restatements of "if the average processing rate is similar to the arrival rate, any variance in arrivals will lead to long queues".
- As a corollary, most queues encountered in practice can also be dealt with using very simple techniques or making the queue negligible. Neither of which require the study of queue theory.
- The quick recommendations are really boring (if you want the queue to get shorter, you need to process faster or add more servers).
But studying queue theory handy nonetheless because it turns out that queues are, in practical systems, about as common as list data structures or associative maps in programming. They are everywhere. Every time a stream meets a buffer, in fact. Being able to see a situation and reducing a lot of the noise to (M/M/n, lambda = 0.2, mu = 0.4) can free up a lot of thinking horsepower for more interesting problems. Then there is no need to try to reason about queue lengths vs serving times vs variance vs how those change with the addition of servers. An expert understands that a lot of results flow from a few simple variables, and doesn't have to remember the details because they are just symptoms of a few key observations.
So, in a sense, the reward for knowing a lot of queue theory is not having to think very much about queues.