Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
An experiment in elastically scaling a thread pool using a PID controller (github.com/stevana)
109 points by todsacerdoti on March 14, 2023 | hide | past | favorite | 53 comments


Always fun to play with PIDs.

Some notes adding to the author's questions at end:

* In addition to putting a maximum number of threads, you'll probably want to put a maximum on your integral accumulator (oh, and probably a minimum number of threads too).

* If there's no cost in switching threads between different jobs, and all jobs are truly part of a single tree of steps, then you could probably get away with a global controller. However if you have certain steps that have especially unpredictable and/or time-varying requirements, you might consider giving it a separate controller

* Because queue length can't get less than 0, you can get pretty asymmetric behavior in your controller (it can only "unload" excess threads at a given rate). This might be useful behavior, but also might be worth tuning. You can implement gain scheduling (so picking different PID values for different states), or change your error measure to something that is more symmetric.

* Consider adding a dead zone/hysteresis around your zero point.

* The D term is probably not that useful here. As you saw, the addition of the I was the big game changer. Won't go into full control theory, but one way you can think about this is that 'effort' that you're putting in is scaling the number of threads. The time integral of number of threads gives you number of items processed, which is the same units as queue length. If you find a 'measure' and an 'effort' that have the same units, then you might even find that you can skip the I term completely.

* Always worth studying the cost you see (either total resources spend on threads, or cost of volatility scaling up and down) versus what the end user sees (wait time). It's possible you can get away with a much simpler system with an appropriate hard minimum + simpler (ie maybe just P + maximum rate of change) system.


A post about PID controllers without a Bode plot? Honestly the post had a lot of promise and then didn't really get into the interesting bits. Building things up in terms of the proportional, then adding integral and derivative feedback is a fun exercise that helps you understand what those terms do (as you pointed out in your comment). OP would benefit from taking a control systems class and not just reading the wikipedia.


I did take a control theory class in 2009, but I forgot most of it because I never used it.

Nevertheless I did my best explaining what I can, but if you think you can improve upon it then I'm happy to accept pull requests.


Welcome computer scientists to the bleeding edge control scheme of the 1920’s.

Teasing apart, it always amazes me how slow the diffusion of innovation is across different fields.


> Teasing apart, it always amazes me how slow the diffusion of innovation is across different fields.

I had been dabbling with electronics for a while, so I had heard of PID controllers (as black boxes). Never really thought of them outside of that.

The first time it clicked for me that this could have applications in other fields was when playing a game. Stationeers, to be precise. Sure, the PID controller that a user is created was controlling a device in the game, but the actual code was implemented in 'assembly' and very simple to understand. That led me to think that I could use that in place of our simpler autoscalers at work. I have not implemented a solution, but that's how the gears started to turn.

Reading this thread, apparently the field has advanced far past PID controllers. Now it's yet another rabbit hole to go through...


What's a more modern way to control an internal combustion engine + forced induction (turbo charger) than PID?


There is a lot of literature if you search the terms dynamic matrix control, optimal control theory, model predictive control, reinforcement learning.

The main idea is that control problems can be posed as optimization problems where we try to achieve some goal (financial / stability etc) while satisfying some hard restrictions.

For example try to change lines gracefully and fast on the highway without crashing on another car.

For your particular case, if you want something dirt cheap (computationally) and reliable then the PID is a great first step.


My understanding is that PID does not work as well for Mulitple Input Multiple Output systems compared with "modern control theory." See the "Analysis techniques - frequency domain and time domain" and "Classical SISO System Design" sections of below Wikipedia article.

https://en.m.wikipedia.org/wiki/Control_theory


This is an area our team has been working on for the last couple of years — applying control systems principles to prevent metastable failures, overloads and so on.

Blog by DoorDash team - https://doordash.engineering/2023/03/14/failure-mitigation-f...

Project - https://github.com/fluxninja/aperture


What's the point? Saving power?

The processor has fixed resource capabilities, if you run it at the maximum the only tradeoff is power. If you don't account for power it's like driving a car with unlimited fuel, why should you use a pid controller to control how many pistons fire if my fuel is unlimited? Just have the engine run idle always.

I mean there's wear and tear on an engine but this aspect of the analogy doesn't translate to threads.

The only other place where I see this is useful is competing tasks. One task needs more resources from a thread pool shared by other tasks. A pid controller can allocate existing threads based off of pressure. Allocating more threads to a pool in this case though, still doesn't make sense.


> The only other place where I see this is useful is competing tasks. One task needs more resources from a thread pool shared by other tasks. A pid controller can allocate existing threads based off of pressure. Allocating more threads to a pool in this case though, still doesn't make sense.

Imagine you got 16 CPUs/cores and a 4 stage pipeline, lets say we want to run one thread per CPU/core for max performance (avoiding context switching). Without knowing anything about the workload: what is the best way to distributed the threads over the stages? You can't tell. Even if you tested all the possilble ways to spread out the threads over the different stages on some particular workload, then as the workload changes you wouldn't have optimal allocation anymore. A controller can adjust the threads per stage until it finds the optimal allocation for the current workload and change it as it changes.


Right, the scenario you describe is exactly WHAT I mentioned: Competing tasks for threads.

This article is NOT exactly about that. It's about allocating more threads to a thread pool. I'm addressing the pointlessness of using PID controllers for allocating new threads.

So your point is mistaken and redundant to mine.


Not exactly about that? It's literally the example from the motivation. The first thing I do in the plan section is to say "Let's focus on a single stage of the pipeline to make things easier for ourselves." and in the contributing section "We've only looked at one stage in a pipeline, what happens if we have multiple stages? is it enough to control each individual stage separately or do we need more global control?".


The example is about allocating threads to threadpools.

It is not about utilizing existing threads in threadpools.

Thats where YOU are mistaken. Your example has threads preallocated to core affinity. Which is what Im saying when I say run all pistons on the engine even when idle.


> What's the point? Saving power?

Saving costs, if you're in a cloud environment.


Aws charges per wattage or per thread? Don't think so. It makes sense if you are the cloud provider yourself but I don't see this as relevant for most businesses.


It depends on the specific setup you have, but if you have fewer workers, then either you need fewer things that you pay for, or else you can get by on a smaller box.


Makes sense if the workers are instances, but if the workers are threads as implied by the article then there is no benefit.


That's a use case for a PID loop that I never expected to read about. Though, I know more about motion control than CPU thread scheduling.

Maybe I should learn more about kernel internals.


Control system techniques find applications within software all the time. In particular, estimation techniques like Kalman filtering are used to smooth spatial inputs (mouse/touch), file copy progress bars, etc.

One interesting read: https://www.cs.unc.edu/~tracker/media/pdf/SIGGRAPH2001_Cours...


I expect one of the future frontiers in systems research (OS, compilers, etc) to be in this area: scaling things like file caches, prefetching, memory arenas, gc invocation (and gc types), thread pools, ...

They all need to be regulated, somehow, for best performance. Unfortunately, nobody really knows how to do that except in small isolated cases. What is even more unfortunate: very few of the people working in this area seem to know anything about control theory.


i went down the rabbit hole on control theory when getting into highpower rocketry and electronics as a hobby. I suspect many of the concepts and algorithms have been independently discovered/implemented in high performance computing. It's all about trying to see through the noise and maintain a target in a dynamic environment. It maps pretty well to something like CPU utilization or IO performance or requests/second etc.


At the end of the day, it's all an optimization problem. Even machine learning can be boiled down to being nonlinear optimization.


Do you have any good references to start learning about control theory? I'm after something for someone who already knows all the relevant mathematics but is unfamiliar with the applications.


There are be plenty of university level textbooks on control theory. For how to apply control theory to software problems there seems to be much less material though. Glyn Normington recommended the following book in another comment thread:

    Feedback Control for Computer Systems
    Philipp K. Janert
    330 pages
    O’Reilly (2013)
    ISBN: 978-1449361693


You'll find a little here and there within comp. sci. but the authors are usually not aware that they are doing control theory.

Typical things: scheduling algorithms for 1) processes, 2) disk reads/writes/seeks, TCP/IP stuff (slow start, throttling, window sizing).

For control theory viewed as part of engineering: "Control System Design" by Goodwin, Graebe, and Salgado.

There are many others but that was the one that worked best for me.


Start with this guy and then go as deep as you want with various textbooks.

https://youtube.com/@ControlLectures


You might enjoy the alternative approach taken by .NET's threadpool which uses hill climbing to dynamically adjust the number of threads. See this articles [1] and its references to Microsoft's papers on their design. Semi-relatedly, I used hill climbing to adjust the eviction policy in Caffeine (Java cache) to favor recency/frequency. While it seemed simple and naive, it worked surprisingly well and robustly.

[1] https://mattwarren.org/2017/04/13/The-CLR-Thread-Pool-Thread...


This is really fascinating, and I love PID Controllers. That said, I really don't recommend doing this in production.

It is very rare that you can successfully use this in practice. You either a) Have just one machine, and that machine has a fixed number of cores (or cannot readily swap number of cores on the fly). If your machine is dedicated to this task, it doesn't make much sense to resize your thread pool - Leave it at max. The only way a smaller threadpool would help is if you have noisy neighbors sharing a machine and cores. However, if you have noisy neighbors, your simple queue-length PID controller will start using more cores not just when you have more than one core worth of traffic, but also when you have a lot of noise and are at max threads. It's fine and dandy in the happy case, but it fails pretty spectacularly in the worst case - Each additional thread creates more exposure to your noisy neighbors and possibly even worse global throughput.

If you b) Have multiple machines, you don't want to do this at all. The number of replicas (processes, or "pods" as I will call them because anchoring this in real-world K8s experience is helpful) goes up and down, but you need feedback on the number of threads or cores involved at the top level. If the number of threads goes up per process but the number of processes goes down, or the number of threads goes down but the number of processes goes up, you can get really wonky feedback loops.


Would it be possible for a PID controller to scale some sort of bottleneck mechanism?

E.g. you have an application that could get spammed with 10,000 requests. The application server is fine, but the database can suffer performance issues.

So you have some sort of feedback loop monitoring the response time of the database and deciding how many requests you let through.

I thought about this because it's useful for something I was working on. But I have never used PID controllers and it looked like it's the sort of thing they could be used for. Never went further with it though.


In general, controllers shine when you have complex relationships between the value that you directly control and the measure that you can about. In particular, "complex" meaning "well modeled with differential equations".

In your scenario, since the rate of database requests is probably a very simple function of number (ie, well modelled by a simple constant multiple) of application requests, you probably don't need a "full controller" to implement the throttling.


This is exactly what the Aperture project is solving: https://github.com/fluxninja/aperture


Wouldn't simple backpressure solve this?


I imagine there's already a solution but I couldn't really find any simple algorithms. Have you got a link to any?


Something I've never understood is what to do when the units of work are known up front (like, I have to process these 1000 items with 4 stages), but the stages include network requests. So instead of having a fixed setpoint like 0 in the piece, I want smallest processing time. Can this be solved with PID and control theory?

For example, say I want to list all the keys in an s3 bucket that start with any of 1000 given prefixes. I can do a lot of calls per second, but of course have limited bandwidth and limited cpu to process the incoming responses, and sometimes s3 can say "too many requests" if other users are querying the same prefixes. How do I do this as fast as possible?


Maybe a bit off-topic, but oh boy I wish every project had a README like this one. It actually explains how things work instead of the usual "RTFC" (Read The F Code) approach.


I do too!

I’ve started writing extensive documentation for internal projects at work, with the intention that someone could pick them up, understand the context, use the code and improve it without needing to talk to me.


While neat experiment, I don't know if thread pools are a thing that makes sense to scale up and down? Maybe these days VMs or containers would make more sense


My experience from apache with forking is if your number of workers scales up and down with load, you get two big problems: 1) memory usage grows significantly with load (but our apache modules were bloaty, maybe your thread pool isn't) and 2) you end up paying the cost to initialize the workers at high load. In that situation, it's best to just figure out your maximum worker number and run that all day. Although, if your OS is smart, it allocates new work to the most recently active process, so some workers stay idle most of the time.

If your threads don't block and you have one main application running on the node, you usually just want to run one thread per node and you're done. If you are running a very network heavy load where you can eliminate or highly reduce cross thread communication, you may want one core per NIC tx/rx queue and one thread per core to eliminate cross-core communication; any cores above the number of queues will just be idle, because cross-core communication is more expensive than the work they can do (but that's not a super common scenario).

A control system to add and remove nodes makes sense if you're cloudy, though, since there's a cost for running nodes.


> In that situation, it's best to just figure out your maximum worker number and run that all day. Although, if your OS is smart, it allocates new work to the most recently active process, so some workers stay idle most of the time.

That would be my intuition too, in particular that usually the cost of idle workers is pretty low, so it's better just to preallocate some fixed max number of workers than try to scale them.


> That would be my intuition too, in particular that usually the cost of idle workers is pretty low, so it's better just to preallocate some fixed max number of workers than try to scale them.

I wouldn't necessarily intuit the cost of idle workers is low, more that the non-cpu cost of workers is roughly fixed, and if it's too expensive to run more than you need at idle, it's still going to be too expensive to run that many at full load. Sometimes it's hard to know what the max load capacity is, but Apache configs where the worker count scales in and out are really easy to get into load is high -> spawn more workers -> use too much memory -> pick your poison: evict too much disk cache / swap to death / oom killer


I got the idea of scaling thread pools from a paper[0] coauthored by Eric Brewer (of CAP theorem fame and also vice-president of infrastructure at Google according to Wikipedia). The paper was written in 2001, so things might have changed a bit, but I doubt the example is completely irrelevant today.

[0]: https://people.eecs.berkeley.edu/~brewer/papers/SEDA-sosp.pd...


Intuitively, there is a threshold above which increasing the number of threads decreases throughput (for example reaching your total number of cores if you are CPU-bound, might be a different number for different workloads). It would be nice to have the controller try to find that number, this PID approach just tries to scale more if there's more input.

It might be more suited to infinitely-scalable situations like cloud VMs, where you are trying to optimize money spent.


> It would be nice to have the controller try to find that number, this PID approach just tries to scale more if there's more input.

Right, it should base the feedback on throughput measures instead. If throughput starts dropping, then the threadpool is overcommitted, and should scale back.

I believe the SEDA papers mentioned in the README discusses this as backpressure.


Could someone ELI5 for me... it seems like the end result is that it fails to scale down once the spiky load has gone away

Is that intended? Or I'm misinterpreting the graphs?


The load never goes away in these graphs - the load is a constant sinusoid that's not actually shown on the graph. The blue line is queue depth.

You are right that it's not unloading. In general, you'd expect the response (green line) of a PID controller to a sinusoid load to be another sinusoid. In this case, because the maximum error is bounded on one side(the setpoint is 0, and queue length can never get below 0), it can only unload at a certain rate. It's typically the job of the Ki term to bring the long-term error towards zero. In this case, I think it's tuned small enough + the bounded asymmetric maximum error to result in it never unloading.


Ah thanks, makes perfect sense now!


The one axis isn't instantaneous workload; it is queue length(work items not being handled by threads).

The PID controller is finding a reasonable number of workers that handle the peak bursts without allowing the queue to get too long, and without spamming and destroying threads unnecessarily.

The cost is that under peak load the queue does back up a little, but if the queue stayed long for too long, the PID should create new threads.


The spikes are queue length (which we want to control), not load. The load is not shown. Any queue length under 20, including spikes, is fine based on the author’s requirements.

The other line, which flattens out at 26 (see right y axis) is the number of workers.


Good question! At first I thought that maybe I wasn't waiting long enough after the load generator finished, but I just ran an experiment with a longer pause and I still don't see a scaling down after the traffic stops!

Perhaps my naive implementation of the PID controller isn't good enough, maybe:

> you'll probably want to put a maximum on your integral accumulator

is needed as icegreentea2 pointed out.


Poisson distributed random traffic for fixed average lambda would be more realistic. If you really want, you can keep a rolling sum of your poisson samples thus far (which translates into current simulated time), then use a sin function to adjust lambda before taking the next sample. This simulates day/night traffic or any other predictable "seasonal" effect.


If you’re more into Ruby than Haskell and find PID control interesting, you may enjoy https://github.com/rickhull/device_control


My work is mainly massive data pipelines (I use Project Reactor, MongoDB, etc.) Massive is subjective: in my case this is trying to fit, on a single small machine tens of millions of complex objects per second, hundreds of processing steps, gigabytes of data from MongoDB every second sustained over days, weeks and even months, using wide range of techniques, exotic data structures to try to speed up stream processing. Then if the problem requires, scaling it up.

Some stuff I learned over the years somebody might find useful:

1) Tune your app for max load. This is the most important parameter, nobody really cares what happens when you have 10% of the load -- what is important is how it behaves when you are close to, at or above capacity. If your app can behave nicely as it reaches its limit it should also be able to behave nicely at less than it. Why scale down your thread pools? Threads do not use extra resources when they are not running -- if you had resources at max capacity you now have resources laying around to have an idle thread. Just find the number you need at your max load and leave it at that. Your framework should be able to allocate concurrency until it hits the limit at which point it will be in the state as tuned for max load.

2) A trick I found is to make sure the application becomes more efficient per-transaction as the load increases. This means things like increasing batch sizes, etc. Increasing concurrency can have opposite effect. When per-transaction efficiency increases the application tends to behave nicely as it approaches its max throughput and it is much easier to keep stable at or close to the limit.

3) One way of increasing per-transaction efficiency is increase batch sizes. Imagine a system where every user login requires an object to be fetched from the database. Rather than generate database request for each login, pipe them together and batch them every 100ms. Then take that 100ms batch and generate single request to the database engine, fetch the results, distribute to clients. Now as more and more users log in your application does not create more requests to the database -- it creates 10 requests every second regardless of the load. Larger requests should translate to higher efficiency per item. When the number of users logging in reaches limit you have option of buffering requests (preserving the database from getting overloaded) but this is more likely going to dramatically drop the efficiency. Instead, you can drop requests and require client to implement exponential backoff when retrying the requests.

4) I actually use control engineering techniques to change parameters of the system (just not the concurrency). For example, you can try to regulate processing latency or server load by modifying the limit of transactions in flight. You can even try to regulate database load (I have MongoDB report its state to app layer and app layer adjust its parameters to keep database in safe operating conditions). The reason to do this is because the system is incredibly complex and with all asynchronous stuff happening there is nobody that really understands what is happening. Feedback control helps stabilising system in much wider range of circumstances without people having to change any configuration which is one of big reliability improvements. Feedback control will regulate faster and kick in before anybody notices anything and has much less chance of messing stuff up. One warning, though, it is important to put some boundaries on safe operating space for the control. Whatever control you implement there are going to be limits to what it can and can't do. There are ways to implement stuff so that you can ensure your app stays within limits -- for example start dropping request when the app reaches throughput limit.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: