
The Unscalable, Deadlock-Prone, Thread Pool - dmit
https://pvk.ca/Blog/2019/02/25/the-unscalable-thread-pool/
======
magnetic
I wrote a parallel image processing tool (in Swift) that essentially uses
Grand Central Dispatch on macOS to process all files in a directory.

I hit the problem that GCD would happily let me enqueue tasks for processing,
but each task would need quite an important amount of memory and GPU resources
(because it would use CoreFilters that support GPU acceleration "when
available"), which made the computer hit swap and slow to a crawl once the
available RAM was exhausted.

I had to create an Admission Control component that would let me enqueue tasks
only if there was enough memory left to process the image without risking
swap, and that part isn't as trivial as it seems, as getting the amount of
available memory for a process is a bit vague on UNIX/macOS/Linux since many
OSes of this kind assume you can allocate an infinite amount of memory (see
linux's overcommit system) and that the Virtual Memory system will do the
right thing for you. Also nobody can guess if physical RAM is going to be
allocated by some other process somewhere, so I used a fairly conservative
heuristic that probably left some small amount of resources unused most of the
time, but would almost never slow to a crawl. Not perfect, but "good enough".

~~~
chrisseaton
Right - we need holistic scheduling systems that balance all resources and
tasks. For example scheduling an IO heavy-task alongside a compute-heavy task
and letting them both run well, rather than picking two compute-heavy tasks at
the same time, and only scheduling as many tasks as there is RAM, IO, and
other resource capacity for to use productively. Lots of people are trying
because at data-centre scale and with serverless that lets you run code
wherever they want this could be a huge advantage... not sure how much success
yet.

~~~
beached_whale
I wonder if something as simple as tagging would give a huge benefit towards
that. I should know if my job uses lots of memory(often I know exactly how
much), or IO, or CPU.

------
jorangreef
Node suffers similar problems, although I would describe them differently to
the author.

Essentially:

1\. All Node's async IO is lumped together into the same threadpool.

2\. There is no distinction between the nature of each async IO task.

3\. Async CPU tasks (fs.stat hitting the fs cache, async multi-core crypto,
async native addons) complete orders of magnitude faster than async disk tasks
(SSD or HDD), and these can be orders of magnitude faster than async network
tasks (dns requests to a broken dns server).

4\. There are three basic async performance profiles, fast (CPU), slow (disk),
very slow (dns), but Node has no concept of this.

5\. This leads to the Convoy effect. Imagine what happens when you race
trucks, cars, and F1... all on the same race track.

6\. The threadpool has a default size of only 4 threads, on the assumption
that this reflects the typical number of CPU cores (and reduces context
switches).

7\. 4 threads is a bad default because it leads to surprising behavior (4 slow
dns requests to untrusted servers are enough to DoS the process).

8\. 4 threads is a bad default because libuv's memory cost of 128 threads is
cheap.

9\. 4 threads is a bad default because it prevents the CPU scheduler from
running async CPU tasks while slow disk and slower DNS tasks are running.
Concurrent CPU tasks should rather be limited to the number of cores
available, while concurrent disk and DNS tasks should be given more than the
number of cores available (context switches are better amortized for these).

10\. Because everything is conflated, hard concurrency limits can't be
enforced on fast, slow or slower tasks. It's all or nothing.

There are efforts underway to support multiple threadpools in Node (a
threadpool for fast tasks sized to the number of cores, a threadpool for slow
tasks sized larger, and a threadpool for slower tasks also sized larger). The
goal is to get to the point where we can have separate race tracks in Node,
with F1, cars and trucks on separate race tracks, controlled and raced to
their full potential:

[https://github.com/libuv/libuv/pull/1726](https://github.com/libuv/libuv/pull/1726)

~~~
felixfbecker
This is a common misconception. Node (libuv) only uses the thread pool as a
last resort. Many operations nowadays have a native async OS API, so they
don't need to use (and therefor don't block/fill) the thread pool.

[https://medium.com/the-node-js-collection/what-you-should-
kn...](https://medium.com/the-node-js-collection/what-you-should-know-to-
really-understand-the-node-js-event-loop-and-its-metrics-c4907b19da4c#07f5)

[https://stackoverflow.com/a/22644735/4208018](https://stackoverflow.com/a/22644735/4208018)

~~~
jorangreef
"Node (libuv) only uses the thread pool as a last resort."

When did you last read deps/uv/src/win/fs.c or deps/uv/src/unix/fs.c?

If by "last resort" you mean 99.9% of all async fs tasks, then I would agree
with you.

~~~
coder543
isn't that because OSes have broken async filesystem APIs that can't be
trusted?

But network requests (like dns) have well defined APIs for async, and should
never be required to block a thread, and therefore do not need a threadpool.

------
mcqueenjordan
Title is misleading. The author seems to mask what they're saying in a way
that makes it really easy to conclude that their point is "Thread pools are
bad, so I'll show you a better way." In fact, the proposed better way is
ThreadPools(tm).

In fact, "The Unscalable, Deadlock-Prone, Thread Pool" is only referring to a
badly used ThreadPool. Like, really badly: the strawman that forms the crux of
the argument is that you are only using the same ThreadPool for all of your
disparate forms of work.

> My favourite approach assigns one global thread pool (queue) to each
> function or processing step.

Yes, creating different thread pools for different kinds of work is a much
better way to use them...

~~~
fiter
The .NET libraries make it easy to use the default thread pool (a single
pool). If other libraries are similar, then I wouldn't consider this a
complete straw man.

~~~
eurg
Yes, but the .NET thread pool is auto-tuning, although very badly, with at
least a second latency to increased workload. For bursty requests that could
be done within a few hundred milliseconds it is a bad fit. But you can't DOS
your batch processing server so easily.

------
spricket
This feels language specific. Java has excellent thread support for example,
it's quite hard to build an application that deadlocks.

Thread-per-request is how everything used to be done. Most languages switched
to pools because of huge fork/threading overhead.

Other languages have embraced async programming to deal with these issues, and
use thread-per-core. But that introduces some fairly awkward coding
constructions itself.

I believe the best approach is transparent fiber support. Java's Project Loom
is a very ambitious attempt at this. And if you're okay with a "hacky"
solution, you can already use fibers with Quasars code generation.

In java at least, the biggest barrier to fiber support is existing libraries
not supporting it (looking at you JDBC/JPA). Hopefully project loom will make
existing code work with little or no modifications

~~~
xenadu02
The issue the OP is addressing covers your comment quite well. Async
continuations, coroutines, and fibers can all race ahead and exhaust memory/DB
connections/network sockets, or just saturate the IO subsystem. In a multi-
step process this can easily lead to starvation in downstream steps.

The point of structuring an application this way is to prevent any one stage
of your pipeline from consuming too many resources or stalling other stages
(or just hammering some other system with requests that will inevitably
timeout).

I don’t see how your comment about fiber support in Java applies.

~~~
agumonkey
I'd love to learn how to design this way, any reading ?

~~~
ioquatix
I wrote a full stack implementation for Ruby:
[https://github.com/socketry/async](https://github.com/socketry/async)

Come and chat here:
[https://gitter.im/socketry/async](https://gitter.im/socketry/async)

TruffleRuby folks are trying to support us using native fibers in the JVM, but
I'm not sure when that will be working.

------
wtracy
The author spends a lot of time describing the problem, and not a lot on the
solution.

The solution proposed seems to be some variant on dataflow programming:
<[https://en.wikipedia.org/wiki/Dataflow_programming>](https://en.wikipedia.org/wiki/Dataflow_programming>)
Am I misunderstanding?

~~~
dkarl
_Once the problem has been identified, the solution becomes obvious: make sure
the work we push to thread pools describes the resources to acquire before
running the code in a dedicated thread.

My favourite approach assigns one global thread pool (queue) to each function
or processing step. The arguments to the functions will change, but the code
is always the same, so the resource requirements are also well understood. _

I think I get what he's saying. To avoid one stage in the multi-step
computation acquiring too many resources and starving other stages:

\- Cap the resources used by each stage.

\- Create one threadpool per stage.

\- Include sufficient information in pending jobs that the threadpool can
calculate the resources needed to process each job.

\- Only start a job running when it can be run without exceeding resource cap
for the stage.

Conceptually, you can think of it as pre-allocating a bundle of resources to
each processing stage: threads, memory, etc. Each processing stage then runs
as many jobs concurrently as it can with the allocated bundle of resources.
That's the idea, but in reality each job allocates its own resources from a
common pool that is shared among all the processing stages, so the processing
stage has to calculate how much memory, etc. each job _will_ allocate before
giving it a thread to run on.

~~~
jayd16
So does it boil down to "grab all your locks upfront, if you can't then yield
them all back"?

~~~
dkarl
Pretty much, plus if you're even running, you have high confidence that lock
acquisition will succeed, because the entity that gave you a thread to run on
has checked that all the resources you need are available.

------
jondubois
For some use cases, I like the approach of having a fixed number of main, long
lived threads/processes at runtime (which matches the number of CPU cores on
the machine to minimize context-switching). Where possible, the
threads/processes are themseves responsible for picking up the work that must
be done. Each thread/process can use hashing/sharding to figure out which
portion of the state they are responsible for and only operate on those parts.
These kinds of threads should be handling the vast majority of the work. But
of course, for this to be possible, the work needs to be parallelizable
(servers/sockets make a good use case).

If you have occasional I/O tasks which don't require much CPU (e.g. disk I/O),
you can use a separate threadpool for those since the context switching
penalty will be minimal. Context switching is mostly a problem when all the
threads/processes are using a lot (and equal amount) of CPU. If a very heavy
thread/process has to share CPU with a very lightweight thread/process, the
context switching penalty is not significant.

~~~
willvarfar
My experience (on Linux) is that:

\- if you have lots of short-lived cpu-intensive tasks then a pool with one or
two threads per cpu core (depending on how SMT works for you on your hardware)
works well

\- if you have lots of long-lived cpu-intensive tasks then just give each task
a thread and let the OS schedule them

\- if you have lots of io tasks, don't use threads; async io is a massive win

\- if you have lots of io tasks on aws then you have to have high core counts
even if they all sit idle because of the way credits are divvied up; even with
massive brought iops you don't get good io on aws compared to, say, the cheap
laptop you do dev on ;)

I am so so so tempted to go into a particular mysql storage engine that my day
job often relies upon and move it from one-thread-per-core to async io.
Obviously that's a pipe-dream but the wins would be massive (on linux, on aws
blah blah)

Having made this list I can see there are so many cravats that I'm not sure
generalizations get anyone very far, sadly. Its like the same server software
has really different performance characteristics on different cloud providers
vs dedicated hardware etc etc.

~~~
jondubois
The exact terminology probably depends on your programming environment as
well. I'm using Node.js now and async IO (e.g. for disk access) uses a
threadpool behind the scenes but these threads use very little CPU: they're
mostly idle in fact; their only two responsibilities are to start the IO
operation then send back the data to the parent process when the IO completes
so these threads use almost no CPU so they're not a problem but I guess it
depends on how heavily they are used. Node.js is not really designed to be a
DB engine though so this design works well.

------
zmmmmm
> Once the problem has been identified, the solution becomes obvious: make
> sure the work we push to thread pools describes the resources to acquire
> before running the code in a dedicated thread. ... My favourite approach
> assigns one global thread pool (queue) to each function or processing step.

The author seems to do a remarkable job of basically describing actors without
mentioning the word "actor". I use GPars [1] with Groovy which explicitly
supports exactly what they talk about in a very elegant manner. You create a
variety of actors and allocate them to different parallel groups so they
cannot starve each other of resources, and most of these issues become
controllable.

Perhaps it is the Lisp context that resists lumping state into the equation,
since Actors usually carry state?

[1] [http://www.gpars.org/](http://www.gpars.org/)

~~~
kitd
Well yeah, Go channels/goroutines too, et al.

The problem isn't just partition of opaque work on threads though, but how to
schedule work so that resources other than threads are used efficiently.

GPars etc don't cover that.

------
Const-me
> The moment we perform I/O… both the futures and the generic thread pool
> models fall short.

On Windows they usually work fine. OS kernel support is the key. OS-provided
thread pool scales really well. I’m talking about the modern one from
threadpoolapiset.h, StartThreadpoolIo / SubmitThreadpoolWork API functions.

> we might want to limit network requests going to individual hosts,
> independently from disk reads or writes, or from database transactions.

There’re APIs like ReleaseSemaphoreWhenCallbackReturns and
SetEventWhenCallbackReturns.

------
hyperpape
This sounds an awful lot like SEDA, but my memory of it is too fuzzy to see
whether the differences might be.

~~~
TeMPOraL
Author does reference SEDA is one of the footnotes.

------
fraffo
Does he even know Rx observables?

