
Taking a Look at CLR Thread Pool 'Thread Injection' Algorithm - matthewwarren
http://mattwarren.org/2017/04/13/The-CLR-Thread-Pool-Thread-Injection-Algorithm/
======
_benedict
This is somewhat similar to something I published sometime ago, and even gives
a very similar name to the process

[http://belliottsmith.com/injector/](http://belliottsmith.com/injector/)

The main differences are the potential delay for adding threads, and the use
of ML. The "injector" immediately adds threads if there are none spare (and
there's permitted headroom), at the time of task submission. They are then
_pruned_ as they are seen to be wasting cycles.

Admittedly this approach was designed to minimise the overheads of thread
state management only, whereas in principle the CLR approach can respond to
harmful competition for resources between tasks as well, assuming relatively
consistent behaviour.

The downside of course is potentially very slow ramp-up time when the workload
calls for a sudden increase in threads.

~~~
matthewwarren
Nice, thanks for the link, looks interesting

In Java I assume that the executor and/or injection algorithm can be swapped
out with your own implementation, is that right?

~~~
_benedict
Yes, that's correct

------
obstinate
We had a system like this at my workplace, but it is gradually being replaced
with cooperative threads. I don't remember what specific issues we were having
with it, but I believe slow adaptation to periodic workloads was one issue.
Fundamentally, why hill climb when you can know with certainty exactly how
many blocked threads there are in your system.

~~~
ablankst
From [1] "in the case of blocking workloads, it’s extremely difficult to
determine the number of threads that optimizes overall throughput because it’s
hard to determine when a request will be completed." So the argument they make
is that responding to the number of blocked threads directly could lead to an
over-correction, where you add a lot of threads to the thread pool when
threads are about to unblock. This reduces throughput because you now suffer
context switches and poor cache locality.

[1] [https://msdn.microsoft.com/en-
us/magazine/ff960958.aspx](https://msdn.microsoft.com/en-
us/magazine/ff960958.aspx)

~~~
obstinate
This is only a problem when you're not scheduling cooperatively. If you are
scheduling cooperatively, then you don't ever have to experience unnecessary
context switches.

------
kevingadd
The CLR thread pool is a pretty reliable workhorse. For throughput concerns
you can benefit from a more fine-tuned approach to threading, though - each
job added to the CLR thread pool involves at least two allocations in most
cases, so you end up paying additional GC costs on top of the scheduling and
context switch costs.

In cases where you know you are going to be running very many jobs on your
thread pool you can get improved throughput by assigning each thread a
specific job type so that it can blast through many jobs of the same type with
better locality and less overhead. You can also leverage the underlying
platform thread pool to spin up the job type workers and run them for a bit to
avoid managing thread counts yourself.

To a degree if you're using an existing parallel framework it may do this for
you, but at least on the CLR the options there like Parallel.For aren't so
great - lots of abstraction overhead that will show up in profiles.

~~~
matthewwarren
Yeah there's only so much the Thread Pool can do, it doesn't have the same
knowledge of your workload as you do.

They have added things like work-stealing to it, which might help with task
locality in some scenarios

------
lhenry
The same algorithm has been implemented in Mono at
[https://github.com/mono/mono/blob/master/mono/metadata/threa...](https://github.com/mono/mono/blob/master/mono/metadata/threadpool-
worker-default.c)

~~~
matthewwarren
Cool, thanks for the link, I didn't think about looking in the Mono source

It'll be interesting to see how the Mono implementation compares.

------
dewiz
using hill climbing plus ML and anomaly detection for autoscaling bigger
resources, e.g. stream processors and cloud workers

~~~
matthewwarren
Interestingly in the Microsoft research paper they indicate that the algorithm
could be used for other scenarios, seems they were right

------
evdev
Does anyone know of analogous work re: the size/# of IO buffers? It seems like
a similar problem.

~~~
NovaX
I have a prototype to optimize a caching policy using a hill climber. It
favors recency or frequency by adapting the partition between the two regions.
It works really well in early experiments.

[https://docs.google.com/presentation/d/1NlDxyXsUG1qlVHMl4vsU...](https://docs.google.com/presentation/d/1NlDxyXsUG1qlVHMl4vsUUBQfAJ2c2NsFPNPr2qymIBs/edit#slide=id.g1c43e4cd00_0_646)

------
fake-name
Is there anything like this for python? I have an application that would very
much benefit from it.

~~~
eloff
You don't get real threads in Python thanks to the GIL - only one thread at a
time can be executing Python code.

You can use processes to get around this, but it's often not as performant or
convenient. Computer hardware shifted to multi-core a decade ago and Python
was left behind because of that little bit of technical debt.

It used to be my favorite language, but I've since moved on to Go, both for
the better performance and multi-core story, and for the static typing.

