Hacker News new | comments | show | ask | jobs | submit login
Taking a Look at CLR Thread Pool 'Thread Injection' Algorithm (mattwarren.org)
132 points by matthewwarren on Apr 13, 2017 | hide | past | web | favorite | 16 comments

This is somewhat similar to something I published sometime ago, and even gives a very similar name to the process


The main differences are the potential delay for adding threads, and the use of ML. The "injector" immediately adds threads if there are none spare (and there's permitted headroom), at the time of task submission. They are then _pruned_ as they are seen to be wasting cycles.

Admittedly this approach was designed to minimise the overheads of thread state management only, whereas in principle the CLR approach can respond to harmful competition for resources between tasks as well, assuming relatively consistent behaviour.

The downside of course is potentially very slow ramp-up time when the workload calls for a sudden increase in threads.

Nice, thanks for the link, looks interesting

In Java I assume that the executor and/or injection algorithm can be swapped out with your own implementation, is that right?

Yes, that's correct

We had a system like this at my workplace, but it is gradually being replaced with cooperative threads. I don't remember what specific issues we were having with it, but I believe slow adaptation to periodic workloads was one issue. Fundamentally, why hill climb when you can know with certainty exactly how many blocked threads there are in your system.

From [1] "in the case of blocking workloads, it’s extremely difficult to determine the number of threads that optimizes overall throughput because it’s hard to determine when a request will be completed." So the argument they make is that responding to the number of blocked threads directly could lead to an over-correction, where you add a lot of threads to the thread pool when threads are about to unblock. This reduces throughput because you now suffer context switches and poor cache locality.

[1] https://msdn.microsoft.com/en-us/magazine/ff960958.aspx

This is only a problem when you're not scheduling cooperatively. If you are scheduling cooperatively, then you don't ever have to experience unnecessary context switches.

The CLR thread pool is a pretty reliable workhorse. For throughput concerns you can benefit from a more fine-tuned approach to threading, though - each job added to the CLR thread pool involves at least two allocations in most cases, so you end up paying additional GC costs on top of the scheduling and context switch costs.

In cases where you know you are going to be running very many jobs on your thread pool you can get improved throughput by assigning each thread a specific job type so that it can blast through many jobs of the same type with better locality and less overhead. You can also leverage the underlying platform thread pool to spin up the job type workers and run them for a bit to avoid managing thread counts yourself.

To a degree if you're using an existing parallel framework it may do this for you, but at least on the CLR the options there like Parallel.For aren't so great - lots of abstraction overhead that will show up in profiles.

Yeah there's only so much the Thread Pool can do, it doesn't have the same knowledge of your workload as you do.

They have added things like work-stealing to it, which might help with task locality in some scenarios

The same algorithm has been implemented in Mono at https://github.com/mono/mono/blob/master/mono/metadata/threa...

Cool, thanks for the link, I didn't think about looking in the Mono source

It'll be interesting to see how the Mono implementation compares.

using hill climbing plus ML and anomaly detection for autoscaling bigger resources, e.g. stream processors and cloud workers

Interestingly in the Microsoft research paper they indicate that the algorithm could be used for other scenarios, seems they were right

Does anyone know of analogous work re: the size/# of IO buffers? It seems like a similar problem.

I have a prototype to optimize a caching policy using a hill climber. It favors recency or frequency by adapting the partition between the two regions. It works really well in early experiments.


Is there anything like this for python? I have an application that would very much benefit from it.

You don't get real threads in Python thanks to the GIL - only one thread at a time can be executing Python code.

You can use processes to get around this, but it's often not as performant or convenient. Computer hardware shifted to multi-core a decade ago and Python was left behind because of that little bit of technical debt.

It used to be my favorite language, but I've since moved on to Go, both for the better performance and multi-core story, and for the static typing.

Applications are open for YC Winter 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact