
Thread pools: How do I use them? - ingve
http://jvns.ca/blog/2016/03/27/thread-pools-how-do-i-use-them/
======
throweway
I like the style of writing in this post. Its fresh in the programming world
where everyone seems to know it all for someone to admit something is hard or
confusing.

~~~
Smircio-
Heck yea. Im kind of new here and it can be so intimidating. I feel like
everyone is an expert at everything.

~~~
radicalbyte
It's not that, it's just that experts tend to contribute to threads where they
can share their expertise.

So while it appears that everyone here is amazing; that isn't the case: we
just tend to have something that we're very good at and comment on it.

The good thing is that you can learn a lot here as long as you don't get
sidelined by some of the inane arguments and fighting :)

~~~
charlesism
Also, an awful lot of people who sound like infallible "experts" in tech are
just people who are dealing poorly with imposter syndrome.

~~~
grkvlt
Hang on; Imposter Syndrome would mean that you would come across as a
fallible, non-expert, since you believe yourself to really be an imposter. Do
you really mean something like the Dunning-Kruger effect, which is sort of the
opposite of Imposter Syndrome, people with little knowledge believing
themselves to be infallible experts?

~~~
dpark
Impostor syndrome means that the person thinks of themselves as a fraud, not
that other people do. So someone dealing poorly with impostor syndrome might
well only chime in when they are exceedingly confident, or only when they do
research and can be certain that the information/position they are providing
is correct. If you keep your mouth shut except when you're 100% certain,
you'll likely look like an "infallible expert" because you're always correct
from an external perspective, even if you are quietly filled with self-doubt.

~~~
charlesism
Pretty much. Some people who lack self-confidence seem afraid to ever admit "I
don't know" lest it somehow reveal their "fraud" to the world.

------
Pirate-of-SV
> Or maybe there is a totally simple way and this could take me 5 minutes!

At that paragraph I immediately thought of [http://aadrake.com/command-line-
tools-can-be-235x-faster-tha...](http://aadrake.com/command-line-tools-can-
be-235x-faster-than-your-hadoop-cluster.html)

~~~
ambicapter
Holy shit.

> I have been in technology roles for over 17 years

I guess this is what experience gets you, huh.

~~~
bronxbomber92
Or if you know some basic performance characteristics of your machine (i.e.
memory bandwidth + processor speed + number of cores) and roughly how much
work you need to do (i.e. number cycles to do X), some back of the envelope
calculations can often give you the same insight (i.e. "is X feasible with
compute resource Y?").

------
metachris
I recently ran into the issue that the multiprocessing.pool.ThreadPool class
is not cancellable in Python 2. It seems that only in Python 3 can you exit a
ThreadPool with eg. a KeyboardInterrupt before all tasks are finished.

Since then I've been using this code, which works great across all platforms
and Python versions:
[https://github.com/metachris/pdfx/blob/master/pdfx/threadpoo...](https://github.com/metachris/pdfx/blob/master/pdfx/threadpool.py)

~~~
megamouse
I've experienced this bug in Python 2 using "pool.map" and "pool.apply", but
KeyboardInterrupt does work if you're waiting for a finite amount of time. So
as a workaround you can use "result = pool.map_async(...);
result.get(sys.maxint)". A little hacky but functional.

------
BogusIKnow
As is sometimes the case, people start with the wrong abstraction (thread
pool) and then optimize it.

Java/Scala streams look like they would solve most of the problems in the post
on their own.

e.g. Java streams

[http://radar.oreilly.com/2015/02/java-8-streams-api-and-
para...](http://radar.oreilly.com/2015/02/java-8-streams-api-and-
parallelism.html)

    
    
        private void runParallel() {
          trades
          .stream()
          .parallel().forEach(t->doSomething(t) );
        }
    

Java/Scala has a hierarchy of concurrency abstractions, from high level to low
level:

Streams, Parallel collections, Futures, Actors, Fork/Join,
ExecutionContext/ExecutorService, Threadpools, Threads, synchronized,
Barriers, Atomic values and many more.

~~~
BogusIKnow
Something not so obvious for beginners: Java/Scala does not use all the
available memory. If you get out of memory problems with the 10gb data set and
you have much more RAM available, you need to tell Java/Scala to increase the
Heap.

------
whalesalad
My favorite blogger of 2016 thus far. Every post is great.

Also the bit that rings home the strongest is just how difficult it is to do
trivial things. Even a senior level programmer with ten years of experience
has to sit and say, why is this such a pain in the ass?

I guess that's one reason I really like clojure. Core.async has a really
hideous API at first glance but using channels and buffers makes doing this
sort of thing a lot more enjoyable. Anyway the intent of the article was to
dive into abstractions not to suggest more of them so I'll shush.

------
smegel
Maybe "Scala concurrency surprises" would be a better title.

~~~
masklinn
It uses the JVM's threadpools directly so it applies to more or less any JVM-
based language, and the problems it talks about (expensive sequential
operation before threadpool submission and unbounded threadpool queue) are
pretty common issues.

Witness the unbounded and unconfigurable (though replaceable — at your own
risk since you're setting a "private" variable of the pool) queue of Python
3's threadpool:
[https://hg.python.org/cpython/file/default/Lib/concurrent/fu...](https://hg.python.org/cpython/file/default/Lib/concurrent/futures/thread.py#l99)

------
emmelaich
Great writing.

I wonder if a ForkJoinPool [1] would be helpful here as it does work-stealing
giving better utilisation.

Although I've read admonitions that they should not be used when doing I/O.
[2] That article seems a little ranty and overblown though, so I'd like
opinions on that.

1\.
[https://docs.oracle.com/javase/8/docs/api/java/util/concurre...](https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ForkJoinPool.html)
2\.
[http://coopsoft.com/ar/CalamityArticle.html](http://coopsoft.com/ar/CalamityArticle.html)

~~~
cmsd2
A coworker showed me Akka Streams recently. It allows you to overcome the
impedence mismatch between threadpools processing jobs at different speeds
where one task feeds into another. Could be a neat way to join up the i/o
bound tasks to the cpu bound tasks.

[https://opencredo.com/introduction-to-akka-streams-
getting-s...](https://opencredo.com/introduction-to-akka-streams-getting-
started/)

~~~
kevinavery
Yep, it's a really cool library. I'm pretty new to it myself, but I made a
little gist of how one might approach this problem:
[https://gist.github.com/kevinavery/941e7d67c4f8b104f610](https://gist.github.com/kevinavery/941e7d67c4f8b104f610)

I was kinda surprised by how tricky it was to make a Flow that turned
ByteStrings into separate String lines, but it seems like the custom
GraphStage approach generalizes really well to more complicated stages.

------
Terr_
> In my case, I was reading a bunch of data off disk. maybe 10GB of data. And
> I was submitting all of that data into the ExecutorService work queue.
> Unsurprisingly, the queue exploded and crashed my program.

In the spirit of brainstorming:

1) Abstract out the I/O step so that you can pass around a bunch of
lightweight and lazy "I'll get the lines for you when you ask" objects. Then
you can safely queue up a large number of them.

2) Choose to break the "job" apart into two parts, and submit the "get lines
from filename" jobs into a different, smaller pool, with the results flowing
into the larger (CPU-core-limited) pool.

------
wsargent
The easiest way to use thread pools in Scala is to use an ExecutionContext.
Akka will let you define an ExecutionContext from a configuration file using a
wrapper called a Dispatcher -- you can ramp up the parallelism-max setting for
more cores. Then you do actorSystem.dispatchers.lookup("my-dispatcher") and
pass it to all the actor based code that you are using.

You can use a work pulling set of actors to handle computation after that, or
use Akka Streams to provide backpressure, to avoid the OOM problem.

------
bogomipz
Always enjoy reading her posts, I particularly like her enthusiastic style
too.

------
bpicolo
> Here's what that looks like in Python.

Although python multithreading outside of IO tasks or dropping into C code
that drops the GIL isn't going to parallelize, the latter of which is the only
way you're going to burn up cores. The API does make it seem simple though.

~~~
delroth
The example is using multiprocessing.pool.ThreadPool which does not have the
GIL problem you mention (at the cost of making it harder to share data).

~~~
aftbit
You're thinking multiprocessing.Pool. I think ThreadPool actually uses threads
and does hit the GIL.

------
known
Good write up.

