
A simple C++11 thread pool implementation - hellofunk
https://github.com/progschj/ThreadPool
======
brandmeyer
Threadpools are rabbit-hole problems. They seem structurally simple at first,
but they also have some unexpected failure modes. There are two that I'm aware
of, and there are almost certainly more.

1) Nonuniform work sizes. Optimally scheduling the work when the sizes are
nonuniform is Hard. An almost-optimal algorithm is to dispatch larger items
before smaller items. This implementation dispatches work in FIFO order. While
this is Not Wrong (TM), its far from optimal when the work sizes are
nonuniform.

2) Small work sizes. As the size of each work item gets smaller and smaller,
the proportion of the runtime cost which is spent to dispatch the work becomes
larger and larger. I'm working on a system which has dozens of work items to
execute, but each one is only 50-100 microseconds long. It turns out that this
is an awfully inconvenient work size, and contention on a common FIFO was
limiting available parallelism. I found I got nearly 50% speedup by giving
each worker thread its own workqueue and round-robin dispatching among them.

3) (edited to add) They don't model fork/join parallelism well on a common
pool. If work items also wait for work to complete, then the OP's
implementation may deadlock them. To avoid deadlock, the join operator must be
capable of executing a work item. The C++11 future API isn't rich enough to do
this in user code. See also WG21 paper N3557 for more on this topic.

From the perspective of a standards body, threadpools are similar to
associative data structures. There is a community of users who would like to
have a one-size-fits-all implementation. But it turns out that actual
workloads have wildly differing requirements. In practice, a standard library
must provide several different models of a threadpool to satisfy them, or
users will be forever complaining and/or rolling their own.

~~~
w0utert
Just curious, but is there a reason why you didn’t just use a single lock-less
queue? Per-thread work queues sounds like a perfectly serviceable solution but
I can imagine it an advantage to hide that added complexity behind a lock-less
queue (which could in fact be implemented using multiple queues internally,
but I think that’s just one of multiple ways to implement such a thing)

~~~
dragontamer
> use a single lock-less queue

I dare say that Multiple-producer + multiple-consumer lockless queues are not
"difficult", but straight up "impossible" to write.

You MUST make a decision early on:

1\. Settle on single-consumer / single-producer ("True" parallelism of 2x
threads best case) on the queue, retaining the "first-in / first-out" property
of the queue.

2\. Decide that "first-in / first-out" is unnecessary in your use case, and
generalize to multiple-producer / multiple-consumer code.

After all, it is impossible to define first-in / first-out if you cannot
define an insertion order, or consuming order. If you decide on a defined
insertion order, it is equivalent to deciding upon a "single" producer
protected by a spinlock or mutex. (And lo-and-behold, all your code is just a
very, very complex mutex). Ditto for the consumer side.

\-------

You CAN write 1x consumer + 1x producer lockless queues with defined total
ordering. You CAN have "many" consumers sequentially "acting" as the 1x
consumer, or "many" producers "acting" as the 1x producer (by sequentially
choosing one through CAS-loops or mutexes).

But I dare say, 30x consumers each being a "consumer 1 at a time" isn't really
a multi-consumer queue, its a 1x consumer queue with a mutex on the front-end
:-)

\-------

Deciding upon a 1x producer / Multi-consumer (or the inverse: multi-producer /
1x consumer) queue is just a grossly inefficient queue.

If you cannot define the insertion order, then you cannot define the removal
order. Similarly, if you cannot define the removal order, you cannot define
the insertion order. You might as well "undefine the order completely" to have
better performance.

\-------

In any case, your assumption here has a gross amount of complexity you're
ignoring. Lockless queues are NOT an item to be trifled with.

~~~
w0utert
Not sure if I'm following. In many (most?) multiple-producer/multiple-consumer
cases it is completely irrelevant if the work-queue guarantees a strict
insertion/removal order, as long as the queue is 'fair' (time in queue is
distributed evenly over all work items). If that's all you need, a lock-free
implementation using atomic operations does not have to be complicated, or
what am I missing here?

~~~
dragontamer
A "work-queue" doesn't have strict ordering in many multithreaded programs.
But a "queue" or "fifo" is expected to have a strict ordering.

The confusion comes in because many programmers expect a strict-ordering on
"work-queues", especially because many work-queues are made out of simple
queues or fifos.

The big decision point for many is finally deciding, and/or realizing, that
strict-ordering is not necessary in many pragmatic use cases. But this also
comes with a double-edge sword: many algorithms require strict-ordering of
tasks.

\---------

Take a Breadth First Search, as an example. One core assumption of BFS is that
the first solution you come across will be the solution closest to the root.

Ex: If searching for "Checkmate" in a chess puzzle, the BFS implementation
will ALWAYS find the shortest checkmate. Lets say you're in a situation with
"many checkmates", checkmate in 5-moves, checkmate in 7-moves... etc. etc. But
you want to absolutely find the shortest one. BFS is clearly the answer. But
ONLY if your queue is strictly FIFO.

Now if you parallelize the BFS search by replacing the true FIFO queue with a
pseudo-FIFO parallel queue, its no longer guaranteed to find the shortest
checkmate.

------
Koshkin
On a somewhat tangential note, for C++ threading, Intel TBB is the best thing
after sliced bread. Can't recommend it enough. It's just plain amazing. Very
advanced. Complete with a flow-graph implementation. Should be part of the
toolkit of every serious programmer (on par with C++ standard library).

------
knorker
This is a "hello world" of C++11 threading.

I'm not sure what this is doing on HN.

~~~
stabbles
I wonder how many people would know about

    
    
        using return_type = typename std::result_of<F(Args...)>::type
    

though.

~~~
knorker
Maybe in this context they shouldn't, either.

    
    
        using return_type = decltype(f(args...));
    

seems easier to read and "more C++11".

Also the "> >" is a pre-C++11 pattern.

~~~
JakobProgsch
Earlier versions did use decltype but that causes issues with pointers to
member functions. See:
[https://github.com/progschj/ThreadPool/commit/ac2a77d5cc0436...](https://github.com/progschj/ThreadPool/commit/ac2a77d5cc04361d341f63b1d19706a4d1b4943e)

result_of has now been deprecated in newer C++ versions but since this
advertises itself as C++11 thread pool I intend to leave it as is.

------
0xff00ffee
It's been so long since I've programmed in C++ that I can't even read this
code. Last time I wrote an anonymous function I had to use boost::lambda, now
there's an override for [] on line 39? And wtf is the arrow operator doing out
there in space on line 19?

Starting the day with heavy dose of FOMO...

~~~
thestoicattack
L19 is a trailing return type.

L39 et seq. is a lambda expression.

~~~
0xff00ffee
Thanks.

Native lambdas. Huh. It's like waking up to see people flying their cars to
work. :)

~~~
steveklabnik
Here are the docs, if you're curious:
[https://en.cppreference.com/w/cpp/language/lambda](https://en.cppreference.com/w/cpp/language/lambda)

------
jupp0r
You better hope the potentially blocking waiters don’t run inside a thread
pool or trouble awaits (pun intended).

------
abacate
The queue is a single contention point which will have a considerable overhead
unless the work takes a long time to execute (ie, enqueue is infrequent).

This kind of approach is usually not what you want: either you want to spawn
threads as new work arrives (when each work unit is independent of each
other), or you want to have a static number of threads each doing some
predefined work (ideally one per core).

Scheduling pieces of work like this is going to be terrible for cache locality
and will probably result in a number of threads waiting for each other and
result in underutilization of the available cores.

~~~
JakobProgsch
> This kind of approach is usually not what you want

One thing I learned from all the issues and emails I have received about this
repository over the years is that what people consider the "usual" use case of
threads varies wildly. Some only care for the concurrency but not the
performance, some have huge amounts of tiny work items, some have small
amounts of huge work items, some have complex dependencies, some have none at
all etc.

------
jhasse
Unfortunately no longer maintained. Multiple forks exist though:
[https://github.com/progschj/ThreadPool/issues/40#issuecommen...](https://github.com/progschj/ThreadPool/issues/40#issuecomment-468250869)

~~~
hellofunk
It's not maintained but it sure does work well. I've been throwing all kinds
of workloads at it, always works like a charm.

The forks don't seem to actually fix problems, they just add various features
to the library, features I don't personally need. And many of the forks are
also not updated in a long time.

I guess short and simple is sometimes good enough to last?

------
NCG_Mike
Not sure but there may be two bugs in the code. The use of notify on the
condition outside of the mutex lock.

I know this would cause problems with boost, not sure about std::.

~~~
thestoicattack
Per [1] the lock does not need to be held when calling notify. In the 17
standard this is §33.5 but I can't quite parse out the real requirements
there.

[1]
[https://en.cppreference.com/w/cpp/thread/condition_variable](https://en.cppreference.com/w/cpp/thread/condition_variable)

~~~
graetzer
Its usually a good idea to hold the lock because the other thread(s) might not
have called wait() yet.

As a rule of thumb: its usually a bug to not hold the lock before notify(),
unless you synchronize it with another condition

~~~
usefulcat
In this case I think it should be ok because !this->tasks.empty() is part of
the predicate, and will be true after a new item is enqueued (and before
notify is called). So if the producer is interrupted between releasing the
mutex and calling notify, the consumer would not call wait.

But yeah, in general I agree with the advice.. I'm pretty sure I've made that
mistake before.

~~~
ComputerGuru
It depends on if the predicate can (also) be changed without the lock being
held, as otherwise the release/acquire ordering semantics are not guaranteed.
So long as items are only every enqueued with the lock held, it should be ok.

~~~
gpderetta
according to cppreference:

"Even if the shared variable is atomic, it must be modified under the mutex in
order to correctly publish the modification to the waiting thread."

That's not normative but I can't be bothered to check the c++ standard for
exact wordings. Also SuSv4 doesn't have an explicit prohibition on modifying
the predicate outside a critical section. But if you do, you are truly on your
own.

~~~
ComputerGuru
It is because condition variables have no state, i.e. calling signal without
waiters is equivalent to doing nothing and does not affect a future wait. The
concern is that a thread is switched out after it checks the condition but
before it calls wait, and in that moment the condition variable is signaled.
The thread then calls wait and is not awoken (potentially never to be awoken).
If the predicate is not altered without the mutex and the waiting thread has
the mutex at the time it checks the predicate (this latter part is a must in
all cases), then it is guaranteed the predicate will remain unmet at least
until the thread has switched to a waiting state.

So simply using an atomic (even with a full release-acquire memory barrier) is
not necessarily sufficient as that would only enforce the coherence but not
prevent the scheduler from interrupting at the wrong moment.

------
trzeci
The 'simple' adjective is vastly abused - in majority of cases code behind
'simple' project isn't simple - like in this case a person without mid-
advanced knowledge about threading / binding / synchronisation / generic
programming will have a problem to grasp what's going on in the
implementation.

Side note is that I love how rust simplified that
([https://docs.rs/threadpool/1.7.1/threadpool/](https://docs.rs/threadpool/1.7.1/threadpool/)).

~~~
mschuetz
It's simple as in simple to use. Something that C++ needs more of.

~~~
jupp0r
No it's not. It looks like it on the surface, but real world usage will run
into the blocking `get()` methods blocking threads on the thread pool leading
to deadlocks or low throughput. What C++ needs is a well thought out design of
how to transfer values in a purely asynchronous manner (like Rust futures,
JavaScript promises, etc).

~~~
MaxBarraclough
Is that kind of thing a concern in other thread-pool implementations too?

~~~
jupp0r
The thread pool normally just allows you to execute code (think void()
lambdas). Returning values to the caller isn’t something they normally do, but
is normally managed by higher level concurrency libraries like facebooks folly
or libq (among others).

~~~
MaxBarraclough
Ah yes of course, _auto result = pool.enqueue...._ is blocking. Oh dear.

~~~
jupp0r
The assigment isn’t, but result.get() is (potentially, if computation hasn’t
finished yet).

~~~
MaxBarraclough
Ah yes, I skimmed the code too quickly. Again :-P

The _result_ local holds a future/task, _not_ the result itself, and get() may
force a synchronous wait, like _Task#Result_ in .Net.

Somewhat related ramblings: I couldn't have misread the code in this way were
it not for C++'s new type-inference with _auto_. In C#, I generally only use
type-inference when first writing a statement, and then have the IDE
substitute the _var_ keyword to give a traditional 'explicit' declaration.
Explicit types greatly help a codebase to be 'self-documenting'. Readability
outweighs writing speed.

