
Writing Multithreaded Applications in C++ - StylifyYourBlog
http://deathbytape.com/post/110008612055/cpp-threading
======
exDM69
Both of these examples exhibit the "busy waiting" anti pattern where a thread
is waiting for something in a loop with a sleep or yield in it. This consumes
CPU time, power and resources and may be sleeping more than necessary. The
proper fix for this is to wait on a condition variable, which will make sure
the thread remains inactive when idle.

This may be fine for a trivial example of std::thread and std::mutex but you
should almost never do this in practice (there are some exceptions in kernel
space when waiting for an I/O device).

I have a few useful rules of thumb when doing multi threaded programming which
I want to share:

1) Almost every mutex should be coupled with one or (usually) more condition
variables

2) A mutex variable named "mutex" is a code smell. Ditto for condition
variables named "cond". They usually have an easy-to-describe function such as
"queue_lock", "not_full" or "not_empty".

~~~
functional_test
I'll start by saying that I agree with almost everything you've said. I can
think of one time where a loop with a sleep might be appropriate though -- if
you have something, perhaps some IO channel like a log, and you want it
flushed regularly, but not every time a write is performed. With a long enough
sleep (>100ms) the performance overhead ought to be low. Is there a better way
to design that though?

~~~
std_throwaway
In that case you don't want to sleep. You want to wait until a condition (new
data) or a timeout (100ms) occurs.

The reasoning is that you don't know how much data will arrive in the next
100ms and if you sleep either a buffer could overrun or something else could
block. Therefore you need to flush the buffer either when enough data has
accumulated or if a timeout has occured.

~~~
functional_test
I can see where you're coming from. For my particular apication that's not a
concern (and I have sequence numbers to know when I've dropped data).

------
mhogomchungu
Qt/C++ users can now use futures with .then() and .await() constructs through
this tasks[1] library.

an example of .then() construct is here[2].

an example of .await() construct is here[3].What happens here is the
following:

1\. The function is suspected at that line.This suspension does not block the
thread and hence the GUI remains responsive. 2\. A thread is created and the
lambda is run in the new thread. 3\. When the lambda completes,the function is
unblocked and finish up its work.

The above gives easy to read and non blocking synchronous code.

[1]
[https://github.com/mhogomchungu/tasks](https://github.com/mhogomchungu/tasks)

[2]
[https://github.com/mhogomchungu/zuluCrypt/blob/d0439a4e36521...](https://github.com/mhogomchungu/zuluCrypt/blob/d0439a4e36521e42fa9392b82dcefd3224d53334/zuluMount-
gui/mainwindow.cpp#L812)

[3]
[https://github.com/mhogomchungu/zuluCrypt/blob/eadd2643291f3...](https://github.com/mhogomchungu/zuluCrypt/blob/eadd2643291f37d8dc96302d854f1ae8e29758d9/zuluCrypt-
gui/createkeyfile.cpp#L169)

------
Animats
It's nice that C++11 and later have standard wrappers around pthreads. But
that doesn't make multithreaded programming any easier. Threads still work the
same way. All the usual problems, from race conditions to termination, remain.
There's nothing comparable to Go's race detector or Rust's compile-time
concurrency checking.

I've done a fair amount of concurrent programming in C++; I wrote most of the
code for a DARPA Grand Challenge off-road autonomous vehicle a decade ago.
That was fun. Some of the code is hard real time, and some isn't. Thread
priorities matter. (This was on QNX, which is a hard real time OS.) There are
some unusual techniques; for example, I had a "logprintf" function which was
non-blocking. "logprintf" wrote to a queue being written to disk by another
thread, and if the queue was full, "..." went into the log. "logprintf" thus
could not delay a real-time task and could be used safely in hard real time
sections.

~~~
pcwalton
The Go race detector was pretty much just a port of Google's Thread Sanitizer
for C++.

------
Too
I'm curious what guarantees c++ gives on reading variables like the 'data' in
his first example when it can be modified from elsewhere.

Yes, the lock will prevent concurrent access but will it prevent the compiler
from optimizing away the read completely since main() was the last function
that wrote to the variable. I've seen nasty optimization bugs like that happen
in C with single threading and strict pointer aliasing and they are not easy
to track down.

One could interpret his example program like this (unrolled and tweaked a
bit).

    
    
        int data = 0; //ok data is now guaranteed to be equals to 0
        thread t(produce, &data); //called thread with a pointer to data so if i want to read data again i must invalidate my registers
        {
            lock_guard<mutex> lock(theLock);
            data = 5;  // i know i set data to 5 here
        }
        this_thread::sleep_for(chrono::milliseconds(500));  // call contains no references to data, sleep is a "pure" call
        {
            lock_guard<mutex> lock(theLock);    // call contains no references to data
            if (data == 5)  // i set data to 5 myself, and i haven't done anything that could touch data since then so we optimize away this branch
               cout << "data is 5";
        }
        

Will the call to lock_guard simply disable all such kind of optimizations? Or
are you required to write memory barriers yourself in some way? I guess a
simple way to look at it would be that the compiler assumes that thread will
keep the pointer to data and that lock_guard has access to that pointer behind
the scenes somehow, so that acquiring the mutex "invalidates" your last write,
but that would mean acquiring _any_ mutex would invalidate _all_ variables as
the mutex doesn't specify which data-sets it is locking. Am i thinking
correctly? Where can i find more information regarding this.

~~~
jhdevos
> Where can i find more information regarding this.

The C++ standard, of course :) But if you prefer some lighter reading, either
[http://www.cplusplus.com/reference/mutex/mutex/unlock/](http://www.cplusplus.com/reference/mutex/mutex/unlock/)
or
[http://en.cppreference.com/w/cpp/thread/mutex/unlock](http://en.cppreference.com/w/cpp/thread/mutex/unlock)
explain what happens, in slightly different ways. I usually prefer the
cppreference version, because it uses terminology from the standard more
consistently, but there is no harm in reading multiple explanations about
difficult concepts like these :)

I'd repeat the explanation here, but I think you'd best be served by reading
one of the two links above.

------
jarjoura
I would recommend using std::async instead of std::thread directly as it
leaves it up to the STL implementation to manage a thread pool.

[http://en.cppreference.com/w/cpp/thread/async](http://en.cppreference.com/w/cpp/thread/async)

------
polskibus
Le me recommend the Threading Building Blocks library from Intel - it's great
for writing parallel solutions for data parallelism problems.

------
vkjv
Zmq is most often associated with network communications, but its in process
sockets are fantastic for multi-threaded programming.

~~~
easytiger
I can't describe how important using ZMQ has been to architecting some pretty
heavy lifting solutions. You end up with lots of very well defined boundaries
in your application pipeline which are language agnostic.

------
fsloth
Kudos! So elegant. Both C++11 and the article.

------
robmccoll
OpenMP is still the fastest and easiest way to multithreading in C and C++.
Got a for loop with no loop-carried dependencies?

#pragma omp parallel for

for(...)

The Wikipedia page isn't a bad starting point:
[http://en.m.wikipedia.org/wiki/OpenMP](http://en.m.wikipedia.org/wiki/OpenMP)

~~~
SoapSeller
OpenMP is terrible.

While you can argue whether introducing pragmas as control structure is a good
idea or not, the performance issues with high load(>80% total cpu) and the
caveats(two OpenMP versions in same app, multiple threads running OpenMP
paralleled code & etc) are just not worth it.

TBB[0] is much better option(especially with C++11, where you can use lambdas
to warp your to-be-paralleled-code and don't have to put it elsewhere), and
you get all sort of goodies with it(e.g. thread-safe-containers).

[0]
[https://www.threadingbuildingblocks.org/](https://www.threadingbuildingblocks.org/)

~~~
TillE
I've had no trouble using OpenMP (with Visual C++ and GCC), though I admit I
haven't rigorously benchmarked it compared to other solutions.

As far as I know, it's just syntactic sugar over a pretty simple thread pool
model. What are the performance issues at high load?

~~~
goalieca
Many years ago I had benchmarked openmp in comparison to other frameworks. IT
was very easy but poorly performing.

------
flipcoder
Using boost::coroutine, resumable functions are already possible. I've
implented an AWAIT() macro in my scheduler and its usage looks like this:

[https://github.com/flipcoder/kit/blob/master/toys/src/echo.c...](https://github.com/flipcoder/kit/blob/master/toys/src/echo.cpp#L25)

Implementation upon boost::coroutine here:
[https://github.com/flipcoder/kit/blob/master/include/kit/asy...](https://github.com/flipcoder/kit/blob/master/include/kit/async/mx.h)

This is only a proof of concept, but it works.

------
jokoon
What's the real reason for using threads today ? Why not use multi-process
instead ? Or even OpenCL/CUDA if one needs to crunch numbers ?

The only real use-case for thread I can see right now, is for large game
engine that need to process sound, graphics, networking, and inputs, all at
the same time. What other type of application does need to make use of threads
?

~~~
chrisseaton
In some application domains, the program conceptually operates on a big shared
data structure. The usual example applications are mesh triangulation and mesh
refinement [1].

For these kind of algorithms, it is not possible to divide work up into
isolated jobs that can be run entirely independently of others, as until we
start solving something like a refinement problem, we don't know what parts of
the graph we will need.

That rules out OpenCL and CUDA, as they want to take a job and run it to
completion without having to worry about what anyone else is doing.

It also makes multi-processes less attractive. If you have isolated processes
and just pass messages between them, then you have the same problem. Do you
have one process that handles the shared data structure? Well then you've
effectively serialised your program. Do you divide up the data structure
between processes? I wouldn't know how to do that as operations may span the
partitions.

Perhaps you'll use shared memory between the processes? Well then you're
effectively writing multi-threaded code and you might as well use threads.

In the end the easiest way to do things that we know of is to use multiple
threads, shared memory, synchronisation primitives and optimistic algorithms.

[1] K. Pingali, D. Nguyen, M. Kulkarni, M. Burtscher, M. A. Hassaan, R.
Kaleem, T.-H. Lee, A. Lenharth, R. Manevich, M. M. ndez-Lojo, D. Prountzos,
and X. Sui, “The Tao of Parallelism in Algorithms,” presented at the
Proceedings of the 32nd Conference on Programming Language Design and
Implementation (PLDI), 2011.

~~~
aardvark179
Thank you for being a voice of sanity on this Chris.

There seems to be a terrible amount of cargo culting around safe concurrent
programming and the reinvention of the wheel just because somebody heard the
original wheel was bad (c.f. Processes and shared memory vs. threads), and the
hope that CSP and similar schemes will act as a magik bullet.

In the end you need to think carefully about the design of your program, work
out what races might occur, and try encapsulate any shared state mutation in
as small an area as possible where it can be sensibly reasoned about. Remember
that using concurrent data structures will not automatically make your code
safe, put in comments saying why any shortcut you take is safe, and do really
thorough code review.

