
Hell is a multi-threaded C++ program (2006) - mpweiher
http://codemines.blogspot.com/2006/09/hell-is-multi-threaded-c-program.html
======
oppositelock
I've been writing massively threaded C++ code since the 1990's, so I can
comment with some serious experience here. My code used to run on 256CPU SGI's
and had high cpu utilization across all of them.

When designing C++ code for threading, you have to keep a few things in mind,
but it's not particularly harder than on other languages with thread support.
You generally want to create re-entrant functions which pass all of their
state as input, and return all results as output, without modifying ant kind
of global state. You can run any number of these in parallel without thread
safety issues, or locks. When using multiple threads with shared data, you
have to protect it with locks, but I generally use a single thread to access
data of this sort which produces work for other threads to do, which can be
queued or distributed any other way. Producer/consumer queues show up a lot in
good designs, and your single-threaded choke points are usually a producer or
a consumer, which allows thread fan-in or fan-out. (As a side note, Go really
got this right with channels).

What leads to problems in threaded C++ programs is unprotected, shared state.
Some C++ design patterns make life particularly hard, such as singletons,
which are just a funny name for global variables. C++ gives you lots of ways
to make something global accidentally - static member variables, arguments by
reference, for instance.

True C++ hell is taking a single threaded C++ program and making it multi-
threaded after the fact, but that's hell in any language which allows you to
have globals of any sort and doesn't perform function calls within some kind
of closure.

The article in this post is being a little reductionist. You don't need to
pick a threading model - some of your threads can pass messages, others can
use shared data, all in the same program. The lack of thread local storage is
not an issue, because you partition the data so that no two threads are
working on the same thing at the same time.

~~~
rdtsc
Obviously it is possible to write threaded C++ programs. And there are
experienced people like you who can do it well, with enough discipline, good
design, and so on. But I think the point of the article is that for most
programmers it is very easy to make mistakes and shoot themselves in the foot
with threads and shared memory. It could be even something like using a 3rd
party library where it's initialization context can't be shared but a mistake
was made and it did end up being shared by accident, say passed to some
workers threads.

> What leads to problems in threaded C++ programs is unprotected, shared
> state.

That's one of the main problem, but I don't think people start with saying
"we'll just have this unprotected shared state and hope for the best", that
shared state ends up being shared by accident or as a bug. I've seen enough of
those and they not fun to debug (hardware environment is slightly different,
say cache sizes are bit off, to make it more likely to happen at customer's
site for example, on Wednesday evening at 9pm but never during QA testing).

Other things I've seen bugs in is mixing non-blocking (select / epoll / etc)
based callbacks with threads. Pretty easy to get tangled there. Throw signal
handlers in and now it is very easy to end up with a spaghetti mess.

Even worse, there is a difference between "we'll start with a clean, sane
threaded design from the start" vs "we'll add a bit of threading here in this
corner for extra performance". That second case is much worse and can result
in subtle and tricky bugs. Sometimes it is not easy to determine if the code
is re-entrant in a large code-base.

Interestingly and kind of tongue in cheek someone (I think Joe Armstrong, but
I maybe wrong) said to try and think about your programming environment as an
operating system. It is 2017, most sane operating systems have processes with
isolated heaps, startup supervision (so services can be started / stopped as
groups), preemption based concurrency (processes don't have to explicitly
yield) and so on. So everyone agrees that's sane and normal and say putting
their latest production release on a Windows 3.1 would not be a good idea. Why
do we then do it with our programming environment? A bunch of C++ threads
sharing memory are bit like that Windows 3.1 environment where the word
processor crashes because the calculator or a game overwrote its memory.

~~~
snarfy
I think the OP's attitude is still correct. Threaded programs are hard. It
takes experience, discipline, and good design. There is no band-aid for that.
The technology is helping with things like local storage, built in actor
models, borrow checkers, etc. but those are only helping the design aspect. It
still takes discipline, like knowing to not mix up async callbacks with/across
threads. You are basically writing your own little kernel, and writing a
kernel is hard, but once you know the rules it can be done.

~~~
pjc50
I think rdtsc's point was that writing your own kernel in the present day is a
terrible mistake unless you genuinely need something not available otherwise.

------
fsloth
I've found message passing and thinking in transactions are pretty good
architectural patterns for maintaining ones sanity. Message queues with
spinlocks for performance critical code (mutexes are slow).

The biggest sin generally is to disregard the overhead caused by thread
management and thinking more threads makes the sofrware run faster.

I've seen people try to parallellize a sequential program by just spawning
mutexes everywhere and then thinking now any number of threads can do whatever
they please. Of course when tested the system was quite a bit slower as when
it ran on a single thread (the system was quite large so quite a lot of work
was needed before reaching this state).

~~~
exDM69
An uncontended mutex is just as fast as a spinlock (on modern operating
systems using a futex). It takes about 25 nanoseconds to lock it.

The difference is when there's contention. A spinlock will burn CPU cycles but
a mutex will yield to another thread or process (with some context switch
overhead).

A spinlock should only be used when you know you're going to get it in the
next microsecond or so. Or in kernel space when you don't have other options
(e.g. interrupt handler). Anything else is just burning CPU cycles for
nothing.

Mutex and condition variables (emphasis on the latter) are much more useful
than spinlocks and atomics for general multithreaded programming.

~~~
oppositelock
You have to be careful with those things - for instance, you have to special
case whether you'r on a single CPU or multiple CPU system, because a spin lock
will block forever in a non-preemptive context, such as the kernel.

Outside of hard realtime code, there's zero reason to use spin locks.

~~~
gpderetta
> Outside of hard realtime code, there's zero reason to use spin locks.

On some common architectures, releasing a spin lock is cheaper than releasing
a mutex.

~~~
comex
On all architectures, releasing a mutex requires at least a branch (to see if
you need to wake up sleeping threads) that you don’t need with a pure
spinlock.

But if you don’t have a guarantee the lock owner won’t be preempted, well,
spinning for a whole timeslot is quite a bit more expensive…

~~~
Skunkleton
Spin locks are a tough sell in a preemptable context. Say you have two
processes that share some memory location. They both briefly access it for a
very short time, so you protect it with a spin lock. Well, what happens in the
case when one of the threads is preempted while holding the lock? The other
thread would try to aquire it, and just spin for its entire timeslot. No
bueno. When you call spin_lock() in the kernel it actually disables preemption
until you call spin_unlock() to avoid this. You can't disable preemption from
userspace. There might be a use for a spin lock if you have a process that is
SCHED_RR, but I haven't seen it.

~~~
gpderetta
You are not wrong in general but note that there are ways to disable
preemption in userspace in practice, be it SCHED_FIFO[1] or isolcpu with cpu
pinning.

[1] you better be careful with spinlocks and priorities here as you can
livelock forever.

------
andrewguenther
When I think about hard problems, I like to look at projects I feel handle
these problems well. Having worked in the Chromium code base for some time
now, I really appreciate their approach to threading. So much so, I really
wish they would distribute their base/threading module as its own project.

[https://chromium.googlesource.com/chromium/src/+/lkcr/docs/t...](https://chromium.googlesource.com/chromium/src/+/lkcr/docs/threading_and_tasks.md)

------
mbil
> While standardization and support for [threaded] APIs has come a long way,
> their use is still predominantly restricted to system programmers as opposed
> to application programmers. One of the reasons for this is that APIs such as
> Pthreads are considered to be low-level primitives. Conventional wisdom
> indicates that a large class of applications can be efficiently supported by
> higher level constructs (or directives) which rid the programmer of the
> mechanics of manipulating threads. Such directive-based languages have
> existed for a long time, but only recently have standardization efforts
> succeeded in the form of OpenMP. OpenMP is an API that can be used with
> FORTRAN, C, and C++ for programming shared address space machines. OpenMP
> directives provide support for concurrency, synchronization, and data
> handling while obviating the need for explicitly setting up mutexes,
> condition variables, data scope, and initialization.

From Introduction to Parallel Computing

OpenMP: [http://www.openmp.org/](http://www.openmp.org/)

~~~
xaa
Yes, every time I hear people talking about how evil or difficult threads are,
I just think back to my frequent use of OpenMP. It is really quite easy to
take a single-threaded program and make it multithreaded with OMP -- so long
as the jobs only read, and do not write shared state.

Usually I will write a program that does the following:

1\. Initialize global read-only data structures

2\. Parallelize jobs across STDIN lines (or whatever)

3\. Output results within a #pragma omp critical block

It works wonders, and is literally 5 additional lines of code to make a
single-threaded program multithreaded with OMP. But only for some types of
program. The concept is very similar to what GNU parallel does.

In short, threads are not the problem per se, they are just the wrong tool for
the job if you have a multiple-readers multiple-writers situation. Message
passing, databases, or something else are more appropriate in those cases. But
you will pry my _read only_ shared memory space out of my cold, dead hands.

~~~
imtringued
Using OMP is incredibly easy but in my experience it sometimes leaves up to
20% performance on the table vs doing and tuning everything manually.

------
iainmerrick
Pthreads seems to be getting a bad name here, but compared to other _low-level
threading APIs_ it's excellent. It has a proper join operation, which is
weirdly lacking in some other libraries I've used, and it has a good choice of
key synchronisation primitive: condition variables. Other choices are
possible, like semaphores, but those are fiddly to use for some applications.
Building semaphores on top of CVs is pretty easy, building CVs on top of
semaphores is very hard. Likewise Windows events.

Higher-level systems have primitives like message queues, thread pools, and
channels. Those are all great but they can't always be used to build the exact
system you need without a lot of overhead.

C and C++ are all about emphasising speed and flexibility over safety. Whether
you think that's a good or bad approach, it is what it is, and pthreads is the
right design for that approach. Once you come up with the high-level
multithreading abstraction of your dreams, what else would you rather
implement it in than pthreads?

The one thing that was missing from pthreads was atomic operations for modern
lockless data structures, so it's great that those have finally been added to
the standard library in both C and C++.

------
beached_whale
Generally, do not use raw threads. They are a building block for higher level
abstractions like task queues, pipelines, futures... which lend themselves to
higher level algorithms like your sorts, filters, find, map, reduce,
map_reduce...

C++17 gets a lot of this with much of the algorithms header getting
parallel/vectorized versions. But it doesn't have a task stealing work queue
or something like then to pass the value when it completes. Boost has a
version now that does though and this lets you have the next function called
when done instead of waiting and blocking.

But if you spend your time in the thread zone you are probably putting too
much thought into threading and not the problem you are solving. Plus, using a
task queue can yield really good results for lower level algorithms.

------
andreasgonewild
There is no need to use raw pthreads in modern C++, it comes equipped with a
much more convenient and capable standard library module.

I prefer working with channels, which are still missing from the library; but
they are trivial to add:

[https://github.com/andreas-gone-
wild/blog/blob/master/diy_cp...](https://github.com/andreas-gone-
wild/blog/blob/master/diy_cpp_chan.md)

~~~
ridiculous_fish
Unfortunately std::thread is not as capable as pthreads. For example,
std::thread has no facilities for controlling the stack size or signal mask.

Whenever I use channels, I find I need them to synchronize with each other.
For example, I have a Go data structure with a write channel and a separate
read channel; how do I avoid a read-after-write hazard?

The simplest solution is to make them the same channel that ships closures, to
ensure requests are processed in-order. But then why not just make all
channels ship closures, like libdispatch...

~~~
shaklee3
You can get the underlying pthread pointer if you want.

~~~
to3m
pthread_sigmask always operates on the current thread.

~~~
ridiculous_fish
The usual technique is to set the pthread sigmask and then spawn the thread,
to avoid the obvious race. But it's risky to assume that `std::thread` won't
manipulate the sigmask by itself.

------
hellofunk
Unless you use this excellent library:

[https://actor-framework.org/](https://actor-framework.org/)

~~~
lookACamel
Note that this blog post was written in 2006.

 _The other major model for multi-threading is known as message-passing
multiprocessing. Unless you 're familiar with the Occam or Erlang programming
languages, you might not have encountered this model for concurrency before._

 _Two popular variants of the message-passing model are "Communicating
Sequential Processes" and the "Actor model"._

 _Why would you want to learn about this alternative model, when Pthreads have
clearly won the battle for the hearts and minds of the programming public?
Well, besides the sheer joy of learning something new, you might develop a
different way of looking at problems, that 'll help you top make better use of
the tools that you do use regularly. In addition, as I'll explain in Part II
of this rant, there's good reason to believe that message-passing concurrency
is going to be coming back in a big way in the near future._

------
kuwze
The FoundationDB people invented Flow to easily create multithreaded
distributed systems[0].

[0][https://gist.github.com/Preetam/98e80cd17ecb8748c72b](https://gist.github.com/Preetam/98e80cd17ecb8748c72b)

------
moderation
Good article by the creator [0] on how Lyft's relatively new L4-7 proxy called
Envoy [1] handles multi-threading at [https://medium.com/@mattklein123/envoy-
threading-model-a8d44...](https://medium.com/@mattklein123/envoy-threading-
model-a8d44b922310). There are now more people at Google contributing to Envoy
than Lyft.

0\. [https://twitter.com/mattklein123/](https://twitter.com/mattklein123/)

1\. [https://lyft.github.io/envoy/](https://lyft.github.io/envoy/)

------
noobermin
So to check, I googled C++11 threads memory model, and it[0] looks like his
main complaint isn't helped by C++ 11 threads, as they essentially allow
shared state too. They are also a thin veneer over the OS threads, for unix
types, pthreads I imagine.

[0]
[http://en.cppreference.com/w/cpp/language/memory_model](http://en.cppreference.com/w/cpp/language/memory_model)

~~~
fnj
Honest question: If you aren't going to share data between/among threads, why
use threads at all? Why not just fork-abd-exec?

~~~
thristian
Sharing (immutable) data between threads is fine, you just don't want to share
(mutable) state or else you run the risk of threads stomping on each other.
The ideal scenario is map/reduce: you take some large chunk of data, divide it
up into smaller chunks, spin up a thread to process each smaller chunk and
produce a result. Once the threads have all completed and the results are no
longer changing, the main thread can pick them up and combine them to a single
result.

It's possible to do that with fork-and-exec, but without a shared address
space, marshalling costs make it expensive.

------
paulddraper
But...like so many other abstractions created by people with puny hardware,
shared memory parallelism is very fast.

------
mbessey
I really need to do an update of that article, with C++ std::thread,
libDispatch, and some of the other stuff that's come along in the intervening
11 years. I still spend a lot of time tracking down other people's threading
bugs (in an entirely-different codebase, these days) though.

------
chris_wot
I think one thing to learn is how monitors work.

