When designing C++ code for threading, you have to keep a few things in mind, but it's not particularly harder than on other languages with thread support. You generally want to create re-entrant functions which pass all of their state as input, and return all results as output, without modifying ant kind of global state. You can run any number of these in parallel without thread safety issues, or locks. When using multiple threads with shared data, you have to protect it with locks, but I generally use a single thread to access data of this sort which produces work for other threads to do, which can be queued or distributed any other way. Producer/consumer queues show up a lot in good designs, and your single-threaded choke points are usually a producer or a consumer, which allows thread fan-in or fan-out. (As a side note, Go really got this right with channels).
What leads to problems in threaded C++ programs is unprotected, shared state. Some C++ design patterns make life particularly hard, such as singletons, which are just a funny name for global variables. C++ gives you lots of ways to make something global accidentally - static member variables, arguments by reference, for instance.
True C++ hell is taking a single threaded C++ program and making it multi-threaded after the fact, but that's hell in any language which allows you to have globals of any sort and doesn't perform function calls within some kind of closure.
The article in this post is being a little reductionist. You don't need to pick a threading model - some of your threads can pass messages, others can use shared data, all in the same program. The lack of thread local storage is not an issue, because you partition the data so that no two threads are working on the same thing at the same time.
I've written a lot of multithreaded C++, I did quite a lot back starting about 1994 for around 10+ years, a lot of middle-tier application code on Windows NT in particular (back in the day when it was all 3-tier architectures - now we'd call them services) - it's totally fine if you know what you are doing.
Work is usually organised in mutex-protected queues, worker threads consume from these queues, results placed a protected data structure, and receivers poll and sleep, waiting for the result.
Other tricks to remember are to establish a hierarchy of mutexes - if you have to take several locks, they must be done in order, and unlocked in reverse order, this should guarantee an absence of deadlocks. A second trick - a way to guarantee locks are released and bugs of non-release of mutexes do not occur, as well as the correct order of releases, is to strictly follow an RAII pattern, where destructors of stack-based lock objects, unlock your mutexes as you exit stack frames.
Of course, in later periods, you started to see formal recognition of these design patterns, in Java and C# libraries which had WorkerPools, Java and it's lock() primitive, but these design patterns were prevalent in my code at the time, because it was the only obvious and simple way to use multi-threading in a conceptually simple manner. KISS...
Nothing particularly hellish about any of it - but I remember it was not a development task for all developers, and without common libraries in the period (this is pre-STL), you had to work a lot of it out for oneself.
I do remember in the period you would get grandiose commentary from some public developers who would proclaim such things as, "it is impossible to have confidence in/write a bug-free threaded program."
I always felt that said more about the developer than multithreading though.
After using it for a while, it's amazing how much global variables tend to creep unannounced into code :-)
Are foreign calls assumed to be non-pure? Can they be marked pure in the case that the foreign (likely C) function doesn't reference global mutable state?
> Can they be marked pure in the case that the foreign (likely C) function doesn't reference global mutable state?
Yes. Here's an example:
It's true that the C Standard does not actually guarantee that they don't access mutable global state, but in practice they don't. We're not aware of one that does, and don't know why anyone would write one that does.
That's not true. In Rust you can have globals, but you have to declare the synchronization semantics for every global when you do. (That is, they have to be read only, or thread local, or atomic, or protected by a lock.)
This property makes it very easy to take a single-threaded Rust program and make it multithreaded later, and in fact this is how most Rust programs end up being parallelized in my experience.
That doesn't sound like it helps you determine which one of those is appropriate. So it seems rather like a system that will lead to unexpected bottlenecks when you parallelize. (Somebody deep inside the stack arbitrarily decided locking was the most appropriate - now you've got a contended lock or a potential lock ordering bug later on.)
Granted it does seem like it allows for more reasonable defaults than a default-unsafety policy.
We have experience with this. For a long time, WebRender rasterized glyphs and other resources sequentially. Switching it to parallelize was painless: Glenn just added a parallel for loop from Rayon and we got speedups.
This is a funny comment. You are implying that performance is of higher value than correctness. Speed without correctness is dangerous, and leads to significant bugs, especially when you're talking about concurrent modification of state across threads.
I'll take correct and need to improve performance over incorrect and fast where the cost of tracking down incorrect concurrent code is so extremely high, let alone dangerous for actual data being stored.
Of course it is. Tony Hoare noticed it as far back as 1993: given a safe program and a fast program, people would always choose the fast one. Correctness in a mathematical sense does not always map to correctness in the business sense; it's sometimes much more cost-effective to reboot a computer every day and not free any memory than try to be memory-correct which will cost at least a few thousand dollars more in employee time.
What really bothers me though, is that you might actually store incorrect data somewhere. That could have hugely negative implications for the business.
Funny would be an understatement.
> What leads to problems in threaded C++ programs is unprotected, shared state.
That's one of the main problem, but I don't think people start with saying "we'll just have this unprotected shared state and hope for the best", that shared state ends up being shared by accident or as a bug. I've seen enough of those and they not fun to debug (hardware environment is slightly different, say cache sizes are bit off, to make it more likely to happen at customer's site for example, on Wednesday evening at 9pm but never during QA testing).
Other things I've seen bugs in is mixing non-blocking (select / epoll / etc) based callbacks with threads. Pretty easy to get tangled there. Throw signal handlers in and now it is very easy to end up with a spaghetti mess.
Even worse, there is a difference between "we'll start with a clean, sane threaded design from the start" vs "we'll add a bit of threading here in this corner for extra performance". That second case is much worse and can result in subtle and tricky bugs. Sometimes it is not easy to determine if the code is re-entrant in a large code-base.
Interestingly and kind of tongue in cheek someone (I think Joe Armstrong, but I maybe wrong) said to try and think about your programming environment as an operating system. It is 2017, most sane operating systems have processes with isolated heaps, startup supervision (so services can be started / stopped as groups), preemption based concurrency (processes don't have to explicitly yield) and so on. So everyone agrees that's sane and normal and say putting their latest production release on a Windows 3.1 would not be a good idea. Why do we then do it with our programming environment? A bunch of C++ threads sharing memory are bit like that Windows 3.1 environment where the word processor crashes because the calculator or a game overwrote its memory.
Because people, convincing devs to adopt new ways is an uphill battle unless they have tried it themselves.
Joe Duffy's keynote at Rust Conf just went live, and one of the reasons of Midori's adoption failure was convincing Windows kernel devs of better ways of programming, in spite of them having Midori running in front of them.
And yet unikernels are here and used.
We've had it since C++11. http://en.cppreference.com/w/cpp/language/storage_duration
Note clang is not the only variable here. You need support from the dynamic linker.
Edit: some googling suggests they may have added this support last year with some caveats.
You can already use clang 5.0 on the NDK, so even early C++17 support is possible. I don't know about TLS support.
The NDK is only there for high performance graphics (vulkan), realtime áudio, SIMD and bringing native libraries from other platforms.
So apparently their motivation is that devs should touch the NDK as little as possible.
There plenty of apps that could really use a well-supported NDK. Games and audio apps for a start.
"ARCore works with Java/OpenGL, Unity and Unreal and focuses on three things:"
Which makes quite clear that even in AR, the role of the NDK is to support Java, Unity and Unreal applications, not to be used alone.
I am fine with it, given the security issues with the native code so we should actually minimize its use, I just would like they would provide better tooling to call framework APIs instead of forcing everyone to write JNI wrappers by themselves.
So they are still using C as a lingua franca for some low-level parts of the system. But without putting much effort into improving the C toolchain.
I'm not very familiar with Unreal but I understand it's a C++ library, so it seems like that would benefit from a solid C++ toolchain too.
I worked on a C++ trading app in the past and our approach was "services" (actors) talking via message passing. Basically a bunch of long-lived threads, each with a queue (nowadays that could be something like boost::lockfree::queue ), running an event loop. Message passing worked by getting a pointer to the target service's queue via a service registry (singleton initialized at startup, you could also keep a copy per service though), then pushing a message to it. If you needed to do something with the result, then that would come back via another message sent via the same system.
It does mean if there's back-and-forth between services the code would be harder to follow as you'd have to read through a few event handlers to get the whole flow, but with well-designed service boundaries it worked quite well. I don't think futures would have been useful in our context as blocking any service for an extended time would have been very bad.
 If you don't want boost, you can wrap std::queue yourself, this is a simple example where you can replace the boost mutex etc. with std ones: https://www.justsoftwaresolutions.co.uk/threading/implementi...
Futures are somewhat useful if you spawn off a thread to do a single task and wait for it to finish, however, on many platforms, the thread creation overhead is high enough that you don't want to do that, you want to have existing worker threads being given work to do.
Once again, this is something Go does really well. Go routines are almost free, so you can use them with a futures model very easily.
The advantages of this data structure over a linked list queue are as follows:
* Faster; only requires a single Compare-And-Swap instruction of sizeof(void *) instead of several double-Compare-And-Swap instructions.
* Simpler than a full lockless queue.
* Adapted to bulk enqueue/dequeue operations. As pointers are stored in a table, a dequeue of several objects will not produce as many cache misses as in a linked queue. Also, a bulk dequeue of many objects does not cost more than a dequeue of a simple object.
* Size is fixed
* Having many rings costs more in terms of memory than a linked list queue. An empty ring contains at least N pointers.
- skimming through doug lea concurrency articles: FP
- thread safe C: pure functions, shared nothing: FP
- comment above: FP
at what point will people start to call it what it is ?
Then you can get into what you mean by FP - if you share nothing then your communication is done by copying. This is not a magic bullet and isn't an option for many scenarios. If you do shared state for join parallelism you can cover other scenarios, but now you are sharing data.
Atomics are very fast and work very well when they line up with the problem at hand. Then again, you are creating some sort of data structure that is made to have its state shared.
If the problem was so easy to solve, it wouldn't be nearly as much a problem. Handwaving with 'just use FP' is naive and is more of a way for people to feel that they have the answer should anyone ask the question, but reality will quickly catch up.
Where is the synchronization in this scenario? You either have to decide how to split up the read only memory to different threads (fork join) or you have one thread make copies of pieces and 'send' them to other threads somehow. Arguably these are the same thing. This is one technique, but again, it doesn't cover every scenario.
I don't know if calling it 'immutability' changes anything.
> avoid locking (unless you have to synchronize on a change)
Synchronizing on changes is the whole problem, you can't just hand wave it away as if it is a niche scenario. Anyone can create a program that has threads read memory and do computations. If you can modify the memory in place with no overlap between threads, even better. These however are the real niche scenarios, because the threads eventually need to do something with their results whether it's sending to video memory, writing to disk, or preparing data for another iteration or state in the pipeline. Then you have synchronization and that's the whole issue.
I must confess, I have no experience there, it's just years of reading about and writing functional code and seeing a potential trail here.
That said if you have new thoughts on the subject, please write them :)
You get the simplicity and familiarity of OOP and most of the benefits of multi-threading.
That could be from different threads or from an interrupt handler. It can even come up in single-threaded code: you call function A() which internally calls B() which results in a nested call to A(). If A is re-entrant that's safe.
The biggest sin generally is to disregard the overhead caused by thread management and thinking more threads makes the sofrware run faster.
I've seen people try to parallellize a sequential program by just spawning mutexes everywhere and then thinking now any number of threads can do whatever they please. Of course when tested the system was quite a bit slower as when it ran on a single thread (the system was quite large so quite a lot of work was needed before reaching this state).
Gah, no. Userspace spinlocks are the deepest of voodoo and something to be used only by people who know exactly what they are doing and would have no difficulty writing "traditional" threaded code in a C/C++ environment. Among other problems: what happens when the thread holding the spinlock gets preempted and something else runs on the core? How can you prevent that from happening, and how does that collision probability scale with thread count and lock behavior?
Traditional locking (e.g. pthread mutexes, windows CriticalSections) in a shared memory environment can be done with atomic operations only for the uncontended case, and will fall back to the kernel to provide blocking/wakeup in a clean and scalable way. Use that. Don't go further unless you're trying to do full-system optimization on known hardware and have a team full of benchmark analysis experts to support the effort.
If you ever used pthread_mutex with glibc then you use spinlocks without knowing it. The implementation spins for some time before going for a full kernel mutex.
"Mutex" on the other hand might have a fast-path that spins a few times before inserting the thread onto a wait list.
The difference is when there's contention. A spinlock will burn CPU cycles but a mutex will yield to another thread or process (with some context switch overhead).
A spinlock should only be used when you know you're going to get it in the next microsecond or so. Or in kernel space when you don't have other options (e.g. interrupt handler). Anything else is just burning CPU cycles for nothing.
Mutex and condition variables (emphasis on the latter) are much more useful than spinlocks and atomics for general multithreaded programming.
Outside of hard realtime code, there's zero reason to use spin locks.
Interesting read: http://www2.rdrop.com/~paulmck/realtime/SMPembedded.2006.10....
That is just not true. You _must_ use them in the case where the kernel is non-preemptable. Additionally, if the locked resource is held for a very short time, a spin lock is likely a more efficient choice than a traditional mutex.
On some common architectures, releasing a spin lock is cheaper than releasing a mutex.
But if you don’t have a guarantee the lock owner won’t be preempted, well, spinning for a whole timeslot is quite a bit more expensive…
 you better be careful with spinlocks and priorities here as you can livelock forever.
That said I'm used to situations where we're pinning thread affinity to specific cores and really trying to squeeze out what you can from fixed resources.
Setting CPU affinity will ensure that you always get the same core, but it might not increase performance and could adversely affect other parts of the system.
CPU affinity is a good fit for continuously running things like audio processing or game physics or similar. It's not good when threads are blocked or react to external events.
In most cases it's just unnecessary because the kernel is pretty good in keeping threads on cores anyway.
Be careful with that. First off, what people refer to as "mutex" is usually a spinlock that falls back to a kernel wait queue when the spin count is exceeded. There are even adaptive mutexes that figure out at runtime how long the lock is typically held and base their spin count limit on that.
Secondly, busy-waiting is often worse than a single slow program, because you actively slow down all of the other running programs.
Qt does it with signals-slots. What I generally do is that I have a queue of std::function and just pass lambdas with the capture being copied.
From Introduction to Parallel Computing
Usually I will write a program that does the following:
1. Initialize global read-only data structures
2. Parallelize jobs across STDIN lines (or whatever)
3. Output results within a #pragma omp critical block
It works wonders, and is literally 5 additional lines of code to make a single-threaded program multithreaded with OMP. But only for some types of program. The concept is very similar to what GNU parallel does.
In short, threads are not the problem per se, they are just the wrong tool for the job if you have a multiple-readers multiple-writers situation. Message passing, databases, or something else are more appropriate in those cases. But you will pry my read only shared memory space out of my cold, dead hands.
Higher-level systems have primitives like message queues, thread pools, and channels. Those are all great but they can't always be used to build the exact system you need without a lot of overhead.
C and C++ are all about emphasising speed and flexibility over safety. Whether you think that's a good or bad approach, it is what it is, and pthreads is the right design for that approach. Once you come up with the high-level multithreading abstraction of your dreams, what else would you rather implement it in than pthreads?
The one thing that was missing from pthreads was atomic operations for modern lockless data structures, so it's great that those have finally been added to the standard library in both C and C++.
C++17 gets a lot of this with much of the algorithms header getting parallel/vectorized versions. But it doesn't have a task stealing work queue or something like then to pass the value when it completes. Boost has a version now that does though and this lets you have the next function called when done instead of waiting and blocking.
But if you spend your time in the thread zone you are probably putting too much thought into threading and not the problem you are solving. Plus, using a task queue can yield really good results for lower level algorithms.
I prefer working with channels, which are still missing from the library; but they are trivial to add:
Whenever I use channels, I find I need them to synchronize with each other. For example, I have a Go data structure with a write channel and a separate read channel; how do I avoid a read-after-write hazard?
The simplest solution is to make them the same channel that ships closures, to ensure requests are processed in-order. But then why not just make all channels ship closures, like libdispatch...
As for the signal mask, I'm pretty sure you can just set that using pthreads, and it will apply to the current thread. This is certainly true for the pthreads thread name, which is as deeply as I've investigated myself... the C++ threads library isn't magic. It's just a wrapper round what you'd expect.
Building channels with select functionality is a lot harder.
An interesting thing is the equivalent of pthreads on windows (WinAPI synchronization primitives) actually provides something which has some similarity with select out of the box: WaitForMultipleEvents, which is also quite powerful. However that is also a still a lot harder to work with than Go's channels, since there are additional sources for errors like HANDLE invalidation, which you can't have in Go where channels are garbage collected references.
You can't implement this with non-blocking reads alone. If you would, it would be something like busy spinning and iterating through all select cases and trying none-blocking reads until one succeeds. But that's just not efficient.
You need a mechanism instead which also registers a pending select at each event source and unregisters it when done. And the event source (e.g. channel) must wake up one or more select when it's state changes.
Besides, how do you block in user space when there is nothing to read?
Looping with unblocking reads is plenty fast enough though.
As for waiting on any event without select, adding an optional condition-variable pointer to each channel that is pinged when data arrives is the easiest approach I can think of.
In fact pinging condition-variables after a channel successfully is read is a much more expensive emulation of that same behavior. I'm curious if the Linux kernel actually exports something like that.
I'm not actually sure that the C++ library does anything that pthreads doesn't - maybe std::lock, perhaps? - but it's more convenient to use (and it's not like pthreads is bad to start with...), and the pieces work well and fit together quite nicely. Some good use of move semantics. unique_lock is neat and you can do stuff like vector<thread>.
The C++ thread library works on Windows, too...
There’s std::atomic, which didn’t have a standardized equivalent in pthreads. In lieu of that you’d see some ad-hoc combination of GCC builtins, OS-specific functions, and `volatile`, with the result often not guaranteed to work on architectures other than x86 (if at all). Compared to that, std::atomic is both more flexible and easier to get right. The C ‘backport’, stdatomic.h, is also available on most platforms these days...
The other major model for multi-threading is known as message-passing multiprocessing. Unless you're familiar with the Occam or Erlang programming languages, you might not have encountered this model for concurrency before.
Two popular variants of the message-passing model are "Communicating Sequential Processes" and the "Actor model".
Why would you want to learn about this alternative model, when Pthreads have clearly won the battle for the hearts and minds of the programming public? Well, besides the sheer joy of learning something new, you might develop a different way of looking at problems, that'll help you top make better use of the tools that you do use regularly. In addition, as I'll explain in Part II of this rant, there's good reason to believe that message-passing concurrency is going to be coming back in a big way in the near future.
 shameless plug: https://github.com/duneroadrunner/SaferCPlusPlus#asynchronou...
Not sharing state at all means that you cannot even do message passing other than through sockets or something like this. At some point you have to share at least a message queue to be able to go from thread A to thread B.
It's possible to do that with fork-and-exec, but without a shared address space, marshalling costs make it expensive.