Hacker News new | past | comments | ask | show | jobs | submit login
Hell is a multi-threaded C++ program (2006) (codemines.blogspot.com)
223 points by mpweiher on Sept 2, 2017 | hide | past | web | favorite | 122 comments

I've been writing massively threaded C++ code since the 1990's, so I can comment with some serious experience here. My code used to run on 256CPU SGI's and had high cpu utilization across all of them.

When designing C++ code for threading, you have to keep a few things in mind, but it's not particularly harder than on other languages with thread support. You generally want to create re-entrant functions which pass all of their state as input, and return all results as output, without modifying ant kind of global state. You can run any number of these in parallel without thread safety issues, or locks. When using multiple threads with shared data, you have to protect it with locks, but I generally use a single thread to access data of this sort which produces work for other threads to do, which can be queued or distributed any other way. Producer/consumer queues show up a lot in good designs, and your single-threaded choke points are usually a producer or a consumer, which allows thread fan-in or fan-out. (As a side note, Go really got this right with channels).

What leads to problems in threaded C++ programs is unprotected, shared state. Some C++ design patterns make life particularly hard, such as singletons, which are just a funny name for global variables. C++ gives you lots of ways to make something global accidentally - static member variables, arguments by reference, for instance.

True C++ hell is taking a single threaded C++ program and making it multi-threaded after the fact, but that's hell in any language which allows you to have globals of any sort and doesn't perform function calls within some kind of closure.

The article in this post is being a little reductionist. You don't need to pick a threading model - some of your threads can pass messages, others can use shared data, all in the same program. The lack of thread local storage is not an issue, because you partition the data so that no two threads are working on the same thing at the same time.

I agree with all your points.

I've written a lot of multithreaded C++, I did quite a lot back starting about 1994 for around 10+ years, a lot of middle-tier application code on Windows NT in particular (back in the day when it was all 3-tier architectures - now we'd call them services) - it's totally fine if you know what you are doing.

Work is usually organised in mutex-protected queues, worker threads consume from these queues, results placed a protected data structure, and receivers poll and sleep, waiting for the result.

Other tricks to remember are to establish a hierarchy of mutexes - if you have to take several locks, they must be done in order, and unlocked in reverse order, this should guarantee an absence of deadlocks. A second trick - a way to guarantee locks are released and bugs of non-release of mutexes do not occur, as well as the correct order of releases, is to strictly follow an RAII pattern, where destructors of stack-based lock objects, unlock your mutexes as you exit stack frames.

Of course, in later periods, you started to see formal recognition of these design patterns, in Java and C# libraries which had WorkerPools, Java and it's lock() primitive, but these design patterns were prevalent in my code at the time, because it was the only obvious and simple way to use multi-threading in a conceptually simple manner. KISS...

Nothing particularly hellish about any of it - but I remember it was not a development task for all developers, and without common libraries in the period (this is pre-STL), you had to work a lot of it out for oneself.

I do remember in the period you would get grandiose commentary from some public developers who would proclaim such things as, "it is impossible to have confidence in/write a bug-free threaded program."

I always felt that said more about the developer than multithreading though.

The D programming language has a 'pure' annotation which can be applied to functions. This enables the compiler checking that the function does not read or write any global mutable state, including checking the functions called.

After using it for a while, it's amazing how much global variables tend to creep unannounced into code :-)

> The D programming language has a 'pure' annotation which can be applied to functions. This enables the compiler checking that the function does not read or write any global mutable state, including checking the functions called.

Are foreign calls assumed to be non-pure? Can they be marked pure in the case that the foreign (likely C) function doesn't reference global mutable state?

> Are foreign calls assumed to be non-pure?


> Can they be marked pure in the case that the foreign (likely C) function doesn't reference global mutable state?

Yes. Here's an example:


It's true that the C Standard does not actually guarantee that they don't access mutable global state, but in practice they don't. We're not aware of one that does, and don't know why anyone would write one that does.

great feature, sir. thank you!

You're welcome! It's a 'little' feature with a surprisingly large impact.

> True C++ hell is taking a single threaded C++ program and making it multi-threaded after the fact, but that's hell in any language which allows you to have globals of any sort and doesn't perform function calls within some kind of closure.

That's not true. In Rust you can have globals, but you have to declare the synchronization semantics for every global when you do. (That is, they have to be read only, or thread local, or atomic, or protected by a lock.)

This property makes it very easy to take a single-threaded Rust program and make it multithreaded later, and in fact this is how most Rust programs end up being parallelized in my experience.

> you have to declare the synchronization semantics for every global when you do. (That is, they have to be read only, or thread local, or atomic, or protected by a lock.)

That doesn't sound like it helps you determine which one of those is appropriate. So it seems rather like a system that will lead to unexpected bottlenecks when you parallelize. (Somebody deep inside the stack arbitrarily decided locking was the most appropriate - now you've got a contended lock or a potential lock ordering bug later on.)

Granted it does seem like it allows for more reasonable defaults than a default-unsafety policy.

This is usually not a problem because using globals and taking locks unnecessarily is unidiomatic Rust. You have to go out of your way to write more code when you use mutexes, so most people don't unless they truly need to.

We have experience with this. For a long time, WebRender rasterized glyphs and other resources sequentially. Switching it to parallelize was painless: Glenn just added a parallel for loop from Rayon and we got speedups.

> a system that will lead to unexpected bottlenecks when you parallelize

This is a funny comment. You are implying that performance is of higher value than correctness. Speed without correctness is dangerous, and leads to significant bugs, especially when you're talking about concurrent modification of state across threads.

I'll take correct and need to improve performance over incorrect and fast where the cost of tracking down incorrect concurrent code is so extremely high, let alone dangerous for actual data being stored.

> This is a funny comment. You are implying that performance is of higher value than correctness.

Of course it is. Tony Hoare noticed it as far back as 1993: given a safe program and a fast program, people would always choose the fast one. Correctness in a mathematical sense does not always map to correctness in the business sense; it's sometimes much more cost-effective to reboot a computer every day and not free any memory than try to be memory-correct which will cost at least a few thousand dollars more in employee time.

In a case where it just crashes, that's probably a reasonable tradeoff.

What really bothers me though, is that you might actually store incorrect data somewhere. That could have hugely negative implications for the business.

Let's not equivocate here. Threads don't exist merely as a fun exercise to introduce more interesting bugs. They are for performance. If you don't care about that I may advise to stay away from threads.

> This is a funny comment.

Funny would be an understatement.

So much nicer to change than the type and chase compiler errors though. Reading the code and just being real smart is tough to get right.

Obviously it is possible to write threaded C++ programs. And there are experienced people like you who can do it well, with enough discipline, good design, and so on. But I think the point of the article is that for most programmers it is very easy to make mistakes and shoot themselves in the foot with threads and shared memory. It could be even something like using a 3rd party library where it's initialization context can't be shared but a mistake was made and it did end up being shared by accident, say passed to some workers threads.

> What leads to problems in threaded C++ programs is unprotected, shared state.

That's one of the main problem, but I don't think people start with saying "we'll just have this unprotected shared state and hope for the best", that shared state ends up being shared by accident or as a bug. I've seen enough of those and they not fun to debug (hardware environment is slightly different, say cache sizes are bit off, to make it more likely to happen at customer's site for example, on Wednesday evening at 9pm but never during QA testing).

Other things I've seen bugs in is mixing non-blocking (select / epoll / etc) based callbacks with threads. Pretty easy to get tangled there. Throw signal handlers in and now it is very easy to end up with a spaghetti mess.

Even worse, there is a difference between "we'll start with a clean, sane threaded design from the start" vs "we'll add a bit of threading here in this corner for extra performance". That second case is much worse and can result in subtle and tricky bugs. Sometimes it is not easy to determine if the code is re-entrant in a large code-base.

Interestingly and kind of tongue in cheek someone (I think Joe Armstrong, but I maybe wrong) said to try and think about your programming environment as an operating system. It is 2017, most sane operating systems have processes with isolated heaps, startup supervision (so services can be started / stopped as groups), preemption based concurrency (processes don't have to explicitly yield) and so on. So everyone agrees that's sane and normal and say putting their latest production release on a Windows 3.1 would not be a good idea. Why do we then do it with our programming environment? A bunch of C++ threads sharing memory are bit like that Windows 3.1 environment where the word processor crashes because the calculator or a game overwrote its memory.

> So everyone agrees that's sane and normal and say putting their latest production release on a Windows 3.1 would not be a good idea. Why do we then do it with our programming environment?

Because people, convincing devs to adopt new ways is an uphill battle unless they have tried it themselves.

Joe Duffy's keynote at Rust Conf just went live, and one of the reasons of Midori's adoption failure was convincing Windows kernel devs of better ways of programming, in spite of them having Midori running in front of them.

I think the OP's attitude is still correct. Threaded programs are hard. It takes experience, discipline, and good design. There is no band-aid for that. The technology is helping with things like local storage, built in actor models, borrow checkers, etc. but those are only helping the design aspect. It still takes discipline, like knowing to not mix up async callbacks with/across threads. You are basically writing your own little kernel, and writing a kernel is hard, but once you know the rules it can be done.

I think rdtsc's point was that writing your own kernel in the present day is a terrible mistake unless you genuinely need something not available otherwise.

> So everyone agrees that's sane and normal and say putting their latest production release on a Windows 3.1 would not be a good idea.

And yet unikernels are here and used.

thank you

I did it for many years. It is not for the weak. The hellgrind plug-in for valgrind saved my ass big time here and there. Without threads a program needs to have zero bugs, with threads you need a negative bug count. You will get that joke when you try to add threads to a normally bug free program and suddenly you discover the rest of the real bugs you didn't know yet.

> lack of thread local storage is not an issue

We've had it since C++11. http://en.cppreference.com/w/cpp/language/storage_duration

I think android ndk still barfs on c++11 tls and the earlier language extensions that came before it like __thread. It's not uncommon to be working with language implementations that are missing things like that. (I remember MS's version of it produced binaries that would not run on XP - maybe less important today but I had a run-in with that circa 2009.)

I don't know about TLS, but NDK is already using clang 5.0 and my code is C++14.

Yes I know they have recent clang. But does tls work? Last I knew, no.

Note clang is not the only variable here. You need support from the dynamic linker.

Edit: some googling suggests they may have added this support last year with some caveats.

Really? That's disappointing. Good old bad old NDK strikes again. I really don't understand why Google doesn't seem interested in fixing it.

They are fixing it, but NDK team seems to be a very small team, from the commits and gitHub issues.


You can already use clang 5.0 on the NDK, so even early C++17 support is possible. I don't know about TLS support.

The NDK is only there for high performance graphics (vulkan), realtime áudio, SIMD and bringing native libraries from other platforms.

So apparently their motivation is that devs should touch the NDK as little as possible.

Yeah, it just seems like they have the resources to support a bigger NDK team if they cared to do so.

There plenty of apps that could really use a well-supported NDK. Games and audio apps for a start.

Google released ARCore last week.

"ARCore works with Java/OpenGL, Unity and Unreal and focuses on three things:"


Which makes quite clear that even in AR, the role of the NDK is to support Java, Unity and Unreal applications, not to be used alone.

I am fine with it, given the security issues with the native code so we should actually minimize its use, I just would like they would provide better tooling to call framework APIs instead of forcing everyone to write JNI wrappers by themselves.

As a counter-example, they also just announced AAudio, and that's a C API: https://developer.android.com/ndk/guides/audio/aaudio/aaudio...

So they are still using C as a lingua franca for some low-level parts of the system. But without putting much effort into improving the C toolchain.

I'm not very familiar with Unreal but I understand it's a C++ library, so it seems like that would benefit from a solid C++ toolchain too.

The thing I see glanced over a lot in multithreading is the actual implimentation of messaging people are talking about. What is being used to actually just 'send' messages to the main thread while the sender thread continues on? I know qt has slots and signals, but I haven't found a good native method that does something like that if I wanted to for example use an std::future and also receive messages back that can update a gui.

If you're using a GUI toolkit like Qt or WxWidgets, then you have to use whatever event system the toolkit provides, since it controls the main thread's event loop. For WxWidgets for instance the doc is here: http://docs.wxwidgets.org/trunk/overview_thread.html

I worked on a C++ trading app in the past and our approach was "services" (actors) talking via message passing. Basically a bunch of long-lived threads, each with a queue (nowadays that could be something like boost::lockfree::queue [1]), running an event loop. Message passing worked by getting a pointer to the target service's queue via a service registry (singleton initialized at startup, you could also keep a copy per service though), then pushing a message to it. If you needed to do something with the result, then that would come back via another message sent via the same system.

It does mean if there's back-and-forth between services the code would be harder to follow as you'd have to read through a few event handlers to get the whole flow, but with well-designed service boundaries it worked quite well. I don't think futures would have been useful in our context as blocking any service for an extended time would have been very bad.

[1] If you don't want boost, you can wrap std::queue yourself, this is a simple example where you can replace the boost mutex etc. with std ones: https://www.justsoftwaresolutions.co.uk/threading/implementi...

There isn't a good, native method. I always begin by implementing a thread safe producer/consumer queue with non-blocking peek, which checks to see if there's something at the head of the queue. You can use something like this for other threads to block on, or to poll, etc. It's very useful.

Futures are somewhat useful if you spawn off a thread to do a single task and wait for it to finish, however, on many platforms, the thread creation overhead is high enough that you don't want to do that, you want to have existing worker threads being given work to do.

Once again, this is something Go does really well. Go routines are almost free, so you can use them with a futures model very easily.

I'm not 100% sure what's meant by "native" method in this context, but if you're programming for macOS/iOS, try Grand Central Dispatch. It's a very nice task parallelism library with some extremely smart design choices.

If you use dpdk it has some of the fastest (if not the fastest) ring implementation available.

In case anyone else is wondering a "ring" is a queue data structure which provides a different set of trade-offs. The documentation specifically compares it to a linked list:

    The advantages of this data structure over a linked list queue are as follows:

    * Faster; only requires a single Compare-And-Swap instruction of sizeof(void *) instead of several double-Compare-And-Swap instructions.
    * Simpler than a full lockless queue.
    * Adapted to bulk enqueue/dequeue operations. As pointers  are stored in a table, a dequeue of several objects will not produce as many cache misses as in a linked queue. Also, a bulk dequeue of many objects does not cost more than a dequeue of a simple object.

    The disadvantages:

    * Size is fixed
    * Having many rings costs more in terms of memory than a linked list queue. An empty ring contains at least N pointers.


technique wise, it's not too hard, but it only takes one little screw up to bring the whole thing crashing down. Back in an older version of MS C++ compiler/library their std::string had this bug where strings weren't thread safe as they relied on global state. That took a while to find.

So, functional programming then?

I'm tired of thinking this in my head but

- skimming through doug lea concurrency articles: FP

- thread safe C: pure functions, shared nothing: FP

- comment above: FP

at what point will people start to call it what it is ?

Because thinking that functional programming is going to solve all your problems is simple, easy, and wrong. Anyone who has written high performance software knows this. First, garbage collection without locks is extremely exotic, so likely all your memory allocations and deallocations will lock and with typical functional programming you will be doing a lot of them.

Then you can get into what you mean by FP - if you share nothing then your communication is done by copying. This is not a magic bullet and isn't an option for many scenarios. If you do shared state for join parallelism you can cover other scenarios, but now you are sharing data.

Atomics are very fast and work very well when they line up with the problem at hand. Then again, you are creating some sort of data structure that is made to have its state shared.

If the problem was so easy to solve, it wouldn't be nearly as much a problem. Handwaving with 'just use FP' is naive and is more of a way for people to feel that they have the answer should anyone ask the question, but reality will quickly catch up.

What about immutability and shared structure. You can stream your computation in a way to avoid copying, avoid locking (unless you have to synchronize on a change). Persistent DS are rarely mentioned, I only know Demaine's course.

> What about immutability and shared structure > You can stream your computation in a way to avoid copying

Where is the synchronization in this scenario? You either have to decide how to split up the read only memory to different threads (fork join) or you have one thread make copies of pieces and 'send' them to other threads somehow. Arguably these are the same thing. This is one technique, but again, it doesn't cover every scenario.

I don't know if calling it 'immutability' changes anything.

> avoid locking (unless you have to synchronize on a change)

Synchronizing on changes is the whole problem, you can't just hand wave it away as if it is a niche scenario. Anyone can create a program that has threads read memory and do computations. If you can modify the memory in place with no overlap between threads, even better. These however are the real niche scenarios, because the threads eventually need to do something with their results whether it's sending to video memory, writing to disk, or preparing data for another iteration or state in the pipeline. Then you have synchronization and that's the whole issue.

What I meant is that a FP language could go its own way and let the side effects happen.. well on the side. Sure you will have to communicate computations to other systems but you can have these part clearly segregated and synchronized.

I must confess, I have no experience there, it's just years of reading about and writing functional code and seeing a potential trail here.

How exactly does functional programming solve this problem though? How does it differ from imperative programming?

Yeah, immutability was something that got kind of glossed over in that article (I'm the author). And Functional Programming was a lot less mainstream then. Maybe I should do an update.

Well you're excused, 2006 was a really different world. FP was still alien and multithreading probably a lot less mainstream.

That said if you have new thoughts on the subject, please write them :)

Yup, basically.

It's not exactly functional programming. It can be a traditional OOP, just structured in independent blocks, and you execute each block on different threads. But the block internally can have as much shared state as it wants.

You get the simplicity and familiarity of OOP and most of the benefits of multi-threading.

I think the key words are "reentrant function" which in the context means atomic, pure functions. He didn't say something about OO, like DI or multiple inheritance in a some clean way, or something about classes or structs. This is more FP than not. By far.

I am not as experienced as you are and I don't even know the term re-entrant functions but accidentally this is how I am now designing programs, mostly writing self contained functions that don't alter any global state. This approach has made my programming life hell because I am finding it incredibly hard to get rid of the old shared variables habit. What I have experienced is that shared data access is more or less inevitable and I design to restrict the shared data access at the database.

Re-entrant is basically what you said, self-contained functions that don't depend on external state. The "re-entrant" part just means that, basically, the function can be "re-entered." If it gets interrupted in the middle of execution, it will still produce the same output no matter what, no matter how long it sleeps for.

No, that's not what re-entrant means. A re-entrant function can be called multiple times at the same time.

That could be from different threads or from an interrupt handler. It can even come up in single-threaded code: you call function A() which internally calls B() which results in a nested call to A(). If A is re-entrant that's safe.

a lot of the time you can restrict shared data access to at least being read-only, which helps a lot.

In my limited experience of the single- to multi-threading circle of hell, the function calls performed outside of an effective form of closure are a bigger problem than globals / statics / singletons. Whenever an object is accessed concurrently by multiple threads, its state becomes a potential problem. Some programs may contain so many pathological dependencies that they cannot be multithreaded without introducing so much mutual exclusion that they cannot perform better than the single-threaded original, even if the task they implement is a candidate for multithreading. Putting those cases aside, the hardest part of conversion is probably that of understanding the existing code well enough to be able to find and analyze all the operations that cut across the thread boundaries created by the conversion, which are aspects of the dynamic behavior of the program.

I agree with pretty much everything you say!

I’ve been taking single threaded code in JavaScriptCore and making it thread-safe for a while now and it’s challenging but I wouldn’t call it hell. I used to think this was harder than it turns out to be.

Sounds a lot like Erlang BEAM

this comment reeks of experience, expertise and pragmatism. thank you! -- programming for 30 years

Got a repo with some example code?

I've found message passing and thinking in transactions are pretty good architectural patterns for maintaining ones sanity. Message queues with spinlocks for performance critical code (mutexes are slow).

The biggest sin generally is to disregard the overhead caused by thread management and thinking more threads makes the sofrware run faster.

I've seen people try to parallellize a sequential program by just spawning mutexes everywhere and then thinking now any number of threads can do whatever they please. Of course when tested the system was quite a bit slower as when it ran on a single thread (the system was quite large so quite a lot of work was needed before reaching this state).

> spinlocks for performance critical code (mutexes are slow).

Gah, no. Userspace spinlocks are the deepest of voodoo and something to be used only by people who know exactly what they are doing and would have no difficulty writing "traditional" threaded code in a C/C++ environment. Among other problems: what happens when the thread holding the spinlock gets preempted and something else runs on the core? How can you prevent that from happening, and how does that collision probability scale with thread count and lock behavior?

Traditional locking (e.g. pthread mutexes, windows CriticalSections) in a shared memory environment can be done with atomic operations only for the uncontended case, and will fall back to the kernel to provide blocking/wakeup in a clean and scalable way. Use that. Don't go further unless you're trying to do full-system optimization on known hardware and have a team full of benchmark analysis experts to support the effort.

> Gah, no. Userspace spinlocks are the deepest of voodoo and something to be used only by people who know exactly what they are doing and would have no difficulty writing "traditional" threaded code in a C/C++ environment.

If you ever used pthread_mutex with glibc then you use spinlocks without knowing it. The implementation spins for some time before going for a full kernel mutex.

There's a bit of a terminology confusion here. When they say "spinlock", most people mean a pure spinlock that spins forever until it succeeds.

"Mutex" on the other hand might have a fast-path that spins a few times before inserting the thread onto a wait list.

I think the warning is directed against rolling your own spinlocks without careful consideration, including a realistic evaluation of whether you know every issue involved in doing so.

Sure, and C library authors qualify. App developers don't.

An uncontended mutex is just as fast as a spinlock (on modern operating systems using a futex). It takes about 25 nanoseconds to lock it.

The difference is when there's contention. A spinlock will burn CPU cycles but a mutex will yield to another thread or process (with some context switch overhead).

A spinlock should only be used when you know you're going to get it in the next microsecond or so. Or in kernel space when you don't have other options (e.g. interrupt handler). Anything else is just burning CPU cycles for nothing.

Mutex and condition variables (emphasis on the latter) are much more useful than spinlocks and atomics for general multithreaded programming.

You have to be careful with those things - for instance, you have to special case whether you'r on a single CPU or multiple CPU system, because a spin lock will block forever in a non-preemptive context, such as the kernel.

Outside of hard realtime code, there's zero reason to use spin locks.

In hard realtime code, a FIFO mutex gives you bounded wait time because you have a fixed number of threads.

Interesting read: http://www2.rdrop.com/~paulmck/realtime/SMPembedded.2006.10....

> Outside of hard realtime code, there's zero reason to use spin locks.

That is just not true. You _must_ use them in the case where the kernel is non-preemptable. Additionally, if the locked resource is held for a very short time, a spin lock is likely a more efficient choice than a traditional mutex.

> Outside of hard realtime code, there's zero reason to use spin locks.

On some common architectures, releasing a spin lock is cheaper than releasing a mutex.

On all architectures, releasing a mutex requires at least a branch (to see if you need to wake up sleeping threads) that you don’t need with a pure spinlock.

But if you don’t have a guarantee the lock owner won’t be preempted, well, spinning for a whole timeslot is quite a bit more expensive…

Spin locks are a tough sell in a preemptable context. Say you have two processes that share some memory location. They both briefly access it for a very short time, so you protect it with a spin lock. Well, what happens in the case when one of the threads is preempted while holding the lock? The other thread would try to aquire it, and just spin for its entire timeslot. No bueno. When you call spin_lock() in the kernel it actually disables preemption until you call spin_unlock() to avoid this. You can't disable preemption from userspace. There might be a use for a spin lock if you have a process that is SCHED_RR, but I haven't seen it.

You are not wrong in general but note that there are ways to disable preemption in userspace in practice, be it SCHED_FIFO[1] or isolcpu with cpu pinning.

[1] you better be careful with spinlocks and priorities here as you can livelock forever.

Can't upvote this enough. If you are using spinlocks because "mutexes are slow", please reconsider since the contended case makes much more sense with mutexes than spinlock, and the uncontended case is exactly the same.

And a mutex that you never take is cheaper than one that you do. Taking the (f)mutex is fast but any context switch will be brutal if you're really using threading to increase your performance.

That said I'm used to situations where we're pinning thread affinity to specific cores and really trying to squeeze out what you can from fixed resources.

Pinning threads to cores with affinity is another technique that should be used with caution. Under normal circumstances the kernel will run the same thread on the same core as much as possible because CPU migration is expensive.

Setting CPU affinity will ensure that you always get the same core, but it might not increase performance and could adversely affect other parts of the system.

CPU affinity is a good fit for continuously running things like audio processing or game physics or similar. It's not good when threads are blocked or react to external events.

In most cases it's just unnecessary because the kernel is pretty good in keeping threads on cores anyway.

> mutexes are slow

Be careful with that. First off, what people refer to as "mutex" is usually a spinlock that falls back to a kernel wait queue when the spin count is exceeded. There are even adaptive mutexes that figure out at runtime how long the lock is typically held and base their spin count limit on that.

Secondly, busy-waiting is often worse than a single slow program, because you actively slow down all of the other running programs.

Is there a good message passing library for C++? Sounds like something for the stl to include.

I know coworkers that speak well of ZeroMQ:


> Is there a good message passing library for C++? Sounds like something for the stl to include.

Qt does it with signals-slots. What I generally do is that I have a queue of std::function and just pass lambdas with the capture being copied.

Boost has one, but not the stl

Using spinlocks can cause livelocks under certain conditions due to priority inversion.

When I think about hard problems, I like to look at projects I feel handle these problems well. Having worked in the Chromium code base for some time now, I really appreciate their approach to threading. So much so, I really wish they would distribute their base/threading module as its own project.


> While standardization and support for [threaded] APIs has come a long way, their use is still predominantly restricted to system programmers as opposed to application programmers. One of the reasons for this is that APIs such as Pthreads are considered to be low-level primitives. Conventional wisdom indicates that a large class of applications can be efficiently supported by higher level constructs (or directives) which rid the programmer of the mechanics of manipulating threads. Such directive-based languages have existed for a long time, but only recently have standardization efforts succeeded in the form of OpenMP. OpenMP is an API that can be used with FORTRAN, C, and C++ for programming shared address space machines. OpenMP directives provide support for concurrency, synchronization, and data handling while obviating the need for explicitly setting up mutexes, condition variables, data scope, and initialization.

From Introduction to Parallel Computing

OpenMP: http://www.openmp.org/

Yes, every time I hear people talking about how evil or difficult threads are, I just think back to my frequent use of OpenMP. It is really quite easy to take a single-threaded program and make it multithreaded with OMP -- so long as the jobs only read, and do not write shared state.

Usually I will write a program that does the following:

1. Initialize global read-only data structures

2. Parallelize jobs across STDIN lines (or whatever)

3. Output results within a #pragma omp critical block

It works wonders, and is literally 5 additional lines of code to make a single-threaded program multithreaded with OMP. But only for some types of program. The concept is very similar to what GNU parallel does.

In short, threads are not the problem per se, they are just the wrong tool for the job if you have a multiple-readers multiple-writers situation. Message passing, databases, or something else are more appropriate in those cases. But you will pry my read only shared memory space out of my cold, dead hands.

Using OMP is incredibly easy but in my experience it sometimes leaves up to 20% performance on the table vs doing and tuning everything manually.

Since OMP Tasks it is kinda useful for non-numerical non-trivial applications as well, though still kinda... odd. Odd but neat.

Pthreads seems to be getting a bad name here, but compared to other low-level threading APIs it's excellent. It has a proper join operation, which is weirdly lacking in some other libraries I've used, and it has a good choice of key synchronisation primitive: condition variables. Other choices are possible, like semaphores, but those are fiddly to use for some applications. Building semaphores on top of CVs is pretty easy, building CVs on top of semaphores is very hard. Likewise Windows events.

Higher-level systems have primitives like message queues, thread pools, and channels. Those are all great but they can't always be used to build the exact system you need without a lot of overhead.

C and C++ are all about emphasising speed and flexibility over safety. Whether you think that's a good or bad approach, it is what it is, and pthreads is the right design for that approach. Once you come up with the high-level multithreading abstraction of your dreams, what else would you rather implement it in than pthreads?

The one thing that was missing from pthreads was atomic operations for modern lockless data structures, so it's great that those have finally been added to the standard library in both C and C++.

Generally, do not use raw threads. They are a building block for higher level abstractions like task queues, pipelines, futures... which lend themselves to higher level algorithms like your sorts, filters, find, map, reduce, map_reduce...

C++17 gets a lot of this with much of the algorithms header getting parallel/vectorized versions. But it doesn't have a task stealing work queue or something like then to pass the value when it completes. Boost has a version now that does though and this lets you have the next function called when done instead of waiting and blocking.

But if you spend your time in the thread zone you are probably putting too much thought into threading and not the problem you are solving. Plus, using a task queue can yield really good results for lower level algorithms.

There is no need to use raw pthreads in modern C++, it comes equipped with a much more convenient and capable standard library module.

I prefer working with channels, which are still missing from the library; but they are trivial to add:


Unfortunately std::thread is not as capable as pthreads. For example, std::thread has no facilities for controlling the stack size or signal mask.

Whenever I use channels, I find I need them to synchronize with each other. For example, I have a Go data structure with a write channel and a separate read channel; how do I avoid a read-after-write hazard?

The simplest solution is to make them the same channel that ships closures, to ensure requests are processed in-order. But then why not just make all channels ship closures, like libdispatch...

Interesting point about the stack size!

As for the signal mask, I'm pretty sure you can just set that using pthreads, and it will apply to the current thread. This is certainly true for the pthreads thread name, which is as deeply as I've investigated myself... the C++ threads library isn't magic. It's just a wrapper round what you'd expect.

You can get the underlying pthread pointer if you want.

Unfortunately pthread attributes need to be set at creation time, so by the time you have a pthread pointer it's too late.

pthread_sigmask always operates on the current thread.

The usual technique is to set the pthread sigmask and then spawn the thread, to avoid the obvious race. But it's risky to assume that `std::thread` won't manipulate the sigmask by itself.

Channels like the one in the link are pretty easy to create. However that's more like Javas BlockingQueue than like Go channels. The power of Gos channels is in "select", which enables a lot more sophisticated synchronization mechansism then just getting data from thread A to B. E.g. waiting for events from multiple sources or cancellation support.

Building channels with select functionality is a lot harder.

An interesting thing is the equivalent of pthreads on windows (WinAPI synchronization primitives) actually provides something which has some similarity with select out of the box: WaitForMultipleEvents, which is also quite powerful. However that is also a still a lot harder to work with than Go's channels, since there are additional sources for errors like HANDLE invalidation, which you can't have in Go where channels are garbage collected references.

What is even more fundamental and useful than select is non-blocking reads (which the implementation I posted supports), once you have that you can write your own select. It's not magic, select has to traverse the list of channels just like any other code. The only reason it feels like magic in Go is that it's the only way to read a channel without blocking.

The non-blocking read parts is also easy, but that's also only one use-case for Go's select. The harder part is the generic "wait until one of the select cases can be taken".

You can't implement this with non-blocking reads alone. If you would, it would be something like busy spinning and iterating through all select cases and trying none-blocking reads until one succeeds. But that's just not efficient.

You need a mechanism instead which also registers a pending select at each event source and unregisters it when done. And the event source (e.g. channel) must wake up one or more select when it's state changes.

Adding a condition-variable per select and keeping a list of pointers in each channel that are pinged when data arrives is the easiest approach I can think of. Haven't tried that one though, I have honestly never run into a use case where I needed select for channels outside of Go.

Select is a kernel level interface. I don't know how it's implemented at the kernel, but there is no technical reason why it can't follow from an interruption to an array index and a file descriptor without ever passing through a list.

Besides, how do you block in user space when there is nothing to read?

Are you saying they are allocating a separate file-descriptor for each channel just to be able to select? I guess you could do that if you wanted to but it's far from the first thing I would try. I remember reading an article on golang about select and having to traverse the list, but I don't really have the motivation to dig deeper right now.

Looping with unblocking reads is plenty fast enough though.

As for waiting on any event without select, adding an optional condition-variable pointer to each channel that is pinged when data arrives is the easiest approach I can think of.

Well, select is well known for not scaling well so AFAIK it may very well use a list. What I'm saying is that kernel code can be much more performant than anything you may create on userland because you have access to the interruptions. And that code would be "magic" in the sense that you really can't replace it with a library.

In fact pinging condition-variables after a channel successfully is read is a much more expensive emulation of that same behavior. I'm curious if the Linux kernel actually exports something like that.


This is good advice and should never have attracted a downvote!

I'm not actually sure that the C++ library does anything that pthreads doesn't - maybe std::lock, perhaps? - but it's more convenient to use (and it's not like pthreads is bad to start with...), and the pieces work well and fit together quite nicely. Some good use of move semantics. unique_lock is neat and you can do stuff like vector<thread>.

The C++ thread library works on Windows, too...

> I'm not actually sure that the C++ library does anything that pthreads doesn't

There’s std::atomic, which didn’t have a standardized equivalent in pthreads. In lieu of that you’d see some ad-hoc combination of GCC builtins, OS-specific functions, and `volatile`, with the result often not guaranteed to work on architectures other than x86 (if at all). Compared to that, std::atomic is both more flexible and easier to get right. The C ‘backport’, stdatomic.h, is also available on most platforms these days...

I was thinking about mostly about atomics, which aren't covered by pthreads; and I can't remember shared mutexes being supported either.

Unless you use this excellent library:


Note that this blog post was written in 2006.

The other major model for multi-threading is known as message-passing multiprocessing. Unless you're familiar with the Occam or Erlang programming languages, you might not have encountered this model for concurrency before.

Two popular variants of the message-passing model are "Communicating Sequential Processes" and the "Actor model".

Why would you want to learn about this alternative model, when Pthreads have clearly won the battle for the hearts and minds of the programming public? Well, besides the sheer joy of learning something new, you might develop a different way of looking at problems, that'll help you top make better use of the tools that you do use regularly. In addition, as I'll explain in Part II of this rant, there's good reason to believe that message-passing concurrency is going to be coming back in a big way in the near future.

The FoundationDB people invented Flow to easily create multithreaded distributed systems[0].


Good article by the creator [0] on how Lyft's relatively new L4-7 proxy called Envoy [1] handles multi-threading at https://medium.com/@mattklein123/envoy-threading-model-a8d44.... There are now more people at Google contributing to Envoy than Lyft.

0. https://twitter.com/mattklein123/

1. https://lyft.github.io/envoy/

So to check, I googled C++11 threads memory model, and it[0] looks like his main complaint isn't helped by C++ 11 threads, as they essentially allow shared state too. They are also a thin veneer over the OS threads, for unix types, pthreads I imagine.

[0] http://en.cppreference.com/w/cpp/language/memory_model

Yeah, I don't know if there's ever going to be a portable way to enforceably "disallow" the sharing of state between threads in C++. For straightforward cases, the SaferCPlusPlus library provides an easy way to "wrap" any object so that access from different threads is safely controlled[1].

[1] shameless plug: https://github.com/duneroadrunner/SaferCPlusPlus#asynchronou...

> Yeah, I don't know if there's ever going to be a portable way to enforceably "disallow" the sharing of state between threads in C++.

Not sharing state at all means that you cannot even do message passing other than through sockets or something like this. At some point you have to share at least a message queue to be able to go from thread A to thread B.

Honest question: If you aren't going to share data between/among threads, why use threads at all? Why not just fork-abd-exec?

Sharing (immutable) data between threads is fine, you just don't want to share (mutable) state or else you run the risk of threads stomping on each other. The ideal scenario is map/reduce: you take some large chunk of data, divide it up into smaller chunks, spin up a thread to process each smaller chunk and produce a result. Once the threads have all completed and the results are no longer changing, the main thread can pick them up and combine them to a single result.

It's possible to do that with fork-and-exec, but without a shared address space, marshalling costs make it expensive.

If it's the same, it's the same, so there's no reason to do one rather than the other. Why not just create a thread?

Because on windows it is much slower.

But...like so many other abstractions created by people with puny hardware, shared memory parallelism is very fast.

I really need to do an update of that article, with C++ std::thread, libDispatch, and some of the other stuff that's come along in the intervening 11 years. I still spend a lot of time tracking down other people's threading bugs (in an entirely-different codebase, these days) though.

I think one thing to learn is how monitors work.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact