Hacker News new | past | comments | ask | show | jobs | submit login
The C11 and C++11 Concurrency Model (2014) [pdf] (kent.ac.uk)
63 points by clemlais on Nov 28, 2015 | hide | past | web | favorite | 12 comments

The absolute minimum that average users of the C++ memory ordering system in std atomic loads, etc. [1] should know is that the default memory access scheme is 'sequentially consistent ordering', whereas to get the ultimate performance, one should use the 'relaxed' scheme, or alternatively use 'release/acquire' as described both in the article and in [1].

[1] http://en.cppreference.com/w/cpp/atomic/memory_order

To be honest, the average C++ programmer should not be bothered with any of the concurrency stuff. The chance that you get it wrong and introduce subtle concurrency bugs that will obviously only show up in production environments is simply too great.

What is needed is a bunch of concurrently usable STL containers (eg. checks for invalidating iterators on entering items into a map from another thread), akin to the java.util.concurrent library. That way you separate the tricky concurrency stuff out from the people that probably do not understand it.

I was with you until you said "concurrently usable containers." While they aren't standard, they do exist in C++ (Intel Threading Building Blocks and whatever Microsoft called it's clone of Intel TBB), and they aren't as useful as people believe. C#'s concurrent containers have the same limitations, and to the extent I've used them, Java concurrent containers have the same limitations.

The limitations I'm thinking of is that it's very common to need to do more than one operation atomically, and common concurrent containers are unable to do that. You can atomically add or remove a single item from a concurrent queue, stack or hash map, and sometimes that's enough. But if you want to iterate a whole container, say searching for all items that meet some criteria, you need a lock.

To iterate the whole container, I don't need a lock. I just use a container that offers cheap immutable snapshots, e.g. Scala TrieMap and I store only immutable items in it. Doing similar structures in C++ is much harder if not impossible without efficient GC.

I have to say, I have been pleasantly surprised time and time again by the powerful ease of data-structure-heavy algorithm implementation afforded by thread-safe, atomic std::shared_ptr, the rest of <memory> and <atomic>, judicious use of move semantics and open mp.

It's also a huge relief to have deterministic deallocation back in my life. Praise Jesus for that.

It's not everyone's cup of tea, but if you need the speed, it's there for you - with a solid, clean and expressive implementation. Your classic symbolic regression genetic algorithm with halls-of-fame and whatnot is an absolute cinch, for example, and runs gob-smackingly fast with -O3 -march=native, and a good ten minutes of profiling data for pgo.

Using 'relaxed' (or acquire/release) for performance without some vague understanding of the semantic implications (of relaxed especially, and the memory model more generally) is a recipe for disaster and some confusing bugs.

As a corollary, if you see someone using anything other than the default, check their code carefully.

What are some applications that use "relaxed memory concurrency"? Which ones really benefit from the performance increase?

How does it compare with using lock-free / immutable data structures? IMO this strategy is less error prone and easier to test. I've written multi-threaded C++, but mostly before C++11.

In general, relaxed memory models are useful because you do not need to fetch a cache line from another core (slow!) every time you want to perform a read or even possibly a write (although that could introduce a data race). In case of a lock, where you want to guarantee some state being equal for all cores, you'd need strong memory ordering guarantees.

For other applications, where it is only important that things will eventually propagate through, think of adding elements to a vector (which could be serialized in multiple ways anyway) or performing computation on data that is implicitly shared between cores but not actually read by the other core during the operations, a relaxed memory system is sufficient. It's also faster and more power efficient as high-speed core interconnects are a major drain, compared to, lets say the ALU in a modern CPU.

Lock-free structures are useful, but as your data structure gets more complicated and the updates involve more operations, it becomes a very non-trivial task to write a lock-free version of your operation. It's easier to simply lock and perform an 'atomic' update. Immutable structures often have performance overheads which are not desired in the C++ STL.

Minor nitpick: There is no such thing like fetching a cache line from another core, at least in Intel x64 architecture. All the synchronizing between the cores happen at L3 cache layer and QPI, because caches are inclusive and L3 is common to all cores. When a core writes to L3, it invalidates the same cache line in L1 and L2 caches of other cores. Therefore if your cores write to the same cache line, you have a big problem regardless of the chosen memory order model.

Relaxed is useful when you need atomicity but not consistency between cores. x86 guarantees that all reads/writes are atomic, i.e. no torn reads/writes but I don't think all architectures have this guarantee.

Also there maybe times in your program where you know that your thread has exclusive access to the variable during a certain segment of the program. During this segment it may be beneficial to do lazy reads/writes then complete the critical section with a release operation.

This is used heavily in single producer single consumer queues.

Minor nitpick, but x86 only guarantees atomic reads/writes for properly aligned values. I believe ARM is the same, but not very sure about that.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact