
The C11 and C++11 Concurrency Model (2014) [pdf] - clemlais
https://www.cs.kent.ac.uk/people/staff/mjb211/toc.pdf
======
mrpopo
The absolute minimum that average users of the C++ memory ordering system in
std atomic loads, etc. [1] should know is that the default memory access
scheme is 'sequentially consistent ordering', whereas to get the ultimate
performance, one should use the 'relaxed' scheme, or alternatively use
'release/acquire' as described both in the article and in [1].

[1]
[http://en.cppreference.com/w/cpp/atomic/memory_order](http://en.cppreference.com/w/cpp/atomic/memory_order)

~~~
Gladdyu
To be honest, the average C++ programmer should not be bothered with any of
the concurrency stuff. The chance that you get it wrong and introduce subtle
concurrency bugs that will obviously only show up in production environments
is simply too great.

What is needed is a bunch of concurrently usable STL containers (eg. checks
for invalidating iterators on entering items into a map from another thread),
akin to the java.util.concurrent library. That way you separate the tricky
concurrency stuff out from the people that probably do not understand it.

~~~
maxlybbert
I was with you until you said "concurrently usable containers." While they
aren't standard, they do exist in C++ (Intel Threading Building Blocks and
whatever Microsoft called it's clone of Intel TBB), and they aren't as useful
as people believe. C#'s concurrent containers have the same limitations, and
to the extent I've used them, Java concurrent containers have the same
limitations.

The limitations I'm thinking of is that it's very common to need to do more
than one operation atomically, and common concurrent containers are unable to
do that. You can atomically add or remove a single item from a concurrent
queue, stack or hash map, and sometimes that's enough. But if you want to
iterate a whole container, say searching for all items that meet some
criteria, you need a lock.

~~~
pkolaczk
To iterate the whole container, I don't need a lock. I just use a container
that offers cheap immutable snapshots, e.g. Scala TrieMap and I store only
immutable items in it. Doing similar structures in C++ is much harder if not
impossible without efficient GC.

~~~
ehvatum
I have to say, I have been pleasantly surprised time and time again by the
powerful ease of data-structure-heavy algorithm implementation afforded by
thread-safe, atomic std::shared_ptr, the rest of <memory> and <atomic>,
judicious use of move semantics and open mp.

It's also a huge relief to have deterministic deallocation back in my life.
Praise Jesus for that.

It's not everyone's cup of tea, but if you need the speed, it's there for you
- with a solid, clean and expressive implementation. Your classic symbolic
regression genetic algorithm with halls-of-fame and whatnot is an absolute
cinch, for example, and runs gob-smackingly fast with -O3 -march=native, and a
good ten minutes of profiling data for pgo.

------
chubot
What are some applications that use "relaxed memory concurrency"? Which ones
really benefit from the performance increase?

How does it compare with using lock-free / immutable data structures? IMO this
strategy is less error prone and easier to test. I've written multi-threaded
C++, but mostly before C++11.

~~~
Gladdyu
In general, relaxed memory models are useful because you do not need to fetch
a cache line from another core (slow!) every time you want to perform a read
or even possibly a write (although that could introduce a data race). In case
of a lock, where you want to guarantee some state being equal for all cores,
you'd need strong memory ordering guarantees.

For other applications, where it is only important that things will eventually
propagate through, think of adding elements to a vector (which could be
serialized in multiple ways anyway) or performing computation on data that is
implicitly shared between cores but not actually read by the other core during
the operations, a relaxed memory system is sufficient. It's also faster and
more power efficient as high-speed core interconnects are a major drain,
compared to, lets say the ALU in a modern CPU.

Lock-free structures are useful, but as your data structure gets more
complicated and the updates involve more operations, it becomes a very non-
trivial task to write a lock-free version of your operation. It's easier to
simply lock and perform an 'atomic' update. Immutable structures often have
performance overheads which are not desired in the C++ STL.

~~~
pkolaczk
Minor nitpick: There is no such thing like fetching a cache line from another
core, at least in Intel x64 architecture. All the synchronizing between the
cores happen at L3 cache layer and QPI, because caches are inclusive and L3 is
common to all cores. When a core writes to L3, it invalidates the same cache
line in L1 and L2 caches of other cores. Therefore if your cores write to the
same cache line, you have a big problem regardless of the chosen memory order
model.

