
What every systems programmer should know about lockless concurrency [pdf] - sidcool
https://assets.bitbashing.io/papers/lockless.pdf
======
johnbender
It seems like weak memory models get short shrift, but if you're going to
program without locks it's semi-important to understand what information one
gets when examining a read-write pair.

It's true that (as implied by the article) you can probably get by with just
studying/programming with the C/C++ [1][2] "atomic" memory access types and
letting the compiler enforce those semantics, though the reasoning behind a
lot of these memory orderings is lost without understanding the
motivation/arch. models behind them.

If you're interested in the C/C++ memory models there's active research into
specifying them without bad behavior (thin-air reads [3]). Recent results
include a semantics that makes value promises and requires a justifying
execution [4] which is not totally dissimilar from those required by the
official java memory model [5].

1\.
[http://en.cppreference.com/w/cpp/atomic/atomic](http://en.cppreference.com/w/cpp/atomic/atomic)

2\.
[http://en.cppreference.com/w/c/atomic](http://en.cppreference.com/w/c/atomic)

3\.
[http://www.cl.cam.ac.uk/~pes20/cpp/notes42.html](http://www.cl.cam.ac.uk/~pes20/cpp/notes42.html)

4\. [https://people.mpi-
sws.org/~dreyer/papers/promising/paper.pd...](https://people.mpi-
sws.org/~dreyer/papers/promising/paper.pdf)

5\.
[http://rsim.cs.uiuc.edu/Pubs/popl05.pdf](http://rsim.cs.uiuc.edu/Pubs/popl05.pdf)

~~~
bshanks
I wonder how the memory model in [https://people.mpi-
sws.org/~dreyer/papers/promising/paper.pd...](https://people.mpi-
sws.org/~dreyer/papers/promising/paper.pdf) compares to the WMM and WMM-S
memory models given in
[https://arxiv.org/pdf/1707.05923.pdf](https://arxiv.org/pdf/1707.05923.pdf) ;
and if there are other research memory models out there that should also be
considered?

~~~
johnbender
These papers take forever to read (take it from me, I am doing research in
this area, in particular proofs of correctness for lock free programs). I
recommend focusing on the C/C++ ones if only for the practical value.

As for comparison, the C/C++ memory model is more general and is operational.
It is also formalized in coq and has some good theorems (data race free
sequential consistency being the most obvious).

The RISC memory model is axiomatic and follows the standard axiomatic approach
adopted for many other memory models like Java and the current C/C++ standard.
That's not a dig, they just don't care as much about rigor.

Axiomatic models consider the every possible execution and then weed out bad
ones, where as the operational semantics defines the set of all possible
executions using a transition relation. If you're into math you might see
these vaguely as extensional and intensional respectively.

~~~
bshanks
thanks! I think the RISC/WMM folks provide an operational model too, using
their own "I2E" formalism. The end of the paper seems to have a proof of
equivalence between their axiomatic and operational models.

------
dragontamer
Lockless concurrency was always an "interesting" subject to me, but I never
thought it was practical for me to use them. Mostly because multithreaded
programming generally has easier-to-use and higher-level constructs (ie:
Reader-Writer Queues already written) that can be used.

However, I've recently grown to respect Lockless Programming as I have begun
to experiment with GPU programming. OpenCL doesn't give you Mutexes, and
indeed, you really shouldn't "lock" an instruction pointer in a GPU ever. (On
AMD systems, a singular instruction pointer runs 64 different "threads". So if
one of these "threads" lock up, ALL 64 of them lock. This "Single Instruction,
Multiple Data" is the key to why GPUs have so many "cores")

As such, the only practical way to write decently high-performance code with
concurrency on a GPU is through Atomics (ie: Compare and Swap). Which of
course, requires an understanding of memory barriers as well.

~~~
blattimwind
> Single Instruction, Multiple Data

*multiple threads (preferably both, in the case of GPUs, but what you described is SIMT)

~~~
dragontamer
I know its what NVidia calls SIMT, but its what AMD calls SIMD. ¯\\_(ツ)_/¯

NVidia is somewhat correct to call their "SIMT" concept something new, because
Intel's original SIMD implementations (MMX and SSE) couldn't do the thread-
divergence thing very easily (AVX 512 adds a few more features to make thread-
divergence easier to handle at the CPU level). But as far as I can tell, AMD
GPUs can do thread-divergence, "constant broadcasts", and all that jazz too,
but AMD still calls it SIMD.

------
jeff571
Keep in mind lockless algorithms are not necessarily more scalable than lock-
based algorithms, usually have higher constant overheads, and are
significantly easier to get wrong.

However, this post is, in part, about how to implement locks.

~~~
taneq
> However, this post is, in part, about how to implement locks.

I haven't read the article in-depth but yeah, it seems like it maybe would
have been better titled "What every C++ systems programmer should know about
std::atomic". Which is still cool but from the title I was expecting some
fancy voodoo algorithms for threadsafe non-locking FIFO queues or something.

~~~
banachtarski
I think it's meant to be pretty fundamental. It's not supposed to include
"voodoo algorithms" because it literally says "what every systems programmer
should know...". The voodoo algorithm is a voodoo algorithm _because_ not
everyone needs to know it. In contrast, a mutex and all of that stuff is a
consequence of how memory sequencing on a CPU works. If you understand those
concepts, the rest is a hop, skip, and a jump away.

~~~
taneq
I guess my beef was mostly in response to the word "lockless", which I'd hoped
meant "not using any wait-for-other-thread type semantics" but in this context
seems to mean "implementing locking without using OS-provided locking
primitives, although standard library is fine."

I was hoping that there was some cool new trick I'd missed or fundamentally
different way to do concurrency than locking. I get disappointed easily in
cases like this. :/

------
xyzzy_plugh
I have always found that the biggest hurdle in understanding concurrency
amongst my peers is learning how memory barriers work.

~~~
ohazi
One reason for this is that for the longest time they weren't really available
in a cross platform way. I'd learn about e.g. mutexes in a computer
architecture class, and then I could immediately go use them with boost or
pthread or whatever.

I'd also learn about atomic operations, atomic cas, and memory barriers, but
short of some very specific game libraries with weird macros for this, I
couldn't find them anywhere.

It seemed like we were expected to sprinkle volatile everywhere, remember
rules about unaligned access, and write our own platform specific macros for
memory barriers. If you're primarily testing your code on an x86 laptop, it's
very difficult to be confident that you're doing it correctly.

~~~
exDM69
> I'd also learn about atomic operations, atomic cas, and memory barriers, but
> short of some very specific game libraries with weird macros for this, I
> couldn't find them anywhere.

How come? These days atomic ops are included in C and C++ standard library.
And there are atomic operations for every important compiler out there and if
you need portability across compilers (without relying on new C/C++
standards), there are portability libraries (e.g. boost).

These have existed for a long time

E.g. here are GCC's atomics:
[https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-
Builtins...](https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-
Builtins.html)

> It seemed like we were expected to sprinkle volatile everywhere, remember
> rules about unaligned access, and write our own platform specific macros for
> memory barriers.

Unless you're using Java, "volatile" is not useful for concurrent programming.
When you see volatile keyword in C code, it's almost always wrong or there for
a legacy reason.

Sounds like your computer architecture classes were not up to date or very
well taught.

~~~
ohazi
> Sounds like your computer architecture classes were not up to date or very
> well taught.

Or maybe I went to school before C++11 was around.

¯\\_(ツ)_/¯

~~~
exDM69
There are many atomics libs and compilers that support atomics that predate
C++11.

If you went to school in the multi-core era or slightly before that it
shouldn't have been a problem.

There isn't much need for atomics on single core, single CPU machines so if
you went before that, it's understandable.

------
codepie
The thread example given in article is only limited to a single writer and
multiple readers. Things get interesting when you want to have lock free
concurrency with multiple readers and writers. One of the famous techniques is
using software transactional memory
([https://en.m.wikipedia.org/wiki/Software_transactional_memor...](https://en.m.wikipedia.org/wiki/Software_transactional_memory)).
But it hasn't really caught up in the industry and it still exists within
academia.

The existing lock free concurrency solutions are not mature enough to replace
locks. I like this analogy of what garbage collection is to memory is STM is
to
memory([https://homes.cs.washington.edu/~djg/papers/analogy_oopsla07...](https://homes.cs.washington.edu/~djg/papers/analogy_oopsla07.pdf)).

~~~
saas_co_de
Intel has TSX [https://software.intel.com/en-
us/node/524022](https://software.intel.com/en-us/node/524022)

I have never used it though and I am not sure who does.

~~~
convolvatron
I wish this were more useful. The real sticking point for me is that you can
run out of transaction contexts - which means you have to have a full software
implementation as backup.

the only case I can think of where I would actually use the thing is if I
already had an STM runtime and I wanted to cut down some of the constant time
overheads.

the other problematic aspect s that transaction scheduler policy is pretty
important for contended performance and as part of the broader system design.
in this case you have to live with whatever intel gives you

------
srcmap
I implemented a lockless IPC, DIPC - Interprocess Communication and
Distributed IPC library in 2000-2002 time frame.

The library is used in vxWorks, QNX, Linux (both kernel and user space.) and
tested heavily on Xeon SMP 1-2GHz CPU at that time.

For memory management, it was implemented in C++ and the message (Memory)
Alloc/Free calls were all lockless. One can allocated a memory from kernel and
send it as msg to user space and use it there without copy. The limitation is
that the max allows messages are statically pre-allocated for the system. The
library manages the pre-allocated pools of various sizes with lockless APIs.
When the pools runs out, the API returns error.

The only primitive the library depended on was Atomic_Inc/Dec - absolutely no
mutex, spin locks in the non-blocking API code path, etc.

The library was designed to support HA (High available ) TCP/IP stack and
routing protocols. The APIs was not the complex. The most complicate part was
the regress testing code and the diagnostic code. The API was used by team of
100+ developers. I need help them to debug complex IPC/DIPC issues - and prove
to them when the error return, it was not the library/API issue.

There was a trace mechanism that track all the messages went thru the
system(s). The trace system was also completely non-blocking and lockless,
worked both in kernel and user space. All context switches in Linux Kernel was
tracked - very similar to FTRACE in kernel today. I hacked the kernel/bios to
preserve the DRAM content part of the trace system during warmboot. This way,
I can recovered all the messages, last context switched info even for kernel
hang, crash issue for postmortem. The trace system used rdtsc for time
stamping had 0.5 nano-seconds resolution on 2Ghz Xeon CPU.

The API/Library worked fine. The overall system complexities was a big issue.
We can demo TCP/IP, BGP and various routing protocols that did HA Switch over
and In Service OS upgrade. But only as demo - and after switch over the state
syncing took a long time and reliability of HA TCP/IP and routing protocol was
an big issue.

Base on the lesson learned from that project - I went on and coded a different
Application level HA system. That system based everything on top of standard
Linux API/utilities. That works much better and the HA port of the SW took
only 1 dev (me) 3 weeks to code. It supported in service SW (include Linux
kernel) upgrade, HW/SW fault detected and automatically switch over. It was
deployed in Comcast. Fault detection took < 50 ms and switch over is almost
instance. Full state sync was done with wget over management Ethernet
interface and always took < 1 second. All incremental states update were done
over UDP and took 1 pkt.

The funny Biz/$ outcomes from these two projects:

The first one: startup sold only one HA router for $200k and burned 100M+ of
VC $ but sold to Nokia for $400+Mil. I was able to paid off my mortgage with
the options from that company - Not FU money but ok outcome.

The 2nd one: The startup sold $40M of products for $60K from that project to
Comcast and various Cable companies at 60% margin. That product (with 19+
FPGA) was created/code/developed with ~8 HW/FW/SW engineers (including the 2
founders). The VC got "a seasoned CEO" \- managed to raise a few more rounds,
kick out the founders (MIT PhD), hired 160+ people and ran the company to the
ground.

------
signa11
imho, memory-barriers.txt, in linux kernel documentation should be a required
grokking before getting into lockless stuff.

~~~
albertzeyer
Link: [https://www.kernel.org/doc/Documentation/memory-
barriers.txt](https://www.kernel.org/doc/Documentation/memory-barriers.txt)

~~~
maxxxxx
I wish I could write that well :(. This is better than most write ups with
fancy graphics.

------
SEJeff
Another one (posted by me) on "Fear and Loathing in Lock Free Programming":

[https://news.ycombinator.com/item?id=15530032](https://news.ycombinator.com/item?id=15530032)

------
ozfive
This is incredibly well thought out and written. I felt like a friend was
telling me about something that we would only understand. Matt Kline, bravo.

------
amelius
How well does Rust support these concepts?

~~~
pjmlp
As far as I know Rust doesn't have a memory model fully defined yet, but their
type system already helps a lot.

[https://github.com/nikomatsakis/rust-memory-
model](https://github.com/nikomatsakis/rust-memory-model)

A bit old, [https://blog.rust-lang.org/2015/04/10/Fearless-
Concurrency.h...](https://blog.rust-lang.org/2015/04/10/Fearless-
Concurrency.html)

~~~
steveklabnik
We don't, but there's been a ton of work; [http://plv.mpi-
sws.org/rustbelt/](http://plv.mpi-sws.org/rustbelt/)

For this kind of thing we mostly say "we defer to the C++ memory model" at the
moment.

