
Benign data races considered harmful - signa11
https://bartoszmilewski.com/2020/08/11/benign-data-races-considered-harmful/
======
Animats
"Benign data races" have been considered harmful from a theoretical
perspective for decades. But most deployed CPUs had memory semantics that made
them work. Mostly.

On x86 multiprocessors, if, without taking any special measures, you set x=1
from processor 0, and processor 1 reads x shortly, but not too shortly,
thereafter, it will see 1. If you try that on, say, an 80-core ARM CPU [1],
CPU #53 might not see a 1 for a long time, if ever.[2] You're no longer
guaranteed that cache changes propagate unasked. As we get more and more CPUs,
the cost of creating the illusion that there's a single memory goes up. It's
now important that compilers know what's shared so they can generate the
proper fence and barrier instructions. Especially since the latest iteration
of ARM has new features such as "non-temporal load and store."

What used to be a theoretical problem, or at most a problem for OS architects,
has thus acquired teeth and claws that can bite ordinary applications
programmers.

This is where traditional C++ mutexes start to break down. From a theoretical
perspective, it's always been annoying that mutexes at the POSIX level have no
tie to what they're supposed to be protecting. Hitting a mutex has to mean
"flush everything". That's inefficient. Rust has the advantage that its locks
own data, so the compiler knows what has to be flushed. That becomes more
important as the number of CPU cores becomes very large.

I've said for years that the three issues in C/C++ are "how big is it", "who
owns it", and "who locks it". C++ now at least has abstractions for all of
these, although they all leak. You can still get raw pointers out to misuse,
or, more likely, pass to some API for misuse.

[1] [https://venturebeat.com/2020/03/03/ampere-altra-is-the-
first...](https://venturebeat.com/2020/03/03/ampere-altra-is-the-
first-80-core-arm-based-server-processor/)

[2]
[https://developer.arm.com/documentation/den0024/a/the-a64-in...](https://developer.arm.com/documentation/den0024/a/the-a64-instruction-
set/memory-access-instructions/memory-barrier-and-fence-instructions)

~~~
adwn
> _You 're no longer guaranteed that cache changes propagate unasked._

Huh? Yes, you can: Caches in multi-core ARM systems are still coherent, so
once a write operation reaches any level of the cache hierarchy (typically
level 1) and updates that cache line, _all_ other cores will see the updated
value when they read from that address.

That's not what you need memory consistency operations, like load acquire and
store release, for. You need those to make sure that a write from a write
operation has been written to the cache, because it might still linger in the
store buffer, which sits between the execution unit and the cache.

> _Rust has the advantage that its locks own data, so the compiler knows what
> has to be flushed._

That's absolutely not how Rust works. The Rust compiler has no notion of a
mutex, only of atomic operations. Mutexes are purely library constructs – as
shown by the parking_lot crate [1], which implements its own mutex
independently of the standard library mutex.

[1]
[https://github.com/Amanieu/parking_lot](https://github.com/Amanieu/parking_lot)

~~~
Animats
_" A lock knows what data it protects, and Rust guarantees that the data can
only be accessed when the lock is held. State is never accidentally shared.
"Lock data, not code" is enforced in Rust."_

[1] [https://blog.rust-lang.org/2015/04/10/Fearless-
Concurrency.h...](https://blog.rust-lang.org/2015/04/10/Fearless-
Concurrency.html)

~~~
adwn
The issue isn't whether the compiler knows which data is affected and which
isn't, but rather that the compiler doesn't know about the semantics of the
mutex.

However, the _library_ knows about the semantics of the mutex, and it could
explicitly instruct the processor to only flush certain memory regions from
the store buffer when the mutex is unlocked. But then again, that's not Rust-
specific – a C++ library could do the same.

------
Hokusai
> Weak atomics give you the same kind of performance as benign data races but
> they have well defined semantics.

That is a great argument. Self-explanatory code make it easier to understand
what are the design goals and usage of that code.

> Traditional “benign” races are likely to be broken by optimizing compilers
> or on tricky architectures.

This is even better. One of the reasons is so hard to move code from one
architecture to another one is because not only word sizes but small non-
defined behaviors.

I had the pleasure to make C++ code compile for WebAssembly. It is challenging
and interesting the unexpected ways that it breaks. And how many times the
mental model that we have as developer and the reality of the C++
specification differ in small ways.

My most absurd example was a test application that had to crashi on the push
of a button, to verify that all logs and recovery code was working. To crash
the application the code was accessing a NULL pointer. Alas, WebAssembly
allowed read access to address Zero without problem. So, the button did
nothing when compiled in WebAssembly.

~~~
msla
> To crash the application the code was accessing a NULL pointer. Alas,
> WebAssembly allowed read access to address Zero without problem.

Good thing nothing in the C or C++ standards mandates NULL be address all-
bits-zero regardless of how it's sometimes spelled.

Any compiler which assumes that memory access to address all-bits-zero _is_
access to NULL is not compliant.

~~~
PaulDavisThe1st
Since C++11, the definition of NULL is:

    
    
       an integer literal with value zero, or a prvalue of type std::nullptr_t 
    

See
[https://wg21.cmeerw.net/cwg/issue903](https://wg21.cmeerw.net/cwg/issue903)
for a case where NULL was did not evaluate to zero.

~~~
wolfgang42
I thought “integer literal zero (in pointer contexts)” was the definition of
NULL in _all_ versions of C. As I understand it the resulting pointer doesn’t
necessarily have to have a value of 0 (even though that’s what it says in the
source code) so long as it behaves appropriately nullish—i.e. the compiler can
translate “int *foo = 0;” as “MOV foo 0xDEADCODE” so long as that value won’t
be used for any legitimate pointer.

~~~
msla
The above-linked site about POSIX has this paragraph:

[https://www.austingroupbugs.net/view.php?id=940](https://www.austingroupbugs.net/view.php?id=940)

> The C standard goes to great length to ensure that NULL (and for that
> matter, any representation of the null pointer, which is written in source
> code by converting a constant value 0 to a pointer) need not be represented
> in hardware by all 0 bits (that is, 'int _p = 0; ' or 'static int _p;' is
> allowed to compile to assembly instructions that assign a non-zero bit
> pattern to the location named p). And in fact, there are historical machines
> where the most efficient null pointer in hardware was indeed a non-zero bit
> pattern:
> [http://c-faq.com/null/machexamp.html](http://c-faq.com/null/machexamp.html)

... and the site it links to has this:

> Q: Seriously, have any actual machines really used nonzero null pointers, or
> different representations for pointers to different types?

> A: The Prime 50 series used segment 07777, offset 0 for the null pointer, at
> least for PL/I. Later models used segment 0, offset 0 for null pointers in
> C, necessitating new instructions such as TCNP (Test C Null Pointer),
> evidently as a sop to [footnote] all the extant poorly-written C code which
> made incorrect assumptions. Older, word-addressed Prime machines were also
> notorious for requiring larger byte pointers (char _' s) than word pointers
> (int _'s).

[snip]

> The Symbolics Lisp Machine, a tagged architecture, does not even have
> conventional numeric pointers; it uses the pair <NIL, 0> (basically a
> nonexistent <object, offset> handle) as a C null pointer.

------
pornel
The difficulty with "benign" data races and atomic operations with weak
ordering, is that they're not guaranteed to behave like the target
architecture behaves! It's insufficient to say "this is fine on x86" even in
an x86-only program!

That's because the optimizer symbolically executes the program according to
the C memory model, not x86 memory model (as if it was some kind of Low-Level
Virtual Machine…). The optimizer may reorder or remove memory accesses in ways
that wouldn't happen on the target hardware.

~~~
userbinator
_That 's because the optimizer symbolically executes the program according to
the C memory model, not x86 memory model_

...and I think that is a huge problem, because it strays from the spirit of
the language in that it's supposed to be "closer to the hardware" than other
HLLs.

Edit: care to give a counterargument?

~~~
ganafagol
I'm not the one who downvoted you, but the counterargument is that such a
target-specific approach prevents any reasonable effort to consistently define
the semantics of any programming language. "On x86_amd_this_and_that processor
the program means this, on arm_64_some_version it means that and on
x86_intel_something again something else". This would not just be hell, it
would be unworkable.

In other words, any approach other than an abstract machine model is doomed.
Ideally this abstract machine is rather close to real world machines, but it
can't be identical to all of them, nor is it reasonable to start special
casing.

~~~
userbinator
_but the counterargument is that such a target-specific approach prevents any
reasonable effort to consistently define the semantics of any programming
language._

Why does there even need to be? There's already undefined behaviour to take
into account the variance. Otherwise it's just making things worse for
programmers who want to easily solve real problems on real hardware, not the
absurd theoretical world of spherical cows.

~~~
jsgf
Undefined isn't the same as implementation defined. "Undefined" means you're
not even talking about C any more. "Implementation defined" means which of the
allowable semantics the implementors happened to choose.

------
microcow
> It is now widely recognized that this attempt to define the semantics of
> data races has failed, and the Java memory model is broken (I’m citing Hans
> Boehm here).

This is a bit misleading. C++ also ended up defining the semantics of data
races with memory_order_relaxed. In the standard they are not called "data
races", but they correspond closely to what Boehm calls Java "giving some
semantics to data races". C++ relaxed atomics also have the same issues with
causality.

See Russ Cox's talk
[http://nil.csail.mit.edu/6.824/2016/notes/gomem.pdf](http://nil.csail.mit.edu/6.824/2016/notes/gomem.pdf)
And Viktor Vafeiadis' slides: [https://people.mpi-
sws.org/~viktor/slides/2016-01-debug.pdf](https://people.mpi-
sws.org/~viktor/slides/2016-01-debug.pdf)

> Weak atomics give you the same kind of performance as benign data races but
> they have well defined semantics.

This is mostly true, but not at the limits. GCC, clang, MSCVC, and icc tend to
be pretty conservative in how they treat relaxed atomics. For example, on x86
they tend not to generate the reg+mem form and instead generate an explicit
`mov` instruction
([https://gcc.godbolt.org/z/K135Mo](https://gcc.godbolt.org/z/K135Mo)). This
requires extra resources for instruction fetch & decoding and uses an extra
named register (potentially causing spills to the stack).

> Traditional “benign” races are likely to be broken by optimizing compilers
> or on tricky architectures.

This is true, but the sad thing is that the implementation of the C++ memory
model is sometimes broken too. GCC had lots of real bugs up through about GCC
4.9 (I think...). And then there's [http://www.open-
std.org/jtc1/sc22/wg21/docs/papers/2017/p066...](http://www.open-
std.org/jtc1/sc22/wg21/docs/papers/2017/p0668r1.html) \-- "Existing
implementation schemes on Power and ARM are not correct" (2017)

~~~
GeertB
Having been there, and having bitten often enough, I can confirm that most
compilers just try to generate the most conservative code possible for
volatile accesses. Their semantics are crazy/inconsistent, so you can't reason
about them, and high performance code won't use them. Nobody will ever be
interested in making them work/perform well.

C++ atomics in their various memory models are different in that they have
well-defined semantics, used in high performance code and people do care.
You're right that there have been bugs (and probably still are), but at least
you can _argue_ that something is a bug if there is a coherent definition.
Optimization of atomics, such as merging barriers, is still in its infancy.

------
secondcoming
The comment:

> if you have mutexes in your code you’re probably programming at too low a
> level

is quite disappointing to read

~~~
mjburgess
> something based on CSP

Gives the game away. Just another proselytising FP nut trying to justify all
the time they've spent learning "universal" FP-abstractions (ie., highly
specific hacks around lazyness).

------
valuearb
It seems I deal with race conditions all the time in Swift (Fixed one bug just
today), but can understand little of this. Every time I turn around, it turns
out things are harder and more complex than I thought.

~~~
brundolf
Fwiw "data race" != "race condition", confusing as that may be. A race
condition is a blanket term for _any_ nondeterministic bug resulting from
differences in timing between independent, concurrent processes. A data race
is a specific kind of race condition where reads/writes to the same value in
machine memory step on each other's toes and lead to inconsistent behavior.
The classic example is two threads that try to increment the same number but
only one takes effect, because they both read the current value before either
writes their new value (instead of one read/writing and then the other
read/writing). I'm not really a C/++ programmer, though, so I only have a
rough understanding of this topic. I don't know what a "benign data race" is,
for example.

------
carabiner
memes considered memeful

------
vermilingua
"Considered harmful" considered harmful:
[https://meyerweb.com/eric/comment/chech.html](https://meyerweb.com/eric/comment/chech.html)

