Hacker News new | past | comments | ask | show | jobs | submit login
Benign data races considered harmful (bartoszmilewski.com)
82 points by signa11 on Aug 14, 2020 | hide | past | favorite | 58 comments



"Benign data races" have been considered harmful from a theoretical perspective for decades. But most deployed CPUs had memory semantics that made them work. Mostly.

On x86 multiprocessors, if, without taking any special measures, you set x=1 from processor 0, and processor 1 reads x shortly, but not too shortly, thereafter, it will see 1. If you try that on, say, an 80-core ARM CPU [1], CPU #53 might not see a 1 for a long time, if ever.[2] You're no longer guaranteed that cache changes propagate unasked. As we get more and more CPUs, the cost of creating the illusion that there's a single memory goes up. It's now important that compilers know what's shared so they can generate the proper fence and barrier instructions. Especially since the latest iteration of ARM has new features such as "non-temporal load and store."

What used to be a theoretical problem, or at most a problem for OS architects, has thus acquired teeth and claws that can bite ordinary applications programmers.

This is where traditional C++ mutexes start to break down. From a theoretical perspective, it's always been annoying that mutexes at the POSIX level have no tie to what they're supposed to be protecting. Hitting a mutex has to mean "flush everything". That's inefficient. Rust has the advantage that its locks own data, so the compiler knows what has to be flushed. That becomes more important as the number of CPU cores becomes very large.

I've said for years that the three issues in C/C++ are "how big is it", "who owns it", and "who locks it". C++ now at least has abstractions for all of these, although they all leak. You can still get raw pointers out to misuse, or, more likely, pass to some API for misuse.

[1] https://venturebeat.com/2020/03/03/ampere-altra-is-the-first...

[2] https://developer.arm.com/documentation/den0024/a/the-a64-in...


> You're no longer guaranteed that cache changes propagate unasked.

Huh? Yes, you can: Caches in multi-core ARM systems are still coherent, so once a write operation reaches any level of the cache hierarchy (typically level 1) and updates that cache line, all other cores will see the updated value when they read from that address.

That's not what you need memory consistency operations, like load acquire and store release, for. You need those to make sure that a write from a write operation has been written to the cache, because it might still linger in the store buffer, which sits between the execution unit and the cache.

> Rust has the advantage that its locks own data, so the compiler knows what has to be flushed.

That's absolutely not how Rust works. The Rust compiler has no notion of a mutex, only of atomic operations. Mutexes are purely library constructs – as shown by the parking_lot crate [1], which implements its own mutex independently of the standard library mutex.

[1] https://github.com/Amanieu/parking_lot


"A lock knows what data it protects, and Rust guarantees that the data can only be accessed when the lock is held. State is never accidentally shared. "Lock data, not code" is enforced in Rust."

[1] https://blog.rust-lang.org/2015/04/10/Fearless-Concurrency.h...


The issue isn't whether the compiler knows which data is affected and which isn't, but rather that the compiler doesn't know about the semantics of the mutex.

However, the library knows about the semantics of the mutex, and it could explicitly instruct the processor to only flush certain memory regions from the store buffer when the mutex is unlocked. But then again, that's not Rust-specific – a C++ library could do the same.


That does not actually boil down to anything at the CPU level. That's just a (better) abstraction for the Rust programmer.


> it might still linger in the store buffer, which sits between the execution unit and the cache.

You're confusing ARM-specific component labels with general-purpose technical terminology; the store buffer is a cache, and > you're no longer guaranteed that [its] changes propagate unasked.


'adwn meant only that the store may transiently linger in the store buffer, not that it will remain in the store buffer indefinitely. Barriers (including load-acquire and store-release) don't make the store buffer start draining or drain faster or something - the store buffer is always draining.


> the store buffer is always draining.

That sounds quite resonable, but it doesn't mean the store buffer isn't part of the cache hierarchy (in the non-ARM-specific-terminology sense), and it's also not what they said.

> You need [memory consistency operations] [...] because [a write operation] might still linger in the store buffer


> If you try that on, say, an 80-core ARM CPU [1], CPU #53 might not see a 1 for a long time, if ever.[2] You're no longer guaranteed that cache changes propagate unasked.

What do you mean by this? ARM requires cache coherency for data caches, which requires that every store is eventually made visible to all cores. Fences are necessary to impose additional requirements on the order which writes to different addresses are made visible, but AFAIK not necessary to ensure that a single write is eventually visible.


Not sure how Load-Acquire (LDAR) and Store-Release (STLR) fit into this. They seem to be intended to be atomic only against other LDAR/STLR pairs, rather than doing a general flush. There's also the whole "shareability domain" thing - full system, outer sharable, inner sharable, and non-sharable. Barrier instructions get to specify how much consistency breadth to request.

So there are situations when every store may not be visible to every CPU core, but the ones that don't get the word shouldn't have access to that memory anyway. You can sort of see the architects of the CPU trying hard to prevent the cache system from choking on inter-CPU traffic, and trying to get more help from the software.


Inner vs outer shareable is not particularly relevant for applications - from the ARM ARM:

> This architecture assumes that all PEs that use the same operating system or hypervisor are in the same Inner Shareable shareability domain.

[PE = processing element - i.e. core]

i.e. if you a running a userspace program, all the cores are in the same shareability domain.

(I think your comments about LDAR/STLR are a response to 'adwn's sibling comment).

Eventual write propagation is a very basic liveness guarantee for systems with coherent caches. I'm pretty sure that the ARM architecture requires it for Observers with coherent caches, but event if not, it would be very surprising if a mainstream implementation didn't provide this guarantee.

Barrier instructions provide control of ordering relationships between accesses and when accesses complete, but I believe there is a guarantee that accesses eventually complete even without barriers.

(Also from the ARM ARM (B2.3.7), "complete" here means "visible to other cores": "Any read to the same Location by an Observer within the shareability domain will either Reads-from W1 or Reads-from a write that is Coherence-after W1.")

I'd be very interested to learn that ARM (or Ampere Altra) doesn't guarantee eventual write propragation, but that needs a better source than [2].


"Atomic" is probably the wrong term at this level. For a CPU, all loads and stores are atomic (assuming aligned and <= word size).

On an out-of-order CPU, loads and stores go into a buffer which may process its entries in a different order. Fences like LDA and STL prevent certain reorderings.


All loads and stores are atomic. I'm also pretty sure ARM has coherent caches. What ARM does allow is for CPUs to optimize the order it loads and stores from cache as opposed to x86 TSO guarantee. DMB barrier instruction on ARM allows you to control how L/S may be reordered. Both x86 and ARM have special types of memory regions such as uncachable memory, but that's not something you are going to see outside of kernel space.


The idea of memory being a shared scratch space for multiple cpus makes a lot of sense for a first order design perspective, but it really no longer holds. CPUs are now distributed systems and should acknowledge that.

What do you think about each CPU owning its own memory region and that to send values to other CPUs, one would have to do a send, either symbolically or physically, that could be implementation dependent.

Abstractions that aren't total and sound are insufficient and should be flagged by the compiler.


Yes. We're there already, with "sharability regions". So far, only the OS has to deal with that; each process (always?) is in one sharability region. (Shared memory between processes, though - how does that work now on machines with sharability regions?)

Total sharability slows down systems with large numbers of CPU cores. We're seeing the hardware features appear to allow for less sharability. It's not being used much yet. These have programming implications, and OSs and languages have to keep up. Probably more the OS for now, with groups of CPUs and areas of memory dedicated to each other. Can Linux cgroups be used that way?

This has to be invisible to the application programmer, somehow. Or we'll have another hard to program mess like the Cell.


> to send values to other CPUs, one would have to do a send, either symbolically or physically

The funny thing is, that pretty much is what store-release is (with load-acquire being the corresponding recv), anyway, just without any memory safety or sanity-checking.


> without any memory safety

Actually, I think that would be memory protection, not safety.


> Weak atomics give you the same kind of performance as benign data races but they have well defined semantics.

That is a great argument. Self-explanatory code make it easier to understand what are the design goals and usage of that code.

> Traditional “benign” races are likely to be broken by optimizing compilers or on tricky architectures.

This is even better. One of the reasons is so hard to move code from one architecture to another one is because not only word sizes but small non-defined behaviors.

I had the pleasure to make C++ code compile for WebAssembly. It is challenging and interesting the unexpected ways that it breaks. And how many times the mental model that we have as developer and the reality of the C++ specification differ in small ways.

My most absurd example was a test application that had to crashi on the push of a button, to verify that all logs and recovery code was working. To crash the application the code was accessing a NULL pointer. Alas, WebAssembly allowed read access to address Zero without problem. So, the button did nothing when compiled in WebAssembly.


> I had the pleasure to make C++ code compile for WebAssembly. It is challenging and interesting the unexpected ways that it breaks. And how many times the mental model that we have as developer and the reality of the C++ specification differ in small ways.

And that's because the way most programmers write C/C++ programs is with the mental model of the actual machine they are writing the code for. That's not so strange, it's being taught that way. It's even praised as a skill to look at the generated assembly if you want to understand how something is compiled.

But that approach is fundamentally broken. One needs to program against the abstract machine which defines the semantics of your programming language. Everything else leads to unportable and subtly broken code with hidden assumptions of how undefined behavior is resolved.

Sometimes you need to do this for performance reasons? Well, then you pay the price. It's expensive.


> To crash the application the code was accessing a NULL pointer. Alas, WebAssembly allowed read access to address Zero without problem.

Good thing nothing in the C or C++ standards mandates NULL be address all-bits-zero regardless of how it's sometimes spelled.

Any compiler which assumes that memory access to address all-bits-zero is access to NULL is not compliant.


AFAIK, the POSIX standard mandates that all-bits-zero be treated as NULL (https://www.austingroupbugs.net/view.php?id=940).


Since C++11, the definition of NULL is:

   an integer literal with value zero, or a prvalue of type std::nullptr_t 
See https://wg21.cmeerw.net/cwg/issue903 for a case where NULL was did not evaluate to zero.


I thought “integer literal zero (in pointer contexts)” was the definition of NULL in all versions of C. As I understand it the resulting pointer doesn’t necessarily have to have a value of 0 (even though that’s what it says in the source code) so long as it behaves appropriately nullish—i.e. the compiler can translate “int *foo = 0;” as “MOV foo 0xDEADCODE” so long as that value won’t be used for any legitimate pointer.


The above-linked site about POSIX has this paragraph:

https://www.austingroupbugs.net/view.php?id=940

> The C standard goes to great length to ensure that NULL (and for that matter, any representation of the null pointer, which is written in source code by converting a constant value 0 to a pointer) need not be represented in hardware by all 0 bits (that is, 'int p = 0;' or 'static int p;' is allowed to compile to assembly instructions that assign a non-zero bit pattern to the location named p). And in fact, there are historical machines where the most efficient null pointer in hardware was indeed a non-zero bit pattern: http://c-faq.com/null/machexamp.html

... and the site it links to has this:

> Q: Seriously, have any actual machines really used nonzero null pointers, or different representations for pointers to different types?

> A: The Prime 50 series used segment 07777, offset 0 for the null pointer, at least for PL/I. Later models used segment 0, offset 0 for null pointers in C, necessitating new instructions such as TCNP (Test C Null Pointer), evidently as a sop to [footnote] all the extant poorly-written C code which made incorrect assumptions. Older, word-addressed Prime machines were also notorious for requiring larger byte pointers (char 's) than word pointers (int 's).

[snip]

> The Symbolics Lisp Machine, a tagged architecture, does not even have conventional numeric pointers; it uses the pair <NIL, 0> (basically a nonexistent <object, offset> handle) as a C null pointer.


The best part of C++ is how even the caveats get caveats. The fun never ends for rules lawyers.


Huh, I didn't know this was the case for WebAssembly. I found a discussion about it amongst the WebAssembly designers. https://github.com/WebAssembly/design/issues/204 -- Looks like efficient implementation of mprotect() was a sticking point, and its been relegated to a Future Feature https://github.com/WebAssembly/design/blob/master/FutureFeat...


Despite the quote, it's just not true that relaxed atomics will give the same kind of performance. They may, but they may not, depending on the situation. The issue with them is that they're still atomic throughout the entire program, even in the situations where you can guarantee there are no other threads running and hence don't need atomicity. (Perhaps in particular scenarios, like with small inputs, you never even spawn additional threads to begin with.) For native-sized integers, ensuring atomicity might or might have overhead depending on the implementation. But it will at least have overhead for larger objects.


> But it will at least have overhead for larger objects.

Given that there's essentially no way to do atomic operations on larger objects (than machine-level types), all std::atomic<> for such types must involve a mutex, and are therefore guaranteed to have performance implications.

std::atomic<T> provides syntax that treats std::atomic<int> and std::atomic<MyHugeObject> in precisely the same way. This is useful, but entirely misleading if you care about performance.


You're arguing for my point, not against it. (Though your basis for it isn't even true; HTM could avoid mutexes.)


HTM?


Hardware transactional memory. TSX-NI in Intel's case.


They don't have to involve a mutex but some kind of synchronisation. A sequential lock is another way of efficiently implementing that, without blocking the happy read path. Instead of locking the large object, it detects if a read was atomic and retries if it was not.


Technically speaking, this is still not atomic, merely lock-free. Still very useful, but not semantically equivalent to the store or load of a 32 or 64 bit integer value (for example).


Semantically it is atomic, because if the seq lock is implemented as a separate abstraction, the caller never observes a partially written value (no tearing). Retrying a broken read is just an implementation detail.


Relaxed atomics translate to native loads and stores. In theory there is no cost to them apart from the compiler barrier which prevents other loads/stores from moving past them, depending on the case.

For larger objects, std::atomic uses a mutex, so it's kind of silly anyway. If you wanted a mutex, better to make it explicit.


> Relaxed atomics translate to native loads and stores

Native loads and stores with lock... and that is on x86. https://gcc.godbolt.org/z/asKna4 Definitely not the same as a native operation.

> In theory there is no cost to them

This isn't theory though.

> apart from the compiler barrier which prevents other loads/stores from moving past them, depending on the case

Why apart from? Why not include this obvious deficiency? It's a legitimate objection and a pretty important component to my point you can't just brush aside out of inconvenience.

> For larger objects, std::atomic uses a mutex, so it's kind of silly anyway. If you wanted a mutex, better to make it explicit.

Better in your eyes. Clearly not in the standard writers', or they wouldn't have introduced the functionality. My only point is if people do decide to use it (the reasons not being the subject of this debate), they come across this shortcoming.


If you can really guarantee that there are no other threads running, surely you can explain that guarantee to the compiler so that it can make use of it.


Doing so may likely require creating a separate code path for threads = 1 case, which is probably not worth the effort involved.


The difficulty with "benign" data races and atomic operations with weak ordering, is that they're not guaranteed to behave like the target architecture behaves! It's insufficient to say "this is fine on x86" even in an x86-only program!

That's because the optimizer symbolically executes the program according to the C memory model, not x86 memory model (as if it was some kind of Low-Level Virtual Machine…). The optimizer may reorder or remove memory accesses in ways that wouldn't happen on the target hardware.


> The optimizer may reorder or remove memory accesses in ways that wouldn't happen on the target hardware.

yes, but so far (C/C++2003) I always thought that something like mutex.lock() in the example, before accessing x, would be recognized as a memory barrier by the optimizer so it should be safe to access x as long as your thread acquired the lock.

Now for C++11 I'm not so sure anymore because of this last paragraph of the article:

> What’s more, since C++11 has well defined memory semantics, compiler writers are no longer forced to be conservative with their optimizations. If the programmer doesn’t specifically mark shared variables as atomic, the compiler is free to optimize code as if it were single-threaded.

Does this really mean that the above x must always be declared volatile even though it's surrounded by a mutex-lock? If so I hope there's a compiler flag do get the old behavior...


No, you shouldn't declare x as volatile. Also note that the article never even talked about volatile in C++; volatile in C++ is entirely different from volatile in Java.

Instead when you use a mutex, your mutex internally uses atomics. In fact the simplest kind of mutex can be built with atomics alone (spinlock), though a production-quality mutex will have to use other techniques such as a Linux futex.


A mutex lock will still be modeled as an aliasing read/write of all global memory. So you’re fine.


In this instance the shared variable isn't x, but the internal workings of the mutex.


That's because the optimizer symbolically executes the program according to the C memory model, not x86 memory model

...and I think that is a huge problem, because it strays from the spirit of the language in that it's supposed to be "closer to the hardware" than other HLLs.

Edit: care to give a counterargument?


Optimizers follow what the C specification says. When C says "this can be reordered", then they can reorder.

There are practical reasons why it's not "closer to the hardware":

• it would be harder to have a single compiler front-end for multiple CPU back-ends.

• what programmers have in mind for "closer to the hardware" is actually quite fuzzy, based roughly on how they imagine a naive C compiler would generate machine code. That interpretation doesn't have a spec. The things that seem obvious aren't that obvious in detail, or lead to unexpected performance cliffs (e.g. did you know that non-byte array indexing on 64-bit machines would be slow if signed int was defined to "simply" overflow?)


I'm not the one who downvoted you, but the counterargument is that such a target-specific approach prevents any reasonable effort to consistently define the semantics of any programming language. "On x86_amd_this_and_that processor the program means this, on arm_64_some_version it means that and on x86_intel_something again something else". This would not just be hell, it would be unworkable.

In other words, any approach other than an abstract machine model is doomed. Ideally this abstract machine is rather close to real world machines, but it can't be identical to all of them, nor is it reasonable to start special casing.


but the counterargument is that such a target-specific approach prevents any reasonable effort to consistently define the semantics of any programming language.

Why does there even need to be? There's already undefined behaviour to take into account the variance. Otherwise it's just making things worse for programmers who want to easily solve real problems on real hardware, not the absurd theoretical world of spherical cows.


Undefined isn't the same as implementation defined. "Undefined" means you're not even talking about C any more. "Implementation defined" means which of the allowable semantics the implementors happened to choose.


Aka the "C is a high level assembler" view.

C is not a low level language - https://queue.acm.org/detail.cfm?id=3212479 - argues that this hasn't been true in a long time, since C was ported from the PDP-11.

Attempts to model C as being "close to the hardware" are at best misguided and usually hopelessly wrong. It's only possible by delving into the "implementation defined" parts of the spec but even then implementation s are often very bad at precisely defining what their semantics actually are.


Exactly this. Undefined behaviour exists for the purpose of portability, not optimization; actively going out of it's way to abuse it is a compiler bug.


> It is now widely recognized that this attempt to define the semantics of data races has failed, and the Java memory model is broken (I’m citing Hans Boehm here).

This is a bit misleading. C++ also ended up defining the semantics of data races with memory_order_relaxed. In the standard they are not called "data races", but they correspond closely to what Boehm calls Java "giving some semantics to data races". C++ relaxed atomics also have the same issues with causality.

See Russ Cox's talk http://nil.csail.mit.edu/6.824/2016/notes/gomem.pdf And Viktor Vafeiadis' slides: https://people.mpi-sws.org/~viktor/slides/2016-01-debug.pdf

> Weak atomics give you the same kind of performance as benign data races but they have well defined semantics.

This is mostly true, but not at the limits. GCC, clang, MSCVC, and icc tend to be pretty conservative in how they treat relaxed atomics. For example, on x86 they tend not to generate the reg+mem form and instead generate an explicit `mov` instruction (https://gcc.godbolt.org/z/K135Mo). This requires extra resources for instruction fetch & decoding and uses an extra named register (potentially causing spills to the stack).

> Traditional “benign” races are likely to be broken by optimizing compilers or on tricky architectures.

This is true, but the sad thing is that the implementation of the C++ memory model is sometimes broken too. GCC had lots of real bugs up through about GCC 4.9 (I think...). And then there's http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p066... -- "Existing implementation schemes on Power and ARM are not correct" (2017)


Having been there, and having bitten often enough, I can confirm that most compilers just try to generate the most conservative code possible for volatile accesses. Their semantics are crazy/inconsistent, so you can't reason about them, and high performance code won't use them. Nobody will ever be interested in making them work/perform well.

C++ atomics in their various memory models are different in that they have well-defined semantics, used in high performance code and people do care. You're right that there have been bugs (and probably still are), but at least you can _argue_ that something is a bug if there is a coherent definition. Optimization of atomics, such as merging barriers, is still in its infancy.


The comment:

> if you have mutexes in your code you’re probably programming at too low a level

is quite disappointing to read


> something based on CSP

Gives the game away. Just another proselytising FP nut trying to justify all the time they've spent learning "universal" FP-abstractions (ie., highly specific hacks around lazyness).


It seems I deal with race conditions all the time in Swift (Fixed one bug just today), but can understand little of this. Every time I turn around, it turns out things are harder and more complex than I thought.


Fwiw "data race" != "race condition", confusing as that may be. A race condition is a blanket term for any nondeterministic bug resulting from differences in timing between independent, concurrent processes. A data race is a specific kind of race condition where reads/writes to the same value in machine memory step on each other's toes and lead to inconsistent behavior. The classic example is two threads that try to increment the same number but only one takes effect, because they both read the current value before either writes their new value (instead of one read/writing and then the other read/writing). I'm not really a C/++ programmer, though, so I only have a rough understanding of this topic. I don't know what a "benign data race" is, for example.


memes considered memeful


"Considered harmful" considered harmful: https://meyerweb.com/eric/comment/chech.html




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: