On x86 multiprocessors, if, without taking any special measures, you set x=1 from processor 0, and processor 1 reads x shortly, but not too shortly, thereafter, it will see 1. If you try that on, say, an 80-core ARM CPU , CPU #53 might not see a 1 for a long time, if ever. You're no longer guaranteed that cache changes propagate unasked. As we get more and more CPUs, the cost of creating the illusion that there's a single memory goes up. It's now important that compilers know what's shared so they can generate the proper fence and barrier instructions. Especially since the latest iteration of ARM has new features such as "non-temporal load and store."
What used to be a theoretical problem, or at most a problem for OS architects, has thus acquired teeth and claws that can bite ordinary applications programmers.
This is where traditional C++ mutexes start to break down. From a theoretical perspective, it's always been annoying that mutexes at the POSIX level have no tie to what they're supposed to be protecting. Hitting a mutex has to mean "flush everything". That's inefficient. Rust has the advantage that its locks own data, so the compiler knows what has to be flushed. That becomes more important as the number of CPU cores becomes very large.
I've said for years that the three issues in C/C++ are "how big is it", "who owns it", and "who locks it". C++ now at least has abstractions for all of these, although they all leak. You can still get raw pointers out to misuse, or, more likely, pass to some API for misuse.
Huh? Yes, you can: Caches in multi-core ARM systems are still coherent, so once a write operation reaches any level of the cache hierarchy (typically level 1) and updates that cache line, all other cores will see the updated value when they read from that address.
That's not what you need memory consistency operations, like load acquire and store release, for. You need those to make sure that a write from a write operation has been written to the cache, because it might still linger in the store buffer, which sits between the execution unit and the cache.
> Rust has the advantage that its locks own data, so the compiler knows what has to be flushed.
That's absolutely not how Rust works. The Rust compiler has no notion of a mutex, only of atomic operations. Mutexes are purely library constructs – as shown by the parking_lot crate , which implements its own mutex independently of the standard library mutex.
However, the library knows about the semantics of the mutex, and it could explicitly instruct the processor to only flush certain memory regions from the store buffer when the mutex is unlocked. But then again, that's not Rust-specific – a C++ library could do the same.
You're confusing ARM-specific component labels with general-purpose technical terminology; the store buffer is a cache, and > you're no longer guaranteed that [its] changes propagate unasked.
That sounds quite resonable, but it doesn't mean the store buffer isn't part of the cache hierarchy (in the non-ARM-specific-terminology sense), and it's also not what they said.
> You need [memory consistency operations] [...] because [a write operation] might still linger in the store buffer
What do you mean by this? ARM requires cache coherency for data caches, which requires that every store is eventually made visible to all cores. Fences are necessary to impose additional requirements on the order which writes to different addresses are made visible, but AFAIK not necessary to ensure that a single write is eventually visible.
So there are situations when every store may not be visible to every CPU core, but the ones that don't get the word shouldn't have access to that memory anyway. You can sort of see the architects of the CPU trying hard to prevent the cache system from choking on inter-CPU traffic, and trying to get more help from the software.
> This architecture assumes that all PEs that use the same operating system or hypervisor are in the same Inner Shareable shareability domain.
[PE = processing element - i.e. core]
i.e. if you a running a userspace program, all the cores are in the same shareability domain.
(I think your comments about LDAR/STLR are a response to 'adwn's sibling comment).
Eventual write propagation is a very basic liveness guarantee for systems with coherent caches. I'm pretty sure that the ARM architecture requires it for Observers with coherent caches, but event if not, it would be very surprising if a mainstream implementation didn't provide this guarantee.
Barrier instructions provide control of ordering relationships between accesses and when accesses complete, but I believe there is a guarantee that accesses eventually complete even without barriers.
(Also from the ARM ARM (B2.3.7), "complete" here means "visible to other cores": "Any read to the same Location by an Observer within the shareability domain will either Reads-from W1 or Reads-from a write that is Coherence-after W1.")
I'd be very interested to learn that ARM (or Ampere Altra) doesn't guarantee eventual write propragation, but that needs a better source than .
On an out-of-order CPU, loads and stores go into a buffer which may process its entries in a different order. Fences like LDA and STL prevent certain reorderings.
What do you think about each CPU owning its own memory region and that to send values to other CPUs, one would have to do a send, either symbolically or physically, that could be implementation dependent.
Abstractions that aren't total and sound are insufficient and should be flagged by the compiler.
Total sharability slows down systems with large numbers of CPU cores. We're seeing the hardware features appear to allow for less sharability. It's not being used much yet. These have programming implications, and OSs and languages have to keep up. Probably more the OS for now, with groups of CPUs and areas of memory dedicated to each other. Can Linux cgroups be used that way?
This has to be invisible to the application programmer, somehow. Or we'll have another hard to program mess like the Cell.
The funny thing is, that pretty much is what store-release is (with load-acquire being the corresponding recv), anyway, just without any memory safety or sanity-checking.
Actually, I think that would be memory protection, not safety.
That is a great argument. Self-explanatory code make it easier to understand what are the design goals and usage of that code.
> Traditional “benign” races are likely to be broken by optimizing compilers or on tricky architectures.
This is even better. One of the reasons is so hard to move code from one architecture to another one is because not only word sizes but small non-defined behaviors.
I had the pleasure to make C++ code compile for WebAssembly. It is challenging and interesting the unexpected ways that it breaks. And how many times the mental model that we have as developer and the reality of the C++ specification differ in small ways.
My most absurd example was a test application that had to crashi on the push of a button, to verify that all logs and recovery code was working. To crash the application the code was accessing a NULL pointer. Alas, WebAssembly allowed read access to address Zero without problem. So, the button did nothing when compiled in WebAssembly.
And that's because the way most programmers write C/C++ programs is with the mental model of the actual machine they are writing the code for. That's not so strange, it's being taught that way. It's even praised as a skill to look at the generated assembly if you want to understand how something is compiled.
But that approach is fundamentally broken. One needs to program against the abstract machine which defines the semantics of your programming language. Everything else leads to unportable and subtly broken code with hidden assumptions of how undefined behavior is resolved.
Sometimes you need to do this for performance reasons? Well, then you pay the price. It's expensive.
Good thing nothing in the C or C++ standards mandates NULL be address all-bits-zero regardless of how it's sometimes spelled.
Any compiler which assumes that memory access to address all-bits-zero is access to NULL is not compliant.
an integer literal with value zero, or a prvalue of type std::nullptr_t
> The C standard goes to great length to ensure that NULL (and for that matter, any representation of the null pointer, which is written in source code by converting a constant value 0 to a pointer) need not be represented in hardware by all 0 bits (that is, 'int p = 0;' or 'static int p;' is allowed to compile to assembly instructions that assign a non-zero bit pattern to the location named p). And in fact, there are historical machines where the most efficient null pointer in hardware was indeed a non-zero bit pattern: http://c-faq.com/null/machexamp.html
... and the site it links to has this:
> Q: Seriously, have any actual machines really used nonzero null pointers, or different representations for pointers to different types?
> A: The Prime 50 series used segment 07777, offset 0 for the null pointer, at least for PL/I. Later models used segment 0, offset 0 for null pointers in C, necessitating new instructions such as TCNP (Test C Null Pointer), evidently as a sop to [footnote] all the extant poorly-written C code which made incorrect assumptions. Older, word-addressed Prime machines were also notorious for requiring larger byte pointers (char 's) than word pointers (int 's).
> The Symbolics Lisp Machine, a tagged architecture, does not even have conventional numeric pointers; it uses the pair <NIL, 0> (basically a nonexistent <object, offset> handle) as a C null pointer.
Given that there's essentially no way to do atomic operations on larger objects (than machine-level types), all std::atomic<> for such types must involve a mutex, and are therefore guaranteed to have performance implications.
std::atomic<T> provides syntax that treats std::atomic<int> and std::atomic<MyHugeObject> in precisely the same way. This is useful, but entirely misleading if you care about performance.
For larger objects, std::atomic uses a mutex, so it's kind of silly anyway. If you wanted a mutex, better to make it explicit.
Native loads and stores with lock... and that is on x86. https://gcc.godbolt.org/z/asKna4 Definitely not the same as a native operation.
> In theory there is no cost to them
This isn't theory though.
> apart from the compiler barrier which prevents other loads/stores from moving past them, depending on the case
Why apart from? Why not include this obvious deficiency? It's a legitimate objection and a pretty important component to my point you can't just brush aside out of inconvenience.
> For larger objects, std::atomic uses a mutex, so it's kind of silly anyway. If you wanted a mutex, better to make it explicit.
Better in your eyes. Clearly not in the standard writers', or they wouldn't have introduced the functionality. My only point is if people do decide to use it (the reasons not being the subject of this debate), they come across this shortcoming.
That's because the optimizer symbolically executes the program according to the C memory model, not x86 memory model (as if it was some kind of Low-Level Virtual Machine…). The optimizer may reorder or remove memory accesses in ways that wouldn't happen on the target hardware.
yes, but so far (C/C++2003) I always thought that something like
mutex.lock() in the example, before accessing x, would be recognized
as a memory barrier by the optimizer so it should be safe to access x
as long as your thread acquired the lock.
Now for C++11 I'm not so sure anymore because of this last paragraph of
> What’s more, since C++11 has well defined memory semantics, compiler
writers are no longer forced to be conservative with their
optimizations. If the programmer doesn’t specifically mark shared
variables as atomic, the compiler is free to optimize code as if it
Does this really mean that the above x must always be declared
volatile even though it's surrounded by a mutex-lock? If so I hope
there's a compiler flag do get the old behavior...
Instead when you use a mutex, your mutex internally uses atomics. In fact the simplest kind of mutex can be built with atomics alone (spinlock), though a production-quality mutex will have to use other techniques such as a Linux futex.
...and I think that is a huge problem, because it strays from the spirit of the language in that it's supposed to be "closer to the hardware" than other HLLs.
Edit: care to give a counterargument?
There are practical reasons why it's not "closer to the hardware":
• it would be harder to have a single compiler front-end for multiple CPU back-ends.
• what programmers have in mind for "closer to the hardware" is actually quite fuzzy, based roughly on how they imagine a naive C compiler would generate machine code. That interpretation doesn't have a spec. The things that seem obvious aren't that obvious in detail, or lead to unexpected performance cliffs (e.g. did you know that non-byte array indexing on 64-bit machines would be slow if signed int was defined to "simply" overflow?)
In other words, any approach other than an abstract machine model is doomed. Ideally this abstract machine is rather close to real world machines, but it can't be identical to all of them, nor is it reasonable to start special casing.
Why does there even need to be? There's already undefined behaviour to take into account the variance. Otherwise it's just making things worse for programmers who want to easily solve real problems on real hardware, not the absurd theoretical world of spherical cows.
C is not a low level language - https://queue.acm.org/detail.cfm?id=3212479 - argues that this hasn't been true in a long time, since C was ported from the PDP-11.
Attempts to model C as being "close to the hardware" are at best misguided and usually hopelessly wrong. It's only possible by delving into the "implementation defined" parts of the spec but even then implementation s are often very bad at precisely defining what their semantics actually are.
This is a bit misleading. C++ also ended up defining the semantics of data races with memory_order_relaxed. In the standard they are not called "data races", but they correspond closely to what Boehm calls Java "giving some semantics to data races". C++ relaxed atomics also have the same issues with causality.
See Russ Cox's talk http://nil.csail.mit.edu/6.824/2016/notes/gomem.pdf
And Viktor Vafeiadis' slides: https://people.mpi-sws.org/~viktor/slides/2016-01-debug.pdf
> Weak atomics give you the same kind of performance as benign data races but they have well defined semantics.
This is mostly true, but not at the limits. GCC, clang, MSCVC, and icc tend to be pretty conservative in how they treat relaxed atomics. For example, on x86 they tend not to generate the reg+mem form and instead generate an explicit `mov` instruction (https://gcc.godbolt.org/z/K135Mo). This requires extra resources for instruction fetch & decoding and uses an extra named register (potentially causing spills to the stack).
This is true, but the sad thing is that the implementation of the C++ memory model is sometimes broken too. GCC had lots of real bugs up through about GCC 4.9 (I think...). And then there's http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p066... -- "Existing implementation schemes on Power and ARM are not correct" (2017)
C++ atomics in their various memory models are different in that they have well-defined semantics, used in high performance code and people do care. You're right that there have been bugs (and probably still are), but at least you can _argue_ that something is a bug if there is a coherent definition. Optimization of atomics, such as merging barriers, is still in its infancy.
> if you have mutexes in your code you’re probably programming at too low a level
is quite disappointing to read
Gives the game away. Just another proselytising FP nut trying to justify all the time they've spent learning "universal" FP-abstractions (ie., highly specific hacks around lazyness).