Making Sense of Acquire-Release Semantics

zevv · 2024-05-11T17:37:56

Well written article, nice and to the point. Do recommend.

Decades ago I declared myself too stupid to use shared memory with threading; I have learned to avoid this whenever possible, or abstract away the memory access under a safe layer as soon as possible. One of the greatest decisions of my career.

Memory model semantics is one of the parts of systems programming that is generally poorly understood; I have had long discussions with senior programmers who have spent their careers carelessly threading their code without realizing what is happening under the hood. Not only did some of them not properly understand the acquire/release model, but they were not even aware of its existence.

For a more in-depth explanation, I recommend Sutter's excellent talk "Atomic <> weapons": https://www.youtube.com/watch?v=A8eCGOqgvH4. It is a two hour lecture, but it will be worth your time.

mcdeltat · 2024-05-12T00:15:25

That atomics talk by Sutter is a superbly high quality explanation, albeit long. He addresses the topic with exceptional rigor and plenty of examples that show what can and can't happen. It for sure took me from 0 to 100 knowledge on thread safety.

ein0p · 2024-05-12T07:54:24

Not being aware of the existence of acquire/release is strictly better than knowing about it but not understanding the details. One can have an excellent career and never use anything other than condvars and mutexes. But one can also really shit the bed if they misunderstand the semantics of atomics. I use atomics with explicit memory ordering only as a measure of last resort.

eru · 2024-05-12T02:48:54

I agree.

Btw, shared memory by itself is fine, even shared memory with threading. The really dangerous ingredient is mutability. And mutability is already toxic waste on its own, and concurrency amplifies the danger.

toast0 · 2024-05-11T22:38:52

> Decades ago I declared myself too stupid to use shared memory with threading; I have learned to avoid this whenever possible, or abstract away the memory access under a safe layer as soon as possible. One of the greatest decisions of my career.

This probably qualifies you to do shared memory threading, when it is needed. Such as debugging those abstraction layers. Knowing you don't understand it puts you in the right frame of mind to carefully do the work.

spacechild1 · 2024-05-12T02:48:59

I can highly recommend Herb's talk as well!

liblfds-temp · 2024-05-12T05:53:37

> See how we added a new fence() call to put()? That fixes the reordering problem we’ve described at length. Now if the CPU gets bored waiting for entries[i] to be read into the cache, and tries to pull up the tail++ line so it happens sooner, bonk!, the tail++ line hits that fence and stops moving. We’ve forced the entry to be written before the tail is bumped. Problem solved!

I may be completely wrong, it's a complicated subject, but I think the wording here is potentially misleading.

As I understand it (and I may be wrong!), a full fence does the following : "all reads and all writes prior to the fence must occur before any reads or writes after the fence".

What I'm concerned about is people thinking this is a blocking behaviour, i.e. when we hit the fence, then at that point all prior reads and writes occur.

This is not the case.

The reads and writes prior to the fence can occur at any time - they could occur LONG AFTER we pass the fence - but what we do get from the fence is that the reads and writes prior to the fence WILL occur BEFORE any reads or writes after the fence.

So, to put it another way, the code is executing, we come to the fence - and absolutely nothing happens. The processor just keeps going. No reads occur, no writes occur.

Then at some point later on, both in time and in code, the process comes to another (say) read. NOW, finally, the processor MUST complete all reads and writes which were issued prior to the fence.

liblfds-temp · 2024-05-12T08:29:56

A thread which performs a write, has a fence, and then performs another write, will at the time of the second write due to the fence guarantee the first write has completed.

However, "completed" is misleading.

The write will still not be seen by readers.

"Complete" really means "the writer has done all the work he can do, which is necessary but insufficient".

For a reader to see the write, the readers must issue a read memory barrier before reading the value.

All of this is due to store buffers and cache invalidation requests.

In short, in principle, a write is known about only by the processor it occurred upon; if there are any other processors in the system, then unless special magic is performed (fences and so on), any writes they see from other processors are seen purely by chance and can happen in any order (including going backwards in time), or not happen at all.

I once read an article which framed all this in terms of source control.

Memory is like SVN, or Git.

You make local changes (on your processor, which has its local copy of memory). You then commit (write fence/atomic operation). No one else can see your changes until they update from source control and get the latest version (read fence).

gpderetta · 2024-05-13T16:08:25

> You make local changes (on your processor, which has its local copy of memory). You then commit (write fence/atomic operation). No one else can see your changes until they update from source control and get the latest version (read fence).

That's actually not really true, what you are describing is a non-coherent system.

On a cc system, once a store is no longer speculative, it is guaranteed to be flushed out of the store buffer into the cache, and a load that reaches the cache layer is guaranteed to see the last version of the data that was stored in that memory location by any coherent agent in the system.

As pointed out elsethread, you need load barriers specifically to order your loads, so they are needed for ordering, not visibility.

The way that barriers contribute to visibility (and the reason that they need to be paired) is by giving conditionally guarantees: T: S.1; #StoreStore; S.2 T2: L2 #LoadLoad L.1. If T2 observers from its first load at memory location .2 the value stored by S.2, then it is guaranteed that L.1 will observer the value stored by S.1. So cache coherency plus purely local load and store barriers give you the global Acquire/Release model.

liblfds-temp · 2024-05-14T07:04:21

> On a cc system, once a store is no longer speculative, it is guaranteed to be flushed out of the store buffer into the cache,

That's the thing - I may be wrong, but I'm under the impression store buffers do not guarantee anything. I'm pretty certain they do not on Intel, for example. All you read in the docs is "flushed in a reasonable time", which can mean anything.

> and a load that reaches the cache layer is guaranteed to see the last version of the data that was stored in that memory location by any coherent agent in the system.

Yes.

> As pointed out elsethread, you need load barriers specifically to order your loads, so they are needed for ordering, not visibility.

Mmm - again, I may be wrong; but I think also no guarantees about behaviour of handling of cache invalidation requests. Given no guarantee, then in theory the invalidation request is never handled (unless you force it to be with a read barrier).

When I need a set of data to be written (say a struct - something bigger than you can manage in an atomic op), I'll write to the struct, fence, then perform a throw-away atomic op. The atomic op forces a write (a normal write could just be deferred and not force completion of pre-barrier writes) and then I know my struct has gone out past the store buffer and has reached cache control.

gpderetta · 2024-05-14T08:32:25

The cpu still need to guarantee forward progress. The store buffer is always flushed as fast as possible (i.e. as soon as the cpu can acquire the cacheline in exclusive mode, which is guaranteed to happen in a finite time). I can't point you to the exact working in the intel docs as they are quite messy, but you can implement a perfectly C++11 compliant SPSC queue purely with load and stores on x86 without any fences or #LOCK operations.

What fences (and atomic RMWs) do on intel is to act as synchronization, preventing subsequent reads to complete before any store preceding the fence. This was originally implemented simply by stalling the pipeline, but these days I suspect the loads have an implicit dependency on a dummy store on the store buffer representing the fence (possibly reusing the alias detection machinery?).

liblfds-temp · 2024-05-20T06:40:36

> I can't point you to the exact working in the intel docs as they are quite messy, but you can implement a perfectly C++11 compliant SPSC queue purely with load and stores on x86 without any fences or #LOCK operations.

I would disagree. I think the Intel docs do not specify a guarantee of flushing, and so if the SP and SC are on different cores, I think then in principle (but not in practise) the SC could in fact never see what the SP emits.

Veserv · 2024-05-12T17:27:57

You do not need a load barrier to observe stores from other cores. You need the load barrier so your own core does not reorder loads otherwise you could see wacky loads even if the stores are totally ordered.

As a example, suppose we have totally ordered stores, but loads can be reordered freely. Suppose X is 0 and Y is 0, then we store X to 1 then store Y to 1. Therefore, Y can only be 1 after X is 1. But, if you can load out of order, then you can load Y prior to the store, then load X after the store and see Y is 0 and X is 1 even though the store ordering guarantees Y is 1 only after X is 1.

liblfds-temp · 2024-05-14T07:01:51

> You do not need a load barrier to observe stores from other cores. You need the load barrier so your own core does not reorder loads otherwise you could see wacky loads even if the stores are totally ordered.

Yes. I mean, you need the load barrier to see stores safely. Also, in principle, since AFAIK there are no guarantees about cache invalidation request behaviour, in theory it could fail to be read forever - in which case you would in fact literally never see the writes on other cores (until you issued a read barrier).

moonchild · 2024-05-12T08:35:30

That's more correct, but there are two points of order.

The first is that a cpu core is a massively distributed system, and there is no single point in time when an operation is executed.

The second is that cpus will absolutely physically do reads out of order even when those reads logically have to happen in order. Suppose you read X, then fence, then read Y; the cpu may read Y immediately, immediately begin speculatively executing dataflows depending on Y, then later on read X. But then if, between the time when it read Y and it read X, Y changed, then it will roll back any intermediate computation dependent on Y and try again. But if Y didn't change between the time when it was initially read and the time when X was read, then it's semantically the same as if Y was read after X, so there is no problem.

gpderetta · 2024-05-14T08:38:29

> The first is that a cpu core is a massively distributed system, and there is no single point in time when an operation is executed.

this is critical: there might be hundreds of cycles between an instruction entering the pipeline and existing it, so you have to be precise what "ordering" means. Typically the only thing it matters is visible side effects, i.e access to the coherence layer.

mcdeltat · 2024-05-11T23:50:25

The article says a lot about CPUs reordering memory instructions - is this actually the main cause of the issues with the shown code?

Speaking of x86 specifically, instructions are aggressively reordered but as far as I understood it, the results are generally committed in order. The ISA has rules about what memory reorderings can occur, and looks like most reorderings are forbidden. This stackoverflow explains it well: https://stackoverflow.com/a/50310563/7064452

In this queue example I'd say the elephant in the room is the compiler, not the CPU. The code has no language-level synchronisation, therefore the compiler is free to do whatever - it's not surprising memory ops get reordered. If you want to be pedantic, the code is UB with no synchronisation present. Perhaps besides the point to discuss CPU behaviour in this light.

mumblingdrunk · 2024-05-12T00:43:39

All CPUs commit in order and except precisely, because most other options are insane, or would drive you to it. However: single thread commit order =/= observability order.

Observability order of memory operations --- which are the only operations that matter --- are governed by the memory consistency model of the architecture. x86 has what's generally referred to as strong ordering on memory operations.

On x86, part of it means that stores from the same core cannot be observed out of order from each other, nor can loads.

So assuming the compiler does not move the `tail++` up, or move the assignment out of the if-statement (both of which are achieved by marking them `volatile`), the code should actually work on x86. The `tail++` change cannot be observed before the write to the queue and the reading from the queue cannot be observed before the reading of the `tail` and `head` variables.

On RISC-V and Arm, you need more as they have substantially weaker memory consistency. The RISC-V specs have some examples of interesting outcomes you can have. Some of it involves time-travel.

But in the end: yes the reordering done by the CPU is the issue. The compiler can and does reorder stuff when it thinks that it'll unlock more instruction-level parallelism, but no amount of volatile is going to make that queue universally usable on RISC-V. No matter what the compiler does. Even perfectly preserving the single-thread semantics of the code, not reordering a single instruction, the CPU can move stuff around in terms of observability. The alternative is that the compiler inserts a barrier/fence after every instruction.

There are trade-offs. Poorly written code for x86 can absolutely tank performance because of ordering violations requiring code to be replayed, though that is sometimes a problem in even weaker consistency models as well.

mcdeltat · 2024-05-12T02:05:57

Valid points, although I have another perspective on this bit:

> But in the end: yes the reordering done by the CPU is the issue

I think from a programmer perspective, the CPU side of things is mostly beside the point (unless you're writing assembly), and this contributes to the misunderstanding and air of mystery surrounding thread safety.

At the end of the day the CPU can do anything, really. I'd argue this doesn't matter because the compiler is generating machine code, not us. What does matter is the contract between us and the compiler / language spec. Without language-level synchronisation the code is not valid C/C++ and we will likely observe unexpected behaviour - either due to CPU reordering or compiler optimisations, doesn't matter.

I think the article is somewhat missing the point by presenting the case somewhat pretending that the compiler is not part of the equation. It seems like often people think they know how to do thread safety because they know, e.g. what reorderings the CPU may do. "Just need to add volatile here and we're good!" (probably wrong). In reality they need to understand how the language models concurrency.

We could translate that queue code into another language with a different concurrency model - e.g. Python - and now the behaviour is different despite the CPU doing the same fundamental reorderings.

mattnewport · 2024-05-12T08:15:52

This is true but in practice it's pretty common to find this sort of code seems to work fine on x64 because the compiler doesn't actually reorder things and then sometimes blows up on ARM (or PowerPC, though that's less commonly encountered in the wild these days).

eqvinox · 2024-05-12T02:30:07

The searchable keyword to look for here (re x86) is "Total Store Ordering" (TSO).

(I'm not gonna try to summarize it here because I'd probably get it ever so subtly wrong…)

x86 has a very strong memory model, meaning it is a very poor test platform. Last time I touched atomics I used PowerPC (e500mc & e6500) to test, which was a good thing as it did point me at a problem. Not sure where current ARM falls on this. The uncontested, notorious king of weak memory ordering is DEC Alpha, but these are a bit hard to run these days. If you want to go truly insane, look at DEC Alpha consumer dependency (non-)ordering :)

dbcurtis · 2024-05-12T00:09:10

Yes, having worked on one of the out-of-order Intel CPU's, I can tell you that you are correct. Instructions may be "complete", as in their results can be forwarded to later operations, but the instruction isn't "retired" until it is known that it can not raise an exception, or be cancelled because of branch mis-predict, etc. Programmer-visible architectural state as defined in the ISA is not written until instruction retirement. CPU re-ordering instructions is not going to change semantics (in X86 and similar architectures... there are some archs that relax that guarantee).

Compilers are notorious for doing dumb things around locks.... the gnu C for AVR architecture, for instance, looks at the SEI instruction (SEt Interrupt mask bit) and notices that it doesn't modify memory or registers, so hoists it to the top of functions. Eh.. No, SEI; CLI; <code> <critical section> <code> is not what I intended...

Also... CPU's with data caches can to smart things with architecturally-defined locking instructions such as "test-and-set' or 'compare-and-exchange' such that the instructions are always cache-coherent across CPU's. If you try to roll-your-own locking code, you had best understand how the cache invalidation mechanism works in your chosen CPU or you are going to have a bad day.

moonchild · 2024-05-12T22:04:35

> CPU's with data caches can to smart things with architecturally-defined locking instructions such as "test-and-set' or 'compare-and-exchange' such that the instructions are always cache-coherent across CPU's. If you try to roll-your-own locking code, you had best understand how the cache invalidation mechanism works in your chosen CPU or you are going to have a bad day.

What do you mean? Are you implying that read-modify-writes are treated differently from plain writes by the cache coherency protocol?

dbcurtis · 2024-05-13T03:05:22

I'm saying that an atomic RMW is going to get the cache line in "exclusive" state, (in typical MESI protocol) but that if you are trying to gin up the equivalent with spin locks you need to think through how that plays out, as the reads might be in "shared" state.

moonchild · 2024-05-13T07:27:39

I still don't see what you're getting at. What is the implication of this for software? The implementation of the cache coherency protocol is largely opaque to software.

gpderetta · 2024-05-12T00:04:02

The article does mention in the end that a trivial translation of the queue code to x86 will work.

It is broken on other architectures though (aside of the obvious UB because of races).

newpavlov · 2024-05-11T18:54:10

For a more in-depth understanding, I recommend reading "Rust Atomics and Locks": https://marabos.nl/atomics/ It uses Rust for examples, but it more or less applicable to C/C++ as well.

ajross · 2024-05-11T20:46:17

This bit isn't quite correct:

> Acquire and release semantics don’t have any meaning on Intel- and AMD-brand processors. On x86 CPUs, which is to say, on just about any Intel- or AMD-brand CPU you can buy right now, memory operations on a single CPU core happen in program order.

In fact x86 CPUs do allow themselves to reorder reads around other reads[1]. The rule is that no memory access is allowed to cross a write operation[2]. The distinction isn't important to traditional critical section analysis like the article is doing, but there are lockless algorithms out there that depend on fully-ordered reads. Dekker's famous (but mostly useless in practice) algorithm for mutual exclusion is one.

[1] Note here I'm using the more traditional and frankly much clearer terminology about actual hardware behavior and not the frustrating abstractions embraced by the language design community.

[2] Or one of a handful of "serializing instructions", the most commonly relied on being LOCK CMPXCHG

moonchild · 2024-05-11T21:18:43

> x86 CPUs do allow themselves to reorder reads around other reads. The rule is that no memory access is allowed to cross a write operation

That's not correct. Intel manual, vol 3, sec. 9.2.2:

> Reads are not reordered with other reads

> Writes are not reordered with older reads.

> Writes to memory are not reordered with other writes

A read may be reordered w.r.t. an older write (and hence tfa is incorrect that po=ppo on x86), but reads are not ordered with other reads. You can see in the c/c++ processor mappings here https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html that load acquire/store release can be mapped to plain loads and stores on x86.

ajross · 2024-05-11T21:51:30

Oops, indeed. Must be a bit flip error in my memory.

moonchild · 2024-05-11T21:23:01

> more traditional and frankly much clearer terminology about actual hardware behavior and not the frustrating abstractions embraced by the language design community

Axiomatic memory models were pushed as much from the hardware world as from the software world if not more so, and they exist for a reason. Overfitting obligatorily abstract models to the behaviour of a particular microarchitecture benefits neither hardware nor software writers

ajross · 2024-05-11T21:50:21

> Overfitting obligatorily abstract models to the behaviour of a particular microarchitecture benefits neither hardware nor software writers

I'm too dumb to know what the first clause means. But the second is silly: understanding how to control the ordering of the actual memory operations (i.e. the mapping to actual behavior as seen on a bus/interconnect somewhere) can only benefit hardware writers (because it tells them how to implement and explain what's happening) and software writers (because it's simple).

You don't see articles like TFA about read and write barriers, because they're obvious. Acquire/release is abstract egghead stuff, which means it's almost certainly a bad API.

gary_0 · 2024-05-11T22:19:38

What moonchild meant is that higher-level semantics are necessary because directly describing what the hardware is doing won't work because almost every model of CPU does something slightly different.

Your argument that the higher-level terminology the industry settled on is confusing is valid (albeit subjective).

> Acquire/release is abstract egghead stuff

The article explains why this functionality is named that way, and why it's necessary. It's even kind of convenient, because in one line of standardized code you tell the compiler what you want, and it takes care of all the ugly platform-specific gunk, guaranteed!

ajross · 2024-05-11T22:51:31

> What moonchild meant

To spell it out: I know very well what moonchild meant, and am no stranger to lockless algorithm design or memory ordering semantics. It was a turn of phrase riffing on the point that "memory ordering semantics are APIs" and thus should have been designed with an eye toward clarity and comprehension for the working programmers who need to use them.

Acquire/release was intended to make the job of a compiler author easier, by trying to home in on the differences between hardware that might be reasonably abstracted as a single API. And I'm saying that's a bad trade, because the compiler nerds are doing just fine and what the dumb jocks in the trenches want is read and write barriers.

gpderetta · 2024-05-12T00:16:03

I think it is just a generational difference. I learned lock-free programming during the standardization of the C++0x memory model and I do find acq/rel a simpler model to understand and analyse algorithms, while thinking in term of reorderings never clicked for me.

gpderetta · 2024-05-11T21:36:56

Indeed, program ordering is too strong. x86 is TSO, which, as mentioned elsethread, allows for Store/Load reordering (but not Load/Load)

Dekker per-se is not terribly useful, but once you know that pattern (write to one variable and check a separate variable to make a decision) you start to recognise it in many lock-free algorithms that try to be clever with plain stores and loads.

chrisaycock · 2024-05-11T19:32:30

The first explanation that really clarified memory barriers for me was from Paul E. McKenney:

http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2...

JonChesterfield · 2024-05-11T22:32:36

That one plus the kernel notes at https://www.kernel.org/doc/Documentation/memory-barriers.txt did the trick for me.

king_geedorah · 2024-05-11T20:42:07

No matter how many times I read about these, I'm always left just slightly confused. Acquire/release is about the ordering of instructions within a thread of execution. So if I have a writer writing X, then writing Y, then I need write-release to make sure that the compiler actually puts the instructions for Y after the instructions for X in the machine code. If I want to guarantee that the results of those writes are visible to another thread, then I need a memory fence to force flushing of the caches out to main memory basically?

The author mentions the queue as originally written is technically correct for x86/x86_64. This is true only in the sense that neither the producer or consumer can experience partial reads / partial writes, right? It is still possible that if, say, the consumer were busy waiting for an item to be available then it will spin for as long as it takes for the CPU to decide to flush the producer's writes out to main memory?

Whenever I see these concepts discussed, it is in the context of the C++ stdatomic library. If I were writing a program in C99, I would assume it would still be possible to communicate the same intent / restrictions to the compiler, but I'm not sure because I haven't been able to find any resources that discuss doing so. How might one communicate that to the compiler, assuming they are on x86/x86_64 where in theory the CPU should just do the right thing with the right machine code?

Finally, does target architecture influence the compiler's behavior in this regard at all? For example, if we take x86/x86_64 as having acquire/release semantics without any further work, does telling the compiler that my target architecture is x86/x86_64 imply that those semantics should be used throughout the program?

neerajsi · 2024-05-11T21:45:00

I just wanted to clarify something about flushing caches: fences do not flush the caches in any way. Inside the CPU there is a data structure called the load store queue. It keeps track of pending loads and stores, of which there could be many. This is done so that the processor can run ahead and request things from the caches or to be populated into the caches without having to stop dead the moment it has to wait for any one access. The memory fencing influences how entries in the load store queue are allowed to provide values to the rest of the CPU execution units. On weak orderes processors like ARM, the load store queue is allowed to forward values to the execution pipelines as soon as they are available from the caches, except if a store and load are to the same address. X86 only allows values to go from loads to the pipeline in program order. It can start operations early, but if it detects that a store comes in for a load that's not the oldest it has to throw away the work done based on the speculated load.

Stores are a little special in that the CPU can declare a store as complete without actually writing data to the cache system. So the stores go into a store buffer while the target cache line is still being acquired. Loads have to check the store buffer. On x86 the store buffer releases values to the cache in order, and on ARM the store buffer drains in any order. However both CPU architectures allow loads to read values from the store buffer without them being in the cache and without the normal load queue ordering. They also allow loads to occur to different addresses before stora. So on x86 a store followed by a load can execute as the load first then the store.

Fences logically force the store buffer to flush and the load queue to resolve values from the cache. So everything before the fence is in the caching subsystem, where standard coherency ensures they're visible when requested. Then new operations start filling the load store queue, but they are known to be later than operations before the fence.

king_geedorah · 2024-05-11T22:22:16

That clarifies fences more for me a little bit more. Thanks for the insight.

JonChesterfield · 2024-05-11T22:29:21

If you program without fences, instructions are reordered by the compiler and/or the processor as they see fit provided the single threaded semantics are unchanged. This is why alias analysis is a big deal. A store can be moved before a load if they are independent, i.e. do not alias. A store followed by a load can be simplified to use the value already in a register if there is no other store in the meantime.

This doesn't work if there are multiple threads and shared mutable state. Whatever semantics the programmer had in mind, and encoded in load/store patterns, are usually insufficient for correctness under arbitrary interleaving of threads.

This is fixed by introducing additional constraints on which instructions can be reordered. Fences affect loads, stores, both. Usually with respect to all memory but potentially only a subset of it. The point is to say that moving some operation past some other one will cause unacceptable behaviour for this program, so neither compiler nor CPU shall do so.

On top of this there's a C++ memory order model, where you can tag an integer add with acq_rel semantics, or specify fences, all reasoned in terms of synchronises with and a global oracle determining acceptable execution sequences. I think this is grossly over complicated and heavily obfuscates the programming model to no gain. Fortunately one can mechanically desugar it into the fences and reason with the result.

king_geedorah · 2024-05-11T23:05:47

Would it be correct to say that acquire-release is in some sense "higher level" than memory fences, with acq-rel implemented in terms of fences (and restrictions on code-gen?) and fences being all that the CPU actually knows about at least in the case of x86_64?

gpderetta · 2024-05-12T00:08:39

Acquire/release is higher level because it is more of a theoretical description than an hardware description. But acq/rel is not implemented using fences on x86. On this arch all loads and stores have implicit acquire and release semantics respectively, while all RMW are sequentially consistent acq+rel. The compiler will need to emit explicit fences very rarely on x86.

king_geedorah · 2024-05-12T00:41:16

Okay, that makes sense to me. Thank you.

moonchild · 2024-05-11T21:34:58

> If I were writing a program in C99, I would assume it would still be possible to communicate the same intent / restrictions to the compiler, but I'm not sure because I haven't been able to find any resources that discuss doing so

You cannot. See boehm, 'threads cannot be implemented as a library' (https://web.archive.org/web/20240118063106if_/https://www.hp...). You can do that in c11, however, which includes functionally the same facilities as c++11 (howbeit c-flavoured, obviously). https://en.cppreference.com/w/c/atomic

> does target architecture influence the compiler's behavior in this regard at all? For example, if we take x86/x86_64 as having acquire/release semantics without any further work, does telling the compiler that my target architecture is x86/x86_64 imply that those semantics should be used throughout the program?

It does not imply that. You should completely ignore the target architecture. C11 provides an abstract interface and a set of abstract constraints for concurrent programs; you should program against that interface, ensuring that your code is correct under those constraints. The compiler is responsible for making sure that the constraints are satisfied on whatever target you happen to run (so your code will be portable!).

> if I have a writer writing X, then writing Y, then I need write-release to make sure that the compiler actually puts the instructions for Y after the instructions for X in the machine code

You need Y to be a write-release if you would like it to be the case that, if another thread who acquire-reads and observes Y, then it will observe X. (The classic example is 'message passing', where X is a message and Y is a flag saying that there is a message. Obviously, it would be bad if you could see the flag but not actually the message.) But maybe you don't need that property.

> If I want to guarantee that the results of those writes are visible to another thread, then I need a memory fence to force flushing of the caches out to main memory basically?

No. That's not what a fence does and that's not how caches work.

king_geedorah · 2024-05-11T22:41:45

Claro. Thank you for the threads as libraries link; very interesting.

cryptonector · 2024-05-12T03:29:31

And yet people did threaded programming in C before C11. Granted, you cannot do it in plain C99 -- in practice extensions were used. The hard part wasn't getting the fence instructions (asm will do it) but getting the compiler to not re-order things around fences, and `asm volatile ("" ::: "memory")` (or similar) would do that.

king_geedorah · 2024-05-12T05:07:09

This is essentially what I was getting at. Thank you.

gpderetta · 2024-05-11T21:31:32

> Acquire/release is about the ordering of instructions within a thread of execution

acquire/release is about visibility. Acquire and release always go in pair. You can't really reason purely about a release and an acquire in isolation and that's why simply thinking about instruction reordering is not enough.

> So if I have a writer writing X, then writing Y, then I need write-release to make sure that the compiler actually puts the instructions for Y after the instructions for X in the machine code. If I want to guarantee that the results of those writes are visible to another thread, then I need a memory fence to force flushing of the caches out to main memory basically

Whether an explicit memory fence is needed or not depends on the architecture (for example you do not need them on x86). But you do not need to care, if you use the atomic operation with the correct semantic, the compiler will insert any required fence for you.

As an aside, typically fences have nothing to do with caches. One a store or a load operation hits the cache, the coherence system takes care that everything works correctly. If fences had to flush the cache, they would be orders of magnitude slower.

Instead fences (explicit or otherwise) make sure that either memory operations commit (i.e. are visible at the cache layer) in the expected order or that an application can't tell otherwise, i.e. reordering is still permitted across fences as long as conflicts can be detected and repaired, typically this can only happen for loads that can be retried without side effects.

> Whenever I see these concepts discussed, it is in the context of the C++ stdatomic library. If I were writing a program in C99, I would assume it would still be possible to communicate the same intent / restrictions to the compiler

formally in C99 multithreaded programs are UB. Of course other standards (POSIX, openmp) and implementations (the old GCC __sync_builtins) could give additional guarantees; but only C11 gave a model defined well enough to reason in depth about the overall CPU+compiler system; before that people just had to make a lot of assumptions.

> Finally, does target architecture influence the compiler's behavior in this regard at all? For example, if we take x86/x86_64 as having acquire/release semantics without any further work, does telling the compiler that my target architecture is x86/x86_64 imply that those semantics should be used throughout the program?

It does, but note that the compiler will only respect acquire/release semantics for atomic objects operations with the required ordering, not normal load and stores.

catlifeonmars · 2024-05-11T16:46:22

Minor quibble: you can linearize small amounts of memory using atomic access. You just need to ensure that your memory fits within the size of a single atomic access. For example, storing two uint32 as a uint64 when there is atomic access to uint64 available.

mevric · 2024-05-11T22:32:12

I am curious to understand how the following is achieved? Is there a material on this?

"storing two uint32 as a uint64"

JonChesterfield · 2024-05-11T22:36:37

Put them next to each other, 8 byte align the first one, use a compiler mechanism to disable alias analysis, do the uint64 store. Attribute((may_alias)) is the local override, fno-strict-aliasing the global one.

I think C++ can now do "these bytes are now that type", called something like start_lifetime_as. C probably can't, though using a union might be legitimate. The language rules in this area are a mess.

MaxBarraclough · 2024-05-12T12:00:37

There's no need to flirt with undefined behaviour and non-standard compiler flags. Just convert both uint32_t values to uint64_t type, then combine them into a single uint64_t value using bitwise shift then bitwise inclusive OR.

Rob Pike has blogged about this kind of thing. [0]

Perhaps also of interest: both C and C++ provide a (portable and standard) means of determining whether atomic operations on uint64_t are assured to be lock-free. [1][2] (Assuming of course that the uint64_t type exists - it's in the standard but it's optional.)

[0] https://commandcenter.blogspot.com/2012/04/byte-order-fallac... ( discussion: https://news.ycombinator.com/item?id=3796378 )

[1] https://en.cppreference.com/w/c/atomic/atomic_is_lock_free

[2] https://en.cppreference.com/w/cpp/atomic/atomic_is_lock_free

JonChesterfield · 2024-05-12T17:37:34

If you do the loads as uint32, you lose the single atomic operations on two different values which was the whole point of this exercise.

Using a single uint64 as the memory type works, but you no longer have two different names fields and have to pack/unpack them by hand.

There's no ub if you use the compiler extension, just totally clear code that does the right thing

MaxBarraclough · 2024-05-12T18:47:42

> If you do the loads as uint32, you lose the single atomic operations on two different values which was the whole point of this exercise.

There's no need for any flirting with undefined behaviour through type-punning.

When doing the atomic write, you prepare the uint64_t value to write by using bitwise operations, and then perform the atomic write of the resultant uint64_t value.

When doing the atomic read, you atomically read the uint64_t value, then use bitwise operations to unpack the original pair of uint32_t values.

Put differently, writing is done by pack-then-atomically-write, and reading is done by atomically-read-then-unpack.

Turns out we're both overthinking it though, there's a more direct way: use a struct containing an array of 2 uint32_t elements, or declare a struct with 2 uint32_t members. Both C and C++ support atomic reads and writes of user-defined types. For a C++ example showing this see [0]. This will be atomic and, presumably, should be lock-free where possible (hard to imagine the compiler would introduce padding in the struct type that would sabotage this).

> Using a single uint64 as the memory type works, but you no longer have two different names fields and have to pack/unpack them by hand.

Yes, the stored variable would hold 2 different meaningful values, which is a little ugly.

> There's no ub if you use the compiler extension, just totally clear code that does the right thing

Anyone with a deep knowledge of the language will quickly recognise it as incorrect per the language standard. I wouldn't call that totally clear code that does the right thing.

Your proposed solution is only assured to behave as expected if the correct compiler-specific flags are used, otherwise it will introduce undefined behaviour. There's no guarantee that a compiler will even offer such a flag. It's also likely to trigger compiler warnings.

[0] https://en.cppreference.com/w/cpp/atomic/atomic

gpderetta · 2024-05-12T00:12:21

Note that writing 64 bits and reading 32 (or viceversa) is not a way to get around fences on x86. It is explicitly documented as begin undefined. In most cases it will fail to store-forward that will stall and act as an implicit fence, but in some cases the CPU can do partial store forwarding, breaking it.

AFAIK this trick does work on SPARC though.

moonchild · 2024-05-12T00:29:48

It's not documented as being undefined; it's simply not documented at all.

Intel's latest uarch does partial store forwarding.

There was a paper from a few years ago trying to define semantics for mixed-size accesses (not for x86 though) https://www.cl.cam.ac.uk/~pes20/popl17/mixed-size.pdf

I don't think the parent was talking about this, though; they were just talking about using a single large physical location, which logically contains multiple smaller values. Accesses to a single location happen order, so there is indeed no need for fencing between accesses to it. Usually you get a full 128 bits (at least amd64/aarch64/ppc64; not riscv yet but I expect they will get there).

That said—mixed-size can be useful despite the lack of semantics (I think linux uses them in a few places?). sooo

gpderetta · 2024-05-12T10:04:39

Ah, right, it was about guaranteed total order on all stores in a single memory location.

Re colocation and x86, IIRC the intel memory model has wordings regarding read and writes to a a memory location having to be of the same size to take advantage of the memory model guarantees.

moonchild · 2024-05-12T18:11:17

total order on all accesses to a given location—loads from a single location can't be reordered w.r.t. each other either

i don't remember seeing any wording relating to mixed-size accesses in the intel manual (not withstanding that the official models are ... ambiguous, to say the least, compared with what 3rd-party researchers have done)

gpderetta · 2024-05-12T22:28:09

> i don't remember seeing any wording relating to mixed-size accesses in the intel manual (not withstanding that the official models are ... ambiguous, to say the least, compared with what 3rd-party researchers have done)

I was probably misremembering the details. The manual has to say this regarding #LOCK prefixed operations:

"Software should access semaphores (shared memory used for signalling between multiple processors) using identical addresses and operand lengths. For example, if one processor accesses a semaphore using a word access, other processors should not access the semaphore using a byte access"

which is already vague enough, but regarding general atomic load and stores I couldn't find anything.

gpderetta · 2024-05-12T18:17:26

For a total store order to be meaningful of course it implies that loads are also non visibly reordered. If a store falls in the forest but nobody is around to load it, was it really ordered :)

moonchild · 2024-05-13T02:00:22

Tbf you could say stores happen in order, and loads can happen out of order unless you fence. Personally I don't understand why we need such strong ordering constraints for weakly ordered reads—istm you can go much weaker and maintain sanity.

mevric · 2024-05-11T23:19:13

Thank you!!

alfiedotwtf · 2024-05-12T01:45:53

I’ve read so much through the years on this, and I feel like it’s our Emperor’s Clothes - people pretend to understand, but does anyone actually understand this magic? Like not theoretically-superficially but in practice, or am I just too dumb to see the King’s new attire?

spacechild1 · 2024-05-12T03:01:58

I do understand it to the extent that I can write correct lock-free data structures. For me that's enough, I don't need to know what happens exactly in the CPU for each architecture (altough I have some understanding).

klabb3 · 2024-05-12T13:15:35

Mozilla had this famous sign 8ft up on the wall that said “you must be this tall to write multi-threaded code”[1].

A lot of people will claim to be able to master it. The track record shows that even the superhuman domain experts make mistakes frequently, as it’s almost impossible to reason about, verify, debug and importantly maintain the invariants as code evolves. And good luck predicting performance implications on different CPUs and workloads. In either case, for software development it doesn’t really matter if there is such a hypothetical person: because nobody else is.

In practice it’s similar to protecting against crypto vulns: don’t roll your own, use trusted libs with very few hard-to-misuse data structures and operations, sometimes rely on lang/compiler features, and runtime analysis tooling.

Personally I think we eventually need an overhaul in language design, perhaps new concurrent control flow constructs and operations. Just look at the C code shown.. it’s not even “code” in the sense that one line does some logical operation scoped to the current function. No, it’s “markers” that tells compiler and CPU to flush internal caches, change their optimizations as a thread-global “operation”. And the post is introductory, SPSC which is much simpler, doesn’t cover thread parking, MESI, compiler reordering etc. It’s a red flashing sign that the abstractions are all whack, frankly. It’s the anti-thesis of a neat, modularized Russian dolls we typically enjoy with single threaded code, including C. Now, these things are still critical, they work and it’s the best we got. It’s a genuinely hard-hard problem, which is quite humbling given how prevalent, studied and important concurrency is to virtually every programming domain.

[1]: https://bholley.net/blog/2015/must-be-this-tall-to-write-mul...

alfiedotwtf · 2024-05-12T14:31:16

LOL nice! Ok, glad it’s not just me then. It’s actually one of the reasons why I flipped to Rust

cryptonector · 2024-05-12T03:00:38

Solaris/Illumos has a better name for this: `membar_producer()` and `membar_consumer()`. The first is release semantics, and it's what a producer must do before it finishes publishing some memory to a consumer. The second is what a consumer must do before it starts consuming what the producer gave it.

I find that naming convention much more intuitive than release/acquire, though release/acquire is more general.