C++ Memory Model: Migrating from x86 to ARM (2021)

samsquire · on Nov 28, 2023

I've been using "sfence" memory barrier to enforce visibility in my programs on amd64/x86_64.

It's good to know that my programs might need to be adjusted.

Chris Wellons [1] has an interesting post about using an explicit __ATOMIC_RELEASE and then __ATOMIC_ACQUIRE to tell thread sanitizer that there is definitely a happens-before relationship between threads.

[1]: https://nullprogram.com/blog/2022/10/03/

I have an idea: I would really like to layer an abstraction ontop of the memory model to formulate a protocol for interprocess/interthread communication. In other words, the cache coherency protocol IS the protocol for message passing. No locks!

My understanding is that memory_order_seq_cst or _ATOMIC_SEQ_CST in gcc C means all atomic operations in the same thread happens sequentially from the perspective of other threads.

This whitepaper ("How to miscompile programs with “benign” data races") discusses more about compiler optimisations breaking data races.

[2]: https://www.usenix.org/legacy/event/hotpar11/tech/final_file...

KMag · on Nov 28, 2023

> I have an idea: I would really like to layer an abstraction ontop of the memory model to formulate a protocol for interprocess/interthread communication. In other words, the cache coherency protocol IS the protocol for message passing. No locks!

That's generally what you get with RCU/lockfree data structures.

nick_ · on Nov 28, 2023

> I have an idea: I would really like to layer an abstraction ontop of the memory model to formulate a protocol for interprocess/interthread communication. In other words, the cache coherency protocol IS the protocol for message passing. No locks!

If I'm understanding you correctly, this is exactly what I've wanted for a while. Let me use the cached values of things when I want, and let me read and write from/to RAM when I want. Those calls imply the load/store order semantics (I think?).

bcrosby95 · on Nov 28, 2023

This sounds like volatile in Java, and while sometimes useful it is still a minor part of writing concurrent code.

In particular, once you read the value is potentially stale. If you write to something based upon it you now have a race condition.

gpderetta · on Nov 28, 2023

> I would really like to layer an abstraction ontop of the memory model to formulate a protocol for interprocess/interthread communication. In other words, the cache coherency protocol IS the protocol for message passing. No locks!

that would be just a queue, right?

samsquire · on Nov 28, 2023

A queue is part of an event loop solution but I'm thinking from the perspective of Erlang, Smalltalk and Dbus and distributed objects communication.

Rust async semantics is pretty hard and difficult to define and understand, there is pinning, runtimes, async/await state machines and awaiting and polling and type system interactions.

It's all to create a protocol that allows for concurrency and parallelism that is also safe. It seems the cache coherency protocol IS what we're looking for, which is safe happens before relationships and no data races.

gpderetta · on Nov 28, 2023

I'm not sure how cache coherency would help here. Cache coherency is specifically about keeping cache coherent (i.e. not stale) in shared memory systems. It is not applicable to pure message passing systems. Also cache coherency per se it has little to say about sequencing and happens-before. That's the job of the memory model built on top (again, still for shared memory).

At an higher level still, the general idea of consistency is of course applicable to both shared memory concurrency and message passing concurrency. Maybe consistency models is what you are thinking about?

Alternatively, maybe you want to implement shared-memory on top of message passing. In this case cache coherency is definitely applicable, but I'm not sure how amenable are common cache coherency protocols to be implemented in software (but it is certainly possible, and probably what some RDMA system do).

samsquire · on Nov 28, 2023

I'm not an expert but here's my intuition.

The L1 cache is very fast and each core has its own 32 kilobytes or so cache.

If we could send between L1 caches of cores then couldn't we have latency below 100 nanoseconds?

gpderetta · on Nov 29, 2023

You can already have sub 100ns latency on core-to-core communication on some common machines. And yes, thank if the magic of bus snooping the transfer can in practice happen cache to cache, although counterintuitively it is not necessarily what you want for message queues.

toast0 · on Nov 28, 2023

> A queue is part of an event loop solution but I'm thinking from the perspective of Erlang, Smalltalk and Dbus and distributed objects communication.

Each Erlang(BEAM) process has a mailbox which collects messages for them to receive, but that's pretty much a queue. Under the hood, a mailbox is implemented in C, with locks, message content is copied and the destination process owns the copy. Cross-node distributed messaging has different specifics, message content is serialized into a socket buffer, although there's queuing or locking to send to thst socket.

If you write rust with all your cross-thread communication happening via mpsc channels, you can get something that looks like a fuzzy approximation of Erlang, but at least personally, I would prefer to just run Erlang. And call into nifs, C or Rust via rustler or whatever, to manage computation intense bits.

u320 · on Nov 28, 2023

Rust async semantics are complex because they need to support resumable computations on a single C-style stack. It has nothing to do with concurrency, which was already supported, but at the cost of multiple stacks (either green threads at one point or OS threads). Futures just happen to be a very prominent example of resumable computations. Generator functions is another.

weinzierl · on Nov 28, 2023

This is a very approachable, yet detailed, introduction to a difficult topic. I wish I had it when I first got into this.

Another good resource for me was Mara Bos' book "Rust Atomics and Locks". It is primarily a Rust book , but I found the concise and reduced to the essentials explanations of memory ordering really helpful.

sumtechguy · on Nov 28, 2023

I had this one ARM platform that was absolutely unforgiving on any sort of memory shenanigans that x86 lets you get away with. Read too much data... fault. Too little... fault. Unaligned memory... fault. Really had to go thru the code and make sure anything like that was removed or properly done (curse you sprintf).

KMag · on Nov 28, 2023

I remember Netscape Navigator getting SIGBUS on Solaris SPARC machines once or twice a week. Presumably this was due to most development being done on systems that allowed unaligned memory operations.

Architectures that allow unaligned memory operations should at least have some sampling mechanism to record the instruction addresses causing misaligned operations, for both portability and performance improvement of codebases.

sumtechguy · on Nov 28, 2023

while probably true NN was wildly crashy anyway.

layer8 · on Nov 28, 2023

Is there a quick explanation why data races are UB in the C++ MM but not in the Java MM? Famously, Java’s String::hashCode implementation makes use of the Racy Single-Check Idiom, so this is a core functionality on all supported platforms. What is the cost of Java guaranteeing defined behavior here? Or is it just that it would be costly on some non-mainstream platforms not supported by Java but supported by C++?

jcranmer · on Nov 28, 2023

There is general agreement that the Java 5 memory model's definition of data races is unsatisfactory. It's overly restrictive of optimizations, it doesn't quite match reality, it's utterly confusing (it's like 12 pages where the rest of the memory model is basically just 1 page). The only reason it's not UB in Java is that Java doesn't want any UB; everyone agreed that in C/C++, UB was the best way to handle data races.

C++ does have a form of defined data races, though: relaxed atomics. And if you look at the history of relaxed atomics, from the moment it was first voted into the standard, there's been constant tweaking and kvetching over its unsuitability. I wouldn't recommend trying to understand relaxed atomics any more thoroughly than "you will read some value you wrote into it."

o11c · on Nov 28, 2023

C++ relaxed still serializes accesses to the variable itself, so it is far more useful than mere "defined data races".

In particular it's useful for things like counters that always increment, or counters that decrement but which will branch and do another atomic operation when it reaches zero.

jcranmer · on Nov 28, 2023

I was going to mention that relaxed atomics additionally have a requirement that a read always reads the "last" write to that value, but the problem is that the temporal sense of "last" in that sentence only bears a loose resemblance to a typical programmer's sense of time in a program, so I think it may confuse more than it helps.

In LLVM IR terms, the largest practical difference to users between unordered (Java) and monotonic (C relaxed) is that unordered is too weak to let an atomicrmw instruction be definable, whereas monotonic is strong enough to let it be defined. In both cases, you still have to be prepared for the possibility that a load is going to read a value that is in some sense stale.

(There are a few other differences, e.g., you can reorder load unordered p/load unordered q if p and q are may-alias, but you can't if they're relaxed loads. But that's generally something that only comes up when you're writing optimization passes).

plorkyeran · on Nov 28, 2023

My possibly-incorrect understanding is that in C++ terms, every field in Java is a relaxed atomic. In both languages this gives you very confusingly defined but not undefined behavior.

A non-performance argument against Java's decision is that relaxed atomics have turned out to have much more complicated semantics than was expected in the 90s and 00s, to the extent that most code using them is probably incorrect. Java telling developers that data races are okay leads to people writing buggy code, and because it's legal you can't have something like TSan in Java to trap on data races.

On the performance side, this decision inhibits some optimizations and while relaxed 32-bit atomics are free on all non-exotic hardware, that isn't the case for 64-bit atomics.

raphlinus · on Nov 28, 2023

It's essentially the same as a relaxed atomic at the machine level, but it's weaker - it corresponds to "unordered" in the LLVM sense. In particular, it allows for a lot more optimization - operations can be reordered with respect to each other as long as the compiler is aware that they refer to disjoint memory locations.

bluGill · on Nov 28, 2023

Data races are generally UB because to make them defined you have to implement some form of locking on all use which is CPU expensive. With a good architecture you can make those data races impossible in the first place in most cases (not all, but most) and this is considered best practice in general: even when there is no data race, it is difficult to reason about all the possible states of your system when data is shared across threads. As such C++ has decided that data races as UB is the correct thing: in most cases the data race doesn't actually exist - the compiler just doesn't have enough information to prove it doesn't exist. When there is an exception you do have to manually add something to your code to force the synchronization, but hopefully that is rare, and when it happens you want to audit that code more anyway.

I don't know how Java does data races, but often the language doesn't actually prevent data races the way you want it to. That is A=something; B=SomethingEles; can ensure A is updated before B, but will it ensure that either both A and B are both updated or neither are - which may be what you really need, and of course if you don't need that guarantee having it will slow down code that needs A until B is updated even though it doesn't care.

layer8 · on Nov 28, 2023

If you look up the Racy Single-Check Idiom, the point is to have defined behavior without any additional cost (aside from possible duplicate initialization). No locking, and not even volatile. At least on x86 it effectively comes for free.

bluGill · on Nov 28, 2023

On X86 a lot of cases where C++ has UB the behavior could be considered fully defined since there is no optimization the compiler can do based on it being undefined and x86 guarantees memory order. On ARM you get no such guarantees and it isn't free.

KMag · on Nov 28, 2023

> could be considered fully defined since there is no optimization the compiler can do based on it being undefined

UB means the compiler is free to perform whatever optimizations it wants. Are you really saying that no optimizations are allowed once the compiler notices UB?

Michael Chastain (GDB developer, one of the few people with commit rights to the root of Google's monorepo) told me about the worst bug he ever helped a Googler debug, eventually boiling it down to:

    static const double y = 1.0;
    static const double x = 2 * y;
    printf("Before\n");
    if (x > 0.0) {
      printf("x\n");
    } else {
      printf("Not x\n"):
    }
    printf("After\n");

It was printing "Before\nAfter\n", with nothing in-between.

It turns out that the value of x depended upon the initialization order of statically-scoped doubles. There was no malicious optimization in GCC that got rid of both branches, but both branches got optimized away through a series of intermediate steps where the compiler basically split the control flow graph into something equivalent to:

    static const double y = 1.0;
    static const double x = 2 * y;
    printf("Before\n");
    if (x > 0.0) {
      printf("x\n");
    }
    if (!(x > 0.0)) {
      printf("Not x\n"):
    }
    printf("After\n");

GCC determined that x was statically determinied and UB. In both places an optimization pass determined that the fastest choice for if(UB) is to branch over the code in question. Since it's UB, the compiler wasn't required to pick some self-consistent value for x. It's free to optimize away all if(UB) blocks to if(false) blocks.

(Note that this particular case of UB is specific to floating point types and would not occur with integer types. I believe the intention of the spec is to avoid the requirement for floating point hardware to behave identically, including the state bits that affect rounding etc., to behave identically on the machine where the compiler is run and on the machine where the program is run. My naive hot take would be that this should be ID, not UB, but I'm sure the standards committee put a lot more thought into this than I have.)

It wasn't some malicious or two-clever optimization. It was the emergent behavior of a series of optimizations, each one correct.

Anyway, long story short, once the compiler notices UB, all optimizations are on the table. You seem to be implying that compilers need to be conservative when they detect UB. Compilers are generally recklessly aggressive in optimizing when they detect UB.

usefulcat · on Nov 28, 2023

I don't see the UB. Are you implying or assuming that x and y are defined in different translation units? That would be certainly be UB, but what you wrote doesn't look like that.

ETA:

from https://en.cppreference.com/w/cpp/language/siof

"The static initialization order fiasco refers to the ambiguity in the order that objects with static storage duration in different translation units are initialized in."

"Within a single translation unit, the fiasco does not apply because the objects are initialized from top to bottom."

KMag · on Nov 29, 2023

Sorry, from the context, you could assume the above was C++.

At least in C89, floating point literals aren't constant expressions, and static initialization order is only defined for constant expressions.

I distinctly remember getting burned by this in 2003. GCC was acting sanely, but Sun Studio was initializing two static const floats in successive lines in the same file in reverse order they were defined, resulting in 0.0 for the product of two nonzero constants. Tracking that down was rather painful. Switching the file to C++ solved the issue, but this was radio network simulation code that took several days to run on my client's $250,000 Sunfire V1280 with 96 GB of RAM (quite a large amount of RAM for 2003), and Sun's C++ compiler generated slower code for this same file. I presume some amount of monkeying with compiler flags would have brought the C++ performance back in line with the C performance, but I had more important things to do.

So, I changed the static const doubles to #defines and that also fixed the bug.

vardump · on Nov 28, 2023

That UB is so hard to see, even when you know about static initialization order indeterminism.

Static initialization is such a bug magnet in general. Can make builds ”work” randomly, when static object constructors are ran in more or less random order, decided by the compiler. Change something unrelated and kaboom.

(Needless to say, that statically initializing anything with a complex constructor is a bad idea, especially if it refers to another statically initialized object.)

neuromanser · on Nov 28, 2023

I want to say "Static Initialization Order Fiasco!" but that only applies in cross-translation unit scenarios. Were x and y defined in different TUs? It's either that or a compiler bug.

https://en.cppreference.com/w/cpp/language/siof

KMag · on Nov 29, 2023

Sorry, from the context, you could assume the above is C++.

In C89, floating point literals aren't constant expressions, and static initialization order within a compilation unit is only defined for constant expressions. I believe this is also the case for C99, and things get much more fuzzy for me for C standards after C99.

As stated in another comment, I got burned by this in 2003 in porting some very large network simulation from Linux on x86 to run on Solaris on my client's $250,000 SunFire V1280. GCC was initializing two const doubles in the file in top-to-bottom order, but Sun Studio was doing the opposite. Switching the file in question to C++ solved the bug, but Sun's C++ compiler generated slower code in this case. So, I changed the static const doubles to #defines to fix the bug.

bluGill · on Nov 28, 2023

> Are you really saying that no optimizations are allowed once the compiler notices UB?

My statement is specific to how data is shared across multiple threads and does not generalize to any other situation, such as the one you presented. When threads are in play a compiler typically cannot know if there is UB or not (since the access is often done across different translations units).

dataflow · on Nov 28, 2023

Most of the UB in C++ could be well defined within the language if the language just did whatever the machine did (like overflows). The exceptions I can think of are (a) executing invalid code, or (b) access deallocated memory. Java can avoid both of them by taking control of both: you can't execute arbitrary pointers or read/write to arbitrary addresses within the language, and the runtime (I assume) prevents tearing of writes. (This is already guaranteed by the machine on x86, so it doesn't need to do anything there. I can only assume weaker architectures might require atomic instructions for potentially interfering writes, which seems like the cost paid here, but someone can enlighten me.) Furthermore, the GC can ensure that its operations are synchronized with your threads, so you can't interfere with it, and it doesn't pull the rug out from under you. So the end result is that there's theoretically no way to break memory safety.

jcelerier · on Nov 28, 2023

from the java spec notes:

> Some implementations may find it convenient to divide a single write action on a 64-bit long or double value into two write actions on adjacent 32-bit values. For efficiency's sake, this behavior is implementation-specific; an implementation of the Java Virtual Machine is free to perform writes to long and double values atomically or in two parts.

> Implementations of the Java Virtual Machine are encouraged to avoid splitting 64-bit values where possible. Programmers are encouraged to declare shared 64-bit values as volatile or synchronize their programs correctly to avoid possible complications.

so in practice Java behaves exactly as C++ for tearing (`volatile` in java having the same meaning than `atomic` in C++) - and if there is a powerful enough optimizer running in the back-end without access of the original semantic information it may well create the kind of UB-led optimizations that people complain about in C++

dataflow · on Nov 28, 2023

Is this only for numbers or also for pointers? Tearing for numbers isn't a memory safety issue, but for pointers it would mean that on a 64-bit machine with tearing, merely reassigning a variable and then calling a method on that object can execute arbitrary machine code.

raphlinus · on Nov 28, 2023

I haven't carefully read the relevant part of the spec, but I imagine that pointers have to be modeled as not subject to tearing, otherwise safety guarantees would completely fall apart.

The flip side of this is that adding "fat pointers" safely to Java would be very difficult. Fat pointers are widely used in languages such as Go, Rust, and Swift to represent both slices (pointer + length) and various forms of dynamic dispatch (pointer + vtable). In Rust, this is safe because data races are forbidden, enforced by the type system. I consider Go and Rust to be "safe-ish" in that data races can happen, and can violate safety. Apparently this hasn't led to huge problems in Go. Thus, the roadmap for Carbon[1] targets a comparable level of safety, not trying to prevent data races, but aiming for memory safety in single threaded contexts.

[1]: https://www.youtube.com/watch?v=1ZTJ9omXOQ0

layer8 · on Nov 28, 2023

Java has effectively a non-tearing guarantee for pointers. Regarding performance and alignment, Java pointers are commonly 32-bit even on 64-bit Java, due to compressed pointer representation. (Java objects are always aligned to 8-byte or 16-byte boundaries, so the bottom 3 or 4 bits of a pointer can be shifted out, resulting in 8 or 16 times the normal 32-bit address space. IIRC there are even more variations regarding the Java pointer model.)

spacechild1 · on Nov 28, 2023

> resulting in 8 or 16 times the normal 32-bit address space

That would still impose a 64 GB memory limit. How does this work?

hashmash · on Nov 29, 2023

Compressed pointers are disabled if the max heap is larger than 32GB.

vardump · on Nov 28, 2023

I’m not aware of any platform that tears aligned pointer stores, but I’ve seen a lot of defensive atomic stores when changing a pointer value. So shrug.

mort96 · on Nov 28, 2023

Platforms where pointers are bigger than the bus width/register size would presumably have tearing in pointer stores... I imagine a JVM implementation for a 16-bit system would want to use 32-bit "far pointers" if the platform supports that.

dzaima · on Nov 28, 2023

Far from the full story, but a note is that Java has the nice property of all possible single-operation writes being ≤8 bytes, all aligned if the VM so wants (byte, char/short, int, long, and Object are all the possible writes), whereas in C you can have larger types (structs, or larger ints (_BitInt in C2x); and with compiler extensions those can get arbitrarily misaligned). IIRC there were/are questions about write atomicity for value classes/Valhalla.

And there's not much sane to define data races to anyway - String::hashCode might be "safe" with writing an 'int', but, were you to want a 64-bit hash, i.e. use a 'long', it'd be unsafe, as 64-bit types are specified to possibly tear!

layer8 · on Nov 28, 2023

In principle this is no different from the situation with long and double in Java, for which atomicity is not guaranteed, and word tearing can occur. It is still considered defined behavior, in the sense that you will see some combination of actually written values, and not get nasal demons.

dzaima · on Nov 28, 2023

And so Java effectively mandates at least max(32, size of Object)-bit atomic writes, while C can trivially run on 16-bit systems. I don't know if anyone's running multi-threaded software on such, but having a completely arbitrary boundary (from the point-of-view of a programmer wanting to store an integer) for what does and doesn't behave sanely is far from nice.

muizelaar · on Nov 28, 2023

Check out the difference between LLVM's NotAtomic (C++) and Unordered (Java) here: https://llvm.org/docs/Atomics.html#notatomic

layer8 · on Nov 28, 2023

Ok, so it seems that the Java semantics can be expensive in cases like 64-bits-wide pointers on ARM. However, pointers (object references) in Java usually only are 32 bits wide by default, due to using a compressed pointer format.

NobodyNada · on Nov 28, 2023

For a more in depth introduction to the topic with a lot of practical code examples, I highly recommend Herb Sutter’s excellent talk:

Part 1 https://www.youtube.com/watch?v=A8eCGOqgvH4

Part 2 https://www.youtube.com/watch?v=KeLBd2EJLOU

deutschepost · on Nov 28, 2023

Gotta love those full phonescreen popups if I want a chat with their sales team.

No. And if I would it wouldn’t be over a chat window on a website.