It's coherent behind the scenes but it often presents an incoherent view to the ...

throwawaylinux · on June 15, 2023

It never presents an incoherent view to software.

I'm using coherency as in the term of art, not a colloquial meaning. Every agent observes stores to a location in the same order[*]. Cache coherency says nothing about observed ordering of stores to different locations.

[*] Although store forwarding throws a bit of a spanner in that definition, there can still be reordering occurring absent that local reordering.

gpderetta · on June 15, 2023

Hum I don't see how store forwarding breaks the illusion of total order of stores on a single memory location, at least in 5 minutes of thinking I can't come up with a litmus that would demonstrate it. In fact even c++ relaxed stores and loads preserve this ordering.

I think your definition is correct without the asterisk.

edit: tweaked working

throwawaylinux · on June 15, 2023

Oh yes that must be right, I wasn't thinking (or thinking about consistency ordering). Good catch.

Dylan16807 · on June 15, 2023

Fine, with that specific term of art meaning then ignore my second post. I stand by my original statement that you can't trust it to "do much for you". Just replace the last word with "act consistent" or "act ordered". Per-address ordering is nearly useless by itself. And if you had a CPU that didn't guarantee that, you'd observe almost no difference.

throwawaylinux · on June 15, 2023

Well no, if you don't have a coherent system then your memory operations aren't reliable. You can lose updates or read stale data. Look at what software has to do in incoherent systems, specific flush and invalidate points which is not the same as ordering barriers.

Your CPU guarantees a lot, cache coherency to start with. But also a very well defined ordering model and ordering instructions. It's not necessarily trivial to program for, but that doesn't mean you can't trust it.

Dylan16807 · on June 15, 2023

> You can lose updates

A system without cache coherency can still promise that updates won't be lost. There are lots of way to write rules around update propagation, and cache coherency is just one of them.

> or read stale data

Cache coherency doesn't protect you from stale data unless you only read one memory address ever.

> Look at what software has to do in incoherent systems, specific flush and invalidate points which is not the same as ordering barriers.

That depends on the memory model. You could have a system that doesn't guarantee cache coherency in general but works fine if you put in ordinary memory barriers.

throwawaylinux · on June 15, 2023

> A system without cache coherency can still promise that updates won't be lost. There are lots of way to write rules around update propagation, and cache coherency is just one of them.

Not without coherency performed in software though, which is the point.

> Cache coherency doesn't protect you from stale data unless you only read one memory address ever.

It does. Observing updates to different locations in other than sequential order does not mean the data is stale. I guess that's also colloquial language issue. The data is up to date according to the constraints of the memory model.

> That depends on the memory model. You could have a system that doesn't guarantee cache coherency in general but works fine if you put in ordinary memory barriers.

What property of cache coherency could you lose and still have it working?

Dylan16807 · on June 15, 2023

> Not without coherency performed in software though, which is the point.

How would you do cache coherency in software? I don't follow.

> It does. Observing updates to different locations in other than sequential order does not mean the data is stale. I guess that's also colloquial language issue. The data is up to date according to the constraints of the memory model.

When you read address X that refers to address Y, it's impossible to read Y and know it's the version that X refers to (or a more recent version). I would call that Y being "stale, as per the memory model". I'm pretty sure that's a normal use of the word "stale" in the context of memory models?

> What property of cache coherency could you lose and still have it working?

Imagine a CPU with a weak memory model, where multithreaded code without memory barriers loses the property of always seeing the same order of accesses to a specific address. That would break some things, but that code was broken anyway.

Normal memory barrier semantics could force certain accesses to be viewed in the same order, including important accesses that would normally be enforced by cache coherency. You don't need it to be enforced twice. The code will run fine.

throwawaylinux · on June 16, 2023

> How would you do cache coherency in software? I don't follow.

Hardware caches which are not coherent require e.g., writeback and flushes to be coherent with other agents.

> When you read address X that refers to address Y, it's impossible to read Y and know it's the version that X refers to (or a more recent version). I would call that Y being "stale, as per the memory model". I'm pretty sure that's a normal use of the word "stale" in the context of memory models?

It isn't, because it's relative. "Most recent" is according to the observer, and if you couldn't previously observe something that is "newer" (within the rules of memory consistency model), then it is not stale.

> Imagine a CPU with a weak memory model, where multithreaded code without memory barriers loses the property of always seeing the same order of accesses to a specific address. That would break some things, but that code was broken anyway.

Not same order of access, same order of stores. If memory location x receives two stores, A and B, and CPU1 sees A, B and CPU2 sees B, A, now one thinks the location contains B and the other thinks it contains A. In the end, all CPUs can see different values at all memory locations. A normal barrier doesn't solve this, the stores are already done.

Dylan16807 · on June 16, 2023

> It isn't, because it's relative. "Most recent" is according to the observer, and if you couldn't previously observe something that is "newer" (within the rules of memory consistency model), then it is not stale.

Why is most recent according to the observer, and not the core(s) that actually did the writes?

The value in Y caused the value in X. Surely that's worth something in terms of ordering?

gpderetta · on June 16, 2023

What if the write hasn't happened at all?

Writer:

    0:    mov $0 $random_cell // zero initialize $random_Cell
    1:    mov $0 $random_cell_ready // zero initialize $random_cell_ready
    3:    rng r1 // generates a non-zero random number in r1, takes 300 hundreds cycles
    4:    mov r1 $random_cell // write generated value to memory location $random_cell
    5:    mov 1 $random_cell_ready // sets a flag to notify that the value has been written

Reader

   0:    test $random_cell_ready
   1:    jz 0  // spin-wait for cell ready
   2:    mov $random_cell r1

Consider an 1-wide OoO[0] machine with a maximally relaxed memory model. There are no caches, but magic memory with 1 clock cycle latency, so no coherence issues: at time t writers starts computing a random number (writer:3): rng is going to take a few hundreds cycles before writing the result to r1. Next clock cycle t+1 it can't execute writer:4 as r1 is not ready. Instead it executes writer:5 that has no dependencies. writer:3 is only executed at t+300.

At time t, the reader will read a zero form the flag, so at t+1 loops back. On t+2 it sees the flag set to 1. So t+3 the jump falls through and t+4 we read 0 form $random_cell. That's obviously wrong as the data is not there. It is certainly not reading stale data as that's literally the last value that was written to this memory location: a newer value hasn't even been computed yet.

To fix this you need a fence #StoreStore between writer:4 and writer:5 and a corresponding fence #LoadLoad between reader:1 and reader:2 to implement release and acquire semantics [1]. In particular the fence on the writer side will stall the pipeline [2] until previous stores in program order have committed.

As you can see there are no caches, only a single shared memory which is necessarily always coherent. There is no stale data. Yet we need fences for correctness.

You can now reintroduce caches, but MESI give the illusion to the rest of the CPU that it is actually talking with uncached, fully coherent memory, except that latency is now variable.

Ergo, generally, fences and caches are completely orthogonal concepts.

[0] BTW the example would be broken even on a machine with scoreboarding but no renaming (so no fully OoO).

[1] proper release and acquire require slightly stronger fences, but these are sufficient for this example

[2] a better option is to just prevent further stores to be dispatched.

edit: some light editing

Dylan16807 · on June 16, 2023

The writing thread I had in mind was not being reordered.

Consider instead a loop that walks through an array looking for a value of 8, then writes the address of that value to memory location X.

Another thread can read X, then read the location, and see a number that is not 8 but used to be there before the 8.

That's allowed by the memory model, but is it wrong to say "stale"?

Is it wrong to say that the 8 was set before X was set?

gpderetta · on June 16, 2023

Your example has a single store, there is no reordering. There is no value before 8, it doesn't require any barrier. If you make and additional store to the array, that's would be equivalent to my example.

Dylan16807 · on June 16, 2023

I said there was a value before 8, I just didn't describe it well enough.

First off your example looks a little under-constrained (What if reader:0 happens before writer:0?), so let's assume all the writes of 0 at the start of every program happen before any other instructions.

Let there be a third thread, "builder". It writes 0 to every address in the array, then fills it with new values including an 8.

The "writer" thread loops repeatedly over the array until it finds an 8, then stores the address in X.

The "reader" thread waits for X to be nonzero then loads the value at [X].

In the toy coherent computer, reader will always load an 8.

But in a very weak memory model, one that doesn't enforce dependent loads, reader is allowed to get a 0.

The write to X and the write of 8 don't have to show up in any particular order.

But in that situation I would say that X is "more recent" than 8, because of causality. And if you asked me yesterday I would have called the 0 "stale".

throwawaylinux · on June 16, 2023

Your example was hard to follow. What do you mean exactly. Write it in terms of memory operations performed and values observed by each CPU and address, with assertions that are or are not violated according to your example.

gpderetta · on June 15, 2023

> You could have a system that doesn't guarantee cache coherency in general but works fine if you put in ordinary memory barriers.

How would that work? In such a system, either you have no caches or memory barriers would need to pessimistically flush all dirty lines to memory and send invalidation and synchronization messages to all other cores. In practice such system, far from being fine, would be so slow to be unusable if barriers had such semantics. Even implementing c++ relaxed semantics would be very expensive.

Dylan16807 · on June 15, 2023

> In such a system, either you have no caches or memory barriers would need to pessimistically flush all dirty lines to memory and send invalidation and synchronization messages to all other cores.

Why would you need to avoid caches or flush to memory?

And invalidation and synchronization message are already part of a normal CPU's overhead, so I don't see why restricting some of them to memory barriers would increase overhead.

In other words, assume you still have a cache coherency protocol, but you're lazy about certain states in the absence of memory barriers, so the default behavior is not always coherent.

gpderetta · on June 15, 2023

How would you implement the c++11 memory model in a non-cc system exactly? Let's say you build linked list, spanning a few cachelines and then publish it by storing the address of the first node to into an atomic with release semantics. Another thread load-consumes its address and start traversing it.

On a cc system the only thing the release barrier needs to ensure is that all previous stores commit to L1 (in any order) before the final store. There is typically nothing to do on the consume side as the CPU pipeline preserves causality. Everything else is taken care by plain MESI, which will transfer exactly and on-demand those cache lines that contain the list nodes and no more.

What would the acquire/consume barriers do on a non-cc system? Consider that, generally, neither the CPU nor the compiler actually have an understanding of the list data structure itself.

Dylan16807 · on June 16, 2023

In that example I don't think anything really changes. With cache coherency a release barrier makes sure all previous stores are committed to L1, and by implication any old versions of the cache line have been invalidated. Without cache coherency a release barrier makes sure all previous stores are committed to L1, and explicitly says that any old versions of the cache line have been invalidated.

If you had a design a lot like MESI, you wouldn't really need release semantics, you'd just have an extra "Shared but maybe stale" state instead of just Invalid, and consume would coerce all those lines into Invalid and they'd have to be re-acquired. But re-acquiring those lines is no worse than if you had vanilla MESI. If you had an Owned state that needs to broadcast changes, you could avoid most broadcasts until seeing a release or similar barrier.

In both of these situations you'd probably make the CPU try to sync lines right away, but it could squeeze out some more performance when it's not immediately mandatory.

gpderetta · on June 16, 2023

So to be clarify, I understand from this comment and other in this thread that you are talking about a mostly CC system, with coherency messages and the full MESI transition; The only relaxation is that you allow an additional Stale state. I don't see what you gain here, you still need to send RFOs for every write (including plain reads) to not yet exclusive lines as you can't delay sending invalidates to other caches on a fence as otherwise two cores might be writing to the same cacheline at the same time.

I guess this improves the false sharing scenario as you can read from data on a stale line if the data you care is not the one that actually caused the line to go stale and a barrier is needed to fully invalidate it, but the cost is that your release barriers, instead of being cheap purely core-local effect, now have to cross the coherence fabric. This can move the cost from a dozen of cycles to hundreds.

Dylan16807 · on June 16, 2023

It's not supposed to be a particularly valuable model, it's just to show how it could exist.

> the cost is that your release barriers, instead of being cheap purely core-local effect, now have to cross the coherence fabric. This can move the cost from a dozen of cycles to hundreds.

That's not the intent.

The barrier pushes the lines in this new state to I, but under MESI they already would have been I.

There is no need to have any performance loss compared to MESI. If no lines are in this state then there is no extra work. And if it would be helpful, you can preemptively work on converting those lines to S in the background.

Edit: Oh wait you said release barrier. No, if you're doing the simplest plan based on MESI then there is no need for anything extra when releasing. Those considerations were only if you wanted to look at MOSI where it's already doing write broadcasts.

gpderetta · on June 16, 2023

Sorry, I'm losing track of what you are proposing. I guess you mean MOESI plus an additional Stale state? Can you describe all the transitions and when they are performed?

throwawaylinux · on June 16, 2023

It seems to be a state that can send stale data to loads, but will get invalidated if the CPU performs a barrier.

It doesn't work of course, obviously because there is no forward progress. CPU1 can order all previous stores, then later store some flag or lock variable, and that will never propagate to CPU2 spinning on that waiting for the value.

But also because CPU2 and CPU3 can see different values depending on the state of their caches. If one had no such line and the other had a valid-stale line, then they will end up seeing different values. And the writeback out of CPU1's cache needs to write back and invalidate all possible older such written-to lines. By the time you make it work, it's not a cache it's a store queue but worse because it has to do all these coherency operations and barriers have to walk it and flash or flush it, etc.

Dylan16807 · on June 16, 2023

I can't say that spinning without a barrier is doing things wrong?

I guess if I need an emergency fix I can make those lines invalidate themselves every thousand cycles.

> But also because CPU2 and CPU3 can see different values depending on the state of their caches. If one had no such line and the other had a valid-stale line, then they will end up seeing different values.

But only if there's no memory barriers, so it shouldn't be a big deal. And the valid-stale lines can't be used as the basis for new writes.

> And the writeback out of CPU1's cache needs to write back and invalidate all possible older such written-to lines.

The only such lines are in this new state that's halfway between S and I. They don't need to be marked any more invalid. Zero bus traffic there.

gpderetta · on June 16, 2023

Yes, saw that flaw, but I think in their model, barriers are always associated with stores (so atomic_thread_fence is not implementable but store_release is), which fixes your first example. I agree that in any case you end up doing more work that in the typical model to make other scenarios work.

throwawaylinux · on June 16, 2023

> In that example I don't think anything really changes. With cache coherency a release barrier makes sure all previous stores are committed to L1, and by implication any old versions of the cache line have been invalidated.

That is not what a release barrier does.

> Without cache coherency a release barrier makes sure all previous stores are committed to L1, and explicitly says that any old versions of the cache line have been invalidated.

That is not a "normal barrier" though, writeback and invalidate operations are software coherency.

> If you had a design a lot like MESI, you wouldn't really need release semantics,

This is not the case. Release barrier can be required even if your cache coherency operations completed in FIFO order, because reordering could be done before cache coherency.

Dylan16807 · on June 16, 2023

>That is not what a release barrier does.

> That is not a "normal barrier" though, writeback and invalidate operations are software coherency.

I think I was unclear when I said "the cache line". I meant the one containing a releasing store.

Let me try wording it a different way. A store_release requires all previous writes to be ordered before it. This obviously includes other memory addresses, but its own memory address isn't an exception. So even without cache consistency as a general rule, the nature of a release gives you all the ordering you need in this situation.

I'm sorry for mentioning the word "invalidate", because that's the implementation and not the semantics.

> This is not the case. Release barrier can be required even if your cache coherency operations completed in FIFO order, because reordering could be done before cache coherency.

So I meant acquire but I think there's also a clear solution based on exactly what I said.

The lines that could possibly be affected by the FIFO are all in the "Shared but maybe stale" state. Consume turns those into Invalid. So any read that's from after the Consume, reordered before it, should see those lines as Invalid.

throwawaylinux · on June 16, 2023

> I think I was unclear when I said "the cache line". I meant the one containing a releasing store.

Not sure what you mean by that.

> Let me try wording it a different way. A store_release requires all previous writes to be ordered before it. This obviously includes other memory addresses, but its own memory address isn't an exception. So even without cache consistency as a general rule, the nature of a release gives you all the ordering you need in this situation.

We're talking about cache coherency, not memory consistency. Coherency is not about ordering, it's about ensuring agents don't see stale data.

> So I meant acquire but I think there's also a clear solution based on exactly what I said.

The same goes for acquire though.

> The lines that could possibly be affected by the FIFO are all in the "Shared but maybe stale" state. Consume turns those into Invalid. So any read that's from after the Consume, reordered before it, should see those lines as Invalid.

Implementation of CPU memory pipelines and cache coherency aren't really something you can just get a bit of a feel for and then handwave about.

Dylan16807 · on June 16, 2023

When I talk about how the cache protocol has to do XYZ to implement a barrier, you complain that that isn't what a barrier is.

When I talk about what memory barriers do in pure terms, you complain that I'm not mentioning the cache protocol.

When I give an example of a cache protocol in isolation, you start talking about which memory barriers I'm missing.

I don't know what you want.

> Coherency is not about ordering, it's about ensuring agents don't see stale data.

Well, if I go by "if it's part of the memory model then it's not stale", then you can allow a relaxed ordering on single addresses without having stale data.

When a core takes Exclusive control of a cache line, put all other Shared copies into the state "might be an old version, but that's allowed by the protocol and the memory model".

Some instructions can read "might be an old version, but that's allowed by the protocol and the memory model" values and some can't. The exact details of "some instructions" are flexible/irrelevant. See the memory model (not provided) for details.

There. Done. Minimal proof established of a design that doesn't always guarantee cache coherency, but can enforce as much cache coherency as you need. You don't need to add any explicit writebacks or flushes to manage it from software, and enforcing it doesn't take any significant effort beyond a normal CPU's coherency protocol.

Agents will never be confused. They know that Shared means everyone has the same value, and "might be an old version, but that's allowed by the protocol and the memory model" does not mean everyone has the same value. They know that transitioning from "might be an old version, but that's allowed by the protocol and the memory model" to Shared or Exclusive requires reading the data anew, just like transitioning from Invalid to Shared to Exclusive.

If agents want to always be as up to date as possible, they can simply not use this state. If an agent wants to be up to date some of the time, then it can allow this state but purge it at will.

This state only allows for "old" values going back to the most recent purge, so it's not a useless act to read from it. And this state can give you data faster than acquiring a Shared state, so there's a reason to use it.

> The same goes for acquire though.

I'm pretty sure the entire point of acquire is that you can't reorder reads from after it to before it.

throwawaylinux · on June 16, 2023

> I don't know what you want.

I don't want anything, I was correcting your misconceptions.

> Well, if I go by "if it's part of the memory model then it's not stale", then you can allow a relaxed ordering on single addresses without having stale data.

I don't know what you're talking about. Memory ordering is not about ordering of a single address. That's cache coherency.

[snip]

> I'm pretty sure the entire point of acquire is that you can't reorder reads from after it to before it.

And you're still wrong. Acquire barrier can be required even if you receive coherency updates in a sequential order. Barriers in modern processors do not flush, invalidate, or perform coherency operations.

This is what I mean by you can't just handwave with a basic idea about the programming semantics (which aren't particularly complicated). It's easy to think up some vaguely plausible sounding implementation of those things, but the reality is infinitely more complicated. Real cache coherency protocols are verified with formal proofs, and not because they are easy. I guarantee if you handwave a new coherency state or give up some property of coherency, you will have bugs.

Dylan16807 · on June 16, 2023

> I don't know what you're talking about. Memory ordering is not about ordering of a single address. That's cache coherency.

The ordering of a single address is relevant to both the cache protocol and the memory model.

That section is describing a cache protocol.

> And you're still wrong. Acquire barrier can be required even if you receive coherency updates in a sequential order.

I agree. How does that make my statement wrong in any way?

> Real cache coherency protocols are verified with formal proofs, and not because they are easy. I guarantee if you handwave a new coherency state or give up some property of coherency, you will have bugs.

Do you think my description is impossible to fix, or are you just trying to impress on me that it's hard?

I don't feel like spending hours finding and editing a concurrency simulator today.