How would you implement the c++11 memory model in a non-cc system exactly? Let's...

Dylan16807 · on June 16, 2023

In that example I don't think anything really changes. With cache coherency a release barrier makes sure all previous stores are committed to L1, and by implication any old versions of the cache line have been invalidated. Without cache coherency a release barrier makes sure all previous stores are committed to L1, and explicitly says that any old versions of the cache line have been invalidated.

If you had a design a lot like MESI, you wouldn't really need release semantics, you'd just have an extra "Shared but maybe stale" state instead of just Invalid, and consume would coerce all those lines into Invalid and they'd have to be re-acquired. But re-acquiring those lines is no worse than if you had vanilla MESI. If you had an Owned state that needs to broadcast changes, you could avoid most broadcasts until seeing a release or similar barrier.

In both of these situations you'd probably make the CPU try to sync lines right away, but it could squeeze out some more performance when it's not immediately mandatory.

gpderetta · on June 16, 2023

So to be clarify, I understand from this comment and other in this thread that you are talking about a mostly CC system, with coherency messages and the full MESI transition; The only relaxation is that you allow an additional Stale state. I don't see what you gain here, you still need to send RFOs for every write (including plain reads) to not yet exclusive lines as you can't delay sending invalidates to other caches on a fence as otherwise two cores might be writing to the same cacheline at the same time.

I guess this improves the false sharing scenario as you can read from data on a stale line if the data you care is not the one that actually caused the line to go stale and a barrier is needed to fully invalidate it, but the cost is that your release barriers, instead of being cheap purely core-local effect, now have to cross the coherence fabric. This can move the cost from a dozen of cycles to hundreds.

Dylan16807 · on June 16, 2023

It's not supposed to be a particularly valuable model, it's just to show how it could exist.

> the cost is that your release barriers, instead of being cheap purely core-local effect, now have to cross the coherence fabric. This can move the cost from a dozen of cycles to hundreds.

That's not the intent.

The barrier pushes the lines in this new state to I, but under MESI they already would have been I.

There is no need to have any performance loss compared to MESI. If no lines are in this state then there is no extra work. And if it would be helpful, you can preemptively work on converting those lines to S in the background.

Edit: Oh wait you said release barrier. No, if you're doing the simplest plan based on MESI then there is no need for anything extra when releasing. Those considerations were only if you wanted to look at MOSI where it's already doing write broadcasts.

gpderetta · on June 16, 2023

Sorry, I'm losing track of what you are proposing. I guess you mean MOESI plus an additional Stale state? Can you describe all the transitions and when they are performed?

throwawaylinux · on June 16, 2023

It seems to be a state that can send stale data to loads, but will get invalidated if the CPU performs a barrier.

It doesn't work of course, obviously because there is no forward progress. CPU1 can order all previous stores, then later store some flag or lock variable, and that will never propagate to CPU2 spinning on that waiting for the value.

But also because CPU2 and CPU3 can see different values depending on the state of their caches. If one had no such line and the other had a valid-stale line, then they will end up seeing different values. And the writeback out of CPU1's cache needs to write back and invalidate all possible older such written-to lines. By the time you make it work, it's not a cache it's a store queue but worse because it has to do all these coherency operations and barriers have to walk it and flash or flush it, etc.

Dylan16807 · on June 16, 2023

I can't say that spinning without a barrier is doing things wrong?

I guess if I need an emergency fix I can make those lines invalidate themselves every thousand cycles.

> But also because CPU2 and CPU3 can see different values depending on the state of their caches. If one had no such line and the other had a valid-stale line, then they will end up seeing different values.

But only if there's no memory barriers, so it shouldn't be a big deal. And the valid-stale lines can't be used as the basis for new writes.

> And the writeback out of CPU1's cache needs to write back and invalidate all possible older such written-to lines.

The only such lines are in this new state that's halfway between S and I. They don't need to be marked any more invalid. Zero bus traffic there.

gpderetta · on June 16, 2023

Yes, saw that flaw, but I think in their model, barriers are always associated with stores (so atomic_thread_fence is not implementable but store_release is), which fixes your first example. I agree that in any case you end up doing more work that in the typical model to make other scenarios work.

throwawaylinux · on June 16, 2023

> In that example I don't think anything really changes. With cache coherency a release barrier makes sure all previous stores are committed to L1, and by implication any old versions of the cache line have been invalidated.

That is not what a release barrier does.

> Without cache coherency a release barrier makes sure all previous stores are committed to L1, and explicitly says that any old versions of the cache line have been invalidated.

That is not a "normal barrier" though, writeback and invalidate operations are software coherency.

> If you had a design a lot like MESI, you wouldn't really need release semantics,

This is not the case. Release barrier can be required even if your cache coherency operations completed in FIFO order, because reordering could be done before cache coherency.

Dylan16807 · on June 16, 2023

>That is not what a release barrier does.

> That is not a "normal barrier" though, writeback and invalidate operations are software coherency.

I think I was unclear when I said "the cache line". I meant the one containing a releasing store.

Let me try wording it a different way. A store_release requires all previous writes to be ordered before it. This obviously includes other memory addresses, but its own memory address isn't an exception. So even without cache consistency as a general rule, the nature of a release gives you all the ordering you need in this situation.

I'm sorry for mentioning the word "invalidate", because that's the implementation and not the semantics.

> This is not the case. Release barrier can be required even if your cache coherency operations completed in FIFO order, because reordering could be done before cache coherency.

So I meant acquire but I think there's also a clear solution based on exactly what I said.

The lines that could possibly be affected by the FIFO are all in the "Shared but maybe stale" state. Consume turns those into Invalid. So any read that's from after the Consume, reordered before it, should see those lines as Invalid.

throwawaylinux · on June 16, 2023

> I think I was unclear when I said "the cache line". I meant the one containing a releasing store.

Not sure what you mean by that.

> Let me try wording it a different way. A store_release requires all previous writes to be ordered before it. This obviously includes other memory addresses, but its own memory address isn't an exception. So even without cache consistency as a general rule, the nature of a release gives you all the ordering you need in this situation.

We're talking about cache coherency, not memory consistency. Coherency is not about ordering, it's about ensuring agents don't see stale data.

> So I meant acquire but I think there's also a clear solution based on exactly what I said.

The same goes for acquire though.

> The lines that could possibly be affected by the FIFO are all in the "Shared but maybe stale" state. Consume turns those into Invalid. So any read that's from after the Consume, reordered before it, should see those lines as Invalid.

Implementation of CPU memory pipelines and cache coherency aren't really something you can just get a bit of a feel for and then handwave about.

Dylan16807 · on June 16, 2023

When I talk about how the cache protocol has to do XYZ to implement a barrier, you complain that that isn't what a barrier is.

When I talk about what memory barriers do in pure terms, you complain that I'm not mentioning the cache protocol.

When I give an example of a cache protocol in isolation, you start talking about which memory barriers I'm missing.

I don't know what you want.

> Coherency is not about ordering, it's about ensuring agents don't see stale data.

Well, if I go by "if it's part of the memory model then it's not stale", then you can allow a relaxed ordering on single addresses without having stale data.

When a core takes Exclusive control of a cache line, put all other Shared copies into the state "might be an old version, but that's allowed by the protocol and the memory model".

Some instructions can read "might be an old version, but that's allowed by the protocol and the memory model" values and some can't. The exact details of "some instructions" are flexible/irrelevant. See the memory model (not provided) for details.

There. Done. Minimal proof established of a design that doesn't always guarantee cache coherency, but can enforce as much cache coherency as you need. You don't need to add any explicit writebacks or flushes to manage it from software, and enforcing it doesn't take any significant effort beyond a normal CPU's coherency protocol.

Agents will never be confused. They know that Shared means everyone has the same value, and "might be an old version, but that's allowed by the protocol and the memory model" does not mean everyone has the same value. They know that transitioning from "might be an old version, but that's allowed by the protocol and the memory model" to Shared or Exclusive requires reading the data anew, just like transitioning from Invalid to Shared to Exclusive.

If agents want to always be as up to date as possible, they can simply not use this state. If an agent wants to be up to date some of the time, then it can allow this state but purge it at will.

This state only allows for "old" values going back to the most recent purge, so it's not a useless act to read from it. And this state can give you data faster than acquiring a Shared state, so there's a reason to use it.

> The same goes for acquire though.

I'm pretty sure the entire point of acquire is that you can't reorder reads from after it to before it.

throwawaylinux · on June 16, 2023

> I don't know what you want.

I don't want anything, I was correcting your misconceptions.

> Well, if I go by "if it's part of the memory model then it's not stale", then you can allow a relaxed ordering on single addresses without having stale data.

I don't know what you're talking about. Memory ordering is not about ordering of a single address. That's cache coherency.

[snip]

> I'm pretty sure the entire point of acquire is that you can't reorder reads from after it to before it.

And you're still wrong. Acquire barrier can be required even if you receive coherency updates in a sequential order. Barriers in modern processors do not flush, invalidate, or perform coherency operations.

This is what I mean by you can't just handwave with a basic idea about the programming semantics (which aren't particularly complicated). It's easy to think up some vaguely plausible sounding implementation of those things, but the reality is infinitely more complicated. Real cache coherency protocols are verified with formal proofs, and not because they are easy. I guarantee if you handwave a new coherency state or give up some property of coherency, you will have bugs.

Dylan16807 · on June 16, 2023

> I don't know what you're talking about. Memory ordering is not about ordering of a single address. That's cache coherency.

The ordering of a single address is relevant to both the cache protocol and the memory model.

That section is describing a cache protocol.

> And you're still wrong. Acquire barrier can be required even if you receive coherency updates in a sequential order.

I agree. How does that make my statement wrong in any way?

> Real cache coherency protocols are verified with formal proofs, and not because they are easy. I guarantee if you handwave a new coherency state or give up some property of coherency, you will have bugs.

Do you think my description is impossible to fix, or are you just trying to impress on me that it's hard?

I don't feel like spending hours finding and editing a concurrency simulator today.