There is no "cache flushing" when barriers or other instructions are used to ens...

gpderetta · on June 5, 2019

Exactly.

I just wanted to add that #StoreLoad fences (i.e. mfence on x86 for example), as far as I know do not usually actually flush the store buffer per-se. They just stall the pipeline (technically they only need to stall any load) until all stores prior to the fence have been flushed out by the normal store buffer operations, i.e. the store buffer is always continuously flushing as fast as possible all the time.

You didn't imply otherwise, but I wanted to clarify that because I have seen comments elsewhere and in code claiming that a fence would make a prior store visible faster (i.e. the fence was added to improve latency instead of being required for correctness), which I do not think it is the case, at least at a microarchitectural level (things are more complex when a compiler is involved of course).

BeeOnRope · on June 6, 2019

Yes, that's right. As far as I know, on modern Intel chips, atomic operations block both the store and load ports, but lets other ops through. I think allocation blocks when the first load/store arrives when the pipeline is in that state - so you can hide a lot of the cost of atomic operations by ensuring there is a long-as-possible series of non-load/store operations after them (of course, this is often not possible).

mfence used to work like that, but due to various bugs/errata was "upgraded" to now block execution of all subsequent instructions until it retires (like mfence) in addition to it's store draining effects. So mfence is actually a slightly stronger barrier than atomic operations (but the difference is only apparent with non-WB memory).

If you want to be totally pedantic, it may be the case that that mfence or another fencing atomic operation results in the stores being visible faster: because they block further memory access instructions, there can less competition for resources like fill buffers, so it is possible that the stores drain faster.

For example, Intel chips have a feature where cache lines targeted by stores other than the ones at the head of the store buffer can be fetched, so called "RFO prefetch" - this gives MLP in the store pipeline. However, this will be limited by the available fill buffers and perhaps also heuristics ramping back this feature when fill buffers are highly used even if some are available (since load latency is generally way more important than load latency).

So something like an mfence/atomic op blocks later competing requests and gives stores the quietest possible environment to drain. I don't think the effect is very big though, and you could achieve the same effect by for e.g., just putting a bunch of nops after the "key" store (although you wouldn't know how many to put).

gpderetta · on June 7, 2019

> As far as I know, on modern Intel chips, atomic operations block both the store and load ports, but lets other ops through. I think allocation blocks when the first load/store arrives when the pipeline is in that state.

That's great to know. I suspected that was the case and they had moved from the stall the pipeline approach, but I had never tested it.

BeeOnRope · on June 6, 2019

Sorry that should say "(like lfence)" not "(like mfence)".

dragontamer · on June 5, 2019

> There is no "cache flushing" when barriers or other instructions are used to ensure seqcst on any system I am aware of.

Good to know. I've seen enough of your other posts to trust you at your word.

BTW: I'll have you know that AMD GPUs implement "__threadfence()" as "buffer_wbinvl1_vol", which is documented as:

> buffer_wbinvl1_vol -- Write back and invalidate the shader L1 only for linesthat are marked volatile. Returns ACK to shader.

So I'm not completely making things up here. But yes, I'm currently looking at some GPU assembly which formed the basis of my assumption. So you can mark at least ONE obscure architecture that pushes data out of the L1 cache on a fence / memory barrier!

tntn · on June 5, 2019

GPU L1 caches are typically not coherent, so flushing them is necessary on GPU architectures.

BeeOnRope · on June 6, 2019

Right, I could have been clearer that I was restricting my comments to general purpose CPUs.