Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There is no "cache flushing" when barriers or other instructions are used to ensure seqcst on any system I am aware of. Such a system would perform so slowly as to be practically unusable.

At most there is usually a local dispatch stall while store buffers (not caches) are flushed and pending loads complete. In some scenarios, like a cache-line crossing atomic operation on x86, you might need additional non-local work such as asserting a lock signal on an external bus. There might be some other work such as ensuring that all arrived invalidation requests have been processed.

Still, you are talking in the range of 10s or maybe low 100s of cycles. Nothing like a cache flush which would probably be an impact of 10,000s of cycles or more (depending on how large the caches are and where the data to reload them comes from).



Exactly.

I just wanted to add that #StoreLoad fences (i.e. mfence on x86 for example), as far as I know do not usually actually flush the store buffer per-se. They just stall the pipeline (technically they only need to stall any load) until all stores prior to the fence have been flushed out by the normal store buffer operations, i.e. the store buffer is always continuously flushing as fast as possible all the time.

You didn't imply otherwise, but I wanted to clarify that because I have seen comments elsewhere and in code claiming that a fence would make a prior store visible faster (i.e. the fence was added to improve latency instead of being required for correctness), which I do not think it is the case, at least at a microarchitectural level (things are more complex when a compiler is involved of course).


Yes, that's right. As far as I know, on modern Intel chips, atomic operations block both the store and load ports, but lets other ops through. I think allocation blocks when the first load/store arrives when the pipeline is in that state - so you can hide a lot of the cost of atomic operations by ensuring there is a long-as-possible series of non-load/store operations after them (of course, this is often not possible).

mfence used to work like that, but due to various bugs/errata was "upgraded" to now block execution of all subsequent instructions until it retires (like mfence) in addition to it's store draining effects. So mfence is actually a slightly stronger barrier than atomic operations (but the difference is only apparent with non-WB memory).

If you want to be totally pedantic, it may be the case that that mfence or another fencing atomic operation results in the stores being visible faster: because they block further memory access instructions, there can less competition for resources like fill buffers, so it is possible that the stores drain faster.

For example, Intel chips have a feature where cache lines targeted by stores other than the ones at the head of the store buffer can be fetched, so called "RFO prefetch" - this gives MLP in the store pipeline. However, this will be limited by the available fill buffers and perhaps also heuristics ramping back this feature when fill buffers are highly used even if some are available (since load latency is generally way more important than load latency).

So something like an mfence/atomic op blocks later competing requests and gives stores the quietest possible environment to drain. I don't think the effect is very big though, and you could achieve the same effect by for e.g., just putting a bunch of nops after the "key" store (although you wouldn't know how many to put).


> As far as I know, on modern Intel chips, atomic operations block both the store and load ports, but lets other ops through. I think allocation blocks when the first load/store arrives when the pipeline is in that state.

That's great to know. I suspected that was the case and they had moved from the stall the pipeline approach, but I had never tested it.


Sorry that should say "(like lfence)" not "(like mfence)".


> There is no "cache flushing" when barriers or other instructions are used to ensure seqcst on any system I am aware of.

Good to know. I've seen enough of your other posts to trust you at your word.

BTW: I'll have you know that AMD GPUs implement "__threadfence()" as "buffer_wbinvl1_vol", which is documented as:

> buffer_wbinvl1_vol -- Write back and invalidate the shader L1 only for linesthat are marked volatile. Returns ACK to shader.

So I'm not completely making things up here. But yes, I'm currently looking at some GPU assembly which formed the basis of my assumption. So you can mark at least ONE obscure architecture that pushes data out of the L1 cache on a fence / memory barrier!


GPU L1 caches are typically not coherent, so flushing them is necessary on GPU architectures.


Right, I could have been clearer that I was restricting my comments to general purpose CPUs.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: