I have mixed feelings about all of this. On one hand, requiring every CPU to implement strong memory-order guarantees is one of the things holding back frequencies and concurrency levels. On the other hand, weaker models make bugs so likely and those bugs are so difficult that people add excessive memory barriers and performance ends up being even worse. I don't know what the answer is, but it's a complex set of tradeoffs that I wish more people knew about.
Recent ARMs apparently have cheap explicit load acquire and store release which are great and map very well to the c11 memory model. If this means they need to use less transistors or save some power, that's even better.
The issue is with some RISCs were even implemeting acquire/release semantic requires expensive barriers (and seq cst requires even more expensive barriers), the only alternative being address or flow control dependent loads which are hard to reason about in higher level languages.
And then there was alpha were even load consume required a barrier.
The exact equivalent problem occurs with C optimisations, but at least you can choose -O0 to either:
* change -O level to see if a bug disappears,
* or change -O level because reliability is more important to your application than performance (or vice versa).
Think of it like you would any more common garden-variety race condition. The solution is not to add sleep() calls, even if that appears to work. The solution is to put in the necessary synchronization. Failing to use the right barriers on a data structure that pings between processors is exactly the same as failing to take the right locks when accessing that data structure. It's just easier to get away with when all the code has run on before is x86 processors with strong ordering.
P.S. This same processor did at one point have a bug that prevented locks from working sometimes, but that's a whole different horror story.
How does it hold back the frequency? I would have thought it would mean you can do less in each clock cycle, but not limit the frequency.
* More work and communication to be done per cycle means slower cycles.
* More circuitry means more heat to dissipate.
There are probably others, but I'm not a chip designer myself. I just sat next to a few.
Are you sure about that? Moderns x86 CPU are fast, can have a lot of cores, and actually even have quite fast atomics compared to their current competition (Arm) IIRC. Which is very useful for exploiting your multiple cores with fancy parallel algorithms.
I don't think this kind of things is holding back frequencies today. Most of the time thermals is. And that's probably the same for pretty much all the things that fall into transistor budget constraints. The reduced approach has "failed", because a good HW is "always" faster and now the transistor budget is so high that clever HW will be what matters for a while. It has even began a "few" years ago, I would say.
Back to the more precise subject: you have that huge OOO engine anyway (if you want perfs); would you get that much speed (even just theoretically) by not loading, retiring and queuing in order? You have to be cache coherent anyway, and if you want real gains then you have to avoid cache line bounces. From the store side I don't believe it. The load side, hm maybe some small gains? Maybe that is even one of the things that let the A12 be competitive? Does not seems to be a hard stop though, because pretty much all uarch structures continue to grow with the usual associated IPC gains. So yeah, first I don't believe at all that this is holding "frequencies and concurrency levels"; maybe IPC, and maybe even efficiency, and so because of those 2 maybe even frequency, but only indirectly (boosting for longer while the thermals are ok) but for sure we have not hit any wall even with quite strong ordering. The finer programming that would be required is a joke also; it is well known that trying to program with atomics shall be done using sequential consistency unless you are an extraordinary good expert, and are ready to spend a huge time on verification. So it is kind of back to the beginning, because in pretty much all cases your CPU better has to be fast enough for sequential consistency. At this point it will likely be possible to get it fast for a quite strong default ordering, so why not keep it convenient and reduce the likelihood of bugs?
I think for general purpose CPU, strong ordering is here to stay (or even for weaker ISA still on the market, they will eventually become stronger), kind of the same way as the page table walks, prefetching, speculation, etc. stories go (they all have had their moment of attempting simpler HW but more complex SW; it never worked in the end, because the HW can act as a kind of JIT optimizer and self adapt to workloads thanks to e.g. predictors way more efficiently than SW could do it). Of course you will "always" have the possibility of load-store reordering, because this is a logical perf issue from first principles, not just a quality of implementation problem (well, you could avoid it by being extremely slow with current uarch approach, but that is so impractical that we should as well consider it not "fixable" -- then maybe there could be a radically different arch that get it fast, but I really don't believe it)
While the x86 bus architecture is very forgiving, its guarantees only extend up to the level of assembly language. Compilers are happy to re-order operations that the machine has so carefully sequenced.
At the source-code level, therefore, you need to use "atomic" data types and operations (and carefully) just to retain the deterministic semantics the underlying machine implements.
That's the bad news. The good news is twofold: using atomics on x86 doesn't cost any extra (i.e., you pay for fully ordered operations whether you rely on them or not), and doing it right makes your code portable to machines with "relaxed" ordering, where you only pay for ordering if you ask for it.
Intel and AMD manage to make a strongly ordered bus system almost as fast as a relaxed one by throwing enormous numbers of transistors at the problem. It costs more power, and is the very devil to get right, but it makes wrong programs more likely to get the right answer anyway.
I mean, yeah, compilers can reorder your instructions at build time just like CPUs can at runtime. But neither will reorder across a call to pthread_mutex_lock(). Use that. Doing otherwise is just a footcannon, unless you happen to be actually implementing a locking primitive.
Edit: it prevents store forwarding though, which is probably the attack rurban has in mind.
Edit2: or more precisely it prevents any attack on the store load aliasing predictor