Everyone has, even the hardware vendors are treating these rules as the golden standard to which they can optimize and conform. So it's too late. But it's still a shame.
The acquire/release semantics make perfect sense for locks, because that's what it was designed for. Designing anything else with them is extremely difficult, and in practice such work tends to lean heavily on the full-barrier "sequentially consistent" primitive instead. The older read/write barrier instructions were much simpler. (And x86 is simpler still, with just one reordering rule to worry about and a single idea of serializing instructions to regulate it).
Hardware vendors adhering to C++ memory model to design their chips? I think you got it backwards or I misunderstood your point. Memory barriers exist because of how cache memory hierarchy and coherency protocols are implemented in multi-core chips. And not to "optimize and conform to C++ memory model".
C++ memory model is there to make life of a programmer easier across different micro-architectures because memory model of the micro-architecture is what differs.
There is an argument, particularly from C++ people, that the CPU architectures designed after C++ 11 shipped with the Acquire/Release model have all chosen to provide features that can do Acquire/Release smoothly.
After all, if the software most people execute rewards CPUs which do X and punishes CPUs which do Y instead, it's a tough sell to design your CPU to do Y, that means ordinary users will get the idea it's worse than it really is, and your sales people don't like that, it seems reasonable to argue that the reason modern x86-64 CPUs ignore the weird x87 FPU features is that nobody's programs want those weird features.
I remember in the mid-1990s there was a lot of hype for high clock frequency 80486 and clones because they had very good integer performance. However, Intel's Pentium had excellent FP performance at similar or lower clocks. The video game Quake needed floating point, and so Quake on my 100 MHz Pentium was very quick, while on a friend's 100MHz 486 it was pretty nasty. Increased sales of the Pentium were I believe widely attributed to the game Quake showing off the FP performance (it's nothing to look at in today's terms, we did not have GPUs back then).
I have never worked for a CPU design firm or CPU manufacturer, so I can't speak to whether this is in practice a meaningful influence.
While influential, I don't think C++11 was the primary reason forarchitectures standardising around acquire/release. I think that, with the advent of commodity multi core designs, the time was ripe to move beyond haphazardly, ad hoc and underspecified description of hardware rules to more formal and rigorous models.
Acquire/release happened to be a good fit, easy enough to program for, and relaxed enough to map relatively closely existing architectures. So C++11 embracing it was just moving with the zeitgeist.
For example Itanium got acquire/release operations well before the work on the C++11MM even started.
> There is an argument, particularly from C++ people, that the CPU architectures designed after C++ 11 shipped with the Acquire/Release model have all chosen to provide features that can do Acquire/Release smoothly.
I never heard that argument (source would be good?) but this is very different to what the parent comment said. Code does not run in vacuum - it runs on the CPU. And CPU does not exist vacuum - it's there to run that very code so they're both very inter-dependent and intertwined. As much as CPU designs are changing to accommodate new algorithmic requirements so does the code but vice-versa - to make a utility of new CPU designs.
So, of course, CPU vendors will do everything that is in their control to make the new chip design appealing to their customers. This has been done forever. If that means spending extra transistors to run some C++ heavy data center code faster by 10% of course they will do it - there's a very large incentive in doing so.
But that doesn't mean that CPU vendors are designing their chips to accommodate abstract programming language models. In this case, memory models.
Probably one of the most easiest examples of such practices to understand, and that I could think of right now, is Jazelle - an ARM CPU circuitry designed to execute the Java bytecode directly in the CPU itself.
I'm pretty sure there's a Herb Sutter C++ talk which explicitly associates newer CPUs having instructions well suited to the Acquire/Release model with that C++ 11 memory model. I have a lot of Herb's talks in my Youtube history so figuring out which one I meant will be tricky. Maybe one of the versions of "Atomic Weapons" ? This idea is out there more generally though.
I don't think I agree that this doesn't mean the memory model infects the CPU design. Actually I don't think I agree more generally either. For example I would say that the Rust strings are fast despite the fact that modern CPUs have gone out of their way to privilege the C-style zero terminated string. There are often several opcodes explicitly for doing stuff you'd never want except that C-style strings exist. They would love to be faster but it's just not a very fast type, so there's not much to be done.
Contrast this with say, bit count, which is a good idea despite the fact it's tricky to express in C or a C-like language. In a language like C++ or Rust you provide this as an intrinsic, but long before that happened the CPU vendor included the instruction because this is fundamentally a good idea - you should do this, it's cheap for the CPU and it's useful for the programmer, the C language is in the way here. "Idiom recognition" was used in compilers to detect OK, this function named "count_pop" is actually the bit count instruction, so, just use that instruction if our target architecture has it. More fragile than intrinsics (because it's a compiler optimisation) but effective.
At an even higher level, from the point of view of a CPU designer, it would be great to do away with cache coherence. You can go real fast with less transistors if only the stupid end users can accept that there's no good reason why cache A over here, near CPU core #0 should be consistent with cache D, way over on another physical CPU, near CPU core #127. Alas, turns out that writing software for a non-coherent system hurts people's brains too much so we've been resolutely not doing that. But that's exactly a model choice - we reject the model where the cache might not be coherent. Products which lack cache coherence struggle to sell.
RISC-V designers explicitly declare in their base specification: "The AMOs were designed to implement the C11 and C++11 memory models efficiently." So at least one example is present.
I think it is just a generational/educational difference. I learned lock-free programming around the time c+11 was being standardized, and the acquire/release model seems very natural to me and easier to understand and model algorithms in it than the barrier based model.
The acquire/release semantics make perfect sense for locks, because that's what it was designed for. Designing anything else with them is extremely difficult, and in practice such work tends to lean heavily on the full-barrier "sequentially consistent" primitive instead. The older read/write barrier instructions were much simpler. (And x86 is simpler still, with just one reordering rule to worry about and a single idea of serializing instructions to regulate it).