
Who ordered memory fences on an x86? (2008) - luu
https://bartoszmilewski.com/2008/11/05/who-ordered-memory-fences-on-an-x86/
======
notacoward
I worked on a CPU with very relaxed ordering once. It was a MIPS variant,
designed by the people who had previously worked on DEC Alpha. Unfortunately,
there hadn't been an architecture _since_ Alpha that had such a relaxed model,
so there was code in various parts of the Linux kernel that had gotten sloppy
about memory barriers. I remember at least one in NBD, at least one in NFS,
multiple in Lustre. Each one was a major PITA to debug, because (a) there was
no evidence in memory of what went wrong and (b) the code would literally only
fail once in a billion times. So there you are, looking at register values in
a crash dump and trying to guess exactly which memory locations had to be read
in which incorrect order to get to the state you're seeing. Ugh.

I have mixed feelings about all of this. On one hand, requiring every CPU to
implement strong memory-order guarantees is one of the things holding back
frequencies and concurrency levels. On the other hand, weaker models make bugs
so likely and those bugs are so difficult that people add _excessive_ memory
barriers and performance ends up being even worse. I don't know what the
answer is, but it's a complex set of tradeoffs that I wish more people knew
about.

~~~
chrisseaton
> requiring every CPU to implement strong memory-order guarantees is one of
> the things holding back frequencies

How does it hold back the _frequency_? I would have thought it would mean you
can do less in each clock cycle, but not limit the frequency.

~~~
garmaine
Because electrical signals, required for coordination, can only travel so far
in one cycle.

~~~
chrisseaton
Right, so you need more cycles to complete each operation, don't you? Why
can't you run the same number of cycles though?

~~~
notacoward
Memory ordering isn't the kind of thing that can be neatly separated from your
instruction scheduler, register file, etc. Some of the communication and
checking that's necessary to maintain strong ordering has to be handled within
each unit each cycle. I'm probably doing a poor job explaining it, though.
Maybe someone even closer to these issues can do better.

------
ncmncm
The CPU core gives, and the compiler takes away.

While the x86 bus architecture is very forgiving, its guarantees only extend
up to the level of assembly language. Compilers are happy to re-order
operations that the machine has so carefully sequenced.

At the source-code level, therefore, you need to use "atomic" data types and
operations (and carefully) just to retain the deterministic semantics the
underlying machine implements.

That's the bad news. The good news is twofold: using atomics on x86 doesn't
cost any extra (i.e., you pay for fully ordered operations whether you rely on
them or not), and doing it right makes your code portable to machines with
"relaxed" ordering, where you only pay for ordering if you ask for it.

Intel and AMD manage to make a strongly ordered bus system almost as fast as a
relaxed one by throwing enormous numbers of transistors at the problem. It
costs more power, and is the very devil to get right, but it makes wrong
programs more likely to get the right answer anyway.

~~~
ncmncm
Something that has changed since 2008 is that Standardized languages actually
have atomic types.

------
dang
2014:
[https://news.ycombinator.com/item?id=8572160](https://news.ycombinator.com/item?id=8572160)

------
rurban
Nowadays you also need an mfence for memset_s, not to leave secrets in the
cache, as those caches can be read via sidechannel attacks (Spectre).

~~~
pkaye
mfence clears the cache?

~~~
gpderetta
It does not.

Edit: it prevents store forwarding though, which is probably the attack rurban
has in mind.

Edit2: or more precisely it prevents any attack on the store load aliasing
predictor

~~~
rurban
Thanks for clarification. It only adds two fences, one into the load and one
into the store ordering buffer. (not IO). But this has an as dramatic of an
performance impact as a "clear cache", that's why I usually describe it that
way. it just sets something like two dirty flags.

