
Who ordered memory fences on an x86? (2008) - luu
http://bartoszmilewski.com/2008/11/05/who-ordered-memory-fences-on-an-x86/
======
davidtgoldblatt
By the way, in case anyone was curious about the comment in the article:

> Since fences are so expensive, couldn’t you add a dummy assembly instruction
> that prevents the X86 from re-ordering? So the pseudo assembly for thread 0
> might become:
    
    
        > store(zeroWants, 1)
        > store(victim, 0)
        > r0 = load(victim) ; NEW DUMMY INSTRUCTION
        > r0 = load(oneWants)
        > r1 = load(victim)

> Since the dummy instruction is a load, the X86 can’t reorder the other loads
> before it. Also, since it loads from “victim”, the X86 can’t move it before
> the store to “victim”.

> If you do this to both threads, does that solve the problem?

This doesn't work. Intel specifically calls out these sorts of attempts to get
a fake memory barrier: "The memory-ordering model allows concurrent stores by
two processors to be seen in different orders by those two processors;
specifically, each processor may perceive its own store occurring before that
of the other." This is true in practice as well as in theory.

A related trick that _does_ work in practice (though it is also banned by
Intel) is to write to the low-order bytes of a word, read the entire word, and
get the high-order bytes. It's sort of a single-word StoreLoad barrier.
There's a paper from Sun that documents it further:
[http://home.comcast.net/~pjbishop/Dave/QRL-OpLocks-
BiasedLoc...](http://home.comcast.net/~pjbishop/Dave/QRL-OpLocks-
BiasedLocking.pdf) .

------
userbinator
Reminds me of this interesting paper (2 years later) which found at least one
of the x86 memory ordering guarantees was not true:
[http://www.cl.cam.ac.uk/~pes20/weakmemory/cacm.pdf](http://www.cl.cam.ac.uk/~pes20/weakmemory/cacm.pdf)

~~~
wolfgke
For an overview of papers for x86 memory ordering of the same group, see
[http://www.cl.cam.ac.uk/~pes20/weakmemory/index3.html](http://www.cl.cam.ac.uk/~pes20/weakmemory/index3.html)

------
mattnewport
This article is pretty old and I suspect if you asked Bartosz now he'd explain
it slightly differently. You certainly need to use instructions that impose
additional ordering guarantees on x86 in these kinds of situations but you
don't need an mfence and in general it will be slower than an appropriate
locked instruction. The appropriate uses of mfence are actually much more
limited, it's only really needed with special memory types (e.g. write
combined) or when you need ordering on certain non temporal loads and stores
(mostly vector instructions) AFAIK.

In regular code you should never require the hammer of mfence for correct
synchronization. You can implement C++11 atomics without it.

~~~
johnbender
It doesn't look like the LOCK prefix applies to MOV (from a quick google)? So
how does it address write buffering or OOE for stores in a TSO memory model?

[edit] "never require the hammer of mfence for correct synchronization", maybe
you're confining this to correct synch. and not recovering sequential
consistency (or some other semantic property).

~~~
mattnewport
It's a little while since I looked closely at this and it's easy to get this
stuff wrong but here's what I concluded when digging into this in the context
of a codebase I was maintaining that made heavy use of lock free techniques:

\- You don't _need_ mfence (or sfence / lfence) on x64 (which is actually what
I care about rather than x86, though they're essentially the same in this
respect) to correctly implement C++11 atomics with acquire / release
semantics. You can get all the guarantees you need with locked instructions.

\- Correctly implemented C++11 atomics implemented with locked instructions
will be as fast or faster than when implemented with explicit fence
instructions.

\- The obvious way to implement standalone fences on x64 is with fence
instructions but you rarely if ever need standalone fences. Generally you are
better off using atomic operations with explicit acquire / release semantics.

In the codebase I was maintaining, the pre-C++11 atomic library was based
around explicit fences rather than C++11 style atomic operations with acquire
/ release semantics attached to the operations themselves. This code was
primarily written for Gen 3 consoles (PS3 / Xbox 360) and so was optimized for
PowerPC. On x64 (Gen 4 consoles!) there was measurable performance overhead
due to unnecessary/redundant standalone fences.

We decided it was too risky to try and rewrite everything in terms of atomic
operations with acquire release semantics and remove the standalone fences in
the end but it seems to me that if you want to write efficient cross
architecture lock free code you want to avoid standalone fences and use a
C++11 style atomics library where acquire release semantics are tied to the
atomic operations themselves.

------
redraga
Remember that x86 (and SPARC) offer the strongest memory ordering guarantees
among modern processors. The POWER and ARM memory models are weaker than x86.
This actually leads to correctness issues when virtualizing a multi-core x86
guest on a weaker host (cross-ISA virtualization). Of course, this problem
only shows up in truly parallel emulators using multiple threads on the host
to emulate a multi-core guest, such as COREMU
([http://sourceforge.net/projects/coremu/](http://sourceforge.net/projects/coremu/))

------
pkhuong
The fun thing about membars on x86 is that, unless you're playing with
nontemporal stores or non-standard memory types, LOCKed ops are more efficient
fences than mfence.

~~~
mattnewport
Missed this somehow before making my comment saying essentially the same
thing. Unfortunately the code base I've been maintaining heavily overuses
mfences at a measurable performance penalty on x64.

------
jhallenworld
I'm pretty sure the strong ordering of x86 is all in support of backward
compatibility (probably to the first multi-core x86). One related example of
this is the cache coherent I/O system. If a PCIe card writes to memory, there
is really not much that the driver code needs to worry about compared with
other processors. Why is this? So the ancient device drivers in MS-DOS will
still work with the ancient floppy disk DMA controller.

------
amelius
> Loads are not reordered with other loads.

Wonder why that guarantee is necessary. Loads have no side-effects (in the
memory), after all.

~~~
rayiner
To add to the example of 'DSingularity: remember that the memory ordering
guarantees specify the behavior of _individual processors_ in a multi-
processor system. You have to extrapolate the system behavior from there. The
guarantee on reordering of loads with respect to each other is only
unnecessary if you assume that no other processor modifies those locations
between loads.

~~~
amelius
Ah yes, I understand it now. Took me some time to get myself thinking on that
abstraction level again :) Thanks!

------
m00dy
You can apply Peterson`s lock for more than 2 threads.

