
This Is Why They Call It a Weakly-Ordered CPU - octopus
http://preshing.com/20121019/this-is-why-they-call-it-a-weakly-ordered-cpu
======
ComputerGuru
Nice blog post, though I personally prefer the ridiculousfish post [0] he
links to in the end, that one's an instant classic.

He mentions Windows/x86 a couple of times. I only _wish_ it were as simple as
"this platform does not reorder." Having done low-level, heavily-multithreaded
work on Windows for years: it'll behave like a strongly-ordered architecture
999 times out of a 1000 (or more). Then it'll bite you in the ass and so
something unexpected. Basically, if you're doing your own synchronization
primitives on x86, you have to pretty much rely on visual/theoretical
verification because tests won't error out w/ enough consistency. I've run a
test (trying to get away w/ not using certain acquire/release semantics) for
an entire week to have it error out only at the last second (x86_64). Other
times, I've shipped code that's been tested and vetted inside out for months,
only to have the weirdest bug reports 3 or 4 months down the line in the most
sporadic cases.

0: <http://ridiculousfish.com/blog/posts/barrier.html>

~~~
Dylan16807
Well, to be specific about x86:

Reads always happen in the original order.

Writes always happen in the original order.

Reads can be moved in front of writes.

~~~
callan
I work for Intel. This is not correct. A lot depends on your cache type. The
two basic ones are uncacheable and write-back.

What you wrote is true for UC. For WB, reads can happen in any order
(especially due to cache pre-fetchers). Writes always happen in program order,
unless you are in other cache types such as write-combining. WC is mainly used
for graphics memory-mapped pixmaps, where the order doesn't matter.

But don't let this scare you too much. From the viewpoint of a single CPU,
everything is in order _. It's only when you look at it from the memory bus
point-of-view that things get confusing.

_ Unless we are dealing with a memory-mapped IO device that has read side-
effects, in which case you need to carefully choose a cache type.

~~~
dspeyer
How do I control my cache type?

~~~
yuhong
I don't think you can from a user mode program. If you are dealing with normal
memory, it will be typically WB, and that is the only thing most user-mode
programs will encounter.

------
nkurz
I'm oddly uncomfortable with this article. It reinforces the idea of Memory
Ordering as voodoo, rather than as something that can (and needs to be!)
understood to properly write low level multicore code. Neither it nor the
linked articles go into any details of how memory and cores actually interact,
and without these details it would be very hard to get from "this seems to
work" to "this is bug free".

 _You can try running the sample application on any Windows, MacOS or Linux
machine with a multicore x86/64 CPU, but unless the compiler performs
reordering on specific instructions, you'll never witness memory reordering at
runtime._

It may just be poor wording, but I don't think this sentence makes sense -- it
conflates compiler optimizations with memory reordering, and implies that this
is dependent of choice of operating system. While the author probably didn't
mean this, it's clear from some of the comments in this thread that this is
causing confusion to readers. Worse, it's just not true --- while this
particular example might not cause problems, memory reordering is still an
issue that needs to be dealt with on x86.

Analogies can be helpful for intuition, but I think this is a case where one
really needs to understand what's happening under the hood. Treating the CPU
as a black box is not a good idea here, and test-driven development is
probably not a good approach to writing mutexes. Calling attention to the
issue is great, but this is an area where you really want to know what exactly
guarantees your processor provides, rather than trying things until you find
something that seems to work.

~~~
klodolph
> It reinforces the idea of Memory Ordering as voodoo

I thought there were some very solid bits in there, like the reminder that
memory barriers always come in pairs.

> it conflates compiler optimizations with memory reordering

Well, the code as written is subject to reordering both by the compiler and
the CPU. The author is just being honest that this isn't a perfectly crafted
example. You could make a version not subject to reordering by the compiler,
either by marking variables as volatile or by inserting "compiler barriers"
into the code (e.g., empty volatile asm statements). I don't think this is
conflation.

> Worse, it's just not true --- while this particular example might not cause
> problems,

I think what happened is that you interpreted the author's statement as one
about memory reordering in general on the x86 architecture, but I think the
author is referring to this specific application, which makes the statement
true. (The sentence you quoted is bracketed by two sentences which are more
explicitly about this demo application, but it's not explicit in the sentence
you quoted.)

> memory reordering is still an issue that needs to be dealt with on x86.

The article links to another post by the same author with a similar experiment
demonstrating reordering on x86. The example is a lot harder to follow,
though, since x86 only allows reordering of stores and loads relative to each
other.

[http://preshing.com/20120515/memory-reordering-caught-in-
the...](http://preshing.com/20120515/memory-reordering-caught-in-the-act)

> test-driven development is probably not a good approach to writing mutexes

The experiment is to show people how reordering can happen, it has educational
purpose. Thermite is a bad way to heat your home, but it's a good experiment
to demonstrate exothermic reactions.

Yes, the article could be written better. But there are so few people who
write any articles at all on this subject, I'm glad to be able to read it.

~~~
nkurz
I agree --- definitely better to have an possibly flawed article to discuss,
rather than no article at all. And as a whole, the series has lots of great
information.

------
qdog
This is why I don't like to use shared memory. It's not easy to do this for a
variety of reasons.

At a low level, to try and make this work, you need to do more than worry
about a mutex. You need the cpu's cache to be out the way, the memory area
protected, AND the memory bus transactions to be completed!

So...if c++11 works, this is what it must really do(some of this is handled by
the hardware, but these all have to happen...and if there's a hardware bug,
you need a software workaround):

1) Lock the memory area to the writing cpu (this could be a mutex with a
memory range, but safest, and slowest, is to disable interrupts while you dick
with memory. That's unlikely to be available at high level).

2) Write the memory through the cache to the actual memory address OR track
the dirty bit to make sure CPU2 fetches memory for CPU1's cache. AND go over
to CPU2 and flip the dirty bit if it has this bit of memory in cache...

3) Wait for all the memory to be written by the bus. Depending on the
implementer of the but, it's entirely possible to have CPU1's memory writes
heading into memory, but not yet committed, when CPU2's request arrives,
giving CPU2 a copy of old data! One way to try and fix this is...have CPU1
read-through-cache to the actual memory location, which the bus will flush
correctly as the request is coming from the same device that did a previous
write. (I used to do embedded programming and had to use this trick at times,
it's possible this is the only bus that worked like this, YMMV).

4) Release the locking mechanism and hope it's all correct.

Realizing that a '1 in a million' chance of failure probably equates to months
between failures at most, you see why bugs with this stuff appear all the
time. If you MUST use shared memory as your interface for some reason, you
better be really careful. And maybe look to move to a different method ASAP.

Edit: changed memory controller to bus, oops

------
callan
For those seeking more detail, Linux has a great reference on using memory
barriers: <http://www.kernel.org/doc/Documentation/memory-barriers.txt>

~~~
hresult
Great article. But the ASCII art in it!

------
mjb
This is a really interesting article. Multi-core ARM seems to be the first
really mainstream processor architecture that behaves this way. There have
been others, like Alpha, but none have achieved the ubiquity that mult-core
ARM has achieved. I suspect a side-effect of this is that many of the "threads
are hard" effects that are hidden by x86 will come back to bite a lot of
programmers. I think we are going to be seeing a lot more "threads are hard"
and "threads and weird" posts in the near future, and hopefully better
learning material about threading issues in the longer term. Even more
hopefully, this might drive more research and development into abstractions
for providing parallelism and concurrency in ways that hide the complexity of
threads.

~~~
dorianj
PowerPC has a weakly ordered memory model, so older Mac programmers are no
strangers to this sort of stuff.

There have been many multi-core ARM devices out in both the Android and iOS
world for a few years, and I haven't heard too many complaints. I think most
good high-level programmers know that they need to use high-level locks
provided by the SDK, while anyone writing code that actually needs to deal
with memory ordering probably already knows to use barriers some of the
pitfalls of different ISAs.

~~~
Tuna-Fish
But old PowerPC macs weren't typically multicores, and even when they had
multiple CPUs, people rarely programmed threaded software for them. Memory
ordering is completely transparent to a single thread -- it's only when you
add more of them that you start to have problems.

~~~
AnthonyMouse
[http://en.wikipedia.org/wiki/List_of_Macintosh_models_groupe...](http://en.wikipedia.org/wiki/List_of_Macintosh_models_grouped_by_CPU_type#PowerPC_G4)

All of the PowerMac G4s and PowerMac G5s from 1999-2006 spanning 350MHz to
2.7GHz were sold in multiprocessor configurations.

Maybe most programmers never worried about writing threaded software for them,
but they certainly weren't uncommon.

~~~
brigade
Only the most expensive PowerMacs had multiprocessors; iMacs, PowerBooks, and
iBooks vastly outnumbered those with top-end PowerMacs. Which also meant there
wan't much point in writing heavily threaded programs unless specifically
targeting high-end users.

------
tveita
Valgrind has tools that supposedly can find certain classes of load/store race
conditions. I've never used them in anger, so I can't vouch for them, but it
would be interesting to do a test on the example in the article.

Memcheck is certainly a must-have tool for finding heisenbugs in low-level
code - it would be wonderful to have an equally effective solution for race
conditions.

<http://valgrind.org/docs/manual/hg-manual.html>

<http://valgrind.org/docs/manual/drd-manual.html>

~~~
osivertsson
I have very successfully used ThreadSanitizer [1] on Linux to find data races
in a large C++ codebase. The Linux and OSX versions are based on
Valgrind/Helgrind.

[1] <http://code.google.com/p/data-race-test/wiki/ThreadSanitizer>

------
hobbyist
Do, the memory barriers in ARM architecture also flush the caches? In Intel
x86 architectures the hardware handles the coherency between all the caches,
so a CPU core can directly read from the cache line of another core if it
finds its own cache line to be dirty.. Does this happen in ARM also?

------
lincolnq
Yay memory semantics!

A classic case where this sort of problem bit Java in the ass: the "double-
checked locking pattern" for initializing Singletons.
[http://www.ibm.com/developerworks/java/library/j-dcl/index.h...](http://www.ibm.com/developerworks/java/library/j-dcl/index.html)

I'm not sure if this was ever fixed / improved enough to allow the programmer
to make this work.

~~~
trhtrsh
The "fix" is to define an enum with one member named INSTANCE, if you need a
Singleton. \-- Effective Java, 2/ed, bu Josh Bloch

~~~
oconnor0
Which only works if you need something that doesn't extend existing classes
but only interfaces.

------
makira
and this is why you don't implement your own mutexes and use the ones provided
by the OS.

------
usea
A question: Why is it the CPU architecture that is weakly ordered, if it's the
compiler that is reordering the statements? Couldn't you have a compiler on a
weakly ordered arch that preserved order, and a compiler on x86 for example
that could reorder your statements?

Isn't it the language spec / compiler that is in charge of this, rather than
the CPU? I'd like to know more about this.

~~~
klodolph
Both the compiler _and_ the CPU can reorder code, as long as the results don't
change from the perspective of the code that is executing.

For example, a compiler could take the following code:

    
    
        global_x = 1;
        global_y = 2;
        global_x = 3;
    

And change it into:

    
    
        global_x = 3;
        global_y = 2;
    

Compilers on all platforms reorder statements like this. You can prevent this
by making your variables `volatile`, but 99% of the time when you write the
word "volatile" in your code you're making a mistake.

------
hresult
Great article. CPU reordering is an effect which makes it notoriously
difficult to implement lock-free code correctly.

