
Memory Reordering Caught in the Act - jenhsun
http://preshing.com/20120515/memory-reordering-caught-in-the-act
======
reitzensteinm
Probably the most important broad point of the article:

"And there you have it! During this run, a memory reordering was detected
approximately once every 6600 iterations. When I tested in Ubuntu on a Core 2
Duo E6300, the occurrences were even more rare. One begins to appreciate how
subtle timing bugs can creep undetected into lock-free code."

If you were shipping software to millions of people, you could reasonably
_never see_ a specific race condition error that ends up occurring on many
tens of thousands of machines.

And then when you start to consider interactions between infrequent errors...

~~~
bediger4000
But now every programmer is supposed to take advantage of the multi-core
processors needed to run modern operating systems, aren't they? This is going
to lead to a number of prominent disasters.

~~~
pmjordan
I don't know about "disasters". For better or worse, lots of software ships
with severe bugs that are evidently rare enough never to be fixed but do
happen.

For example, every few months, OSX kernel panics on my MacBook Air, with no
obvious trigger. The backtrace _always_ implicates the AHCI driver for the
SATA SSD. Still, Apple evidently don't receive enough crash reports for this
particular bug to actually fix it, as it's happened since I first bought the
Air in November 2010.

I have no idea whether it's a threading bug like the one in the article (I'm
not about to run my system in single-core mode for months just to find out) or
maybe a race condition with DMA or just a simple logic error that only rarely
applies. It might even be specific to the exact SSD model and revision I ended
up with in my device. But that's pretty much irrelevant - there certainly
hasn't been a widespread outcry about it.

Personally, as a developer, I would _definitely_ want to fix that kind of bug
in anything I'd built. But tracking it down might take weeks of costly
developer time. So Apple's bug triage probably marked this one "WONTFIX" after
deciding it only happened on hardware they no longer sell, so fixing it wasn't
going to have a positive ROI.

(FWIW, if it sounds like I have a personal vendetta against the AHCI driver,
that's because I do. ;-) It behaves erratically in ways unrelated to this
crash, but that weirdness doesn't cause kernel panics, just extra work for
developers of other drivers.)

~~~
aptwebapps
Going on a little (actually really long) tangent here about Apple and the
general inscrutableness of their bug fixing ...

I have a late 2008 MBP from the first batch of unibodies and after having it
for a few months I started getting this thing where the screen would blink. A
lot other people did as well because by mid-2009 the relevant thread on their
forum ran to over 75 pages so they had to be completely aware of it. People
even put up youtube videos of it in action, which must have been a little
tricky to catch because it was completely random.

In that thread on the forum some people told how they had managed to get their
machines replaced. Sometimes the replaced machines had the defect as well. I
never had any luck with my local support as they basically said they couldn't
reproduce it so they couldn't replace it.

However, at one point a rookie tech left a support manual in pdf form on my
machine after I had taken it in for servicing. There was a table in it which
listed about a dozen different types of problems and the recommended approach
for each. Which type of problem was the only one to have a blank cell where
the solution should have been? Flickering or blinking displays.

So every time I had some other issue with my MBP, and I had several, I tried
to get them to do something about the blinking problem. I finally got a
second-tier support person (on the phone) to accept the issue. He said it had
happened to his as well and replacing the Airport card (which happened to be
the thing that had failed that particular time) had fixed it. This was about
two months before the end of my three-year Apple Care period and he said that
if that didn't fix it, he would try to get it replaced.

Well it didn't, to my utter lack of surprise, but I couldn't manage the hassle
of giving up my machine yet again and, anyway, I was used to it so the
warranty period expired.

About a month later Apple released a firmware update that fixed the blinking.

I know there are companies that might support their products after the
warranty period and I know there are companies that stonewall but I thought
this combination was pure Apple.

Completely ignore the problem for three years and then release an update after
everyone is out of warranty anyway.

~~~
pmjordan
Wow, that is indeed classic Apple. Does the firmware update really fix it,
though? My father-in-law has had all sorts of trouble with standby and his
2011 Mac Mini + Thunderbolt Display. Supposedly "fixed" by various firmware
updates, but not much has changed.

~~~
aptwebapps
It seems to have worked.

------
scott_s
_On a related note, I compiled and ran this sample on Playstation 3, and no
memory reordering was detected. This suggests that the two hardware threads
inside the PPU effectively act as a single processor, with very fine-grained
hardware scheduling._

This is because all of the cores on the Cell processor are _in-order_
([http://en.wikipedia.org/wiki/Out-of-order_execution#In-
order...](http://en.wikipedia.org/wiki/Out-of-order_execution#In-
order_processors)). That fact is well documented:
<http://hpc.pnl.gov/people/fabrizio/papers/ieeemicro-cell.pdf>

And for the record, the "correct" thing to do is to always use the appropriate
memory and instruction fences. Reasoning about where these should go can be
subtle.

~~~
preshing
Actually, it's not because the PPU executes in-order, which has only to do
with the way the CPU orders instructions internally and says nothing about
memory ordering. For example, the Xbox 360 has in-order processors too, yet
you can observe memory reordering all over the place. Both consoles use the
PowerPC architecture, which is well-known to provide weak memory ordering.

~~~
pron
Correct. Memory operation re-ordering can occur due to the way the cache-
coherence mechanism is implemented. If messages on the bus between cores are
not ordered, memory operations can effectively be reordered even if each core
never performs any reordering (e.g. due to cache misses).

This is a great explanation:
[http://irl.cs.ucla.edu/~yingdi/paperreading/whymb.2010.06.07...](http://irl.cs.ucla.edu/~yingdi/paperreading/whymb.2010.06.07c.pdf)

------
adobriyan
Somewhat more systematic document:

[http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6...](http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/memory-
barriers.txt;h=2759f7c188f0cd9232ce10122d3dd6595b4f0d30;hb=HEAD)

------
loboman
So what's the surprise here? this is exactly why you use locks in the code,
because you don't know what the compiler is going to do with your code, and
you don't want to know all the details of that. The solution is not to know
all those asm details, but to use tools like locks correctly.

edit: after writing lock-free code years ago, and having very subtle bugs like
the ones described here, I don't think it is a good idea anymore, unless you
are in a very specific scenario.

~~~
AlexandrB
> ...unless you are in a very specific scenario.

Locks can be very slow (relatively). At my last job using lock-free code was
necessary for performance in many cases.

------
simcop2387
Apparently this happens alot more for me on an AMD bulldozer.

I'm getting it happening every 1600 iterations with their makefile and 2000
with -march=native. Very interesting.

------
javert
This whole article was problematic to me because it starts out with:

 _Two processors, running in parallel, execute the following machine code_

Processors don't "run in parallel." Among other things, the OS could be
scheduling the threads on each CPU in any way it wants. I just can't think
about the false hypothetical that the processors "run in parallel."

I'm in the process of going through the article more carefully to see if this
erroneous way of thinking propogated. It seems like the author knows his
stuff, though, so I'm guessing not.

~~~
javert
I didn't find any problems.

However, one of the reasons I was uncomfortable with the setup (besides thread
scheuling), is that I personally don't make any assumptions about cache
coherency. To me, it could take thread 1 an arbitrary amount of time to see an
update made to memory by thread 2, unless some primitive (like a semaphore...
with "acquire and release semantics") is used.

Right? I mean, couldn't that account for what is happening just as much as
instruction reordering? In theory, I'm fairly certain this is true. In
practice, it depends on the specifics of cache coherency in x86-64; if anyone
can comment on that, I'd appreciate it.

------
alpb
This is a very nice post. I didn't use to know <code>-S</code> existed before.

~~~
malkia
On the MSVC side the option is -FA[scu]

/FA[scu] configure assembly listing

(this is from CL 15.00.30729.207 from the WDK 7.1)

------
tocomment
I hope this isn't too dumb, but I always imagined each processor having its
own set of registers.

Am I incorrect? I can't see how multiple processors can share registers
without chaos?

~~~
HeXetic
The registers aren't shared here, but the main memory is. The example has each
thread write '1' into a [shared, main] memory location and then read the other
thread's memory location into a register. If both CPUs' registers are 0, that
means that both reads occurred before both writes.

------
mrushton14
I do embedded Linux development and I see bugs related to this stuff
frequently. What can make things more tricky is different architectures have
different ordering guarantees. Anyone know of any cpu architectures that
intentionally sacrifice performance for a simpler less bug prone model? I'm
thinking of markets like aviation where bugs like this seem particularly
scary.

~~~
duskwuff
There's no need for a separate architecture to avoid concurrency issues; a
simpler solution is to simply use a uniprocessor system. (And be careful with
interrupts.)

