
Examining ARM vs. x86 Memory Models with Rust - redbluemonkey
https://www.nickwilcox.com/blog/arm_vs_x86_memory_model/
======
ajross
> Where the memory model of ARM differs from X86 is that ARM CPU’s will re-
> order writes relative to other writes, whereas X86 will not.

May. Not will. The difference here is important, because the actual memory
ordering presented is an issue of hardware implementation choice (and of
course the local vagaries like cache line alignment, interrupt order and the
behavior of other CPUs on the bus). You can't just write some sample code to
demonstrate it and expect it's going to work the same on "ARM".

In fact I'd be really curious how Apple handles this during the Mac
transition. I wouldn't be at all surprised if, purely for the sake of
compatibility, they implement an strongly ordered x86-style cache hierarchy.
Bugs in this world can be extremely difficult to diagnose, and honestly cache
coherence transistors aren't that expensive relative to the rest of the
system.

~~~
BeeOnRope
Reordering rarely or never comes from "cache coherence" on CPUs, but rather
core local effects like store buffering, out of order execution, coalescing
and out of order commit in the store buffer, etc.

~~~
ajross
I don't follow your point. Ordering control is _necessarily_ a feature of the
cache layer. You're right that there are local pipeline effects too, but if
you have a memory ordering requirement in your architecture it has to be built
into the cache design all the way up and down the hierarchy.

In practice that's the stuff that's expensive.

~~~
BeeOnRope
I'm saying that everyone making CPUs uses a strongly ordered cache coherency
subsystem and re-ordering _only_ comes from "pipeline effects" [1].

Slightly longer version of the same claim here:

[https://news.ycombinator.com/item?id=23661588](https://news.ycombinator.com/item?id=23661588)

\---

[1] I prefer the term _core-local_ because I think it's somewhat more accurate
as it can include things like the delayed processing of invalidations and
sibling core interactions in SMT which might not fall under "pipeline" effects
but are still local to the (physical) core.

~~~
ajross
That may well be true in ARM application processors, at least between CPUs.
Though nothing in the architectural memory model guarantees that so I'd be
surprised if there weren't some surprises (DMA on most of those CPUs is
upstream of the cache, for example, and not susceptible to that kind of
optimization).

I can guarantee it's not for "everyone making CPUs". I'm literally writing
code as we speak on a multi CPU cache-incoherent system. It's a big world.

~~~
jeffreygoesto
TriCore by any chance? ;-)

~~~
ajross
Xtensa

------
barskern
Isn't the following statement always true, as casting using `as` will silently
~~overflow~~ truncate the `u32` if `usize` is 64-bits?

    
    
        assert!((samples as u32) <= u32::MAX);
    

EDIT: I know it's a contrived example, but I was just curious if my
understanding is correct. I also found this page in the nomicon about casting:
[https://doc.rust-lang.org/nomicon/casts.html](https://doc.rust-
lang.org/nomicon/casts.html)

EDIT2: As I thought casting a `usize` which is 64-bits to a `u32` causes it to
be truncated and hence the assertion is always true. Further by using a number
that's bigger than a `u32`, this example contains undefined behavior. This is
due to the use of `slice::from_raw_parts` where `self.samples` is left as a
`usize` and hence takes a much bigger slice than what was allocated (the
leftover of the truncate operation). I made a small playground which
demonstrates the segfault. [https://play.rust-
lang.org/?version=stable&mode=debug&editio...](https://play.rust-
lang.org/?version=stable&mode=debug&edition=2018&gist=2d6a189182daa3884eebc7524d1b8223).
The assertion should rather be:

    
    
        assert!(samples <= u32::MAX as usize);
    

Don't get me wrong, I think the blogpost is a great explanatory article about
memory ordering and the example is rather contrived. I just wanted to reassure
myself that my understanding was correct and further perhaps help someone not
seeing this issue (as this is a very easy trap to fall into).

~~~
redbluemonkey
You are correct. Thanks for the pickup. Fixed the post and repo.

------
RcouF1uZ4gsC
> The x86 processor was able to run the test successfully all 10,000 times,
> but the ARM processor failed on the 35th attempt.

I think this issue might prove a problem in the long tail of desktop and
server software running on ARM.

A lot of desktop and server applications try to take advantage of all the
cores. Many times, they are using libraries that were either implemented prior
to C and C++ having defined memory models or else without that much care for
memory model as long as it ran without issues on the developer computer (x86)
and server (x86). Going to ARM is going to expose a lot of these bugs as
developers recompile their code for ARM without making sure that their code
actually adheres to the C/C++ memory models.

~~~
nindalf
There’s now 2 incentive to support ARM better - Apple’s move to ARM on the
desktop and cheaper cloud bills if you’re willing to use ARM instances. Either
one wouldn’t be enough of an incentive, but together it will cause a shift in
the next 3-5 years.

Developers will become more aware of the differences between the
architectures, tool chains will accommodate both better, people and software
will stop assuming they are running on x86 as default. ARM won’t “win” the
desktop or the server market, but it will become a viable alternative,
squeezing the profits of companies who depend on x86.

~~~
jfkebwjsbx
> cheaper cloud bills

That remains to be seen.

~~~
nindalf
Graviton2 instances are available on AWS. They claim 40% better performance
than their x86 peers - [https://aws.amazon.com/ec2/instance-
types/m6/](https://aws.amazon.com/ec2/instance-types/m6/)

------
userbinator
_If we initialize the contents of both pointers to 0_

Is this a "Rust-ism"? I had a double-take while reading that, because in C
that would mean a null pointer, and in the terminology I'm used to, the intent
is to set the _pointee_ to 0.

Note that x86 does allow some memory reordering:

[https://preshing.com/20120515/memory-reordering-caught-in-
th...](https://preshing.com/20120515/memory-reordering-caught-in-the-act/)

(I have experience debugging and fixing an extremely rare bug caused by the
above subtle reordering, which occurred approximately once every _3-4
months_.)

~~~
kibwen
C programmers use "contents" to refer to the address of the pointer? As a
non-C programmer, asking what is "contained" in the pointer would certainly
refer to the pointee.

~~~
wahern
In C a pointer is an object in its own right, so referring to the contents of
a pointer, p, means the same thing as referring to the contents of an integer,
i; namely, the _value_ held by the object. A pointer value has semantic
meaning and usefulness the same as an integer value, completely divorced from
the machine representation, which in a pure C program should always be
irrelevant for pointer and arithmetic types alike.

I don't think I can articulate well why it's important to keep in mind and
appreciate value semantics. Certainly many C programmers, especially the newer
ones, are far too concerned with value _representation_ rather than the
abstract value itself, often conflating the two. I guess one good example of
why it's important to understand pointers as proper objects with abstract
values is when working with pointers-to-pointers, pointers-to-pointers-to-
pointers, etc. Pointer arithmetic is another case, which doesn't overlap with
the former as much as you'd think. In these cases understanding pointers as
values is important to understanding the semantics of a program, and more
generally how to leverage the language efficiently and safely. Note that these
semantics have no analog in references, the construct in, e.g., C++. It
doesn't make sense to conceptualize references as independent objects; rather,
a reference is a syntactic construct, no more an object than a list
initializer.

Yeah, as a C programmer I would find it kind of odd if someone seemed to
conceptually elide the character of pointers as proper first-class objects.

~~~
kibwen
I wonder if pointer arithmetic specifically is the sticking point here; when
using pointers in other languages I don't ever do pointer arithmetic on them,
so the integral value of the pointer is easily ignored.

For those wondering, given the context of Rust here, both references and
C-style pointers are first-class in Rust, but references (which are
overwhelmingly more common) don't directly permit pointer arithmetic at all,
and would require one to first cast the pointer to an integer.

------
Diggsey
One thing omitted from this article is that it's not only the physical
hardware that may (effectively) re-order operations. The compiler may also
perform these re-orderings.

The compiler's re-orderings will always be valid according to the abstract
memory model rather than the hardware's, so even on x86 you must use the
correct memory orderings, or risk subtle bugs due to compiler optimisations.

~~~
comex
In particular, the “multi-threaded using volatile” example is technically
incorrect because the compiler is allowed to move non-volatile accesses across
volatile ones. [1] To make it correct (but still only on x86), you’d have to
use std::sync::atomic::compiler_barrier.

[1]
[https://gcc.gnu.org/onlinedocs/gcc/Volatiles.html](https://gcc.gnu.org/onlinedocs/gcc/Volatiles.html)

------
jhoechtl
Did this page just loaded for me with the glimpse of an eye? Way to do it!

------
ramshanker
And the good news here is that many libraries have accelerated their ARM port
/ investigations.

------
truth_seeker
Isn't this something Rust std lib api or perhaps LLVM backend should take care
of ?

~~~
secondcoming
The problem is that there's likely to be a fair amount of logically broken 3rd
party code out there because x86's memory model semantics doesn't reveal the
issues. The compiler can't help here, it'll just do what the developer told it
to do.

It's a whole new 'it works on my machine' issue (for some people).

