Hacker News new | past | comments | ask | show | jobs | submit login
Examining ARM vs. x86 Memory Models with Rust (nickwilcox.com)
239 points by redbluemonkey 6 days ago | hide | past | web | favorite | 65 comments





> Where the memory model of ARM differs from X86 is that ARM CPU’s will re-order writes relative to other writes, whereas X86 will not.

May. Not will. The difference here is important, because the actual memory ordering presented is an issue of hardware implementation choice (and of course the local vagaries like cache line alignment, interrupt order and the behavior of other CPUs on the bus). You can't just write some sample code to demonstrate it and expect it's going to work the same on "ARM".

In fact I'd be really curious how Apple handles this during the Mac transition. I wouldn't be at all surprised if, purely for the sake of compatibility, they implement an strongly ordered x86-style cache hierarchy. Bugs in this world can be extremely difficult to diagnose, and honestly cache coherence transistors aren't that expensive relative to the rest of the system.


Reordering rarely or never comes from "cache coherence" on CPUs, but rather core local effects like store buffering, out of order execution, coalescing and out of order commit in the store buffer, etc.

I don't follow your point. Ordering control is necessarily a feature of the cache layer. You're right that there are local pipeline effects too, but if you have a memory ordering requirement in your architecture it has to be built into the cache design all the way up and down the hierarchy.

In practice that's the stuff that's expensive.


I'm saying that everyone making CPUs uses a strongly ordered cache coherency subsystem and re-ordering only comes from "pipeline effects" [1].

Slightly longer version of the same claim here:

https://news.ycombinator.com/item?id=23661588

---

[1] I prefer the term core-local because I think it's somewhat more accurate as it can include things like the delayed processing of invalidations and sibling core interactions in SMT which might not fall under "pipeline" effects but are still local to the (physical) core.


That may well be true in ARM application processors, at least between CPUs. Though nothing in the architectural memory model guarantees that so I'd be surprised if there weren't some surprises (DMA on most of those CPUs is upstream of the cache, for example, and not susceptible to that kind of optimization).

I can guarantee it's not for "everyone making CPUs". I'm literally writing code as we speak on a multi CPU cache-incoherent system. It's a big world.


TriCore by any chance? ;-)

Xtensa

The cost in terms of transistors may be small, but what about performance? (This is not a rhetorical question, I'm curious to know)

The fastest general purpose CPUs in the world implement a strongly ordered architecture, so clearly it can be done.

My question was exactly @Retr0spectrum's, the cost in performance, but I don't accept your answer (yet). I presume you mean intel and intel has a large transistor budget to burn, and quite like did to get that fastest perf. If you don't have to enforce strong ordering, how much extra a) actual speed b) transistor budget could you free up?

(not a hardware guy


Isn't the following statement always true, as casting using `as` will silently ~~overflow~~ truncate the `u32` if `usize` is 64-bits?

    assert!((samples as u32) <= u32::MAX);
EDIT: I know it's a contrived example, but I was just curious if my understanding is correct. I also found this page in the nomicon about casting: https://doc.rust-lang.org/nomicon/casts.html

EDIT2: As I thought casting a `usize` which is 64-bits to a `u32` causes it to be truncated and hence the assertion is always true. Further by using a number that's bigger than a `u32`, this example contains undefined behavior. This is due to the use of `slice::from_raw_parts` where `self.samples` is left as a `usize` and hence takes a much bigger slice than what was allocated (the leftover of the truncate operation). I made a small playground which demonstrates the segfault. https://play.rust-lang.org/?version=stable&mode=debug&editio.... The assertion should rather be:

    assert!(samples <= u32::MAX as usize);
Don't get me wrong, I think the blogpost is a great explanatory article about memory ordering and the example is rather contrived. I just wanted to reassure myself that my understanding was correct and further perhaps help someone not seeing this issue (as this is a very easy trap to fall into).

You are correct. Thanks for the pickup. Fixed the post and repo.

To be honest, this check deserves to be in clippy.

> The x86 processor was able to run the test successfully all 10,000 times, but the ARM processor failed on the 35th attempt.

I think this issue might prove a problem in the long tail of desktop and server software running on ARM.

A lot of desktop and server applications try to take advantage of all the cores. Many times, they are using libraries that were either implemented prior to C and C++ having defined memory models or else without that much care for memory model as long as it ran without issues on the developer computer (x86) and server (x86). Going to ARM is going to expose a lot of these bugs as developers recompile their code for ARM without making sure that their code actually adheres to the C/C++ memory models.


There’s now 2 incentive to support ARM better - Apple’s move to ARM on the desktop and cheaper cloud bills if you’re willing to use ARM instances. Either one wouldn’t be enough of an incentive, but together it will cause a shift in the next 3-5 years.

Developers will become more aware of the differences between the architectures, tool chains will accommodate both better, people and software will stop assuming they are running on x86 as default. ARM won’t “win” the desktop or the server market, but it will become a viable alternative, squeezing the profits of companies who depend on x86.


> cheaper cloud bills

That remains to be seen.


Graviton2 instances are available on AWS. They claim 40% better performance than their x86 peers - https://aws.amazon.com/ec2/instance-types/m6/

Smartphones used ARM for a while. I think that many of the major libraries were used in one or another project. So I think that this problem won't be as severe, because those bugs hopefully were fixed.

Also Raspberry Pi was a popular choice for many tinkerers for years which also helps with ARM penetration.


What do you mean "for a while"? AFAIK most smartphones are still based on ARM SoCs.

I would read "for a while" as "for a number of years, since the iPhone".

Maybe they meant to say “have been using” instead of “used”.

Yeah, feels similar to Make concurrency bugs, where the makefile was developed with --jobs=1, and then years down the track someone says “let’s make it faster!” and tries --jobs=8 or similar, but that leads to compilation failures (or worse, succeeding in compiling the wrong thing), and because all the makefile was made years ago, it’s hard to track down exactly where the prerequisites are lacking; whereas if --jobs=8 had been used from the start, it would have been caught early.

So far I have not encountered any problems with e.g. desktop Firefox (no surprise, gecko has been running on Android for ages) and various server things (mostly the BEAM.. and even some Haskell, even though before 8.8 GHC did not put proper memory barriers everywhere, and I used 8.6 back when I compiled the app).

Most libraries and applications use stuff like mutexes, btw :) it's not like people who don't care about memory models try to make lockfree things often.


Old software will have been tested on CPUs with weaker memory models.

Also, Multi-threaded programming is hard, so the really long tail probably already is buggy on x86. Surfacing their bugs more often may be a blessing in disguise.


In practice, CPU's with weaker memory models tend to be newer - a single-threaded CPU being the strongest model of them all. There are some interesting exceptions, such as e.g. the Alpha architecture - which is so 'weak' it needs a memory barrier even on pointer/array indirections, if multiple threads might be involved! (I actually did some digging and explained the issue in https://news.ycombinator.com/item?id=22282049 this old thread.)

ARM, POWER, early SPARCs all have a weaker memory model than intel.

It is true though that with near total x86 domination i the server market for the las 20 years, newer server software might have a lot of x86-isms.


Some of them are latent bugs on x86 too, just rarer due to the stronger guarantees x86 provides.

It's possible to provide stronger guarantees than the spec requires.

It wouldn't surprise me if server focused ARM chips ended up providing x86 style ordering to ensure compatibility with ported software.


It would very much surprise me, since the weaker ordering requirements are a significant part of how ARM achieves lower power consumption.

I agree that it will be a large power penalty, but the alternative may end up being the long tail of software suffering subtle concurrency correctness issues that just aren't there on x86. It may sacrifice some of their advantage but avoid being known as the CPU equivalent of a Ford Pinto.

There's still plenty of room to undercut Intel's historical 60% gross margins even if you're shipping largely interchangeable products.

The other way it might play out is a race to very high core counts as the main differentiator, providing single socket performance worth the hassle of not being able to run everything on it in a rock solid way. Postgres will work great. That redis fork that adds threads maybe not.

Or Intel may start to allow developers to relax memory correctness guarantees on a process by process granularity in their own progress to high core counts. It's hard to imagine their current methods scaling to 1k cores. But if you asked me ten years ago I'd have said they wouldn't have made it to 56 cores, either.


> Or Intel may start to allow developers to relax memory correctness guarantees on a process by process granularity

If Intel can allow developers/OS's to relax guarantees, why can't ARM allow OS's to strengthen them as necessary, while still keeping a relaxed memory model most of the time?


Sure, although you'd probably be better off with strong by default with opt in weakness.

The amount of testing and verification in making sure something like Postgres runs well on a different memory order more is substantial. Adding the flag at the end is trivial.

This way around, software is correct by default unless it explicitly asks to be unsafe.


I would be surprised if that's the case. Why would memory ordering have any effect on power consumption?

Because ensuring release/acquire for every operation (at least in the x86 way) requires supporting total store order, which requires much stronger cache coherence in the normal (no barrier) case and hence results in a lot more bus traffic. ARM didn't just make the memory model weaker for no reason.

I am pretty sure this is not correct. Every reordering effect I'm aware of is a core-local effect. That is, it happens in the core before (or at the moment) the data hits the L1D. It does not occur due to a weaker cache coherence system.

Having a cache coherence system which was itself weaker, allowing reorderings consistent with the memory model, makes barriers and "implicit barriers" like address dependencies very expensive, and there is little evidence this is the case.

Even in a hypothetical core which had a cache coherency system coupled to the memeory model, you aren't really avoiding any coherence traffic, just allowing certain reorderings such as satisfying requests out of order.


Seems plausible to me. I guess it would be pretty hard to maintain a TPD of about 200W for 80 3GHz ARM cores [1]. The bus traffic to synchronize 80 CPU caches can probably be significant.

[1] https://www.anandtech.com/show/15871/amperes-product-list-80...


Do you have a concrete example were a weaker memory model would allow you to avoid memory coherence traffic?

If we initialize the contents of both pointers to 0

Is this a "Rust-ism"? I had a double-take while reading that, because in C that would mean a null pointer, and in the terminology I'm used to, the intent is to set the pointee to 0.

Note that x86 does allow some memory reordering:

https://preshing.com/20120515/memory-reordering-caught-in-th...

(I have experience debugging and fixing an extremely rare bug caused by the above subtle reordering, which occurred approximately once every 3-4 months.)


> Is this a "Rust-ism"?

It doesn’t ring a bell to me as someone who’s spent a lot of time in the Rust community, so I’d say it’s probably just a difference in personal jargon rather than a Rust versus C thing.

(Rust usually uses “reference” instead of “pointer” anyway.)


References and (raw) pointers are distinct things with very different semantics. (References are borrow checked.)

Also, references are not guaranteed to be indirections in machine code (which is the definition of a pointer).

Yes, they are – well, as much as raw pointers are. In reality, the optimizer can always remove indirections when it can prove that doing so doesn’t affect visible semantics, but that applies equally to raw pointers and references.

(edit: replaced asterisks with “[asterisk]” since HN’s Markdown has no way to escape them.)

Critically, Rust references have pointer identity: you can convert between references and raw pointers, and if you convert a [asterisk]const u32 to &u32 and back to [asterisk]const u32, the pointers are guaranteed to compare equal. In my opinion this is unfortunate. It would be nice if the compiler could represent &u32 in memory as if it were just u32, i.e. just pass the value rather than a pointer. After all, the pointed-to value is guaranteed to remain unchanged as long as the &u32 exists. And passing by value is almost always faster for small values (where the size of a value <= the size of a pointer); the programmer could just pass by value directly, but only if they know it’s a small value, which isn’t the case in generic code. Unfortunately, the ability to compare pointer identities means that you really do need a pointer, even though most code never does so. (LLVM can transform pointers to values for local variables and even function arguments if it can prove that the actual code never checks pointer identity, but again, that applies to both references and raw pointers; most of the distinction between the two has been lost by the LLVM IR stage anyway.)


C programmers use "contents" to refer to the address of the pointer? As a non-C programmer, asking what is "contained" in the pointer would certainly refer to the pointee.

In C a pointer is an object in its own right, so referring to the contents of a pointer, p, means the same thing as referring to the contents of an integer, i; namely, the value held by the object. A pointer value has semantic meaning and usefulness the same as an integer value, completely divorced from the machine representation, which in a pure C program should always be irrelevant for pointer and arithmetic types alike.

I don't think I can articulate well why it's important to keep in mind and appreciate value semantics. Certainly many C programmers, especially the newer ones, are far too concerned with value representation rather than the abstract value itself, often conflating the two. I guess one good example of why it's important to understand pointers as proper objects with abstract values is when working with pointers-to-pointers, pointers-to-pointers-to-pointers, etc. Pointer arithmetic is another case, which doesn't overlap with the former as much as you'd think. In these cases understanding pointers as values is important to understanding the semantics of a program, and more generally how to leverage the language efficiently and safely. Note that these semantics have no analog in references, the construct in, e.g., C++. It doesn't make sense to conceptualize references as independent objects; rather, a reference is a syntactic construct, no more an object than a list initializer.

Yeah, as a C programmer I would find it kind of odd if someone seemed to conceptually elide the character of pointers as proper first-class objects.


I wonder if pointer arithmetic specifically is the sticking point here; when using pointers in other languages I don't ever do pointer arithmetic on them, so the integral value of the pointer is easily ignored.

For those wondering, given the context of Rust here, both references and C-style pointers are first-class in Rust, but references (which are overwhelmingly more common) don't directly permit pointer arithmetic at all, and would require one to first cast the pointer to an integer.


C and Asm at least... and "address of the pointer" is something entirely different. For example (assuming a 16-bit system and radix 16 for brevity),

    Name:    Address:    Value:
    p        0100        02 01
    v        0102        12 59
p is the pointer. The address of p is 0100. The contents of p are the two bytes which form the address of v, 0102. The contents of v are the two bytes which form the value 5912.

I wonder if this confusion of the pointee with the pointer is what makes this concept seem so difficult to those who didn't start with Asm (especially multiple indirections) --- I certainly saw a lot of that when I used to teach CS courses.


> I wonder if this confusion of the pointee with the pointer is what makes this concept seem so difficult to those who didn't start with Asm

Possibly (I started with asm, so I can't speak with certainty) but it could also be because many modern programmers first learned Java or JavaScript or Python in which most names are references to object (thus effectively all variables are addresses). Their first languages do not have value semantics, so grasping the concept of value semantics applied to a reference itself (ie pointers) is a major stretch.


In Python, everything is a pointer to a PyObject and always passed by value, so thinking of them as values makes sense. You can even easily get the address of the pointer by calling id on it (but not as easily convert it back).

doesn't id return the address of the pointee?

Ah, yes - the value of the pointer, which is an address, the address of the pointee. (Even after reading that other thread, I messed up my terminology.)

Actually, id returns a pointer to a PyObject representing an integer, the value of which is the value of the pointer that was passed as an argument, which is the address of the pointee. Right?


JavaScript does have value semantics for booleans and numbers. But it doesn't have any reference types which allow you to inspect the reference as a value.

A pointer does not contain what it points to, it just points to it.

The content (the value) of a pointer is an address.


> which occurred approximately once every 3-4 months

ouch that hurts. You should be proud of that fix....I guess you kinda are :D


I've never thought about that aspect of the my writing, but I use the terms "value of the pointer" to mean the address, and "contents of the pointer" to mean whats stored at the address.

Yes, "targets" would be clearer than "contents" here.

> in C that would mean a null pointer

It's a Rust-ism. In Rust there is no null.


Raw pointers in Rust can be null.

(star)const T and (star)mut T both have an is_null method and there is std::ptr::null and std::ptr::null_mut available to create null pointers.


One thing omitted from this article is that it's not only the physical hardware that may (effectively) re-order operations. The compiler may also perform these re-orderings.

The compiler's re-orderings will always be valid according to the abstract memory model rather than the hardware's, so even on x86 you must use the correct memory orderings, or risk subtle bugs due to compiler optimisations.


In particular, the “multi-threaded using volatile” example is technically incorrect because the compiler is allowed to move non-volatile accesses across volatile ones. [1] To make it correct (but still only on x86), you’d have to use std::sync::atomic::compiler_barrier.

[1] https://gcc.gnu.org/onlinedocs/gcc/Volatiles.html


Did this page just loaded for me with the glimpse of an eye? Way to do it!

And the good news here is that many libraries have accelerated their ARM port / investigations.

Isn't this something Rust std lib api or perhaps LLVM backend should take care of ?

The problem is that there's likely to be a fair amount of logically broken 3rd party code out there because x86's memory model semantics doesn't reveal the issues. The compiler can't help here, it'll just do what the developer told it to do.

It's a whole new 'it works on my machine' issue (for some people).


It does as long as the programmer uses it properly (i.e. you don't use atomics weaker than SeqCst or unsafe code with shared + mutable accesses unless you have proven that it's correct).

See how the functions are all unsafe? The primitives given by the Rust standard library actually do handle this; the author is simply going off the beaten path to illustrate one of the aspects those primitives need to paper over for you.



Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: