5. How do they solve the problem that x86 has a much stronger memory model than ...

pranith · on May 12, 2017

Since there are no publicly available details, I would guess this is solved in hardware. The Qualcomm processor can be made to support stronger memory model for translated instructions similar to how the Power processors support strong-access ordering for efficient x86 emulation[1].

[1] https://www.google.com/patents/WO2012101538A1?cl=en

StillBored · on May 12, 2017

Previous ARM generations had a "strongly ordered" memory type. ARMv8 doesn't have this for cacheable memory (normal) but device memory has a R(erorder) attribute to control access ordering. Of course device memory isn't cached either. Beyond that the non reordering attribute only applies within an implementation defined block size and its behavior with respect to normal accesses (or outside the block) isn't defined either. Making it mostly useless except for the most restrictive of cases.

qb45 · on May 12, 2017

https://news.ycombinator.com/item?id=14319420

Not sure where MikusR gets this info from, but it could be right given that Snapdragon 820 is Qualcomm's custom core.

wolfgke · on May 12, 2017

While I consider this explanation as interesting and plausible, it is a patent by IBM and not by Qualcomm (there can still license agreements negotiated etc.).

vardump · on May 11, 2017

That's an excellent point.

I hope the answer is not "by restricting all emulated threads to one CPU core".

vvanders · on May 11, 2017

My off-the-cuff guess is they just force a memory/inst fence where x86 makes guarantees which is a bit better but still incurs some overhead.

Visual Studio used to(probably still does) this around volatile for x86 which leads to fun bugs when you port to ARM/etc.

monocasa · on May 11, 2017

That'd basically be putting a barrier on every memory access though.

StillBored · on May 12, 2017

Nah, because ideally most applications are using library/system services to fence their data structures. For the remaining cases the apps could be marked with a new application compatibility flag (this feature already exists in windows for older apps) which adds additional checking that flags pages which are being accessed from multiple contexts as requiring additional emulator level sequencing. There are ton of strategies here, which could be as simple as basically disabling multi threading in the emulator or a page ownership ping/pong, etc.

kyberias · on May 11, 2017

What do you mean, could you elaborate on that a bit? How is x86 "memory model stronger" and why would application code be affected?

wolfgke · on May 11, 2017

For a short overview read:

> http://preshing.com/20120930/weak-vs-strong-memory-models/

If you want more details of memory models of some CPU architectures, read

> http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2...

Note that "x86 OOStore" is not a model that you should be concerned about (according to https://groups.google.com/forum/#!topic/linux.kernel/2dBrSeI... it was only used on IDT WinChip)

For a perspective on memory barriers for different memory models with a focus on the Linux kernel look at

> https://www.kernel.org/doc/Documentation/memory-barriers.txt

If you are specifically interested in some subtile details of the x86 memory model, have a look at

> http://www.cl.cam.ac.uk/~pes20/weakmemory/index3.html

Best begin with

> http://www.cl.cam.ac.uk/~pes20/weakmemory/cacm.pdf

---

I hope this should give you enough information to start reading up on the subject.

Splines · on May 12, 2017

I've found Herb Sutter's talk on std::atomic also useful for understanding this.

The 2nd part of his talk goes into some of the differences between x86 and other architectures (about 31 min in):

https://channel9.msdn.com/Shows/Going+Deep/Cpp-and-Beyond-20...

valarauca1 · on May 11, 2017

Strength of memory model refers to concurrency guarantees. Also gets deep into caching architectures, OOO, and inter-cpu communications.

This is a deep rabbit hole because modern processors are speculatively executing THOUSANDS of instructions ahead. So what you have/haven't written to memory is slightly existential.

Here is a good blog post about it http://preshing.com/20120930/weak-vs-strong-memory-models/

The TLDR: x64 tries REALLY HARD to ensure your pointers always have the newest data. ARMv6, very much. ARMv7 kind of. ARMv8 fairly.

wfunction · on May 11, 2017

Thousands? I thought it was more like a dozen or two at most. Any link on that?

valarauca1 · on May 12, 2017

Sky lakes OoO window is 224 instructions.

With up to 96 in scheduler. 72 loads, and 56 writes.

Link: (summery you find hard intel references internally) https://en.wikichip.org/wiki/intel/microarchitectures/skylak...

224 per core. We're talking concurrency. So a 10core, 20HT server class can have 2240 instructions in flight.

wfunction · on May 12, 2017

Admittedly more than I expected, and thanks for sharing, but "20 CPUs executing a peak of 224 instructions each" is a far cry from "modern processors execute THOUSANDS of instructions AHEAD".

It's neither "thousands" of instructions in flight for a single processor, nor is it thousands "ahead" for multiple processors, let alone both. It's like taking 4000 basic single-stage CPUs and claiming they execute thousands of instructions "ahead", which makes no sense.

vardump · on May 12, 2017

Agreed, saying thousands was hyperbole. You do have a point. When talking about instruction windows, one should talk only about one core at a time.

Although the idea of "lots of instructions in flight" was right. From programmers' point of view there's no difference between out-of-order windows of tens, hundreds or thousands. One just needs to be prepared that CPU can and will reorder things within limits of the defined memory model. Which does not define limits for out-of-order window size.

valarauca1 · on May 12, 2017

>one should talk only about one core at a time.

Nope.

Because we're talking about cross-core guarantees of concurrency. So you actually do care about what another, or all cores are doing.

What core is loading what, and what core is storing what.. and what is pending/holding up those loads is actually extremely important from the perspective of atomic guarantees.

Weaker CPU's (Like say POWER7) that do batching writes (you accumulate 256bits of data, then write it all at once) don't communicate what is in their write-out buffer. So you may write to a pointer, but until that CPU does a batched write the other cores aren't aware. You have to do a fence to flush this buffer (in the other cores).

There are some scenarios where the same situation can arise on x64 but its rarer. The intel cache architecture attempts to negotiate and detect when you are/aren't sharing data between cores. So for _most_ writes it uses the same situation as POWER7, but if it can predict your sharing data it'll use a different bus and alert the other CPU directly.

This is why x64 uses a MESIF-esque cache protocol [1][2]. It can tell when data is Owned/Forwarded/Shared between cores.

[1] https://en.wikipedia.org/wiki/MESIF_protocol

[2] Intel hasn't updated their white papers in 5+ years they're likely using a more advanced protocol.

vardump · on May 12, 2017

> Because we're talking about cross-core guarantees of concurrency. So you actually do care about what another, or all cores are doing.

This is mostly about load/store order of an individual core. How individual core decides to order its reads and writes to memory.

> Weaker CPU's (Like say POWER7) that do batching writes (you accumulate 256bits of data, then write it all at once) don't communicate what is in their write-out buffer. So you may write to a pointer, but until that CPU does a batched write the other cores aren't aware. You have to do a fence to flush this buffer (in the other cores).

You can actually do same on x86 by using non-temporal stores. Although you're not talking about store ordering, but about visibility to other cores. A store won't ever be visible to other cores until it at least hits L1 cache controller.

> There are some scenarios where the same situation can arise on x64 but its rarer.

Yup, that's right. That's why x86 (and x64) got mfence and sfence instructions.

> This is why x64 uses a MESIF-esque cache protocol [1][2]. It can tell when data is Owned/Forwarded/Shared between cores.

Reordering happens before cache controller. When cache controller is involved, the store is already in progress.

vardump · on May 11, 2017

He means x86 doesn't reorder memory store order with other stores or loads with other loads, etc.

See: https://en.wikipedia.org/wiki/Memory_ordering#In_symmetric_m...

StillBored · on May 12, 2017

Why is that? We are talking user space applications rather than drivers/etc. Properly coded applications will be using some form of sync primitives to shared data structures. So, when the emulator hits a LOCK xxx instruction it simply replaces it with a LDAXR/STLXR (load acquire exclusive, store release exclusive which have implied barriers) pair, or drops a barrier in. Generally though, you would expect that most well behaved application are using EnterCriticalSection()/InterlockedXXXX() or other library calls that could be calling ARM64 optimized code (likely one of those hybrid libraries) which would also assure proper barriers.

So, an emulator like this isn't going to be perfect, and applications which have been getting lucky due to x86's ordering model and avoiding proper sync primitives will likely break, and there really isn't much that can be done in those cases but fixing the application although it wouldn't surprise me if there is a compatibility flag which falls back to some slow path emulation mode which orders individual loads/stores and runs at 1/10 the normal speeds.

vardump · on May 12, 2017

You're talking about atomic operations and synchronization. This is not about that.

This is about memory order. How CPU is allowed to reorder normal unsynchronized loads and stores. X86 has strong order, but ARM CPUs can reorder loads and stores significantly more. Thus on ARM you very often need memory barriers on ARM where x86 needs none.

If memory order is not handled according to specifications, programs executing simultaneously on more than one CPU core, relying on CPU memory model/ordering will not work correctly.

You don't want to make every ordinary load and store atomic, because that would incur a rather high performance cost.

StillBored · on May 12, 2017

Yes it is (about atomics/sync). ARM like every other major core is "sequentially consistent" with respect to the core executing the code. The memory order only matters for external visibility (aka other threads or devices). This means that the load/store order from a single thread doesn't matter to the other threads except at synchronization points (aka locks/etc). Those sync primitives have implied barriers.

If someone has a program which is depending on load/store order in userspace then they likely have bugs on x86 as well since threads can be migrated between cores and the compilers are fully allowed to reorder load/stores as well as long as visible side effects are maintained.. I could go into the finer points of how compilers have to create internal barriers (this has nothing to do with DMB/SFENCE/etc) across external function calls as well (which plays into why you should be using library locking calls rather than creating your own) but that is another whole subject.

The latter case is something I don't think most people understand... Particularly as GCC and friends get more aggressive about determining side effects and tossing code. Also, volatile doesn't do what most people think it does and trying to create sync primitives simply by forcing load/store does nothing when the compiler is still free to reorder the operations.

An emulator is also going to maintain this contract as well. That is why things like qemu work just fine to run x86 binaries on random ARMs today without having to modify the hardware memory model.

vardump · on May 12, 2017

> If someone has a program which is depending on load/store order in userspace then they likely have bugs on x86 as well

Incorrect. On x86, if you write to memory n times, other cores are guaranteed to see the writes in same order. Second write is never going to be visible to other cores before first write. It's correct to rely on x86 memory model in x86 software.

On ARM, those stores can become visible to other cores in any order.

> since threads can be migrated between cores

Irrelevant. This is about code executing concurrently on multiple cores. Operating system and threads are irrelevant. This is about hardware behavior, CPU core load/store system and instruction reordering, not software.

> ... and the compilers are fully allowed to reorder load/stores...

This has nothing to do with compilers. This has everything to do how CPU cores reorder reads and writes.

> Particularly as GCC and friends get more aggressive about determining side effects and tossing code

If GCC has bugs, please report them. Undefined behavior can give that impression, but again, this topic has nothing to do with compilers.

> An emulator is also going to maintain this contract as well. That is why things like qemu work just fine to run x86 binaries on random ARMs today without having to modify the hardware memory model.

This is not true. See: http://wiki.qemu.org/Features/tcg-multithread#Memory_consist...

Remaining Case: strong on weak, ex. emulating x86 memory model on ARM systems

I recommend you read this: https://en.wikipedia.org/wiki/Memory_ordering.

StillBored · on May 12, 2017

> It's correct to rely on x86 memory model in x86 software.

If you have complete control over the whole stack, sure. But we are taking windows user space applications. You continue to ignore my point originally that the edge cases you are describing may cause problems, but are just that, edge cases which can be solved with slow path code (put a DSB following every store if you like), and are likely depending on behaviors higher in the stack which aren't guaranteed and are therefor "broken". If your code is in assembly, and never makes library calls, etc then you might consider it "correctly written" otherwise your probably fooling yourself for the couple percent you gain over simply calling EnterCriticalSection().

vardump · on May 12, 2017

> If you have complete control over the whole stack, sure. But we are taking windows user space applications.

Well, hardware feature is a hardware feature. Load/store ordering is a hardware feature.

Operating system, user space or kernel space, compiler, etc. are not relevant when discussing about CPU core hardware operation.

You don't need complete control of the stack. You just need to have CPU cores executing instructions on multiple cores.

StillBored · on May 12, 2017

> On ARM, those stores can become visible to other cores in any order.

And I will repeat this again, for "correctly" written code this doesn't matter. Because the data areas being stored to should be protected by _LOCKs_ which will enforce visibility. If you think your being clever and writing "lock free" code by depending on the memory model your likely fooling yourself.

vardump · on May 12, 2017

Lock free code indeed does indeed need to rely on memory model. Relying on underlying machine model is not fooling oneself.

Please don't conflate visibility with load/store order. It's a different matter.

See how C++ memory model operations map to different processors, especially how many cases are simple loads or stores on x86.

https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html (link from another comment in this discussion)

pranith · on May 12, 2017

> Properly coded applications will be using some form of sync primitives to shared data structures. So, when the emulator hits a LOCK xxx instruction

This is not always true. See [1] for an example. Because of the semantics of the x86 memory model a sequentially consistent load does not generate a lock'd instruction. When such loads are translated to ARM64, you need to introduce barriers or use a ldacq instruction.

[1] https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html

StillBored · on May 12, 2017

That doesn't matter because a properly consistent lock will need a sequentially consistent store somewhere in the sequence, and that store will generate the barrier.Something like a ticket lock works in this case because the store will eventually become visible and when it does the ordering of operations proceeding it will have completed.

0x0 · on May 12, 2017

Not just the memory model, unaligned word/dword accesses would be tricky too I think? Doesn't ARM usually silently give you garbage when you load or store a multibyte item into a register unaligned?

tjalfi · on May 12, 2017

ARMv6 added unaligned support[0].

[0] http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc....

dis-sys · on May 12, 2017

In linux, you get a nice SIGBUS for unaligned memory access.

0x0 · on May 12, 2017

It looks like that's not the default, though? https://www.kernel.org/doc/Documentation/arm/mem_alignment