Not sure where MikusR gets this info from, but it could be right given that Snapdragon 820 is Qualcomm's custom core.
I hope the answer is not "by restricting all emulated threads to one CPU core".
Visual Studio used to(probably still does) this around volatile for x86 which leads to fun bugs when you port to ARM/etc.
If you want more details of memory models of some CPU architectures, read
Note that "x86 OOStore" is not a model that you should be concerned about (according to https://groups.google.com/forum/#!topic/linux.kernel/2dBrSeI... it was only used on IDT WinChip)
For a perspective on memory barriers for different memory models with a focus on the Linux kernel look at
If you are specifically interested in some subtile details of the x86 memory model, have a look at
Best begin with
I hope this should give you enough information to start reading up on the subject.
The 2nd part of his talk goes into some of the differences between x86 and other architectures (about 31 min in):
This is a deep rabbit hole because modern processors are speculatively executing THOUSANDS of instructions ahead. So what you have/haven't written to memory is slightly existential.
Here is a good blog post about it http://preshing.com/20120930/weak-vs-strong-memory-models/
The TLDR: x64 tries REALLY HARD to ensure your pointers always have the newest data. ARMv6, very much. ARMv7 kind of. ARMv8 fairly.
With up to 96 in scheduler. 72 loads, and 56 writes.
Link: (summery you find hard intel references internally) https://en.wikichip.org/wiki/intel/microarchitectures/skylak...
224 per core. We're talking concurrency. So a 10core, 20HT server class can have 2240 instructions in flight.
It's neither "thousands" of instructions in flight for a single processor, nor is it thousands "ahead" for multiple processors, let alone both. It's like taking 4000 basic single-stage CPUs and claiming they execute thousands of instructions "ahead", which makes no sense.
Although the idea of "lots of instructions in flight" was right. From programmers' point of view there's no difference between out-of-order windows of tens, hundreds or thousands. One just needs to be prepared that CPU can and will reorder things within limits of the defined memory model. Which does not define limits for out-of-order window size.
Because we're talking about cross-core guarantees of concurrency. So you actually do care about what another, or all cores are doing.
What core is loading what, and what core is storing what.. and what is pending/holding up those loads is actually extremely important from the perspective of atomic guarantees.
Weaker CPU's (Like say POWER7) that do batching writes (you accumulate 256bits of data, then write it all at once) don't communicate what is in their write-out buffer. So you may write to a pointer, but until that CPU does a batched write the other cores aren't aware. You have to do a fence to flush this buffer (in the other cores).
There are some scenarios where the same situation can arise on x64 but its rarer. The intel cache architecture attempts to negotiate and detect when you are/aren't sharing data between cores. So for _most_ writes it uses the same situation as POWER7, but if it can predict your sharing data it'll use a different bus and alert the other CPU directly.
This is why x64 uses a MESIF-esque cache protocol . It can tell when data is Owned/Forwarded/Shared between cores.
 Intel hasn't updated their white papers in 5+ years they're likely using a more advanced protocol.
This is mostly about load/store order of an individual core. How individual core decides to order its reads and writes to memory.
> Weaker CPU's (Like say POWER7) that do batching writes (you accumulate 256bits of data, then write it all at once) don't communicate what is in their write-out buffer. So you may write to a pointer, but until that CPU does a batched write the other cores aren't aware. You have to do a fence to flush this buffer (in the other cores).
You can actually do same on x86 by using non-temporal stores. Although you're not talking about store ordering, but about visibility to other cores. A store won't ever be visible to other cores until it at least hits L1 cache controller.
> There are some scenarios where the same situation can arise on x64 but its rarer.
Yup, that's right. That's why x86 (and x64) got mfence and sfence instructions.
> This is why x64 uses a MESIF-esque cache protocol . It can tell when data is Owned/Forwarded/Shared between cores.
Reordering happens before cache controller. When cache controller is involved, the store is already in progress.
So, an emulator like this isn't going to be perfect, and applications which have been getting lucky due to x86's ordering model and avoiding proper sync primitives will likely break, and there really isn't much that can be done in those cases but fixing the application although it wouldn't surprise me if there is a compatibility flag which falls back to some slow path emulation mode which orders individual loads/stores and runs at 1/10 the normal speeds.
This is about memory order. How CPU is allowed to reorder normal unsynchronized loads and stores. X86 has strong order, but ARM CPUs can reorder loads and stores significantly more. Thus on ARM you very often need memory barriers on ARM where x86 needs none.
If memory order is not handled according to specifications, programs executing simultaneously on more than one CPU core, relying on CPU memory model/ordering will not work correctly.
You don't want to make every ordinary load and store atomic, because that would incur a rather high performance cost.
If someone has a program which is depending on load/store order in userspace then they likely have bugs on x86 as well since threads can be migrated between cores and the compilers are fully allowed to reorder load/stores as well as long as visible side effects are maintained.. I could go into the finer points of how compilers have to create internal barriers (this has nothing to do with DMB/SFENCE/etc) across external function calls as well (which plays into why you should be using library locking calls rather than creating your own) but that is another whole subject.
The latter case is something I don't think most people understand... Particularly as GCC and friends get more aggressive about determining side effects and tossing code. Also, volatile doesn't do what most people think it does and trying to create sync primitives simply by forcing load/store does nothing when the compiler is still free to reorder the operations.
An emulator is also going to maintain this contract as well. That is why things like qemu work just fine to run x86 binaries on random ARMs today without having to modify the hardware memory model.
Incorrect. On x86, if you write to memory n times, other cores are guaranteed to see the writes in same order. Second write is never going to be visible to other cores before first write. It's correct to rely on x86 memory model in x86 software.
On ARM, those stores can become visible to other cores in any order.
> since threads can be migrated between cores
Irrelevant. This is about code executing concurrently on multiple cores. Operating system and threads are irrelevant. This is about hardware behavior, CPU core load/store system and instruction reordering, not software.
> ... and the compilers are fully allowed to reorder load/stores...
This has nothing to do with compilers. This has everything to do how CPU cores reorder reads and writes.
> Particularly as GCC and friends get more aggressive about determining side effects and tossing code
If GCC has bugs, please report them. Undefined behavior can give that impression, but again, this topic has nothing to do with compilers.
> An emulator is also going to maintain this contract as well. That is why things like qemu work just fine to run x86 binaries on random ARMs today without having to modify the hardware memory model.
This is not true. See: http://wiki.qemu.org/Features/tcg-multithread#Memory_consist...
Remaining Case: strong on weak, ex. emulating x86 memory model on ARM systems
I recommend you read this: https://en.wikipedia.org/wiki/Memory_ordering.
If you have complete control over the whole stack, sure. But we are taking windows user space applications. You continue to ignore my point originally that the edge cases you are describing may cause problems, but are just that, edge cases which can be solved with slow path code (put a DSB following every store if you like), and are likely depending on behaviors higher in the stack which aren't guaranteed and are therefor "broken". If your code is in assembly, and never makes library calls, etc then you might consider it "correctly written" otherwise your probably fooling yourself for the couple percent you gain over simply calling EnterCriticalSection().
Well, hardware feature is a hardware feature. Load/store ordering is a hardware feature.
Operating system, user space or kernel space, compiler, etc. are not relevant when discussing about CPU core hardware operation.
You don't need complete control of the stack. You just need to have CPU cores executing instructions on multiple cores.
And I will repeat this again, for "correctly" written code this doesn't matter. Because the data areas being stored to should be protected by _LOCKs_ which will enforce visibility. If you think your being clever and writing "lock free" code by depending on the memory model your likely fooling yourself.
Please don't conflate visibility with load/store order. It's a different matter.
See how C++ memory model operations map to different processors, especially how many cases are simple loads or stores on x86.
https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html (link from another comment in this discussion)
This is not always true. See  for an example. Because of the semantics of the x86 memory model a sequentially consistent load does not generate a lock'd instruction. When such loads are translated to ARM64, you need to introduce barriers or use a ldacq instruction.