
Memory-Level Parallelism: Intel Skylake versus Apple A12/A12X - gbrown_
https://lemire.me/blog/2018/11/13/memory-level-parallelism-intel-skylake-versus-apple-a12-a12x/
======
nkurz
Adding a little bit of background here might help. Daniel defines what he
means by "lane" in an earlier post:
[https://lemire.me/blog/2018/11/05/measuring-the-memory-
level...](https://lemire.me/blog/2018/11/05/measuring-the-memory-level-
parallelism-of-a-system-using-a-small-c-program/).

Restating, what he's doing is "pointer chasing", which is one of the slowest
way of accessing memory because each load is delayed by the full latency of
the lookup. Set up an array of random numbers, look up the value at a
location, use the value you find as the next array location, then repeat.
Because each load is data dependent on the previous, no prefetching is
possible. So far, all data reported are for a single core.

Each self-contained "pointer chase" is a "lane". To increase the number of
lanes, do multiple pointer chases in parallel: lookup up N array values in
quick succession on the same core, use the values you find as the next array
location for each lane, repeat. You find that up to a point, the total number
of "chases" per second increases as you increase the number of lanes.

Different processors "plateau" at different levels of parallelism. Intel
processors max out at about 10 lanes, which implies that they can keep about
10 requests to RAM "in flight" at a time. This matches the published number of
"fill buffers" for the Intel chips. Apple's processors max out at over 20
lanes, saying that they can keep more than 20 requests in flight. Apple hasn't
published how many "fill buffers" their chips have, so this is one of the
first direct measurements.

One of the big questions being asked is: "If increasing the number of
requests-in-flight is important to performance, why doesn't Intel just
increase the number of fill buffers?" If you have an answer, please let us
know, either here, or as a comment on the blog, or in response to Travis's
question at RWT:
[https://www.realworldtech.com/forum/?threadid=181499](https://www.realworldtech.com/forum/?threadid=181499).

~~~
davrosthedalek
It might be interesting to know if these lanes are per logical core, physical
core or CPU. If it's per core, it might be hard to find real-life code using
more than 10 lanes outside of micro-benchmarks.

~~~
nicoburns
I'm not an expert (so I could be completely wrong), but I could imagine in-
memory data stores being bound by this kind of limit...

~~~
dragontamer
AMD processors have 44-stores in flight, but only 22-loads in flight.

I presume Intel processors have a similar difference between their "store
buffers" and their "load buffers".

~~~
nkurz
I'm not sure if we are using the same terminology, but I think Intel has the
opposite arrangement, with more "load buffers" than "store buffers". In Table
2-7. L1 Data Cache Components from
[https://www.intel.com/content/dam/doc/manual/64-ia-32-archit...](https://www.intel.com/content/dam/doc/manual/64-ia-32-architectures-
optimization-manual.pdf), it says:

    
    
                              Sandy Bridge
      Load buffers            64 entries 
      Store buffers           36 entries
      Line fill buffers (LFB) 10 entries
    

I think Haswell (and later?) are 72/42/10 per physical(?) core, also with more
load than store. How does this match up with what you know about AMD? And do
you have a clear number of LFB's (or whatever they call the buffers which are
allocated only for L1D misses) for AMD?

Edit: I now see the link you gave in your other comment. Hmm. I think numbers
you gave do correspond to the inverted ratio numbers for Intel, and I couldn't
find any claimed number corresponding to LFB's. We do have a couple AMD
machine we test on, and I think we did, but I don't recall the exact numbers,
other than a general recollection that they were about about the same as the
Intel. In his RWT post, Travis claims Ryzen as having 8-10.

~~~
dragontamer
> In his RWT post, Travis claims Ryzen as having 8-10.

I wonder if they're statically partitioned: so that you only get ~11 per
thread on Ryzen when SMT is activated.

The AMD Optimization manual is cler that the LS unit handles cache misses.
Intel's manual is clear that LFB handles cache misses.

Load / store buffers seem to be a more general set of micro-op reordering in
general.

\--------

It appears I've been mistaken btw. Let me post the full text of the section:

> The LS unit includes a 44-entry load queue (LDQ). The LDQ receives load
> operations at dispatch. Loads leave the LDQ when the load has completed and
> delivered data to the integer unit or the floating-point unit.

> The LS unit utilizes a 44-entry store queue which holds stores from dispatch
> until the store data can be written to the data cache

...

> The LS unit can track up to 22 outstanding in-flight cache misses.

So that's 44 load queue, 44 store queue, and 22-cache misses for AMD Ryzen.
Presumably, this is split somehow in hyperthreads (which would lead to the ~10
number you suggested)

~~~
BeeOnRope
Load and store queues are different than the "miss buffers" we are talking
about here. Every load and store will need one of these entries from
allocation until retirement, so there are usually a lot of them, something
like 20% to 40% the size of the ROB (re-order buffer) each, since many code
sequences have many loads and stores.

The miss buffers are different in that they only get involved when there is an
L1 miss. They are only used from the moment the miss is detected until the
miss completes. Also, load and store buffers are used 1:1 with load and store
instructions (including instructions that have a load, store or both combined
with an ALU operation, such as load-op or RMW on x86). Miss buffers are 1:1
with outstanding misses _on the granularity of cache lines_. So if you load 8
consecutive 4-byte values, which all fall in the same cache line, which misses
in L1, you'll have 8 load buffer entries used, but only 1 fill buffer used
since the buffer coalesces all misses to the same line.

------
saagarjha
> On Apple’s devices, whether it be an iPhone or an iPad Pro, I believe that
> the pages always span 16kB and that this cannot be easily reconfigured.

This is correct, and has been true since the A9:
[https://developer.apple.com/library/archive/documentation/Pe...](https://developer.apple.com/library/archive/documentation/Performance/Conceptual/ManagingMemory/Articles/AboutMemory.html#//apple_ref/doc/uid/20001880-99100-TPXREF113)

------
dragontamer
AMD's optimization guide for Zen
([https://developer.amd.com/wordpress/media/2013/12/55723_SOG_...](https://developer.amd.com/wordpress/media/2013/12/55723_SOG_Fam_17h_Processors_3.00.pdf))
suggests:

> The LS unit can track up to 22 outstanding in-flight cache misses.

So not as much as Apple, but more than Intel in this regards. I probably could
test this, but I'm tired today, lol.

\----------

10 outstanding loads per core seems like a small number, especially compared
to contemporary processors (Apple or AMD's). Maybe Intel knows something we
don't.

~~~
awalton
> Maybe Intel knows something we don't.

Or maybe Intel's chips miss less often due to more sophisticated branch
prediction and prefetching? The Pentium 4s were victims of excruciatingly
painful pipeline flushes and stalls. It'd make sense Intel would have invested
heavily in understanding and mitigating these kinds of caching behaviors and a
lot of that surely was lifted into the Nehalem/Core microarchitecture designs
that most of the modern Intel chips are designed around.

~~~
Taniwha
I think the point of this benchmark is that you can't predict load addresses
because they are effectively random

------
dragontamer
Okay, I've given it some thought. Here's my theory:

ARM's load/load, load/store, store/load, and store/store reordering is more
flexible than Intel's store/load (only) reordering. As such, ARM is more
likely to benefit from deeper load-buffering than Intel.

Remember that all of these systems are multi-core, which means we have to
consider what these load-buffers do during a multi-core situation. Here's some
example code for "Thread 1"

Thread 1:

    
    
        register RAX=0; // Non-memory. No load/store associated
        int a=0, b=0, c=0; // "True Memory", accessing these have a load/store
        while(1){
            RAX++;
            memory barrier();
            a = RAX;
            RAX++;
            memory barrier();
            b = RAX;
            RAX++;
            memory barrier();
            c = RAX;
        }
    

In effect, this thread would create the memory commands:

    
    
        memory barrier
        store [A], 0
        memory barrier
        store [B], 1
        memory barrier
        store [C], 2
        memory barrier
        store [A], 3
        memory barrier
        store [B], 4
        memory barrier
        store [C], 5
        (etc. etc)
    

For simplicity, I've added memory barriers to ensure that these writes happen
in EXACTLY this order. No need to worry about ARM's store/store reordering in
this example. (x86 does NOT need memory barriers, because store/store will
always happen in order)

Now lets imagine a 2nd core reading from these variables. Lets say Core#2 has
the following load buffer:

    
    
        load rax, [A]
        load rbx, [B]
        load rcx, [C]
        # Note: no memory barriers!
    

With this ordering, the Intel processor is guaranteed to have "RBX" a later
value than RAX, and RCX to have the latest value.

This is NOT true in ARM. In the ARM processor, the following assignments are
possible:

    
    
        RAX = 3 # load RAX MAY execute last on ARM.
        RBX = 1 # This example will NEVER happen on x86
        RCX = 2 
    

Why? Because in ARM, the load/load boundary can be crossed by the core. That
is, the "load RBX, [B]" instruction MAY be executed first on ARM's
architecture, despite coming in afterwards.

This means that load buffers are far, far cheaper on ARM. But they also come
at a cost to the programmer's sanity. If you actually have important
"ordering", you need to be sure to protect your code with memory barriers.

x86 has an implicit memory barrier on all stores and loads. This is slower and
more costly. In effect, the load-buffer on x86 MUST be a proper queue: the
first loads must execute first, while the last loads must execute last.

In ARM, this isn't the case. A load-buffer of size 40 can all execute
simultaneously and out of order. Much to the confusion to multi-threaded
programmers, I'm sure. This example also demonstrates why on weakly ordered
systems, you need a memory-barrier on BOTH threads. (Fixing the store/store
reordering with memory barriers only solves half the problem)

With that being said: modern programmers know that the only variables that
need a defined ordering are synchronization structures, like spinlocks. Every
spinlock needs a "release" barrier afterwards (Explicit barrier states: do NOT
reorder anything past this store). Every spinlock needs an "acquire" barrier
before reading (Command: do NOT reorder anything before this load).

Semaphores / Atomics / other synchronization structures also need a defined
ordering. Accessing "normal" variables is well known to be "unordered" by
modern programmers. So ARM has the advantage with regards to modern languages.
x86's architecture is more strict than necessary.

~~~
BeeOnRope
Yes, this is a reasonable explanation of the ARM and Intel memory models, but
I'm not following how this relates to miss buffers specifically.

The load buffers which maintain the source/program ordering on a per-
instruction basis as you describe above are called ... load buffers. These are
different than the "miss buffers" which track outstanding misses. Intel chips
have 70+ load buffers, so no there is no shortage there. Yes, x86 chips have
to jump through a bunch of to keep ordering, which means their load buffers
probably need to more sophisticated (Intel sometimes refers to a MOB - a
"memory order buffer" which tracks this stuff, but as far as I can tell this
is not a separate buffer but just another name for the load and store buffers
and associated machinery, with a focus on the memory ordering aspect).

The way it works in Intel is that everything will speculatively execute out of
order, loads passing loads and all the things the memory model doesn't
officially allow, but the load buffer will track the original order. As long
as there is not indication of an actual problem with the ordering (i.e., a
store on another core which causes a snoop that hits the load buffer on this
core), the out-of-order nature of the execution isn't visible, so everything
can retire fine. If something happens that violates the order, a "MOB machine
clear" happens (with a cost similar to a branch misprediction) occurs and the
chip starts over.

None of this directly implicates the miss buffers though. For one thing, they
are not 1:1 with instructions and they are not instruction ordered, since
multiple misses will coalesce to the same buffer.

That said, I can certainly see the stronger x86 memory model playing a role
here, but it isn't clear to me _how_.

