
Test results for Broadwell and Skylake - ivank
http://www.agner.org/optimize/blog/read.php?i=415
======
charleskh
A lot of this goes over my head but does this mean it's not worth waiting for
the 2016 Macbook Pro to come out with the Skylake processor?

~~~
virtuallynathan
It has lots more besides AVX512, so I'd consider waiting. These include: lower
power, HEVC in hardware, JPEG in hardware, some VP9 in hardware, Better GPU,
Faster AES-NI, and Thunderbolt 3.

If Apple supports it in software, it also supports jumping between power-
saving states much faster.

~~~
desdiv
No consumer Skylake chips will have AVX-512. It's going to be on server Xeon
Skylake chips only ("server Xeon" as opposed to "laptop Xeon"[0]).

[0] [http://www.imore.com/intels-xeon-coming-soon-laptops-and-
wha...](http://www.imore.com/intels-xeon-coming-soon-laptops-and-what-means-
mac)

~~~
ksec
WoW, this is news to me, what is Skylake without AVX-512?

This is more like a tick from Intel rather then tock.

~~~
Narishma
I don't think they are following the tick tock rythm anymore. Fab process
improvements have slowed down recently.

------
nkurz
I think I figured out what was happening the question I posted to Agner. I
submitted a reply, but it's still in moderation. For me, it finally explains a
lot of lower than expected performance issues I've had with AVX/AVX2. In case
others are interested, here it is.

\-----

Agner wrote: You are always limited by cache ways, read/write buffers, faulty
prefetching, suboptimal reordering, etc.

Yes, although in my example I'm considering the much simpler case where there
are two reads but no writes, and all data is already in L1. So although
problematic in the real world, these shouldn't be a factor here. In fact, I
see the same maximum speed if I read the same 4 vectors over and over rather
than striding over all the data. I've refined my example, though, and think I
now understand what's happening. The problem isn't a bank conflict, rather
it's a slowdown due to unaligned access. I don't think I've seen this
discussed before.

Contrary to my previous understanding, alignment makes a big difference on the
speed at which vectors are read from L1 to register. If your data is 16B
aligned rather than 32B aligned, a sequential read from L1 is no faster with
256-bit YMM reads than it is with 128-bit XMM reads. VMOVAPS and VMOVUPS have
the same speed, but you cannot achieve 2 32B loads per cycle if the underlying
data is not 32B aligned. If the data is 32B aligned, you still can't quite
sustain 64 B/cycle of load with either, but you can get to about 54 B/cycle
with both.

I put up new test code here:
[https://gist.github.com/nkurz/439ca1044e11181c1089](https://gist.github.com/nkurz/439ca1044e11181c1089)

Results at L1 sizes are essentially the same on Haswell and Skylake.

    
    
      Loading 4096 floats with 64 byte raw alignment
      Vector alignment 8:
      load_xmm : 19.79 bytes/cycle
      load_xmm_nonsequential : 23.41 bytes/cycle
      load_ymm : 28.64 bytes/cycle
      load_ymm_nonsequential : 36.57 bytes/cycle
    
      Vector alignment 16:
      load_xmm : 29.26 bytes/cycle
      load_xmm_nonsequential : 29.05 bytes/cycle
      load_ymm : 28.44 bytes/cycle
      load_ymm_nonsequential : 36.90 bytes/cycle
    
      Vector alignment 24:
      load_xmm : 19.79 bytes/cycle
      load_xmm_nonsequential : 23.54 bytes/cycle
      load_ymm : 28.64 bytes/cycle
      load_ymm_nonsequential : 36.57 bytes/cycle
    
      Vector alignment 32:
      load_xmm : 29.05 bytes/cycle
      load_xmm_nonsequential : 28.85 bytes/cycle
      load_ymm : 53.19 bytes/cycle
      load_ymm_nonsequential : 52.51 bytes/cycle
    

What this says is that unless your loads are 32B aligned, regardless of method
you are limited to about 40B loaded per cycle. If you are sequentially loading
non-32B aligned data from L1, the speeds for 16B loads and 32B loads are
identical, and limited to less than 32B per cycle. All alignments not shown
were the same as 8B alignment.

Loading in a non-sequential order is about 20% faster for unaligned XMM and
unaligned YMM loads. It's possible there is a faster order than I have found
so far. Aligned loads are the same speed regardless of order. Maximum speed
for aligned XMM loads is about 30 B/cycle, and maximum speed for aligned YMM
loads is about 54 B/cycle.

At L2 sizes, the effect still exists, but is less extreme. XMM loads are
limited to 13-15 B/cycle on both Haswell and Skylake. On Haswell, YMM non-
aligned loads are 18-20 B/cycle, and YMM aligned loads are 24-26 B/cycle. On
Skylake, YMM aligned loads are slightly faster at 27 B/cycle. Interestingly,
sequential unaligned L2 loads on Skylake are almost the same as aligned loads
(26 B/cycle), while non-sequential loads are much slower (17 B/cycle).

At L3 sizes, alignment is barely a factor. On Haswell, all loads are limited
to 11-13 B/cycle. On Skylake, XMM loads are the same 11-13 B/cycle, while YMM
loads are slightly faster at 14-17 B/cycle.

Coming from memory, XMM and YMM loads on Haswell are the same regardless of
alignment, at about 5 B/cycle. On Skylake, XMM loads are about 6.25 B/cycle,
and YMM loads are about 6.75 B/cycle, with little dependence on alignment.
It's possible that prefetch can improve these speeds slightly.

Agner writes: The write operations may sometimes use port 2 or 3 for address
calculation, where the maximum throughput requires that they use port 7.

I don't recall if you mention it in your manuals, but I presume you are aware
that Port 7 on Haswell and Skylake is only capable of "simple" address
calculations? Thus sustaining 2 loads and a store is only possible if the
store address is [const + base] form rather than [const + index*scale + base].
And as you point out, even if you do this, it can still be difficult to force
the processor to use only Port 7 for the store address.

~~~
stephencanon
When dealing with unaligned loads, keep in mind that the penalty for a page-
crossing load is _huge_ pre-Skylake (~100 cycles); does your test set include
page boundary crossings?

I believe there's still a throughput penalty for cacheline-crossing loads as
well, though it's quite modest compared to pre-Nehalem (where each cacheline
crossing cost you 20 cycles!)

~~~
matt_d
I realize that this is another topic, although in a somewhat similar context,
so I thought I may just ask: Have there been any advances in reducing the page
walk latency?

I'm thinking of the virtual address translation costs having impact on the run
times of common algorithms, e.g., as demonstrated in the following work by
Jurkiewicz & Mehlhorn:
[http://arxiv.org/abs/1212.0703](http://arxiv.org/abs/1212.0703)

The recent research I'm aware of is, e.g., Generalized Large-page Utilization
Enhancements (GLUE) mechanism, proposed in "Large Pages and Lightweight Memory
Management in Virtualized Environments" (from this year's Micro): slides:
[https://dl.dropboxusercontent.com/u/36554102/BPC-1.pdf](https://dl.dropboxusercontent.com/u/36554102/BPC-1.pdf)
; paper:
[http://paul.rutgers.edu/~binhpham/phamMICRO15.pdf](http://paul.rutgers.edu/~binhpham/phamMICRO15.pdf)

Admittedly, it focuses specifically on one aspect (the Double Address
Translation on Virtual Machines issue in the Jurkiewicz & Mehlhorn context).

What I'm wondering about is: Has there been any progress on that on the
"practical implementation" side, in the recent/coming Intel (or other, for
that matter) CPUs?

~~~
nkurz
I haven't read these papers (thanks for the links!) but there has been one
major recent improvement. Starting with Broadwell (and I presume continuing
with Skylake), the CPU can now handle two page misses in parallel:
[http://www.anandtech.com/show/8355/intel-broadwell-
architect...](http://www.anandtech.com/show/8355/intel-broadwell-architecture-
preview/2)

The other interesting thing is that page walks themselves are actually not
very expensive: something on the order of 10 cycles. They only become
painfully expensive when the page table is too large to fit in cache, and
spills into memory, necessitating a load from memory just to get the page
table. So improvements in memory (and cache) latency will have strong positive
effect.

~~~
matt_d
Thank you for the reply!

Interesting about parallel misses handling, thanks!

One worry is that this tends to compound other effects -- say, non-prefetch-
friendly access combined with TLB misses resulting in increasingly expensive
slowdowns (as in the continuous-vs.-random array access example in the paper).

