
Ice Lake Store Elimination - matt_d
https://travisdowns.github.io/blog/2020/05/18/icelake-zero-opt.html
======
BeeOnRope
Author here, happy for any feedback or questions.

You can find the previous article at [1], which goes over the basics and the
original finding on Skylake. This new article focuses on Ice Lake but probably
mostly makes sense after reading the original. HN discussion of the first part
is at [2].

\---

[1] [https://travisdowns.github.io/blog/2020/05/13/intel-zero-
opt...](https://travisdowns.github.io/blog/2020/05/13/intel-zero-opt.html)

[2]
[https://news.ycombinator.com/item?id=23169605](https://news.ycombinator.com/item?id=23169605)

~~~
ktta
What do you recommend to someone who's a beginner in this kind of performance-
profiling related work read up on?

The book that got closest for me was CS:APP, but haven't found anything beyond
that. It was also very generic. I'm looking for specifically high-perf
optimization techniques for x86-64

Any specific books, websites (other than Agner's) or even other blogs?

~~~
bobbiechen
Here's an (oldish, 2008) paper which is (was?) the basis for CMU's 18-645 How
to Write Fast Code at the time:
[http://users.ece.cmu.edu/~pueschel/teaching/18-645-CMU-
sprin...](http://users.ece.cmu.edu/~pueschel/teaching/18-645-CMU-
spring08/course.html) (I didn't take this personally, so I don't know about
more recent versions of the course and couldn't find any publicly).

How To Write Fast Numerical Code: A Small Introduction - Srinivas Chellappa,
Franz Franchetti and Markus Püschel

[http://spiral.ece.cmu.edu:8080/pub-
spiral/abstract.jsp?id=10...](http://spiral.ece.cmu.edu:8080/pub-
spiral/abstract.jsp?id=100)

~~~
ktta
Thank you!

------
StillBored
Here we are, its 2020, and intel's single core memory bandwidth numbers
haven't really moved at all since nehalem. Despite the additional memory
channels, and the increase in DDR bandwidth moving from DDR2 to DDR4 leading
to machines with 100GB+/sec memory bandwidth (the large NUMA machines have
significantly more).

So, its pretty obvious they are choking off the single core bandwidth to keep
a single core from starving the others in the machine. This 100% makes sense
for server applications, but for workstation and desktop usage its completely
crazy that a single core has 50-60 GB/sec L3 bandwidth, but can only utilize
10-20% of the available ram bandwidth to flush the L3 even when the other
cores in the machine are idle.

The ARM machine literally has somewhere between 2-4x the single core memory
bandwidth, despite being in a configuration with more actual cores than intel
offers.

~~~
BeeOnRope
Yes, the number of L1 fill buffers was static at ~10 for almost a decade, and
this, in combination with memory latency which didn't move much, proved the
limit on single-core bandwidth.

That said, in ICL there has been a nice jump.

Graviton results to RAM certainly are nice. I've heard that in this type of
store workload they can automatically use something like NT stores, avoiding
the RFO (which would normally cut bandwidth in half).

The Intel chips do get much better numbers if you use NT stores... but I
didn't go into that since it wasn't the point of this article.

~~~
StillBored
The NT vs vector ops/etc discussion shouldn't even happen on x86.

A decade+ ago, Andy Glew expounded on the mistakes they made for the P6 when
it came to the rep prefix. Apparently, they have finally fixed the startup
times for it in icelake. But it continues to be a squandered opportunity
because in theory they could have hidden a lot of the generational/microarch
messiness behind the rep mov/sto sequences rather than requiring software
updates every-time they updated the microarch like a RISC machine. AKA rep
should be a lot harder to beat on any given machine.

The nontemporal case is another one of these instances where the rep sequence
could signal early on that the operation is going to be more efficient with a
NT store. Reserving actual software defined NT stores for the cases where
software is absolutely sure it won't fetch the data in the near future.

There are a few other cases where high level microcoded instruction like this
are an advantage because it allows a common piece of code to be implemented in
differing ways depending on the actual product. Reference, Alpha PAL code.

------
Taniwha
I'm reminded of when I was architecting Mac graphics accelerators back in the
early 90s - we put a lot of work into solid fills (profiling showed us that
was where quickdraw spent most of its time) ... in the end we were pushing
1.5Gb/sec into the frame buffer, the ram of the day just couldn't go any
faster.

What really surprised us was how much faster Excel went, it wasn't really our
target, turns out that every time it updated the screen it redrew the
background 5-6 times - drawing white over white - the actual content (the
black pixels) were just noise in the graphics numbers

------
potiuper
Can programs be written such that they switch between AVX-512 for L1 and
AVX-256 for L2/L3? The decision by the compiler to switch to AVX-256 due to
AVX-512 downclocking / throttling based on a processor flag seems biased
towards bursting / low voltage mode and fragile. But, an alternative of having
JIT instructions emitted based on the current voltage profile seems of
questionable benefit. Does 2nd store port only work with / benefit L2 or also
L3?

~~~
BeeOnRope
You can't write a program that does that automatically, but if you had a
"cache aware" program that knows what writes are likely to hit in what level
of the cache, you do it at the software level. I have my doubts it would be
worth it though and it sounds difficult and fragile.

I'm not sure what you mean by the 512 downlock: nothing I describe in the
article has much to do with a downclock, except for the L1 effect at the
beginning, and this was honestly fairly specific to the original test
structure, which alternated periods of spinning on a timer with running a
short test interval. It wasn't actually a downclock either (this CPU does not
downclock for AVX-512 at 3.5 GHz: it has very little downclocking at all):
rather it's dispatch throttling: the CPU still runs at full speed, but
instructions are prevented from dispatching every cycle, effectively slowing
the throughput, until the voltage can adjust to the heavier instructions.

"Store ports" are only a concept that apply to the core itself, not the
caches. Ports are entry points for execution of a certain type of instruction.
Older CPUs had one store port (p4) and Ice Lake has two (p4 and p9), so two
stores can execute in a single cycle. So these ports only benefit the core and
don't interact with the L2 or L3 directly, which operate on a cache line
basis.

Caches _also_ have something called ports, e.g., a cache might have 2 read and
1 write ports, meaning 2 reads and 1 write per cycle, but there isn't any
indication these have changed in Ice Lake.

------
jcranmer
I need to add this blog to my list of blogs I regularly follow...

~~~
BeeOnRope
There's an RSS feed if that's your thing:

[https://travisdowns.github.io/feed.xml](https://travisdowns.github.io/feed.xml)

