
How much bandwidth does the L2 have to give, anyway? - nkurz
https://github.com/travisdowns/uarch-bench/wiki/How-much-bandwidth-does-the-L2-have-to-give,-anyway%3F
======
BeeOnRope
The code to reproduce my results can be found at [1] (Linux only for now, but
porting is welcome). Basically `run-all.sh` generates the results and `plot-
all.sh` plots them.

It would be especially interesting to see Zen2 results since that's the first
AMD chip with 256-bit wide loads executed in a single op, so the first one
that can do better than 16 bytes per cycle from any level of the cache.

[1] [https://github.com/travisdowns/uarch-
bench/tree/master/scrip...](https://github.com/travisdowns/uarch-
bench/tree/master/scripts/l2-bandwidth)

~~~
pedrocr
_> Basically `run-all.sh` generates the results and `plot-all.sh` plots them._

I've done things like this in the past and it's awesome. It would be great if
more and more research was done like this so reproducing and extending results
was much easier. Congratulations.

~~~
BeeOnRope
Thanks!

So I started to do that for myself, to save time and make everything
reproducible. I found that when I wanted to make a small change to anything
I'd have to go back and dig up the old command lines, and sometimes I couldn't
reproduce my old results.

By recording the result generation in the script, including things like
turning on and off the prefetchers - it made things reproducible for myself,
and also encouraged experimentation since it was very easy to generate all the
results after any change.

Then, once you make that script for yourself, a nice side effect is that
everyone else can use it too and you can skip a lot of the description about
how you got your results: it's kind of a self-documenting way to explain how
you got your results.

~~~
pedrocr
That was my experience as well. Experimentation becomes much easier and for
things where you want to run the same analysis every year/month/etc when more
data comes out it makes things much easier. It's also great when you have
computationally expensive steps to just be able to issue a build the world
command at the end of the day and have the computer regenerate everything
overnight.

This was my version of that:

[https://github.com/pedrocr/codecomp](https://github.com/pedrocr/codecomp)

Ruby made for a good way to have scripting together with more declarative
build like tooks (e.g., rake). I also automated the plotting like you did by
embedding R snippets. There's probably space for a good framework for this. To
be the rails of scientific workflow as it were. Integrate nicely with
R/LaTeX/etc for bonus points. Maybe a procrastinating PhD student somewhere
will make a name for himself doing this :)

------
wiz21c
FTA :

>>> unless somehow your workload really, really wants optimized L2 access.

It's been 20 years since I have optimized assembly code at high level (think
VTune, pipelining, etc). What kind of workload needs that kind of optimisation
nowadays ?

~~~
cbzoiav
Drivers. High frequency / low latency trading. Networking/telecoms equipment.
Core routines (i.e if 10% of your workload is running a single block and this
runs at scale then your savings can far outweigh the cost). Cheap electronics
produced at scale (if you sell enough of them saving 1c on hardware outweighs
the engineering effort).

------
zbjornson
Neat trick! I'm interested to know if this holds up on server configurations
or is unique to client only. In addition to the published architecture
differences between the two (size and exclusive layout), the prefetcher seems
to behave differently in forward and reverse in SKL-SP.

~~~
BeeOnRope
Yes, it also applies on server architectures that derive from Skylake (aka
SKX), and I've tested it there. However, it is probably of less utility there
since currently all Skylake server architectures support AVX-512, which lets
you load an entire 64-byte cache line in a single load, and SKX can do these
at a rate of about 1 per cycle - so you can already do better than the
technique described here simply by using AVX-512.

As I mentioned near the end of the wiki page, this might still be useful in
scenarios where you don't want to use AVX-512 for some reason.

~~~
servrite
One reason to avoid copious AVX-512 instructions is that doing so is
guaranteed to cause the processor to reduce its clock rate (see the
Optimization manual for crossover points when workloads make heavy (or even
medium) use of AVX-512 (or in some cases AVX2).

