
AVX-512, what’s useful for us - rbultje
http://obe.tv/about-us/obe-blog/item/50-avx-512-what-s-useful-for-us
======
raphlinus
If I'm understanding it correctly, they're not actually using the 512 bit
(ZMM) registers, because using them can cause overall system slowdown. It
seems to me they're only really useful if you're doing an AVX-512 intensive
workload. And do those really exist? For something like bulk matrix
multiplications, GPGPU is going to be much better, both in throughput and in
operations per joule. I'm remaining to be convinced that the ecological niche
occupied by SIMD is significant, let alone expanding.

~~~
steve_musk
I work in an HPC lab (computational chemistry) and we have a hand-coded
AVX-512 codepath. I can't give any specifics (because I don't know them) but I
know there is a non-trivial speedup versus the standard codepath (or AVX2).

However, GPUs blow it out of the water for much a lower price since we only
need FP32. I think the main reason we invested time adding supports is for the
Xeon Phi cards. I guess is could be worthwhile for some FP64 pipelines from a
cost/performance perspective.

~~~
m_mueller
If you look at Skylake-SP architecture vs. recent GPUs, the chip design at
first glance doesn't seem so different anymore between these two, CPUs are
just much less focused, which pays a 2x price in theoretical performance for
the same die space, even using Intel's superior process technology. Now that
being said, I think the GPU/SIMT model of vector computing is just much
smarter. Why let me jump through all these hoops of masking and compiler
optimizations if all I want is a branch and an early exit for a specific set
of values? GPU schedulers and drivers make this easy to use and with somewhat
predictable performance results. Furthermore (and probably more importantly),
why is Intel putting this amount of compute power on a CPU without
significantly upgrading memory bandwidth? A 28 core Skylake-SP using full
vectorization now has 3x (!) the FLOP/Byte system balance compared to NVIDIA
P100. Seriously? System balance was once an argument against GPUs, but not
anymore apparently...

~~~
jabl
> Now that being said, I think the GPU/SIMT model of vector computing is just
> much smarter.

I'm not sure. For an argument in favor of vectors, see [https://riscv.org/wp-
content/uploads/2015/06/riscv-vector-wo...](https://riscv.org/wp-
content/uploads/2015/06/riscv-vector-workshop-june2015.pdf)

> Why let me jump through all these hoops of masking and compiler
> optimizations if all I want is a branch and an early exit for a specific set
> of values? GPU schedulers and drivers make this easy to use and with
> somewhat predictable performance results.

If the underlying hw is SIMD (vectors) and not SIMT anyway, as Nvidia hw
apparently is, why should I have to go through the effort of rewriting my code
in CUDA, and hope that some opaque driver will manage to turn that into
efficient vector code?

I mean, ideally I'd just like to write C/C++/Fortran/Julia/Haskell/whatever
code, and the compiler would autovectorize it.

> Furthermore (and probably more importantly), why is Intel putting this
> amount of compute power on a CPU without significantly upgrading memory
> bandwidth?

Flops are cheap, bw expensive. But yeah, certainly the are many applications
that would benefit from a much better bw/flops ratio.

Then again, with the latest Teslas you have 16 GB with awesome bw, after that
you're trying to feed the firehose through the PCIe straw.

------
dragontamer
It will be a while before AVX-512 becomes practical however. AMD doesn't
support it (so any RyZen or Threadripper fans will miss out), and even Intel
8th Gen Coffee-lake doesn't support it.

Only Intel Extreme i9 and Xeon Silver / Gold / Platinum seems to support it.
So the market for this instruction set is quite limited.

~~~
jsheard
FWIW we're only one generation away from AVX512 on consumer CPUs, Intel's
upcoming Cannon Lake architecture will support it.

[https://www.anandtech.com/show/11928/intels-document-
points-...](https://www.anandtech.com/show/11928/intels-document-points-to-
avx512-support-by-consumer-cannon-lake-cpus)

~~~
dragontamer
Well, one-generation away from consumers being able to buy the chip. And maybe
5-years away before a sizable number of consumers upgrade to that chip (or
newer)... since the typical Desktop is at LEAST 5 years old in my
experience...

The Users who really need the feature are likely upgrading to AVX-512
computers already. IE: Mac Pro. So adoption is not as bad as my hyperbole
above. But its still going to be a while before we can assume AVX512 support
on machines.

Hell, with so many people running Sandy Bridge (i7-2xxx series), you can't
even assume AVX2 support today.

~~~
tambre
Couldn't you use something like GCC function multi-versioning?

~~~
dragontamer
AVX512 has so many more features above-and-beyond Intel's typical SIMD
implementation. Feature wise, its beginning to be competitive against NVidia's
PTX CUDA architecture. Like, AVX512 is a really, really good instruction set
(or I guess: a really good set of instruction sets).

Assuming AVX512 F, CD, VL, DQ, and BW (the expected AVX512 instructions in
CannonLake):

* AVX512F -- "Standard" 512-bit arithmetic already has major improvements, above and beyond the 256bit -> 512bit upgrade. AVX512 has 32-registers per core (when AVX2 and earlier only have 16). The new set of opmask instructions also allow for way more code to turn into "branch-free" code which is friendly for pipelines. This is already a major step forward alone with huge implications for multimedia code.

* AVX512-CD: Conflict Detection. These instructions allow auto-vectorizers to "resolve loop conflicts" and auto-vectorize more code.

* VL, DQ -- Extend AVX512 to Bytes, Shorts, Longs, Long Longs.

* BW -- Extend AVX512 to operate on only 256-bit and 128-bits at a time.

\--------------------

I'm certain that some code, which could not be vectorized in AVX2 (or lower),
will be vectorized with AVX512. Maybe even automatically as compiler writers
implement high-level features / auto-vectorizers.

~~~
jabl
I wonder why they did the BW thing instead of just defining a vector length
register like other vector ISA's (which would have allowed to get rid of a
remainder loop, leading to less code bloat and more efficient execution for
short loops where the number of iterations is not an integer multiple of the
ISA vector length).

------
minxomat

        document.querySelector('#k2Container').style.color = 'black';
    

and the blog post becomes almost readable.

Other than that, nice intro.

~~~
kierank
Yeah, sorry about that, that site's getting replaced soon.

~~~
tambre
Might want to also think about SSL. Certificates are free from Let's Encrypt.

That said, props for the IPv6 support!

------
pkaye
I changed some Golang code to AVX in my last project. In isolation that code
ran like 2-4x faster but as part of the full program, the program was 5%
slower overall. Could never make a sense of it. Any thoughts on how to
determine the cause?

~~~
mscrivo
AVX code is known to make the CPU run way hotter than usual. Perhaps that
caused throttling that made general code running at the same time, or within a
short span thereafter, perform worse?

~~~
pkaye
That is one theory I had but i'm not sure how to determine if CPU is
throttling (on Ubuntu Linux.)

~~~
thecompilr
You can either use lscpu, which is less accurate, or the best way is to check:

cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq

where instead of cpu0 you can write any core number, and it will give you the
current frequency of that core in KHz.

~~~
pkaye
I'll try that out. Thanks.

~~~
lathiat
Note that with pstate /proc/cpuinfo is not reflective (not suggested here but,
in the past, the MHz used to change to reflect the scaling speed). You could
also look at 'powertop'

------
gok
Doesn't mention what I find the coolest part of AVX-512: the conflict
detection instructions. Finally a way to vectorize loops with indirect loads!

------
ninegunpi
for slowing down awkward code?

