
AVX-512: when and how to use these new instructions - ingve
https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-use-these-new-instructions/
======
imh
I read a post yesterday on a generalized notion of compositionality [0]. It
was neat and extolled the virtues of modularity and compositionality and being
able to reason about system by reasoning about its parts.

If I'm understanding OP, this means that to use the AVX-512 instructions well,
a compiler that has to think about instruction speed as a function of what
other instructions are around it. It might be faster to write operation X with
these instructions than without, but only if you don't also write operation Y
with them, because then the CPU would get too hot.

That sounds so much harder! Hot damn! I know CPUs are complicated and 1
instruction = 1 cycle is wrong in many ways, but this just sounds especially
difficult.

[0]
[https://news.ycombinator.com/item?id=17923075](https://news.ycombinator.com/item?id=17923075)

~~~
Twirrim
This situation gets absolutely awful when you consider that the Bronze and
Silver Xeons do even more aggressive down throttling. Bronze speed plummets if
only one core is doing AVX instructions.

Compilers can't hope to realistically handle this. JITs at least have a
chance, but adding in handling for this behaviour surely requires a lot more
complexity than I'd imagine most runtime developers would want to add to their
code.

~~~
BeeOnRope
Since bronze and silver largely only have one AVX-512 FP unit, running AVX-512
in the L2 license is almost totally pointless: you'd often be better off
running twice as many AVX/AVX2 instructions on the two 256-bit units since you
run at a higher frequency and the FLOP/cycle is the same.

The exception would be if your kernel can make some good use of other wide
instructions such as memory access or shuffles.

------
yalok
That explains what I observed while optimizing convolution functions in
OpenBLAS, adding support for AVX512 while AVX2 was already there. On a 36-core
Xeon Platinum there was no measurable gain when moving to AVX512, maybe even a
bit slower, when running intensive convolutions full of Fuse Multiply Add -
FMLA. I was puzzled and thought it’s due to hitting the memory bandwidth limit
(but pre-caching more didn’t help), or due to the pipeline - but again,
unrolling the loop and reshuffling AVX512 instructions in between “lighter”
instructions didn’t seem to help either. I should have monitored the CPU clock
changes. But I had no idea there was this kind of dependency. I now wonder
what should be the average ratio of heavy 512-bit was AVX512 instructions vs
lighter 256-bit instructions to avoid getting into permanent L2 mode and stay
at L1. Maybe paralleling one 512bit loop unroll with several 256bit unrolls
may still yield some gains in heavy avx usage... Thank you for the article!

~~~
nkurz
_I now wonder what should be the average ratio of heavy 512-bit was AVX512
instructions vs lighter 256-bit instructions to avoid getting into permanent
L2 mode and stay at L1._

One of the co-authors talks more about the exact limits here:
[https://www.realworldtech.com/forum/?threadid=179654&curpost...](https://www.realworldtech.com/forum/?threadid=179654&curpostid=179654)

The quick answer is that on the W-2104 system he tested, you can sustain one
FMA every 2 cycles while still remaining in the medium speed L1 state.

His avx-turbo tool ([https://github.com/travisdowns/avx-
turbo](https://github.com/travisdowns/avx-turbo)) can be used to check the
situation for your particular processor.

------
jblow
These instructions will be around for a long time, but their performance
attributes will change in 5 minutes when Intel releases the next wave of
processors.

I think given the current state of things it would be irresponsible for
compilers to generate heavy instructions unless asked. Forget trying to be
smart about it ... we already fail to be smart about things that are much
simpler and more visible.

More interestingly, this may be what all CPU behavior looks like in 10 years,
because if Intel has to resort to this kind of design now, why would hat
change any time soon? Instead of worrying about primarily keeping the
execution units full, people trying to write fast code may be primarily
concerned with keeping them NOT full so that the chip doesn’t slow down. Which
sounds crazy and hard to deal with.

~~~
VHRanger
Fwiw, autovectorizers tend to be pretty conservative about it.

So to use cutting edge instructions you generally have to hand code them
(either in intrinsics like Lemire did there or in asm)

~~~
BeeOnRope
Especially for FP instructions, where compilers are heavily restricted by the
standard and IEEE754 semantics. Vectorization often changes the order of
operations and since FP math is generally not associative.

For integer operations, auto-vectorization is more prevalent since everything
can be reordered more freely. clang especially auto-vectorizes a ton of stuff
even at -O2.

~~~
int_19h
But that's precisely why most language standards don't require strict IEEE754
semantics. And why people mostly compile with -ffast-math or equivalent when
they do.

------
bcaa7f3a8bbc
A modern CPU running an instruction set specifically designed for high-
performance computing, working under proper operational conditions, and with
aqueduct cooling, can still go overheat and trigger its internal throttling.
And since Turbo Boost, throttling is explicitly used not only as a powersave
or protective measure, but part of the normal operation.

So we have now being at the closest point to the CPU power wall in history ...

------
crankylinuxuser
I'm no CPU engineer, but this smells like some sort of chip-level macro
function that cuts edges even more than the spectre/meltdown issue.

I remember joke opcodes back in the day. One was "Halt and catch fire". Is
that seriously what this is doing?

~~~
kazinator
> _cuts edges even more than the spectre /meltdown issue._

Indeed. If you have some piece of code in a different security context that
conditionally executes a heavy instruction based on a decision made over some
sensitive data, doesn't this provide a way to obtain information about that
data?

~~~
scottlamb
> If you have some piece of code in a different security context that
> conditionally executes a heavy instruction based on a decision made over
> some sensitive data, doesn't this provide a way to obtain information about
> that data?

Yes.
[http://www.numberworld.org/blogs/2018_6_16_avx_spectre/](http://www.numberworld.org/blogs/2018_6_16_avx_spectre/)

------
twtw
I'm still upset about the permanent-until-vzeroupper "transition" penalty on
nonVEX instructions on skylake after any VEX instruction.

~~~
chrisseaton
Do you know anything about that? I know some ABIs have a vzeroupper at the end
of every method - what is this for and do you know how expensive is is?

~~~
haberman
Looks to be a complicated issue, with very different performance profiles
across different microarchitectures:

[https://software.intel.com/en-us/forums/intel-isa-
extensions...](https://software.intel.com/en-us/forums/intel-isa-
extensions/topic/704023)

------
kazinator
> _Intel cores can run in one of three modes: license 0 (L0) is the fastest
> (and is associated with the turbo frequencies “written on the box”), license
> 1 (L1) is slower and license 2 (L2) is the slowest._

Wow, let's give the word "license" semantics not related to software
licensing, _and_ cause confusion between L1 meaning "L1 cache" and "license
1".

~~~
the8472
AIUI this is because the power scheduler gives the core the license to
actually execute those instructions. Until it is granted the core must emulate
them as multiple vector instructions of smaller stride size.

------
valarauca1
This post cites sources that actively contradict the points it attempts to
make. It provides a wealth of useful aggregation of information against the
points it makes.

Let's us begin

    
    
        However, there are also deterministic frequency 
        reductions based specifically on which instructions you
        use and on how many cores are active (downclocking).
    

As you are likely using AVX-512 in a cloud deployment you don't have access to
any of the information, and are likely sharing that hardware who may not
respect the same engineering rigor.

Also, nobody setting there CPU affinity to ensure the down-clocking stays on a
single core. You have to pray the scheduler doesn't shuffle your workload
around. This requires a lot of platform specific C and most people are writing
Java/Go/Ruby/Python where doing bit twidling NUMA management is impossible,
furthermore the information you have access to in a cloud environment (which
is where you'll be using advanced AVX-512 unless you work for Amazon, Google,
Intel, or Cloudflare) may just lie about core count, and NUMA architecture.

Also this is just false [1]. Running AVX-512 adjusts the BASE clock for the
package. There are throttling attempts to ensure every other core on the
package throttles back. This is a package wide effect, not a per-core effect.

    
    
        Light instructions include integer operations other than
        multiplication, logical operations, data shuffling
        (such as vpermw and vpermd) and so forth.
    

This is false according cloudflare [2] which you've linked. They test your
"light" carry less adding, shifting, and xoring (these are the only operations
in ChaCha20 [4]). It cost too much.

    
    
        We have chosen to only include two columns.
    

I'll include the whole thing [3]. Wow yeah the entire package's power curve is
changing. The base clock, cores not effected, all their clocks are changing.
Its almost like having 1 out of 24 cores still effects all 24 cores....

    
    
        For example, the openssl project used heavy AVX-512
        instructions to bring down the cost of a particular
        hashing algorithm (poly1305) from 0.51 cycles per byte
        (when using 256-bit AVX instructions) to 0.35 cycles
        per byte, a 30% gain on a per-cycle basis. They have
        since disabled this optimization.
    

The literal example to show AVX-512 is good at ends with statement that people
using AVX-512 are now actively avoiding it.

This is less then content, do you have an agenda or are you just an idiot?

[1]
[https://en.wikichip.org/wiki/intel/xeon_silver/4116](https://en.wikichip.org/wiki/intel/xeon_silver/4116)

[2] [https://blog.cloudflare.com/on-the-dangers-of-intels-
frequen...](https://blog.cloudflare.com/on-the-dangers-of-intels-frequency-
scaling/)

[3]
[https://en.wikichip.org/wiki/intel/xeon_gold/5120](https://en.wikichip.org/wiki/intel/xeon_gold/5120)

[4]
[https://en.wikipedia.org/wiki/Salsa20](https://en.wikipedia.org/wiki/Salsa20)

~~~
dang
Please don't break the HN guidelines by becoming uncivil.

[https://news.ycombinator.com/newsguidelines.html](https://news.ycombinator.com/newsguidelines.html)

~~~
gjs278
please stop blinding me by making downvoted comments one shade away from the
background color

~~~
Rychard
This is a reasonable criticism.

I'm not a fan of the delivery, but I absolutely agree with you.

~~~
CamperBob2
Agreed as well. It's annoying because they think they're deprecating downvoted
comments by fading them out when they're actually calling _more_ attention to
them by requiring the reader to invest more effort to read them.

It's reminiscent of the infamous "disemvoweling" strategy used on a few other
forums, where the reader is forced to decide whether they want to
painstakingly reconstruct offensive and abusive comments or blindly trust
someone else to restrict what they see.

Life would be so much easier if they just displayed the comment score like
most other moderated forums and let the reader decide the merits of the
comments based on visible information.

