
Gathering Intel on Intel AVX-512 Transitions - matt_d
https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html
======
BeeOnRope
Author here, happy for any feedback or to answer any questions.

~~~
ksec
Just want to say Thank You. Do you know anything about on AMD's side of
things?

>Note: For the really short version, you can skip to the summary, but _then
what will do you for the rest of the day?_

Spending rest of the day on HN. /s

~~~
BeeOnRope
I don't know specifically, e.g. if there are any such pauses on Zen. Also, Zen
doesn't yet support AVX-512 so a big possible source of variation is moot.

I don't know if _any_ AMD chip has ever had different turbo speeds for any
ISA. It should be noted that even without that, any chip can still run slower
with heavier instructions because they hit some other limit: thermal, TDP,
current, etc.

AMD has used an interesting "adaptive clocking" scheme since steamroller, and
apparently this is still in effect in Zen:

[https://www.realworldtech.com/steamroller-
clocking/](https://www.realworldtech.com/steamroller-clocking/)

This handles the same type of voltage droop worst case that Intel apparently
handles by dispatch throttling. It would be interesting to test it, since the
clock elongation should be visible when you measure instruction timing
relative to a clock not affected by the adaptation.

~~~
celrod
On the desktop chips (x299) it's easy to adjust all the clock speeds in the
bios.

If the workloads I'm most interested in are all avx512-heavy (why I bought
x299 instead of threadripper), do you think there'd be a reason to set the
clock speeds to be equal, regardless of ISA? That is, if I currently have
4.6/4.3/4.1 GHz no-avx/avx(2)/avx512, when might it be worth setting all three
of these to 4.1 GHz?

I suspect "never" is the answer?

I have the impression that Zen's clocking algorithm is much smarter than
Intel's heuristic approach.

~~~
BeeOnRope
I don't think it makes sense, because the maximum penalty due to the
tranitions is fairly low (~30 us out of 650 us, and only under pretty much a
malicious load that transitions at exactly the right points), and mostly you
want the higher frequencies when you can get them: they quickly overwhelm the
small transition periods.

Also, someone indicated to me in private correspondence that even when the
frequencies are manually set so no transition takes place, the throttling
periods may still take place (which makes sense since the required voltage may
still be higher).

------
oddity
It’s a real shame avx-512 has so many eccentricities when it’s a much nicer
ISA than anything before it (in x86 land). I would almost prefer a more
predictable, high-latency decomposition into 4x128 wide uops over what we have
now.

~~~
BeeOnRope
If I could choose, I would like everything to run at the max turbo frequency
all the time, yeah.

Still, and despite writing this post which will make a lot of people express
something similar to what you wrote, I consider myself an AVX-512 fan, not the
other way around. It's the most important ISA extension since, well, I'm not
sure: a long time (probably AVX and AVX/2 combined would have a similar
impact).

It introduces a whole ton of stuff that is very powerful: full-width shuffles
down to byte granularity with awesome performance, masking of every operation,
often free, compress and expand operations, and a longer list at [1]. That's
only from an integer angle too (what I care about).

Yeah, it's taken AVX-512 a while to get traction (the fact that generation
after generation of new chips have just been Skylake client derivatives with
no AVX-512 hasn't helped), but I hope we are reaching a turning point.

These transitions are something you have to deal with if you want max
performance, and I think we'll come up with better models for how to make the
"global" decision of whether you should be using AVX-512.

\---

[1] [https://branchfree.org/2019/05/29/why-ice-lake-is-
important-...](https://branchfree.org/2019/05/29/why-ice-lake-is-important-a-
bit-bashers-perspective/)

~~~
ComputerGuru
The never-ending Skylake is/was a real problem. Intel was slowly adding
features in a manner where it made sense to target last n generations but then
all that came to a perpetual stop and suddenly we have this new extension that
you can only really use on the very latest and most expensive, with virtually
no backwards compatibility.

The instructions are sufficiently different from AVX2 that any appropriate use
is not as simple as sticking it behind a gate and using a smaller block size,
it basically requires a completely separate (re)write to properly take
advantage of.

~~~
BeeOnRope
> The instructions are sufficiently different from AVX2 that any appropriate
> use is not as simple as sticking it behind a gate and using a smaller block
> size, it basically requires a completely separate (re)write to properly take
> advantage of.

I'd say yeah, you often need a rewrite of the core loop to take full
advantage, but you can still more or less write AVX-style code in AVX-512 if
you want, and take advantage of the width increase.

The main difference I think for most code is the way the comparison operators
compare into a mask register. It would have been nice if they had just
extended the existing compare into SIMD reg (0/-1 result) instructions too, to
ease porting.

------
haecceity
What is this intended to be used for? This [1] article mentions compression,
ML, scientific computing. Wouldn't people rather use GPU for those workloads
though?

[1] [https://devblogs.microsoft.com/cppblog/microsoft-visual-
stud...](https://devblogs.microsoft.com/cppblog/microsoft-visual-
studio-2017-supports-intel-avx-512/)

~~~
yvdriess
Offloading to the GPU has a significant cost that needs to be amortized. DNNs
work well because very little is moved over the PCIe bottleneck, and inputs
can be buffered. Take modern compression, let's say H265, has a complex
control flow combined with highly vectorizable work. I'm unsure where the
threshold lies today, but you need a significant amount of work before
offloading to and reading back from the GPU becomes interesting on a beefy
Xeon.

------
Flicki
Nice work as usual BeeOnRope

Now if only I could actually use avx512 in a desktop, been waiting what feels
like 5+ years..

------
zone411
The latest Microsoft Visual C++ has an option to generate AVX-512 code
[https://docs.microsoft.com/en-
us/cpp/build/reference/arch-x6...](https://docs.microsoft.com/en-
us/cpp/build/reference/arch-x64?view=vs-2019)

------
shaklee3
This site seems to lock up my phone, but desktop is fine. Anyone else have
that issue?

~~~
BeeOnRope
There are several large SVGs, I wonder if that is the issue.

Can I ask what type of phone you have? Are you willing to help me diagnose the
issue?

~~~
shaklee3
I have a really old Nexus 6p. I'm definitely willing to help, but I bet it's
just part of having a really old phone.

~~~
BeeOnRope
Well that happens to be a phone I have too, although the battery is pretty
dead so it's hard to use. It's still a pretty powerful phone though, weird
that it would have issue rendering the page.

I was able to reproducing freezing and hanging even on my Pixel 3, so I can
probably look into it myself. Again, I have to guess the large SVGs are to
blame.

~~~
shaklee3
Awesome. Thanks!

------
stolk
Would disabling “Intel Turbo Boost Technology” be advised to avoid this?

If it never uses turbo, it should not suffer the transitions?

~~~
BeeOnRope
No, the transitions occur even without turbo. In fact, the chip I tested on
has no turbo at all, just 3.2 GHz nominal speed.

Also, disabling turbo would probably be a massive over-reaction, unless you
really care out about 99.9th latency or something: the impact of these
transitions is small (at worst a few %), while the benefit of turbo is large:
10s of %.

