
Ice Lake AVX-512 Downclocking - ingve
https://travisdowns.github.io/blog/2020/08/19/icl-avx512-freq.html
======
tarlinian
Hopefully this behavior change will help improve AVX-512 uptake and end the
somewhat ridiculous conception people have that the instructions are entirely
useless. Intel's HotChips presentation on Icelake-SP also indicated that the
behavior will be significantly better on server chips, but the behavior is
more instruction dependent with 3 license levels and only 512-bit instructions
that utilize the FMA unit being subject to downclocking (by ~15-17% as opposed
to ~27-29% on SKX derived chips).

(Intel Hotchips slide:
[https://images.anandtech.com/doci/15984/202008171757161.jpg](https://images.anandtech.com/doci/15984/202008171757161.jpg))

~~~
CoolGuySteve
I don't think the book is closed. Thermal and TDP downclocking are still
present.

It would have been nice to see the vcore and thermal values graphed as part of
the benchmark. Do they increase faster for AVX-512 vs the other instruction
sets?

I've had problems in the past with Sandybridge, where AVX hit thermal
throttles before SSE. I ended up having to disable them in my build because of
it. Presumably, the same behaviour would be seen here now that the vector unit
is wider and there are more densely packed transistors flipping.

~~~
wmf
Doing work faster is almost always going to consume more power and if you're
already at the power or temperature limit (which is how most CPUs/GPUs operate
now) then the frequency will have to reduce. This isn't automatically bad;
ultimately what matters is performance. Did you really see lower performance
with AVX than with SSE?

~~~
CoolGuySteve
The problem is that the downclocking affects other cores. So a performance
improvement in this one task can hurt performance on other threads, which is
what happened to me.

~~~
BeeOnRope
At least on most new cores, the frequency is per-core. This isn't true on, for
example, some Skylake client cores - but these don't have much SIMD related
downclocking either.

~~~
makomk
The license-based downclocking is per-core on most new chips, but according to
the linked blog post that doesn't really exist on Ice Lake anyway. Pretty sure
thermal and TDP-based downclocking are across the entire package (all cores
and the GPU) which in some sense might make them more objectionable than
license-based downclocking.

------
jojobas
Does the word "license" actually mean "even though the hardware is capable,
you didn't pay enough to not downclock" or something else here?

~~~
SloopJon
The three links in the third paragraph of the article describe the meaning of
license in this context:

[https://stackoverflow.com/a/56861355](https://stackoverflow.com/a/56861355)

[https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html](https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html)

[https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-
us...](https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-use-these-
new-instructions/)

Excerpted from the Stack Overflow answer:

There are three frequency levels, so-called _licenses_ , from fastest to
slowest: L0, L1 and L2. L0 is the "nominal" speed you'll see written on the
box: when the chip says "3.5 GHz turbo", they are referring to the single-core
L0 turbo. L1 is a lower speed sometimes called _AVX turbo_ or _AVX2 turbo_ ,
originally associated with AVX and AVX2 instructions. L2 is a lower speed than
L1, sometimes called "AVX-512 turbo".

~~~
jojobas
Clear as mud. If it means "we can't make these instructions work at nominal
clock" rather than "we won't let you", why can server chips do it?

~~~
jrockway
The server chips tend to be sold with a lower base clock than the consumer
equivalents. If you underclock your consumer CPU to whatever the server
equivalent currently is, you aren't going to see much throttling under any
workloads. But you're also going to be the guy that has a 2.2GHz processor
when all your friends claim to have a 5GHz processor.

In my experience, the limiting factor tends to be power delivery. I had a
6950x that I overclocked and I could get it to consistently hard crash just by
starting an AVX task. My filesystems did not appreciate that! (I eventually
spent a ton of time debugging why that happened, and it was just that the
power supply couldn't keep the 12V rail at 12V. Turns out that Intel knows
what sort of equipment is out on the market, and designed their chips
accordingly. I did upgrade the power supply (1200W Corsair -> 850W Seasonic
Prime) and got stable AVX without downclocking. But the whole experience
killed overclocking for me. It just isn't worth the _certainty_ that your
computer will crash for no good reason at random times.)

------
BooneJS
At Hot Chips 32 this week, Intel mentioned that Tiger Lake Xeon with Sunny
Cove core would only downclock if AVX-512 usage hit TDP limits.

~~~
williadc
TigerLake client uses Willow Cove cores. I would expect IceLake Xeon to use
the Sunny Cove cores (same as IceLake client) and TigerLake Xeon to use Willow
Cove, but I haven't seen evidence that is the case.

~~~
ezekiel68
Well, other than precedent. We have never seen Intel suddenly surprise the
market by using a different core type in a chip family's Xeon SKUs than was
used in the mobile and desktop SKUs. And part of that (I conjecture, having
been in the rodeo with them since 486 days) is that they like to use the
consumer versions of the chips to suss out errata to be fixed before the Xeons
are taped out and released. It is true that some features do differ, such as
support for ECC RAM and perhaps activation of TSX or extra FMA unit, but these
are really just variations on the same core family.

------
jiggawatts
One thing I really want to know is whether SQL Server's new vector-accelerated
"Batch Mode" uses AVX2 only or if it also has AVX-512 code paths?

I'd like to be able to recommend the right CPU to customers, but there just
isn't any information out in public about this...

~~~
foota
This is a terrible suggestion, but you could always attach a debugger to find
out

~~~
fluffything
Or just disassemble the binary and grep for avx-512 or avx2 instructions ?

Or if SQL server is open source, just go to their github/gitlab/whatever repo,
and enter the name of a commonly-used avx512 intrinsic on the search field ?

~~~
foota
I figured it would be easier to find the specific area by debugging than
disassembling, since it's likely easily tens to hundreds of megs.

Also, SQL server is definitely not open source lmao.

------
paulmd
Given the process improvements in Tiger Lake - I wonder if this improves
further, or at least all levels become somewhat faster?

------
throwaway_pdp09
I don't know this area, but to clarify (or mess things up, depending on my
understanding), the downclocking is for FP-heavy work. AFAIU it doesn't occur
for any integer work (maybe an exception for div?), and 512-wide simd integer
instructions could be _very, very_ useful.

Pls correct me if wrong.

------
rbanffy
Did the Xeon Phi also downclock when using AVX-512?

~~~
th3typh00n
I haven't seen any numbers on that but there's literally zero reason to run a
Xeon Phi without using AVX-512, so I'd assume no design considerations were
taken to optimize the clock frequency for a non-AVX-512 use case.

~~~
shaklee3
There's almost zero reason to run a phi in general.

~~~
dragontamer
The Phi was an interesting computer. AVX512 on 60 cores back in 2015 was
pretty nuts. CUDA wasn't quite as good as it is today (there have been HUGE
advancements in CUDA recently).

These days, we have a full-fat EPYC or Threadripper to use, and even then its
only 256-bit vector units. CUDA is also way better and NVidia has advanced
dramatically: proving that CUDA is easier to code than people once thought.
(Back in 2015, it was still "common knowledge" that CUDA was too hard for
normal programmers).

Intel's Xeon Phi was a normal CPU processor. It could run normal Linux, and
scale just like a GPU (Each PCIe x16 lane added another 60 Xeon Phi cores to
your box).

It was a commercial failure, but I wouldn't say it was worthless. NVidia just
ended up making a superior product, by making CUDA easier-and-easier to use.

~~~
shaklee3
I was using CUDA heavily in 2015, and I also looked at the first/second gen of
the Xeon Phi at the time. I thought it was much harder to program for than
cuda was at the time (and certainly that gap has widened). I recall things
like a weird ring topology between cores that you may have had to pay
attention to, the memory hierarchies (you kind of do this with CUDA, but I
remember it being NUMA-like), as well as the transfers to and from the host
CPU were harder/synchronous compared to CUDA.

It was definitely a really cool hardware architecture, but the software
ecosystem just wasn't there.

~~~
dragontamer
Xeon Phi was supposed to be easy to program for, because it ran Linux (albeit
an embedded version, but it was straight up Linux).

Turns out, performance-critical code is hard to write, whether or not you have
Linux. And I'm not entirely sure how Linux made things easier at all. I guess
its cool that you ran GDB, had filesystems, and all that stuff, but was that
really needed?

\---------

CUDA shows that you can just run bare-metal code, and have the host-manage a
huge amount of the issues (even cudaMalloc is globally synchronized and "dumb"
as a doornail: probably host managed if I was to guess).

~~~
shaklee3
That's right -- I always wished they made a Phi with PCIe connections out to
other peripherals. Imagine a Phi host that could connect to a GPU to offload
things it was better at.

~~~
dragontamer
Well... they did. That's basically called a Xeon 8180. :-)

Or alternatively, an AMD EPYC (64-cores / 128x PCIe lanes).

~~~
shaklee3
Now I'm remembering... They had the phi as a coprocessor in a PCI slot,
effectively making it just as issue as a GPU. But the second gen (knights
landing) made the phi the host processor, but removed almost all ability for
external devices. It had potential I think, but it was a weird transition from
v1 to v2.

~~~
dragontamer
I actually found the Coprocessor more interesting.

Yeah, NVidia CUDA makes a better coprocessor for deep learning and matrix
multiplication. But a CPU-based coprocessor for adding extra cores to a system
seems like it'd be better for some class of problems.

SIMD compute is great and all, but I kind of prefer to see different solutions
in the computer world. I guess that the classic 8-way socket with Xeon 8180 is
more straightforward (though expensive).

\--------

A Xeon Phi on its own motherboard is just competing with regular ol' Xeons.
Granted, at a cheaper price... but its too similar to normal CPUs.

Xeon Phi was probably trying to do too many unique things. It used HMC memory
instead of GDDR5x or HBM (or DDR4). It was a CPU in a GPU form factor. It was
a GPU (ish) running its own OS. It was just really weird. I keep looking at
the thing in theory, and wondering what problem it'd be best at solving. All
sorts of weird decisions, nothing else was ever really built like it.

~~~
shaklee3
Agreed! That's why I was bummed when the second-gen was a host system. Didn't
fit well to my use case.

