
Linus Torvalds on AVX512 - ykm
http://www.phoronix.com/scan.php?page=news_item&px=Linus-Torvalds-On-AVX-512
======
robocat
The AVX512 instructions can cause strange global performance downgrades.

“One challenge with AVX-512 is that it can actually _slow down_ your code.
It's so power hungry that if you're using it on more than one core it almost
immediately incurs significant throttling. Now, if everything you're doing is
512 bits at a time, you're still winning. But if you're interleaving scalar
and vector arithmetic, the drop in clock speeds could slow down the scalar
code quite substantially.“ - 3JPLW and [https://blog.cloudflare.com/on-the-
dangers-of-intels-frequen...](https://blog.cloudflare.com/on-the-dangers-of-
intels-frequency-scaling/)

The processor does not immediately downclock when encountering heavy AVX512
instructions: it will first execute these instructions with reduced
performance (say 4x slower) and only when there are many of them will the
processor change its frequency. Light 512-bit instructions will move the core
to a slightly lower clock.

* Downclocking is per core and for a short time after you have used particular instructions (e.g., ~2ms).

* The downclocking of a core is based on: the current license level of that core, and also the total number of active cores on the same CPU socket (irrespective of the license level of the other cores).

As per [https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-
us...](https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-use-these-
new-instructions/)

~~~
MaxBarraclough
> The AVX512 instructions can cause strange global performance downgrades.

Can other SIMD instructions (AVX2, say) do the same?

~~~
th3typh00n
> Can other SIMD instructions (AVX2, say) do the same?

On Intel CPUs, yes. There's even a BIOS/UEFI setting to specify how much you
want the clock frequency to drop when running AVX code called "AVX offset".
AMD CPUs doesn't do that though as far as I know.

The thermal hit of using wider vectors decreases with every node shrink
though, so expect the issue to become muted over time (which also explains why
that doesn't apply to AMD - their only µarch with 256-bit execution units, Zen
2, is on a better node than Intel).

~~~
dogma1138
AVX offset is only available on motherboards that support overclocking due to
how much higher intel CPUs can be pushed relatively to their advertised base
and boost clocks.

Both Zen and Intel lower their clocks under load especially AVX, keep in mind
that Zen 2 doesn’t even reach its advertised boost clocks under any load some
CPUs come close to within 100mhz or so but overall they all clock down rather
fast once TMax or PMax is reached.

------
abainbridge
What are the forces in chip design that are at play here? Over the last 10-15
years, fabs have continued to fit more and more logic gates per unit area, but
haven't reduced the power consumption per gate as much. As a result, if you
fill your modern chip with compute gates, you cannot use them all at once
because the chip will melt. Or at least you can't have them all running at max
clock rates. One solution is to increase the proportion of the chip used for
SRAM (it uses less power per unit area than compute gates), this is what
Graphcore have done. Another is to put down multiple different compute blocks,
each designed for a different purpose, and only use them a-few-at-a-time. The
big-little Arm designs in smartphones are an example of that. But I feel like
AVX512 might be an example too. When they add ML accelerator blocks next, they
also will not be able to be used flat out at the same time as the rest of the
cores' resources.

I'm sure Intel should fix the problems Linus is complaining about, but I feel
like chip vendors are being forced into this "add special purpose blocks"
approach, as the only way to make their new chips better than their old ones.

~~~
tails4e
Jim Keller had an interesting talk recently [1] about ways of doing parallel
processing to better us the billions of transistors we have - assuming the
task is parallelizable. There's the scalar core (i.e the basic CPU) which is
easy to program realtively. Then a scalar core with vector instructions -
difficult to program efficiently. Then there are arrays of scalar cores, i.e.
GPUs, so relatively easy to program again, and now a lot of startups with
arrays of scalar cores each with vector engines, so expected to be most
difficult to program. He didn't go into why vector instructions are hard to
use efficiently, and hard for compiler writers, but I'd be interested if
anyone here could explain that.

1\. [https://youtu.be/8eT1jaHmlx8](https://youtu.be/8eT1jaHmlx8)

~~~
zozbot234
Are GPU's really _easier_ to program than scalar w/ SIMD (or vector insns)?
The programming models you have to work with for GPGPU seem quite obscure,
whereas with CPU and SIMD flipping a compiler switch gets you most of the way
there, and self-contained intrinsics do the rest.

~~~
TinkersW
GPU programming is easy enough, the complexity comes from the seperate memory
system and the tedious(and not portable) API you need to use to access the
GPU.

I prefer intrinsics as they give more control than shader languages and they
can be written in C++ instead of fiddling with some garbage GPU API that runs
async.

~~~
pjmlp
MSL, CUDA and SYSCL are C++ with extra topping.

Also one of the reasons CUDA won developer love is that it fully embraced
polyglot programming on the GPU.

~~~
TinkersW
None of those are both portable and widely available on end user machines,
which is needed for games

CUDA seems nice, but being Nvidia only makes it a total dead end.

~~~
slavik81
Disclaimer: I work on AMD ROCm, but my opinions are my own.

There's also HIP[1], which can be used as a thin wrapper around CUDA, or with
the ROCm backend on AMD platforms. It doesn't yet match CUDA in either breadth
of features or maturity, but it's getting closer every day.

[1]: [https://github.com/ROCm-Developer-Tools/HIP](https://github.com/ROCm-
Developer-Tools/HIP)

~~~
gnufx
As I understand it, that has to work for the CORAL 2 US "exascale", so people
who've been proved fairly right so far obviously have some confidence in it.
(de Supinksi of Livermore said he'd be out of a job if conventional wisdom was
right, though it was pretty obvious at the time that it wasn't.) Free software
too, praise be.

------
floatboth
I agree that there's too much focus on FP, but SIMD is not all about FP. Every
new SIMD ISA extension has something interesting for integer.

Here's an article about JITing x86 to AVX-512 to fuzz 16 VMs per thread:

[https://gamozolabs.github.io/fuzzing/2018/10/14/vectorized_e...](https://gamozolabs.github.io/fuzzing/2018/10/14/vectorized_emulation.html)

------
raverbashing
FP matters (especially with SIMD)

It matters to image/video/audio processing

It matters to simulations

It matters to 3D models/rendering

It matters to games

So it's not "just benchmarks", people actually want to do stuff with it

Sure, AVX512 might not be the greatest way of doing it, and it might be better
to just make the existing instructions go faster, that might work

~~~
qayxc
It's a matter of perspective.

Back in the day, CPUs didn't come with FPUs and the latter were optional co-
processors.

The idea in the x86-world always was to "outsource" special requirements to
dedicated hardware (FP co-processors, GPUs, sound cards, network cards,
hardware codec cards, etc.), instead of putting them on the CPU package (like
ARM-based SoCs).

So it's different philosophies entirely - tightly integrated SoCs vs versatile
and flexible component-based hardware.

It's The One Ring ([ARM-based] SoCs) vs freedom of choice and modularity (PC).
If I don't do simulations or 3d-modelling/rendering, I am free to choose a
cheap display adapter without powerful 3D-acceleration and choose a better
audio interface instead (e.g. for music production).

The SoC approach forces me to buy that fancy AI/ML-accelerator, various video
codecs, and powerful graphics hardware with my CPU regardless of my needs,
because the benevolent system provider (e.g. Apple) deems it fit for all...

Torvalds is just old-school in that he prefers freedom of choice and the
"traditional" PC over highly integrated SoCs.

~~~
raverbashing
FP coprocessors "only" existed because the processes weren't advanced enough
to have them inside the chip, but they were a natural extension (they were
married to the instruction set of the chip - it wasn't a product, it was a
feature)

At the old days there were minor competitors to the x87 family that died
quickly. (For reference:
[https://en.wikipedia.org/wiki/X87#Manufacturers](https://en.wikipedia.org/wiki/X87#Manufacturers)
)

For the rest yeah, it kinda makes sense to have them customizable.

------
rbanffy
I for one would be delighted by having more caches or wider backends instead
of AVX512, but I don't want SIMD to be pushed into GPUs. It'd be better to do
the reverse - to push forward the asymmetric core idea and move more GPU
functionality into lots of simpler cores tuned for SIMD at the cost of single
thread performance.

~~~
molticrystal
Here are some shots of the Mask Registers
[https://travisdowns.github.io/blog/2020/05/26/kreg2.html#the...](https://travisdowns.github.io/blog/2020/05/26/kreg2.html#the-
die-shot)

If seems like they just keep that area mostly empty in processors without that
feature, at least for the processors related to the one pictured. Not really
sure how much cache that would be effective could fit without a major
overhaul, but likely a chip designer or enthusiast would. This could be why
Linus focused on computational enhancement when he discussed transistor
budget.

~~~
rbanffy
From a quick glance at the proportions and considering not only the register
files are halved, but also the vector EUs, I'd expect a 25% increase in L3 or
a 50% in L2. That and some lessened thermal constraints.

------
jasonzemos
AVX-512's richness to x86 is like what C++'s is to C. Linus makes a summary
assessment for how he can leverage these technologies to his advantage and if
the cost of learning the technology and all its intricacies outweighs the
perceived advantage: that technology is garbage. This reaction from Linus
appears to fit his conservative pattern. I think where Linus gets things wrong
stems from his facts rather than his philosophy.

AVX-512's fantastic breadth is born out of an actual need to free compilers
from constraints imposed by programs in virtually every mainstream language.
All of these describe programs for an academic-machine rooted in a scalar
instruction model. Without any further performance from increasing cycles over
time the target has to become instructions-per-cycle and even operations-per-
instruction. The limitations on ILP and the expense of powering circuitry to
achieve it has been well studied for the past two decades. The failure to
realize it is evident in the failure of Netburst. Linus believes that the
frontend of CPU's have a lot more to give; perhaps best exhibited with his
refutation of CMOV
([https://yarchive.net/comp/linux/cmov.html](https://yarchive.net/comp/linux/cmov.html)).

Today's programming languages haven't evolved to make things easier on
programmers to describe non-scalar code. On the other hand, power constraints,
and now security constraints haven't made things easier for hardware to
efficiently execute scalar code. Perhaps AVX-512 is as naive a bet as Itanium,
if not it might be just the missing piece compilers need that they didn't have
twenty years ago.

------
RantyDave
Are Intel just delaying the inevitable? Is it safe to say (even today) that a
slow GPU will crunch big matrices faster than a fast CPU? And that's before we
get to price/performance. So all that's left is the bottleneck around PCIe
which, in theory, leaves the CPU with an advantage only for small datasets -
which we don't really care about anyway (because they happen quickly).

Maybe the tradeoff is somewhere interesting from a latency perspective - SDR
or similar. I dunno, am I barking up the wrong tree?

~~~
teruakohatu
In the article they quote Linus speculating that the increased core count of
CPUs will achieve the same thing as AVX512 without the problems. I have read
comments on HN that if cores keep increasing on CPUs they might be able to
replace GPUs for some of the tasks as GPUs (or CUDA in particular) have quirks
that CPUs don't have.

AVX512 in particular has issues. Using it slows down the CPU so actual wall
clock benefits depends heavily on how it is used.

~~~
panpanna
For general purpose computing maybe. For gaming the GPUs contain special
operations for texture lookup and what not that would be very expensive in a
CPU.

------
fancyfredbot
AVX512 is both integer and floating point, not just FP, so this rant about FP
comes across as ill informed.

Despite that I'd agree most people probably see no benefit from these units
today. But that could change. For workloads with parallelism, wide SIMD is
very efficient - more so than multiple threads anyway. The only way to get
people to write vector code is to have vector processing available. Once it's
ubiquitously available people might code for it and the benefits may become
more apparent.

------
throwaway_pdp09
The very wide AVX stuff with integer ops, like these from wiki:

\- AVX-512 Byte and Word Instructions (BW) – extends AVX-512 to cover 8-bit
and 16-bit integer operations[3]

\- AVX-512 Integer Fused Multiply Add (IFMA) - fused multiply add of integers
using 52-bit precision.

could be very useful. I could have done with those recently. They also don't
(AFAIK) cause cpu scaling (polite term for downclocking). He may well be right
with FP though.

~~~
superjan
52 bit precision? typo?

~~~
sdflhasjd
52 bits is the size of the mantissa in an IEEE 754 double precision floating
point

------
gridlockd
_" I've said this before, and I'll say it again: in the heyday of x86, when
Intel was laughing all the way to the bank and killing all their competition,
absolutely everybody else did better than Intel on FP loads. Intel's FP
performance sucked (relatively speaking), and it matter not one iota.

Because absolutely nobody cares outside of benchmarks."_

That was back in the stone age when a lot of applications for FP math weren't
mainstream. Most of AVX-512 doesn't even concern FP, there's lots of integer
and bit twiddling stuff there.

Furthermore, people really _do_ care about these benchmarks. It influences
their purchasing, which is really the thing that matters most to Intel. A lot
of people don't actually care about hypothetical security issues or the fact
that the CPU is 14nm when it still outperforms 7nm in single-threaded code.

Also, it's not like you can just trade off IPC or extra cores for wider SIMD.
It's not like "just add more cores" is just as good for throughput, otherwise
GPUs wouldn't exist. Wider SIMD is cheap in terms of die area, for the
throughput it gives you.

Lastly, these are just _instructions_ , nothing says that an AVX-512
instruction needs to go through a physical 512-bit wide unit, it just says
that you can take advantage of those semantics, if possible.

------
dang
A related discussion is here:
[https://news.ycombinator.com/item?id=23822203](https://news.ycombinator.com/item?id=23822203),
also with interesting comments.

Since this thread is of the second freshness, we won't merge.

------
bartwe
Down with simd, up with spmd/compute

------
nullc
There are 1001 AVX512 variations, but few equivalent operations to the RISV
bit manipulation instructions.

------
CamperBob2
_Intel 's FP performance sucked (relatively speaking), and it matter not one
iota. Because absolutely nobody cares outside of benchmarks._

Today I learned that even Linus Torvalds has a bozo bit. [1] When's the last
time he actually _did_ anything with a computer?

1:
[https://en.wikipedia.org/wiki/Bozo_bit](https://en.wikipedia.org/wiki/Bozo_bit)

~~~
topspin
He has the history correct. Most of the CPUs that x86 beat in the market had
superior FP performance; SPARCs, Alphas, PA-RISC, Itanium, etc.

> When's the last time he actually did anything with a computer?

According to Linus he completes about 30 pull requests a day. Some multiple of
that in kernel builds. His $1900 32 core Threadripper speeds that process a
great deal and FP contributes little to nothing.

Today people stream video+audio, encrypt+decrypt and render graphics. All of
these have specialized silicon. If their AVX-512 vanished in the night almost
no one would notice the next day.

Maybe we should all be astronomers and thermodynamicists writing bespoke
finite element simulations and have a deep appreciation for the wonders of
floating point ISAs, but that's just not the real world.

~~~
azalemeth
Speaking as someone who does scientific computing all day long, in part with
FEM simulations, even for me AVX512 isn't usually worth it in terms of wall-
clock time.

~~~
sgillen
Speaking as someone else who does scientific computing all day, taking away
vectorized operations would kill my performance completely.

