
Linux founder tells Intel to stop inventing magic instructions and fix problems - rsecora
https://www.pcgamer.com/linux-founder-tells-intel-to-stop-inventing-magic-instructions-and-start-fixing-real-problems/
======
detaro
duplicate:
[https://news.ycombinator.com/item?id=23809335](https://news.ycombinator.com/item?id=23809335)

------
zozbot234
Previously:
[https://news.ycombinator.com/item?id=23809335](https://news.ycombinator.com/item?id=23809335)

------
ajayjain
Two years ago, I wrote an LLVM compiler pass that automatically upgrades SSE
or AVX-2 code to AVX-512 when those instructions are available. It's helpful
for getting performance gains for hand engineered code that you don't want to
touch. We saw some good gains on integer workloads (unpacking, useful for
databases - something I'd call "regular code"): 1.16x speedup going from SSE
to AVX-2, and 1.43x speedup going from SSE to AVX-512. Even better speedups
are available for FP workloads, though hand-writing the AVX-512 version can
work a lot better since programmers can exploit the full range of new
instructions available in AVX.

[https://www.nextgenvec.org/](https://www.nextgenvec.org/)
[https://arxiv.org/abs/1902.02816](https://arxiv.org/abs/1902.02816)

~~~
PaulHoule
The catch is "hand engineered code that you don't want to touch."

I helped put an early deep network application into production that used one
particular generation of vector instructions and it was clear everyone was
traumatized at the process of implemented it to use those vector instructions
and would never do it again.

We later bought servers that had instructions that we weren't using, but we
were so focused on getting the product out the door (which we did) that we
left those performance gains on the table.

The consequence of the "new instructions every two years" route is that end
users don't really experience the performance benefits engineered into the
chips. The vast majority of software development organizations are more
concerned about support costs than they are about getting the last bit of
performance out. (Also, end users are already tired of 200-megabyte binaries,
long start times, and other performance decrements that come from high-
complexity libraries that contain a huge number of code paths for different
CPUs)

Many working assembly programmers who are good at SIMD love the Intel
approach, maybe because it keeps them in business. People are doing very cool
things in the parsing space (e.g. it is shocking how slow it is to parse
ascii-formatted numbers, JSON, XML, RDF, ...) with SIMD instructions.

However, the old tradition in vector computing is the variable-sized vector
unit like what the ILLIAC had or the old Crays, the vector accelerator for the
IBM 3090, etc. ARM has the scalable vector instructions which are organized
that way.

ARM's Helium had a nice middle ground of having "small, medium and large"
implementations available which all run the same code. You might get higher
peak peformance on Intel's road, but higher realized performance with ARM.

~~~
ajayjain
Yep! You got the motivation exactly right. The goal is automatic program
"rejuvenation" w/o all the engineering and maintenance costs. Revectorization
in the compiler allows programmers to capture much of the performance gain
possible with wider instruction sets without having to write and maintain
multiple code paths. It can be a big pain to understand and rewrite this
vector code -- especially if it's in some external library or another team's
code. But agreed that some really nice results are possible with the right
engineering such as in parsing.

Do the binaries really get into the 200 MB range from different code paths, or
is that due to data?

------
corysama
There's not much use for AVX-512 in the kernel. But, from the people who are
using it I've heard good things. It's an enormous collection of new
instructions. But, a lot of them came from the Larrabee team --which was
probably the greatest case of high-perf software engineers having direct
influence over the direction of a CPU. Otherwise, the history SSE and AVX has
largely been software engineers requesting features and hardware engineers
replying "Do not understand. How does that feature improve our SPECfp score?
Rejected."

~~~
gameswithgo
On the one hand AVX-512 is incredible, the performance gains from AVX-2 to
AVX-512 are often more than the 2x you would hope for. On the other hand, when
you can use AVX-512 you can often use a GPU instead, to greater benefit. So
linus may be right, overall.

~~~
corysama
In my observations, GPUs are the way they are largely because Physics demands
sacrifices if you want that kind of performance. Those same demands apply to
CPUs, but Intel has been fighting against them because software engineers hate
to change anything about how they code. Well... the fight was lost years ago
and Intel is conceding to Physics by integrating architecture that was
originally designed for a CPU-GPU hybrid.

With AVX-512 and CUDA, Intel and Nvidia are trying to find the best developer
experience they can to deal with the inevitable future of computing that is
somewhere between those two experiences currently.

~~~
Const-me
> inevitable future of computing that is somewhere between those two
> experiences currently.

Not sure that ever happens. The two are very different in their performance
characteristics. CPUs have to stay focused on single-threaded performance.

Single threaded performance = minimizing latency, that’s why in CPUs so many
transistors and electricity is spent on cache hierarchy and coherency
protocol, instructions reordering, ILP, microops fusion, prefetcher, branch
prediction incl. indirect one, and so on.

GPUs don’t care about single-threaded performance, all they care is throughput
on massively parallel tasks. They ignore memory latency, switching threads
until data arrives, makes caches much simpler. They don’t need high clock
frequencies (this alone improves their flops/watt by a large factor), don’t do
branch prediction, memory model is less strict, and so on. That’s how they can
spend that many transistors on FPUs and ALUs.

P.S. Intel tried many-slow-cores with Larrabee / knights landing / etc., all
they got was a very expensive lesson, people didn’t want that stuff. I’ve
heard a rumor initially Sony wanted to use another Cell CPU for the GPU in
PS3, again, didn’t worked well enough and they negotiated with NVidia instead.

~~~
corysama
All true. Problem is that single-threaded latency minimization has been tapped
out for a long time now. Physics says No. Memory is high-latency because of
the speed of light. Larger caches have diminishing returns. Clocks can't go
faster without melting the chips. There is no more prediction and pipelining
to be dug out of a serial stream of instructions.

We can still add more transistors. But, there is nowhere for them to
contribute but in going wider. High latency, high throughput. High concurrency
and careful management of the memory hierarchy are the only ways to pull that
off. When you accept that fate, you start out with the Cell processor and you
pass through Nvidia's Ampere chip --which has somewhat more thread
independence and better automatic caching than past GPUs, but also finally
gained manual async DMAs to local SRAM like the Cell had. Where we'll end up,
I don't know.

~~~
Const-me
When I look at general-purpose single-threaded benchmarks, I see a good
progress. I’m pretty sure we’ll see very good progress on that, because
Intel/AMD competition, also because inexpensive SSDs removed the storage
bottleneck.

I agree on physical limitations, but the research still goes on, smart people
still able to use these extra transistors in a way that’s useful for
performance. Even for single-threaded one.

Another thing, software is progressively better at using many cores. It’s much
easier to move large pieces of code to another CPU thread (like some browsers
do with JS compiler) keeping it sequential code, than it is to re-write code
as a large set of small tasks / threads. Many useful algorithms can’t use
fine-grained parallelism much, or at all, e.g. streaming stuff like gzip or
parsers.

~~~
corysama
I really miss the 286->Pentium days when you had to buy a new machine every
year because it would be literally 50-100% faster at everything. Every year!
These days people call a 9% single-thread improvement a good year.
[https://www.cpubenchmark.net/year-on-
year.html](https://www.cpubenchmark.net/year-on-year.html)

[https://www.karlrupp.net/wp-
content/uploads/2018/02/42-years...](https://www.karlrupp.net/wp-
content/uploads/2018/02/42-years-processor-trend.png)

BTW: Thanks for
[http://const.me/articles/simd/simd.pdf](http://const.me/articles/simd/simd.pdf)
I came across it recently and posted it to
[https://www.reddit.com/r/simd/](https://www.reddit.com/r/simd/)

------
shadowgovt
While I love Linus and I think he's excellent at what he does, he's utterly
missing the fact that things like AVX-512 sell chips, and that's what Intel's
in the business of doing. And best practice in how one writes software will
increasingly look like "write it so it executes well in a massively-parallel
context" if the driver for chip sales is massively-parallel problems.

We've smacked pretty hard into the wall of how fast we can make CPUs by
miniaturizing components, and the low-hanging fruit now is parallelization,
predictive execution, and the whole host of ways to do more per clock cycle,
not speed up clock cycles. But that flies in the face of the traditional
embarassingly-serial pattern of the x86 instruction set and computing
environment. Much of what Intel's doing these days is trying to open up
opportunities for people to change tools to speed up code without having to
boil the ocean by throwing the whole serial-instruction model completely out
the window (even though, increasingly, code written to that instruction set is
really operating like a language that is _emulated_ by the underlying
parallel-and-predictive CPU hardware).

I think what Linus calls "regular code" isn't going to move chips and we've
wrung all the cheap optimizations out of that critical path (and, depending on
the details of what is meant by "regular," such code is increasingly going to
ram up against an efficiency ceiling unless it can be moved to a model of
"Prepare work to do in parallel, execute in parallel, merge results and
present to UI").

~~~
colinmhayes
Pretty sure this is exactly what Linus is complaining about. It's not that he
doesn't get it, he's upset that intel is spending resources on what he sees as
advertising instead of making the product better.

~~~
shadowgovt
But more direct access to parallel execution hardware is making the product
better. It's not making it better for _his use case_ , which is not surprising
because Intel is in the business of selling chips, not optimizing for the
Linux kernel use case.

------
goalieca
> This is not the first time Torvalds has directed his ire at Intel. In 2018,
> Torvalds referred to Intel's Meltdown and Spectre patches as "COMPLETE AND
> UTTER GARBAGE," in all caps to emphasize his level of anger.

I'm 1000% behind him on this point. Fix that first and fix it well. Then get
onto fancier stuff that i'll never ever code against.

~~~
shadowgovt
The thing is, "that I'll never ever code against" is context specific. The set
of applications that want to be embarrassingly-serial and not utilize any
massively-parallel features is shrinking (even problems traditionally solved
via serial code can benefit from a parallel re-architecting, assuming the path
from language to CPU architecture supports true parallelism).

~~~
PaulDavisThe1st
The set of applications that benefit from SIMD (i.e. SSE, AVX, AVX-foo) is
very different from the set that benefit from parallelism.

~~~
solidasparagus
That doesn't make much sense to me. SIMD is a parallelization technique.

~~~
jlokier
I think they mean that SIMD only provides a very particular, limited kind of
parallelism. It's only useful for some kinds of parallel problems.

Anything where the execution traces diverge significantly requires more
execution cores, to sequence them separately. For example serving multiple web
pages in parallel.

Anything where the executation traces are identical or near-identical across
many repetitions in lockstep, is fine with one execution core and SIMD ALUs.
For example dense or block matrix arithmetic.

Modern GPUs are a hybrid of both ideas. Shared execution cores and SIMD ALUs
because of high repetition of identical logic, but still many execution cores
(but fewer than ALUs) so there is some amount of decoupled parallelism as
well, running different tasks at the same time.

~~~
cogman10
GPUs are somewhat more like "Super powerful SIMD". That is, you tell a GPU "I
want n cores to run this block of code" and that's just what it will do.

SIMD is much more primitive. It is more along the scale of "Multiple these 4
numbers by 5" and that's it. A GPU kernel can have branching, conditionals,
etc. SIMD is limited to one operation on a small set of data (512 bytes for
AVX-512).

~~~
jlokier
CPU SIMD can do conditionals and branches in the same way as GPUs, using per-
lane conditionals.

A modern GPU isn't limited to running one block of code at a time (on n
cores). It will schedule a few different blocks of code independently - up to
the number of execution units on the GPU, multiplied by the number of latency-
hiding time slots per execution unit.

------
dottrap
Seems like many people here haven't actually read Linus's specific thoughts on
AVX-512, and are accusing him of rejecting SIMD entirely or the usefulness of
FP performance entirely. This is not an accurate representation of his
position and he singles out AVX-512 differently from SSE/AVX/AVX2.

Here is a link to the thread directly from Linus explaining his position on
MMX/SSE, AVX/AVX2, AVX512. His complaints on AVX512 are both about
fragmentation and regressing from the learned lessons from their prior
generations. And he also looks to NEON and SVE2 and suggesting ARM looks seems
much saner to him:

"So just as a bystander, I'm looking at AVX512, and I'm looking at SVE2, and
I'm going "AVX512 really is nasty, isn't it"?""

[https://www.realworldtech.com/forum/?threadid=193189&curpost...](https://www.realworldtech.com/forum/?threadid=193189&curpostid=193248)

------
staycoolboy
Intel? Dude, have you seen Arm's JavaScript instructions? Or RISC-V's user-
defined instructions? (and Arm announced the same thing at Arm Tech Con last
year)

Someone is missing the boat: custom ISA is the future.

~~~
thrwyoilarticle
>Arm's JavaScript instructions

Name two

~~~
matthewaveryusa
Is your point that there's only one (FJCVTZS) ?

~~~
thrwyoilarticle
Exactly. It's not comparable to the impact AVX has on x86 architectures.

------
phkahler
I'm still waiting for languages to support 2,3, and 4D vectors as first class
data types. These are so common it's silly to have people define them. We end
up with different implementations sometimes too.

Please Rust, please!

~~~
dhosek
AppleSoft BASIC had multi-dimensional arrays back in 1977. It always bugged me
that modern languages didn't support this neatly.

~~~
shele
FORTRAN had arrays 1956...

------
paulmd
His idea that AVX-512 is something that's exclusively for floating-point is
completely off-base, AVX has included integer operations since AVX2. Widely
used in JIT, database, etc.

Furthermore, AVX-512 is about much more than doubling the vector width, it is
a significant overhaul to the instruction set and adds many new operations and
"fills in gaps" that were missing from previous instruction sets. It in fact
would be perfectly valid and good to implement AVX-512 with a 256-bit unit
that takes twice as long to run 512-bit width instructions. This completely
negates all his points about die space utilization right from the start -
AVX-512 support does not imply a significantly larger use of space than
previous AVX instructions. This would also fix some of the power-related
problems on Skylake-SP - after all if you go from 2 512-bit wide units to
2x256 gangable units or 1x256 running at half-rate, that reduces power
correspondingly and you no longer need to drop clocks so strongly to offset
this, but you keep the functionality added in AVX-512.

Furthermore, it's not like there are massive gains in general IPC that haven't
been tapped. AVX-512 has taken 25% of the die area in some instances, if you
dropped that to AVX2 (assume 12.5% of die area) then it's not like the
processor would be 12.5% faster in general, that would translate to maybe 2-3%
faster in general and a 30%+ loss in specialty applications. Once you've
mostly explored general-purpose gains and are into diminishing returns
territory (which modern processors certainly are), it makes sense to start
looking at "specialty units", like AVX, or on GPUs you've got tensor cores and
BVH traversal units, and so on. These can provide big speedups in key tasks at
the cost of very little "general" performance (since that's already in
diminishing returns territory).

The biggest thing slowing down AVX-512 adoption has been, yet again, 10nm.
Right now it is only available on Skylake-X and Skylake-SP products, and more
recently Ice Lake (which came out September of last year, in only the
ultrabook segment, supplemented by 14nm in the mobile workstation segment as
well as the ultrabook segment). So right now it is available in less than 1%
of the desktop "fleet" and probably less than 1/8th of the laptop "fleet".
There is very little reason to implement code paths for an instruction set
that nobody can execute. Over time, as Ice Lake and Tiger Lake build share of
the laptop "fleet", Rocket Lake implements it on desktop, and AMD implements
it whenever, it will see more usage, just like prior AVX sets.

It really is wider-market than people realize. I have seem many people scoff
and say "well you'll never see it used in games or whatever", but for many
years now there have been games that simply will not run if you don't have AVX
(notably many Ubisoft titles), there is no fallback SSE/scalar codepath. In
another 10 years you will probably have AVX-512 mandatory games as well.

With all due respect to his long career in software engineering, that doesn't
necessarily translate to processor design. This is just one person's opinion
and you are under no obligation to accept it as gospel just because it's
Torvalds'. See also: his weird ZFS rant.

(This seems to be a common thing with software engineers in particular,
including many on this site - can't count how many "one weird fix from a
software engineer to fix [complex domain problem] in
[chemical/materials/aerospace engineering]" I've seen. I of course have no
particular expertise in processor design either, but the engineers at Intel
presumably do, and they thought it was a good idea.

~~~
phkahler
>> It in fact would be perfectly valid and good to implement AVX-512 with a
256-bit unit that takes twice as long to run 512-bit width instructions. This
completely negates all his points about die space utilization right from the
start - AVX-512 support does not imply a significantly larger use of space
than previous AVX instructions.

The extra registers have to be there even if the execution unit is half size.
The context switching overhead grows, which is something kernel developers
care about. And then with all these variants there's a push to have a single
binary support them all which complicates the code too. Meanwhile the world is
moving toward parallelism which is going to mean more thread creation and
context switching.

Linus does have valid concerns here.

~~~
MrBuddyCasino
Are those AVX-512 registers saved on context switches? At least on syscalls,
the kernel could just have a policy of not using those registers? I'm not
exactly sure how that works, this is part of the ABI right?

------
tbenst
I see a huge bump in numpy performance with AVX-512, to the point where I
wouldn’t buy a cpu without it. I don’t know that I understand the critique—-is
it because these instructions are Intel specific and not on AMD? Seems
obviously useful for scientific computing.

~~~
quotemstr
Numpy is grossly inefficient anyway due to its eager evaluation model and lack
of operator fusion.

------
etaioinshrdlu
Does anyone know if the frequency throttling caused by AVX512 can be
circumvented with high-performance CPU cooling?

It would at a first glance appear the throttling is not caused by temperature
spikes but rather just executing the instructions at all.

I wonder how much more FP performance one could extract out of an AVX512 CPU
with extreme cooling.

~~~
pstrateman
On Intel CPUs AVX512 instructions always block boost clocks, regardless of
actual power budget or cooling capacity.

Indeed the boost clocks are on a hard timer in general and cannot run
continuously.

------
Thaxll
Also AVX-512 heat up quite a bit the CPU.

~~~
corysama
For >= 256 bit operations you need to decide to commit to them or not. If you
go all-in on them and use them well, you will get the same work in less time
and less joules than without them. If they won't help, don't use them. But,
popping in and using them for tiny bits would be the slow and hot worst of
both worlds.

~~~
chrchang523
Nit: most 256-bit AVX2 integer instructions (exception: multiply) do not
trigger downclocking.

~~~
quotemstr
Is there a central list of these safe instructions?

~~~
chrchang523
I currently follow the guidance in this Stack Overflow answer by Travis Downs:
[https://stackoverflow.com/a/56861355/3245034](https://stackoverflow.com/a/56861355/3245034)
. I'm not aware of a more central source of truth.

------
shmerl
It's an old argument of having too many special purpose circuits vs leaving
more room for the rest. Linus has a point.

------
peter_d_sherman
I find it amusing, er, ironic, er, amusing -- that after this article, on the
same web page, there's a link/blurb which reads:

"Nvidia is worth more than Intel for the first time in history. Nvidia is now
worth more than Intel, according to the NASDAQ. The GPU company has finally
topped the CPU company's market cap (the total value of its outstanding
shares) by $251bn..."

Well no surprise there.

On the one hand, Intel (especially early Intel employees) should be thanked,
profusely, for giving us PC history as we know it today.

On the other hand, we should seek to honestly recognize what Intel has become
today.

A mega-corporation, driven by corporate mentality, which always seeks to
maximize profits for their shareholders at the expense of all other virtues.

Keep in mind I am not criticizing Intel's employees -- only the marching
orders that come from the top down.

But, as far as I can tell, Intel, as we know it, will not be around in another
30 years.

The future of semiconductors is in the following areas:

1) Simple, non-proprietary instruction sets (RISC-V and others)

2) Transparent (or as transparent as possible) and publicly auditable
engineering and manufacturing processes

3) Conscientous companies that place virtue first, and are not driven by
maximimizing profits for shareholders

So on the one hand, Intel is to be thanked, lauded, praised for its role,
especially its early role in the beginning of the PC revolution, but on the
other hand, I don't see Intel existing as company more than 30 years from
now...

Although, at that point in time, Intel's past role will always remain
important and relevant -- to future students of early computer history...

~~~
dahinds
Your "future of semiconductors" could have been written 30 years ago, and
seems no more likely to be true today than it was then. Do you think Nvidia is
less proprietary, more transparent, or more virtuous than Intel?

~~~
Iwan-Zotow
Well, combination of the lead in x86 architecture (a lot of times close to
100% of market supply) combined together with lead in manufacturing (first in
process shrink down to 14nm) was unique and highly unlikely to be repeated.

Commoditization of architecture (ARM seems to be in the lead by RISC-V might
go up) plus separation between chip design houses and manufacture fabs seems
to be more efficient market solution. Where it leaves Intel?

------
anarazel
x86 instructions I'd like to see are atomic instructions with lower global
memory ordering guarantees. TSO is nice, but there's enough performance
sensitive concurrent code where opting out would be useful.

------
fierarul
Kinda odd to see the title 'founder'. Linux is no startup or company. Linus is
more like the original author, maintainer or creator of Linux.

~~~
zenhack
The word "founder" just means "person who started a thing." The use as a
synonym for entrepreneur is something I basically haven't seen anywhere
outside of very business-focused contexts.

------
posedge
"Linux founder"

Folks, anyone who knows what Linux is also knows who Linus Torvalds is.

------
drummer
I just love it when Torvalds flips out.

------
Koshkin
Is this an eye-opener for Intel? I think not. So, then, what's the point
(other than to vent one's frustration)?

~~~
dathinab
No point. He just made a commend about something he things is annoying in a
mailing list and some media makes a big deal out of it because it's Linus
Torvalds. That's all there is to it.

------
okareaman
LT: "I hope AVX512 dies a painful death"

I see his month break to work on "unprofessional" behavior didn't include a
course on non-violent communication (NVC -
[https://www.cnvc.org/](https://www.cnvc.org/))

~~~
okareaman
Before I get downvoted to oblivion - a lot of good people worked on that
feature because it was their job, presumably some women, and I'm sure they
brought passion and creativity to the endeavor, even if it was a dumb
management decision. I guess HN approves of telling those people that their
work should die a painful death.

~~~
Dylan16807
Kill your darlings. It sucks when a lot of people work hard on the wrong
thing, but hard work is not a reason to keep a feature.

~~~
okareaman
I didn't say keep it. I suggested using words less harsh than I hope it dies a
painful death. If someone said that about something I worked hard on it would
cause me pain

------
0xFFC
Torvalds clearly does not understand the brutality of business. Intel does not
care at all about customers it have at the moment. Because they have them by
the balls. After many years of struggle only Apple (with its unlimited
resource) was able to switch to ARM. So they don’t care.

They only care about long term market, which is HPC and ML workloads. Because
Nvidia is destroying anybody in that market. Look at their stock.

I’ve got a news for Torvalds, it going to get bad for Intel.

~~~
cameldrv
If there were ever a time where Intel had its customers by the balls the least
in the past twenty years, it's now. AMD is extremely competitive in the server
and desktop space, ARM/Apple is becoming competitive in the laptop space,
Intel has no product in the phone/tablet space, and they are noncompetitive in
the HPC/DL accelerator space.

If there's anything that has typified Intel in the past decade it's a
generalized failure to execute. There's been very little innovation in the
core product since Haswell in 2013.

~~~
trekrich
Then its only a matter of time before they go bankrupt.

~~~
rorykoehler
They have the cash to get out of this if they make the right moves.

~~~
mindcrime
Exactly. A single well-time and well-executed acquisition could change the
whole game. Of course we all know that executing acquisitions well is easier
said than done, and I don't have a specific target in mind when I say this.
It's more of an "in principle" speculation than a concrete suggestion at the
moment.

~~~
rorykoehler
Softbank are looking to sell some or all of their ARM holdings.

