
What's wrong with Intel, and how to fix it: Former principal engineer unloads - em3rgent0rdr
https://www.pcworld.com/article/3569182/whats-wrong-with-intel-and-how-to-fix-it-former-principal-engineer-unloads.html
======
supernova87a
This is purely a total outsider's opinion, so I caveat that and please don't
get all offended if I'm totally wrong.

Sometimes it's really difficult to tell the difference between a crackpot
theorist and a good equities analyst. Watching this guy's video, it's hard to
tell which he comes off as more.

He's just coherent and convincing enough that as an outsider I could believe
his collection of observations. But the presentation is also just disorganized
enough, fixated on certain details enough, and a scattered collection enough
that I can't tell whether these factors matter.

Of course I'm sure an expert in the field could tell instantly.

I've seen enough interviews of former engineers (usually retired guys) of
whatever <xyz> company who know about a very specific fault of the
product/tool/etc and can claim convincingly up and down that it will cause
catastrophic failure, fundamental strategy misstep, etc. But will it really be
a major factor for overall outcome?

He proceeds from analysis of the smallest die area detail to the opposite end
of the spectrum, the problem with MBAs + Marketing, within 1 minute. With
typos. Also, the guy doesn't have a long history of posting analysis of this
type, and he's honest about his info being 8 years old. Not sure what his
motivation for making the video is. So it's hard to put too much weight on it,
as if were like a Ming-Chi Kuo predicting Apple's next iPhone form factor or
something like that.

I would love to know what industry people here think, and then I might study
the video / findings with more attention.

~~~
glangdale
Francois is a smart guy, but there's a range of opinions on the matters he
raises. We argue a good deal on Twitter ("constructive confrontation" between
former Intel Principal Engineers :-) ). He put together the video pretty
quickly so don't dismiss it just for lack of polish.

Personally, I argue with him a lot on AVX-512. I think AVX-512 is a Good Thing
(or will be shortly - the first instantiation in Skylake Server - SKX - isn't
great).

The biggest meta-problem with AVX-512 is that due to process issues, the pause
button got hit _just_ as it appeared in a server-only, 'first cut' form. AVX2
had downclocking issues when it first appeared too - but these were rapidly
mitigated and forgotten.

I personally feel that SIMD is underutilized in general purpose workloads
(partly due to Intel's fecklessness in promoting SIMD - including gratuitous
fragmentation - e.g. a new Comet Lake machine with Pentium branding doesn't
even suport AVX2 and has it fused off). Daniel Lemire and I built simdjson
([https://github.com/lemire/simdjson](https://github.com/lemire/simdjson)) to
show the power of SIMD in a non-traditional domain, and a lot of the benefits
of Hyperscan
([https://github.com/intel/hyperscan](https://github.com/intel/hyperscan))
were also due to SIMD. Notably, both use SIMD in a range of ways that aren't
necessarily all that amenable to the whole idea of "just do it all on an
accelerator/GPU" \- choppy mixes of SIMD and GPR-based computing. Not
everything looks like matrix multplication, or AI (but I repeat myself :-) ).

AVX-512 is a combination of fascinating and frustrating. I enthuse about
recent developments on my blog ([https://branchfree.org/2019/05/29/why-ice-
lake-is-important-...](https://branchfree.org/2019/05/29/why-ice-lake-is-
important-a-bit-bashers-perspective/)) so I won't repeat why it's so
interesting, but the learning curve is steep and there's plenty of
frustrations and hidden traps. Intel collectively tends to be more interested
in Magic Fixes (autovec) and Special Libraries for Important Benchmarks rather
than building first-rate material to improve the general level of expertise
with SIMD programming (one of my perpetual complaints).

You more or less have to teach yourself and feel your way through the process
of getting good at SIMD. That being said, the benefits are often huge - not
only because SIMD is fast, but because you can do fundamentally different
things that you can't do quickly in GPR ("General Purpose Register") land
(look, for example, at everyone's favorite instruction PSHUFB - you can't do
the equivalent on a GPR without going to memory).

~~~
KMag
I've worked on a project to compile a domain-specific description language
down to both SIMD (by using Intel SPMD Program Compiler) and GPGPU (by using
CUDA and OpenCL) Monte Carlo simulation code. We also have a Scala/Java
interpreter for debugging.

Are there workloads that benefit from very wide SIMD vectors that aren't good
fits for GPGPUs, as long as the GPUs support 64-bit floats close enough to
IEEE-754 for your needs? I understand the overhead in shuffling data between
main memory and GPU memory, and synchronization overhead, but most code I'm
familiar with that does heavy number crunching suitable for very wide SIMD
tends to do that number crunching off on threads that don't have much
synchronization with threads doing more general-purpose computation.

On a side note, ISPC's input language is deceptively close to C, but little
traps lie in wait. I remember helping an intern debug his port of some of the
Java code I wrote over to ISPC, but it turns out that mulitplying a long long
by a double in ISPC results in a long long. Our attempt to scale down a random
64-bit integer to the range [0.0, 1.0) was always resulting in 0.0. I get that
integer calculations are faster, and I could understand disallowing implicit
casts, but making a language so close to C, but with different implicit
casting rules is just asking for trouble.

~~~
glangdale
You're in a bit of a different domain - I've never really done all that much
stuff with heavy number crunching (I've done a bit of work on random forest
traversal, but that's more about logic than about FP). And I have never worked
with ISPC.

I think number crunching workloads are typically quite suitable for GPGPU -
I'm certainly not trying to "debunk GPGPU", just saying that there are a lot
of integer/logic intensive workloads that involve rapid switching back and
forth between control/GPR-based-logic sides and "SIMD tasks" (e.g. Hyperscan
switching between NFA/DFA simulation and "acceleration", which was SIMD-based
character skipping).

------
LordHeini
It is a bit odd that there is no mention of the awful security problems like
Spectre and Meltdown which mostly hit Intel and was very poorly handled by
them.

As a consumer i care way more about my CPU actually working as advertised and
not some Specre mitigation killing the performance.

The price performance ratio of AMD is just better, nobody cares about some
weird extension if i can have a Threadripper with a bazillion cores for less
money doing the same stuff. Even only for marketing reasons.

And GPUs stole the vector cake years ago anyway.

~~~
Veedrac
> It is a bit odd that there is no mention of the awful security problems like
> Spectre and Meltdown which mostly hit Intel and was very poorly handled by
> them.

I recall him mentioning exactly this.

------
bosswipe
None of that matters as long as Intel can't ship their 10nm and 7nm nodes.
This guy doesn't seem to have an opinion on what's wrong at the fabs but that
is the big life or death question for Intel right now.

~~~
PaulHoule
I am glad to hear Intel is dabbling at TMSC (they have a plan B if not an
actual business plan) but it is painful to see Intel talking as if nothing was
wrong while congressmen are waking up to the reality that we might not be
making chips in the us anymore.

And that's what is scary. Intel is failing left and right but the rap track
from Intel sounds like an ad for Disneyland.

~~~
justinclift
Wonder if the fabs being developed in China are or will be competitive to
Intel in the ~near future?

------
dathinab
I'm not knowing how much of this things are true, but isn't it ironic that AMD
has more benefits from SMT because they have less pipeline optimizations (and
similar), but exactly this pipeline optimizations (and similar) seem to be at
least partially at fault for some of the Specter style attacks and make it
harder to put more cores into the system (as this additional optimizations are
likely need more silicon space...).

Sometimes I'm wondering if the best (but software side maybe impractical)
design for (non Threadripper level) consumer PC's would be something along the
line of.

\- 1 single non SMT core with most advanced pipeline optimizations and highest
clock speed which only always runs one thread pinned to it (many consumers
tend to only run one "costy" applications with need for highest performance at
the same time, e.g. a game main event loop).

\- around this 4+ SMT enabled cores which are (mostly/reasonably) specter save
and have less max. clock speed (other open applications, other threads from
the current game, should be good enough for most applications)

\- 1+ very low power cores which might have another arch (for the background
maintenance tasks of the OS and programs like e.g. update jobs, slow long
polling, always-on/connected features etc.)

\- 1+ crypto core with integrated hardware security module, used to thinks
like signing AES keys and similar (e.g. setup but not run TLS)

Through it's probably unrealistic as e.g. thermal regulations for high speed
threads is optimized by "bouncing" the logical thread between cores and
software tends to not want to do things like write the background maintenance
code for a different arch then the rest of the software, well phone OS
probably could force this. Microsoft + Intel kinda can't.

~~~
Scaless
You are more or less describing the architecture of the PlayStation 3:
[https://en.wikipedia.org/wiki/PlayStation_3_technical_specif...](https://en.wikipedia.org/wiki/PlayStation_3_technical_specifications#Central_processing_unit)

The problem with this type of system is both the software and OS have to be
aware of what is or isn't a "main thread" job. You have to architect your
whole program around job systems to spread your data between the main thread
and sub-processes which was a major complaint of gamedevs for the PS3. There's
also the fact that many consumer software developers just don't care and will
flag all of their processes as high priority jobs.

It worked out decently well on PS3 because it was such a locked down system,
but for a general purpose desktop it would be utter chaos.

~~~
inawarminister
Wasn't Linux installable in early PS3s, and some labs (US military?) used
stacks of PS3s as distributed superishcomputers? I wonder if anyone had
analysed general computing performance in CELLs...

------
Jonnax
So Linus, and this guy is saying AVX512 was a bad idea.

"“The state of software out there is really not favoring going larger
vectors,” Piednoel said in the video. “In fact, you can see clearly in
Cinebench for example—that is not one of my favorite benchmarks, especially
for a laptop where it doesn’t make any sense—but you can see that AMD is
winning the battle of throughput. It’s because they have more cores and they
can afford to have more cores.”"

But in other places I've seen a lot of hype for it but without much discussion
on actual use cases.

With now people saying that it isn't useful except to give intel leadership in
benchmarks.

Is anyone here actually using it and seeing a benefit?

~~~
BeeOnRope
AVX-512 is Intel's best vector ISA to date and it is definitely useful beyond
benchmarks.

In fact, _too low_ penetration of AVX-512 has been a problem, rather than too
much: for a long time it has only been available on server chips, not laptop
(a small fraction of laptops have gotten it recently) or destkop (outside some
low-volume "extreme" parts, which were really just rebadged server parts).

It would also be quite unusual that some instructions could be very useful in
benchmarks that are based on real-life, heavily used applications, but not in
real life. Outside of small, easily gamed benchmark that doesn't seem
plausible: if the CPU is good a video encoding benchmark, it will be good at
video encoding, with high probability.

The main problem with AVX-512 is lack of software exploitation. Unlike with
frequency boosts, increased cache sizes, better branch predictors, etc: this
speedup doesn't come for free. Either compilers have to use the new
instructions, or people have to use them by hand. The former has been very
limited because these instructions cause a frequency drop (so-called "license
based downclocking"), so compilers have mostly disabled their use by default:
otherwise, a single AVX-512 instruction could cause a large impact on
surrounding code which doesn't use AVX-512.

So by-hand exploitation remains, and penetration has just been too low and the
people with the skills to do this are limited.

~~~
FullyFunctional
I'm sure AVX-512 is wonderful and you can easily produce examples where it
makes loops go N times faster. That unfortunately is missing the point.

Microprocessors implementation is an extremely careful balance and effectively
everything is a trade off. The area, power, and complexity of features comes
at a non-zero cost to code that doesn't use it.

The fact that AVX-512 isn't on every processor _AND_ the performance benefits
are FAR from obvious (as the core clocks down when using it) really hinders
its adoption, which in turn lowers the value; a downward spiral.

Intel is in a fine mess of their own making. I'm no fan of either company and
AMD gets perhaps a little too much credit as things would have looked a lot
different had 10 nm succeeded on the original schedule, but I bought my first
AMD processor since Athlon 64 because it's better.

~~~
BeeOnRope
I'm not missing the point as I was responding to whether AVX-512 is useful in
practice. It is. It is some badly designed extensions where the earlier ones
were great: it's arguably better than the earlier ones.

Now adoption has been poor and there is a chicken-and-egg effect as you point
out, but that's separate from the question of whether AVX-512 is useful in
practice. It is.

The point about tradeoffs is well taken. It's not like Intel (or any other
vendor) is rolling back progress in other areas to add larger vector units:
rather they try to progress along all viable axes at once.

~~~
PaulHoule
AVX-512 isn't "bad in practice" but in the context of AVX-128, AVX-256, and
AVX-1024 it is a very bad practice because the inconstant availability of the
vector unit means that it doesn't get used, or you use a least-common
denominator of what worked on Atom 10 years ago.

Look on the bright side: if you don't use AVX-512 you have 10% extra heat sink
and heat soak on your CPU so you can turbo higher. AVX-512 makes the chip
easier to cool.

~~~
BeeOnRope
Yes, although this gets better with time.

People who care enough to want to use these units have figured out CPU
dispatch by now, too.

------
PhantomGremlin
From the article: _Piednoel didn’t spare words for Intel’s culture, which he
said has changed drastically and promotes MBAs over those with technical
prowess._

The current CEO is a finance / MBA type, so it's no surprise that the current
culture favors that.

In Intel's defense, the previous CEO _was_ a process engineer. But,
unfortunately, he couldn't keep his weiner in his pants. He also was probably
the wrong choice as CEO since, under his watch, Intel process development
apparently went to shit.

Oh, well. Intel was a great company for many years. Perhaps it will reinvent
itself.

~~~
nine_k
As somebody noted, when you have cornered a large part of the market, you
don't feel as much technical pressure, and success in selling becomes key. So
sales and MBAs start to run the show, because it makes the business sense, and
while doing so risk to lose the sight of the technological advances that made
the dominant position possible.

Intel's founder, Andy Grove, used to say: "Only the paranoid survive". (He
personally survived under a false identity in Nazi-occupied Hungary.) I'm
afraid Intel had stopped being paranoid enough before it started to stumble
more and more seriously.

~~~
karmakaze
This seems to be a reasonable explanation. Even with sales and MBAs running
the show, they know that they will always need something competitive to sell.
For the longest time Intel relied on shrinking dies and single-threaded
performance advantage. They didn't continue to race ahead with shrinking and
as noted AVX-512 is not the hit to save them. They need more but may have
misjudged that or believed too much of their own sales/marketing lacking
sufficient paranoia.

The future will be more consumer appliance-like devices rather than PC-like
products. I would much rather use a laptop/OS that suspended/resumed like an
iPad with low power consumption (cool/quiet) and extended battery life. Apple
has a way to this future. AMD can run on servers and gaming rigs. Intel needs
something like a ChromeOS or Fuchsia to do everything (a cloud desktop rather
than a browser) to be mainstream. I would use Clear Linux if it was more than
an academic experiment--had a good desktop and repos that had a maintained
useful fraction of Ubuntu packages (i.e. packages are up to date, used, have
good defaults, etc).

~~~
nine_k
I find Clear Linux quite nice, but I don't know if it could replace the lost
CPU revenue :-\

------
ENOTTY
I disagree that Intel has "lost focus" to the detriment of its core
competencies, after all, Intel is a large company that spends billions on R&D
every quarter. Rather I call it Intel expanding its focus to expand its
addressable market to include 5G, AI, autonomous driving, advanced memory,
software-defined networking, etc. I think Francois is focused on Intel being a
CPU manufacturing company, when Intel sees itself as a data-processing
company.

It's important to note that these aren't totally unrelated pushes (a la Google
X), but rather all of the new areas leverage Intel's core competency in
silicon manufacturing. (Though, many of these products are due to be fabbed at
TSMC, according to rumors.)

"Xeon should dump unused core space" \- Intel already does this. Intel has two
CPU microarchitectures - Sunny Cove[1] (used for desktops, laptops,
workstations, and beefy servers) and Tremont[2] (used for low-power clients
and specialized servers). Tremont ditches the AVX instructions. Notably,
Tremont is used in specialized server SoCs that target the storage,
networking, and IoT/embedded markets.

Could Intel differentiate its product lineup even more? At the core level, one
should ask, Are there IPs that should be added to Tremont or subtracted from
Sunny Cove? Can some IPs be added at the SoC level?

[1]:
[https://en.wikichip.org/wiki/intel/microarchitectures/sunny_...](https://en.wikichip.org/wiki/intel/microarchitectures/sunny_cove)

[2]:
[https://en.wikichip.org/wiki/intel/microarchitectures/tremon...](https://en.wikichip.org/wiki/intel/microarchitectures/tremont)

------
gok
Whining about "MBAs" is almost always technical person speak for general
discontent about company direction, often from the non-business parts. You
don't see complaints about CEOs of Apple, Google, or Microsoft much, though
they all have MBAs.

~~~
me_me_me
MBAs improve numbers, engineers create new products. The mentality of MBA
crowd is to get the numbers right eat the cake and when things go sideways
move on to a new job.

Intel is a hardware tech business, its a complicated space that needs deep
knowledge to understand it before you shape the future of it.

Technical people might not be the only ones that should run business but they
should be the core of it.

I have same view as the video, engineers need more power and decision making
not less to win. Intel will not fight back by cutting down cost of production
by 5%. But by creating next gen chip.

~~~
mobilio
Same happens to Boeing.

When they have engineers - we get 737, 747 and many more. when they have MBAs
- we get 737-MAX, 787.

~~~
me_me_me
Thats a great direct example of the thought process. We will save x% by doing
workarounds and call it MAX instead of biting the bullet spending money and
creating next iconic plane design.

~~~
mobilio
787 is another great example because EVERYTHING is outsourced to other
companies.

As result - when things (elements) are trying to getting work together they
fail spectacular.

------
SomeoneFromCA
Speaking of AVX-512: I find it very odd, that they still produce deaktop CPU's
without AVX (not even AVX2) at all (10th gen Celerons), yet they are pushing
AVX-512 into laptops at the same time.

~~~
jdsully
They produced a few low volume 10nm laptop parts so they could say 10nm has
finally shipped. They can’t produce enough volume to move their whole product
line over.

------
ENOTTY
Link to the video that the article is summarizing
[https://www.youtube.com/watch?v=fiKjzeLco6c](https://www.youtube.com/watch?v=fiKjzeLco6c)

The creator's Twitter can be found here
[https://twitter.com/FPiednoel](https://twitter.com/FPiednoel)

~~~
Jonnax
[https://twitter.com/FPiednoel/status/1290874265705697280?s=1...](https://twitter.com/FPiednoel/status/1290874265705697280?s=19)

Looks like his video is him promoting himself to investors to try to get
himself on the Intel board of directors.

~~~
markus_zhang
That's fair. I mean if he wants to change that he has to be in the board. But
I think investors like less risky ways to make revenue (i.e. by cutting costs)

------
tails4e
I've always wondered, if I compile an application and don't specify the
architecture as explicitly having AVX, just say x86 64bit, will alternate code
paths be in the binary for processors with AVX? Or alternatively, if I do
specify AVX support on the command line, then what happens if the binary runs
on a CPU without AVX? The reason I ask is I use commercial tools for circuit
simulation, and they run on a wide range of CPU architectures, but it would be
a shame if thst CPU compatability comes at a performance hit when run on a
more capable CPU architecture

~~~
lozenge
Depends on the compiler. Gcc compiles to the lowest common denominator while
icc offers "Processor dispatch technology performs a check at execution time
to determine which processor the application is running on and use the most
suitable code path for that processor. Compatible, non-Intel processors will
take the default optimized code path."

For HPC they will always compile to use the full set of instructions and see
how it performs. For some software there will be explicit checks and loading
the right compiled code(compression, video effects etc). On a desktop most
things are compiled for the minimum processor though. The CPU might still be
able to use its new features like wide registers of the instructions are in
just the right order.

If you run something with instructions your processor doesn't have it will
crash.

~~~
majewsky
> If you run something with instructions your processor doesn't have it will
> crash.

I would be very surprised if the processor itself crashed. I haven't
encountered it in actuality so far, but as far as I'm aware, illegal
instructions raise an interrupt that the OS can catch. On Unix, this should
result in the corresponding user space process receiving SIGILL.

It appears that SIGILL is actually preemptible. Could this be used as a poor
man's CPU feature detection mechanism?

~~~
lozenge
I meant the process crash not the CPU, you are correct. Also, my gcc knowledge
is outdated.

People have mapped out the millions of "undocumented instructions" by trapping
SIGILL.

------
kingcobraninja
Just a heads up, this site is impossible to read on mobile.

------
crb002
Except AWS and Apple can and do recompile the ocean, even over to ARM. AVX512
free chips and gutting legacy X86 are worth it, perhaps an x86_64_2020 target?

------
ilaksh
I am honestly not enthusiastic about x86 anymore. It seems like basically an
entrenched legacy architecture at this point.

What are Intel's plans for HSA or RISC-5? How good are their AI products?
RealSense seems pretty exciting to me.

Is there going to be a successor to the PC design? My crazy hope is that we
will get some type of module system where you don't have to open the case and
can just plug in something like M.2 modules but with a superfast nextgen bus.

Is there an upstart chipmaker that can help push Intel out (I wish)? I have
heard good things about Nuvia.

When I hear about a company that is not run by engineers making bad decisions
and pissing off engineers, I am actually eager to see them die. Just like I
wish Boeing would die for similar reasons. But I know that both are very
unlikely to go away. One can always dream.

------
aPoCoMiLogin
> Ryzen’s “Hyper-Threading” looked good because of poor single-threaded
> performance

So if my CPU is one of the worst in single-threaded performance, then multi-
threaded performance will be the best? That make no sense and this guy doesn't
seem like he knows what he is talking about..

~~~
FullyFunctional
He's completely right. Hyperthreading is all about putting idle functional
units to work when you can't extract instruction level parallelism from single
threaded code. His claim is that Intel processors did better and thus only got
a small relative boost from HT (IIRC it used to be 10%, maybe it's 20% now).
If your processor does poorer at ILP extraction, you get relatively more out
of HT.

This all agree with the benchmarks in which Intel still generally is ahead on
single thread performance. For a more extreme case look to older POWER
processors which were 4-way threaded, presumably because they were even worse
on single threaded code.

~~~
aPoCoMiLogin
> This all agree with the benchmarks in which Intel still generally is ahead
> on single thread performance.

citation needed. benchmarks that I've seen [0][1] there is not that big of
difference between the two.

[0] [https://www.anandtech.com/show/15578/cloud-clash-amazon-
grav...](https://www.anandtech.com/show/15578/cloud-clash-amazon-
graviton2-arm-against-intel-and-amd/5) ZEN1

[1] [https://www.anandtech.com/show/14694/amd-rome-epyc-2nd-
gen/9](https://www.anandtech.com/show/14694/amd-rome-epyc-2nd-gen/9) ZEN2

~~~
FullyFunctional
[https://browser.geekbench.com/processor-
benchmarks](https://browser.geekbench.com/processor-benchmarks)

Even normalizing for the frequency isn't enough to explain the difference

~~~
kasabali
You need to look at turbo frequencies. Geekbench runs at turbo speed almost
all the time.

~~~
FullyFunctional
Like I said, I did account for that

