
Serious Intel CPU bugs (2016) - derek
https://danluu.com/cpu-bugs/
======
slivym
As a former Intel employee this aligns closely with my experience. I didn't
work in validation (actually joined as part of Altera) but velocity is an
absolute buzzword and the senior management's approach to complex challenges
is sheer panic. Slips in schedules are not tolerated at all - so problems in
validation are an existential threat, your project can easily just be canned.
Also, because of the size of the company the ways in which quality and
completeness are 'acheived' is hugely bureaucratic and rarely reflect true
engineering fundamentals. Intel's biggest challenge is simply that it's not
'winning big' at the moment and rather than strong leadership and focus the
company just jumps from fad to fad failing at each (VR is dead, long live
automotive).

~~~
baq
not quite true regarding the focus - you aren't building a $50B+ company when
you don't have focus, it's just that their core compentency isn't fashionable
right now. they can't grow in the markets they're in because they either own
them completely or were driven out, so they're trying different ones
(pivoting, if that's applicable to mature companies).

~~~
marcosdumay
They had a presence on the currently faster growing market (mobile), but they
decided to sell it a while ago because it wasn't trendy and margins were
smaller.

Now they are pivoting everywhere, but theirs is the only market with
sufficient margins. And the perspective is that their margins will shrink
because of competition and software emulation (that they are keeping into
control by patent trolling).

~~~
rasz
Intel also liquidated all ram fabs in the eighties, very costly process.
Almost less than a year after last fab closed ram prices skyrocketed letting
Japanese manufacturers grow strong.

------
bwang29
Quote from article "We need to move faster. Validation at Intel is taking much
longer than it does for our competition. We need to do whatever we can to
reduce those times… we can’t live forever in the shadow of the early 90’s FDIV
bug, we need to move on. Our competition is moving much faster than we are".

Competition pressure could make a company's new product worse than (in this
case, less stable than) their previous products, e.x. Samsung phone explosion.
I still remembered the story was Samsung wanting to release their phone ahead
of iPhone and I would imagine the testing went through a similar stressful
time as Intel.

Of course not all cases of taking such risks would lead to disasters - just
imagine Intel rushes on releasing new chips ahead of competition and 99 out of
100 times it ended up performing well. But a unique character in Intel's case
is these bugs, unlike a faulty battery design, are accumulative and additive
to future product development, which means a few small wins in catching up
with your competitor could also lead to massive failures in some next major
battle.

Now imagine Intel's competitors are going through the exact same scenario. One
possible outcome is both Intel and its competitors' products become less
stable and more buggy over time, and until everyone's stuff seems to be broken
they probably never have time to fix them.

~~~
dspillett
_> "we can’t live forever in the shadow of the early 90’s FDIV bug"_

There is a valid point there though - if you are testing for testing's sake
and not finding anything extra through the extra effort then you are wasting
time and potentially worse: lulling yourself into a false sense of security.
Testing should be done for utility, not just in response to fear - you need to
test intelligently, not just test lots. Like TTD in software, good testing
processes make life much easier and quality much higher, bad testing processes
can be worse than useless.

Processor bugs are always a thing and always have been a thing - look at the
list of bugs the linux kernel scans for and works around many of which pre-
date the FDIV debacle.

What made FDIV special isn't that is was a bad bug, it was the recent change
in _marketing_. Before then processors were sold to manufacturers who might
tell the customer what is used, unless you were a hobbyist you didn't much
care for the specifics. But the Pentium line was the first time a processor
had been particularly marketed directly at the end user. It had started with
the 486 lines a couple of years earlier when "Intel Inside" was first a thing,
but there was a huge push in that direction with the release of the first
Pentium lines. Suddenly Joe Public was more aware of that detail, but was
blissfully unaware that CPUs are complex beasts and generally not 100%
perfect.

It didn't help that the bug was very easy to demonstrate in common
applications like Excel so Joseph & Josephine Public could see and understand
the problem where they wouldn't, for example, the FOOF bug, and it was easy to
joke about (We are Pentium of Borg. Division is futile. You will be
approximated) which fanned the rapid spread of the news. The fact that the bug
only significantly affected fairly rare combinations was lost in the mass
discussion about how such a bug could happen at all.

~~~
static_noise
Testing is not for finding bugs, testing is for preventing bugs. But otherwise
you are right, it's hard to calibrate and further develop the testing
procedure if you rarely find bugs. While you might wide awake on one eye you
might be blind on seven others. It's usually the things you don't expect that
kill you. So, some need to be paranoid, be very paranoid.

~~~
dspillett
The current chip bug does look like a doozy...

See
[https://www.theregister.co.uk/2018/01/02/intel_cpu_design_fl...](https://www.theregister.co.uk/2018/01/02/intel_cpu_design_flaw/)
if you've not already picked up on the news.

------
paulmd
Denverton is much more complex than a "simple" Atom (performance of a C3958 is
up to about half of an i5-7500 in single-thread, twice the total multi-thread
performance). Avoton is really no slouch either. It's really not surprising
that the incidence of bugs is increasing on those uarchs as the complexity
grows.

The Skylake/Kaby hyperthread bug has been fixed in microcode and is no longer
applicable. It's perfectly safe to run HT on these processors now.

The AMD Ryzen segfault remains unmitigated at this point in time. Phoronix
rushed to declare the bug fixed because they got a binned RMA replacement but
there are plenty of reports of it occurring in current-production processors
to at least a moderate degree, roughly proportionate with ASIC/litho quality.
It's unclear what the scope is w/r/t Epyc since Epyc is on a different
stepping but also hasn't really ramped yet either. The early Epyc processors
were essentially engineering samples (on the order of hundreds to single-digit
thousands of samples) with no real (public) visibility into any binning that
might be taking place.

The Ryzen high-address bug is no big deal, that's the kind of thing that gets
patched all the time (like the Skylake HT bug). That's one thing Dan is
glossing over here - there are tons of these bugs all the time and as long as
there is an effective mitigation available it's no big deal.

The PTI patch can be viewed as making syscalls take somewhat longer (about
double iirc). Gamers and compute-oriented workloads won't be hurt hardly at
all. The average mixed-workload case sees 5% performance loss, not ideal but
it's not critical either. Losing 30% is real bad though, and that's what you
will get on IO-heavy workloads that context-switch into the kernel a lot.

The only real mitigation there appears to be right now is to give up
hyperconvergence for now and harden up those DB/NAS servers that are going to
be pushing a lot of IO so that you know there won't be hostile code running on
them. That will allow you to safely disable PTI and sidestep the performance
hit.

Of course, Epyc was not that good at running databases in the first place, so
you still might be better off sucking it up and running Intel even with the
PTI patch. It will probably depend on your actual workload and the relative
amount of IO vs processing.

~~~
redcalx
> The Skylake/Kaby hyperthread bug has been fixed in microcode and is no
> longer applicable. It's perfectly safe to run HT on these processors now.

Only if you can actually get the fix. My main home PC has this bug and the
motherboard manufacturer (ASUS) has yet to ship a BIOS update with the fix.

~~~
paulmd
Your OS can deliver a microcode fixup that's installed at OS startup, Windows
should do it automatically and Linux you just need to install the intel-
microcode package.

------
mehrdadn
Out of curiosity, if you notice a CPU bug in a computer under warranty, is
there anything the vendor is usually obligated do, or are they under no
obligation to do anything about a CPU bug? Is that considered a defect they
have to handle?

(Edit: I'm assuming the USA, and I'm assuming bugs that were not known to the
vendor at the time of the sale.)

~~~
kuschku
Under EU law (this is not legal advice), any bug that wasn't known to you at
time of sale, but was present at time of sale, has to be fixed by the vendor.
For 2 years after sale usually, some countries have longer (afaik, Norway for
example has it set at 5 years. As EEA member, they also follow the EU laws on
warranty).

~~~
TrickyRick
Good luck explaining to a random vendor what the problem is though. They also
need to verify that the problem exists and most of them won't have a clue what
you're talking about, they'll just turn the computer on and notice that it's
working.

If you're a company buying from more qualified vendors then it might be a
different story, however at that point consumer law does not apply to you.

~~~
soziawa
Well depending on how this is turning out there will be benchmarks that show
the chips performing slower than before. I've read that many chips will be
getting 17% slower in the best case and up to 30% in the worst case. That
could be enough to warrant a return of the product.

~~~
TrickyRick
Yes, however most computers are marketed using GHz numbers, not experienced
performance (And in the cases where they are it's relative, like x times
faster than before). It's hard to show a 17% slowdown to a store clerk.

Don't get me wrong, I'm all for being able to return a product that's 17%
slower today than yesterday, I'm just saying it might be difficult since it's
an issue that requires a degree of technical knowledge to understand which
most store clerks don't have.

~~~
maksimum
I agree with most of what you say, but why do you take for granted that it's
up to the customer to demonstrate that the product is defective? Furthermore,
where does the idea that the customer must be able to communicate this to a
_uninformed_ clerk come from?

------
chrisper
So I recently bought an 8700k. I was wondering if I should rather return it
and get AMD instead? Not sure how much the recent bugs will impact me
performance wise.

~~~
kuschku
It looks like you'll see a significant slowdown, PostgreSQL with pgbench sees
a 7% to 16% slowdown.

[https://www.postgresql.org/message-
id/20180102222354.qikjmf7...](https://www.postgresql.org/message-
id/20180102222354.qikjmf7dvnjgbkxe@alap3.anarazel.de)

Of course, this depends on workload — gaming will see different results than
computationally heavy tasks.

It is likely that games using Vulkan, DX12 or OoenGL's AZDO functions will see
much lower performance impact (because they usually only do a handful of
syscalls per frame) than games using older APIs, or even OpenGLs immediate
mode (which does one syscall per emitted vertex, in worst case)

~~~
chrisper
I am also quite annoyed because I paid premium (Intel) prices with the
expectation to get premium speed. Now if I can just get the same performance
cheaper with AMD, maybe I should just return the whole thing.

~~~
philjohn
Which chip to get depends on your usecase.

If you absolutely need the best single core performance you can, Intel is the
way forward.

If multicore performance is important (lots of multitasking, lots of heavy
processes running) then one of the 8 core Ryzen 7's will be better, for
cheaper.

~~~
chrisper
Thanks. I guess I'll stick to my 8700k then.

~~~
otakucode
Just curious, what do you do that requires a fast single core? I always find
it strange when people value that, as most computers nowadays generally run
more than 1 process at a time.

~~~
chrisper
A lot of games only run single core. Though finally, more modern games do take
advantage of more cores!

------
pier25
It annoys me when some online content (or edit) hasn't a date explicitly
stated.

------
lukax
Is this article from 2015 or 2017?

It would be great if the page displayed the date that the article was
posted/updated. It is not in the URL nor the sources. The only way to see the
dates is in the RSS feed and even that is only for new articles.

~~~
adtac
Jan 2016, according to [https://danluu.com/](https://danluu.com/) and ctrl+f

~~~
kasabali
Last update is aug 2017 or later

~~~
collinmanderson
Last update is _today_. It links to a comment on this thread. I think (2016)
should be removed.

------
mtgx
_As someone who worked in an Intel Validation group for SOCs until mid-2014 or
so I can tell you, yes, you will see more CPU bugs from Intel than you have in
the past from the post-FDIV-bug era until recently.

Why?

Let me set the scene: It’s late in 2013. Intel is frantic about losing the
mobile CPU wars to ARM. Meetings with all the validation groups. Head honcho
in charge of Validation says something to the effect of: “We need to move
faster. Validation at Intel is taking much longer than it does for our
competition. We need to do whatever we can to reduce those times… we can’t
live forever in the shadow of the early 90’s FDIV bug, we need to move on. Our
competition is moving much faster than we are” - I’m paraphrasing.

Many of the engineers in the room could remember the FDIV bug and the ensuing
problems caused for Intel 20 years prior. Many of us were aghast that someone
highly placed would suggest we needed to cut corners in validation - that
wasn’t explicitly said, of course, but that was the implicit message. That
meeting there in late 2013 signaled a sea change at Intel to many of us who
were there. And it didn’t seem like it was going to be a good kind of sea
change. Some of us chose to get out while the getting was good. As someone who
worked in an Intel Validation group for SOCs until mid-2014 or so I can tell
you, yes, you will see more CPU bugs from Intel than you have in the past from
the post-FDIV-bug era until recently._

So this is why Krzanich sold his stock. He knows the bug is his fault. Whoops.
I think someone may "quit for personal reasons" soon.

[https://www.fool.com/investing/2017/12/19/intels-ceo-just-
so...](https://www.fool.com/investing/2017/12/19/intels-ceo-just-sold-a-lot-
of-stock.aspx)

------
fpoling
I can only expect that future will be worse. It may be that VM providers will
find it unprofitable to offer a VM capable of running generic native code.
Another thing that security products for desktops like Qubes OS that rely on
hardware isolation to run untrusted code may need to reconsider their business
model.

~~~
walterbell
What’s the connection between the performance hit of this bug and Qubes’
business model?

~~~
fpoling
This bug can be worked around. But the next one may not, making hardware-based
virtualization as a secure way to run unprivileged code with max native
performance unworkable. I.e. longer term if one wants to run untrusted code,
it cannot be native one so any bug can be fixed without replacing hardware.

~~~
walterbell
_> if one wants to run untrusted code, it cannot be native one so any bug can
be fixed without replacing hardware_

It's a bit hard to parse that sentence, could you rephrase?

Are you saying that untrusted code should only be run on systems which do not
use hardware virtualization, because there's a risk of hardware bugs that
require hardware replacement? The problem is that there is no single-system
equivalent, users would have to use multiple laptops/desktops and air gaps to
achieve separation (e.g. between network drivers and userspace apps). May not
be practical.

Yes there's a risk of a catastrophic hardware bug with no workaround, but that
risk applies to every feature in the CPU, not only virtualization or page
tables or speculative execution. Statistically it's only happened once with
the single Intel CPU recall, which are better odds than other risks.

~~~
fpoling
My point is that to run untrusted code it should be delivered in some form of
bytecode, not the native code for CPU. This way one can always workaround CPU
issues by changing the compiler or the interpreter even for catastrophic bugs
in any part of CPU. Moreover, as hardware VM can execute much more
instructions than unprivileged user processes, the probability that something
unfixable will happen to them is higher then for ordinary processes.

As for statistics, there are strong indications that modern efforts for CPU
verification do not keep up with increasing CPU complexity. So number and
severity of bugs will grow.

~~~
walterbell
Which bytecode do you recommend? Are you assuming the bytcode interpreter to
be bug-free? There have been JVM escapes.

~~~
fpoling
I do not assume that bytecode interpreter or compiler is bug free. But I
assume that the interpreter can be trivially updated, while a bad CPU bug may
require hardware replacement or taking terrible performance hit.

As for a particular bytecode format, I have no idea. Webassembly is a
possibility, but it is still slower by factor of 2 compared with native code.
Perhaps CPU-specific symbolic assembler will be a better choice as long as one
can realibly alter it to workaround CPU bugs.

------
johnflan
I have seen a lot of talk of AMD benefitting from this but what about ARM -
how are their server offerings shaping up?

~~~
sitharus
ARM is interesting because of the customisability of ARM chips. Some
companies, such as Apple, license the ISA and spin their own silicon. Others
license the whole CPU design from ARM holdings, and ARM produce several lines
of designs with different CPU capabilities.

Edit: Looks like ARM64 was affected, but it has an architectural feature that
makes the mitigation much easier: [http://lists.infradead.org/pipermail/linux-
arm-kernel/2017-N...](http://lists.infradead.org/pipermail/linux-arm-
kernel/2017-November/542751.html)

------
drej
Can we put [2016] in the title? Thanks!

~~~
opencl
At least part of the article (the 'More updates' section) had to have been
written in 2017 because it references Ryzen chips which were released in March
2017. It's not entirely clear when the original article was posted but it
seems like it could have been either late 2015 or early 2016.

~~~
collinmanderson
Last update is _today_. It links to a comment on this thread.

------
hungerstrike
All the CPUs in my house are 4th and 5th generation Intel CPUs except for one
PC laptop that has a Skylake processor.

I guess I'm glad now that Apple put a 2 year old CPU in the early 2015 Macbook
Pro! Besides my 2012 Mac Pro, that is the most expensive machine in the house!

~~~
pixl97
The latest big bug evidently affects all intel x64 processors back to 2007 or
so.

~~~
earenndil
I thought it was 1995?

------
scribu
I wonder if this is related to the recently discovered design flaw in Intel
CPUs:

[https://news.ycombinator.com/item?id=16055395](https://news.ycombinator.com/item?id=16055395)

------
juanmirocks
Considering the late Intel problems, Apple is going to be even more tempted to
design its own CPUs/GPUs for the mac.

What do you think, is this realistic?

~~~
maksimum
I'm not an insider, but it seems like Apple's motto is "long-term profit above
almost anything else." As a result they certainly are considering designing
their own x86-64 or ARM CPUs in hopes of reduced costs down the line from
vertically integrating their business. What may stop them is the fact that
their PC sales don't enjoy nearly as much volume as their mobile devices.

------
prewett
Before we all jump on Intel for being buggy, what's the list of serious AMD
bugs like for the past five years? If AMD has a similar amount of bugs we
should jump on them, too. If not, then we will actually know that Intel
deserves being jumped on to the exclusion of AMD.

