
I found a bug in Intel Skylake processors - testcross
http://gallium.inria.fr/blog/intel-skylake-bug/
======
piemonkey
One amusing thing about this epic tale is Serious Industrial OCaml User
disregarded direct, relatively easy to implement, very sound advice from
_Xavier Leroy_ about how to debug their system! I would like to think that,
were I in a similar situation being advised by an expert of that calibre, I
would at least humor his suggestions.

Why seek the expert if not for his advice? It brings to mind people
disregarding doctors who give them inconvenient medical advice.

~~~
justin66
I know nothing of OCaml culture or why the author is deemed worthy of having
his name italicized, but the doctor comparison is upsetting. If the guy who
wrote this:

 _I was tired of this problem, didn 't know how to report those things (Intel
doesn't have a public issue tracker like the rest of us), and suspected it was
a problem with the specific machines at SIOU (e.g. a batch of flaky chips that
got put in the wrong speed bin by accident)._

were a doctor, he'd be guilty of malpractice. This bug went unreported _eight
months_ longer than it needed to. Am I misreading all this somehow?

~~~
piemonkey
His name is italicized because he is the primary author of OCaml (and a
plethora of other great tools, like CompCert, the first fully-verified
compiler). Overall, an exceedingly competent and productive programmer and
scientist.

The doctor metaphor isn't perfect; what I was going for is, when you are
seeking out an expert's advice and you ignore it, why do you go to see the
expert in the first place?

~~~
justin66
Thanks, that makes sense. I don't disagree about taking expert advice! I was
just really disappointed to read the part of the article where the author
doesn't bother to figure out how to navigate Intel support and report the bug.

~~~
jacquesm
Bug reports are gifts from inconvenienced users to vendors who have just done
free work on the vendors behalf. There is absolutely no obligation on the part
of a user to report bugs.

~~~
justin66
I am shocked by this attitude. Maybe I shouldn't be, and I'll certainly give
this all some thought.

There's at least one obvious error in your statement: if an inconvenienced
user's bug report results in less downtime for other users, it is a "gift" to
other users, as well as a "gift" to the vendor.

But it says something about our profession if we regard putting flags down to
mark the landmines we find a mere courtesy (a gift!) instead of an obligation.
I guess that's a debate for a different time and place.

~~~
jacquesm
Well, guess what: you're as entitled to your opinion as I am to mine.

But just like a user has no obligation to 'mark the landmines' for vendors
they also have no such obligation towards other users. They _do_ have a right
to receiving bug free software in the first place, alas our industry is
utterly incapable of doing so which has lowered our expectations to the point
where you feel that we have an actual obligation as users to become part of
the debugging process.

That is not going to make our lives better.

What _will_ make our lives better is if software producers accept liability
for their crap they put out and if they were unable to opt-out of such
liability through their software licenses and other legal trickery.

You're just a small step away from making it an obligation rather than an
optional thing for users to report bugs, the only difference is that for you
the obligation is a moral one rather than a legal one. I really do not
subscribe to that, when I pay for something I expect it to work and I expect
the vendor (and definitely not the other users) to work as hard as they can to
find and fix bugs before the users do.

But we're 'moving fast and breaking shit' in the name of progress and part of
that appears to extend to being in perpetual beta test mode. That's not how
software should be built and I refuse to subscribe to this new world order
where the end user is also the Guinea pig.

Keep in mind that users have their own work to do, are not on the payroll of
the vendors usually have forked over cold hard cash in order to be able to use
the code (ok, not in the case of open source) and tend to be less
knowledgeable about this stuff than the vendors. They really should not have a
role in this other than that they may - at their option - upgrade their
software from time to time when told very explicitly what the changes are (and
hopefully without pulling in a boatload of things that are good for the vendor
but not for them).

~~~
justin66
> You're just a small step away from making it an obligation rather than an
> optional thing for users to report bugs, the only difference is that for you
> the obligation is a moral one rather than a legal one.

I'd argue that a person does have that obligation in some circumstances, yes.
And yes, I am thinking in moral rather than legal terms. The legal picture is
pretty far outside my expertise, and the professional ethics of software
engineering (which would in turn inform the legal picture) seems to be
woefully opt-in. As you say, 'moving fast and breaking shit,' perpetual beta
test mode, etc. So I'd put the legal stuff aside for now.

For me, the key is that "user" is a deceptive term here. A mere user cannot
point to a small piece inside a much larger machine and say "that will blow up
occasionally, and I know exactly when." We are talking about engineers. Or at
least, I was thinking of the professional obligations of engineers - on the
user side of the fence and the vendor side of the fence - and that was
informing my comments.

> Keep in mind that users have their own work to do, are not on the payroll of
> the vendors usually have forked over cold hard cash in order to be able to
> use the code (ok, not in the case of open source) and tend to be less
> knowledgeable about this stuff than the vendors.

Yeah, and I don't think I disagree with you in the "user" case. I really think
a software engineer finding a CPU bug is a different case. It seems me that if
we're in possession of knowledge of something as serious and wide-reaching as
a CPU bug, we have a reproducible test case, and we don't do anything with it
(I mean, at least a tweet or something, for the love of God) we are part of
the problem with our profession.

~~~
jacquesm
I'm on board with a reporting duty _if_ such a thing will always result in:

(1) a payment from the vendor to the reporter compensating them for time and
effort spent at getting the bug to be reproduced

and, crucially,

(2) a requirement for _all_ vendors of software and hardware to timely respond
to bug reports and to have a standardized reporting process.

In that case I can see how such a shared responsibility would work, but as it
is the companies get the benefits and the users get the hardship with a good
portion of reported bugs (sometimes including a solution) that go unfixed,
that's not a fair situation.

Case in point: I've reported quite a few bugs to vendors over the years but
I've stopped doing it because in general vendors simply don't care, most of
the time bug reports seem to result in a 'wont fix' or 'here is a paid upgrade
for you with your fix in it'.

~~~
justin66
The security guys seem to be converging on a way of managing these - the
compensation of the person reporting the bug and the factors motivating
vendors to respond to the bug report in a timely fashion or suffer
consequences - with bug bounties. Intel should make it easy to report this
stuff, but if everyone understood that finding something genuinely interesting
resulted in a serious payday, nobody would skip making the call.

The difference between something like Google's bug bounty (capped at over
$30k, I think) and a hypothetical bounty for Intel is, well, Intel has a lot
more at stake. It's honestly strange that they don't have something in place
already. Something like Skylake costs on the order of _billions_ to get out
there. It's cool that this Skylake bug was fixable via microcode, but the
Pentium FPU bug back in the day cost them _half a billion dollars._ If such a
bug exists, that is the kind of thing Intel should want to have reported as
soon as humanly possible. Even the reputational damage they take from
something milder like the Skylake bug would justify a bounty system with very
serious payouts.

------
willvarfar
A comp.arch poster said:

> The errata refers to the problem showing up on short loops of less than 64
> instructions that use AH, BH, CH or DH.

> Looking at the Skylake microarch, the instruction decode queue is 128 uOps
> thread, 2*64 uOps when threaded. The Loop Stream Detector "can stream the
> same sequence of µOPs directly from the IDQ continuously without any
> additional fetching, decoding, or utilizing additional caches or resources."
> ... "capable of detecting loops up to 64 µOPs per thread".
> [https://en.wikichip.org/wiki/intel/microarchitectures/skylak...](https://en.wikichip.org/wiki/intel/microarchitectures/skylake#.C2.B5OP-
> Fusion_.26_LSD)

> So maybe the microcode update just shuts off the loopback detector.

[https://groups.google.com/d/msg/comp.arch/UkO4Z2FT18c/7YlC0a...](https://groups.google.com/d/msg/comp.arch/UkO4Z2FT18c/7YlC0aH7AQAJ)

So if the bug is in the loop-detector, and the patch possibly disables it
rather than fixes it, then does anyone have any before-and-after performance
stats?

~~~
Tuna-Fish
IIRC the Skylake loop buffer is not any faster than the uop cache, instead the
reason for it's existence is to save power by not touching the cache. So you'd
have to test power consumption instead?

------
ihnorton
> I worked from the executable provided by SIOU, first interactively under GDB
> (but it nearly drove me crazy, as I had to wait sometimes one hour to
> trigger the crash again), then using a little OCaml script that ran the
> program 1000 times and saved the core dumps produced at every crash.

rr can often be a time-saver in situations by providing deterministic replays
up to the point of a crash, whereas coredump analysis is a single
retrospective snapshot.

[http://rr-project.org/](http://rr-project.org/)

~~~
tdullien
As much as I love RR, I am not sure it would've helped here, as the bug
requires multiple threads to concurrently run? Also, RR is based on achieving
deterministic replay IIRC, so I am not sure it'd be the first choice for a
nondeterministic hardware bug?

~~~
KenoFischer
You can do multiple concurrent rr recordings. A non-deterministic cpu bug
would generally cause a divergence in the recording vs replay, so rr would be
a decent way to go about this. The way I'd have probably used rr when faced
with this is to bisect the recording to find which code is responsible.

------
timeu
Related to this: [https://tech.ahrefs.com/skylake-bug-a-detective-story-
ab1ad2...](https://tech.ahrefs.com/skylake-bug-a-detective-story-ab1ad2beddcd)

Also a pretty good read and recently discussed here:
[https://news.ycombinator.com/item?id=14661473](https://news.ycombinator.com/item?id=14661473)

------
hacktothefuture
I've worked so far away from the metal for such a long time but I still find
these types of articles so interesting even though I only understand a small
fraction of the info. Its amazing to think the levels of abstraction which are
in place from the code at this level which make my work possible.

------
pcarolan
How does this bug affect current hardware in production? Is it worth waiting
for the fix before buying the new MBPs, for example?

~~~
striking
EDIT: my comment was wrong. Thanks. There is a microcode update. Install it
and you'll be fine. On macOS it should be installed automatically, according
to [https://support.apple.com/en-in/HT201518](https://support.apple.com/en-
in/HT201518) (although that page doesn't say whether it's available or not)

~~~
j_jochem
The article states that Kaby Lake is affected, too. Maybe except not the Kaby
Lake versions used in MBPs?

------
raverbashing
GCC generates code that's smaller but it _isn 't_ optimal, because of the
potential for partial register stalls (and just overall register renaming
issues)

It's of course not wrong, but using AH when you're dealing with RAX is a weird
anachronism

Clang does the obvious, correct thing.

~~~
userbinator
"smaller but not optimal" is not really true. It depends on what you're
optimising for. In my experience, optimising for size overall, and then speed
in the really performance-critical parts (with some expected expansion), gives
the best results. Even the non-performance-critical code will have a
noticeable effect if its larger size causes more cache misses.

Making use of the "partial registers" (I see them more as separate smaller
registers that can be grouped together) effectively can avoid many more
instructions.

~~~
raverbashing
I don't disagree with you, but then use EAX, not AH (which would also produce
a smaller code as you don't need to use a 64-bit constant)

Optimizing for size is good, but what GCC did there made sense in the 32-bit
days, but not that much today

Some code snippets use AH/AL as 2 separate registers hence the processor might
rename them to different internal registers. But then when reading EAX the
processor needs to update EAX accordingly as well.

~~~
qb45
TFA says clang actually uses an encoding of AND which operates on RAX but
takes only 32b of constant so it is pretty much the solution you propose. And
BTW, operating directly on EAX itself fucks up the upper half of RAX.

~~~
yuhong
AFAIK it is standard for all 64-bit instructions that use immediates.

~~~
qb45
Interesting, it seems you are right and no 64b immediates are possible at all
except in MOV.

------
libeclipse
Reminds me of this:
[https://news.ycombinator.com/item?id=14279124](https://news.ycombinator.com/item?id=14279124)

~~~
Sean1708
I'm very upset that we didn't get the story behind #10.

------
hsnewman
Can this be exploited for malicious code?

~~~
userbinator
It's "unpredictable" what happens, so I think the best you're going to do is a
DoS. I.e. if you could get the JS JIT in a browser to generate code like this
and execute it repeatedly, you could crash a machine just by visiting a site.

~~~
qb45
It also is "unpredictable" what happens when you overrun a stack frame.

There is zero guarantee that the infamous Sufficiently Sophisticated Attacker
couldn't predict it. Hardware is largely deterministic, even when it doesn't
behave in the documented way. I wouldn't interpret this as _literally_
unpredictable, it's just a generic slogan they _always_ use in their errata.
And they aren't going to say anything more for obvious reasons.

Patch this damn microcode.

------
dingo_bat
Why is it so exciting when Intel has a bug? I had fun reading :)

------
SiempreZeus
I love reading about hardware bugs, and people their perseverance!

Reminded me of a developer for Crash Bandicoot who had seemingly random
crashes:
[http://www.gamasutra.com/blogs/DaveBaggett/20131031/203788/M...](http://www.gamasutra.com/blogs/DaveBaggett/20131031/203788/My_Hardest_Bug_Ever.php)

------
balls187
> That would not be the first time that GCC treats undefined behaviors in the
> least possibly helpful way,

Oh compilers. Like VC++6.0 initializing uninitialized memory to 0xCDCDCDCD in
DEBUG.

~~~
tvgggghh
Err, that's on purpose? It's so you can tell your writes apart, which is
helpful while debugging.

They used to use, uh, _more obvious_ patterns but the PC brigade called them
on it so they settled on 0xcd.

~~~
qb45
Um, what was it? You can paste in decimal to avoid triggering the PC brigade
;)

~~~
detaro
Wikipedia has a list that includes some used values:
[https://en.wikipedia.org/wiki/Hexspeak](https://en.wikipedia.org/wiki/Hexspeak)

------
justin66
> SIOU's application was single-threaded and made no network I/O, only file
> I/O, so its execution should have been perfectly deterministic

Really?

------
gfiorav
Amazing report

------
newusertoday
I routinely see these bugs when the new hardware is still getting developed.

~~~
civilitty
They're completely unavoidable short of having all of the world's entire
computing power with which to do formal verification (and even then, there's
no guarantee).

I've even seen, when developing standard cell libraries for a new fabrication
process, bugs that occur because of unforeseen interactions between different
semiconductor doping concentrations that occur when (due to pure statistics in
fabrication) they overlap in the wrong way.

------
fnsa
Mr pthread!

------
agumonkey
Physicists tell if you sum Fabrice Bellard and Xavier Leroy, the universe
enters an undefined behavior void.

------
etatoby
I will be surely downvoted for this, but I would like to remind everyone how
this bug is just one of the many consequences of Microsoft's evil policy of
encouraging the sale and distribution of proprietary software in executable
form.

There is no other reason why a 64bit multi-core CPU developed in 2015, that
makes heavy use of pipelining and other advanced and complicated code
execution strategies, would need to support instructions that address the
second-to-last byte of a register (eg. %ah) while keeping the rest of the
register 'unchanged', which of course means making a complete mess of the code
execution path.

The only reason this crap still exists is to keep Windows users' ability to
run random EXE and DLL files from the 90s, if not random COM files from the
80s, at the expense of CPU cost, stability, and correctness for everyone else
(such as the OCaml developers and users who ran into this bug.)

~~~
acdha
Did you miss the part about where the bug was found using the current versions
of GCC to build the current versions of OCaml?

It's lazy to the point of dishonesty to act as if Microsoft is the only one
with decades of accumulated code.

~~~
jstimpfle
There is still a point that proprietary binaries are probably the biggest
force keeping decades of cruft in processors. (Not judging here)

~~~
striking
Those "decades of cruft" are usually emulated in microcode and not actual
silicon. Intel knows very few people use the BCD instructions, but it costs
them almost nothing to keep them in while running them slightly slower than
most operations.

Why mess up a stable API when you don't have to?

