
Skylake bug: a detective story - AndreyKarpov
https://medium.com/ahrefs/skylake-bug-a-detective-story-ab1ad2beddcd
======
sohkamyung
Hardware bugs can be notoriously hard to detect.

One time, during my company's development of a product, we kept encountering a
problem while writing a lot of data to an eMMC chip. Was it a problem with the
circuit design? Our eMMC controller? Software driver?

It wasn't until we discovered a unit that did not exhibit the problem that we
discovered that the problem was isolated to a particular batch of eMMC chips
from the vendor (like the OCaml team only realising it was a Skylake problem
when they discovered their code worked fine on non-Skylake systems).

~~~
laydn
I've been through so many of those. Here's one:

On a board, we were using an ADC chip. We bought the ICs from an official
distributor. They sent us the chips, and we populated the boards (~200 of
them).

None of the boards worked properly. We debugged for so long. After so much
wasted time, we found out the root cause with the help of the manufacturer and
the distributor : It turns out, the manufacturer actually produced another IC,
but marked the packages with our part number. It just happened the power/gnd
pins matched with our IC, so there were no electrical problems.

Another case:

We designed an ASIC and taped-out. The silicon arrived and we then sent it to
a packaging house to get the silicon in a QFP package. The packaging house
misread our specifications and placed the silicon with 90 degrees rotation in
the package. That was fun to debug!

~~~
dboreham
The 90 degree rotation thing is surprisingly common. I've seen it done at
least twice. Also several instances of edge connectors that were entirely
backwards due to being CAD placed on the wrong side of the board, or
incorrectly pin nimbered.

------
Outrageous
When I was a newbie, I would often say things like "maybe the compiler is
broken" or "maybe the cpu is doing it wrong."

Of course, it was never the tools. it was always me.

Years later, of course, I work on unreleased hardware. And yeah, sometimes
it's broken. It's really fucking odd. Write C code, and holy shit, it's
actually a compiler bug or CPU bug.

Life was simpler when I was just ignorant and egotistical. Now I'm... well,
still ignorant and egotistical, I guess, but often deal with actual hardware
bugs.

~~~
tomsmeding
Finding an assembler bug is also interesting. Marched up against a nasm bug
once that kept me debugging for quite a while. The good thing with assembler
bugs, however, is that you can decompile the generated machine code to
something that is really close to your original source code... That helped a
lot.

------
jorisgio
[http://gallium.inria.fr/blog/intel-skylake-
bug/](http://gallium.inria.fr/blog/intel-skylake-bug/) another great blogpost
on the topic

------
wfunction
Fantastic post. Just one comment:

> This is a fair amount of optimisation passes and it would have been too time
> consuming to try them one by one

I loved the fact that they used bisection, but then I was shocked to read
this. This isn't the case at all unless you're extremely unlucky. If you just
do bisection again, you can test with half the flags on and off, keeping the
other half off. You can frequently discard half the flags this way, to narrow
down the set, and on average it should be much faster than testing each one by
one.

~~~
pklausler
A useful technique in the design of an optimizing compiler is to include an
"odometer test" on all endomorphic discretionary transformations. E.g., at a
spot where one has a big go/no-go predicate that determines whether a change
to the program is valid and beneficial, add a final term that's a call to a
routine that returns true if odometer++ < limit. You can then use bisection on
that limit and zero in quickly on the failing change.

This function is also a great place to put logging. I typically make it look
like a printf() that returns a Boolean.

------
benmmurphy
he said page tables were corrupted which should not be possible from usermode
so this is presumably a security issue for cloud providers.

------
zelos
Rendered unreadable on mobile by a huge sticky header and a floating "open in
app" button

~~~
cube00
I swear some sites just want mobile responsive design to die so they can load
their malware laden apps onto my phone.

------
yuhong
The fun thing is that only one instruction used AH, and it can be done
perfectly fine with EAX instead.

~~~
acqq
Wouldn't the instruction be bigger if EAX was used? I guess it would have to
have 4 byte constant then, with this encoding it's one byte. How would you
clear the 2 lowest bits in the second byte of the EAX?

For the reference, we talk about:

    
    
        andb $252, %ah
    

I believe it's not an effective speed optimization, given the way modern CPUs
work, but it does produce a smaller instruction?

Additionally, what I miss in this story is which source code actually produces
"I need to clear exactly these two bits while keeping others" in a loop. For
that CPU bug to manifest, it has to be inside of "short loops of less than 64
instructions."

~~~
wolfgke
> Wouldn't the instruction be bigger if EAX was used? I guess it would have to
> have 4 byte constant then, with this encoding it's one byte. How would you
> clear the 2 lowest bits in the second byte of the EAX?

Indeed: Here a list of encodings of the original instruction and natural
alternatives:

> and ah,0xfc: 80E4FC [3 bytes]

> and ax,0xfcff: 6625FFFC [4 bytes]

EDIT: Note that

> and eax,0xfffffcff: 25FFFCFFFF [5 bytes]

does something different: It also zeros the upper 32 bits of the rax register
(Exercise: Why?), which is important since the next line

> movq %rax, (%rbx)

depends on the fact that this has not been changed.

~~~
acqq
At least, as the CPU bug manifests only when "AH, BH, CH or DH" are used, the
compilers could be patched to use the 4 byte variant. But if Intel properly
updates all the affected processors, it's not going to be needed.

(Response to wolfgke's edit: don't forget the "sign extend" variants).

~~~
jorisgio
We pondered reporting this directly to GCC too so that they can implement a
workaround. But despite the buzz generated by this issue in the press I
believe Intel is right in saying that this bug is very unlikely to trigger.

They say it can only happen in tight loops, and although the C code is quite
large, the hot path has only 17 instructions in the loop, which only
sequential memory loads (since the GC is scanning the heap linearly and most
ocaml values are small).

GCC actually did quite a good job here : it managed to lay down the code so
that the "memory reachable" branch is the shortest one. Indeed, when scanning
the major heap in a generational GC assuming that most values scanned are
reachable is reasonable.

In this case, the loop condition will likely be predicted true, and the other
jumps predicted false, and there is a very tight loop with no random memory
load/store. I suspect those conditions are very unlikely.

------
mihaifm
> In late May, devops team noticed a debian package update for intel-microcode

Does this mean that other OS are still susceptible to this? I'm not familiar
on how hardware bugs can be fixed at the OS level.

~~~
acqq
CPUs have updates too, Intel makes them for their processors, but every modern
OS has to add the update and install it on every power on.

~~~
wolfgke
> but every modern OS has to add the update and install it on every power on.

This can also be done by the UEFI (though many hardware vendors shy away from
releasing UEFI updates, while for OSes this is common routine).

------
Upvoter33
Surprising part for me: team building the thing is so unfamiliar with
assembly.

