AMD Ryzen Machine Crashes on a Sequence of FMA3 Instructions (techpowerup.com)
54 points by elorant 1 hour ago | hide | past | web | 26 comments | favorite





Well, it's not like Intel CPUs don't have similar bugs.

See Skylake for example - the list of known errata starts on page 27 and continues on through page 63 : http://www.intel.com/content/dam/www/public/us/en/documents/...

or the recent C2xxx bricking 'fiasco'

http://www.guru3d.com/news-story/intel-atom-c2000-chips-are-...

Interesting. I guess unlike software, these sorts of bugs can't be fixed without producing a new one right?

Do CPU and GPU manufactures do any types of fuzzing?

Microcode is software running on your CPU. And many problems can be fixed with a microcode update, but not all - see the workarounds listed in the PDF I linked - note especially how many say "None identified".

Okay, thanks for clarification. I was thinking about the floating bug which they had to recall the defective processors: https://en.wikipedia.org/wiki/Pentium_FDIV_bug

They do tremendous amounts of validation. I believe random generation of input data is part of that.

Here's an old heavily cited paper from Intel on the topic; I'm sure their state of the art has advanced considerably in the intervening 17 years since its publication:

http://dl.acm.org/citation.cfm?id=623013

CPU ones definitely fuzz, and formally verify parts of their chips.

Unfortunately fuzzing ultimately has a random component, which doesn't really prove that you got all of these bugs.

Oh I understand fuzzing won't catch all bugs. But I see, glad they do run random over.

(To addon to parent poster)

     Unfortunately fuzzing ultimately has a random component
An instruction which accepts 2 registers and returns 1 register has a 192bit problem space to validate. This complexity is present in an instruction as simple as `add`.

As AVX2 instructions which accepts 3 registers and outputs 1 has a 1024bit problem space to validate.

This occurred in FMA3 with a ~512bit problem space.

Repeat for _every_ instruction (HUNDREDS). You can see how a few bugs slip though the cracks. The problem space is as large as some cryptographic functions!! I'm honestly surprised we don't see more of them.

The specific result produced by the data path is probably not very relevant in the case of a lockup. The control path involved with instruction decoding, register renaming, out-of-order execution, SMT, etc. is generally the cause of issues like this. With interactions between different blocks of the CPU and the size of some of the data structures involved, the full verification space is much, much larger.


Intel's AVX2 instructions don't crash the machine, but lead to an extreme increase in voltage and, thus, temperature. Sustained use of these instructions (say, for example, in signal processing and iterative parameter estimation algorithms implemented using routines in Intel's MKL) requires cooling way beyond what the usual boxed cooler that comes with an i7 CPU can deliver in order to prevent throttling.

The security implications of this as commented on the article seems pretty bad. Wonder if any microcode fix can work or it would need a silicon level fix. The second I presume would be very bad?

-- Keeping my earlier comment but it seems there's a microcode fix in testing currently. http://forum.hwbot.org/showpost.php?p=480922&postcount=30

AMD already has a fix, it requires a BIOS update for the motherboard. They have not given details, but some experts think the fix is merely an increase in the power delivered to that part of the chip. Their conclusion stems from teh fact that the FMA3 bug disappeared when the chip was over-clocked because the over-clocking required increasing the voltage to the chip. So the BIOS update probably adjust the voltages up a bit.

http://www.fudzilla.com/news/processors/43166-amd-confirms-r...

If it's not fixable in microcode, then yeah, I'd say it's pretty bad. Makes it useless as a cloud machine, if anyone can cause a DOS and take out every other VM on a host.

For a personal machine, it's probably not terrible if you are using linux or something else where you can compile everything yourself. But running binaries built by someone else would be a crapshoot. I wonder how many games (of the Windows, AAA variety) use these multiply instructions?

There's also not a lot of detail in the article—like if the data has to be specific or just the combination of instructions is enough to crash it.

Cloud doesn't have to mean virtualised and shared. There are plenty of operations that run on dedicated hardware.

> Flops is only affected when the SMT is enabled, so disabling the SMT can be used as a temporary work-around (until the actual fix arrives).

Does this mean disabling SMT will also fix the bug, or is that specific to this app?

I've heard that some things improve performance on Ryzen with SMT off, but I've heard that's because OS level task schedulers need to be optimized better. But I still wonder if Ryzen's SMT implementation is on par with Intel's first implementations.

It's happened to AMD before, and it was pretty bad for their brand.

http://www.anandtech.com/show/2477/2

Intel had serious problems as well, let's not forget: the Pentium fp bug and the F00F bug (though those are more than a decade old at this point)

Haswell and early Broadwell had a bug in its transactional memory extensions so serious that the microcode "fix" simply permanently disabled the instructions.

CPU errata like this aren't uncommon.

While those are the most remembered bugs, almost every CPU have some kinda of CPU errata that can lock up the system in really weird and specific ways. For example, my Intel Ivy Bridge CPU running Linux:

$ dmesg | grep -i microcode

[ 0.000000] microcode: microcode updated early to revision 0x1c, date = 2015-02-26

[ 1.415548] microcode: sig=0x306a9, pf=0x10, revision=0x1c

[ 1.415665] microcode: Microcode Update Driver: v2.2.

The fact that most users does not hit those bugs is because modern OS already patches those microcode before execution (like the case above).

Of course, if AMD can't fixes those bugs without performance regressions (remember the infamous TLB Bug from earlier Phenoms?) it can be pretty bad. However I don't think the majority of users needs to be too cautious about it.

Some of which are quite serious: http://yuhongbao.blogspot.ca/2015/06/why-your-core-2-process...


Don't forget the (other) f00f bug in the Intel Quark![1]

[1] - https://en.wikipedia.org/wiki/Intel_Quark#Segfault_bug

I don't know if this bug affects Phenom II chips, but I've been running a Phenom II for 8 straight years in my primary desktop with nary a hiccup. Rock solid reliable. Guess I haven't been running the "correct conditions" for a crash?

reply


You could also have a BIOS update with the workaround already baked in, or a chip that has the later silicon without the bug...

(I also had a Phenom II for many years without any mysterious crashing problems, even though the chip was the correct revision and there was no BIOS erratum workaround enabled.)

It is not affected.

FMA3 (fused multiply add) instructions are very common for any kind of audio processing (e.g. filters), not only for their speed but also their precision properties. Depending on how a game was compiled, these instructions would be all over the place.

But if this specific sequence is every run, that's another question, of course.

