
AMD Ryzen Machine Crashes on a Sequence of FMA3 Instructions - elorant
http://forum.hwbot.org/showthread.php?t=167605
======
CalChris
I think the original _hwbot_ posting is the better article.

[http://forum.hwbot.org/showthread.php?t=167605](http://forum.hwbot.org/showthread.php?t=167605)

The original post has been updated as more facts came along. It verified that
the bug was reproducible on other machines. And then it said:

    
    
      Update 3/16/2017:
    
      As much as I had least expected this to be the case, this appears to have been confirmed as an errata in the AMD Zen processor.
    

And then it goes on to say:

    
    
      Fortunately, it's one that is fixable with a microcode update and will not result in something catastrophic like a recall or the disabling of features.
    

Basically, this is an _awesome_ bug report. HN should link to it rather than
the _techpowerup_ article.

~~~
sriramkarnati
The issue seems to be fixed:
[http://forum.hwbot.org/showpost.php?p=480922&postcount=30](http://forum.hwbot.org/showpost.php?p=480922&postcount=30)

[http://forum.hwbot.org/showpost.php?p=480524&postcount=25](http://forum.hwbot.org/showpost.php?p=480524&postcount=25)

[https://www.bit-tech.net/news/hardware/2017/03/21/amd-
ryzen-...](https://www.bit-tech.net/news/hardware/2017/03/21/amd-ryzen-
fma3-fix-promise/1)

------
jjuhl
Well, it's not like Intel CPUs don't have similar bugs.

See Skylake for example - the list of known errata starts on page 27 and
continues on through page 63 :
[http://www.intel.com/content/dam/www/public/us/en/documents/...](http://www.intel.com/content/dam/www/public/us/en/documents/specification-
updates/desktop-6th-gen-core-family-spec-update.pdf)

~~~
yeukhon
Interesting. I guess unlike software, these sorts of bugs can't be fixed
without producing a new one right?

Do CPU and GPU manufactures do any types of fuzzing?

~~~
monocasa
CPU ones definitely fuzz, and formally verify parts of their chips.

Unfortunately fuzzing ultimately has a random component, which doesn't really
prove that you got all of these bugs.

~~~
valarauca1
(To addon to parent poster)

    
    
         Unfortunately fuzzing ultimately has a random component
    

An instruction which accepts 2 registers and returns 1 register has a 192bit
problem space to validate. This complexity is present in an instruction as
simple as `add`.

As AVX2 instructions which accepts 3 registers and outputs 1 has a 1024bit
problem space to validate.

This occurred in FMA3 with a ~512bit problem space.

Repeat for _every_ instruction (HUNDREDS). You can see how a few bugs slip
though the cracks. The problem space is as large as some cryptographic
functions!! I'm honestly surprised we don't see more of them.

~~~
cwzwarich
The specific result produced by the data path is probably not very relevant in
the case of a lockup. The control path involved with instruction decoding,
register renaming, out-of-order execution, SMT, etc. is generally the cause of
issues like this. With interactions between different blocks of the CPU and
the size of some of the data structures involved, the full verification space
is much, much larger.

~~~
valarauca1
This isn't a lock up.

If you read the source (TechPowerUp is terrible why it isn't banned is beyond
me) it is actually an _illegal instruction_ error. Which just crashes the
current application.

Also if SMT is not disable the error doesn't occur. Also if the chip isn't
over clocked the error doesn't occur.

So it is very clearly power related.

~~~
wongarsu
Quoting from
[http://forum.hwbot.org/showthread.php?t=167605](http://forum.hwbot.org/showthread.php?t=167605):

> this always hard freezes the computer:

>\- At all clock speeds. >\- When running single-threaded, it happens to any
core that I pin it to.

It _should be_ at worst an illegal instruction, but instead the whole core
freezes, even on underclocked computers.

------
woodson
Intel's AVX2 instructions don't crash the machine, but lead to an extreme
increase in voltage and, thus, temperature. Sustained use of these
instructions (say, for example, in signal processing and iterative parameter
estimation algorithms implemented using routines in Intel's MKL) requires
cooling way beyond what the usual boxed cooler that comes with an i7 CPU can
deliver in order to prevent throttling.

~~~
stagger87
That is disappointing to hear. Is this first hand knowledge, or do you have a
discussion/link you could share on the topic?

~~~
DannyBee
It's documented and known that you can't execute AVX2 continuously without it
clocking the processor down :(

~~~
semi-extrinsic
> clocking the processor down

Nitpick, and certainly a matter of definition, but it's not clocking the
processor down, it's just unable to turbo like it can with other loads.

~~~
DannyBee
Err, no, actually. it really is clocked down, and that's what intel says.

There is the normal base frequency (which is, again, the one people see on the
box. Intel calls it "marked TDP frequency")

There is the avx base frequency, which is less. If you execute all avx
workloads, it will clock down to the avx frequency, which is below the marked
tdp frequency.

They are very explicit about this: "Workloads using Intel AVX instructions may
reduce processor frequency as far down as the AVX base frequency to stay
within TDP limits."

See
[http://www.intel.com/content/dam/www/public/us/en/documents/...](http://www.intel.com/content/dam/www/public/us/en/documents/white-
papers/performance-xeon-e5-v3-advanced-vector-extensions-paper.pdf)

IE if you buy a 2.3ghz processor, and run all avx workloads on it, it may
operate at 1.9ghz

------
tmd83
The security implications of this as commented on the article seems pretty
bad. Wonder if any microcode fix can work or it would need a silicon level
fix. The second I presume would be very bad?

\-- Keeping my earlier comment but it seems there's a microcode fix in testing
currently.
[http://forum.hwbot.org/showpost.php?p=480922&postcount=30](http://forum.hwbot.org/showpost.php?p=480922&postcount=30)

~~~
pierrebai
AMD already has a fix, it requires a BIOS update for the motherboard. They
have not given details, but some experts think the fix is merely an increase
in the power delivered to that part of the chip. Their conclusion stems from
teh fact that the FMA3 bug disappeared when the chip was over-clocked because
the over-clocking required increasing the voltage to the chip. So the BIOS
update probably adjust the voltages up a bit.

[http://www.fudzilla.com/news/processors/43166-amd-
confirms-r...](http://www.fudzilla.com/news/processors/43166-amd-confirms-
ryzen-screen-freeze-fix)

~~~
mickronome
Since they are pushing how efficiently the chip are regulating power to
various parts of the chip to keep power draw to a minimum it sounds like a
likely cause.

Assuming most of the testing budget is spent on the most common and likely
instruction sequence patterns is probably not unrealistic. Still a bit odd
that such a short sequence of repeats of the same instruction failed.

Which got me thinking.

At more than 2000 opcodes for x86-64 and legacy opcodes, testing all
3-sequences with one input is more than 8G sequences, to cover that and
(guesstimated average) 64 bits of input per opcode is a staggering 10^29
number of instructions, or 2T CPU*year at 4Gops/s. Not easy to test all that,
even if it would be off by a couple of magnitudes in the right direction!

Might be that valid 2-sequences is the most that can hope to be even close to
exhaustively tested while at the same time covering some significant fraction
of the operand space ?

The fact that modern CPU's work at all given their complexity in so many
different areas looks pretty darn close to magic, even from a pretty close
distance.

~~~
flamedoge
At least we try, with SMT solvers and such. Surprise surprise, math works!

------
mhroth
FMA3 (fused multiply add) instructions are very common for any kind of audio
processing (e.g. filters), not only for their speed but also their precision
properties. Depending on how a game was compiled, these instructions would be
all over the place.

But if this specific sequence is every run, that's another question, of
course.

~~~
gcp
FMA is used in literally everything that deals with floats _and_ has AVX2/FMA3
support, not just audio. AVX usage in game code isn't actually all _that_
common, though, due to the need to support lowest common denominators.

And no, this specific sequence isn't ever run. If it were, they'd have found
it while designing the chip. I checked the original source and it's hand
written FMA3 intrinsics that don't correspond to a real computation.

~~~
Animats
Can you induce this through WebAssembly? Probably not, but someone should
check.

~~~
sunfish
WebAssembly does not currently have an fma instruction, and implementations
are not permitted to fold plain multiply+add sequences into fma because that
produces different rounding.

------
filereaper
Like other have mentioned, this happens, to both Intel, AMD and others.

Intel had similar failures like its TSX instruction set which was only fixed
in the next stepping.

Not clear if AMD can push out a patch to the microcode to fix this.

------
ysleepy
Some context relating to Intel:

"We saw some really bad Intel CPU bugs in 2015, and we should expect to see
more in the future"

[https://danluu.com/cpu-bugs/](https://danluu.com/cpu-bugs/)
[https://news.ycombinator.com/item?id=10877270](https://news.ycombinator.com/item?id=10877270)

------
rurban
Can someone please at fix the title? Crashes -> Freezes.

Big difference.

And summarized: Intel has about 80 known similar problems as outlined in the
errata, for which no known fixes exist. For this AMD bug a Bios update fix
exists (more voltage).

------
late2part
Patient: Dr. It hurts when I do "this." Dr.: Don't do that.

------
WhitneyLand
Is it known if he microcode patch affects performance?

