Hacker News new | comments | show | ask | jobs | submit login
AMD Ryzen Machine Crashes on a Sequence of FMA3 Instructions (hwbot.org)
174 points by elorant 11 months ago | hide | past | web | favorite | 95 comments



I think the original hwbot posting is the better article.

http://forum.hwbot.org/showthread.php?t=167605

The original post has been updated as more facts came along. It verified that the bug was reproducible on other machines. And then it said:

  Update 3/16/2017:

  As much as I had least expected this to be the case, this appears to have been confirmed as an errata in the AMD Zen processor.
And then it goes on to say:

  Fortunately, it's one that is fixable with a microcode update and will not result in something catastrophic like a recall or the disabling of features.
Basically, this is an awesome bug report. HN should link to it rather than the techpowerup article.



ah! the author of that hwbot.org post is Alex Yee; I remember his name as he wrote that insanely upvoted (and awesome) stackoverflow answer on branch misprediction: http://stackoverflow.com/a/11227902


Can concur - multiple comments in the linked article show confusion as as to why this is even a problem "because the benchmark program in question has CPU-series-specific binaries, and it didn't release one for Ryzen yet" (paraphrasing).

Ouch. Whoosh...


Ok, we changed to that from https://www.techpowerup.com/231536/amd-ryzen-machine-crashes.... Thanks!

Btw everybody, if you notice something like this or see a comment by someone who has, and you have a minute to drop us a note at hn@ycombinator.com, it's a real community service if you do so. Then we can change the URL (or title or what have you) sooner than 19 hours in—or rather, sooner than never, since we only found out about this one by a user telling us. We can't come close to reading all the comments but we do see all the emails and (usually) act on the time-sensitive ones quickly even if we can't reply till a bit later.


Unfortunately, it's just a full page ad telling me to install some app. back


Well, it's not like Intel CPUs don't have similar bugs.

See Skylake for example - the list of known errata starts on page 27 and continues on through page 63 : http://www.intel.com/content/dam/www/public/us/en/documents/...


It's crazy how today we can just release microcode updates to patch the CPU, compared to the original Pentium bug which requires the operating system to be patched.


To be fair, there are definitely occasional bugs that require OS patches.


Can you please explain how does it work? How does the new microcode gets into the CPU?


Microcode is updated by the OS or BIOS/EFI.

https://wiki.debian.org/Microcode#CPU_Microcode


As you can see in the erratum workarounds, most of those bugs were not fixed by microcode updates, but by BIOS updates (mostly changing voltages or timings), and some require OS and compiler updates.


I'm presuming it's just bios update to disable certain optimizations within the CPU.. meaning they were already in place to be A/B tested.. just flipping bit on/off.


I don't think this is true. I'm not an expert on application scale CPUs, but I think they actually have ROMs that control state machines that execute the instructions. This is similar to how regular instructions are stored in memory.


Correct, look at this page for an example: https://www.bedroomlan.org/projects/mcasm

  // Minimum Microcode file.
  
  cond OP:4;
  cond uaddr:3;
  
  signal /MEMIO    = ....1;
  signal MREAD     = ...1.;
  signal MWRITE    = ..1..;
  
  field  REGMOVE   = XX___;
  signal MAR <- PC = 01...;
  signal IR <- MEM = 10...;
  
  start OP=XXXX; // A tiny fetch instruction
    /MEMIO, MREAD, MAR <- PC;   // 
    hold;                       // Same as MEMIO, MREAD, MAR <- PC
    hold, -MREAD, IR <- MEM;    // Same as MEMIO, MAR <- PC, IR <- MEM
  
  // End of file.


My understanding is that there are several forms of microcode updates.

Sometimes the microcode can do simple patches like inserting a nop between problematic instruction combinations. This usually doesn't have much (if any) negative performance implications. The instruction decoder (again AFAIK) is fairly programmable.

However if there is a more serious flaw in the actual silicon the microcode must rewrite the problem instructions to emulate the correct behavior and that can make it extremely slow - think emulating floating point but not quite as bad.



Intel's chipsets aren't immune either, the first revision of P67 and H67 had a flaw that caused the SATA ports to slowly overvolt themselves to death.

They ended up having to recall a huge number of motherboards.


Yep, I had to RMA a P8P67Pro because of this https://en.wikipedia.org/wiki/Sandy_Bridge#Cougar_Point_chip...


Interesting. I guess unlike software, these sorts of bugs can't be fixed without producing a new one right?

Do CPU and GPU manufactures do any types of fuzzing?


In this case the fix is adding a resistor:

"If your system does not use SERIRQ and BIOS puts SERIRQ in Quiet-Mode, then the weak external pull up resistor is not required. All other cases must implement an external pull-up resistor, 8.2k to 10k, tied to 3.3V"

https://www-ssl.intel.com/content/dam/www/public/us/en/docum...

Depending on the motherboard, those pins may be exposed already. Here's what the fix looks like a Synology unit: https://www.reddit.com/r/synology/comments/609u1l/c2538_cloc...


So I assume that the problem is that SERIRQ is left floating causing too much of a load on the LPC clock it depends on, right?


Microcode is software running on your CPU. And many problems can be fixed with a microcode update, but not all - see the workarounds listed in the PDF I linked - note especially how many say "None identified".


Okay, thanks for clarification. I was thinking about the floating bug which they had to recall the defective processors: https://en.wikipedia.org/wiki/Pentium_FDIV_bug


Hard to say if something like this would be patchable via a microcode update. But regardless, back then your CPU didn’t run software to run your software, so a new hardware revision was in order.


They do tremendous amounts of validation. I believe random generation of input data is part of that.

Here's an old heavily cited paper from Intel on the topic; I'm sure their state of the art has advanced considerably in the intervening 17 years since its publication:

http://dl.acm.org/citation.cfm?id=623013


"crashme" is one venerable program that does this kind of fuzzing -- I managed to find a bug in a cpu with it once, which does not speak well of their QA department.


A former coworker was a QA manager at Intel. He said it was an explicit decision to cut back on validation and QA, which is why he wasn't at Intel anymore. The general feeling was that they had "overreacted" to the Pentium FDIV bug and needed to move faster.

YMMV, I have no inside knowledge.


I'm very curious what CPU it was, and what the bug was.


Broadcom Sibyte 1250. I had pre-release silicon, and there was a known bug that prefetch would occasionally hang the chip. I wanted to have a little fun goofing around, so I modified crashme to replace prefetches in the randomly generated code with a noop. A few minutes later, I hung the chip.

If I identified the right erratum later, it was: if there is a branch in the delay slot of a branch in a delay slot which is mispredicted, the cpu hangs. It was fixed by making this sequence throw an illegal instruction exception. (It was undefined behavior already, I think.)


Interesting. And... the nesting you describe is messing with my head! :)

How did the erratum get fixed? Hardware re-re-release?

Definitely filed crashme away, sounds like a useful tool.

The Sibyte 1250 sounds cute. There's a "650000000 Hz" in https://lists.freebsd.org/pipermail/freebsd-mips/2009-Septem... although I'm not sure if MIPS from circa 2002 (?) was that fast (completely ignorant; no field knowledge - definitely want to learn more though).

I also noted the existence of the BCM91250E via http://www.prnewswire.com/news-releases/broadcom-announces-n... sibytetm-bcm1250-mipsr-processor-to-accelerate-customers-software-and-hardware-development-75841437.html, which was kind of cool. I like how the chip is a supported target for QNX :)

Now I'm wondering what you were using it for. Some kind of network device? (I think I saw networking as one of its target markets.)

--

As an aside, I'm also very curious what HN notifier you use, if you use one. (I use http://hnreplies.com/ myself, but it's sometimes slow. I saw your message after 4 minutes in this case (fast for HN Replies); typing/research + IRL stuff = delay :) )


Our company was PathScale, and we were hoping that the Sibyte 1250 would make a great supercomputing chip. Opteron wasn't out yet, Intel was pricing 64-bit Itanium very high, and this dual-core Sibyte thing was projected to have a reasonable clock and it could do 2 64-bit fp mul-adds/cycle. We had Fred Chow and the GPLed SGI Compiler on our side. And the next revision of the chip had these great 10 gig networking features that I thought I could make work well with MPI, the scientific computing message-passing standard.

You can guess how it worked out: Sibyte was late, slow, and buggy. Even simple code sequences like hand-paired multiply-adds would start running at 1/2 speed after an interrupt. Our experienced compiler team was unable to get good perf on several SPECcpu benchmarks despite the code looking good. (Fred didn't have much hair left to pull out!)

Soon after we raised our A round we pivoted to using Opteron for compute, building an ASIC for a specialized MPI network, and Fred's team did an Opteron backend for the compiler.

The descendant of the network is now called Intel Omni-Path, and is on-package for Xeon Phi cpus.


I see. Always poignant to hear these kinds of stories.

I'm curious as to whether interrupt handling was done underneath the multiprocessing layer or whether interrupts were just hammering the pipeline design. (I assume by "an interrupt" you're referring to slowdown within the fraction of a second after a given interrupt occurred, within the context of floods of interrupts interspersed between instructions?)

Very cool to hear that Intel snapped up what you eventually managed to ship, FWIW - and that you were able to pivot in the way you did. Also interesting to hear about Opteron use in the field, my experience is only with tracking the consumer sector.

Also, the "GPLed SGI compiler" part you mentioned caught my eye and led me to the EKOpath compiler, and particularly its OSS release in 2011: http://web.archive.org/web/20110616135434/http://pathscale.c...

After some floating back and forth I found https://www.phoronix.com/forums/forum/software/general-linux... which a) mirrored my questioning exactly and b) contained a very nice and straightforward answer, so I guess that put closure on that. Really sad that it never really took off though; faster compilers (https://lwn.net/Articles/447541/ in/from https://lwn.net/Articles/447529/) are always something I'm looking for :)


CPU ones definitely fuzz, and formally verify parts of their chips.

Unfortunately fuzzing ultimately has a random component, which doesn't really prove that you got all of these bugs.


(To addon to parent poster)

     Unfortunately fuzzing ultimately has a random component
An instruction which accepts 2 registers and returns 1 register has a 192bit problem space to validate. This complexity is present in an instruction as simple as `add`.

As AVX2 instructions which accepts 3 registers and outputs 1 has a 1024bit problem space to validate.

This occurred in FMA3 with a ~512bit problem space.

Repeat for _every_ instruction (HUNDREDS). You can see how a few bugs slip though the cracks. The problem space is as large as some cryptographic functions!! I'm honestly surprised we don't see more of them.


The specific result produced by the data path is probably not very relevant in the case of a lockup. The control path involved with instruction decoding, register renaming, out-of-order execution, SMT, etc. is generally the cause of issues like this. With interactions between different blocks of the CPU and the size of some of the data structures involved, the full verification space is much, much larger.


I don't know about that. As I understand it, interesting things can happen if a large number data lines toggle at the same time, and this obviously depends on both the data and the control path. Huge space of possible states.


This isn't a lock up.

If you read the source (TechPowerUp is terrible why it isn't banned is beyond me) it is actually an _illegal instruction_ error. Which just crashes the current application.

Also if SMT is not disable the error doesn't occur. Also if the chip isn't over clocked the error doesn't occur.

So it is very clearly power related.


Quoting from http://forum.hwbot.org/showthread.php?t=167605:

> this always hard freezes the computer:

>- At all clock speeds. >- When running single-threaded, it happens to any core that I pin it to.

It should be at worst an illegal instruction, but instead the whole core freezes, even on underclocked computers.


Oh I understand fuzzing won't catch all bugs. But I see, glad they do run random over.


Intel's AVX2 instructions don't crash the machine, but lead to an extreme increase in voltage and, thus, temperature. Sustained use of these instructions (say, for example, in signal processing and iterative parameter estimation algorithms implemented using routines in Intel's MKL) requires cooling way beyond what the usual boxed cooler that comes with an i7 CPU can deliver in order to prevent throttling.


Sort of? If you can actually increase your performance architecturally by doubling your vector width then you double the power dissipated by the relevant bits of your execution and memory systems, true. But doubling performance by doubling power is really awesome. Things like your out of order window size will tend to increase power as the square of their size. And in your typical desktop cpu regime your power scales with cube of frequency.

So if you're lucky enough to have code that can benefit form 256 bit vectors then AVX will double your performance and power then the thermal management system will throttle you back down to regular power and 80% of your regular clock speed for a net of merely a 60% increase in performance. Which is really nice if you happen to be doing all 256 bit wide vector math. If you're only infrequently using vectors your vector speedup will be smaller but so will the increase in power that needs to be overcome so it's still a net win.


This is the main reason why, for example, the i7-7700K comes with a "base clock" of 4.2 GHz even though for most workloads the CPU will operate at 4.4 or 4.5 GHz. That 4.2 GHz represents the maximum speed the CPU can maintain when running the worst case AVX2 workload without exceeding the specified 91W TDP.


Interesting... do you mind me asking: how does overclocking works these days then? We have P-states, i.e. the CPU will go from 1200Mhz... 2400Mhz to 4Ghz, then there's "trubo" frequency range. So does modern overclocking apply some kind of multiplier to all these steps? Thanks.


There's a base clock that's usually something like 100MHz and can usually be tweaked by a few percent. The base clock affects all power states, and usually the memory and I/O (which is usually the limiting factor).

On top of that is a CPU frequency multiplier that is variable and the upper limit is unlocked on certain processor and motherboard combinations. If you have an unlocked multiplier you can set the maximum multiplier for each of the Turbo states (1-core, 2-core, through all-cores). You could configure them to all be the same multiplier, or just scale up the default behavior of running at a higher clock speed when fewer cores are active.


You're not wrong - the Vcore boost when both AVX pipes are powered up causes increased dissipation, but a large part of that is also caused by ... well, AVX processing huge amounts of data.

For similar reasons server processors (Xeon E5, E7 series) throttle the core clock in AVX "mode".


Probably part of why (the other obvious reason being die space) Ryzen lacks 256-bit SIMD hardware, and instead runs such instructions with a 128-bit unit over two cycles.


Well doing that tends to halve throughput for ALU-dense code, but obviously saves a bit of die space and more importantly reduces peak power draw, which has to be supported by the metallization on the chip.

AVX2 might have also been just a bit too late to be integrated outside the microcode.


I wonder how much worse it is in practice, because Intel's chips supposedly have to downclock a bit when doing AVX2 because of the heat.


Desktop parts have no dedicated AVX clock (however, their base vs. all-core turbo pretty much does the same thing), while server SKUs (Socket R3 and later) basically have two distinct clocking ladders, one for non-AVX code and one for AVX code (the "mode switch" quantum is rather coarse, on the order of one millisecond). So both base clock and maximum turbo clocks are lower in AVX code compared to non-AVX code.

As usual, actual achieved clock speeds depend on thermal performance and are not fixed. If cooling is completely insufficient, the clock frequency will drop below base clocks, i.e. thermal throttling.

To clarify: This isn't a secret, but part of the specs and even in some Intel slides. (However, platform and processor specs, even at a basic (product D/S) level, tend to be a rather lengthy read, so most do away with reading all that).


That is disappointing to hear. Is this first hand knowledge, or do you have a discussion/link you could share on the topic?


It's documented and known that you can't execute AVX2 continuously without it clocking the processor down :(


> clocking the processor down

Nitpick, and certainly a matter of definition, but it's not clocking the processor down, it's just unable to turbo like it can with other loads.


Err, no, actually. it really is clocked down, and that's what intel says.

There is the normal base frequency (which is, again, the one people see on the box. Intel calls it "marked TDP frequency")

There is the avx base frequency, which is less. If you execute all avx workloads, it will clock down to the avx frequency, which is below the marked tdp frequency.

They are very explicit about this: "Workloads using Intel AVX instructions may reduce processor frequency as far down as the AVX base frequency to stay within TDP limits."

See http://www.intel.com/content/dam/www/public/us/en/documents/...

IE if you buy a 2.3ghz processor, and run all avx workloads on it, it may operate at 1.9ghz



Very interesting. Any links for further info?


The security implications of this as commented on the article seems pretty bad. Wonder if any microcode fix can work or it would need a silicon level fix. The second I presume would be very bad?

-- Keeping my earlier comment but it seems there's a microcode fix in testing currently. http://forum.hwbot.org/showpost.php?p=480922&postcount=30


AMD already has a fix, it requires a BIOS update for the motherboard. They have not given details, but some experts think the fix is merely an increase in the power delivered to that part of the chip. Their conclusion stems from teh fact that the FMA3 bug disappeared when the chip was over-clocked because the over-clocking required increasing the voltage to the chip. So the BIOS update probably adjust the voltages up a bit.

http://www.fudzilla.com/news/processors/43166-amd-confirms-r...


Since they are pushing how efficiently the chip are regulating power to various parts of the chip to keep power draw to a minimum it sounds like a likely cause.

Assuming most of the testing budget is spent on the most common and likely instruction sequence patterns is probably not unrealistic. Still a bit odd that such a short sequence of repeats of the same instruction failed.

Which got me thinking.

At more than 2000 opcodes for x86-64 and legacy opcodes, testing all 3-sequences with one input is more than 8G sequences, to cover that and (guesstimated average) 64 bits of input per opcode is a staggering 10^29 number of instructions, or 2T CPU*year at 4Gops/s. Not easy to test all that, even if it would be off by a couple of magnitudes in the right direction!

Might be that valid 2-sequences is the most that can hope to be even close to exhaustively tested while at the same time covering some significant fraction of the operand space ?

The fact that modern CPU's work at all given their complexity in so many different areas looks pretty darn close to magic, even from a pretty close distance.


At least we try, with SMT solvers and such. Surprise surprise, math works!


A power issue would also plausibly explain why disabling SMT avoids the problem, since disabling SMT powers off a bunch of stuff (a whole thread context does need a bit of power), and, in a multi-threaded scenario, tends to reduce core usage thereby reducing average (! so likely not relevant) power draw as well.


And (probably more importantly) reduces overall IPC by around 30% generally. The whole point of SMT these days is to have continuous work for the ALUs despite recent cache misses.


If it's not fixable in microcode, then yeah, I'd say it's pretty bad. Makes it useless as a cloud machine, if anyone can cause a DOS and take out every other VM on a host.

For a personal machine, it's probably not terrible if you are using linux or something else where you can compile everything yourself. But running binaries built by someone else would be a crapshoot. I wonder how many games (of the Windows, AAA variety) use these multiply instructions?

There's also not a lot of detail in the article—like if the data has to be specific or just the combination of instructions is enough to crash it.


Cloud doesn't have to mean virtualised and shared. There are plenty of operations that run on dedicated hardware.


For many cloud vendors, including EC2, "dedicated hardware" just means yours will be the only virtual machine on that computer. But they're still running your image under a hypervisor.


Yes but if you're worried about attacks from co-tenants, that's OK (the cloud provider can always access your hardware anyway)


It would still be a pain for cloud providers if the tenant can unwillingly DoS the hypervisor. Sure they have watchdogs in place, but it gets more complicated if the DoS can be non-malicious.


I suspect Amazon (for instance) has far more shared machines than dedicated machines. If something isn't viable for shared machines then a huge section of the market is ripped away. Still seems catastrophic to me.


> Flops is only affected when the SMT is enabled, so disabling the SMT can be used as a temporary work-around (until the actual fix arrives).

Does this mean disabling SMT will also fix the bug, or is that specific to this app?

I've heard that some things improve performance on Ryzen with SMT off, but I've heard that's because OS level task schedulers need to be optimized better. But I still wonder if Ryzen's SMT implementation is on par with Intel's first implementations.


It's happened to AMD before, and it was pretty bad for their brand.

http://www.anandtech.com/show/2477/2


Intel had serious problems as well, let's not forget: the Pentium fp bug and the F00F bug (though those are more than a decade old at this point)


Haswell and early Broadwell had a bug in its transactional memory extensions so serious that the microcode "fix" simply permanently disabled the instructions.

CPU errata like this aren't uncommon.


While those are the most remembered bugs, almost every CPU have some kinda of CPU errata that can lock up the system in really weird and specific ways. For example, my Intel Ivy Bridge CPU running Linux:

$ dmesg | grep -i microcode

[ 0.000000] microcode: microcode updated early to revision 0x1c, date = 2015-02-26

[ 1.415548] microcode: sig=0x306a9, pf=0x10, revision=0x1c

[ 1.415665] microcode: Microcode Update Driver: v2.2.

The fact that most users does not hit those bugs is because modern OS already patches those microcode before execution (like the case above).

Of course, if AMD can't fixes those bugs without performance regressions (remember the infamous TLB Bug from earlier Phenoms?) it can be pretty bad. However I don't think the majority of users needs to be too cautious about it.



Don't forget the (other) f00f bug in the Intel Quark![1]

[1] - https://en.wikipedia.org/wiki/Intel_Quark#Segfault_bug


Yeah, that was kind of unfortunate as it made the Intel Quark incompatible with all normal Linux distros. I don't think it even had any kind of microcode update facility, being effectively a slightly updated 486.


Even with a runtime microcode update facility, no BIOS update equivalent would still make it pretty fatal for packaging except as an entirely distinct build target, because otherwise you'd need to do something kinky like always first booting a custom kernel+initrd target built to involve no locks, load the microcode update, then kexec into the real kernel.


I don't know if this bug affects Phenom II chips, but I've been running a Phenom II for 8 straight years in my primary desktop with nary a hiccup. Rock solid reliable. Guess I haven't been running the "correct conditions" for a crash?


You could also have a BIOS update with the workaround already baked in, or a chip that has the later silicon without the bug...

(I also had a Phenom II for many years without any mysterious crashing problems, even though the chip was the correct revision and there was no BIOS erratum workaround enabled.)


Windows 7 and later distribute and activate microcode updates through Windows Update.


It seems like it's not a microcode update fix, but an MSR killbit flip from the BIOS? [1]

[1] - http://support.amd.com/TechDocs/41322_10h_Rev_Gd.pdf page 66


Yes, which reduces performance if I remember correctly. It was quite famous. They eventually released a fixed stepping, and Phenom II is not affected.


True, my mistake.

(While it doesn't apply for this specific example, it's still the reason why one generally will see microcode-fixed errata go away rather quickly, independently of BIOS updates)


Phenom II is not affected.


FMA3 (fused multiply add) instructions are very common for any kind of audio processing (e.g. filters), not only for their speed but also their precision properties. Depending on how a game was compiled, these instructions would be all over the place.

But if this specific sequence is every run, that's another question, of course.


FMA is used in literally everything that deals with floats and has AVX2/FMA3 support, not just audio. AVX usage in game code isn't actually all that common, though, due to the need to support lowest common denominators.

And no, this specific sequence isn't ever run. If it were, they'd have found it while designing the chip. I checked the original source and it's hand written FMA3 intrinsics that don't correspond to a real computation.


Can you induce this through WebAssembly? Probably not, but someone should check.


WebAssembly does not currently have an fma instruction, and implementations are not permitted to fold plain multiply+add sequences into fma because that produces different rounding.


Some game developers that had optimized with AVX were asked to take it out since it would crash overclocked CPUs that were riding the limits. Hit a patch of AVX and bam, they'd fall over.

Intel has mostly fixed this in more recent chips so that now they look ahead for upcoming AVX instructions and just slow down.


FMA (fused multiply-add) is a basic computing operation that is a basic building block in tasks such as evaluating polynomials, vector-vector and matrix-vector multiplication, convolutions, and also algorithms to solve nonlinear equations.

You'd be hard pressed to find any computing application that doesn't take advantage of FMA instructions.


Like other have mentioned, this happens, to both Intel, AMD and others.

Intel had similar failures like its TSX instruction set which was only fixed in the next stepping.

Not clear if AMD can push out a patch to the microcode to fix this.


Some context relating to Intel:

"We saw some really bad Intel CPU bugs in 2015, and we should expect to see more in the future"

https://danluu.com/cpu-bugs/ https://news.ycombinator.com/item?id=10877270


Can someone please at fix the title? Crashes -> Freezes.

Big difference.

And summarized: Intel has about 80 known similar problems as outlined in the errata, for which no known fixes exist. For this AMD bug a Bios update fix exists (more voltage).


Patient: Dr. It hurts when I do "this." Dr.: Don't do that.


Is it known if he microcode patch affects performance?




Applications are open for YC Summer 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: