The original post has been updated as more facts came along. It verified that the bug was reproducible on other machines. And then it said:
As much as I had least expected this to be the case, this appears to have been confirmed as an errata in the AMD Zen processor.
Fortunately, it's one that is fixable with a microcode update and will not result in something catastrophic like a recall or the disabling of features.
Btw everybody, if you notice something like this or see a comment by someone who has, and you have a minute to drop us a note at email@example.com, it's a real community service if you do so. Then we can change the URL (or title or what have you) sooner than 19 hours in—or rather, sooner than never, since we only found out about this one by a user telling us. We can't come close to reading all the comments but we do see all the emails and (usually) act on the time-sensitive ones quickly even if we can't reply till a bit later.
See Skylake for example - the list of known errata starts on page 27 and continues on through page 63 : http://www.intel.com/content/dam/www/public/us/en/documents/...
// Minimum Microcode file.
signal /MEMIO = ....1;
signal MREAD = ...1.;
signal MWRITE = ..1..;
field REGMOVE = XX___;
signal MAR <- PC = 01...;
signal IR <- MEM = 10...;
start OP=XXXX; // A tiny fetch instruction
/MEMIO, MREAD, MAR <- PC; //
hold; // Same as MEMIO, MREAD, MAR <- PC
hold, -MREAD, IR <- MEM; // Same as MEMIO, MAR <- PC, IR <- MEM
// End of file.
Sometimes the microcode can do simple patches like inserting a nop between problematic instruction combinations. This usually doesn't have much (if any) negative performance implications. The instruction decoder (again AFAIK) is fairly programmable.
However if there is a more serious flaw in the actual silicon the microcode must rewrite the problem instructions to emulate the correct behavior and that can make it extremely slow - think emulating floating point but not quite as bad.
They ended up having to recall a huge number of motherboards.
Do CPU and GPU manufactures do any types of fuzzing?
"If your system does not use SERIRQ and BIOS puts SERIRQ in Quiet-Mode, then the weak external pull up resistor is not required. All other cases must implement an external pull-up resistor, 8.2k to 10k, tied to 3.3V"
Depending on the motherboard, those pins may be exposed already. Here's what the fix looks like a Synology unit: https://www.reddit.com/r/synology/comments/609u1l/c2538_cloc...
Here's an old heavily cited paper from Intel on the topic; I'm sure their state of the art has advanced considerably in the intervening 17 years since its publication:
YMMV, I have no inside knowledge.
If I identified the right erratum later, it was: if there is a branch in the delay slot of a branch in a delay slot which is mispredicted, the cpu hangs. It was fixed by making this sequence throw an illegal instruction exception. (It was undefined behavior already, I think.)
How did the erratum get fixed? Hardware re-re-release?
Definitely filed crashme away, sounds like a useful tool.
The Sibyte 1250 sounds cute. There's a "650000000 Hz" in https://lists.freebsd.org/pipermail/freebsd-mips/2009-Septem... although I'm not sure if MIPS from circa 2002 (?) was that fast (completely ignorant; no field knowledge - definitely want to learn more though).
I also noted the existence of the BCM91250E via http://www.prnewswire.com/news-releases/broadcom-announces-n...
sibytetm-bcm1250-mipsr-processor-to-accelerate-customers-software-and-hardware-development-75841437.html, which was kind of cool. I like how the chip is a supported target for QNX :)
Now I'm wondering what you were using it for. Some kind of network device? (I think I saw networking as one of its target markets.)
As an aside, I'm also very curious what HN notifier you use, if you use one. (I use http://hnreplies.com/ myself, but it's sometimes slow. I saw your message after 4 minutes in this case (fast for HN Replies); typing/research + IRL stuff = delay :) )
You can guess how it worked out: Sibyte was late, slow, and buggy. Even simple code sequences like hand-paired multiply-adds would start running at 1/2 speed after an interrupt. Our experienced compiler team was unable to get good perf on several SPECcpu benchmarks despite the code looking good. (Fred didn't have much hair left to pull out!)
Soon after we raised our A round we pivoted to using Opteron for compute, building an ASIC for a specialized MPI network, and Fred's team did an Opteron backend for the compiler.
The descendant of the network is now called Intel Omni-Path, and is on-package for Xeon Phi cpus.
I'm curious as to whether interrupt handling was done underneath the multiprocessing layer or whether interrupts were just hammering the pipeline design. (I assume by "an interrupt" you're referring to slowdown within the fraction of a second after a given interrupt occurred, within the context of floods of interrupts interspersed between instructions?)
Very cool to hear that Intel snapped up what you eventually managed to ship, FWIW - and that you were able to pivot in the way you did. Also interesting to hear about Opteron use in the field, my experience is only with tracking the consumer sector.
Also, the "GPLed SGI compiler" part you mentioned caught my eye and led me to the EKOpath compiler, and particularly its OSS release in 2011: http://web.archive.org/web/20110616135434/http://pathscale.c...
After some floating back and forth I found https://www.phoronix.com/forums/forum/software/general-linux... which a) mirrored my questioning exactly and b) contained a very nice and straightforward answer, so I guess that put closure on that. Really sad that it never really took off though; faster compilers (https://lwn.net/Articles/447541/ in/from https://lwn.net/Articles/447529/) are always something I'm looking for :)
Unfortunately fuzzing ultimately has a random component, which doesn't really prove that you got all of these bugs.
Unfortunately fuzzing ultimately has a random component
As AVX2 instructions which accepts 3 registers and outputs 1 has a 1024bit problem space to validate.
This occurred in FMA3 with a ~512bit problem space.
Repeat for _every_ instruction (HUNDREDS). You can see how a few bugs slip though the cracks. The problem space is as large as some cryptographic functions!! I'm honestly surprised we don't see more of them.
If you read the source (TechPowerUp is terrible why it isn't banned is beyond me) it is actually an _illegal instruction_ error. Which just crashes the current application.
Also if SMT is not disable the error doesn't occur. Also if the chip isn't over clocked the error doesn't occur.
So it is very clearly power related.
> this always hard freezes the computer:
>- At all clock speeds.
>- When running single-threaded, it happens to any core that I pin it to.
It should be at worst an illegal instruction, but instead the whole core freezes, even on underclocked computers.
So if you're lucky enough to have code that can benefit form 256 bit vectors then AVX will double your performance and power then the thermal management system will throttle you back down to regular power and 80% of your regular clock speed for a net of merely a 60% increase in performance. Which is really nice if you happen to be doing all 256 bit wide vector math. If you're only infrequently using vectors your vector speedup will be smaller but so will the increase in power that needs to be overcome so it's still a net win.
On top of that is a CPU frequency multiplier that is variable and the upper limit is unlocked on certain processor and motherboard combinations. If you have an unlocked multiplier you can set the maximum multiplier for each of the Turbo states (1-core, 2-core, through all-cores). You could configure them to all be the same multiplier, or just scale up the default behavior of running at a higher clock speed when fewer cores are active.
For similar reasons server processors (Xeon E5, E7 series) throttle the core clock in AVX "mode".
AVX2 might have also been just a bit too late to be integrated outside the microcode.
As usual, actual achieved clock speeds depend on thermal performance and are not fixed. If cooling is completely insufficient, the clock frequency will drop below base clocks, i.e. thermal throttling.
To clarify: This isn't a secret, but part of the specs and even in some Intel slides. (However, platform and processor specs, even at a basic (product D/S) level, tend to be a rather lengthy read, so most do away with reading all that).
Nitpick, and certainly a matter of definition, but it's not clocking the processor down, it's just unable to turbo like it can with other loads.
There is the normal base frequency (which is, again, the one people see on the box. Intel calls it "marked TDP frequency")
There is the avx base frequency, which is less.
If you execute all avx workloads, it will clock down to the avx frequency, which is below the marked tdp frequency.
They are very explicit about this: "Workloads using Intel AVX instructions may reduce processor frequency as
far down as the AVX base frequency to stay within TDP limits."
IE if you buy a 2.3ghz processor, and run all avx workloads on it, it may operate at 1.9ghz
-- Keeping my earlier comment but it seems there's a microcode fix in testing currently.
Assuming most of the testing budget is spent on the most common and likely instruction sequence patterns is probably not unrealistic. Still a bit odd that such a short sequence of repeats of the same instruction failed.
Which got me thinking.
At more than 2000 opcodes for x86-64 and legacy opcodes, testing all 3-sequences with one input is more than 8G sequences, to cover that and (guesstimated average) 64 bits of input per opcode is a staggering 10^29 number of instructions, or 2T CPU*year at 4Gops/s.
Not easy to test all that, even if it would be off by a couple of magnitudes in the right direction!
Might be that valid 2-sequences is the most that can hope to be even close to exhaustively tested while at the same time covering some significant fraction of the operand space ?
The fact that modern CPU's work at all given their complexity in so many different areas looks pretty darn close to magic, even from a pretty close distance.
For a personal machine, it's probably not terrible if you are using linux or something else where you can compile everything yourself. But running binaries built by someone else would be a crapshoot. I wonder how many games (of the Windows, AAA variety) use these multiply instructions?
There's also not a lot of detail in the article—like if the data has to be specific or just the combination of instructions is enough to crash it.
Does this mean disabling SMT will also fix the bug, or is that specific to this app?
I've heard that some things improve performance on Ryzen with SMT off, but I've heard that's because OS level task schedulers need to be optimized better. But I still wonder if Ryzen's SMT implementation is on par with Intel's first implementations.
CPU errata like this aren't uncommon.
$ dmesg | grep -i microcode
[ 0.000000] microcode: microcode updated early to revision 0x1c, date = 2015-02-26
[ 1.415548] microcode: sig=0x306a9, pf=0x10, revision=0x1c
[ 1.415665] microcode: Microcode Update Driver: v2.2.
The fact that most users does not hit those bugs is because modern OS already patches those microcode before execution (like the case above).
Of course, if AMD can't fixes those bugs without performance regressions (remember the infamous TLB Bug from earlier Phenoms?) it can be pretty bad. However I don't think the majority of users needs to be too cautious about it.
 - https://en.wikipedia.org/wiki/Intel_Quark#Segfault_bug
(I also had a Phenom II for many years without any mysterious crashing problems, even though the chip was the correct revision and there was no BIOS erratum workaround enabled.)
 - http://support.amd.com/TechDocs/41322_10h_Rev_Gd.pdf page 66
(While it doesn't apply for this specific example, it's still the reason why one generally will see microcode-fixed errata go away rather quickly, independently of BIOS updates)
But if this specific sequence is every run, that's another question, of course.
And no, this specific sequence isn't ever run. If it were, they'd have found it while designing the chip. I checked the original source and it's hand written FMA3 intrinsics that don't correspond to a real computation.
Intel has mostly fixed this in more recent chips so that now they look ahead for upcoming AVX instructions and just slow down.
You'd be hard pressed to find any computing application that doesn't take advantage of FMA instructions.
Intel had similar failures like its TSX instruction set which was only fixed in the next stepping.
Not clear if AMD can push out a patch to the microcode to fix this.
"We saw some really bad Intel CPU bugs in 2015, and we should expect to see more in the future"
And summarized: Intel has about 80 known similar problems as outlined in the errata, for which no known fixes exist. For this AMD bug a Bios update fix exists (more voltage).