
Ryzen CPU HyperThreads break if 100% busy and interrupted to top of Memory - sounds
https://svnweb.freebsd.org/base?view=revision&revision=321899
======
StillBored
Its like we have all collectively forgotten that the first release of a
hardware or software project is just an expensive beta. I don't buy 1st gen
microarches because i'm not interested in paying top dollar to be a tester.
What I find odd, is that apparently both intel and AMD have also forgotten
this, as Intel seems to be moving toward making their enterprise customers the
beta testers, while AMD seems so desperate for marketshare as to have released
zen as a volume product before releasing it as a high end one. Meaning that if
they have to do a recall, they are both losing their entire margin, as well as
having to replace a large number of devices.

~~~
paulmd
AMD's not doing a recall, it works decently enough for most applications.
Their response is going to be "if it crashes your application, turn SMT off".

Consider they didn't even do a recall when Phenom had a showstopping TLB bug,
they shipped a BIOS patch that disabled TLB entirely.

And remember, Epyc is on a new stepping of the silicon, it's possible this is
already fixed on it. (Threadripper is not, however)

~~~
gruez
> Their response is going to be "if it crashes your application, turn SMT
> off".

it happens when SMT is disabled

>Epyc is on a new stepping of the silicon, it's possible this is already fixed
on it. (Threadripper is not, however)

that's assuming they caught this bug, which i doubt is the case because it's
only discovered now rather than being documented in the errata.

~~~
dom0
I dunno, there were several issues in the SMT implementation publicized
earlier, it is entirely possible that the root cause is the same or related.

------
rbisewski
I had a similar scenario occur on my Ryzen 1700 w/ Gigabyte B350 Motherboard.
Nothing in the BIOS seemed to help, and updating to the latest version of the
firmware didn't seem to help much.

Eventually I just looked into some kernel docs decided that setting the IOMMU
mode to pt during bootup might work. Specifically, I added the following to my
grub config.

GRUB_CMDLINE_LINUX_DEFAULT="quiet iommu=pt"

Not sure if this will help any of you, but it did completely eliminate the
problems I had.

Shameless plug, wrote a blog about my investigations into it:

[https://ibiscybernetics.com/blog/2017-05-24.html](https://ibiscybernetics.com/blog/2017-05-24.html)

------
krylon
I bought a custom-built Rzyen-based PC around Easter, and I have experienced
some issues with it; I am not sure, though, were to put the blame (CPU,
motherboard/firmware, operating system (which is openSUSE Tumbleweed)).

Under heavy load, the machine has performed most gracefully. However, the
machine does freeze (almost) completely when left idle for a while (usually >
1 hour). When it happens, it _sometimes_ still responds to pings, but nothing
more; if I try to ssh into it from my laptop, I do not even receive a TCP ACK.

Unfortunately, I guess, there is no Kernel Panic, so no memory dump I could
inspect or send to somebody who actually knows how to make sense of it.

On the upside, I have gotten into the habit of putting the machine into
standby when I leave it alone for more than a couple of minutes, and I was
pleasantly surprised that Suspend-to-Disk is a very acceptable option with an
M.2 SSD. ;-P

Asus (who built my mainboard) releases firmware updates (which include
microcode updates) on a fairly regular basis, so I hope this problem will be
fixed eventually. I knew there was a risk of something like this happening
when I got this machine, and overall, I am not disappointed. Otherwise, I am _
_very_ _ happy with the machine.

~~~
stefantalpalaru
> However, the machine does freeze (almost) completely when left idle for a
> while (usually > 1 hour).

Try disabling the C6 power state from BIOS:
[https://www.reddit.com/r/hardware/comments/6rklcf/ryzen_segf...](https://www.reddit.com/r/hardware/comments/6rklcf/ryzen_segfaulting_under_heavy_load/dl66o4b/)

~~~
theandrewbailey
I'll definitely try this. I downgraded my latest BIOS because I couldn't stand
the random crashes that had no rhyme or reason (crashed during gaming,
youtube, encrypting my drives, emailing, browsing, among other activities).

------
powercf
I hoped this just affected Ryzen CPUs, but this Reddit post indicates that it
affects Epyc also:
[https://www.reddit.com/r/Amd/comments/6rmq6q/epyc_7551_minin...](https://www.reddit.com/r/Amd/comments/6rmq6q/epyc_7551_mining_performance/)

The first post on on AMD's community forum
([https://community.amd.com/thread/215773?start=0&tstart=0](https://community.amd.com/thread/215773?start=0&tstart=0))
is almost three months old, so AMD have known about this for a long time. If
it's not something that can be fixed in a UEFI update, then it's bad news for
everyone: a weakened AMD means more stagnation in amd64

~~~
old-gregg
So he's got a segfault every couple of minutes, wow... I've been running the
same test for over 4 hours now on my Ryzen 1700 (and I've had several
uneventful 30-40 minute runs before). To date, I only got one "internal
compiler error: Illegal instruction" but no segfaults.

Whatever it is, it doesn't affect every chip the same way.

~~~
userbinator
_Whatever it is, it doesn 't affect every chip the same way_

If it is marginal timing in some part of the chip, that combined with
statistical process and environment variations, and the increasingly _tiny_
geometries (which serve to amplify the variation) mean the problem could
really occur quite randomly. Modern CPUs are pushing the limits in more ways
than one, and IMHO this is what happens when they go too far.

------
0x0
Not the first bug reported for Ryzen. Wasn't there a couple of others too, one
with linux locking up and another triggered by the ocaml compiler emitting
opcodes refering AH/BH/CH/DH registers in a tight loop?

Edit: Sorry, the ocaml bug was Intel Skylake. It's interesting how so many new
CPUs have breaking bugs. Feels like it's been quiet since the original pentium
F00F bug and then all of a sudden everyone's new CPUs break.

~~~
bloaf
The ocaml compiler found a bug in Intel's Skylake/Kaby Lake. If they found one
for Ryzen too, I haven't heard about it.

[https://lists.debian.org/debian-
devel/2017/06/msg00308.html](https://lists.debian.org/debian-
devel/2017/06/msg00308.html)

~~~
0x0
You are right, of course.

------
cma
> When a cpu-thread stalls in this manner it appears to stall INSIDE the
> microcode for IRETQ. It doesn't make it to the return pc, and the cpu thread
> cannot take any IPIs or other hardware interrupts while in this state.

So maybe fixable with a microcode update?

~~~
db48x
Most cpu errata are. (Most of the rest are ignorable).

~~~
sebazzz
Aren't microcode updates usually "disable some feature" updates?

~~~
catrabbit
No. That's exceptionally rare and the worst case scenario. See Intel and TSX
(transactional memory).

------
pella
Phoronix: "Ryzen-Test & Stress-Run Make It Easy To Cause Segmentation Faults
On Zen CPUs"

[https://www.phoronix.com/scan.php?page=news_item&px=Ryzen-
Te...](https://www.phoronix.com/scan.php?page=news_item&px=Ryzen-Test-Stress-
Run)

~~~
pella
newer "50+ Segmentation Faults Per Hour: Continuing To Stress Ryzen" ( 5
August 2017. )

[http://www.phoronix.com/scan.php?page=article&item=ryzen-
seg...](http://www.phoronix.com/scan.php?page=article&item=ryzen-segv-
continues)

------
sqeaky
Does this affect only BSDs then?

I have a Ryzen machine I bought and put into a Jenkins cluster. It does builds
on a bunch of VMs occasionally and pegs the CPUs. I have had no issue so far.

~~~
mjevans
It probably affects Windows and every other OS as well. It's just less likely
to be seen by their typical user workloads, or more likely to be written off
as some other issue when it does happen.

------
shmerl
See this thread:
[https://community.amd.com/message/2815893](https://community.amd.com/message/2815893)

------
Glyptodon
For some reason my boss was super gung-ho about Ryzen and so now I have a work
PC that randomly freezes at unknown intervals. Sometimes twice in a day, but
usually more like once a week or week and a half. They're hard freezes, and
typically nothing gets logged - you see a normal entry in syslog for a normal
system event at a random time and then the next entry will be from when you
got in to work and had to hard reset. Pretty sure at this point it's either
CPU or Mobo related (thought we don't have the tools to verify the power
supply under load), but no real means of diagnosing the problem.

~~~
londons_explore
I wish motherboards had built in PSU testers. It would be super simple to have
a cheap ADC measuring the voltage on every power line all the time, and then
have some way that software could access the current voltage and minimum and
maximum seen in the last few seconds.

That could then be paired with a bios which displays a boot time resettable
warning if the PSU has been misbehaving.

------
copx
I wonder if the Ryzen bugs can really be fixed with a microcode update..

~~~
nly
They will probably just issue an update to turn off hyperthreading.

~~~
kartD
Even with SMT disabled it causes the faults

~~~
strmpnk
I don't think that is the case for this particular issue. There may be other
reported SMT bugs (still some instability) but here it has to be a pair of
HTs:

    
    
        if one hyperthread is in a cpu-bound loop of any kind
        (can be in user mode), and the other hyperthread is 
        returning from an interrupt via IRETQ ...

------
sondr3
I'm thinking of building a new PC this autumn or spring but news like this
about the new AMD processors are making me a bit uneasy committing to them. I
don't think bugs like this and the one with GCC crashing (might be the same)
might affect me but it's a risk I'm not sure if I want to take.

------
loeg
Note that we still don't really know what's going on; just that this change
seems to alleviate the symptoms.

------
hspak
Potentially related to this issue over 5 years ago?
[https://it.slashdot.org/story/12/03/06/0136243/amd-
confirms-...](https://it.slashdot.org/story/12/03/06/0136243/amd-confirms-cpu-
bug-found-by-dragonfly-bsds-matt-dillon)

~~~
loeg
Seems awfully unlikely.

------
ChuckMcM
Dang, I guess I'm going to wait for the B stepping of the Ryzen :-)

~~~
bitL
Original Ryzen is B1 stepping. EPYC should be already a newer one, with
ThreadRipper some say it's the old B1. So you might be looking for B3/C
stepping instead ;-)

------
rubatuga
Uh oh, this might be it for AMD this year

