
Users report parallel compiling is causing segfault on Ryzen Linux - rnhmjoj
http://phoronix.com/scan.php?page=news_item&px=Ryzen-Compiler-Issues
======
examancer
Ryzen linux user. I haven't experienced these issues yet, but I have
experienced a few growing pains with early BIOS revisions not being 100%
stable for me, and RAM speed and timing challenges. Mostly resolved, though
RAM speed is still slightly shy of XMP settings.

System is overclocked (1700@3.8) and has been up and 100% solid for weeks now.
3.85 actually worked and tried to stress it by compiling a bunch of stuff.
Didn't have any segfaults or other issues. Worked great.

Only after using an artificial stress tool (stress-ng) did I finally decide
3.85 was not 100% stable at stock volts. Backed off to 3.8 to avoid voltage
increase for now. Haven't rebooted since.

The issues being reported do seem legitimate, however. Not sure if it's the
memory controller having trouble with certain DDR4, the motherboards, or
errata within the Ryzen CPU itself. All seem plausible. Hopefully AMD finds a
resolution. In the meantime I'm glad I'm not affected.

------
abbeyj
This reminds me of the old bug in AMD K6 processors that also tended to only
show up when doing long compiles in Linux and only when having more than 32MB
of RAM.
[https://web.archive.org/web/20120515215109/http://membres.mu...](https://web.archive.org/web/20120515215109/http://membres.multimania.fr/poulot/k6bug.html)

------
octoploid
It appears to be an issue with Ryzen's new micro-op cache and "CMP/TEST
conditional jump" instruction fusion.

See comment from inuwashidesu in this thread:
[https://www.reddit.com/r/programming/comments/6f08mb/compili...](https://www.reddit.com/r/programming/comments/6f08mb/compiling_with_ryzen_cpus_on_linux_causing_random/)

~~~
i336_
Direct permalink:
[https://www.reddit.com/r/programming/comments/6f08mb/compili...](https://www.reddit.com/r/programming/comments/6f08mb/compiling_with_ryzen_cpus_on_linux_causing_random/dieuoad/)

Couldn't find the comment with that link for some reason - thanks for
mentioning the username, had to get it from their account page :)

------
i336_
Computer science-y question.

Initially this question was going to be "can we log executed instructions" but
I rapidly realized that not even DDR5 could keep up with such a logging system
- it would slow things down too much and likely mask the bug (not to mention
the TBs of space that would be needed).

Rethinking a bit, my 2nd take is to see if it's possible to somehow repeatedly
synthesize workloads from (presumably smaller, more manageable amounts of)
seed data.

One of the users in the AMD forum thread (I don't seem to be able to get a
permalink) mentions that they're experiencing gcc crashes on Ubuntu inside
VMWare on Win10! This means that the bug fits inside _two_ kernels'
preemptation/task scheduling _and_ a hypervisor! Interesting.

What stumps me is that some users are experiencing gcc segfaults, while others
are getting faults in `sh`.

...yeah this has me stumped. CPUs are so fast, and we have no idea where the
problem is.

EDIT: This comment is interesting:
[https://www.reddit.com/r/programming/comments/6f08mb/compili...](https://www.reddit.com/r/programming/comments/6f08mb/compiling_with_ryzen_cpus_on_linux_causing_random/dieuoad/)

~~~
todd8
From my experience, tracing the execution paths is possible, but it isn't
really logging every instruction.

To isolate the fault, one can start use a kind of binary search on the program
containing the fault. By putting in one very light weight tracing instruction
that records being executed one can look for the fault happening before the
tracing instruction is executed or if the fault happens after the tracing
instruction. The tracing instruction can be moved to the half of the code
where the fault happened. By repeatedly dividing the code into smaller regions
one can eventually narrow the location down to a small sequence of
instructions that might contain the problem.

Of course, doing this for a problem like this involves overcoming a number of
large obstacles. First, the fault the OP is talking about appears to be
somewhat unpredictable. This means that we will have to keep records of the
execution of the tracing instruction and need to have multiple tracing
instructions in the code to see where the processor was really executing
instructions when the fault happens. A good understanding of the code's
organization as basic blocks (roughly sequences of machine instructions
without branches) and some way of analyzing in a programmatic way the location
where the fault must have occurred by looking at the counts of times that all
of the different tracing instructions were executed will help in narrowing
down the region where the fault happens. Compilers, like GCC, can be used to
systematically instrument the code.

How can tracing instructions be of such low impact that they don't interfere
with the fault being searched for? There is no guarantee that the attempt to
measure or detect faults won't hide them, but light weight tracing can be done
with pretty simple tracing hardware (simple that is for companies that make
computers, like IBM or HP). Basically a device/card is plugged into one of the
addressable buses (memory bus or I/O bus like PCI). The tracing hardware
simple looks for bus addresses in a range of addresses that are reserved for
tracing, say 8k of addresses allowing up to 8k tracing instructions to be
scattered around the code. The tracing hardware can then record in its own
separate memory the last few million of these addresses that appear on the bus
in this unused range of address. The tracing instructions inserted into the
program under test (in this case, say bash) will depend on the hardware, in
this case, amd64. I'm not familiar enough with all of the new instructions
available on the latest processors, but an instruction like set-memory-to-zero
would work. The instruction doesn't really matter to the tracing hardware, it
ignores the instruction it just looks for an address in the special range on
the bus.

Even this fancy tracing hardware is too slow to use in the middle of loops
running in registers or the cache, but by tracing the entrance and exit from
such sequences the hardware/software causing the problems can isolated.

The same techniques are used to do debugging and performance tuning of
operating systems, special hardware traces the operation of say the disk
scheduler and a careful study of the relationship between the code responsible
for scheduling operations on the disk drive and what shows up in the tracing
is used to reduce the inefficiencies or problems in the low level drivers.

~~~
i336_
Thanks so much for this answer, I just learned a _lot_.

> _There is no guarantee that the attempt to measure or detect faults won 't
> hide them..._

I got completely stuck on this in my original ponderings. I totally didn't
think of sprinkling instrumentation instructions into the code and seeing if
the bug still fired. If it did, the approach you described would certainly
work very well (and, indeed, you describe it being widely used).

Major TIL with the address-based tracing hardware idea. That's an awesome
approach, to do it that way... wow. :)

Building something like this would actually be a really cool challenge in
designing a _really_ fast piece of hardware. Considering the kind of access
speed needed, though (particularly with the memory bus approach)... would a
custom ASIC be required? :/ Or could I get away with using a (perhaps
decent/pricey) FPGA?

I say this because it would be awesome to make something like this
inexpensively available for people to put in their workstations. I can totally
see a device like this also having some fast, nonvolatile* memory-mapped
storage for things like infinite logging, as well. For example, the way Linux
handles crashes is to kexec into a new kernel that hopefully fishes the log
out of RAM and saves it. Very clunky. This approach also does not handle early
kernel bringup - or even BIOS/EFI bringup, the libreboot folks would probably
love something like this.

(*By "fast, nonvolatile" I mean something that writes straight to a large DDR-
backed buffer and is then quickly yanked onto something like an NVMe disk.)

------
jacquesm
This could be a CPU problem but it could also easily be a memory subsystem or
cooling issue. I really hope someone will get to the bottom of this soon and
that it won't be a CPU issue, that could get expensive for AMD in a hurry.

Edit: and reading the comments in that thread it would be great if people
would remark if they're running stock clocks and if they have upgraded their
BIOS.

------
c2h5oh
It seems to happen on heavily overclocked CPUs. Phoronix user had managed to
replicate the issue he wasn't experiencing by simply pushing it a bit further.

~~~
dryatu
Not exclusively on overclocked CPUs.

segfaults happen even with factory defaults.

~~~
c2h5oh
Interesting. With what AGESA version - I'm asking because anything earlier
than 1.0.0.6 has 1T command rate, which not all memory modules can handle

~~~
aidenn0
Is there an easy way to check my AGESA version? I'm on the latest BIOS
revision for my motherboard and I still saw these symptoms.

------
tscs37
Hmm, I've compiled kernels several times on Arch Linux since I got my Ryzen
build together and I haven't experienced this issue at all so far.

Might be affecting only a subset of users based on silicon?

------
wyldfire
> The issue is happening on multiple versions of GCC but I haven't seen any
> reports when using LLVM/Clang or alternative compilers.

So, still to be ruled out is a bug in GCC itself?

~~~
qb45
Considering that it happens only on Ryzen and with multiple GCC versions and
that the first post in this Gentoo forum thread shows segfaults in bash (not
gcc) it looks rather like a CPU bug manifesting itself under load.

Nobody talked about even trying clang, knowing Phoronix they just couldn't
resist mentioning it for drama or SEO.

~~~
aidenn0
Ryzen 1700 stock clock, lateset BIOS here.

Recompiling bash with gcc-6.3 made it much less likely to happen. After
recompiling my entire system with 6.3, I haven't seen it happen yet, despite
doing 8 hours straight of building.

Also recompiling my linux kernel with gcc-6.3 my ~72 hour kernel crashes seem
to have gone away I had to reboot for an unrelated reason, after doing this,
so only have two ~144 hour uptimes for my data.

~~~
watt
Sounds like a CPU bug then. Recompiling made it so that certain instruction
(or sequence of instructions) are not hit - avoiding hitting some kind of
hardware problem. But the problem is still there.

~~~
qb45
Not really, it could equally well be a pre-6.3 GCC bug which resulted in
outputting bad code.

What makes me think it may be CPU rather than GCC is that the segfault seems
to happen nondeterministicly in a single-threaded bash binary running the same
shell script with the same arguments each time.

Somebody also reported this oddity:

 _Strangely enough, the machine can run mprime for days on end without any
trouble. However, an average run of the Glasgow Haskell Compiler 's testsuite
exhibits a handful of failures (typically segmentation faults). Even stranger,
if I run a few mprime threads alongside a run of GHC's testsuite, mprime will
itself sometimes crash with a segmentation fault._

So you have the same mprime binary which either crashes or not depending on
whether GHC tests are running at the same time.

------
raverbashing
Whoever hits the issue should enable the creation of core files and analyse
the corresponding backtrace

That would help identifying the concerned instructions

------
Qantourisc
I'm not sure I have had this, I had to turn down -J on some jobs, but not all,
so could still be a software problem.

