Users report parallel compiling is causing segfault on Ryzen Linux

examancer · on June 3, 2017

Ryzen linux user. I haven't experienced these issues yet, but I have experienced a few growing pains with early BIOS revisions not being 100% stable for me, and RAM speed and timing challenges. Mostly resolved, though RAM speed is still slightly shy of XMP settings.

System is overclocked (1700@3.8) and has been up and 100% solid for weeks now. 3.85 actually worked and tried to stress it by compiling a bunch of stuff. Didn't have any segfaults or other issues. Worked great.

Only after using an artificial stress tool (stress-ng) did I finally decide 3.85 was not 100% stable at stock volts. Backed off to 3.8 to avoid voltage increase for now. Haven't rebooted since.

The issues being reported do seem legitimate, however. Not sure if it's the memory controller having trouble with certain DDR4, the motherboards, or errata within the Ryzen CPU itself. All seem plausible. Hopefully AMD finds a resolution. In the meantime I'm glad I'm not affected.

abbeyj · on June 3, 2017

This reminds me of the old bug in AMD K6 processors that also tended to only show up when doing long compiles in Linux and only when having more than 32MB of RAM. https://web.archive.org/web/20120515215109/http://membres.mu...

octoploid · on June 4, 2017

It appears to be an issue with Ryzen's new micro-op cache and "CMP/TEST conditional jump" instruction fusion.

See comment from inuwashidesu in this thread: https://www.reddit.com/r/programming/comments/6f08mb/compili...

i336_ · on June 4, 2017

Direct permalink: https://www.reddit.com/r/programming/comments/6f08mb/compili...

Couldn't find the comment with that link for some reason - thanks for mentioning the username, had to get it from their account page :)

i336_ · on June 4, 2017

Computer science-y question.

Initially this question was going to be "can we log executed instructions" but I rapidly realized that not even DDR5 could keep up with such a logging system - it would slow things down too much and likely mask the bug (not to mention the TBs of space that would be needed).

Rethinking a bit, my 2nd take is to see if it's possible to somehow repeatedly synthesize workloads from (presumably smaller, more manageable amounts of) seed data.

One of the users in the AMD forum thread (I don't seem to be able to get a permalink) mentions that they're experiencing gcc crashes on Ubuntu inside VMWare on Win10! This means that the bug fits inside two kernels' preemptation/task scheduling and a hypervisor! Interesting.

What stumps me is that some users are experiencing gcc segfaults, while others are getting faults in `sh`.

...yeah this has me stumped. CPUs are so fast, and we have no idea where the problem is.

EDIT: This comment is interesting: https://www.reddit.com/r/programming/comments/6f08mb/compili...

posterboy · on June 4, 2017

Since programms are non-deterministic, knowing the initial parameters should be enough, theoretically, but I am not sure if the internal translation of microcode is still deterministic. Considering the initial conditions for the kernel and every other part of the system would have to be reset in hardware and rebooted for every run until the bug is triggered, I'm not sure how feasible this approach would be.

Edit: My comment doesn't even get to your question. I mean, bug reproduction is quite hard before recording.

dom0 · on June 4, 2017

> Initially this question was going to be "can we log executed instructions" but I rapidly realized that not even DDR5 could keep up with such a logging system - it would slow things down too much and likely mask the bug (not to mention the TBs of space that would be needed).

Yes, the instruction fetching is quite a big part of all memory/cache reads :)

todd8 · on June 4, 2017

From my experience, tracing the execution paths is possible, but it isn't really logging every instruction.

To isolate the fault, one can start use a kind of binary search on the program containing the fault. By putting in one very light weight tracing instruction that records being executed one can look for the fault happening before the tracing instruction is executed or if the fault happens after the tracing instruction. The tracing instruction can be moved to the half of the code where the fault happened. By repeatedly dividing the code into smaller regions one can eventually narrow the location down to a small sequence of instructions that might contain the problem.

Of course, doing this for a problem like this involves overcoming a number of large obstacles. First, the fault the OP is talking about appears to be somewhat unpredictable. This means that we will have to keep records of the execution of the tracing instruction and need to have multiple tracing instructions in the code to see where the processor was really executing instructions when the fault happens. A good understanding of the code's organization as basic blocks (roughly sequences of machine instructions without branches) and some way of analyzing in a programmatic way the location where the fault must have occurred by looking at the counts of times that all of the different tracing instructions were executed will help in narrowing down the region where the fault happens. Compilers, like GCC, can be used to systematically instrument the code.

How can tracing instructions be of such low impact that they don't interfere with the fault being searched for? There is no guarantee that the attempt to measure or detect faults won't hide them, but light weight tracing can be done with pretty simple tracing hardware (simple that is for companies that make computers, like IBM or HP). Basically a device/card is plugged into one of the addressable buses (memory bus or I/O bus like PCI). The tracing hardware simple looks for bus addresses in a range of addresses that are reserved for tracing, say 8k of addresses allowing up to 8k tracing instructions to be scattered around the code. The tracing hardware can then record in its own separate memory the last few million of these addresses that appear on the bus in this unused range of address. The tracing instructions inserted into the program under test (in this case, say bash) will depend on the hardware, in this case, amd64. I'm not familiar enough with all of the new instructions available on the latest processors, but an instruction like set-memory-to-zero would work. The instruction doesn't really matter to the tracing hardware, it ignores the instruction it just looks for an address in the special range on the bus.

Even this fancy tracing hardware is too slow to use in the middle of loops running in registers or the cache, but by tracing the entrance and exit from such sequences the hardware/software causing the problems can isolated.

The same techniques are used to do debugging and performance tuning of operating systems, special hardware traces the operation of say the disk scheduler and a careful study of the relationship between the code responsible for scheduling operations on the disk drive and what shows up in the tracing is used to reduce the inefficiencies or problems in the low level drivers.

i336_ · on June 4, 2017

Thanks so much for this answer, I just learned a lot.

> There is no guarantee that the attempt to measure or detect faults won't hide them...

I got completely stuck on this in my original ponderings. I totally didn't think of sprinkling instrumentation instructions into the code and seeing if the bug still fired. If it did, the approach you described would certainly work very well (and, indeed, you describe it being widely used).

Major TIL with the address-based tracing hardware idea. That's an awesome approach, to do it that way... wow. :)

Building something like this would actually be a really cool challenge in designing a really fast piece of hardware. Considering the kind of access speed needed, though (particularly with the memory bus approach)... would a custom ASIC be required? :/ Or could I get away with using a (perhaps decent/pricey) FPGA?

I say this because it would be awesome to make something like this inexpensively available for people to put in their workstations. I can totally see a device like this also having some fast, nonvolatile* memory-mapped storage for things like infinite logging, as well. For example, the way Linux handles crashes is to kexec into a new kernel that hopefully fishes the log out of RAM and saves it. Very clunky. This approach also does not handle early kernel bringup - or even BIOS/EFI bringup, the libreboot folks would probably love something like this.

(*By "fast, nonvolatile" I mean something that writes straight to a large DDR-backed buffer and is then quickly yanked onto something like an NVMe disk.)

jacquesm · on June 3, 2017

This could be a CPU problem but it could also easily be a memory subsystem or cooling issue. I really hope someone will get to the bottom of this soon and that it won't be a CPU issue, that could get expensive for AMD in a hurry.

Edit: and reading the comments in that thread it would be great if people would remark if they're running stock clocks and if they have upgraded their BIOS.

c2h5oh · on June 3, 2017

It seems to happen on heavily overclocked CPUs. Phoronix user had managed to replicate the issue he wasn't experiencing by simply pushing it a bit further.

aidenn0 · on June 4, 2017

I run a 100% stock clock Ryzen 1700 with the most recent bios. It happened for me very reliably after ~45 minutes of compiling. It was nearly always a segfault in bash (and most of the time bash was running libtool).

CPU temperatures were in the upper 50s; downright cold compared to the rather hot (and old) Xeon CPU it replaced.

Interestingly enough, and I'm not the only one to report this, rebuilding the entire system with GCC 6.3 caused the problems to go away (I'm running Gentoo, so this was quite feasible). This is really odd because I was not using any AMD specific cflags, just the default x64 march.

I'm guessing the problem didn't actually go away, but rather the instruction scheduling of GCC 6.3 is less likely to cause whatever the underlying problem was.

examancer · on June 3, 2017

There are multiple examples of people who are not overclocking at all and have gone to great lengths to ensure everything in their BIOS was properly configured. There does seem to be a real issue here. My money is on memory controller and issues with certain DDR4 modules. Hopefully something AMD can sort out with BIOS updates.

dom0 · on June 3, 2017

Tickets from users using overclocked components are generally INVALID CLOSED.

bnolsen · on June 3, 2017

Always had problems with this in the old days with intel stuff when I used to overclock. There's a reason I don't overclock anymore...

dryatu · on June 3, 2017

Not exclusively on overclocked CPUs.

segfaults happen even with factory defaults.

c2h5oh · on June 4, 2017

Interesting. With what AGESA version - I'm asking because anything earlier than 1.0.0.6 has 1T command rate, which not all memory modules can handle

aidenn0 · on June 4, 2017

Is there an easy way to check my AGESA version? I'm on the latest BIOS revision for my motherboard and I still saw these symptoms.

hatsunearu · on June 4, 2017

Yeah, overclocking tends to fuck up certain corner cases--my friends once told me when you're debugging with a debugger you should turn off overclocking because that could really fuck up how the debugger works

tscs37 · on June 4, 2017

Hmm, I've compiled kernels several times on Arch Linux since I got my Ryzen build together and I haven't experienced this issue at all so far.

Might be affecting only a subset of users based on silicon?

wyldfire · on June 3, 2017

> The issue is happening on multiple versions of GCC but I haven't seen any reports when using LLVM/Clang or alternative compilers.

So, still to be ruled out is a bug in GCC itself?

qb45 · on June 3, 2017

Considering that it happens only on Ryzen and with multiple GCC versions and that the first post in this Gentoo forum thread shows segfaults in bash (not gcc) it looks rather like a CPU bug manifesting itself under load.

Nobody talked about even trying clang, knowing Phoronix they just couldn't resist mentioning it for drama or SEO.

aidenn0 · on June 4, 2017

Ryzen 1700 stock clock, lateset BIOS here.

Recompiling bash with gcc-6.3 made it much less likely to happen. After recompiling my entire system with 6.3, I haven't seen it happen yet, despite doing 8 hours straight of building.

Also recompiling my linux kernel with gcc-6.3 my ~72 hour kernel crashes seem to have gone away I had to reboot for an unrelated reason, after doing this, so only have two ~144 hour uptimes for my data.

watt · on June 4, 2017

Sounds like a CPU bug then. Recompiling made it so that certain instruction (or sequence of instructions) are not hit - avoiding hitting some kind of hardware problem. But the problem is still there.

qb45 · on June 4, 2017

Not really, it could equally well be a pre-6.3 GCC bug which resulted in outputting bad code.

What makes me think it may be CPU rather than GCC is that the segfault seems to happen nondeterministicly in a single-threaded bash binary running the same shell script with the same arguments each time.

Somebody also reported this oddity:

Strangely enough, the machine can run mprime for days on end without any trouble. However, an average run of the Glasgow Haskell Compiler's testsuite exhibits a handful of failures (typically segmentation faults). Even stranger, if I run a few mprime threads alongside a run of GHC's testsuite, mprime will itself sometimes crash with a segmentation fault.

So you have the same mprime binary which either crashes or not depending on whether GHC tests are running at the same time.

DannyBee · on June 4, 2017

"Considering that it happens only on Ryzen and with multiple GCC versions and that the first post in this Gentoo forum thread shows segfaults in bash (not gcc) it looks rather like a CPU bug manifesting itself under load. "

Well, maybe.

x86 is obviously a large target, and there are definitely weird corner cases and stuff that, while often not legal, just happen to work on intel and not amd (or vice versa)

I'd still bet on that before i'd bet on hardware bug. Certainly the hardware bugs occur, but software tends to be buggier :)

dom0 · on June 4, 2017

https://news.ycombinator.com/item?id=14480282

At least the part where a different process crashes sounds really more like a hardware problem, although it could of course also be a Kernel<->Hardware issue.

There is also this: https://www.reddit.com/r/programming/comments/6f08mb/compili...

gonzalezj · on June 4, 2017

Some users on Reddit are reporting that the problem goes away when they upgrade GCC: https://www.reddit.com/r/Amd/comments/6crru5/linux_instabili...

raverbashing · on June 4, 2017

I wonder if it might be some "exoteric" instruction combination, like something that makes sense on 386 or P5 but not today (performance wise)

raverbashing · on June 4, 2017

Whoever hits the issue should enable the creation of core files and analyse the corresponding backtrace

That would help identifying the concerned instructions

Qantourisc · on June 4, 2017

I'm not sure I have had this, I had to turn down -J on some jobs, but not all, so could still be a software problem.