Ryzen linux user. I haven't experienced these issues yet, but I have experienced a few growing pains with early BIOS revisions not being 100% stable for me, and RAM speed and timing challenges. Mostly resolved, though RAM speed is still slightly shy of XMP settings.
System is overclocked (1700@3.8) and has been up and 100% solid for weeks now. 3.85 actually worked and tried to stress it by compiling a bunch of stuff. Didn't have any segfaults or other issues. Worked great.
Only after using an artificial stress tool (stress-ng) did I finally decide 3.85 was not 100% stable at stock volts. Backed off to 3.8 to avoid voltage increase for now. Haven't rebooted since.
The issues being reported do seem legitimate, however. Not sure if it's the memory controller having trouble with certain DDR4, the motherboards, or errata within the Ryzen CPU itself. All seem plausible. Hopefully AMD finds a resolution. In the meantime I'm glad I'm not affected.
Initially this question was going to be "can we log executed instructions" but I rapidly realized that not even DDR5 could keep up with such a logging system - it would slow things down too much and likely mask the bug (not to mention the TBs of space that would be needed).
Rethinking a bit, my 2nd take is to see if it's possible to somehow repeatedly synthesize workloads from (presumably smaller, more manageable amounts of) seed data.
One of the users in the AMD forum thread (I don't seem to be able to get a permalink) mentions that they're experiencing gcc crashes on Ubuntu inside VMWare on Win10! This means that the bug fits inside two kernels' preemptation/task scheduling and a hypervisor! Interesting.
What stumps me is that some users are experiencing gcc segfaults, while others are getting faults in `sh`.
...yeah this has me stumped. CPUs are so fast, and we have no idea where the problem is.
Since programms are non-deterministic, knowing the initial parameters should be enough, theoretically, but I am not sure if the internal translation of microcode is still deterministic. Considering the initial conditions for the kernel and every other part of the system would have to be reset in hardware and rebooted for every run until the bug is triggered, I'm not sure how feasible this approach would be.
Edit: My comment doesn't even get to your question. I mean, bug reproduction is quite hard before recording.
> Initially this question was going to be "can we log executed instructions" but I rapidly realized that not even DDR5 could keep up with such a logging system - it would slow things down too much and likely mask the bug (not to mention the TBs of space that would be needed).
Yes, the instruction fetching is quite a big part of all memory/cache reads :)
From my experience, tracing the execution paths is possible, but it isn't really logging every instruction.
To isolate the fault, one can start use a kind of binary search on the program containing the fault. By putting in one very light weight tracing instruction that records being executed one can look for the fault happening before the tracing instruction is executed or if the fault happens after the tracing instruction. The tracing instruction can be moved to the half of the code where the fault happened. By repeatedly dividing the code into smaller regions one can eventually narrow the location down to a small sequence of instructions that might contain the problem.
Of course, doing this for a problem like this involves overcoming a number of large obstacles. First, the fault the OP is talking about appears to be somewhat unpredictable. This means that we will have to keep records of the execution of the tracing instruction and need to have multiple tracing instructions in the code to see where the processor was really executing instructions when the fault happens. A good understanding of the code's organization as basic blocks (roughly sequences of machine instructions without branches) and some way of analyzing in a programmatic way the location where the fault must have occurred by looking at the counts of times that all of the different tracing instructions were executed will help in narrowing down the region where the fault happens. Compilers, like GCC, can be used to systematically instrument the code.
How can tracing instructions be of such low impact that they don't interfere with the fault being searched for? There is no guarantee that the attempt to measure or detect faults won't hide them, but light weight tracing can be done with pretty simple tracing hardware (simple that is for companies that make computers, like IBM or HP). Basically a device/card is plugged into one of the addressable buses (memory bus or I/O bus like PCI). The tracing hardware simple looks for bus addresses in a range of addresses that are reserved for tracing, say 8k of addresses allowing up to 8k tracing instructions to be scattered around the code. The tracing hardware can then record in its own separate memory the last few million of these addresses that appear on the bus in this unused range of address. The tracing instructions inserted into the program under test (in this case, say bash) will depend on the hardware, in this case, amd64. I'm not familiar enough with all of the new instructions available on the latest processors, but an instruction like set-memory-to-zero would work. The instruction doesn't really matter to the tracing hardware, it ignores the instruction it just looks for an address in the special range on the bus.
Even this fancy tracing hardware is too slow to use in the middle of loops running in registers or the cache, but by tracing the entrance and exit from such sequences the hardware/software causing the problems can isolated.
The same techniques are used to do debugging and performance tuning of operating systems, special hardware traces the operation of say the disk scheduler and a careful study of the relationship between the code responsible for scheduling operations on the disk drive and what shows up in the tracing is used to reduce the inefficiencies or problems in the low level drivers.
Thanks so much for this answer, I just learned a lot.
> There is no guarantee that the attempt to measure or detect faults won't hide them...
I got completely stuck on this in my original ponderings. I totally didn't think of sprinkling instrumentation instructions into the code and seeing if the bug still fired. If it did, the approach you described would certainly work very well (and, indeed, you describe it being widely used).
Major TIL with the address-based tracing hardware idea. That's an awesome approach, to do it that way... wow. :)
Building something like this would actually be a really cool challenge in designing a really fast piece of hardware. Considering the kind of access speed needed, though (particularly with the memory bus approach)... would a custom ASIC be required? :/ Or could I get away with using a (perhaps decent/pricey) FPGA?
I say this because it would be awesome to make something like this inexpensively available for people to put in their workstations. I can totally see a device like this also having some fast, nonvolatile* memory-mapped storage for things like infinite logging, as well. For example, the way Linux handles crashes is to kexec into a new kernel that hopefully fishes the log out of RAM and saves it. Very clunky. This approach also does not handle early kernel bringup - or even BIOS/EFI bringup, the libreboot folks would probably love something like this.
(*By "fast, nonvolatile" I mean something that writes straight to a large DDR-backed buffer and is then quickly yanked onto something like an NVMe disk.)
This could be a CPU problem but it could also easily be a memory subsystem or cooling issue. I really hope someone will get to the bottom of this soon and that it won't be a CPU issue, that could get expensive for AMD in a hurry.
Edit: and reading the comments in that thread it would be great if people would remark if they're running stock clocks and if they have upgraded their BIOS.
It seems to happen on heavily overclocked CPUs. Phoronix user had managed to replicate the issue he wasn't experiencing by simply pushing it a bit further.
I run a 100% stock clock Ryzen 1700 with the most recent bios. It happened for me very reliably after ~45 minutes of compiling. It was nearly always a segfault in bash (and most of the time bash was running libtool).
CPU temperatures were in the upper 50s; downright cold compared to the rather hot (and old) Xeon CPU it replaced.
Interestingly enough, and I'm not the only one to report this, rebuilding the entire system with GCC 6.3 caused the problems to go away (I'm running Gentoo, so this was quite feasible). This is really odd because I was not using any AMD specific cflags, just the default x64 march.
I'm guessing the problem didn't actually go away, but rather the instruction scheduling of GCC 6.3 is less likely to cause whatever the underlying problem was.
There are multiple examples of people who are not overclocking at all and have gone to great lengths to ensure everything in their BIOS was properly configured. There does seem to be a real issue here. My money is on memory controller and issues with certain DDR4 modules. Hopefully something AMD can sort out with BIOS updates.
Yeah, overclocking tends to fuck up certain corner cases--my friends once told me when you're debugging with a debugger you should turn off overclocking because that could really fuck up how the debugger works
Considering that it happens only on Ryzen and with multiple GCC versions and that the first post in this Gentoo forum thread shows segfaults in bash (not gcc) it looks rather like a CPU bug manifesting itself under load.
Nobody talked about even trying clang, knowing Phoronix they just couldn't resist mentioning it for drama or SEO.
Recompiling bash with gcc-6.3 made it much less likely to happen. After recompiling my entire system with 6.3, I haven't seen it happen yet, despite doing 8 hours straight of building.
Also recompiling my linux kernel with gcc-6.3 my ~72 hour kernel crashes seem to have gone away I had to reboot for an unrelated reason, after doing this, so only have two ~144 hour uptimes for my data.
Sounds like a CPU bug then. Recompiling made it so that certain instruction (or sequence of instructions) are not hit - avoiding hitting some kind of hardware problem. But the problem is still there.
Not really, it could equally well be a pre-6.3 GCC bug which resulted in outputting bad code.
What makes me think it may be CPU rather than GCC is that the segfault seems to happen nondeterministicly in a single-threaded bash binary running the same shell script with the same arguments each time.
Somebody also reported this oddity:
Strangely enough, the machine can run mprime for days on end without any trouble. However, an average run of the Glasgow Haskell Compiler's testsuite exhibits a handful of failures (typically segmentation faults). Even stranger, if I run a few mprime threads alongside a run of GHC's testsuite, mprime will itself sometimes crash with a segmentation fault.
So you have the same mprime binary which either crashes or not depending on whether GHC tests are running at the same time.
"Considering that it happens only on Ryzen and with multiple GCC versions and that the first post in this Gentoo forum thread shows segfaults in bash (not gcc) it looks rather like a CPU bug manifesting itself under load.
"
Well, maybe.
x86 is obviously a large target, and there are definitely weird corner cases and stuff that, while often not legal, just happen to work on intel and not amd (or vice versa)
I'd still bet on that before i'd bet on hardware bug.
Certainly the hardware bugs occur, but software tends to be buggier :)
At least the part where a different process crashes sounds really more like a hardware problem, although it could of course also be a Kernel<->Hardware issue.
System is overclocked (1700@3.8) and has been up and 100% solid for weeks now. 3.85 actually worked and tried to stress it by compiling a bunch of stuff. Didn't have any segfaults or other issues. Worked great.
Only after using an artificial stress tool (stress-ng) did I finally decide 3.85 was not 100% stable at stock volts. Backed off to 3.8 to avoid voltage increase for now. Haven't rebooted since.
The issues being reported do seem legitimate, however. Not sure if it's the memory controller having trouble with certain DDR4, the motherboards, or errata within the Ryzen CPU itself. All seem plausible. Hopefully AMD finds a resolution. In the meantime I'm glad I'm not affected.