
50+ Segmentation Faults per Hour: Continuing to Stress Ryzen - pella
http://www.phoronix.com/scan.php?page=article&item=ryzen-segv-continues
======
vancan1ty
Note that it appears that a large number of the segfaults which Michael
(phoronix) is reporting may be coming from a software issue. In particular,
people on the phoronix forums are reporting that conftest segfaults are a
known software issue and also one has reported that he was able to reproduce
the conftest segfaults on an intel CPU.

So not saying that there's not a problem with Ryzen, but it is possible a
large number of the errors are false positives arising from a known software
issue.

I would wait Michael runs the same test on an Intel CPU with no problems to
pass judgement.

~~~
laydn
Agreed. It is a very basic comparison, which should be done _before_ posting
the article. I don't understand why he keeps posting these before running the
exact same test on different processors (an Intel or even a different AMD
should be ok)

~~~
catdog
Most likely to generate clicks.

------
sofaofthedamned
I have the same with a Kaby Lake laptop btw (XPS 13) where random segfaults
happen. This is with the very latest BIOS update from a day or two ago. Modern
CPUs suck.

~~~
emh68
Can we all switch to Xeons and <whatever server CPUs AMD has out now>? Or do
those have similar issues?

~~~
sofaofthedamned
No idea, but honestly I can't be arsed anymore with new stuff when it's so
unreliable.

This is from a few minutes ago (the timestamps are hours old, but that's
because dmesg timestamps don't take into account sleep time)

[Sat Aug 5 13:46:21 2017] ------------[ cut here ]------------ [Sat Aug 5
13:46:21 2017] WARNING: CPU: 0 PID: 16026 at
drivers/base/firmware_class.c:1225 _request_firmware+0x51f/0x8a0 [Sat Aug 5
13:46:21 2017] Modules linked in: ccm rfcomm fuse xt_CHECKSUM ipt_MASQUERADE
nf_nat_masquerade_ipv4 tun xt_addrtype nf_conntrack_netbios_ns
nf_conntrack_broadcast xt_CT ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ip_set
nfnetlink ebtable_nat ebtable_broute ip6table_nat nf_conntrack_ipv6
nf_defrag_ipv6 nf_nat_ipv6 xt_conntrack ip6table_mangle ip6table_raw
br_netfilter bridge stp llc overlay ip6table_security iptable_nat
nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c
iptable_mangle iptable_raw iptable_security ebtable_filter ebtables
ip6table_filter ip6_tables cmac bnep sunrpc arc4 uvcvideo videobuf2_vmalloc
videobuf2_memops videobuf2_v4l2 videobuf2_core videodev media btusb btrtl
snd_soc_skl snd_soc_skl_ipc snd_soc_sst_ipc snd_soc_sst_dsp snd_hda_ext_core
dell_wmi wmi_bmof snd_soc_sst_match [Sat Aug 5 13:46:21 2017] tpm_crb
snd_hda_codec_hdmi snd_hda_codec_realtek snd_soc_core snd_hda_codec_generic
snd_compress snd_pcm_dmaengine ac97_bus iTCO_wdt iTCO_vendor_support
hid_multitouch mei_wdt dell_laptop dell_smbios dcdbas intel_rapl
x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm ath10k_pci
ath10k_core irqbypass crct10dif_pclmul crc32_pclmul mac80211
ghash_clmulni_intel intel_cstate intel_uncore intel_rapl_perf ath cfg80211
joydev snd_hda_intel snd_hda_codec rtsx_pci_ms snd_hda_core memstick snd_hwdep
snd_seq snd_seq_device snd_pcm snd_timer snd i2c_i801 soundcore hci_uart
intel_pch_thermal mei_me btbcm idma64 btqca mei shpchp
processor_thermal_device btintel intel_soc_dts_iosf intel_lpss_pci bluetooth
wmi soc_button_array intel_vbtn tpm_tis tpm_tis_core ecdh_generic
pinctrl_sunrisepoint [Sat Aug 5 13:46:21 2017] rfkill int3403_thermal
intel_lpss_acpi pinctrl_intel intel_lpss intel_hid int340x_thermal_zone
int3400_thermal tpm acpi_als acpi_thermal_rel sparse_keymap kfifo_buf
industrialio acpi_pad rtsx_pci_sdmmc mmc_core i915 crc32c_intel i2c_algo_bit
drm_kms_helper serio_raw drm nvme nvme_core rtsx_pci i2c_hid video [Sat Aug 5
13:46:21 2017] CPU: 0 PID: 16026 Comm: kworker/u9:0 Not tainted
4.13.0-0.rc3.git1.2.fc27.x86_64 #1 [Sat Aug 5 13:46:21 2017] Hardware name:
Dell Inc. XPS 13 9360/0839Y6, BIOS 1.3.7 07/04/2017 [Sat Aug 5 13:46:21 2017]
Workqueue: hci0 hci_power_on [bluetooth] [Sat Aug 5 13:46:21 2017] task:
ffff9d2bf69e8000 task.stack: ffffb77c07014000 [Sat Aug 5 13:46:21 2017] RIP:
0010:_request_firmware+0x51f/0x8a0 [Sat Aug 5 13:46:21 2017] RSP:
0000:ffffb77c07017c50 EFLAGS: 00010282 [Sat Aug 5 13:46:21 2017] RAX:
000000000000002c RBX: ffffb77c07017d18 RCX: 0000000000000000 [Sat Aug 5
13:46:21 2017] RDX: 0000000000000000 RSI: ffff9d2cff40e118 RDI:
ffff9d2cff40e118 [Sat Aug 5 13:46:21 2017] RBP: ffffb77c07017cc0 R08:
0000000000000487 R09: 0000000000000007 [Sat Aug 5 13:46:21 2017] R10:
fffff404518d9200 R11: ffffffff94313aed R12: ffff9d2c81d13ae0 [Sat Aug 5
13:46:21 2017] R13: ffffb77c07017d10 R14: ffff9d2c94f424e8 R15:
ffff9d2c695de068 [Sat Aug 5 13:46:21 2017] FS: 0000000000000000(0000)
GS:ffff9d2cff400000(0000) knlGS:0000000000000000 [Sat Aug 5 13:46:21 2017] CS:
0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [Sat Aug 5 13:46:21 2017] CR2:
000000fd4a1b4b48 CR3: 00000004536d4000 CR4: 00000000003406f0 [Sat Aug 5
13:46:21 2017] Call Trace: [Sat Aug 5 13:46:21 2017] ? snprintf+0x45/0x70 [Sat
Aug 5 13:46:21 2017] request_firmware+0x37/0x50 [Sat Aug 5 13:46:21 2017]
btusb_setup_qca+0x22d/0x410 [btusb] [Sat Aug 5 13:46:21 2017] ?
__pm_runtime_resume+0x5b/0x80 [Sat Aug 5 13:46:21 2017] btusb_open+0x45/0x250
[btusb] [Sat Aug 5 13:46:21 2017] hci_dev_do_open+0x6c/0x590 [bluetooth] [Sat
Aug 5 13:46:21 2017] ? try_to_wake_up+0x59/0x470 [Sat Aug 5 13:46:21 2017]
hci_power_on+0x4e/0x200 [bluetooth] [Sat Aug 5 13:46:21 2017] ?
lock_timer_base+0x81/0xa0 [Sat Aug 5 13:46:21 2017]
process_one_work+0x193/0x3c0 [Sat Aug 5 13:46:21 2017]
worker_thread+0x4a/0x3a0 [Sat Aug 5 13:46:21 2017] kthread+0x125/0x140 [Sat
Aug 5 13:46:21 2017] ? process_one_work+0x3c0/0x3c0 [Sat Aug 5 13:46:21 2017]
? kthread_park+0x60/0x60 [Sat Aug 5 13:46:21 2017] ret_from_fork+0x25/0x30
[Sat Aug 5 13:46:21 2017] Code: 08 48 83 fb 28 74 45 48 8b 8b a0 ee aa 93 e9
e6 fe ff ff 48 c7 c7 a0 ad fa 93 e8 4d 71 2f 00 48 c7 c7 88 15 cf 93 e8 42 86
b6 ff <0f> ff 41 b9 90 ff ff ff e9 06 fd ff ff 48 c7 45 a8 ff ff ff 7f [Sat
Aug 5 13:46:21 2017] ---[ end trace ff5be3b6989ab935 ]---

~~~
diegoprzl
Modern hardware is unreliable, especially under GNU/Linux. Ryzen didn't give
me any problem under Windows, but GNU/Linux.. the same can be said of
Skylake/Broadwell i915 gpu driver. Both my Thinkpad T450s and X250 are not
able to get past a few weeks uptime while using Debian. Under Windows they can
go on for months.

~~~
BenjiWiebe
Funny, I've got several Linux systems and the limiting factor to uptime is
power outages.

~~~
diegoprzl
My anecdotal experience is that I can get better uptime using Windows rather
than Linux in consumer computers. The i915 bug made me end up using a Windows
host with Debian in a virtual machine, just so I could trust that my computer
would not freeze every few weeks.

With older computers (T420, X201) and Xeon servers I got months and even years
uptime. Maybe I just got unlucky with my recent acquisitions or maybe the
complexity of the new hardware together with the lack of support for Linux
means that these kind of bugs are and will become more prevalent in consumer
hardware.

~~~
lelandbatey
I find your experience intriguing, since I've never been able to get my
windows installations to reliably provide solid uptime. I've had many issues
with inconsistent performance, stuttering, and "slow degradation" of Windows,
while even the 3-4 year old linux installs feel quite snappy.

At this point I suspect getting reliable computers is a lottery.

------
barrkel
This is a non-article, an exercise in bias confirmation. Without a control
with no segfaults, or a root cause, it's not useful.

~~~
qeternity
In the article (or perhaps the original article that is linked in this one)
they explain that they ran the same tests on Intel processors without any
issues whatsoever...

~~~
VThornheart
Except that people are indeed running the same test on Intel systems and
getting the same segfaults:

[https://www.phoronix.com/forums/forum/phoronix/latest-
phoron...](https://www.phoronix.com/forums/forum/phoronix/latest-phoronix-
articles/967080-ryzen-test-stress-run-make-it-easy-to-cause-segmentation-
faults-on-zen-cpus/page10)

------
theincredulousk
This seems like a shill article because black box testing makes almost no
sense in this situation. These are seg faults not hardware lockups. Maybe get
a kernel dump and gdb? You know, debugging.

------
throw2016
At the very least the exact same tests should be run on Intel CPUs.

This will isolate the issue. If Intel CPUs report the same errors as some
users are reporting this is in extremely bad faith.

Phoronix can write 2 articles and spend days on this. Why has he not run the
test on Intel yet?

------
pella
similar problem on EPYC

[https://www.reddit.com/r/Amd/comments/6rmq6q/epyc_7551_minin...](https://www.reddit.com/r/Amd/comments/6rmq6q/epyc_7551_mining_performance/)

[https://www.reddit.com/r/Amd/comments/6rrbsp/epyc_confirmed_...](https://www.reddit.com/r/Amd/comments/6rrbsp/epyc_confirmed_to_suffer_from_the_segfault_issue/)

~~~
VThornheart
The segfaults repro on Intel systems as well with his setup:

[https://www.phoronix.com/forums/forum/phoronix/latest-
phoron...](https://www.phoronix.com/forums/forum/phoronix/latest-phoronix-
articles/967080-ryzen-test-stress-run-make-it-easy-to-cause-segmentation-
faults-on-zen-cpus/page10)

The segfaults being seen on that Epyc test are not the fault of the hardware -
either that, or they're a problem that is common between Intel and AMD
processors somehow.

------
verytrivial
In some respects this is vastly better than a once-a-month fault. AMD have a
better chance of reproducing it even if, clock speeds being what they are
these days, the are a tonne of instructions churning between events.

------
pmoriarty
Should people planning to buy a Ryzen system hold off until this is fixed?

~~~
qeternity
The tester in the article says he hasn't had this happen on his daily use
machine, and that it seems to only happen under very strenuous workloads. Most
people should be fine but there's still a risk.

------
geococcyxc
Everybody should be ignoring this "test" by someone who obviously has no idea
what conftest is or does :)

------
yarg
From what I've read so far, he just keeps replicating the issue without really
attempting to mitigate it. Does anyone know if he's tried disabling ASLR yet?

------
suenyoj
How much money does Intel pay for such articles?

~~~
dna_polymerase
It's funny, to say the least, how these articles pop up merely days before the
1920x launch. The press just received their Threadrippers a few days ago...

Edit: No I'm not saying this is Intel, but it's sad that their release will be
overshadowed by this. I really hoped for AMD to finally catch up with Intel
because we all could profit from a second big player in the game.

~~~
icelancer
What? Plenty of people are running simple test suites and stress tests on new
CPUs to write up. This is nothing out of the ordinary at all.

