Hacker News new | past | comments | ask | show | jobs | submit login
ECC RAM on AMD Ryzen 7000 Desktop CPUs (sunshowers.io)
344 points by mmastrac on Oct 9, 2023 | hide | past | favorite | 190 comments



I am in need of a processor upgrade, but am really interested in an ECC RAM setup. I read this thread of 2 people arguing on /r/AMD about whether or not the AMD processors and/or their motherboards actually support ECC or not. And had no clue if they were right. [1] So this is topical for me!

[1] https://www.reddit.com/r/Amd/comments/lzxqod/list_of_am4_mot...

So this definitively settles it, that the AMD+ASRock combo is truly ECC RAM?


When buying the motherboard, you must check if it declares ECC support.

This is normally specified in the "Memory" section of the specifications, in something like "ECC & Non-ECC, Unbuffered Memory".

Beware of mentions of "On-die ECC", which is present in Non-ECC memories and which is irrelevant.

You also have to buy ECC DDR5 UDIMMs and you must be careful to not buy by accident ECC DDR5 RDIMMs, which are incompatible with AM5 motherboards. The ECC DDR5 UDIMMs have a width of either 80 bits or 72 bits. The width does not matter, as long as it is not the 64-bit width of Non-ECC DDR5 UDIMMs. (For some vendors it is cheaper to use identical x8 chips in ECC and Non-ECC modules, despite wasting some capacity, which results in an 80-bit width; there is a myth that DDR5 ECC DIMMs must have a width of 80 bits, the myth is wrong, because the standard includes 36-bit channels, which result in a 72-bit width for a dual-channel DIMM; for instance Micron makes 72-bit DDR5 ECC UDIMMs)

The last time when I have checked, ASUS had the most AM5 motherboards with ECC support. I like most the PRIME X670E-PRO WIFI because it has the best PCIe expandability beyond the slot occupied by the GPU.

However there are many other cheaper MBs, when less connectivity is enough.


> This is normally specified in the "Memory" section of the specifications, in something like "ECC & Non-ECC, Unbuffered Memory".

I've seen allegations that for some vendors, text like this on their non-server/workstation boards turns out to mean that ECC modules will work in the board, but without the actual ECC function.


While there might have existed such vendors, nobody provides more information than that written above in their MB specifications, so it is difficult to know for sure which is the case, before buying a MB.

The only way to gain more confidence is if the downloadable MB manual has an exhaustive description of the BIOS options.

If in the BIOS options there is one for enabling ECC and perhaps additional related options, e.g. for configuring scrubbing, only then there is complete certainty about ECC support in the MB.

However most recent MB manuals no longer have a complete BIOS description, so they may be not helpful.

At least with the ASRock or ASUS AM4 or AM5 MBs that I have used, whenever "ECC & Non-ECC, Unbuffered Memory" was specified, the MB really had ECC support.


Sometimes the OS is unaware of ECC support on the hardware as well, e.g. Linux doesn't/didn't enable the Intel edac driver on i3's running on server chipsets using ECC memory, despite the CPU actually utilizing ECC in that scenario (they simply forgot to add them to the whitelist). So edac-util went "No ECC MCs found!", even though the platform worked.


About on-die ECC, as you said avoid it like the plague. Real ECC memory has an extra chip on the stick to support ECC.


There are several plausible levels of "support" where ECC is concerned:

0. Not supported at all (i.e., if you plug ECC RAM, your system won't boot).

1. ECC RAM can be plugged but the ECC functionality is not used (i.e., there is no relevant traces/circuitry/etc).

2. ECC functionality is present (i.e., the circuitry is there) but it was not validated by the motherboard manufacturer to be functioning correctly (i.e., detecting/correcting errors).

3. ECC functionality is present and was validated by the motherboard manufacturer. This is the level one would expect from the server-grade boards from a reputable manufacturer like Supermicro.

In case of the AMD processors, when you see "ECC supported", it's anyone's guess which level it is. This is in contrast with Intel, where if it says CPU/chipset supports ECC, then you know it really does. I bet Intel won't allow a motherboards manufacturer to sell a board with chipset like W680 without validated ECC support.


Ryzen consumer CPUs use the same memory controller as their EPYC counterparts, and have full ECC support if the rest of the platform (i.e., motherboard, DIMMs) also supports ECC.

What separates Ryzen from its professional-grade counterparts is that ECC support is an optional part of the consumer Ryzen platform spec, which means that it's up to the motherboard vendor to enable support for it. Some motherboards don't have any support at all, some have ECC support as an explicitly-advertised feature, and many have ECC support but it's not explicitly advertised (simply listed as a footnote in the manual).

That's different from the way Intel does it, where Intel has explicit control over what the platform's feature are, and uses that control to aggressively segment their markets. Intel's approach makes easier to reason about ECC support as a buyer, but you pay for it with the lack of flexibility compared to AMD's platform (and you literally pay more for ECC).


Yes, this is the correct explanation. Intel basically forbids ECC on consumer HW.


They use safety and correctness for market segmentation. Kinda evil


Not kinda, def evil.


AMD Ryzen APUs don't support ECC unless they're the OEM-only PRO variant.

(Edit: These are way more attractive for home server use because they consume a lot less power than the Ryzen CPUs when idle)


All mobile Ryzen 6000 and Ryzen 7000 APUs (Zen 3 or Zen 4) support ECC, unlike in the earlier generations, where ECC was restricted to PRO variants. Unfortunately, none of the small computers that use them supports ECC on the MB.

It is expected that AMD will launch a desktop Zen 4 APU in the near future. If that happens, it remains to be seen whether ECC will remain enabled in it, like in the current laptop packages, but there seems to be no reason to disable it.


I can no longer edit the previous message, but I must mention that, as pointed by another poster, some time during the last couple of months AMD has changed the specifications of all their mobile Ryzen CPUs.

While during at least one year and a half all the laptop CPU specifications for Ryzen 6000 and Ryzen 7000 specifications have included a clear statement that ECC is supported, in the very recent past this statement has been removed from all AMD mobile CPU (non-PRO) specifications, for unknown reasons.


I was looking for this before I posted.

If anyone is looking for one of these standalone, outside of a full build, I bought a Ryzen 7 PRO 4750G about a year or 2 ago from AliExpress. It's been running a homeserver 24x7 since then and I never had any issues with it. YMMV of course.


IIRC AMD doesn't validate the ECC functionality on their ryzen chips? So you could theoretically end up with a defective ECC circuitry if unlucky. I think that was the case for the first generations, it may also have changed or I may be misremembering.


> IIRC AMD doesn't validate the ECC functionality on their ryzen chips? So you could theoretically end up with a defective ECC circuitry if unlucky.

This is not the case.

Ryzen, Threadripper, and EPYC use a unified memory controller that has fully validated/qualified/supported ECC capability. The only difference between the memory controllers in these CPUs is the number of them (Threadripper and EPYC will have multiple memory controller to support the extra memory channels).

When AMD claims that ECC on Ryzen isn't validated, they're talking about the platform as a whole, not the CPU specifically. ECC support on consumer CPUs depends on the motherboard supporting it. Unlike on Threadripper and EPYC (and also unlike Intel's approach), ECC is not a guaranteed feature of the Ryzen platform, so people who want that functionality need to explicitly looks for motherboards that have it (in the same way that you'd need to explicitly verify PCIe bifurcation support).

However, if the motherboard supports it, ECC on Ryzen is a fully supported, validated, you-can-RMA-if-it-doesn't-work feature.


That was in the past (true for Zen 1 and Zen 2 Ryzen). It is no longer true (after Intel enabled ECC in some desktop SKUs, starting with Alder Lake, AMD plussed by validating ECC in all desktop and mobile Ryzens).

Now all Zen 3 and Zen 4 CPUs, both desktop and laptop, have explicit ECC support, which means that ECC must be validated by AMD in all of them.

If any current AMD CPU happened to have defective ECC, that would be a completely defective CPU, which must be replaced by the vendor.

Despite the fact that all laptop Ryzen 6000 and Ryzen 7000 CPUs support ECC, I have not seen yet any AMD laptop or SFF computer with ECC support. On the other hand it is much easier to find AM5 MBs with ECC support than Intel W680 MBs.


The AMD website says the non-PRO 7x40 laptop Ryzens have no ECC support[1], and the Framework folks have said that AMD told them there’s no ECC on non-PRO 7x40 laptop Ryzens[2]. If plugged in, ECC modules will work, but without the error-correction functions.

(Edited to reflect the current state of the website, as it used to say ECC support was present[3].)

[1] e.g. https://www.amd.com/en/product/13186

[2] https://community.frame.work/t/responded-amd-batch-1-guild/2...

[3] https://web.archive.org/web/20230513075641/https://www.amd.c... (here “FP7r2” is the version for use with interchangeable DDR5 modules rather than with soldered LPDDR5)


This is very weird.

I have saved the page from your link on the 17th of July. In the saved page it is written:

"ECC Support: Yes (FP7r2 only; Requires platform support)"

The same was written at all AMD Rembrandt and Phoenix models, and it has been written in all such mobile CPU specifications at least since the beginning of 2022, so at least during a year and a half, if not more.

The ECC support was available only with DDR5 SODIMM memory, not with LPDDR5 memory, and only in the FP7r2 package, and neither in the FP7 nor in the FP8 packages.

Perhaps AMD has discontinued the FP7r2 package, but this is not mentioned in the specifications. Either that, or they have decided right now that they may charge more for PRO models, or save on testing on non-PRO models.

Either way, it seems a stupid move for AMD to degrade right now their mobile CPU specifications, when Intel will launch Meteor Lake, which is likely to be better than AMD Phoenix. AMD mobile Zen 5 will be better than Meteor Lake, but until its launch it remains about a half of year, supposing that it will not be delayed.


> lack of flexibility compared to AMD's platform

You mean flexibility to claim ECC support but not doing any validation to make sure it actually works? Does any AMD motherboard manufacturer actualy states that "ECC is supported and has been validated"? I think the muddy waters that AMD has created would warrant such an explicit statement.


> You mean flexibility to claim ECC support but not doing any validation to make sure it actually works?

The actual ECC functionality is part of the memory controller, which is entirely AMD's domain. The ECC functionality of the memory controller is fully validated.

Whether or not ECC support is present on the rest of the platform is the responsibility of the system builder. However, if the platform isn't fully ECC-capable (e.g., you're using non-ECC DIMMs, the motherboard doesn't have the necessary electrical traces, ECC support is disabled in the UEFI, etc.), this will result in the memory controller disabling ECC support, which is visible to the operating system and is something that you -- the end user -- can verify.

> Does any AMD motherboard manufacturer actualy states that "ECC is supported and has been validated"?

Yes. ECC support will be listed in the motherboard's spec sheet and manual. There also motherboards marketed for professional use where ECC support is explicitly advertised in the vendor's marketing.

> I think the muddy waters that AMD has created would warrant such an explicit statement.

AMD's official statement regarding ECC support on Ryzen is that it's supported if the motherboard also supports it. I'm not sure how they can be any clearer without making ECC a mandatory feature of the platform.


Given this, why would anyone buy an AMD CPU if they need ECC?


Can't speak to asrock but asus x570 boards definitely really do ECC, I have a known bad ECC DIMM and it's possible to use it to create correctable and uncorrectable conditions in short order.

Hard to imagine asrock wouldn't run the lines given all the other components are there, but I guess it's conceivably possible. If the kernel says it has ECC, it's right. If not, return the board as defective and use a different vendor.


I use ECC on my X570 and B550 ASRock boards, they have allowed unbuffered ECC for a while.

Kind of a shame they don't have a mini-itx/matx X670E board (only ASUS does).


Supermicro has an interesting AM5 motherboard which I'm planning to start building a workstation around

https://www.supermicro.com/en/products/motherboard/h13sae-mf


I guess that 8/8 split PCIe could also be used for 100 Gbps networking? I suppose 100 Gbit card requires up to 200 Gbit bandwidth.


Beautiful board! Shame it has to cost $550. I wonder how large of a heatsink I'd need to make it fanless. :)


Oooh, nice find. Now to find some 5200MHz (or more) Unbuffered ECC dimms...


Just be sure to think in advance about the maximal total amount of memory you'd need, since 5200 MHz is officially supported only for 2 modules (3600 MHz for 4). Discussed a week ago (also ECC issues were mentioned): https://news.ycombinator.com/item?id=37717567.


Hopefully won't be the case for Threadripper which allegedly is due on the 19th



The v-color RAM I linked to in the post is available at up to 5600MHz (that's what I got): https://v-color.net/products/ddr5-ecc-udimm-servermemory?var...


Using ASRock X570 PG 4S + Ryzen 5 2600 + Kingston 32GB 2666 ECC(cpu/mem support list for this mobo says the ECC works in this config). Dmidecode still reports 128bit data width instead of 72 however it also reports multi-bit correction instead of single-bit so ... ;). I'm kind of used to 72bit on Intel boards with UDIMM for example Supermicro+Xeon that I have. I think that a combination of memory controller plus mobo reporting has an effect on that info instead of actual hardware support - but still EDAC works, registers proper driver and I get the warnings once in a while from EDAC/RAS that a correctable error has been indeed corrected. Now if I should question whether it actually corrects something then I should also question the Supermicro+Xeon config - I don't have any means to check that - however if it's not working then I should see that on the ZFS dataset every month or so during scrubbing and I don't see anything there. So for me it is settled.


Mine reports 128 bits too. Why 128 bits and not 72 bits?


If it helps, the ASRock B550M Pro4 supports ECC. At least when paired with a Ryzen 5600X and 4x 16GB (Kingston KSM26ED8/16HD @3200MHz) memory modules.

It's what I've been using for 15+ months as my main development desktop.

This is how they show up in the system (Linux) from dmidecode:

    Handle 0x000F, DMI type 16, 23 bytes
    Physical Memory Array
            Location: System Board Or Motherboard
            Use: System Memory
            Error Correction Type: Multi-bit ECC
            Maximum Capacity: 128 GB
            Error Information Handle: 0x000E
            Number Of Devices: 4

    Handle 0x001A, DMI type 17, 92 bytes   (this section is repeated 4 times though, once per DIMM stick)
    Memory Device
            Array Handle: 0x000F
            Error Information Handle: 0x0019
            Total Width: 72 bits
            Data Width: 64 bits
            Size: 16384 MB
            Form Factor: DIMM
            Set: None
            Locator: DIMM 1
            Bank Locator: P0 CHANNEL A
            Type: DDR4
            Type Detail: Synchronous Unbuffered (Unregistered)
            Speed: 3200 MT/s
            Manufacturer: Kingston
            Serial Number: [snipped]
            Asset Tag: Not Specified
            Part Number: 9965745-028.A00G    
            Rank: 2
            Configured Memory Speed: 3200 MT/s
            Minimum Voltage: 1.2 V
            Maximum Voltage: 1.2 V
            Configured Voltage: 1.2 V
            Memory Technology: DRAM
            Memory Operating Mode Capability: Volatile memory
            Firmware Version: Unknown
            Module Manufacturer ID: Bank 2, Hex 0x98
            Module Product ID: Unknown
            Memory Subsystem Controller Manufacturer ID: Unknown
            Memory Subsystem Controller Product ID: Unknown
            Non-Volatile Size: None
            Volatile Size: 16 GB
            Cache Size: None
            Logical Size: None


The B550 chipset is for AM4/DDR4 boards. For newer AM5/DDR5 motherboards, the ECC support is a lot less clear. As in: several motherboard vendors have claimed ECC support in the past, but no longer do, without publicly stating a reason. Also, some boards do list ECC memory modules in their QVL, but do not claim ECC support; whether this means that the board supports ECC or only that the memory module functions without ECC, nobody knows.


> The B550 chipset is for AM4/DDR4 boards.

Yeah. The comment I'm replying to sounded like they're wanting to still use the AM4 platform. Maybe I misread it, or they've adjusted the post for clarity in the meantime. :)


A motherboard can accept ECC RAM and function but not correct single bit errors or halt on multibit errors a you would expect.

Some motherboards list ECC RAM support as a feature and have ECC RAM on their QVL for memory. Be warned ECC UDIMMS are expensive.

For checking if ECC RAM is working in linux use the dmidecode command.

For monitoring ECC errors use rasdaemon.

Asrock isn't the only brand that reliably supports ECC memory. See below.

https://www.reddit.com/r/homelab/comments/15zuj70/ecc_udimm_...


yes asrock is cheap, good vrms and somewhat good bios. but most of all, they exposed all the amd cbs etc options and do not lock them


Maybe slightly OT, since this concerns AMD older AM4 platform with a Zen3 APU core, but working ECC support looks like this and is definitely present on my system:

    $ sudo ras-mc-ctl --errors | tail -n5
    14 2023-08-20 20:16:41 +0200 error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=17), mcg mcgstatus=0, mci CECC, memory_channel=0,csrow=0, mcgcap=0x0000011c, status=0x9c2040000000011b, addr=0x36e701dc0, misc=0xd01a000101000000, walltime=0x64e31c78, cpuid=0x00a50f00, bank=0x00000011
    15 2023-08-23 17:17:49 +0200 error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=17), mcg mcgstatus=0, mci CECC, memory_channel=0,csrow=0, mcgcap=0x0000011c, status=0x9c2040000000011b, addr=0x36e701dc0, misc=0xd01a000101000000, walltime=0x64ea5188, cpuid=0x00a50f00, bank=0x00000011
    16 2023-09-03 16:52:15 +0200 error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=17), mcg mcgstatus=0, mci CECC, memory_channel=0,csrow=0, mcgcap=0x0000011c, status=0x9c2040000000011b, addr=0x36e701dc0, misc=0xd01a000101000000, walltime=0x64f4d227, cpuid=0x00a50f00, bank=0x00000011
    17 2023-09-15 21:37:59 +0200 error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=17), mcg mcgstatus=0, mci CECC, memory_channel=0,csrow=0, mcgcap=0x0000011c, status=0x9c2040000000011b, addr=0x36e701dc0, misc=0xd01a000101000000, walltime=0x65071ed7, cpuid=0x00a50f00, bank=0x00000011
This is with an ASRock B550M-ITX/ac and a AMD Ryzen 5 PRO 5650G. It used to work the same with a Ryzen 5 3600 (using a dedicated GPU for video output) before I upgraded the CPU.

To detect and log ECC activity on modern GNU/Linux, you will want to have the "rasdaemon" service active. I will decode MCE (and other hardware-related) errors and persist them to the database that is shown being queried above.


The frequency of these errors should be enough to cause you to distrust any kind of information output by a computer without ECC.

Edit: thinking about this a bit longer: that frequency is actually so high that you may well have a broken module in there. Note how it is the same module and the same address every time.


Seems like a semi-stuck bit, it'd definitely cause issues w/o ECC but seems to chug along with the circuits doing their job. Best would probably be to add an memory-range exclusion to the kernel at boot to avoid that single area since the sticks seems good otherwise.


I completely forgot you could do that.


Ime ram is either bad or good. I’ve had ecc errors like this and I always ask the DC to replace the ram. After that, 0 errors forever. Same reason why I’m confident a 24 hour memcheck is sufficient for non ECC ram.


Generqlly the same here, but I have had sticks fail after some time in use. I had to rma the ram in my frame.work laptop after it failed. No reason or clue why but it happened after 6 minths or so. No issues with the rma though and it went fine with no issues since. ecc if it was supported there might have given me a heads up about it and avoided needing me to restore from backups when the fs corrupted.


I am fully aware the module is not 100% working, i.e., it is faulty at a specific physical address. That's OK for my personal desktop though, unless the condition worsens, and UCEs (which will panic my kernel) follow.


Any chance it's something that's being affected by temperature?

Along the lines of "computer gets toasty doing work, ecc errors start happening"?

Stuff like that could just mean the memory sticks need pushing in a bit more.


I tried to test and control for that to the best of my ability, but ambient/operating temperature does not seem to be part of the equation.


It'd be cute to log what processes are using that memory at time of error. Fun to speculate about whether a kernel bit flip is better or worse than ones in a web browser, photo editor, spreadsheet, network storage client...


You could test that empirically by setting up a box with the express intent to crash it and then using a chaos monkey like mechanism where you start injecting single bit faults into memory at random addresses. Wonder how long the box would be up before you start noticing something is broken. It would be funny if you accidentally killed the chaos monkey first! Best not use that box for banking...


APUs are specifically excluded from supporting ECC, except on the PRO SKU.

https://www.asus.com/global/support/FAQ/1045186/


I have ECC RAM installed on my Gigabyte B550I system. dmidecode shows the 72 bit width (Total Width: 72 bits) and dmesg | grep -i EDAC does show a bunch of info suggesting ECC is enabled. But this command's output is empty:

No Memory errors.

No PCIe AER errors.

No Extlog errors.

No MCE errors.

Do I need to enable something so these errors get logged or have I been misled by dmidecode and dmesg?


You are just lucky, and your hardware appears to be working without any problems ;)

Some/most(?) AM4 boards can enable "PCIe AER" (Advanced Error Reporting) in their firmware, which will tell you about stuff going awry while components are communicating over said bus (but every instance of PCIe error I have ever seen, even on rather faulty hardware, was recoverable/correctable), and rasdaemon will also persist those.

I do not know what "Memory errors" are supposed to be, since ECC-related problems will be dropped into the "MCE errors" bucket. Neither do I know what "Extlog errors" are.


Which memory sticks do you use?


Two sticks of Kingston 9965745-042.A01G


Thanks!


An excellent read. I have ECC RAM on my Threadripper board. One of the things that ops at Blekko found was that you needed to tell Intel boards (NOTE vendor change!) to actually REPORT correctable errors. The default was machine-check if there was an unrecoverable error, otherwise just go with the flow. With ~ 1600 192GB systems we would see correctable errors about once a week as I recall. I don't recall a single unrecoverable error in 6 years so that was pretty good. Greg will surely correct me if I missed one :-).


Your luck was better than mine, but at least our systems reported correctable errors without us having to ask. We had a roughly similar amount of machines, probably about similar ram on average (many with less, and some with much more) We definitely had unrecoverable errors from time to time; maybe one or two a year, enough to have a policy: give it a shot to see if it was a one off, if it doesn't fail again shortly after it's probably fine; if it does fail again shortly after, swap the ram. Nicer server boards even have a led to tell you which ram module to replace.

For correctable errors, we wouldn't swap unless the error counts got pretty big; some systems went for a long time with one or two errors a day, which is fine. Others went zero errors for a long time, then a couple days at a small number, and then big numbers. There was one system that managed to get thousands per hour and the system was unusable because handling machine check exceptions was too expensive; unfortunately the reporting interval was one hour, so we didn't realize why the system was unusable until the next report.


Well to be fair if we started regularly getting UCE's on a stick of memory we did preemptively replace it. (where regularly was more frequently than one a week)


Yeah, I'd guess a good number of the systems where we got one UCE, we got another one within a day after reboot. If so, off to repair. If not, it was pretty rare for that machine to cause trouble again; seemed worth the risk to check if repair was needed.

The correctables were trickier to judge, because system halt on UCE is easy to recover from and easy to diagnose (system is down, check console, see UCE message); system slow as heck because of constant machine check exceptions is hard to diagnose and a slow system can disrupt a distributed system a lot more than a dead system.


Greg Lindhal had created a really fascinating system that surfaced anomalies like that to ops so that the systems could be fixed. He had a cute name for them like 'fractures' or something (where it was a non-fatal failure but it was impacting overall performance). Things the system would catch were switch ports going bad, DIMMs going bad, and file system corruption outside of the set of released tools. Disks failure was fairly common and got caught, drive reformatted, and resumed quickly.By tracking the rate of disk FS errors for a particular drive we could pick up the "you need to replace this drive" signal as well.


We mostly relied on a couple of easy metrics to signal trouble and then debug from there. Total size of all the Erlang message queues above a threshold was usually an indication of trouble; replication delays of more than a couple seconds too. CPU %, swap % were also signs of trouble.

Tracking down almost working networking without access to the switch metrics was kind of fun, ish. :P


I'm currently on Ryzen 3700X and asus tuf gaming x570 MB. I'd love more single core performance and more nvme/disk speed. Also I have a dual GPU setup, 2 M2 nvme, and 6 sata occupied.

I'm considering upgrading end of the year. I briefly considered thread ripper (for it's pcie lanes) until I found there is still no zen4 thread ripper and the price is likely yo be eye-watering.

So the choice is, upgrade to Ryzen 5900x and keep the rest of the system, or spend a lot more and upgrade to a AM5 ryzen plus a new mobo(I considered Intel too, until I found it tops at 20 pcie lanes).

I've always been buying mostly gigabyte and asus mobos, but I got burned more than once by them so I might go with another manufacturer this time.

What do you think? Is it worth upgrading to AM5 for just single core and disk performance? What about Intel? Considering I'd love more pcie lanes for a 10gb adapter perhaps(so I can move some of these spinning disks out) Intel is probably not a good choice.

I'm happy enough with the multi-core performance of ryzen 3700x. If I got a new MB I'd definitely want at least two nvmes in a mirror(for speed) - perhaps more and sata ports for my 6 spinning disks.


For your mirrored drives idea: I've had two nvme drives in mirror for a while but there are a lot of caveats to getting maximum performance. Not sure how it is on the AMD side but most Intel boards for example would have one m2 slot connected to the CPU and the other three via the chipset. Which makes them share a common bottleneck (also with your 10gb ethernet card)

For me it was in the end much faster to get a new board with pcie 5.0 support and a large enough single SSD. Total IOPS and throughput both turned out higher than my previous RAID 0 array.


I've got a 5950x and two NVME drives. The setup is the same as you describe. One has dedicated bandwidth and the other shares via the chipset. I just use them separately rather than trying to RAID them since the speed is fast enough for just about everything I need.


Indeed, that's what I ended up doing with the other drivers. Use them separately for different data types.


With my personal desktop (2x 1TB NVMe drives), they're mirrored with ZFS for data integrity. Depending on what Roark66 has them mirrored for (speed vs data integrity, etc), the alternative single-drive approach might not be workable.


Well he wrote that it's for speed, so that would be raid0

> If I got a new MB I'd definitely want at least two nvmes in a mirror(for speed)


raid0 isn't really a mirror though.

raid1 is a mirror, and can have speed benefits over just a single drive (able to read from multiple drives in parallel).

But yeah, you might be right. When talking casually, things like "mirror", "raid" etc can be a bit fuzzily applied. :)


You're right. The primary use of mirroring on raid 1 is for redundancy. Raid 0 is striping and more associated with performance.


What kinds of workloads do you run that you are capped by nvme speed?


Loading ML models into GPU memory.


I have 5900X and I wish I’ve waited for 5800X3D instead.


What’s the rationale for that? Gaming?


Very likely. I jumped from 3700x and the difference is staggering. My 6800xt came alive and stutters are a thing of the past.

I kind of want an OLED but they don't come cheap, vertical resolution is the same, and they are not amazing for coding. Maybe in time for a Zen 5 based 3d cache chip.


>My 6800xt came alive and stutters are a thing of the past.

Potentially because of resizable BAR. 3** CPUs didn't support it, 5** did.


if you need more go 5950x for double the cores with only 4/3 of the price or 3d cache cpus for that factorio UPS. 5900x is not that of an upgrade


As a data point, Hetzner has offered Ryzen 7000 CPU's with ECC ram for months:

https://www.hetzner.com/dedicated-rootserver/matrix-ax

The AX52 has optional ECC ram as an upgrade, while the AX102 comes with ECC by default.

It'd be surprising if they'd been offering ECC that didn't really work.


I think they make their own motherboards (have them made). So they can fully make sure that ECC is supported from start to finish


If only if legislators could get their shit together and mandate ECC.

It is unnerving most computing is done in fragile non-ecc systems.

A very messed up form of artificial market segmentation.


>legislators

Haven't you seen those congressional hearings with Zuck and others where congresspeople ask about how to use their phone ? How do you expect them to legislate about ECC ?


Fragile non-ecc systems?

Something that works absolutely fine 99.9999% of the time is not fragile.


Probability of a bit error increases as memory increases. Opens a large can of worms if control flow breaks.


Yep. 99.9999% correct would mean 8 bit flips per megabyte of data stored in ram. The error rates are (thankfully) much lower than that (otherwise your computer wouldn’t boot). But random bit flips can cause utter havoc if they happen at the wrong time or place. If you download software from the internet, bit flips can introduce weird bugs to your software, only on your computer. (Including in the OS - including your filesystem or drivers). They can corrupt writes to your hard drive, and as a result corrupt your drive or files. Bit flips can quietly change the DNS request your browser sends to cause terrifying security problems. Or edit forms before you send them. There was even a case of a voting machine in Germany accidentally inventing 4096 votes due to a bit flip.

ECC is a really good idea. It’s only expensive right now because it’s a “premium feature”. If it were a standard part of all ram sticks, it’d be cheap and we’d all benefit.


Not argueing against ECC, but some of your scenarios seem to be outdated due to crypto. I.e software you download from the internet is often signed and hashes are validated (Linux package managers, macOS developer certs). Same for DNS requests (dnssec) etc. Yes, there is still wiggle room for bitflips to cause problems, but less so than in the past.


DNSSEC is around the 4~5% mark in .com and .net.


True - my bad of referring to DNSsec; there are other ways you can use encryption for DNS resolving (by using an external DNS server that encrypts using TLS or simply by using DNS-over-HTTPs). This way you get 100% encryption of your DNS traffic (and thus tamper checks that would detect bitflips). Again, not arguing against ECC, there are valid points to want it - I just see less and less reasons in the consumer market.


Encryption and signing don't protect against memory corruption.

For example, I download software from the internet then hash it. The hash matches. Before the bytes are written to disk locally, a bit flips in RAM. The corrupted data is written to disk and used.

Likewise, dnssec doesn't protect you against DNS bitsquatting attacks[1] because the domain name can be changed before the DNS request is made. So the DNS response your computer makes for a-azon.com might be totally valid and signed. It can come through DoH or whatever. The problem is that your browser thought it was the response for amazon.com and chrome send a bitsquatter your amazon cookies. (Oops).

[1] https://www.youtube.com/watch?v=9WcHsT97suU


Well aware of all of that, but it decreases the chances if corruption (ie no corruption during download).


news.ycombinator.com not included in this 4% either


That's just made up statistics - average computer gets multiple errors a day.

This was fine when consumer computers were for games and porn. Now they store birth certificates, submit information to the tax man, documents to court, sometimes deal with matters of life and death


I hear this all the time, that non-ecc machines are just completely fragile, and the math seems to indicate that one should be getting bit flips all the time. and yet, my Intel system has 64gb of non-ecc ram and it runs half the day everyday and is hibernated at night. I run 3d cad software on it, Photoshop, vs code with tons of extensions and software running in wsl2 docker containers and I just never really get any errors. certainly no crashing or blue screen. what exactly should one expect from bitflip errors? I would think that even single bitflips as often as they should happen with 64gb of ram being used almost fully would show up somehow.


> what exactly should one expect from bitflip errors?

Probably silently wrong values happening in memory somewhere. If it's not in something where that value is not actively running code code, you probably won't have a crash.

If it's in (say) Excel calculation formula it'll probably just screw up the calculation. Which may or may not be obvious. Similar thing for 3D cad, it could be completely non-obvious.


I don’t quite have the courage to physically short pins, nor the patience to slowly overclock my RAM, waiting multiple minutes for DDR5 link training each time. So instead, I’m content with knowing that the memory controller is reporting that ECC is enabled.

How about hitting the RAM with a warm stream of air from a hair dryer? I've seen that technique used in the past to generate errors.


You should be able to use a propane barbecue lighter at short distance; those things generate stupid amounts of EMI.

https://hackaday.com/2022/01/29/blast-chips-with-this-bbq-li...

https://hackaday.com/tag/emfi/


I'd be careful with EMI/EMP based testing, if it's latching up one line or flipping one bit, it's probably also doing it to a ton of lines/bits. I'm guessing here but based off my experience with such things, I'd expect it to be more likely to latch up whole busses and generally just crash the system than trigger anything ECC could report unless you had a very repeatable setup you could ramp up in a very controlled fashion.

The mode of action here is you're producing a very wide band pulse but it's most strongly couple to the PCB traces long before anything within the chip itself. Guessing again but when people are using this to hack chips, they're probably just causing voltage swings on the power traces that are effectively voltage glitching the chip. The problem is you're over volting the part rather than under volting it which may cause permanent damage.


More sensibly, you can do some memory overclocking live on a running system, with no link training. It's not advisable for actual use, but you can use it to find the limits of your memory, or to generate errors.


Would rowhammer testing reveal whether or not a system has ECC RAM? It reveals errors on my I7-4770K (not overclocked) system whereas an ancient Supermicro board (X10 era) seems to run the rowhammer test indefinitely w/out detecting errors. I suppose this won't work if newer systems have been designed not to be susceptible to rowhammer attacks.

Incidentally the RAM on a Raspberry Pi 4B (and CM4) is feported to use ECC RAM, but this is not the same as discussed here. It's on die ECC and the purpose is to improve yield of the chips. ECC errors are corrected and not reported via H/W. ECC errors that cannot be corrected (I suppose) are just read out as incorrect data. I wonder if modern RAM sticks use chips with on-die ECC.


> I wonder if modern RAM sticks use chips with on-die ECC.

DDR5 does, AFAIK, but it is not a replacement for running it 72 bits wide like previous off-die ECC systems.


> ... not a replacement for running it 72 bits wide

Agreed, but is it better than no ECC at all?


How about just bringing your cell phone close to the DIMMs to trigger errors? That would be easy to do.


I feel like potentially reflowing BGAs on a high speed I/O PCBA you want to run continuously for a decade is riskier than simply shorting a pair of diode-protected pins.

Link training can be disabled in the BIOS, which would make searching for borderline bandwidth settings a quick operation. The results would not be very repeatable, but that's not important.


> I feel like potentially reflowing BGAs on a high speed I/O PCBA y

If you're trying to get chips up to something less than 100C, but instead get stuff underneath it with significant thermal mass up to above 200C, you're really messing up.

Hair dryers tend to emit like 65C-70C air on their "high heat" setting (as opposed to hot air guns used for heat-shrink and reflow soldering).


SAC305 solder won't reflow until >230C vapor phase. Will a hair dryer get a chip to these temps, ever?


Boot the OS off ram and then do something that uses a lot of ram. If the machine crashes randomly or you have a corrupted "disk", chances are it is an ECC error.


Or buy a radioactive material on eBay?


Thoriated tungsten welding electrodes might be an option.


Some (maybe all?) ASUS AM5 motherboards have official ECC support. It appears in both the board manual and the BIOS manual for the model I just checked. Note that one of the relevant BIOS settings is Auto by default, which (counterintuitively) leaves it disabled, so you'll want to change that.

ECC reporting for these processors appeared in Linux 6.5, so Debian Stable users will have to either wait for it to appear in Backports or stray from the beaten path.

https://www.phoronix.com/news/AMD-EDAC-Ryzen-7000-Series


> Note that one of the relevant BIOS settings is Auto by default, which (counterintuitively) leaves it disabled, so you'll want to change that.

I really dislike Auto settings in the BIOS. It wouldn't be so bad if there was a way to see the effective value, but 9 times out of 10 - it's no obvious.


I don't mind auto settings themselves, but I agree about the opaque ones. In this case, the effective value is stated in the field's help message, but it's still easy to miss.


There's a rumor that older AGESA[1] versions had a bug that prevented the chipset from recognizing and utilizing ECC ram properly even though the chipsets should support it. Check any motherboard in question for a firmware update which includes at least AGESA 1.0.0.5 patch C.

https://www.reddit.com/r/truenas/comments/10lqofy/

[1] Part of the firmware on AMD systems, that brings up core system components. https://en.wikipedia.org/wiki/AGESA


Based on the ASRock forum thread, this sounds right. Before installing the ECC RAM I updated to AGESA 1.0.0.7b.


> ”Unfortunately, when the AMD Ryzen 7000 ‘Raphael’ CPUs were launched along with the brand new Socket AM5, all mention of ECC support was gone.”

Had no idea that 7000s series doesn’t official state support of ECC.


After publishing this post, I was alerted to the fact that there are some AM5 motherboards that do officially support ECC.

ASRock's Rack line supports it, for example: https://www.asrockrack.com/general/productdetail.asp?Model=1...

This ASUS motherboard also claims to support ECC: https://www.asus.com/us/motherboards-components/motherboards...

Neither of those were out at the time I originally got my AM5 motherboard (which was right as they came out -- the performance numbers were so good I couldn't help but spring on one early.)


The Asrock Racks, 1U4LW-B650/2L2T, 10g ports appear to be supported by VMWare ESXI 8: https://www.vmware.com/resources/compatibility/detail.php?de...

I would be interested in hearing someones experience with getting this board running as a type 1 ESXI hypervisor.


> Had no idea that 7000s series doesn’t official state support of ECC.

You've misinterpreted the author.

All currently-available Ryzen 7000 series CPUs have official ECC support, but it requires motherboard support as well.[1]

This conditional ECC support has been the case for all past AMD consumer CPUs going back to the original Athlon 64, but the Ryzen 7000 series is the first time I can recall that this support has been explicitly listed on AMD's marketing materials.

What the author is saying is that mention of ECC support had disappeared from ASRock's motherboard documentation. This was a notable change for ASRock, as they had explicitly mentioned ECC support on their previous Ryzen motherboards.

[1] Example from the Ryzen 5 7600 spec page: https://www.amd.com/en/product/12756#:~:text=ECC%20Support,R...)


It was really iffy and not clearly mentioned until this year. Also, it's confused with on-chip ECC which all DDR5 uses. DDR5 needs on chip ECC to correct errors that'll occur in normal operation but that won't provide ECC for transfers across the memory bus to the CPU.


I have a genuine question - in practice on a workstation/developer computer, what sort of protection does ECC ram give me?

I've got two daily driver machines - a 3970x threadripper with 96GB ram, and an M1 Macbook Pro - neither of which have ECC ram.

I've been using them both for over 2 years, and not once in that period (that I'm aware of) have I found myself with a problem due to faulty RAM, but I do regularly find myself wishing both were faster.

What practical benefit would I get in exchange for the performance hit of ECC memory?


> what sort of protection does ECC ram give me?

The ability to detect (and possibly correct) physical memory errors.

> I've been using them both for over 2 years, and not once in that period (that I'm aware of) have I found myself with a problem due to faulty RAM

The "that I'm aware of" part is the key issue. ECC provides visibility into the health of the physical memory that you otherwise wouldn't have.

> What practical benefit would I get in exchange for the performance hit of ECC memory?

Side-band ECC (which is what you'd use on desktops/workstations/servers) doesn't have a performance hit, as the overhead of ECC is fully neutralized by the additional bandwidth and capacity of an ECC DIMM.


> The ability to detect (and possibly correct) physical memory errors.

This is a great example of theoretical benefits (note I'm not saying that those theoretical benefits are real, I'm asking how they practically benefit me).

> The "that I'm aware of" part is the key issue.

Ok so tell me what it looks like. _That's_ what I want to know.

> into the health of the physical memory that you otherwise wouldn't have.

And does what, on my windows workstation or my linux workstation?

> Side-band ECC (which is what you'd use on desktops/workstations/servers) doesn't have a performance hit, as the overhead of ECC is fully neutralized by the additional bandwidth and capacity of an ECC DIMM.

That statement doesn't hold water. If it takes extra bandwidth and capacity to provide ECC, I can use the extra bandwidth as memory instead of error correction, no? (Or I could if the non-ecc DIMM utilised that range). Either way, it's an extra x bits for ECC that _could_ be used for storage.


> Ok so tell me what it looks like. _That's_ what I want to know.

Suppose an application running on your machine suffers some type of malfunction -- a segmentation fault or a seemingly-random kernel panic that you're unable to reproduce.[1] And suppose it happens often enough that you want to fix it, and understand the root cause.

You research the issue, and are pointed to potentially faulty hardware. That then raises a question: how do you know your physical memory is working properly?

You could run a diagnostic application like Memtest86. However, that has the following issues:

- It only tests a particular point in time. If the test "passes", that doesn't say whether your memory encountered a fault in the past, nor does it guarantee that it won't fault in the future.

- What does it even mean for a test to "pass?" Just a single run through a test suite? Running it for some period of time, like a few hours?

- A diagnostic like Memtest86 is invasive. While it's running, you can't use your machine for other things.

Troubleshooting these types of hardware errors without ECC is tricky, because there's not really a way to conclusively link a software fault to a hardware error.[2][3]

However, on a machine equipped with ECC, this type of troubleshooting is a lot more straight-forward. If bits are flipped in memory, the memory controller can probably detect that (assuming only only one or two bits are flipped), and can raise a machine exception that the OS can catch and do something with (e.g., logging the error, terminating the process impacted if the error is correctable, etc). That saves a lot of time and headache that you'd otherwise be spending on guesswork.

You may still be asking, "how big of an issue is this, really?" I suppose whether or not you care about having hardware that can detect memory faults is up to you.[4] However, during the portion of my career where I was managing large machine fleets, memory failures were the second most common type of hardware failure, behind only mechanical disk drives.

> That statement doesn't hold water. If it takes extra bandwidth and capacity to provide ECC, I can use the extra bandwidth as memory instead of error correction, no?

No. The memory controller is unable to use the additional capacity for anything other than error checking (hence why it's called "side-band").

What you're describing is "in-band ECC", which is something that you sometimes see on GPUs or small form factor systems aimed at professional markets.

---

[1] An application crash is actually one of the better outcomes of a memory error. The worst-case scenario is silent corruption of data stored in memory that your application assumes to be valid.

[2] Errors as a result of faulty memory have been particularly frustrating to OS developers, as it can result in nonsense error reports. Linus Torvalds has bemoaned the lack of ECC options on consumer hardware, claiming that it was the industry cheaping out. During the lead-up to the launch of Windows Vista, Microsoft officially encouraged the use of ECC. I don't follow development of the BSDs, but it wouldn't surprise me if they similarly wished their users would use ECC across the board.

[3] When Google first started building out the infrastructure for their server farms, they opted not to use ECC memory for their servers as a cost-savings measure. They subsequently had a memory fault on one their machines that resulted on corruption of their search index. While they worked around the problem by adding logic to verify the contents of the index, all subsequent generations of machines that Google has deployed have ECC support. For specifics, see the second footnote in https://danluu.com/why-ecc/

[4] Note that many of your other components that store data in some way (e.g., caches, persistent storage, etc.) are likely to have ECC capability. The only notable exceptions I can think of are GPU memory, the main system memory on consumer hardware, and the processor's registers.


> > what sort of protection does ECC ram give me?

> Side-band ECC (which is what you'd use on desktops/workstations/servers) doesn't have a performance hit, as the overhead of ECC is fully neutralized by the additional bandwidth and capacity of an ECC DIMM.

This is true as far as I know but only for comparing ECC ram vs non ECC ram that is otherwise equivalent. But if you want the fastest ram available, it doesn't support ECC, so you're still taking a perf hit by buying slower ram with ECC support than you could get without ECC


Yeah, Ryzen usually performs best with faster RAM, but finding good speed on ECC sticks is generally tricky. It makes sense to get x3d version of CPU if you are planning to build a system and squeeze more out of it.


Anecdotally, on a desktop machine with something like 64GB RAM, when I switched from non-ECC to ECC (and 128GB RAM), the number of unexplained crashes seemed to drop from one every couple of weeks to one every 6 months or so. Could have been from a software upgrade though, probably Ubuntu 18 to Ubuntu 20.

Now I'm on an M1 MBP with 32GB RAM, and I get unexplained crashes roughly every 3-4 months, but I've been chalking it up to MacOS. Behavior also seems to degrade after a month or two of uptime. Hmm, maybe it's time for Apple to go ECC ...


> Anecdotally, on a desktop machine with something like 64GB RAM, when I switched from non-ECC to ECC (and 128GB RAM), the number of unexplained crashes seemed to drop from one every couple of weeks to one every 6 months or so. Could have been from a software upgrade though, probably Ubuntu 18 to Ubuntu 20.

Thanks. This is a great anecdote/example that is much more what I'm interested in. I don't see the same level of system instability on my machine, but it's a good example. Appreciate it.

> Behavior also seems to degrade after a month or two of uptime.

Ah, here's another thing. I do tend to reboot daily/weekly which may help


It depends on your workload.

Some industries require to be able to reproduce exact bit by bit data.

Anything that requires absolute bit certainty, for example digital signatures, encryption, or financial information, will require ECC.

If you don't care if your files eventually don't match a sha256 checksum, you'll be fine. for example source code, it doesn't matter because in the worst case, your code won't compile because variable names got flipped.

And even then, if you're using Git for source code control, you're already using hash signatures to detect data corruption, as well as a very large redundancy and replication. Git is, in a philosophical way, ECC by software.


> It depends on your workload. > Some industries require to be able to reproduce exact bit by bit data. > Anything that requires absolute bit certainty, for example digital signatures, encryption, or financial information, will require ECC.

Totally, and I get that. We've got people in this thread who are saying that non-ECC ram shouldn't be trusted [0], or that it should be standard [1]. I get the use cases, but why do I want that on my workstation.

> your code won't compile because variable names got flipped.

I have _never_ seen or even heard of this or anything vaguely resembling this happening. Are there any writeups of this anywhere at all?

[0] https://news.ycombinator.com/item?id=37829796

[1] https://news.ycombinator.com/item?id=37828865


In my opinion yes, ECC should be standard.

ECC is expensive just because Intel had a monopoly over the server segment, and unilaterally forced desktop not to use ECC.

Otherwise people would use desktop computers as servers, and that would reduce profits.

At work, my workstation Desktop computer has ECC. The colleague's sitting next to me has on their screen sometimes kernel warnings of ECC errors.

And cases of bit flips happen every day, that's why some problems are solved by just restarting your computer or the software. I'd argue that many times we blame software bugs on pure hardware bit flips.

Here's a story of a variable name bit flip: https://alexbakker.me/post/did-cosmic-rays-break-my-linux-bu...

Bit flips usually affect operating system stuff, here's another story: https://blogs.oracle.com/linux/post/attack-of-the-cosmic-ray...


When your RAM starts to go bad it can severely screw up writes to your filesystem without you noticing for a long time.


Bit flips in memory can affect instruction code, not just data. This will manifest as random bluescreen crashes or visual artifacts in the case of video memory. Similar to errors/crashes from overclocking a little too high, but occurring occasionally on a normally stable system.


I've been actually researching this a lot recently as I'm in the process of buying parts for a desktop build and want ECC support.

> Unfortunately, when the AMD Ryzen 7000 “Raphael” CPUs were launched along with the brand new Socket AM5, all mention of ECC support was gone.

1. This was effectively a big point of concern for me. Previous ASRock Taichi motherboards officially advertised full ECC support, but not the latest X670E for AM5 model. Older version of the X670E Taichi page mentioned ECC support ("Supports DDR5 ECC/non-ECC, un-buffered memory up to 6600+(OC)") [0], and this Level1Tech review [1] had a segment on ECC support (they fully tested it). The segment was around the 10min mark, but it looks like the video was swapped to remove it during the last 30 days (the original video was 17min41sec long, the current one is only 16min32sec). So I assumed that ECC was supported, but it's a bit concerning that there's no official mention. The article reassures me.

2. Finding ECC memory for desktop computers is hard. For the X670E Taichi, you need DDR5 unregistered DIMM. The best solution I have to find ECC memory is to check QVL lists for similar server motherboards, for example this ASRock Rack [2] or this Gigabyte Motherboard [3]. I decided to go with two 32GiB Kingston sticks [4] for my own build.

[0]: https://web.archive.org/web/20221002103925/https://www.asroc...

[1]: https://www.youtube.com/watch?v=PhrqEV-VAjE

[2]: https://www.asrockrack.com/general/productdetail.asp?Model=1...

[3]: https://download.gigabyte.com/FileList/QVL/server_mb_qvl_MC1...

[4]: https://www.kingston.com/en/memory/server-premier/ddr5-4800m...


I am just immensely impressed by this story. A+ hardware and software investigation work.


Yes, a great article! It was no bigger surprise to see that the article was written by an Oxide employee - it has been a joyful signature that they've managed to attract curious people with an engineering approach to investigate problems and look up how the things work. I'm probably biased, but it seem like a dream company to me.


Thank you for the kind words!


Though I have no intention of switching to ECC on my Ryzen system, I found this writeup both very clear and a great compelling educational read. Thank you for creating it.


Recently decided not to buy the AMD laptop from framework because they didn't bother to implement ECC. Started me thinking that I don't really need a new laptop, have an old one... so why continue to suffer from compromised parts at my desk?

Unfortunately AMD has been going the wrong way with this. Oddly enough, Intel has quietly added a few cheaper models with ECC support.

Unlike the author I don't have time for building a new PC and a bunch of maybes. I found this machine that's pretty cheap:

https://www.dell.com/en-us/shop/desktop-computers/precision-...

This one has a CPU option with a TDP of 65 W. What do y'all think?


I think you're being a bit harsh to AMD on this, you still can run ECC on their consumer CPUs, you just need to research the motherboard because any motherboard/chipset combo could have ECC support.

That Dell has the Intel W680 chipset that is responsible for giving any Intel CPU ECC, just more pointless market segmentation by Intel. I don't know if I would give Intel a glowing review for that, you can also buy an OEM system with an Ryzen Pro processor if you want that kind of ECC guarantee.

You're not going to find Intel offerings with ECC solutions for light laptops either right now.


> Just need to do a bunch of research and trial and error

Or just buy one?

> buy an OEM system

Who sells certified ECC systems with AMD at similar prices? I did look but didn’t find much already assembled.


>Who sells certified ECC systems with AMD at similar prices?

I don't think you can blame AMD for the OEMs choices. Lenovo does compete in a way. You have to buy a GPU on their Threadripper workstations because of no iGPU for the computer to fall back on like the Dell's Intel iGPU. Also your Dell comes with 1 non-ECC 8GB DIMM standard, you have to pay $247 to upgrade to an ECC DIMM. Lenovos Threadrippers come with the ECC standard and can even run registered DIMMs.

But yeah it is a $2500k Threadripper workstation with no GPU compared to a $1300 I5 workstation with no GPU.

If you want to build an ECC AMD system some manufacturers want your money: https://www.supermicro.com/en/products/motherboard/h13sae-mf

Frankly "shopping" OEM sites makes me crazy and it's why I build my own.


I haven't looked but I suspect not many laptops support ECC, if any. That's not a good reason to skip AMD.

The Dell may support ECC, but as any workstation it's pricier than consumer grade desktop with equivalent performance. If you need it, you need it but usually individuals don't pay for that, their employers do.


There are some xeon thinkpads that have full fat ECC capability, however you need to be exceptionally rigorous in making sure that's what you are getting. Only some of the laptop xeon models support ECC, and it's not as simply as "higher cpu model number better", it's a fucked up matrix that intel loves to do.

I believe that there might be issues with it not being enabled properly if you buy the machine with a minimum of non-ecc memory intending to switch it out, because lenovo does something fucky if you don't get ECC from the get go. I could be misremembering that last bit, it's been a couple of years since I looked at ECC capable workstation laptops.

But yeah, when framework comes out with an ECC workstation laptop is when I'll finally consider getting a new laptop.


It’s a thousand and change, cheaper than a similarly specc’d laptop.

And never been given the option.


It’s not a question of whether we want it or not. You’ll have to take it up with AMD on enablement of ECC support on non-PRO U-series mobile processors.


Do we not deserve pro parts? Just a bunch of amateurs over here at HN? ;-)

I suspect the group that wants to physically maintain their laptop but doesn’t care about data integrity isn’t too large.


Thanks for clarifying. That said, if you did want it, couldn’t you just offer a PRO series option e.g. AMD Ryzen 7 PRO 7840U?


> The most foolproof way to test whether ECC is working is to introduce an error somehow. > > ApplesOfEpicness did so by shorting a data and ground pin on their motherboard.

Epic! Some people, really...

> I don’t quite have the courage to physically short pins, nor the patience to slowly overclock my RAM, waiting multiple minutes for DDR5 link training each time

It's not clear to me: when is the DDR5 re-trained? I assembled my Ryzen 7000 desktop PC myself and I don't remember ever having to wait minutes for it to boot, not even the first time. It's a bit slower at boot I'd say than non-DDR5 PCs I had in the past but it's mere seconds, not minutes.

And last but not least: DDR5 specs for ECC vs non-ECC, which price difference and speed difference are to be expected?


> which price difference and speed difference are to be expected

ECC RAM uses 1/8th more memory chips, so as a rule of thumb it costs 1/8th more.

They can have very different dynamics on the used market though, because servers tend to be upgraded to larger modules over time, but there aren't many buyers for smaller modules. Not a big factor for DDR5 because it's too new, but for DDR4 you can get some great deals for 8GB and 16GB modules.


Oh, I just happened to grab a B650E-ITX!

...But are there any ECC sticks with "known" good Hynix ICs? I got lucky with a pair of sticks that can do DDR5 6000 CL30 at 1.3V (and maybe better), and I wouldn't want to sacrifice a ton of RAM performance for ECC.


The v-color RAM I linked to in the post is Hynix: https://v-color.net/products/ddr5-ecc-udimm-servermemory?var...

I haven't tried overclocking it.


That is A-Die! It should overclock like crazy.

You should do it and post it. Ryzen 7000 likes OC'd RAM, and overclocked ECC ram is kind of a desktop hardware myth/legend because its so rare and oxymoronic. Not sure if you've done it before, but DDR5 overclocking is quite a rabbit hole, and there are some "easy" tutorials out there like this: https://www.youtube.com/watch?v=dlYxmRcdLVw

Test the stock primary timings with the modded subtimings at 6000 first, then jump straight to cl34 for the primary timings and reduce the voltage/primary timings from there.

I would grab these myself, but the kit is too rich for my blood at the moment.


Fine, you've convinced me! I'll try it next week.

To be honest the only RAM overclocking I've done is by going into the XMP settings and turning on the shipped overclock configuration. For CPUs I tend to undervolt them a bit to reduce power consumption.


I'm also interested about what speeds and CL would that DIMM run. By the way I'm the one who started that ASRock forum topic, what a small world to randomly stumble here while scrolling HN. In the end I put PC upgrade on hold, partly because of that AMD bios ECC support mess and also cannot find fast DIMM-s. Just looking today it seems ASRock has still not re-added ECC support info on AM5 board specs pages so the more people test those boards and meomory combinations the better for community.


> ApplesOfEpicness did so by shorting a data and ground pin on their motherboard.

This doesn't seem to be the proof that ECC is working. Error detection on things like this is different than bit flip correction.


It is the proof, on unregistered sticks that's the job of your software, so it does the bit flip correction.


I used ECC with a ThreadRipper 2920x on an Asrock motherboard, but for a new Ryzen 7950x build I went back to regular RAM. The problem with ECC with the TR was that only Unbuffered DIMMs work, not Registered DIMMS, and the 7950x is the same. Unbuffered ECC is slower or costlier (or both) than the equivalent non-ECC or Registered, enough that it just wasn't worth it in a new build.

Ryzen has the additional disadvantage vs TR of only being able to run RAM fast if you limit yourself to two sticks. My TR build used 4x16GiB but for my 7950x build I had to get 2x32GiB to be able to run them at 6000MHz.


I currently run a 3700X with 64 GB ECC. I've been interested in a jump to an AM5 series, but have found the matter of ECC compatibility frustrating to research, particularly because DDR5 has limited ECC built in and that confuses the discussion.

I'm curious what your use case is. I mostly use mine for software development (including VMs) and some family media processing. I like the peace of mind that ECC brings, but I have no way to really know if RAM speed is performance factor.


Personal Linux machine mostly used for watching video, code compiling, and some light gaming. The 7950x definitely compiles my stuff much faster than the TR 2920x did, but I don't know how much of that is because of the RAM speed (6000MHz vs 4800MHz, faster timings) and how much because the CPU clocks higher (~5.4GHz vs ~4.2GHz).


(Made a typo. The RAM speed in my 2920x build was 2666MHz, not 4800MHz, of course.)


> Ryzen has the additional disadvantage vs TR of only being able to run RAM fast if you limit yourself to two sticks.

TR has it similar, just doubled; you are able to run 4 sticks faster than 8. On my TR2920X+X399 (probably the same config as yours, Asrock Taichi) I'm able to run any four of them at 3066 MHz and all of them only at 2933 MHz.


Yes, I wasn't saying TR doesn't have a limit. I was saying Ryzen's limit was smaller, so I would've had to get bigger UDIMMs to have the same total memory as before.

(My 2920x was on an X399 Phantom Gaming 6 and my 7950x is on a B650 PG Lightning. I find the Taichis of all generations to be overkill.)


This is both a little off-topic and a little out of my expertise, but shouldn't the file descriptors in the query function be closed?


This is a bit off topic, but I'm running Home Assistant on a system that doesn't have any ECC.

I'll reboot it occasionally. Every once in awhile it will exhibit odd behavior that seems to resolve on reboot.

Is it worth getting something that has ECC? I think I'm running like a Sandy Bridge i7 for reference.


it's not really possible to say if ecc would prevent the issue you're describing.. but that's why imo it is worth having ecc--so you don't have to worry as much about if the weird behavior you see is due to random memory errors.

On the more immediately practical side.. if this is happening frequently enough then running memtest86 or GSAT google stressful application test and see if it can pick up any errors.

You also might be able to improve things with a bios upgrade, or by lowering the clock speed of the ram.


I feel the need for ECC is a bit overstated, probably confused by listening too much to people in the server space. In servers, ECC is critical because servers simply have more RAM than PCs do. You can easily find systems with 32 sticks of RAM. If you roll 32 dice, odds of a rare freak error is a lot bigger than if you roll 2. It's the same reason why you probably don't need RAID on a computer with one or two disks. Stick 48 disks in a machine however, and you're a fool to go without it (or a plan at least). It's why you don't need an automatic fire suppression system in or homelab, because the odds of a fire in one or a few PSUs is fairly small, but stick a few thousands in a room and suddenly it's a very real problem.

That, and most memory problems in PCs tend to be from dodgy overclocking or just bad sticks rather than cosmic rays. ECC won't really save you from that.


Counterpoint: Without ECC, you simply don't notice how many flips happened and that they were the cause of some strange behaviour.


Most of them simply don't matter though, and are inconsequential along other system instabilities.

It's just neurotic to worry about issues when you couldn't tell they existed without being informed of them.


Most issues with car don't matter for driving, till it catches on fire or crashes due to catastrophic failure. Why ever go to mechanic ?


Most car crashes just aren't due to mechanical failure.


It would be if you had anything to do with designing or servicing them...


>dodgy overclocking

Error detection and correction is actually ideal when overclocking. It'll not only save you, it'll let you know you need to back off a bit, because the handful of errors that may happen once a month despite passing days of tests will show up in the system logs instead of staying silent and causing problems. I speak from experience as I overclocked ECC memory in a first gen threadripper system, playing around with overclocking as a hobby for a bit. ECC memory was fantastic to work with.

However the dodgy or outright lack of CPU/Motherboard support results in trying to market that feature to gamers/overclockers having too much friction. Much like what went on with intel's confusing proprietary optane/flash combo M.2 drives that needed certain motherboards in order to fully access both parts of the drive.

And with bad sticks, instead of hard to explain crashes, you'll end up with system logs filled with corrected/detected errors and maby crashes if it's really bad. Then you just RMA the sticks like normal. So honestly that's a win too, because that's the whole point of detecting and correcting errors. It's in every other part of the system already, continuing to leave this capability out of memory is an oversight that needs to be.. corrected.

I just don't really get the strange resistance people have to ECC, it's not some sacred cow. It's just good sense.


> It's the same reason why you probably don't need RAID on a computer with one or two disks

No, it is not "the same reason"

Disks have ECC, even consumer hard drives have checksummed blocks, that's how you get media errors detected. RAM does not. Well, technically with DDR5 it has internal one but it does not give any feedback to the machine so you might not know your RAM has any problems.

> It's why you don't need an automatic fire suppression system in or homelab

But you want smoke sensors. ECC is the smoke sensor too

> That, and most memory problems in PCs tend to be from dodgy overclocking or just bad sticks rather than cosmic rays. ECC won't really save you from that.

But it will tell you the problem

Very ignorant take


> Disks have ECC, even consumer hard drives have checksummed blocks, that's how you get media errors detected. RAM does not. Well, technically with DDR5 it has internal one but it does not give any feedback to the machine so you might not know your RAM has any problems.

Sure, and disks have random errors even with error correction. Adding hamming codes just makes errors less likely and easier to detect, but they aren't fool proof, in ram or on disk or anywhere else. The protection offered is a bit questionable because they rely on the very sketchy assumption that errors are statistically independent events, which is rarely how storage errors work.

> But it will tell you the problem > Very ignorant take

Hardware isn't perfect. No matter how much error correction you pile onto it, you will still have errors and some cases will still be undetectable. It's a matter of pushing the rate of errors into an acceptable range, as a pure cost-benefit tradeoff, statistics through and through.


Looking to get an ASRock B650 Pro RS to go with a 7700X, hopefully ECC will be supported.


The ASRock Rack B650D4U advertises ECC support on AM5.

It's a bit pricy, but you also get IPMI.


I think we have AsrockRack motherboards with Ryzen 7950x and ECC RAM. Works well.


> But the lack of ECC was a huge bummer at the time of purchasing my system.

Why?..


Ever had to troubleshoot bit flips on a non-ECC system? One friend felt like he was going crazy as over the course of two months his system degraded from occasional random errors to random crashes, blue screens and finally to no POST. Another time, a coworker had to stare at raw bytestreams in Wireshark for hours to find a consistently flipped bit.


Don't overclock your memory.


All of these were with stock, non XMP clocks.


Well then… test your memory :)


How often do you test your memory? The nice thing about ECC is it's always testing your memory, and (if it's set up properly!) you'll get notified when it begins to fail. Without ECC, your memory may begin to fail, and you'll have to deal with the consequences between when it starts to fail and when you detect it.

(Of course, I don't run ECC on my personal systems, but at least I'm wandering knowingly into the abyss)


Testing your memory detects if you have bad RAM, which ECC isn't going to help with anyway. Perfectly fine memory will experience random bit flips from environmental factors. Your PC components and UPS also degrade over time and can cause random bit flips. ECC is there to catch problems as they happen and ideally take corrective action before bad data propagates


> Ever had to troubleshoot but flips on a non-ECC system?

No.

> One friend felt like he was going crazy

Tell him about memtest86.


Wow I came back to post this exact reply. I set my system to a slightly high frequency, ran memtest overnight with errors.

Set it back down to a supported frequency, ran a full memtest suite again with no errors.

Never had any issues since.


> Wow I came back to post this exact reply. I set my system to a slightly high frequency, ran memtest overnight with errors. > > Set it back down to a supported frequency, ran a full memtest suite again with no errors.

Cool. You tested your memory at some point in the past.

How do you know it's still working properly and hasn't flipped any bits?

You don't. Because you have no practical way of testing the integrity of the data without running an intrusive tool like memtest86 that basically monopolizes the use of the computer.

Being able to detect these types of memory errors at a hardware level while the processor is doing other things is the fundamental capability that ECC gives you that you otherwise wouldn't have, no matter how thoroughly you run memtest86.


You likely wouldn't know if you had random bit flips. It'd manifest as silent data corruption. You might be okay with that. Others aren't.

It's not a matter of overclocking. Bit flips are a fact of life running with 32+ GB RAM. Leaving your machine on 24/7 (even if in sleep) stacks the odds against you.


Obviously this is just anecdote but I have a work laptop with 128GB of non-ECC ram , use all of it every day and never noticed any issues. I'm not saying there aren't any, but it just....works.


You have silent bit flips, they silently corrupted data instead of causing a visible error.


See also:

https://forum.level1techs.com/t/ecc-capable-verified-motherb...

>"I have ECC working on ASUS ProArt X670E-Creator with AMD Ryzen 9 7950X. But, you have to explicitly turn ECC on in the bios. If left on ‘Auto’, it will be off. I use four sticks of Supermicro (Hynix) 32GB 288-Pin DDR5 4800 (PC5-38400) Server Memory (MEM-DR532MD-EU48)."

https://www.reddit.com/r/truenas/comments/10lqofy/ecc_suppor...

<Read yourself>




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: