I am in need of a processor upgrade, but am really interested in an ECC RAM setup. I read this thread of 2 people arguing on /r/AMD about whether or not the AMD processors and/or their motherboards actually support ECC or not. And had no clue if they were right. [1] So this is topical for me!
When buying the motherboard, you must check if it declares ECC support.
This is normally specified in the "Memory" section of the specifications, in something like "ECC & Non-ECC, Unbuffered Memory".
Beware of mentions of "On-die ECC", which is present in Non-ECC memories and which is irrelevant.
You also have to buy ECC DDR5 UDIMMs and you must be careful to not buy by accident ECC DDR5 RDIMMs, which are incompatible with AM5 motherboards. The ECC DDR5 UDIMMs have a width of either 80 bits or 72 bits. The width does not matter, as long as it is not the 64-bit width of Non-ECC DDR5 UDIMMs. (For some vendors it is cheaper to use identical x8 chips in ECC and Non-ECC modules, despite wasting some capacity, which results in an 80-bit width; there is a myth that DDR5 ECC DIMMs must have a width of 80 bits, the myth is wrong, because the standard includes 36-bit channels, which result in a 72-bit width for a dual-channel DIMM; for instance Micron makes 72-bit DDR5 ECC UDIMMs)
The last time when I have checked, ASUS had the most AM5 motherboards with ECC support. I like most the PRIME X670E-PRO WIFI because it has the best PCIe expandability beyond the slot occupied by the GPU.
However there are many other cheaper MBs, when less connectivity is enough.
> This is normally specified in the "Memory" section of the specifications, in something like "ECC & Non-ECC, Unbuffered Memory".
I've seen allegations that for some vendors, text like this on their non-server/workstation boards turns out to mean that ECC modules will work in the board, but without the actual ECC function.
While there might have existed such vendors, nobody provides more information than that written above in their MB specifications, so it is difficult to know for sure which is the case, before buying a MB.
The only way to gain more confidence is if the downloadable MB manual has an exhaustive description of the BIOS options.
If in the BIOS options there is one for enabling ECC and perhaps additional related options, e.g. for configuring scrubbing, only then there is complete certainty about ECC support in the MB.
However most recent MB manuals no longer have a complete BIOS description, so they may be not helpful.
At least with the ASRock or ASUS AM4 or AM5 MBs that I have used, whenever "ECC & Non-ECC, Unbuffered Memory" was specified, the MB really had ECC support.
Sometimes the OS is unaware of ECC support on the hardware as well, e.g. Linux doesn't/didn't enable the Intel edac driver on i3's running on server chipsets using ECC memory, despite the CPU actually utilizing ECC in that scenario (they simply forgot to add them to the whitelist). So edac-util went "No ECC MCs found!", even though the platform worked.
There are several plausible levels of "support" where ECC is concerned:
0. Not supported at all (i.e., if you plug ECC RAM, your system won't boot).
1. ECC RAM can be plugged but the ECC functionality is not used (i.e., there is no relevant traces/circuitry/etc).
2. ECC functionality is present (i.e., the circuitry is there) but it was not validated by the motherboard manufacturer to be functioning correctly (i.e., detecting/correcting errors).
3. ECC functionality is present and was validated by the motherboard manufacturer. This is the level one would expect from the server-grade boards from a reputable manufacturer like Supermicro.
In case of the AMD processors, when you see "ECC supported", it's anyone's guess which level it is. This is in contrast with Intel, where if it says CPU/chipset supports ECC, then you know it really does. I bet Intel won't allow a motherboards manufacturer to sell a board with chipset like W680 without validated ECC support.
Ryzen consumer CPUs use the same memory controller as their EPYC counterparts, and have full ECC support if the rest of the platform (i.e., motherboard, DIMMs) also supports ECC.
What separates Ryzen from its professional-grade counterparts is that ECC support is an optional part of the consumer Ryzen platform spec, which means that it's up to the motherboard vendor to enable support for it. Some motherboards don't have any support at all, some have ECC support as an explicitly-advertised feature, and many have ECC support but it's not explicitly advertised (simply listed as a footnote in the manual).
That's different from the way Intel does it, where Intel has explicit control over what the platform's feature are, and uses that control to aggressively segment their markets. Intel's approach makes easier to reason about ECC support as a buyer, but you pay for it with the lack of flexibility compared to AMD's platform (and you literally pay more for ECC).
All mobile Ryzen 6000 and Ryzen 7000 APUs (Zen 3 or Zen 4) support ECC, unlike in the earlier generations, where ECC was restricted to PRO variants. Unfortunately, none of the small computers that use them supports ECC on the MB.
It is expected that AMD will launch a desktop Zen 4 APU in the near future. If that happens, it remains to be seen whether ECC will remain enabled in it, like in the current laptop packages, but there seems to be no reason to disable it.
I can no longer edit the previous message, but I must mention that, as pointed by another poster, some time during the last couple of months AMD has changed the specifications of all their mobile Ryzen CPUs.
While during at least one year and a half all the laptop CPU specifications for Ryzen 6000 and Ryzen 7000 specifications have included a clear statement that ECC is supported, in the very recent past this statement has been removed from all AMD mobile CPU (non-PRO) specifications, for unknown reasons.
If anyone is looking for one of these standalone, outside of a full build, I bought a Ryzen 7 PRO 4750G about a year or 2 ago from AliExpress. It's been running a homeserver 24x7 since then and I never had any issues with it. YMMV of course.
IIRC AMD doesn't validate the ECC functionality on their ryzen chips? So you could theoretically end up with a defective ECC circuitry if unlucky. I think that was the case for the first generations, it may also have changed or I may be misremembering.
> IIRC AMD doesn't validate the ECC functionality on their ryzen chips? So you could theoretically end up with a defective ECC circuitry if unlucky.
This is not the case.
Ryzen, Threadripper, and EPYC use a unified memory controller that has fully validated/qualified/supported ECC capability. The only difference between the memory controllers in these CPUs is the number of them (Threadripper and EPYC will have multiple memory controller to support the extra memory channels).
When AMD claims that ECC on Ryzen isn't validated, they're talking about the platform as a whole, not the CPU specifically. ECC support on consumer CPUs depends on the motherboard supporting it. Unlike on Threadripper and EPYC (and also unlike Intel's approach), ECC is not a guaranteed feature of the Ryzen platform, so people who want that functionality need to explicitly looks for motherboards that have it (in the same way that you'd need to explicitly verify PCIe bifurcation support).
However, if the motherboard supports it, ECC on Ryzen is a fully supported, validated, you-can-RMA-if-it-doesn't-work feature.
That was in the past (true for Zen 1 and Zen 2 Ryzen). It is no longer true (after Intel enabled ECC in some desktop SKUs, starting with Alder Lake, AMD plussed by validating ECC in all desktop and mobile Ryzens).
Now all Zen 3 and Zen 4 CPUs, both desktop and laptop, have explicit ECC support, which means that ECC must be validated by AMD in all of them.
If any current AMD CPU happened to have defective ECC, that would be a completely defective CPU, which must be replaced by the vendor.
Despite the fact that all laptop Ryzen 6000 and Ryzen 7000 CPUs support ECC, I have not seen yet any AMD laptop or SFF computer with ECC support. On the other hand it is much easier to find AM5 MBs with ECC support than Intel W680 MBs.
The AMD website says the non-PRO 7x40 laptop Ryzens have no ECC support[1], and the Framework folks have said that AMD told them there’s no ECC on non-PRO 7x40 laptop Ryzens[2]. If plugged in, ECC modules will work, but without the error-correction functions.
(Edited to reflect the current state of the website, as it used to say ECC support was present[3].)
The same was written at all AMD Rembrandt and Phoenix models, and it has been written in all such mobile CPU specifications at least since the beginning of 2022, so at least during a year and a half, if not more.
The ECC support was available only with DDR5 SODIMM memory, not with LPDDR5 memory, and only in the FP7r2 package, and neither in the FP7 nor in the FP8 packages.
Perhaps AMD has discontinued the FP7r2 package, but this is not mentioned in the specifications. Either that, or they have decided right now that they may charge more for PRO models, or save on testing on non-PRO models.
Either way, it seems a stupid move for AMD to degrade right now their mobile CPU specifications, when Intel will launch Meteor Lake, which is likely to be better than AMD Phoenix. AMD mobile Zen 5 will be better than Meteor Lake, but until its launch it remains about a half of year, supposing that it will not be delayed.
You mean flexibility to claim ECC support but not doing any validation to make sure it actually works? Does any AMD motherboard manufacturer actualy states that "ECC is supported and has been validated"? I think the muddy waters that AMD has created would warrant such an explicit statement.
> You mean flexibility to claim ECC support but not doing any validation to make sure it actually works?
The actual ECC functionality is part of the memory controller, which is entirely AMD's domain. The ECC functionality of the memory controller is fully validated.
Whether or not ECC support is present on the rest of the platform is the responsibility of the system builder. However, if the platform isn't fully ECC-capable (e.g., you're using non-ECC DIMMs, the motherboard doesn't have the necessary electrical traces, ECC support is disabled in the UEFI, etc.), this will result in the memory controller disabling ECC support, which is visible to the operating system and is something that you -- the end user -- can verify.
> Does any AMD motherboard manufacturer actualy states that "ECC is supported and has been validated"?
Yes. ECC support will be listed in the motherboard's spec sheet and manual. There also motherboards marketed for professional use where ECC support is explicitly advertised in the vendor's marketing.
> I think the muddy waters that AMD has created would warrant such an explicit statement.
AMD's official statement regarding ECC support on Ryzen is that it's supported if the motherboard also supports it. I'm not sure how they can be any clearer without making ECC a mandatory feature of the platform.
Can't speak to asrock but asus x570 boards definitely really do ECC, I have a known bad ECC DIMM and it's possible to use it to create correctable and uncorrectable conditions in short order.
Hard to imagine asrock wouldn't run the lines given all the other components are there, but I guess it's conceivably possible. If the kernel says it has ECC, it's right. If not, return the board as defective and use a different vendor.
Just be sure to think in advance about the maximal total amount of memory you'd need, since 5200 MHz is officially supported only for 2 modules (3600 MHz for 4). Discussed a week ago (also ECC issues were mentioned): https://news.ycombinator.com/item?id=37717567.
Using ASRock X570 PG 4S + Ryzen 5 2600 + Kingston 32GB 2666 ECC(cpu/mem support list for this mobo says the ECC works in this config). Dmidecode still reports 128bit data width instead of 72 however it also reports multi-bit correction instead of single-bit so ... ;). I'm kind of used to 72bit on Intel boards with UDIMM for example Supermicro+Xeon that I have. I think that a combination of memory controller plus mobo reporting has an effect on that info instead of actual hardware support - but still EDAC works, registers proper driver and I get the warnings once in a while from EDAC/RAS that a correctable error has been indeed corrected. Now if I should question whether it actually corrects something then I should also question the Supermicro+Xeon config - I don't have any means to check that - however if it's not working then I should see that on the ZFS dataset every month or so during scrubbing and I don't see anything there. So for me it is settled.
The B550 chipset is for AM4/DDR4 boards. For newer AM5/DDR5 motherboards, the ECC support is a lot less clear. As in: several motherboard vendors have claimed ECC support in the past, but no longer do, without publicly stating a reason. Also, some boards do list ECC memory modules in their QVL, but do not claim ECC support; whether this means that the board supports ECC or only that the memory module functions without ECC, nobody knows.
Yeah. The comment I'm replying to sounded like they're wanting to still use the AM4 platform. Maybe I misread it, or they've adjusted the post for clarity in the meantime. :)
Maybe slightly OT, since this concerns AMD older AM4 platform with a Zen3 APU core, but working ECC support looks like this and is definitely present on my system:
$ sudo ras-mc-ctl --errors | tail -n5
14 2023-08-20 20:16:41 +0200 error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=17), mcg mcgstatus=0, mci CECC, memory_channel=0,csrow=0, mcgcap=0x0000011c, status=0x9c2040000000011b, addr=0x36e701dc0, misc=0xd01a000101000000, walltime=0x64e31c78, cpuid=0x00a50f00, bank=0x00000011
15 2023-08-23 17:17:49 +0200 error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=17), mcg mcgstatus=0, mci CECC, memory_channel=0,csrow=0, mcgcap=0x0000011c, status=0x9c2040000000011b, addr=0x36e701dc0, misc=0xd01a000101000000, walltime=0x64ea5188, cpuid=0x00a50f00, bank=0x00000011
16 2023-09-03 16:52:15 +0200 error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=17), mcg mcgstatus=0, mci CECC, memory_channel=0,csrow=0, mcgcap=0x0000011c, status=0x9c2040000000011b, addr=0x36e701dc0, misc=0xd01a000101000000, walltime=0x64f4d227, cpuid=0x00a50f00, bank=0x00000011
17 2023-09-15 21:37:59 +0200 error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=17), mcg mcgstatus=0, mci CECC, memory_channel=0,csrow=0, mcgcap=0x0000011c, status=0x9c2040000000011b, addr=0x36e701dc0, misc=0xd01a000101000000, walltime=0x65071ed7, cpuid=0x00a50f00, bank=0x00000011
This is with an ASRock B550M-ITX/ac and a AMD Ryzen 5 PRO 5650G. It used to work the same with a Ryzen 5 3600 (using a dedicated GPU for video output) before I upgraded the CPU.
To detect and log ECC activity on modern GNU/Linux, you will want to have the "rasdaemon" service active. I will decode MCE (and other hardware-related) errors and persist them to the database that is shown being queried above.
The frequency of these errors should be enough to cause you to distrust any kind of information output by a computer without ECC.
Edit: thinking about this a bit longer: that frequency is actually so high that you may well have a broken module in there. Note how it is the same module and the same address every time.
Seems like a semi-stuck bit, it'd definitely cause issues w/o ECC but seems to chug along with the circuits doing their job. Best would probably be to add an memory-range exclusion to the kernel at boot to avoid that single area since the sticks seems good otherwise.
Ime ram is either bad or good. I’ve had ecc errors like this and I always ask the DC to replace the ram. After that, 0 errors forever. Same reason why I’m confident a 24 hour memcheck is sufficient for non ECC ram.
Generqlly the same here, but I have had sticks fail after some time in use. I had to rma the ram in my frame.work laptop after it failed. No reason or clue why but it happened after 6 minths or so. No issues with the rma though and it went fine with no issues since. ecc if it was supported there might have given me a heads up about it and avoided needing me to restore from backups when the fs corrupted.
I am fully aware the module is not 100% working, i.e., it is faulty at a specific physical address. That's OK for my personal desktop though, unless the condition worsens, and UCEs (which will panic my kernel) follow.
It'd be cute to log what processes are using that memory at time of error. Fun to speculate about whether a kernel bit flip is better or worse than ones in a web browser, photo editor, spreadsheet, network storage client...
You could test that empirically by setting up a box with the express intent to crash it and then using a chaos monkey like mechanism where you start injecting single bit faults into memory at random addresses. Wonder how long the box would be up before you start noticing something is broken. It would be funny if you accidentally killed the chaos monkey first! Best not use that box for banking...
I have ECC RAM installed on my Gigabyte B550I system. dmidecode shows the 72 bit width (Total Width: 72 bits) and dmesg | grep -i EDAC does show a bunch of info suggesting ECC is enabled. But this command's output is empty:
No Memory errors.
No PCIe AER errors.
No Extlog errors.
No MCE errors.
Do I need to enable something so these errors get logged or have I been misled by dmidecode and dmesg?
You are just lucky, and your hardware appears to be working without any problems ;)
Some/most(?) AM4 boards can enable "PCIe AER" (Advanced Error Reporting) in their firmware, which will tell you about stuff going awry while components are communicating over said bus (but every instance of PCIe error I have ever seen, even on rather faulty hardware, was recoverable/correctable), and rasdaemon will also persist those.
I do not know what "Memory errors" are supposed to be, since ECC-related problems will be dropped into the "MCE errors" bucket. Neither do I know what "Extlog errors" are.
An excellent read. I have ECC RAM on my Threadripper board. One of the things that ops at Blekko found was that you needed to tell Intel boards (NOTE vendor change!) to actually REPORT correctable errors. The default was machine-check if there was an unrecoverable error, otherwise just go with the flow. With ~ 1600 192GB systems we would see correctable errors about once a week as I recall. I don't recall a single unrecoverable error in 6 years so that was pretty good. Greg will surely correct me if I missed one :-).
Your luck was better than mine, but at least our systems reported correctable errors without us having to ask. We had a roughly similar amount of machines, probably about similar ram on average (many with less, and some with much more) We definitely had unrecoverable errors from time to time; maybe one or two a year, enough to have a policy: give it a shot to see if it was a one off, if it doesn't fail again shortly after it's probably fine; if it does fail again shortly after, swap the ram. Nicer server boards even have a led to tell you which ram module to replace.
For correctable errors, we wouldn't swap unless the error counts got pretty big; some systems went for a long time with one or two errors a day, which is fine. Others went zero errors for a long time, then a couple days at a small number, and then big numbers. There was one system that managed to get thousands per hour and the system was unusable because handling machine check exceptions was too expensive; unfortunately the reporting interval was one hour, so we didn't realize why the system was unusable until the next report.
Well to be fair if we started regularly getting UCE's on a stick of memory we did preemptively replace it. (where regularly was more frequently than one a week)
Yeah, I'd guess a good number of the systems where we got one UCE, we got another one within a day after reboot. If so, off to repair. If not, it was pretty rare for that machine to cause trouble again; seemed worth the risk to check if repair was needed.
The correctables were trickier to judge, because system halt on UCE is easy to recover from and easy to diagnose (system is down, check console, see UCE message); system slow as heck because of constant machine check exceptions is hard to diagnose and a slow system can disrupt a distributed system a lot more than a dead system.
Greg Lindhal had created a really fascinating system that surfaced anomalies like that to ops so that the systems could be fixed. He had a cute name for them like 'fractures' or something (where it was a non-fatal failure but it was impacting overall performance). Things the system would catch were switch ports going bad, DIMMs going bad, and file system corruption outside of the set of released tools. Disks failure was fairly common and got caught, drive reformatted, and resumed quickly.By tracking the rate of disk FS errors for a particular drive we could pick up the "you need to replace this drive" signal as well.
We mostly relied on a couple of easy metrics to signal trouble and then debug from there. Total size of all the Erlang message queues above a threshold was usually an indication of trouble; replication delays of more than a couple seconds too. CPU %, swap % were also signs of trouble.
Tracking down almost working networking without access to the switch metrics was kind of fun, ish. :P
I'm currently on Ryzen 3700X and asus tuf gaming x570 MB. I'd love more single core performance and more nvme/disk speed. Also I have a dual GPU setup, 2 M2 nvme, and 6 sata occupied.
I'm considering upgrading end of the year. I briefly considered thread ripper (for it's pcie lanes) until I found there is still no zen4 thread ripper and the price is likely yo be eye-watering.
So the choice is, upgrade to Ryzen 5900x and keep the rest of the system, or spend a lot more and upgrade to a AM5 ryzen plus a new mobo(I considered Intel too, until I found it tops at 20 pcie lanes).
I've always been buying mostly gigabyte and asus mobos, but I got burned more than once by them so I might go with another manufacturer this time.
What do you think? Is it worth upgrading to AM5 for just single core and disk performance? What about Intel? Considering I'd love more pcie lanes for a 10gb adapter perhaps(so I can move some of these spinning disks out) Intel is probably not a good choice.
I'm happy enough with the multi-core performance of ryzen 3700x. If I got a new MB I'd definitely want at least two nvmes in a mirror(for speed) - perhaps more and sata ports for my 6 spinning disks.
For your mirrored drives idea: I've had two nvme drives in mirror for a while but there are a lot of caveats to getting maximum performance. Not sure how it is on the AMD side but most Intel boards for example would have one m2 slot connected to the CPU and the other three via the chipset. Which makes them share a common bottleneck (also with your 10gb ethernet card)
For me it was in the end much faster to get a new board with pcie 5.0 support and a large enough single SSD. Total IOPS and throughput both turned out higher than my previous RAID 0 array.
I've got a 5950x and two NVME drives. The setup is the same as you describe. One has dedicated bandwidth and the other shares via the chipset. I just use them separately rather than trying to RAID them since the speed is fast enough for just about everything I need.
With my personal desktop (2x 1TB NVMe drives), they're mirrored with ZFS for data integrity. Depending on what Roark66 has them mirrored for (speed vs data integrity, etc), the alternative single-drive approach might not be workable.
Very likely. I jumped from 3700x and the difference is staggering. My 6800xt came alive and stutters are a thing of the past.
I kind of want an OLED but they don't come cheap, vertical resolution is the same, and they are not amazing for coding. Maybe in time for a Zen 5 based 3d cache chip.
Haven't you seen those congressional hearings with Zuck and others where congresspeople ask about how to use their phone ? How do you expect them to legislate about ECC ?
Yep. 99.9999% correct would mean 8 bit flips per megabyte of data stored in ram. The error rates are (thankfully) much lower than that (otherwise your computer wouldn’t boot). But random bit flips can cause utter havoc if they happen at the wrong time or place. If you download software from the internet, bit flips can introduce weird bugs to your software, only on your computer. (Including in the OS - including your filesystem or drivers). They can corrupt writes to your hard drive, and as a result corrupt your drive or files. Bit flips can quietly change the DNS request your browser sends to cause terrifying security problems. Or edit forms before you send them. There was even a case of a voting machine in Germany accidentally inventing 4096 votes due to a bit flip.
ECC is a really good idea. It’s only expensive right now because it’s a “premium feature”. If it were a standard part of all ram sticks, it’d be cheap and we’d all benefit.
Not argueing against ECC, but some of your scenarios seem to be outdated due to crypto. I.e software you download from the internet is often signed and hashes are validated (Linux package managers, macOS developer certs). Same for DNS requests (dnssec) etc. Yes, there is still wiggle room for bitflips to cause problems, but less so than in the past.
True - my bad of referring to DNSsec; there are other ways you can use encryption for DNS resolving (by using an external DNS server that encrypts using TLS or simply by using DNS-over-HTTPs). This way you get 100% encryption of your DNS traffic (and thus tamper checks that would detect bitflips). Again, not arguing against ECC, there are valid points to want it - I just see less and less reasons in the consumer market.
Encryption and signing don't protect against memory corruption.
For example, I download software from the internet then hash it. The hash matches. Before the bytes are written to disk locally, a bit flips in RAM. The corrupted data is written to disk and used.
Likewise, dnssec doesn't protect you against DNS bitsquatting attacks[1] because the domain name can be changed before the DNS request is made. So the DNS response your computer makes for a-azon.com might be totally valid and signed. It can come through DoH or whatever. The problem is that your browser thought it was the response for amazon.com and chrome send a bitsquatter your amazon cookies. (Oops).
That's just made up statistics - average computer gets multiple errors a day.
This was fine when consumer computers were for games and porn. Now they store birth certificates, submit information to the tax man, documents to court, sometimes deal with matters of life and death
I hear this all the time, that non-ecc machines are just completely fragile, and the math seems to indicate that one should be getting bit flips all the time. and yet, my Intel system has 64gb of non-ecc ram and it runs half the day everyday and is hibernated at night. I run 3d cad software on it, Photoshop, vs code with tons of extensions and software running in wsl2 docker containers and I just never really get any errors. certainly no crashing or blue screen. what exactly should one expect from bitflip errors? I would think that even single bitflips as often as they should happen with 64gb of ram being used almost fully would show up somehow.
> what exactly should one expect from bitflip errors?
Probably silently wrong values happening in memory somewhere. If it's not in something where that value is not actively running code code, you probably won't have a crash.
If it's in (say) Excel calculation formula it'll probably just screw up the calculation. Which may or may not be obvious. Similar thing for 3D cad, it could be completely non-obvious.
I don’t quite have the courage to physically short pins, nor the patience to slowly overclock my RAM, waiting multiple minutes for DDR5 link training each time. So instead, I’m content with knowing that the memory controller is reporting that ECC is enabled.
How about hitting the RAM with a warm stream of air from a hair dryer? I've seen that technique used in the past to generate errors.
I'd be careful with EMI/EMP based testing, if it's latching up one line or flipping one bit, it's probably also doing it to a ton of lines/bits. I'm guessing here but based off my experience with such things, I'd expect it to be more likely to latch up whole busses and generally just crash the system than trigger anything ECC could report unless you had a very repeatable setup you could ramp up in a very controlled fashion.
The mode of action here is you're producing a very wide band pulse but it's most strongly couple to the PCB traces long before anything within the chip itself. Guessing again but when people are using this to hack chips, they're probably just causing voltage swings on the power traces that are effectively voltage glitching the chip. The problem is you're over volting the part rather than under volting it which may cause permanent damage.
More sensibly, you can do some memory overclocking live on a running system, with no link training. It's not advisable for actual use, but you can use it to find the limits of your memory, or to generate errors.
Would rowhammer testing reveal whether or not a system has ECC RAM? It reveals errors on my I7-4770K (not overclocked) system whereas an ancient Supermicro board (X10 era) seems to run the rowhammer test indefinitely w/out detecting errors. I suppose this won't work if newer systems have been designed not to be susceptible to rowhammer attacks.
Incidentally the RAM on a Raspberry Pi 4B (and CM4) is feported to use ECC RAM, but this is not the same as discussed here. It's on die ECC and the purpose is to improve yield of the chips. ECC errors are corrected and not reported via H/W. ECC errors that cannot be corrected (I suppose) are just read out as incorrect data. I wonder if modern RAM sticks use chips with on-die ECC.
I feel like potentially reflowing BGAs on a high speed I/O PCBA you want to run continuously for a decade is riskier than simply shorting a pair of diode-protected pins.
Link training can be disabled in the BIOS, which would make searching for borderline bandwidth settings a quick operation. The results would not be very repeatable, but that's not important.
> I feel like potentially reflowing BGAs on a high speed I/O PCBA y
If you're trying to get chips up to something less than 100C, but instead get stuff underneath it with significant thermal mass up to above 200C, you're really messing up.
Hair dryers tend to emit like 65C-70C air on their "high heat" setting (as opposed to hot air guns used for heat-shrink and reflow soldering).
Boot the OS off ram and then do something that uses a lot of ram. If the machine crashes randomly or you have a corrupted "disk", chances are it is an ECC error.
Some (maybe all?) ASUS AM5 motherboards have official ECC support. It appears in both the board manual and the BIOS manual for the model I just checked. Note that one of the relevant BIOS settings is Auto by default, which (counterintuitively) leaves it disabled, so you'll want to change that.
ECC reporting for these processors appeared in Linux 6.5, so Debian Stable users will have to either wait for it to appear in Backports or stray from the beaten path.
> Note that one of the relevant BIOS settings is Auto by default, which (counterintuitively) leaves it disabled, so you'll want to change that.
I really dislike Auto settings in the BIOS. It wouldn't be so bad if there was a way to see the effective value, but 9 times out of 10 - it's no obvious.
I don't mind auto settings themselves, but I agree about the opaque ones. In this case, the effective value is stated in the field's help message, but it's still easy to miss.
There's a rumor that older AGESA[1] versions had a bug that prevented the chipset from recognizing and utilizing ECC ram properly even though the chipsets should support it. Check any motherboard in question for a firmware update which includes at least AGESA 1.0.0.5 patch C.
Neither of those were out at the time I originally got my AM5 motherboard (which was right as they came out -- the performance numbers were so good I couldn't help but spring on one early.)
> Had no idea that 7000s series doesn’t official state support of ECC.
You've misinterpreted the author.
All currently-available Ryzen 7000 series CPUs have official ECC support, but it requires motherboard support as well.[1]
This conditional ECC support has been the case for all past AMD consumer CPUs going back to the original Athlon 64, but the Ryzen 7000 series is the first time I can recall that this support has been explicitly listed on AMD's marketing materials.
What the author is saying is that mention of ECC support had disappeared from ASRock's motherboard documentation. This was a notable change for ASRock, as they had explicitly mentioned ECC support on their previous Ryzen motherboards.
It was really iffy and not clearly mentioned until this year. Also, it's confused with on-chip ECC which all DDR5 uses. DDR5 needs on chip ECC to correct errors that'll occur in normal operation but that won't provide ECC for transfers across the memory bus to the CPU.
I have a genuine question - in practice on a workstation/developer computer, what sort of protection does ECC ram give me?
I've got two daily driver machines - a 3970x threadripper with 96GB ram, and an M1 Macbook Pro - neither of which have ECC ram.
I've been using them both for over 2 years, and not once in that period (that I'm aware of) have I found myself with a problem due to faulty RAM, but I do regularly find myself wishing both were faster.
What practical benefit would I get in exchange for the performance hit of ECC memory?
The ability to detect (and possibly correct) physical memory errors.
> I've been using them both for over 2 years, and not once in that period (that I'm aware of) have I found myself with a problem due to faulty RAM
The "that I'm aware of" part is the key issue. ECC provides visibility into the health of the physical memory that you otherwise wouldn't have.
> What practical benefit would I get in exchange for the performance hit of ECC memory?
Side-band ECC (which is what you'd use on desktops/workstations/servers) doesn't have a performance hit, as the overhead of ECC is fully neutralized by the additional bandwidth and capacity of an ECC DIMM.
> The ability to detect (and possibly correct) physical memory errors.
This is a great example of theoretical benefits (note I'm not saying that those theoretical benefits are real, I'm asking how they practically benefit me).
> The "that I'm aware of" part is the key issue.
Ok so tell me what it looks like. _That's_ what I want to know.
> into the health of the physical memory that you otherwise wouldn't have.
And does what, on my windows workstation or my linux workstation?
> Side-band ECC (which is what you'd use on desktops/workstations/servers) doesn't have a performance hit, as the overhead of ECC is fully neutralized by the additional bandwidth and capacity of an ECC DIMM.
That statement doesn't hold water. If it takes extra bandwidth and capacity to provide ECC, I can use the extra bandwidth as memory instead of error correction, no? (Or I could if the non-ecc DIMM utilised that range). Either way, it's an extra x bits for ECC that _could_ be used for storage.
> Ok so tell me what it looks like. _That's_ what I want to know.
Suppose an application running on your machine suffers some type of malfunction -- a segmentation fault or a seemingly-random kernel panic that you're unable to reproduce.[1] And suppose it happens often enough that you want to fix it, and understand the root cause.
You research the issue, and are pointed to potentially faulty hardware. That then raises a question: how do you know your physical memory is working properly?
You could run a diagnostic application like Memtest86. However, that has the following issues:
- It only tests a particular point in time. If the test "passes", that doesn't say whether your memory encountered a fault in the past, nor does it guarantee that it won't fault in the future.
- What does it even mean for a test to "pass?" Just a single run through a test suite? Running it for some period of time, like a few hours?
- A diagnostic like Memtest86 is invasive. While it's running, you can't use your machine for other things.
Troubleshooting these types of hardware errors without ECC is tricky, because there's not really a way to conclusively link a software fault to a hardware error.[2][3]
However, on a machine equipped with ECC, this type of troubleshooting is a lot more straight-forward. If bits are flipped in memory, the memory controller can probably detect that (assuming only only one or two bits are flipped), and can raise a machine exception that the OS can catch and do something with (e.g., logging the error, terminating the process impacted if the error is correctable, etc). That saves a lot of time and headache that you'd otherwise be spending on guesswork.
You may still be asking, "how big of an issue is this, really?" I suppose whether or not you care about having hardware that can detect memory faults is up to you.[4] However, during the portion of my career where I was managing large machine fleets, memory failures were the second most common type of hardware failure, behind only mechanical disk drives.
> That statement doesn't hold water. If it takes extra bandwidth and capacity to provide ECC, I can use the extra bandwidth as memory instead of error correction, no?
No. The memory controller is unable to use the additional capacity for anything other than error checking (hence why it's called "side-band").
What you're describing is "in-band ECC", which is something that you sometimes see on GPUs or small form factor systems aimed at professional markets.
---
[1] An application crash is actually one of the better outcomes of a memory error. The worst-case scenario is silent corruption of data stored in memory that your application assumes to be valid.
[2] Errors as a result of faulty memory have been particularly frustrating to OS developers, as it can result in nonsense error reports. Linus Torvalds has bemoaned the lack of ECC options on consumer hardware, claiming that it was the industry cheaping out. During the lead-up to the launch of Windows Vista, Microsoft officially encouraged the use of ECC. I don't follow development of the BSDs, but it wouldn't surprise me if they similarly wished their users would use ECC across the board.
[3] When Google first started building out the infrastructure for their server farms, they opted not to use ECC memory for their servers as a cost-savings measure. They subsequently had a memory fault on one their machines that resulted on corruption of their search index. While they worked around the problem by adding logic to verify the contents of the index, all subsequent generations of machines that Google has deployed have ECC support. For specifics, see the second footnote in https://danluu.com/why-ecc/
[4] Note that many of your other components that store data in some way (e.g., caches, persistent storage, etc.) are likely to have ECC capability. The only notable exceptions I can think of are GPU memory, the main system memory on consumer hardware, and the processor's registers.
> Side-band ECC (which is what you'd use on desktops/workstations/servers) doesn't have a performance hit, as the overhead of ECC is fully neutralized by the additional bandwidth and capacity of an ECC DIMM.
This is true as far as I know but only for comparing ECC ram vs non ECC ram that is otherwise equivalent. But if you want the fastest ram available, it doesn't support ECC, so you're still taking a perf hit by buying slower ram with ECC support than you could get without ECC
Yeah, Ryzen usually performs best with faster RAM, but finding good speed on ECC sticks is generally tricky. It makes sense to get x3d version of CPU if you are planning to build a system and squeeze more out of it.
Anecdotally, on a desktop machine with something like 64GB RAM, when I switched from non-ECC to ECC (and 128GB RAM), the number of unexplained crashes seemed to drop from one every couple of weeks to one every 6 months or so. Could have been from a software upgrade though, probably Ubuntu 18 to Ubuntu 20.
Now I'm on an M1 MBP with 32GB RAM, and I get unexplained crashes roughly every 3-4 months, but I've been chalking it up to MacOS. Behavior also seems to degrade after a month or two of uptime. Hmm, maybe it's time for Apple to go ECC ...
> Anecdotally, on a desktop machine with something like 64GB RAM, when I switched from non-ECC to ECC (and 128GB RAM), the number of unexplained crashes seemed to drop from one every couple of weeks to one every 6 months or so. Could have been from a software upgrade though, probably Ubuntu 18 to Ubuntu 20.
Thanks. This is a great anecdote/example that is much more what I'm interested in. I don't see the same level of system instability on my machine, but it's a good example. Appreciate it.
> Behavior also seems to degrade after a month or two of uptime.
Ah, here's another thing. I do tend to reboot daily/weekly which may help
Some industries require to be able to reproduce exact bit by bit data.
Anything that requires absolute bit certainty, for example digital signatures, encryption, or financial information, will require ECC.
If you don't care if your files eventually don't match a sha256 checksum, you'll be fine. for example source code, it doesn't matter because in the worst case, your code won't compile because variable names got flipped.
And even then, if you're using Git for source code control, you're already using hash signatures to detect data corruption, as well as a very large redundancy and replication. Git is, in a philosophical way, ECC by software.
> It depends on your workload.
> Some industries require to be able to reproduce exact bit by bit data.
> Anything that requires absolute bit certainty, for example digital signatures, encryption, or financial information, will require ECC.
Totally, and I get that. We've got people in this thread who are saying that non-ECC ram shouldn't be trusted [0], or that it should be standard [1]. I get the use cases, but why do I want that on my workstation.
> your code won't compile because variable names got flipped.
I have _never_ seen or even heard of this or anything vaguely resembling this happening. Are there any writeups of this anywhere at all?
ECC is expensive just because Intel had a monopoly over the server segment, and unilaterally forced desktop not to use ECC.
Otherwise people would use desktop computers as servers, and that would reduce profits.
At work, my workstation Desktop computer has ECC. The colleague's sitting next to me has on their screen sometimes kernel warnings of ECC errors.
And cases of bit flips happen every day, that's why some problems are solved by just restarting your computer or the software. I'd argue that many times we blame software bugs on pure hardware bit flips.
Bit flips in memory can affect instruction code, not just data. This will manifest as random bluescreen crashes or visual artifacts in the case of video memory. Similar to errors/crashes from overclocking a little too high, but occurring occasionally on a normally stable system.
I've been actually researching this a lot recently as I'm in the process of buying parts for a desktop build and want ECC support.
> Unfortunately, when the AMD Ryzen 7000 “Raphael” CPUs were launched along with the brand new Socket AM5, all mention of ECC support was gone.
1. This was effectively a big point of concern for me. Previous ASRock Taichi motherboards officially advertised full ECC support, but not the latest X670E for AM5 model. Older version of the X670E Taichi page mentioned ECC support ("Supports DDR5 ECC/non-ECC, un-buffered memory up to 6600+(OC)") [0], and this Level1Tech review [1] had a segment on ECC support (they fully tested it). The segment was around the 10min mark, but it looks like the video was swapped to remove it during the last 30 days (the original video was 17min41sec long, the current one is only 16min32sec). So I assumed that ECC was supported, but it's a bit concerning that there's no official mention. The article reassures me.
2. Finding ECC memory for desktop computers is hard. For the X670E Taichi, you need DDR5 unregistered DIMM. The best solution I have to find ECC memory is to check QVL lists for similar server motherboards, for example this ASRock Rack [2] or this Gigabyte Motherboard [3]. I decided to go with two 32GiB Kingston sticks [4] for my own build.
Yes, a great article! It was no bigger surprise to see that the article was written by an Oxide employee - it has been a joyful signature that they've managed to attract curious people with an engineering approach to investigate problems and look up how the things work. I'm probably biased, but it seem like a dream company to me.
Though I have no intention of switching to ECC on my Ryzen system, I found this writeup both very clear and a great compelling educational read. Thank you for creating it.
Recently decided not to buy the AMD laptop from framework because they didn't bother to implement ECC. Started me thinking that I don't really need a new laptop, have an old one... so why continue to suffer from compromised parts at my desk?
Unfortunately AMD has been going the wrong way with this. Oddly enough, Intel has quietly added a few cheaper models with ECC support.
Unlike the author I don't have time for building a new PC and a bunch of maybes. I found this machine that's pretty cheap:
I think you're being a bit harsh to AMD on this, you still can run ECC on their consumer CPUs, you just need to research the motherboard because any motherboard/chipset combo could have ECC support.
That Dell has the Intel W680 chipset that is responsible for giving any Intel CPU ECC, just more pointless market segmentation by Intel. I don't know if I would give Intel a glowing review for that, you can also buy an OEM system with an Ryzen Pro processor if you want that kind of ECC guarantee.
You're not going to find Intel offerings with ECC solutions for light laptops either right now.
>Who sells certified ECC systems with AMD at similar prices?
I don't think you can blame AMD for the OEMs choices. Lenovo does compete in a way. You have to buy a GPU on their Threadripper workstations because of no iGPU for the computer to fall back on like the Dell's Intel iGPU. Also your Dell comes with 1 non-ECC 8GB DIMM standard, you have to pay $247 to upgrade to an ECC DIMM. Lenovos Threadrippers come with the ECC standard and can even run registered DIMMs.
But yeah it is a $2500k Threadripper workstation with no GPU compared to a $1300 I5 workstation with no GPU.
I haven't looked but I suspect not many laptops support ECC, if any. That's not a good reason to skip AMD.
The Dell may support ECC, but as any workstation it's pricier than consumer grade desktop with equivalent performance. If you need it, you need it but usually individuals don't pay for that, their employers do.
There are some xeon thinkpads that have full fat ECC capability, however you need to be exceptionally rigorous in making sure that's what you are getting. Only some of the laptop xeon models support ECC, and it's not as simply as "higher cpu model number better", it's a fucked up matrix that intel loves to do.
I believe that there might be issues with it not being enabled properly if you buy the machine with a minimum of non-ecc memory intending to switch it out, because lenovo does something fucky if you don't get ECC from the get go. I could be misremembering that last bit, it's been a couple of years since I looked at ECC capable workstation laptops.
But yeah, when framework comes out with an ECC workstation laptop is when I'll finally consider getting a new laptop.
It’s not a question of whether we want it or not. You’ll have to take it up with AMD on enablement of ECC support on non-PRO U-series mobile processors.
> The most foolproof way to test whether ECC is working is to introduce an error somehow.
>
> ApplesOfEpicness did so by shorting a data and ground pin on their motherboard.
Epic! Some people, really...
> I don’t quite have the courage to physically short pins, nor the patience to slowly overclock my RAM, waiting multiple minutes for DDR5 link training each time
It's not clear to me: when is the DDR5 re-trained? I assembled my Ryzen 7000 desktop PC myself and I don't remember ever having to wait minutes for it to boot, not even the first time. It's a bit slower at boot I'd say than non-DDR5 PCs I had in the past but it's mere seconds, not minutes.
And last but not least: DDR5 specs for ECC vs non-ECC, which price difference and speed difference are to be expected?
> which price difference and speed difference are to be expected
ECC RAM uses 1/8th more memory chips, so as a rule of thumb it costs 1/8th more.
They can have very different dynamics on the used market though, because servers tend to be upgraded to larger modules over time, but there aren't many buyers for smaller modules. Not a big factor for DDR5 because it's too new, but for DDR4 you can get some great deals for 8GB and 16GB modules.
...But are there any ECC sticks with "known" good Hynix ICs? I got lucky with a pair of sticks that can do DDR5 6000 CL30 at 1.3V (and maybe better), and I wouldn't want to sacrifice a ton of RAM performance for ECC.
You should do it and post it. Ryzen 7000 likes OC'd RAM, and overclocked ECC ram is kind of a desktop hardware myth/legend because its so rare and oxymoronic. Not sure if you've done it before, but DDR5 overclocking is quite a rabbit hole, and there are some "easy" tutorials out there like this: https://www.youtube.com/watch?v=dlYxmRcdLVw
Test the stock primary timings with the modded subtimings at 6000 first, then jump straight to cl34 for the primary timings and reduce the voltage/primary timings from there.
I would grab these myself, but the kit is too rich for my blood at the moment.
To be honest the only RAM overclocking I've done is by going into the XMP settings and turning on the shipped overclock configuration. For CPUs I tend to undervolt them a bit to reduce power consumption.
I'm also interested about what speeds and CL would that DIMM run. By the way I'm the one who started that ASRock forum topic, what a small world to randomly stumble here while scrolling HN.
In the end I put PC upgrade on hold, partly because of that AMD bios ECC support mess and also cannot find fast DIMM-s. Just looking today it seems ASRock has still not re-added ECC support info on AM5 board specs pages so the more people test those boards and meomory combinations the better for community.
I used ECC with a ThreadRipper 2920x on an Asrock motherboard, but for a new Ryzen 7950x build I went back to regular RAM. The problem with ECC with the TR was that only Unbuffered DIMMs work, not Registered DIMMS, and the 7950x is the same. Unbuffered ECC is slower or costlier (or both) than the equivalent non-ECC or Registered, enough that it just wasn't worth it in a new build.
Ryzen has the additional disadvantage vs TR of only being able to run RAM fast if you limit yourself to two sticks. My TR build used 4x16GiB but for my 7950x build I had to get 2x32GiB to be able to run them at 6000MHz.
I currently run a 3700X with 64 GB ECC. I've been interested in a jump to an AM5 series, but have found the matter of ECC compatibility frustrating to research, particularly because DDR5 has limited ECC built in and that confuses the discussion.
I'm curious what your use case is. I mostly use mine for software development (including VMs) and some family media processing. I like the peace of mind that ECC brings, but I have no way to really know if RAM speed is performance factor.
Personal Linux machine mostly used for watching video, code compiling, and some light gaming. The 7950x definitely compiles my stuff much faster than the TR 2920x did, but I don't know how much of that is because of the RAM speed (6000MHz vs 4800MHz, faster timings) and how much because the CPU clocks higher (~5.4GHz vs ~4.2GHz).
> Ryzen has the additional disadvantage vs TR of only being able to run RAM fast if you limit yourself to two sticks.
TR has it similar, just doubled; you are able to run 4 sticks faster than 8. On my TR2920X+X399 (probably the same config as yours, Asrock Taichi) I'm able to run any four of them at 3066 MHz and all of them only at 2933 MHz.
Yes, I wasn't saying TR doesn't have a limit. I was saying Ryzen's limit was smaller, so I would've had to get bigger UDIMMs to have the same total memory as before.
(My 2920x was on an X399 Phantom Gaming 6 and my 7950x is on a B650 PG Lightning. I find the Taichis of all generations to be overkill.)
it's not really possible to say if ecc would prevent the issue you're describing.. but that's why imo it is worth having ecc--so you don't have to worry as much about if the weird behavior you see is due to random memory errors.
On the more immediately practical side.. if this is happening frequently enough then running memtest86 or GSAT google stressful application test and see if it can pick up any errors.
You also might be able to improve things with a bios upgrade, or by lowering the clock speed of the ram.
I feel the need for ECC is a bit overstated, probably confused by listening too much to people in the server space. In servers, ECC is critical because servers simply have more RAM than PCs do. You can easily find systems with 32 sticks of RAM. If you roll 32 dice, odds of a rare freak error is a lot bigger than if you roll 2. It's the same reason why you probably don't need RAID on a computer with one or two disks. Stick 48 disks in a machine however, and you're a fool to go without it (or a plan at least). It's why you don't need an automatic fire suppression system in or homelab, because the odds of a fire in one or a few PSUs is fairly small, but stick a few thousands in a room and suddenly it's a very real problem.
That, and most memory problems in PCs tend to be from dodgy overclocking or just bad sticks rather than cosmic rays. ECC won't really save you from that.
Error detection and correction is actually ideal when overclocking. It'll not only save you, it'll let you know you need to back off a bit, because the handful of errors that may happen once a month despite passing days of tests will show up in the system logs instead of staying silent and causing problems. I speak from experience as I overclocked ECC memory in a first gen threadripper system, playing around with overclocking as a hobby for a bit. ECC memory was fantastic to work with.
However the dodgy or outright lack of CPU/Motherboard support results in trying to market that feature to gamers/overclockers having too much friction. Much like what went on with intel's confusing proprietary optane/flash combo M.2 drives that needed certain motherboards in order to fully access both parts of the drive.
And with bad sticks, instead of hard to explain crashes, you'll end up with system logs filled with corrected/detected errors and maby crashes if it's really bad. Then you just RMA the sticks like normal. So honestly that's a win too, because that's the whole point of detecting and correcting errors. It's in every other part of the system already, continuing to leave this capability out of memory is an oversight that needs to be.. corrected.
I just don't really get the strange resistance people have to ECC, it's not some sacred cow. It's just good sense.
> It's the same reason why you probably don't need RAID on a computer with one or two disks
No, it is not "the same reason"
Disks have ECC, even consumer hard drives have checksummed blocks, that's how you get media errors detected. RAM does not. Well, technically with DDR5 it has internal one but it does not give any feedback to the machine so you might not know your RAM has any problems.
> It's why you don't need an automatic fire suppression system in or homelab
But you want smoke sensors. ECC is the smoke sensor too
> That, and most memory problems in PCs tend to be from dodgy overclocking or just bad sticks rather than cosmic rays. ECC won't really save you from that.
> Disks have ECC, even consumer hard drives have checksummed blocks, that's how you get media errors detected. RAM does not. Well, technically with DDR5 it has internal one but it does not give any feedback to the machine so you might not know your RAM has any problems.
Sure, and disks have random errors even with error correction. Adding hamming codes just makes errors less likely and easier to detect, but they aren't fool proof, in ram or on disk or anywhere else. The protection offered is a bit questionable because they rely on the very sketchy assumption that errors are statistically independent events, which is rarely how storage errors work.
> But it will tell you the problem
> Very ignorant take
Hardware isn't perfect. No matter how much error correction you pile onto it, you will still have errors and some cases will still be undetectable. It's a matter of pushing the rate of errors into an acceptable range, as a pure cost-benefit tradeoff, statistics through and through.
Ever had to troubleshoot bit flips on a non-ECC system? One friend felt like he was going crazy as over the course of two months his system degraded from occasional random errors to random crashes, blue screens and finally to no POST. Another time, a coworker had to stare at raw bytestreams in Wireshark for hours to find a consistently flipped bit.
How often do you test your memory? The nice thing about ECC is it's always testing your memory, and (if it's set up properly!) you'll get notified when it begins to fail. Without ECC, your memory may begin to fail, and you'll have to deal with the consequences between when it starts to fail and when you detect it.
(Of course, I don't run ECC on my personal systems, but at least I'm wandering knowingly into the abyss)
Testing your memory detects if you have bad RAM, which ECC isn't going to help with anyway. Perfectly fine memory will experience random bit flips from environmental factors. Your PC components and UPS also degrade over time and can cause random bit flips. ECC is there to catch problems as they happen and ideally take corrective action before bad data propagates
> Wow I came back to post this exact reply. I set my system to a slightly high frequency, ran memtest overnight with errors.
>
> Set it back down to a supported frequency, ran a full memtest suite again with no errors.
Cool. You tested your memory at some point in the past.
How do you know it's still working properly and hasn't flipped any bits?
You don't. Because you have no practical way of testing the integrity of the data without running an intrusive tool like memtest86 that basically monopolizes the use of the computer.
Being able to detect these types of memory errors at a hardware level while the processor is doing other things is the fundamental capability that ECC gives you that you otherwise wouldn't have, no matter how thoroughly you run memtest86.
You likely wouldn't know if you had random bit flips. It'd manifest as silent data corruption. You might be okay with that. Others aren't.
It's not a matter of overclocking. Bit flips are a fact of life running with 32+ GB RAM. Leaving your machine on 24/7 (even if in sleep) stacks the odds against you.
Obviously this is just anecdote but I have a work laptop with 128GB of non-ECC ram , use all of it every day and never noticed any issues. I'm not saying there aren't any, but it just....works.
>"I have ECC working on ASUS ProArt X670E-Creator with AMD Ryzen 9 7950X. But, you have to explicitly turn ECC on in the bios. If left on ‘Auto’, it will be off. I use four sticks of Supermicro (Hynix) 32GB 288-Pin DDR5 4800 (PC5-38400) Server Memory (MEM-DR532MD-EU48)."
[1] https://www.reddit.com/r/Amd/comments/lzxqod/list_of_am4_mot...
So this definitively settles it, that the AMD+ASRock combo is truly ECC RAM?