Hacker News new | past | comments | ask | show | jobs | submit login
[dupe] Why don’t PCs use error correcting RAM? “Because Intel,” says Linus (arstechnica.com)
82 points by sytelus on Jan 7, 2021 | hide | past | favorite | 14 comments



500+ comment thread 3 days ago: https://news.ycombinator.com/item?id=25622322.

One thing I didn't glean from either article, though, is why ECC ram needs to be implemented in both the CPU and the RAM. Is the concern that corruption could occur either inside the CPU's memory (registers, Caches) or inside the RAM module and hardware needs to exist in both? If so wouldn't that still mean that having ECC RAM modules would still prevent the latter? I had the understanding that RAM is meant to function largely agnostic of the CPU using it.


Most of the logic resides in the memory controller, and the extra bits for the checksum reside in DRAM.

You need a little logic in the OS to report correction and uncorrectable events.


Why can't you just reserve portions of the usable RAM for the checksum bits?

You could even potentially have different parity levels for different applications depending on need.


Think about how CPUs access RAM, Random Access Memory.

Now think about how you would store and process the ECC (checksum / parity data) without the optimally dispersed bits.

This is very much like RAID-5,6,Z or like the addition of integrity codes in other media. CDs and DVDs have the ECC protected data as the default layer and also expose the raw blocks to the OS.

Random seeks in a CD / DVD were possible because the device could start reading from anywhere and get back a full stream of data.

With RAM used the same way the first question is: Where is the ECC decision drawn? Is it per bus word (E.G. 128 or 256 bits), an OS page (often 4KByte for common OSes), some other unit? Next is where is the ECC data stored?

If you're keeping this efficient smaller reads are better, and the 4KByte OS page is a common multiple for lots of things so at the moment that's about where I'd draw the line. Dedicating silicon to calculate the parity on a page that large is probably way more expensive than gates for a couple extra bits per native word (and designs that might also match parity used for internal cache too and thus could be reused), but would be a logical maximum for doing something the hard way.

Virtual page alignment would also be a performance issue, and the cache would have an effectively fixed blocksize and read-ahead granularity.

Or everyone could just use the one already developed industry standard that was optimized by engineers not thinking on the back of a napkin.


It doesn't. I recently read somewhere that Intel had not long ago put out some embedded CPU products that can do it with non-ECC RAM, maybe someone with better memory can find a link.


Intel calls this in-band ECC and it's in their Elkhart Lake Atom SoC. https://www.anandtech.com/show/16102/intel-launches-10nm-ato... This is pretty inefficient compared to regular ECC but you can't really do regular ECC with LPDDR4.

Nvidia also implemented something similar years ago.


Interesting. Nvidia indeed mentions this at https://developer.nvidia.com/blog/inside-pascal/


Interesting, so validation of the parity bits occurs in the CPU? Presumably when data is read off the bus.


Can you expand the pronoun "it" so your comment is less ambiguous? Thanks.


ECC ("it") doesn't need to be implemented in both CPU and RAM.


A related thing that's been on my mind lately is the movement from DDR4 to DDR5, and what that means for ECC. DDR4 and previous generations use 64-bit channels, which needs 8 bits for SECDED protection; so ECC-supporting DIMMs are 72 bits wide. DDR5 goes to (twice as many) 32-bit channels, though, which need 7 bits for SECDED protection -- but DRAM dice are traditionally made in multiples of 8-bit widths, so DDR5 ECC-supporting DIMMS are 80 bits wide (two 40-bit channels). What's interesting about this is that there's an extra bit of (ECC-protected) physical memory available to the hardware for each 32-bit word, at zero cost for a machine that already requires ECC. This raises some interesting opportunities, like tagged memory, that I haven't seen explored anywhere yet.


Probably best to use the extra bit as a parity bit on the whole SECDED encoded word (Updated after a visit to Wikipedia to learn more about the acronym in use), or in some other way use it to make the algorithms more robust.


I'm a little surprised by Linus's comments. The Xeon e3's are just slightly tweaked core i3/i5/i7s. Similar pricing, similar priced motherboards (maybe another $30), ECC support, and ever to slightly lower clocks to hit their reliability goals for servers.

So I bought a Xeon e3-1230 (forget which gen)_it was cheaper than the (then) top of the line i7, was 100-200MHz slower, and supported ECC.

AMD does support ECC with their desktop CPUs, but doesn't guarantee it will work with a given motherboard. Some motherboards "work" in that the ECC dimms are compatable and report extra memory for available, but don't actually implement ECC (actually correcting errors). There's forums to dig around in info, and some motherboard manufacturers test for ECC compatibility and publish the results, other's don't.

Personally I prefer the Intel approach, for those willing to pay a bit more per CPU and motherboard get cheap ECC and are guaranteed to work. AMD's approach is leaves more up in the air, you can't just buy a random ryzen, random motherboard, and random ECC dimms and be guaranteed to work. This is however guaranteed with the AMD epyc, but they are MUCH more expensive then the Intel Xeon E3 line.


> [ecc support is] guaranteed with the AMD epyc, but they are MUCH more expensive then the Intel Xeon E3 line.

Ryzen 7 5800X: eight cores, sixteen threads, $450.

Epyc Rome 7232P: eight cores, sixteen threads, $500.

Xeon Silver 4110: eight cores, sixteen threads, $450.

Note that while the Xeon Silver is $50 cheaper than the Rome 7232P, it gets absolutely clobbered by the AMD part, which also offers compelling features like secure encrypted virtualization that the Intel part doesn't.

https://www.cpubenchmark.net/compare/Intel-Xeon-Silver-4110-...




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: