One thing I didn't glean from either article, though, is why ECC ram needs to be implemented in both the CPU and the RAM. Is the concern that corruption could occur either inside the CPU's memory (registers, Caches) or inside the RAM module and hardware needs to exist in both? If so wouldn't that still mean that having ECC RAM modules would still prevent the latter? I had the understanding that RAM is meant to function largely agnostic of the CPU using it.
Think about how CPUs access RAM, Random Access Memory.
Now think about how you would store and process the ECC (checksum / parity data) without the optimally dispersed bits.
This is very much like RAID-5,6,Z or like the addition of integrity codes in other media. CDs and DVDs have the ECC protected data as the default layer and also expose the raw blocks to the OS.
Random seeks in a CD / DVD were possible because the device could start reading from anywhere and get back a full stream of data.
With RAM used the same way the first question is: Where is the ECC decision drawn? Is it per bus word (E.G. 128 or 256 bits), an OS page (often 4KByte for common OSes), some other unit? Next is where is the ECC data stored?
If you're keeping this efficient smaller reads are better, and the 4KByte OS page is a common multiple for lots of things so at the moment that's about where I'd draw the line. Dedicating silicon to calculate the parity on a page that large is probably way more expensive than gates for a couple extra bits per native word (and designs that might also match parity used for internal cache too and thus could be reused), but would be a logical maximum for doing something the hard way.
Virtual page alignment would also be a performance issue, and the cache would have an effectively fixed blocksize and read-ahead granularity.
Or everyone could just use the one already developed industry standard that was optimized by engineers not thinking on the back of a napkin.
It doesn't. I recently read somewhere that Intel had not long ago put out some embedded CPU products that can do it with non-ECC RAM, maybe someone with better memory can find a link.
A related thing that's been on my mind lately is the movement from DDR4 to DDR5, and what that means for ECC. DDR4 and previous generations use 64-bit channels, which needs 8 bits for SECDED protection; so ECC-supporting DIMMs are 72 bits wide. DDR5 goes to (twice as many) 32-bit channels, though, which need 7 bits for SECDED protection -- but DRAM dice are traditionally made in multiples of 8-bit widths, so DDR5 ECC-supporting DIMMS are 80 bits wide (two 40-bit channels). What's interesting about this is that there's an extra bit of (ECC-protected) physical memory available to the hardware for each 32-bit word, at zero cost for a machine that already requires ECC. This raises some interesting opportunities, like tagged memory, that I haven't seen explored anywhere yet.
Probably best to use the extra bit as a parity bit on the whole SECDED encoded word (Updated after a visit to Wikipedia to learn more about the acronym in use), or in some other way use it to make the algorithms more robust.
I'm a little surprised by Linus's comments. The Xeon e3's are just slightly tweaked core i3/i5/i7s. Similar pricing, similar priced motherboards (maybe another $30), ECC support, and ever to slightly lower clocks to hit their reliability goals for servers.
So I bought a Xeon e3-1230 (forget which gen)_it was cheaper than the (then) top of the line i7, was 100-200MHz slower, and supported ECC.
AMD does support ECC with their desktop CPUs, but doesn't guarantee it will work with a given motherboard. Some motherboards "work" in that the ECC dimms are compatable and report extra memory for available, but don't actually implement ECC (actually correcting errors). There's forums to dig around in info, and some motherboard manufacturers test for ECC compatibility and publish the results, other's don't.
Personally I prefer the Intel approach, for those willing to pay a bit more per CPU and motherboard get cheap ECC and are guaranteed to work. AMD's approach is leaves more up in the air, you can't just buy a random ryzen, random motherboard, and random ECC dimms and be guaranteed to work.
This is however guaranteed with the AMD epyc, but they are MUCH more expensive then the Intel Xeon E3 line.
Note that while the Xeon Silver is $50 cheaper than the Rome 7232P, it gets absolutely clobbered by the AMD part, which also offers compelling features like secure encrypted virtualization that the Intel part doesn't.
One thing I didn't glean from either article, though, is why ECC ram needs to be implemented in both the CPU and the RAM. Is the concern that corruption could occur either inside the CPU's memory (registers, Caches) or inside the RAM module and hardware needs to exist in both? If so wouldn't that still mean that having ECC RAM modules would still prevent the latter? I had the understanding that RAM is meant to function largely agnostic of the CPU using it.