I still think it's important, but it makes sense that people don't want to pay more for something to detect errors that are quite unlikely.
I also disagree that the benefits are meager. I have suffered from several computers with flaky memory (fine when you build the computer, flaky several years later; then you do an overnight memory test and find that indeed the memory has gone bad), and a strong software signal that "hey your memory is bad" is very actionable. You also have to think about it from a programming standpoint -- what happens if this variable isn't what I set it to? What if "for 1..10" is actually "for 1..2147483658"? Do you have time to debug that? How much data do you lose when you persist that to disk? To me it is insane to not get this nearly-free consistency check if you ever plan to persist any bytes in RAM to long-term storage. Even consumer GPUs have ECC memory these days. It's a no-brainer.
What surprises me is that the industry segments RAM sticks by ECC/not-ECC, when it's just a software function performed by the memory controller. I think everyone with a HEDT setup would be happy to enable ECC and get better reliability at the cost of 12% less RAM. I know I would be. (I built a Threadripper workstation recently and just couldn't get any reasonable prices on QVL'd ECC memory. So I skipped it and paid like $600 for 128GB of 3600MT/16-CAS memory... would be happy to flip a switch and have that be only 112GB.)
The reason why ECC isn't widespread is it's because it's used to segment the market. Home users are cost sensitive, so hardware vendors have to think of ways to get server users to not buy home user equipment. ECC is one of these levers. That's all there is to it. Everyone would have ECC if it were a mere 12.5% more expensive.
Standard memory modules become (a multiple of) 9 bits wide, with the additional bit stored near the other 8 bits.
The problem is _INTEL_ deciding its a premium feature, and the memory manufactures charging 50%+ more for 12.5% more hardware.
So instead of being a couple percent of the system cost, it ends up being a noticeable double digit percentage.
Of course none of this explains why apple/etc haven't done it in their phones.
ECC memory physically requires additional wires, necessitating different parts that don't sell at the same volumes.
And as far as "wires", traces are basically free as long as they don't force additional pcb layers, which shouldn't be the case given the careful pin/ddr chip/processor designs focused on exactly that. The extra pins are there regardless, and in the past so were the "wires" given that CBx/DMx pins were muxed. Further, packetized ram interfaces have been known to bury the error correction in the protocol same as PCIe/etc. Meaning its a slight efficiency loss.
Look at it this way, every part of the system _EXCEPT_ the ram has some kind of error correction on it at this point. Its a cheap way to not only increase robustness but its also a security mechanism.
If I were guessing, I would say that DDR5 is the last version that isn't ECC end to end, the idea that the manufactures can shrink the dies at the cost of some BER and make it up with a bit of embedded ECC will just be to tempting. Particularly with AMD on the rise, intel will have a much harder time playing product segmentation games if AMD doesn't do the same.
So I save $50 on the CPU and that just about covers the ram/motherboard premium.
For a memory controller to add extra bits, they would have to come from somewhere. For every 64 bits (8 bytes read), you now need to read 72 bits (9 bytes).
However “DIMMs are printed circuit boards that carry multiple packaged DRAMs and support 64bit or 72bit databus widths, the latter to enable eight error-checking and correction (ECC) bits to protect against single-bit errors.”.
So now every read from 64 bit (non-ECC) DRAM now needs two reads, one for the 64 bits you want, and another to read the 8 bits ECC.
If your access pattern is random, then you will slow down memory access by 100%. For long run sequential reads/writes the slow down will be 12.5% assuming you can lock the memory accesses for 9 sequential reads/writes (avoiding invalid memory states is essential!).
The “cost” of ECC implemented by the memory controller is:
* you lose 1/9th of your memory (as you point out)
* the speed of your computer drops by up to 100% (speed of many operations is limited by random memory access speed, not CPU)
* CPU atomic instructions for memory access are complicated https://en.wikipedia.org/wiki/Compare-and-swap and would also incur a further speed penalty in critical sections - which can very significant.
The idea seems useful (clearly it would be a fantastic feature to be able to switch on if you suspect faulty memory) but there are likely technical reasons for why your idea is not implemented (not just price discrimination).
That appears condescending to me. Generally you should assume people are smart. ECC RAM is designed by smart people to be the way it is and start with the assumption that if an idea isn’t implemented that perhaps there are good reasons why not?
(Edited to make reply flow better. Disclaimer: I only have a very shallow knowledge of the design constraints, and I expect there are other more serious problems with the idea).
(I think there are probably issues with this in terms of how ddr itself actually works with its interleaving and adding a third channel to access from but I don't know enough about how it works to know for sure if that's the case)
At the end of the day, electrical/computer engineers do actually generally know what they're doing.
By the way, for detection you don't need the full error correcting code, but you can use an error detecting code which can use fewer bits.
Are you talking about ECC ram being more expensive? That's probably true but why not give people the option whether to use ECC? ECC being optional is a good idea anyway because ECC memory is generally slower.
I don't think the actual ECC implementation in the processor is very expensive to implement.
Consumer CPUs not having ECC is pure product segmentation, without technical justification, IMO.
But supporting ECC ram also adds more expense. To properly support it, BIOS engineers have to support and test it; motherboard makers have to support and test it.
I appreciate that AMD offers it for their consumer oriented processors, but because it's a best effort feature, and it's hard for an end user (or reviewer) to test, you never really know if you're going to get full support, or if the support is really just that you can use ECC ram, if you want to spend a little more for your ram, and get none of the benefits of ECC.
It certainly adds something to the cost (die space) of the memory controller, but I agree it's probably not much.
Realistically the mainstream CPU manufacturers have ECC solutions, so including them in the mainstream processors shouldn't be a huge issue, it's just a market segmentation ploy to exclude the feature from the consumer processor designs.
I think the cost pressure is too high. Early IBM PCs required parity ram (a 9th chip to store if the sum of bits was even or odd), and would fault if the value was incorrect on reads. Ram module manufactures made innovative fake parity modules that calculated the parity value on access, replacing the 9th ram chip with a very simple circuit and saving money.
It would be hard to convince the whole industry not to make fake ECC ram, if ECC was mandatory.
We will have to see with DDR5 (because it supports "internal" ECC) if its worth it to the memory industry to build RAM that is internally denser, but more error prone (as is the case with modern flash) or continue to attempt to build 100% reliable ram (and failing).
I'm betting some clever person figures that out. Which leaves only the memory bus itself unprotected. Which IMHO, is foolish and serves only to create product segmentation. So, for a DDR5 dimm with internal ECC, generating bus ECC should be a trivial addition.
> It would be hard to convince the whole industry not to make fake ECC ram, if ECC was mandatory.
Assuming you don't use memory-mapped IO, that's easy to fix. On startup, generate 4 random bits a,b,c,d. Parity bit is data line 4a+2b+c, with d?even:odd parity, data bit 4a+2b+c is on data line D8. ECC on 64/72 uses more random bits, but is otherwise similar, although for modern chipsets it would probably have to be scrambled in the northbridge or southbridge (or equivalent) rather than the CPU, to allow for DMA and such. Note that there's no gate delay involved here; the multiplexing can be done with pass transistors.
Does it, though? I don't remember all the prices from when I was researching memory close to a year ago for a new PC, but one big notable difference between ECC and non-ECC I've seen is that almost all non-ECC sticks are overclocked. You buy something rated at 2667MHz, and you actually get is e.g. 1833MHz chips overclocked at that speed. Which makes them cheaper. But I'd expect non-overclocked non-ECC sticks to be that far in pricing to ECC sticks (well, apart from the obvious difference they'd have more chips). And because the market is focused on cheap non-ECC sticks, they don't do cheaper ECC overclocked sticks, but there is no reason they couldn't make them. There aren't that many ECC unbuffered sticks already, it's easier to find registered ones. Essentially, the market is skewed by Intel not supporting ECC on non-Xeons.
Been there. If your servers don’t have ECC memory, you’re eventually going to get bit.
I built a ML workstation for my work team and was disappointed that my options were: expensive, low clock speed server CPU + ECC, or inexpensive, very fast, desktop CPU without ECC. Even if I were willing to pay more money, it was really hard to get the same performance if I needed ECC.
On the other hand, AMD doesn't do that.
The unique factor here is that rr provides _reproducible_ crash recordings, so when it fails to reproduce a crash, you've found some nondeterminism-- either a bug in rr where it didn't replay syscalls correctly or nail down thread behavior accurately, or a hardware issue like this.
What is particularly impressive is that this lowest of the low-level debugging work is done completely in Julia - all the way.
Cant even put it on warranty because it cant be reproduced easily. Cant afford to just start replacing components. 5000€ lemon.
EDIT: only ever used the free memtest86+ tool, I have no experience with PassMark's MemTest86.
They are also responsible for some mysterious memory compatibility problems in PCs. All PC builders have this experience - on a certain motherboard model, some DIMMs work, some doesn't, but they are all off-the-shelf parts that follow JEDEC standards.
Often there are minor variations in electrical characteristics from DIMMs to DIMMs. If the signal integrity on the motherboard is marginal, there will be mysterious problems.
(Many overclockers hate it and think it's "too extreme" because it causes their otherwise "stable" overclocks to instantly fail. I love it because it shows how an unstable system will sometimes effectively calculate 1+1=3. I don't consider a system stable unless it can pass a full day of Linpack without a single error.)
I guess it's okay to game on a system that's right on the verge of failure; what's the worst that happens, your game crashes, maybe you corrupt a drive? Hopefully there's nothing important on it. But I like your regimen for more serious computing.
Why? 1 bad bit, even an intermittent one, is enough for me to condemn RAM. Memory that doesn't remember what was last written is simply not fit for purpose. Even without faulty hardware, most software is already buggy enough as-is.
IMHO, this means that every memory that's vulnerable to rowhammer and related techniques is defective, and/or has been specified to run at irresponsible refresh timings.
No access pattern should ever be able to change the value of bits not being accessed, that is the definition if faulty memory. And I'm astonished that there isn't a class-action or something.
"Normalization of deviance" comes up a lot lately. It's like the frog being boiled, we just came to accept that 100% of RAM is defective by design.
Instead of mass recalls and class-action, the authors of memory testing utilities were persuaded to make rowhammer tests optional and off-by-default, and there is one with this massive bunch of BS that basically says "it works most of the time so you may choose to ignore it":
I recall seeing a discussion where someone basically said "100% of RAM would fail this test, so we shouldn't enable it by default" --- conveniently neglecting to mention that older DDR3 and before wouldn't.
Relatedly, I've noticed prices of used RAM, particularly DDR/DDR2, appears to have gone up recently. Other used computer parts are also selling at surprisingly high prices --- 10+-year-old motherboards and CPUs, pre-DDR3 era stuff. I wonder if that's due to decreasing production, increasing demand from retrocomputing enthusiasts, or increasing demand from those who know about this and don't want new RAM anymore.
Computing itself is broken. Every product from physical to application state is broken in some subtle way which is then countered by a higher layer having some extra code (or component) to adjust it.
Somewhere in the quest for higher speeds, RAM runs with timings and refresh intervals that allow rowhammer. And since nobody wants to take the speed hit of having memory that's actually correct, it just.... it's okay now?
Well, I disagree. I'd like to find out how to configure my memory controller to be rowhammer-proof, even if that means a refresh cycle after every single access cycle. And then we can build performance starting from that assumption that correctness is required.
As an old mentor once said, if you start out medium-speed but wrong, you'll get faster at doing it wrong. Start slow but right, and then you get faster at doing it right.
Yeah, same thing with Spectre... part of the ToS actually states you can't benchmark Intel products with Spectre mitigations applied now (apparently that's a thing)
Agreed, if the memory would be deteriorating (who knows by which process), the device ought to be replaced. It hasn't however in the last three years, so that seems to have been a singular event (not sure if cosmic rays can permanently damage a RAM cell, but those things are tiny now).
It's much more cost effective to use software to work around hardware failures, then to rely on (never perfectly) dependable hardware -- compare Google FS vs IBM Mainframe.
There are some things in software development that are obviously wrong to a few people.
And there are some things people have a hunch we are doing wrong but nobody can crystallize it.
Removing redundancy in code is great most of the time, but it's not a panacea. NASA had to contend with physical failures of memory, and catastrophic costs of failures in 'production'. They solved this problem by consensus pools of three, on physically separate hardware and in some cases using multiple manufacturers. Inability to reach consensus would invoke failsafes.
I have a vague suspicion about how we condense the very most critical bits of our software down to the fewest bits of data and instructions. This may ultimately be a policy we reject. One bad bit and you can end up taking the opposite action of the one you should have performed.
"One reason why the redundancy management software was able to be kept to a minimum is that NASA decided to move voting to the actuators, rather than to do it before commands are sent on buses. Each actuator is quadruple redundant. If a single computer fails, it continues to send commands to an actuator until the crew takes it out of the redundant set. Since the Shuttle's other three computers are sending apparently correct commands to their actuators, the failed computer's commands are physically out-voted79. Theoretically, the only serious possibility is that three computers would fail simultaneously, thus negating the effects of the voting. If that occurs, and if the proper warnings are given, the crew can then engage the backup system simply by pressing a button located on each of the forward rotational hand controllers."
And bit flips do happen