Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Not on the chain of hardware I run at home (the machines with ECC are ones I spec and configure very conservatively), on other larger collections of machines, sure...

I've seen googles study, and out of the few thousand or so machines I've had statistics collections from, the few machines with soft errors were fixable and stopped reporting soft errors after having something swapped.

The google study itself goes on and on about the variability of errors with such wonderful sections as "These numbers vary greatly by platform. Around 20% of DIMMs in Platform A and B are affected by correctable errors per year, compared to less than 4% of DIMMs in Platform C and D."

The paper really leaves a lot of holes, I don't remember (nor do I see after skimming it) any note of how aggressively they are running the ram. Did they say try to reduce the ram timings/bump voltage on the platforms they were having issues with? Did they compare how mature the technology was when the commissioned it? Did they try to diagnose the machines reporting high error rates by seeing if they could convert a machine with a high error rate to something lower? They do spend a lot of time talking about temp though. The only valid conclusion I think can be drawn from the paper is "ECC is important use it because you will have RAM failures, better to know about it than not".

To me the paper speaks to googles diagnostic/repair system more than anything. I took a proactive approach and replaced DIMMs/Motherboards/Powersupplies/etc that reported correctable errors. When we were self supporting we would swap the questionable parts into other machines to see if the failures would follow them in an attempt to see if we could prove a failing part was marginal. Then return/exchange it if it failed in more than one machine.

I've seen a lot of different failures over time, and when I was partially in charge of designing/picking platforms I even managed to find actual design bugs a couple times that caused low rate error rates (not in the RAM subsystem thankfully). I tended to use the "any kind of failure when run normally is instant disqualification" metric when I was initially picking new platforms before buying them to put in production. I would never have qualified a platform that had a 20% DIMM failure rate. (well at least not purposefully, we got some stinkers but we tried to correct our mistakes).

Given what i've heard of google, i'm not sure I would really extend these reliability metrics unless your buying the latest bleeding edge parts and running them well into their design margins. These days its pretty common to design systems that have error correction and push the physical topology to the point where there is an expectation of a pretty solid error rate (think SSD flash chips). So for a company like google pushing the RAM timings/etc right out to the margin where they are experiencing a low but statistically unlikely error rate would seem to be the right thing to do. Its different if your a bank/etc running financial data. In that case you buy for reliability first.



Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: