

Ask HN: Soft failures on CPUs ? - jacquesm

There have been many studies that look at error rates on hard disks and in DRAM, are there similar studies about how often CPUs hickup ?<p>Given the amazing clockfrequencies of modern CPUs and their incredible complexities it stands to reason that CPUs will miss a bit every now and then. Even if their error rate was only 1 in 1 trillion that would mean a soft error every 15 minutes. (so, either errors are happening or the rate is much much lower than that).<p>I think this matters because we are looking at data corruption issues on disk and in RAM and we have solutions in place for those (checksumming, ECC RAM), but if something goes wrong in the cpu then all of that is moot.
======
cperciva
_There have been many studies that look at error rates on hard disks and in
DRAM, are there similar studies about how often CPUs hickup?_

I don't know of any published results, but anecdotal reports I've heard from
the supercomputing community are in the range of one error per 10-100 CPU
years on well-maintained (adequate cooling, clean power, and not overclocked)
COTS systems. This rate would be much higher were it not for the fact that
most die area is used by caches with internal ECC; I believe that some recent
CPUs apply ECC to some internal busses as well.

That said, internal CPU bit errors are a problem which must be considered in
high-reliability environments -- this is one of the major factors behind
interest in byzantine algorithms.

------
graphene
Something that might have something to do with this is the difference between
the DRAM in RAM and the SRAM on the CPU die. On DRAM, the bit is stored as a
charge on a capacitor, which is (apparently) susceptible to cosmic ray bit-
flipping. Perhaps SRAM is less delicate in that regard?

~~~
soundsop
It used to be the case that DRAM was more sensitive to soft errors than SRAM
until about 8 to 10 years ago. Today, soft errors in SRAM are more of a
problem than in DRAM. The capacitance per unit cell of DRAM has remained
relatively constant, while the capacitance per node of SRAM has shrunk
dramatically with transistor scaling. So even though SRAM actively holds data,
contrasted to DRAM, which passively hold data, SRAM's low capacitance makes it
more susceptible to soft errors.

------
frig
No hard data on actual rates but you may find this interesting:

<http://lambda-the-ultimate.org/node/2108>

You might find upper bounds by talking to people who build space-hardened
systems.

