
You need to use error-correcting memory  - skorks
http://lambda-diode.com/opinion/ecc-memory
======
barrkel
> _So the probability of having at least one bit error in 4 gigabytes of
> memory at sea level on planet Earth in 72 hours is over 95% ._

This is misleading. A flaky machine will indeed see bit errors, and it will
probably be visible as random crashes, but that's not even necessarily the
case for the average machine. If you look at the quantitative study from
Google, which the author links to:

<http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf>

... you can see that in terms of errors per DIMM:

 _"Across the entire fleet, 8.2% of all DIMMs are affected by correctable
errors and an average DIMM experiences nearly 4000 correctable errors per
year. These numbers vary greatly by platform. Around 20% of DIMMs in Platform
A and B are affected by correctable errors per year, compared to less than 4%
of DIMMs in Platform C and D. Only 0.05-0.08% of the DIMMs in Platform A and
Platform E see an uncorrectable error per year compared to nearly 0.3% of the
DIMMs in Platform C and Platform D. The mean number of correctable errors per
DIMM are more comparable, ranging from 3351-4530 correctable errors per
year."_

So the mean rate of correctable errors is high, but the variance is also very
high: depending on the manufacturer, 80 to 96% of DIMMs see out a whole year
_without a single correctable error_. If the original statistics of 95% chance
of error in 3 days were correct, a single-DIMM machine ought to have
approximately 0% (astronomically close to 0%) of living out a whole year
without errors - but we can see here that you seem to have between 80 and 96%
of single-DIMM machines doing just that.

The moral here is to test your memory for a while - preferably a few days -
before trusting the DIMMs. But once you know you have good DIMMs, it doesn't
look like you need to be quite so paranoid about bit errors.

~~~
tbrownaw
> But once you know you have good DIMMs, it doesn't look like you need to be
> quite so paranoid about bit errors.

Assuming that only the one-error-per-year cases were due to random bit flips,
and all the multiple-errors-per-year cases were due to bad DIMMs, I came up
with about a 1/5 chance of getting a _single_ random bit-flip over a 6 year
lifespan. But there also seems to be about a 1/3 chance of having a DIMM
randomly go bad after a couple years, which of course without ECC would
manifest as random crashes and lost (or _maybe_ corrupted) work.

~~~
Estragon
Seems like running memtest every six months or so would be a good policy.

It would be nice if there were a way to test the memory while a machine is
running.

~~~
psranga
The errors we're talking about here are _transient_. The memory location
itself is still usable, the contents get changed _when_ a cosmic ray hits.
After the hit, the corrupted value is held without a problem.

Memtest checks if the memory location has a gross fault which prevents it from
storing values correctly.

Doesn't seem like memtest will help.

~~~
Estragon
Good point. Thanks.

------
InclinedPlane
This article contains a fundamental flaw. It estimates the upset rate in
memory due to cosmic ray flux as upsets/bits/hour, but this is an incorrect
unit. Upsets depend on the total _physical_ size of the memory (and thus the
total neutron flux) and the sensitivity of each memory cell (bit) to cosmic
rays. Sensitivity may increase as you decrease the size of memory cells, but
not in lock-step with the change in size. A room full of 4 Mbit memory chips
will almost certainly have a higher rate of upsets per bit than will a single
2GB DIMM. The figures quoted in the article are from studies of computer
systems in the 1980s, so upsets/bits rates are much higher than would be
expected with modern RAM (which has 3 orders of magnitude more bits in the
same volume).

This error is probably why the article's theorized SEU event rate for modern
systems is about 3 orders of magnitude higher than experimental evidence
suggests (such as from this Google study):
<http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf>

~~~
jessriedel
I don't understand your first paragraph. Shouldn't the cosmic ray flux through
each bit be the same for each bit, and unchanged as you increase the total
amount of memory? If I'm a given bit, should having a second stick of RAM 10
inches away affect whether I flip during a given time period?

~~~
InclinedPlane
_Shouldn't the cosmic ray flux through each bit be the same for each bit, and
unchanged as you increase the total amount of memory?_

For identically manufactured ram, generally yes. The total number of upsets
you'll see from a collection of 10 sticks of RAM will be roughly 10x higher
than from 1 stick of RAM. However, there are huge variations in RAM,
especially when you're comparing modern RAM to RAM manufactured in, say, 1988.
The main figure used in the article (1.3e-12 upsets/bit/hour) comes from a
study of a Cray Y-MP 8 system that had a main memory system containing
approximately 32,000 SRAM chips. This amount of memory is measured in cubic
meters, yet today the same number of bits of RAM fits on half or a quarter of
a single DIMM.

Suffice it to say, the cosmic ray flux through the Cray Y-MP 8's main memory
system and through half of a 2GB DIMM is significantly different, by orders of
magnitude. At the same time, the memory cell in the Y-MP 8 and the memory cell
in a 2GB DDR2 DIMM will have a different rate of sensitivity to cosmic ray
flux, translating to a different rate of upsets for the same rate of neutron
flux per memory cell. However, these two factors don't balance each other out,
modern memory cells aren't thousands of times more sensitive to cosmic rays
even though they take up thousands of times less space. The result is that a
figure of upsets/bits/year can only be taken to be constant so long as the
memory technology remains constant. That is most decidedly not the case here.
If one were using 4GB of Cray Y-MP ram (which would likely fill an entire
server rack, and more) perhaps you'd see the SEU rates the author calculates.
However, most folks these days are using 4GB of RAM in 2 tiny DIMMs which may
have, at most, a combined cross-sectional area (of the actual memory chips) of
at most maybe 16 cm^2. This has non-trivial effects on the SEU rate.

~~~
jessriedel
Oh, OK. I guess then I wouldn't say that upsets/bit/hour is an incorrect unit.
(It's clearly what you want to know to calculate the chance of error for a
given piece of RAM.) It's just that this parameter varies across time and
manufacturers. Using the value from a particular model of RAM manufactured in
1988 is sure to lead to wrong conclusions.

Thanks.

------
djcapelis
I love ECC as much as the next guy, but the reality is that the entry
overstates the importance of bit-flips by assuming the bits always flip in
something that matters.

The simple fact is that most bit flips occur in portions of memory no one
cares about. If the error even manifests a lot of the time it'll just manifest
as one pixel in some image somewhere changing color by one bit.

With the author's estimate of 1,000 bit flips over the lifetime of a computer,
maybe 10 of them cause crashes. Most of those crashes are likely to be a web
browser on most people's desktops anyways, so if you just imagine you have the
previous version of the flash player, you can simulate an increased SEU rate
pretty nicely.

ECC is standard on servers because we assume the data they carry matters the
large majority of the time. We assume that server's typically have a larger
portion of their memory devoted to "important" things. (I.E. not images, video
or stuff a javascript interpreter forgot to free().) On desktops, it is still
probably reasonable to purchase non-ECC hardware for the time being.

I agree with the author though, that this is only getting worse. The trends
are all in directions where this is going to start affecting consumer-level
stuff at some point, but I'm not sure we're there yet.

As always, it's a matter of your workload combined with good risk analysis.

------
ableal
Last year I shopped around for a quad-core 8GB box _with_ ECC RAM. The RAM
itself is not much more expensive, the problem is CPU/chipset support. I went
with an AMD Phenom - I think that with Intel CPUs, you only get ECC in the
Xeon server line.

(Note that besides bunging in the ECC RAM DIMMs, you may have to turn on ECC
support in the BIOS.)

Just using increasing amounts of RAM, storage and bandwidth, without adding
data-integrity checks, is really asking for trouble ...

~~~
lutorm
The increasing storage problem applies to hard drives, too, and the increasing
need for something like RAID6 over RAID5.

~~~
derobert
Neither RAID5 nor RAID6 give you integrity checks. Each block of data is only
read from /one/ disk, unless that disk is failed (in which case parity & data
is read from the remaining disks to calculate that block).

If the disk recognizes the sector as bad (through its own, internal redundancy
checks), then (depending on RAID implementation) either that one block will be
read from parity or the entire disk will be dropped from the array.

But, if the disk silently corrupts data, RAID5/6 will not protect you. In
fact, it makes the problem worse; silent corruption is more likely the more
disks you have)

------
Confusion
_the probability of having at least one bit error in 4 gigabytes of memory at
sea level on planet Earth in 72 hours is over 95% ._

This metric is only relevant if you read all 4 GB of your memory every second
and use the data for something that can't stand a flipped bit. Then you'll
have one problem for every 72 hours of constant use of all of your memory.

How much of your memory do you use on average? How many flipped bits will be
read, before being overwritten? How many bit flips cause a real problem? If
one of the gray background dots of HN turns blue, I don't really care. The
likelihood of an actual problem for an average user is vastly lower because of
these factors.

The average comment on the blog of this guy and on Reddit is just sad: it's
all fine and well that anecdotal evidence and the Google paper tell you he's
wrong, but his math makes sense. Doesn't anyone feel the need to get to the
root of the error in his assertion?

~~~
mfukar
The "metric" is always relevant. Your confusion comes from the fact that
you're thinking about whether those errors are reflected in a user's
experience. If you think about the concept of RAM coupled with the fact that
desktops are most likely running every little piece of software available (OS,
a browser with several windows/tabs open, IM programs, a game, etc.) you'll
see why this is a big deal. And let's not even mention servers and machines
where actually significant work is being carried out..

~~~
Confusion
I'm responsible for a handful of servers where 'actually significant work' is
being carried out. My reasoning applies to those machines just as much:

\- how much memory is in use?

\- how large is the chance that a flipped bit is read (as opposed to being
overwritten before being read)?

\- What are the consequences of the flipped bit?

Apart from that: there are quite a few desktops in the world where 'actually
significant work' is being carried out.

~~~
regularfry
The author touches on the "how much memory is in use" question: all major OSes
use unallocated RAM as a file cache (or equivalent), so no matter _where_ the
error happens, it is almost certain to hit something. Whether that "something"
is actually relevant is another matter.

------
patrickgzill
Sidenote: Solaris 10 and OpenSolaris have the ability to not only monitor ECC
memory errors, but when detecting that the errors go over a certain threshold
will automatically mark those pages "bad" and force the operating system to no
longer use that range of memory.

I have an 8GB dual-Opteron system with a DIMM that is in production and should
not be taken down - about 4MB on one DIMM has been marked bad and removed from
use by the OS.

------
prewett
In four years of reading Cassini updates, they have never once mentioned
worrying about software errors due to cosmic rays. And this is without any
atmospheric protection at all; we have 100 miles of atmosphere.

They have mentioned that their solid state relays trip 2 or 3 times a year due
to cosmic rays. I'm not sure how comparable those are to DIMMs, but it does
suggest that the author's claim of one error per day is a bit off...

~~~
AngryParsley
Pretty much anything launched into space uses radiation-hardened electronics.
I don't know what Cassini has on board, but this is popular:
<http://en.wikipedia.org/wiki/IBM_RAD6000>

------
rlpb
> First, let's assume you have a system with no error-correction nor parity.
> The probability that you'll experience a bit error during the time T will be
> 1-(1-p)^m .

OK, let's assume that.

> For T=1 hour , p = 1.3e-12 and m = 4 _2^30_ 8 that gives 0.044 or 4.4% .

WTF? Where did those figures come from?

------
sailormoon
Perhaps a good thing for "mission critical" software to do might be to
implement software "hash checks" where the software keeps a running series of
hashes of what it thinks it has written to disk, and then compares that
against the actual stored bytes at the end of the operation. Or, if that is
impractical, have a "safe mode" intended for kernel builds, software releases,
etc, where it will compile and build twice, then compare the two for
differences. That would solve a lot of the problems the author is postulating,
however unlikely they might be.

