This is misleading. A flaky machine will indeed see bit errors, and it will probably be visible as random crashes, but that's not even necessarily the case for the average machine. If you look at the quantitative study from Google, which the author links to:
... you can see that in terms of errors per DIMM:
"Across the entire fleet, 8.2% of all DIMMs are affected by correctable errors and an average DIMM experiences nearly 4000 correctable errors per year. These numbers vary greatly by platform. Around 20% of DIMMs in Platform A and B are affected by correctable errors per year, compared to less than 4% of DIMMs in Platform C and D. Only 0.05-0.08% of the DIMMs in Platform A and Platform E see an uncorrectable error per year compared to nearly 0.3% of the DIMMs in Platform C and Platform D. The mean number of correctable errors per DIMM are more comparable, ranging from 3351-4530 correctable errors per year."
So the mean rate of correctable errors is high, but the variance is also very high: depending on the manufacturer, 80 to 96% of DIMMs see out a whole year without a single correctable error. If the original statistics of 95% chance of error in 3 days were correct, a single-DIMM machine ought to have approximately 0% (astronomically close to 0%) of living out a whole year without errors - but we can see here that you seem to have between 80 and 96% of single-DIMM machines doing just that.
The moral here is to test your memory for a while - preferably a few days - before trusting the DIMMs. But once you know you have good DIMMs, it doesn't look like you need to be quite so paranoid about bit errors.
During the talk, http://www.stanford.edu/class/ee380/fall-schedule-20092010.h... , IIRC, she said that good DIMMs went bad over a fairly short period of time, on the order of 2-3 years.
Assuming that only the one-error-per-year cases were due to random bit flips, and all the multiple-errors-per-year cases were due to bad DIMMs, I came up with about a 1/5 chance of getting a single random bit-flip over a 6 year lifespan. But there also seems to be about a 1/3 chance of having a DIMM randomly go bad after a couple years, which of course without ECC would manifest as random crashes and lost (or maybe corrupted) work.
It would be nice if there were a way to test the memory while a machine is running.
Memtest checks if the memory location has a gross fault which prevents it from storing values correctly.
Doesn't seem like memtest will help.
This error is probably why the article's theorized SEU event rate for modern systems is about 3 orders of magnitude higher than experimental evidence suggests (such as from this Google study): http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf
For identically manufactured ram, generally yes. The total number of upsets you'll see from a collection of 10 sticks of RAM will be roughly 10x higher than from 1 stick of RAM. However, there are huge variations in RAM, especially when you're comparing modern RAM to RAM manufactured in, say, 1988. The main figure used in the article (1.3e-12 upsets/bit/hour) comes from a study of a Cray Y-MP 8 system that had a main memory system containing approximately 32,000 SRAM chips. This amount of memory is measured in cubic meters, yet today the same number of bits of RAM fits on half or a quarter of a single DIMM.
Suffice it to say, the cosmic ray flux through the Cray Y-MP 8's main memory system and through half of a 2GB DIMM is significantly different, by orders of magnitude. At the same time, the memory cell in the Y-MP 8 and the memory cell in a 2GB DDR2 DIMM will have a different rate of sensitivity to cosmic ray flux, translating to a different rate of upsets for the same rate of neutron flux per memory cell. However, these two factors don't balance each other out, modern memory cells aren't thousands of times more sensitive to cosmic rays even though they take up thousands of times less space. The result is that a figure of upsets/bits/year can only be taken to be constant so long as the memory technology remains constant. That is most decidedly not the case here. If one were using 4GB of Cray Y-MP ram (which would likely fill an entire server rack, and more) perhaps you'd see the SEU rates the author calculates. However, most folks these days are using 4GB of RAM in 2 tiny DIMMs which may have, at most, a combined cross-sectional area (of the actual memory chips) of at most maybe 16 cm^2. This has non-trivial effects on the SEU rate.
The simple fact is that most bit flips occur in portions of memory no one cares about. If the error even manifests a lot of the time it'll just manifest as one pixel in some image somewhere changing color by one bit.
With the author's estimate of 1,000 bit flips over the lifetime of a computer, maybe 10 of them cause crashes. Most of those crashes are likely to be a web browser on most people's desktops anyways, so if you just imagine you have the previous version of the flash player, you can simulate an increased SEU rate pretty nicely.
I agree with the author though, that this is only getting worse. The trends are all in directions where this is going to start affecting consumer-level stuff at some point, but I'm not sure we're there yet.
As always, it's a matter of your workload combined with good risk analysis.
(Note that besides bunging in the ECC RAM DIMMs, you may have to turn on ECC support in the BIOS.)
Just using increasing amounts of RAM, storage and bandwidth, without adding data-integrity checks, is really asking for trouble ...
If the disk recognizes the sector as bad (through its own, internal redundancy checks), then (depending on RAID implementation) either that one block will be read from parity or the entire disk will be dropped from the array.
But, if the disk silently corrupts data, RAID5/6 will not protect you. In fact, it makes the problem worse; silent corruption is more likely the more disks you have)
This metric is only relevant if you read all 4 GB of your memory every second and use the data for something that can't stand a flipped bit. Then you'll have one problem for every 72 hours of constant use of all of your memory.
How much of your memory do you use on average? How many flipped bits will be read, before being overwritten? How many bit flips cause a real problem? If one of the gray background dots of HN turns blue, I don't really care. The likelihood of an actual problem for an average user is vastly lower because of these factors.
The average comment on the blog of this guy and on Reddit is just sad: it's all fine and well that anecdotal evidence and the Google paper tell you he's wrong, but his math makes sense. Doesn't anyone feel the need to get to the root of the error in his assertion?
- how much memory is in use?
- how large is the chance that a flipped bit is read (as opposed to being overwritten before being read)?
- What are the consequences of the flipped bit?
Apart from that: there are quite a few desktops in the world where 'actually significant work' is being carried out.
On an unrelated note, I did not mean to demean desktops, but the reality is that there's orders of magnitude more devices that carry out tasks more critical than image processing or development. Embedded devices are one example.
I have an 8GB dual-Opteron system with a DIMM that is in production and should not be taken down - about 4MB on one DIMM has been marked bad and removed from use by the OS.
They have mentioned that their solid state relays trip 2 or 3 times a year due to cosmic rays. I'm not sure how comparable those are to DIMMs, but it does suggest that the author's claim of one error per day is a bit off...
OK, let's assume that.
> For T=1 hour , p = 1.3e-12 and m = 42^308 that gives 0.044 or 4.4% .
WTF? Where did those figures come from?