

DRAM errors vastly more frequent than previously thought - wglb
http://blogs.zdnet.com/storage/?p=638

======
proee
I've worked for a major DRAM company and it's a wonder that the parts are
reliable at all. The storage cells are being pushed to their limit as they are
often the bottleneck in scaling the overall size of the bit structure. Not to
mention the engineers are pressed to the max to get them out the door, often
before they are fully qualified. Throw in the millions of process steps that
are required to make the part, and the fact that your computer boots at all is
remarkable.

Then throw in a high speed bus, processor and disk array....

~~~
uiohnuipb
And try and convince a programmer that it's possible that their program's
memory can be wrong.

They understand in theory but refuse to code for the possibility. Especialy
when you get into HPC and there are clusters of 50-60 machines with 4Gb each,
the chance of not having corrupt memory is almost 0.

~~~
Psyonic
I'm honestly curious... what kind of defensive programming techniques could
you use to try and deal with this?

~~~
dmm
Do everything twice and make sure the results match.

~~~
pyre
What happens when the code doing the comparison becomes corrupted? Do the
comparison twice? What happens when the code controlling the evaluation of
both comparisons becomes corrupted?

Your data and your instruction set are in the same memory. Even if they are
separated into different areas of memory to prevent buffer overflow exploits,
it's all still in memory. Once the memory starts going, you're kind of
screwed. It's the same as how -- in respect to computer security -- once
someone has physical access to the machine, you're screwed.

With respect to memory errors in distributed environments, usually such
environments are distributed to increased the processing power for number
crunching. If you run all calculations twice and have code comparing them for
acceptance, you're more than doubling your processing requirements.

But at the end of the day, it's all a matter of what level of risk is
acceptable (or tolerable). There is no magic bullet to fix these issues.

~~~
wmf
You're ignoring that the voting code would be a very small fraction of your
RAM and thus less likely to be corrupted. But it's academic since no one runs
twice to avoid the cost of ECC.

~~~
pyre
I realize that. My point is that there is no 100% solution.

------
basugasubaku
I remember djb exhorting people to buy ECC memory and supported motherboards
back in 2001 (<http://cr.yp.to/hardware/ecc.html>) for their standard
workstations and me thinking he must be somewhat crazy since no one else
seemed to be making a fuss about it.

Perhaps he was right after all.

~~~
jrockway
He is right.

Have you ever had fsck detect errors on a filesystem that you haven't abused?
Guess what, that's memory corruption -- saved to disk forever.

I remember having a machine with especially flaky memory (memtest86 failed in
about 30 seconds)... I detected it because dpkg's database was corrupted
enough for it to cause errors in the application. I never even _tried_ to save
that filesystem...

------
dkarl
As a user of hardware, not a hardware engineer, I wonder if I've seen these
errors. Crashes -- I don't see any of those except the ones related to web
browsers. I do a lot of long compiles and don't see any crashes from gcc.
Corrupted data -- well, I've had several large downloads this year that didn't
match the advertised md5 sums. I redownloaded and got matching checksums. Does
any of this have to do with DRAM errors? There are lots of other potential
sources of error in my computers. I could name half a dozen off the top of my
head, but I'm sure I would only prove that I'm ignorant of another half dozen
that are an order of magnitude more important than the ones I named. I await
an answer to one question: who should care about these DRAM errors? Does that
group include me?

~~~
alexkon
Did you check whether the downloads had been truncated? Downloads finishing
prematurely is a frequent cause for a checksum mismatch.

------
pmorici
What's the difference between a "hard" and "soft" error that the article
mentions?

~~~
Kadin
I've heard the term used in two different ways, and I'm not sure which way is
the 'correct' one:

In one usage, "soft" errors are ones that are 'caught' and transparently fixed
by ECC, and thus have no effect (on a system that has ECC memory). "Hard"
errors, by contrast, are ones that affect multiple bits and aren't corrected
by ECC.

In the other usage, which I think is the more technically correct one, a
"soft" error is a transient condition (bit flipped by cosmic ray, etc.) and
the memory cell continues to operate normally on the next cycle. A "hard"
error is where the cell is basically stuck in one state or another, and
indicates that it's probably time to replace the module. I think you detect a
"hard" error by looking for a series of "soft" errors, although maybe some
architectures/chipsets detect the difference and report them in different
ways...?

If anyone can substantiate either set of definitions, I'd be interested as
well.

