
Attack of the cosmic rays: Undetected memory errors can happen to you - nelhage
http://blog.ksplice.com/2010/06/attack-of-the-cosmic-rays/
======
mattmanser
_Since that incident, I’ve had several other, similar problems. Something
would start failing mysteriously, but flushing my cache restored it to
normal._

This seems like a bit of a red flag that in reality something else is actually
going wrong with his computer.

~~~
tomjen3
Yeah, while cosmic-rays are cool and all it sounds like his RAM is failing.

~~~
amanfredi
It doesn't really matter if the errors were caused by cosmic rays or the ram
failing -- it's still dangerous and relatively undetectable.

~~~
thingie
It does. How dangerous they actually are? If it was that common for them to
cause serious disruptions in the operations of our desktops and laptops, where
are these faults? I mean, in the last 10 years, I simply can't remember any
experience of a sudden and irreproducible computer failure that couldn't be
quite convincingly attributed to something else.

I'm sure that this can happen to me, that in any particular moment, my memory
can get corrupted by these rays, or something else, and then, the computer can
misbehave or crash. And if I really wanted to be sure that it will not, I'd
need to have some kind of protection against it. But compared to many other
possible faults, is it really anything more than a very very rare and minor
reason of a computer failure that I simply can just discard on my laptop,
which is nowhere near a "critical and vital" system?

------
yellowbkpk
To give you an idea of density/frequency of this occurring: my wife's CCD for
her PhD experiments routinely (roughly 1 in 5) pick up huge spikes from cosmic
rays during her 30-second exposures. The CCD is less than an inch square and
she's 2 floors down from ground level.

~~~
Tichy
Is she certain that it is not another PhD student one floor above her
experimenting with X-Rays?

~~~
yellowbkpk
Yes, she uses the specific room in the building because it's farthest away
from noisey experiments.

------
tetha
Reading this, I remember how hard NASA works to get their sattelites and
probes secure against cosmic rays, because out there in space, cosmic rays
cause your memory to become pretty unpredictable. Error correcting codes and
redundancy suddenly become really important, even though you are crammed into
this little embedded system which has less processing power than some input
devices these days.

~~~
chroma
The main reason for redundancy is because NASA can't send someone out to fix
the probe/satellite, so the hardware has to be designed to work flawlessly for
decades. Astronauts use consumer laptops just fine on the shuttle and ISS, and
I doubt most of those even have ECC RAM.

Airplane avionics aren't radiation-hardened, but they work fine at altitude
where the incidence of cosmic rays is about 1/4th that in space. Planes aren't
dropping out of the sky from bit flips. People don't get tons of parity or ECC
errors on their laptops while flying. Based on all this, I think the threat of
cosmic rays is vastly overstated.

~~~
InclinedPlane
Worse yet, if your probe's OS crashes at the wrong time it could lead to loss
of the probe or missing out on a lot of data that would otherwise have been
gathered. If a probe's electronic brain crashes due to cosmic rays and goes
into a "safe" state it may fail to maintain attitude control, potentially
pointing its solar arrays away from the Sun and/or its high gain antenna away
from Earth. This can lead to "bad things" such as the probe draining its
batteries and dying before controllers have a chance to fix it (this has
happened quite frequently in the past, though not always due to cosmic rays).
Worse, the probe could reset during a critical course correction manouver, end
up failing to go into orbit around a target planet, burning up in a planet's
atmosphere, or merely ending up in the wrong location on the wrong trajectory.

Avionics systems use ECC memory and other techniques to avoid being impacted
by cosmic rays causing single event upsets. As far as parity and ECC errors on
laptops, I don't believe ECC ram is common on laptops.

That being said, this particular article uses a gross overestimate of SEUs for
memory, which is really only applicable if your 4GB of RAM fills an entire
room in multiple full sized racks (the studies he bases these figures on come
from the 80s and the author fails to adjust for physical size when
extrapolating to modern memory sizes).

------
rubyrescue
Inspiring for the seeming ease with which he moves between package managers
and debuggers...

------
thingie
I don't say that cosmic rays cannot happen (well, they absolutely certainly
do, I mean whether they can cause memory corruption that actually make some
difference in the running system), but this is quite strange. No such faults
were happening before this single incident and now, there many similar faults
happening regularly? Why should I suspect the cosmic rays (was there any
reason for such a sudden change in their activity and visibility of it?) and
not an hardware fault?

------
ajb
These kinds of memory errors are more often caused by alpha particles emitted
by radioactive elements in the chip package:
<http://en.wikipedia.org/wiki/Soft_error>

------
gacba
For those who want to know more about cosmic rays, Wikipedia is filled with
goodness on the subject. (<http://en.wikipedia.org/wiki/Cosmic_ray>) I was
looking for stats on average density per m2 to determine just how prevalent
this effect might be in ground-based electronics. It's been a major problem
with high-altitude and satellite equipment for a long, long time.

------
seanlinmt
From my experience, I think it is unlikely to be due to cosmic rays.Most
likely culprit could be power supply or data buffers. Those non tantalum
capacitors then to end of life faster if you're operating in high humidity
conditions.

This reminded me of a number of random crashes that a client of my previous
company had. Stackdumps just showed random errors. We had about a years worth
of crash logs from a couple thousand of network switches (they were an ISP).
We initially suggested that this might be a problem with cosmic rays. We even
checked the frequency of the random crashes with sunspot cycles. No
relationship found. Turns out it was due to another component failing due to a
design error.

------
fictorial
Great work digging into this issue. A memory test is probably in order.

I learned about ECC RAM when I was trying to figure out why server lease deals
were so inexpensive relative to others. For instance, the last I checked,
hetzner.de's hardware does not support ECC RAM. I am of course not calling out
hetzner, and there are other factors in such deals.

------
Luyt
Idea: Use pieces of lead sheeting to shield the RAM chips from cosmic
radiation.

~~~
harshpotatoes
While lead is a good choice for photon radiation, it is a very bad choice to
shield against charged particles. Because lead is very dense (which is why it
is good for xrays or gamma rays), it slows down the protons/electrons very
quickly, producing brehmsstraulangw. low density materials such as acrylic or
wood or concretemight be better.

Although, becuase the protons are such high energy, I can't imagine how much
material it would take. This stuff is very bhigh energy, and very nasty to
stop. To get an idea of what kind of shielding you'd need, I'd look at the
shielding used at the lhc, tevatron, and slac. they all produce particles of
comparable energy.

I think ECC ram might be cheaper, up to a point, maybe shielding would be
better for data centers.

~~~
eru
One small correction: Bremsstrahlung.

~~~
harshpotatoes
D'oh

------
gcb
a common saying in the medic industry:

sometimes a zebra is just a horse.

------
kunley
Segfaults from Outer Space !! Duck and cover

