
How to Kill a Supercomputer: Dirty Power, Cosmic Rays, and Bad Solder - tambourine_man
http://spectrum.ieee.org/computing/hardware/how-to-kill-a-supercomputer-dirty-power-cosmic-rays-and-bad-solder
======
klagermkii
Do the results talked about in this article scale down to small systems?

"Jaguar had 360 terabytes of main memory, all protected by ECC. I and others
at the lab set it up to log every time a bit was flipped incorrectly in main
memory. When I asked my computing colleagues elsewhere to guess how often
Jaguar saw such a bit spontaneously change state, the typical estimate was
about a hundred times a day. In fact, Jaguar was logging ECC errors at a rate
of 350 per minute."

If I take the 360 TB and scale it down to a laptop with 16 GB of RAM am I
seeing errors at a proportional rate?

350 errors / 368 640 GB * 16 GB * 60 minutes * 24 hours = 21.875 memory bits
changed per day on a 16 GB laptop

That just seems way too high for average users to have not noticed the lack of
ECC, and especially for the Jeff Atwood-type non-ECC server crowd
([http://blog.codinghorror.com/to-ecc-or-not-to-
ecc/](http://blog.codinghorror.com/to-ecc-or-not-to-ecc/)). What am I missing?

~~~
ryao
That applied to all system memory rather than what was in use. Memory errors
in regions not in use are harmless while there are plenty of places where
memory errors can occur in memory that is in use where they would not be
noticed. Take the buffer storing this webpage in your web browser for example.
A bit flip there would cause a misrendering that might not even be noticeable.
Additionally, when a program crashes or the kernel panics, users are more
likely to think it was caused by a bug in the software than a bitflip. In the
rare cases where something goes catastrophically wrong, they tend to blame the
software too.

~~~
klagermkii
That's the theory in the article, that this silent corruption is happening all
the time and it's just happening in unimportant places or ones that don't
trigger a noticeable crash.

But why am I not seeing this when explicitly looking for it? If I run
Memtest86 for 24 hours I certainly don't expect to see an average of 22 errors
and just treat that as within spec. I expect to see no errors. Is it a problem
with the memory tests we're using, do we need ones which just write data once
and then wait a day to read and validate it rather than hammering RAM with
reads and writes?

~~~
lololomg
Memtest86 writes patterns to memory and reads them back almost immediately. To
do the test properly you need to write a known pattern to memory, wait 24
hours, and THEN read it all back and check for flipped bits.

If you do not have ECC in your computer, yes you should see some flips.

------
mfisher87
This was the worst article reading experience I think I've ever had on a
desktop. With "related stories" taking up more screen space than the article
itself, I'm seeing something like 5-8 words on each line on my 1920x1200
display. The article text is wrapping around an animated gif (the kind that is
still 99% of the time and then it moves suddenly and goes back to rest) in the
first paragraph.

I hate to contribute nothing but this complaint, but I'm actually having
trouble reading this.

~~~
kijiki
firefox's "reader view" does a decent job of eliminating the crap.

~~~
mfisher87
Coincidentally, I had heard about that just today as well. I understand the
site you're reading has to be "compatible" in some way. I guess that just
means that the content has to be "easy" to identify and extract
programmatically? Do you often encounter sites where Reader doesn't work?

------
Animats
It's quite possible to catch random errors, with checking CPUs. Some IBM
mainframes have dual CPUs running in lockstep and compare the results. Intel
Xeon CPUs have checking on the on-chip memories [1] (caches, TLBs, etc.) but
not the ALU.

Duplicated ALUs, instruction decoders, and retirement units checking each
other are rarer than they should be. It wouldn't add that much cost, since
most of the chip real estate today is memory cells for caches. If you fault
the CPU on a compare error, rather than trying to recover, it doesn't slow
down the CPU at all, or add much complexity. Backup and retry is tougher, but
may be unnecessary. The important thing is to detect failure, fail fast, and
move the work to another CPU.

[1]
[http://www.intel.com/content/dam/www/public/us/en/documents/...](http://www.intel.com/content/dam/www/public/us/en/documents/white-
papers/xeon-e7-family-ras-server-paper.pdf)

------
privong
> The surface area of all the silicon in a supercomputer functions somewhat
> like a large cosmic-ray detector.

I was thinking about this as I read the article. Dedicated cosmic ray
detectors are almost certainly more sensitive, have better energy resolution,
and cosmic ray localization than watching RAM bits be flipped. But given the
relatively large number of supercomputers (compared to large cosmic ray
detectors) is there some way to use this information from multiple
supercomputers to provide more wide-field coverage of the "cosmic ray sky" and
supplement dedicated cosmic-ray detection experiments?

Edit: There's a project/app to use smartphone cameras to make a distributed
network of cosmic ray detectors:
[http://wipac.wisc.edu/deco](http://wipac.wisc.edu/deco)

------
mietek
_> Unfortunately, today’s programming models and languages don’t offer any
mechanism for such dynamic recovery from faults._

Sounds like somebody needs to learn about Erlang.

 _> In June 2012, members of an international forum composed of vendors,
academics, and researchers from the United States, Europe, and Asia met and
discussed adding resilience to message-passing interface, or MPI, the
programming model used in nearly all supercomputing code. Those present at
that meeting voted that the next version of MPI would have no resilience
capabilities added to it. So for the foreseeable future, programming models
will continue to offer no methods for notification or recovery from faults._

Well, no.

[http://web.archive.org/web/20040918131755/http://www.sics.se...](http://web.archive.org/web/20040918131755/http://www.sics.se/~joe/thesis/armstrong_thesis_2003.pdf)

~~~
Animats
On supercomputers, you're often doing some big, tightly coupled numerical job.
That's what you buy supercomputers for. The unit of rerun is perhaps hours of
computation on thousands of CPUs. If you're doing a large number of
transactions, you have big clusters of relatively ordinary CPUs. The unit of
rerun is one transaction.

~~~
mietek
If you’re using MPI, you’re doing message-passing. If you’re doing message-
passing and you have no methods for notification or recovery from faults,
you’re doing it wrong.

------
batbomb
Cosmic rays aren't neutrons, they mostly electrons or muons (and maybe gamma
rays). For stopping gamma rays and electrons, 10 feet of water or dirt is a
fine for stopping them. So is about 7 feet of concrete, 2 feet of steel or one
foot of lead. So is making sure you are closer to sea level. Muons will pass
through all of those fairly easily but the flux isn't that high. You might
have a few events per year where the flux is high.

------
sandworm101
Is it time to start shielding data centers under lead roofs? Or in abandoned
mines?

I have read that circuit orientation can have an effect. Mounting the silicon
vertically, rather than horizontally, can reduce the surface area exposed to
rays.

------
ryao
How did the IBM Blue Gene/L system mentioned in the article have bit flips
from greater than normal radioactive solder go diagnosed for weeks without
immediately being detected through faults reported by hardware?

