
Attack of the Cosmic Rays - timpattinson
https://blogs.oracle.com/ksplice/entry/attack_of_the_cosmic_rays1
======
aidos
This happened to me many years ago. All of our sites (asp classic) on a single
server went down at the same time. We had a library we shared across them all
that handled some of the db interactions. We discovered that all of a sudden a
sql query had the letter 'a' replaced with a 'q' (ie 01100001, vs 01110001).

I like to think my afternoon was sidetracked by cosmic radiation from millions
of years ago :)

~~~
martinced
_"All of our sites (asp classic) on a single server went down at the same
time"_

Honest curiosity here (I'm not judging): you were running several sites on a
single server that didn't have ECC RAM but regular RAM?

Did you then switch to ECC?

I like the thought about radiation from far away in space and time : )

~~~
aidos
I'm fairly sure we didn't, no. It was a small agency and we were all pretty
junior.

------
wmf
For more on this topic, see DRAM errors in the wild: A Large-Scale Field Study
<http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf> and Cosmic rays
don't strike twice: Understanding the characteristics of DRAM errors and the
implications for system design
<http://www.cs.toronto.edu/~bianca/papers/ASPLOS2012.pdf>

I think there's another one from MS based on Windows error reporting, but I
can't find it.

~~~
nonane
This might be the Microsoft paper you're referring to:
[http://research.microsoft.com/pubs/144888/eurosys84-nighting...](http://research.microsoft.com/pubs/144888/eurosys84-nightingale.pdf)

I worked on one of the Windows Error Reporting (WER) server components that
detected memory corruption from bluescreen memory dumps submitted to the WER
service. It's a debugger extension called !chkimg
([http://msdn.microsoft.com/en-
us/library/windows/hardware/ff5...](http://msdn.microsoft.com/en-
us/library/windows/hardware/ff562217\(v=vs.85\).aspx)). It compares the
executable code in the memory dumps with the actual binaries that Microsoft
shipped and flags differences between the two. This way you can tell what code
was actually running on the machine vs what it should have been running. It
was quite effective at detecting corruption patterns this way. Usually one-bit
corruptions (just a single bit flip) and stride patterns (i.e. one corruption
after every 4, 8, 16 or 32 bytes etc) were a good indicator of hardware
problems.

------
przemoc
Common way to detect memory errors in practice (i.e. how it can easily beat
you) is by compiling stuff, the bigger the better. When you see the famous

    
    
        Internal compiler error
    

it may be caused by some memory problem.

A few years ago I was doing lot of Linux kernel builds. My desktop computer,
which is Athlon 64 X2 4600+ and 2x GeIL PC2-6400 DDR2-800 C4 ULTRA DUAL
CHANNEL (2x1GB each; 4GB total), was doing make -j6 (or -j8, don't remember
now) of kernel one after another. After many many hours some builds started to
failing with mentioned error. Let's say 1 of 4. I had to do it, so I was
repeating make (automatically) till it succeeded, but later I fiddled with
memtest86+. It was running day and night on all 10 tests (I believe that was
test count back then) and something popped up finally (I wasn't checking it
constantly) [1]. But you know what? When I switched to showing badram patterns
later and the problem reappeared, it was red line without any pattern! I used
to see some addresses and such whenever error was detected by memtest (or
experience computer halt, etc.), but that time I had seen nothing - it was
quite enigmatic. I've checked memtest sources and test 5 was filtered out from
showing badram patterns as apparently not being reliable. I changed it (I've
just found the changed sources on the disk and diffed it to the original
release [2]), recompiled the memtest and run it again. I got badram patterns
this time [3].

I needed BadRAM patterns to secure myself from being constantly hit by this
memory problem, but I don't remember whether I used them in the end.

I still use this computer. No big compiles, no big problems, but sometimes I
get the iffy feeling. :) I had a few BSODs since then and I believe they were
caused by this memory problem. Haven't seen its manifestation for a long long
time. Maybe my computer was simply overheated back then? Since then I'm much
more into checksums than I used to be...

    
    
      [1] http://i.imgur.com/RMhCOLM.jpg
      [2] https://gist.github.com/przemoc/5250861
      [3] http://i.imgur.com/jFHZkVA.jpg

~~~
przemoc
Another thing I don't really remember is that BadRAM thing. When I fiddled
with it [1], memmap [2] was already there, but for some reason I wanted to use
BadRAM or maybe I was unaware of it. BadRAM wasn't present for the kernel I
was using, so I attempted porting it to my 2.6.32 one. In the end I haven't
really tested it, I guess, had to move on with other things.

Now I remembered something! Some time later I tried contacting Rick van Rein,
thinking that maybe he would be interested in a new version of the patch, or
rather he could review it and I would learn something, but I never got the
reply from him.

I hope my attempt isn't too embarassing, yet can be interesting for someone,
so I just published my mail to Rick with the attached patch [3]. Feel free to
show me how wrong I was in my poor "refinements" of the original patch! I
would be more than happy (even though I'm not into kernel now at all) to get
constructively bashed about it.

    
    
      [1] http://rick.vanrein.org/linux/badram/
      [2] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/bad_memory.txt
      [3] https://gist.github.com/przemoc/5251104

------
Zenst
Lovely read and educationaly. Also an area most overlook and why reboot
mentality whilst solving this issue would not make you any the wiser. Why I
always enjoy bluescreens of death, though they become non exsitent once you
spot the guilty driver or hadware firmware or bios update or card moved to
another slot so does not share DMA, etc... But always worth it in the end for
stability.

I had a system once that would have memory errors after being run for 5 days,
could thrash it for anything upto 5 days without any problem and after that
would get memory errors. If restarted the memory errors would return. Basicly
turned out this memory would overheat after 5 days of slowly building up the
heat, though a hour turned off would resolve it. That was a fun RMA given most
would soak test for 3 days :(. Cheaper memory won out in the end just fine and
never had a problem with it.

As for single bit errors, well had fun on many a ISDN line which would get the
odd error, not on networking thats fine, error correction. On realtime video,
you see it. Then if a line gets so many errors it shuts off, you call up the
teclo, they run diagnostics and all is fine and its suddenly working again. As
part of running the diagnostics it turned out the diagnostic software would
reset the error count, run diagnostics and be within tollerance. Over time the
errors would increase the counter until it hit a threshold and the line would
go down.

Moral being howeve hard you look into a problem you will still come across
those magic moments and remember bit happens.

------
everettForth
djb has been preaching this for a while. It's amazing to me how much computer
hardware trusts itself:

<http://cr.yp.to/hardware/ecc.html>

------
ChuckMcM
Interesting, I've got 16GB on my desktop machine (ECC) and it has reported a
correctable ECC error about three times in probably 18 months of run time.

------
Schwolop
I would have left a comment on the main article, but it asked me to solve 3+53
as a captcha, and expr kept segfaulting for some reason.

------
ringm
I remember a guy sitting at the next desk to me spending a whole week trying
to figure out why his Direct3D code crashes. He was blaming buggy ATI drivers.
He figured it out when eventually a bit flip ended up in his _source code_.

------
jesseendahl
[http://nakedsecurity.sophos.com/2011/08/10/bh-2011-bit-
squat...](http://nakedsecurity.sophos.com/2011/08/10/bh-2011-bit-squatting-
dns-hijacking-without-exploitation/)

Great article on "how bit-flipping in memory chips or CPU caches can also
cause you to visit a wrong domain that may be one character off from the real
one." I think I remember seeing it on HN awhile back.

------
rdtsc
Just as crazy -- random network packet corruptions that get past the simple
checksumming.

~~~
Sami_Lehtinen
It's quite common for TCP stream to let corrupted data through, because of 16
bit CRC. Start moving large compressed files (gigabytes) and you'll find that
out the bad way.

~~~
martinced
_"Start moving large compressed files (gigabytes) and you'll find that out the
bad way."_

Not if you do that over SCP / SSL / TLS that said... (and the bottleneck is
the network, not the symmetric encryption used once the public/private
symmetric key exchange has been made).

So, yeah, everytime I copy big files I'm using _scp_ (small files too actually
but it's just out of convenience).

~~~
rdtsc
How does random bit flipping and check-summing to fix it work with encryption.

This is actually pretty interesting! How well does encryption handle bit
corruption in encrypted data? I would guess it would depend if signature is
used and the chaining mode used.

------
cek
The sucky thing is ecc memory does not need to be so much more expensive than
non-ecc. It is only more expensive because those who make it, and server
vendors have conspired to use it as a margin driver.

------
nnq
Not to spoil anyone's fun but the radiation doesn't have to come from far away
in space and time... just earthly background radiation is enough to flip bits
from time to time. Place your server closer a "flashy" neighbor (let's say
radiotherapy equipment in a hospital, a floor away from your server room or an
equivalent industrial setting) and your potential bad luck increases :)

------
mark-r
How hard is it to install ECC memory into a random desktop computer? Are they
all compatible with it? Do you have to change BIOS settings?

~~~
timpattinson
Pretty hard. ECC ram will work as normal RAM on non-ECC cpus. To use ECC you
need a Xeon (or amd equivalent) processor and supporting motherboard. You can
get the Xeon equivalent of the latest i7 cheaper than the i7, you (i think)
still need to get a server mobo though.

~~~
martinced
It's a bit of a pain to find ECC mobos for desktops.

I looked for some when I built my last workstation seen that I was putting 16
GB in but ended up using stupid normal RAM : (

Now between SSH, SCP, SSL / TLS, Git, diffs before commits, unit tests, etc.
I'd be really unlikely to have a bit flip really "destroying my work" : )

So, yeah, 16 GB of non-ECC memory on a Linux workstation which regularly
reaches 6 months of uptime: no need to enter in a paranoia either for a
desktop/workstation.

For servers, ECC is great...

