Hacker News new | comments | show | ask | jobs | submit login
Attack of the Cosmic Rays (oracle.com)
85 points by timpattinson on Mar 27, 2013 | hide | past | web | favorite | 26 comments

This happened to me many years ago. All of our sites (asp classic) on a single server went down at the same time. We had a library we shared across them all that handled some of the db interactions. We discovered that all of a sudden a sql query had the letter 'a' replaced with a 'q' (ie 01100001, vs 01110001).

I like to think my afternoon was sidetracked by cosmic radiation from millions of years ago :)

"All of our sites (asp classic) on a single server went down at the same time"

Honest curiosity here (I'm not judging): you were running several sites on a single server that didn't have ECC RAM but regular RAM?

Did you then switch to ECC?

I like the thought about radiation from far away in space and time : )

I'm fairly sure we didn't, no. It was a small agency and we were all pretty junior.

For more on this topic, see DRAM errors in the wild: A Large-Scale Field Study http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf and Cosmic rays don't strike twice: Understanding the characteristics of DRAM errors and the implications for system design http://www.cs.toronto.edu/~bianca/papers/ASPLOS2012.pdf

I think there's another one from MS based on Windows error reporting, but I can't find it.

This might be the Microsoft paper you're referring to: http://research.microsoft.com/pubs/144888/eurosys84-nighting...

I worked on one of the Windows Error Reporting (WER) server components that detected memory corruption from bluescreen memory dumps submitted to the WER service. It's a debugger extension called !chkimg (http://msdn.microsoft.com/en-us/library/windows/hardware/ff5...). It compares the executable code in the memory dumps with the actual binaries that Microsoft shipped and flags differences between the two. This way you can tell what code was actually running on the machine vs what it should have been running. It was quite effective at detecting corruption patterns this way. Usually one-bit corruptions (just a single bit flip) and stride patterns (i.e. one corruption after every 4, 8, 16 or 32 bytes etc) were a good indicator of hardware problems.

While it is possible that cosmic rays affect DRAM, the majority of the single-bit errors are due to a failure in the memory cell itself, as the paper shows a high correlation between error rates and utilization, while radiation-induced errors should be randomly distributed. Even if you get a rad-induced glitch, I believe it's more likely to be caused by alpha-radiation from the chip itself. I don't know for any study that specifically looks for cosmic rays, say, placing the control system in a lead box. Also, dupe.

Lovely read and educationaly. Also an area most overlook and why reboot mentality whilst solving this issue would not make you any the wiser. Why I always enjoy bluescreens of death, though they become non exsitent once you spot the guilty driver or hadware firmware or bios update or card moved to another slot so does not share DMA, etc... But always worth it in the end for stability.

I had a system once that would have memory errors after being run for 5 days, could thrash it for anything upto 5 days without any problem and after that would get memory errors. If restarted the memory errors would return. Basicly turned out this memory would overheat after 5 days of slowly building up the heat, though a hour turned off would resolve it. That was a fun RMA given most would soak test for 3 days :(. Cheaper memory won out in the end just fine and never had a problem with it.

As for single bit errors, well had fun on many a ISDN line which would get the odd error, not on networking thats fine, error correction. On realtime video, you see it. Then if a line gets so many errors it shuts off, you call up the teclo, they run diagnostics and all is fine and its suddenly working again. As part of running the diagnostics it turned out the diagnostic software would reset the error count, run diagnostics and be within tollerance. Over time the errors would increase the counter until it hit a threshold and the line would go down.

Moral being howeve hard you look into a problem you will still come across those magic moments and remember bit happens.

djb has been preaching this for a while. It's amazing to me how much computer hardware trusts itself:


Common way to detect memory errors in practice (i.e. how it can easily beat you) is by compiling stuff, the bigger the better. When you see the famous

    Internal compiler error
it may be caused by some memory problem.

A few years ago I was doing lot of Linux kernel builds. My desktop computer, which is Athlon 64 X2 4600+ and 2x GeIL PC2-6400 DDR2-800 C4 ULTRA DUAL CHANNEL (2x1GB each; 4GB total), was doing make -j6 (or -j8, don't remember now) of kernel one after another. After many many hours some builds started to failing with mentioned error. Let's say 1 of 4. I had to do it, so I was repeating make (automatically) till it succeeded, but later I fiddled with memtest86+. It was running day and night on all 10 tests (I believe that was test count back then) and something popped up finally (I wasn't checking it constantly) [1]. But you know what? When I switched to showing badram patterns later and the problem reappeared, it was red line without any pattern! I used to see some addresses and such whenever error was detected by memtest (or experience computer halt, etc.), but that time I had seen nothing - it was quite enigmatic. I've checked memtest sources and test 5 was filtered out from showing badram patterns as apparently not being reliable. I changed it (I've just found the changed sources on the disk and diffed it to the original release [2]), recompiled the memtest and run it again. I got badram patterns this time [3].

I needed BadRAM patterns to secure myself from being constantly hit by this memory problem, but I don't remember whether I used them in the end.

I still use this computer. No big compiles, no big problems, but sometimes I get the iffy feeling. :) I had a few BSODs since then and I believe they were caused by this memory problem. Haven't seen its manifestation for a long long time. Maybe my computer was simply overheated back then? Since then I'm much more into checksums than I used to be...

  [1] http://i.imgur.com/RMhCOLM.jpg
  [2] https://gist.github.com/przemoc/5250861
  [3] http://i.imgur.com/jFHZkVA.jpg

Another thing I don't really remember is that BadRAM thing. When I fiddled with it [1], memmap [2] was already there, but for some reason I wanted to use BadRAM or maybe I was unaware of it. BadRAM wasn't present for the kernel I was using, so I attempted porting it to my 2.6.32 one. In the end I haven't really tested it, I guess, had to move on with other things.

Now I remembered something! Some time later I tried contacting Rick van Rein, thinking that maybe he would be interested in a new version of the patch, or rather he could review it and I would learn something, but I never got the reply from him.

I hope my attempt isn't too embarassing, yet can be interesting for someone, so I just published my mail to Rick with the attached patch [3]. Feel free to show me how wrong I was in my poor "refinements" of the original patch! I would be more than happy (even though I'm not into kernel now at all) to get constructively bashed about it.

  [1] http://rick.vanrein.org/linux/badram/
  [2] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/bad_memory.txt
  [3] https://gist.github.com/przemoc/5251104

I would have left a comment on the main article, but it asked me to solve 3+53 as a captcha, and expr kept segfaulting for some reason.

I remember a guy sitting at the next desk to me spending a whole week trying to figure out why his Direct3D code crashes. He was blaming buggy ATI drivers. He figured it out when eventually a bit flip ended up in his _source code_.

Interesting, I've got 16GB on my desktop machine (ECC) and it has reported a correctable ECC error about three times in probably 18 months of run time.


Great article on "how bit-flipping in memory chips or CPU caches can also cause you to visit a wrong domain that may be one character off from the real one." I think I remember seeing it on HN awhile back.

Just as crazy -- random network packet corruptions that get past the simple checksumming.

If you are seeing bad packets with any amount of frequency, one thing to check is if you're using any TCP offloading with your driver, both on the sending end and on the receiving end. (Such as "Large Send Offload", TCP checksum offloading, IP checksum offloading, TSO/LRO, etc. These are usually vendor-specific settings, so terminology may differ)

I've seen multiple vendors with buggy implementations. In what way are they buggy? In the cases I've seen, packets are corrupted by the driver and then have a hardware checksum applied to the corrupt data...so they go through the network just fine. (In one case, for example, two 64 byte blocks of data were swapped within a large packet.)

When you turn on TCP offloading, you're swapping the well-proven OS network stack for a who-the-hell-knows-how-much-testing-really-went on network device stack.

There may or may not be a noticeable performance penalty when you turn off TDP offloading features, so you should profile performance before and after.

It's quite common for TCP stream to let corrupted data through, because of 16 bit CRC. Start moving large compressed files (gigabytes) and you'll find that out the bad way.

"Start moving large compressed files (gigabytes) and you'll find that out the bad way."

Not if you do that over SCP / SSL / TLS that said... (and the bottleneck is the network, not the symmetric encryption used once the public/private symmetric key exchange has been made).

So, yeah, everytime I copy big files I'm using scp (small files too actually but it's just out of convenience).

How does random bit flipping and check-summing to fix it work with encryption.

This is actually pretty interesting! How well does encryption handle bit corruption in encrypted data? I would guess it would depend if signature is used and the chaining mode used.

The sucky thing is ecc memory does not need to be so much more expensive than non-ecc. It is only more expensive because those who make it, and server vendors have conspired to use it as a margin driver.

Not to spoil anyone's fun but the radiation doesn't have to come from far away in space and time... just earthly background radiation is enough to flip bits from time to time. Place your server closer a "flashy" neighbor (let's say radiotherapy equipment in a hospital, a floor away from your server room or an equivalent industrial setting) and your potential bad luck increases :)

How hard is it to install ECC memory into a random desktop computer? Are they all compatible with it? Do you have to change BIOS settings?

I always thought the chips on the RAM stick would do one-bit error correction transparently and just fail to correct two-bit errors. But, it appears that the motherboard has to support it and you might need to enable it. There's an overview at http://en.wikipedia.org/wiki/ECC_memory#Pros_and_cons_of_ECC

Thank you for the question, I think I learned something today. I hope someone who knows better comes along, I'd be interested in knowing more.

Pretty hard. ECC ram will work as normal RAM on non-ECC cpus. To use ECC you need a Xeon (or amd equivalent) processor and supporting motherboard. You can get the Xeon equivalent of the latest i7 cheaper than the i7, you (i think) still need to get a server mobo though.

It's a bit of a pain to find ECC mobos for desktops.

I looked for some when I built my last workstation seen that I was putting 16 GB in but ended up using stupid normal RAM : (

Now between SSH, SCP, SSL / TLS, Git, diffs before commits, unit tests, etc. I'd be really unlikely to have a bit flip really "destroying my work" : )

So, yeah, 16 GB of non-ECC memory on a Linux workstation which regularly reaches 6 months of uptime: no need to enter in a paranoia either for a desktop/workstation.

For servers, ECC is great...

AMD AM2/AM3 CPUs support ECC. It will work provided that the motherboard has the extra wiring required and BIOS support is enabled. Some vendors (notably ASUS) advertise ECC support on many cheap AMD motherboards.

AFAIK to get ECC with Intel you have to pay for the Xeon.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact