I like to think my afternoon was sidetracked by cosmic radiation from millions of years ago :)
Honest curiosity here (I'm not judging): you were running several sites on a single server that didn't have ECC RAM but regular RAM?
Did you then switch to ECC?
I like the thought about radiation from far away in space and time : )
I think there's another one from MS based on Windows error reporting, but I can't find it.
I worked on one of the Windows Error Reporting (WER) server components that detected memory corruption from bluescreen memory dumps submitted to the WER service. It's a debugger extension called !chkimg (http://msdn.microsoft.com/en-us/library/windows/hardware/ff5...). It compares the executable code in the memory dumps with the actual binaries that Microsoft shipped and flags differences between the two. This way you can tell what code was actually running on the machine vs what it should have been running. It was quite effective at detecting corruption patterns this way. Usually one-bit corruptions (just a single bit flip) and stride patterns (i.e. one corruption after every 4, 8, 16 or 32 bytes etc) were a good indicator of hardware problems.
I had a system once that would have memory errors after being run for 5 days, could thrash it for anything upto 5 days without any problem and after that would get memory errors. If restarted the memory errors would return. Basicly turned out this memory would overheat after 5 days of slowly building up the heat, though a hour turned off would resolve it. That was a fun RMA given most would soak test for 3 days :(. Cheaper memory won out in the end just fine and never had a problem with it.
As for single bit errors, well had fun on many a ISDN line which would get the odd error, not on networking thats fine, error correction. On realtime video, you see it. Then if a line gets so many errors it shuts off, you call up the teclo, they run diagnostics and all is fine and its suddenly working again. As part of running the diagnostics it turned out the diagnostic software would reset the error count, run diagnostics and be within tollerance. Over time the errors would increase the counter until it hit a threshold and the line would go down.
Moral being howeve hard you look into a problem you will still come across those magic moments and remember bit happens.
Internal compiler error
A few years ago I was doing lot of Linux kernel builds. My desktop computer, which is Athlon 64 X2 4600+ and 2x GeIL PC2-6400 DDR2-800 C4 ULTRA DUAL CHANNEL (2x1GB each; 4GB total), was doing make -j6 (or -j8, don't remember now) of kernel one after another. After many many hours some builds started to failing with mentioned error. Let's say 1 of 4. I had to do it, so I was repeating make (automatically) till it succeeded, but later I fiddled with memtest86+. It was running day and night on all 10 tests (I believe that was test count back then) and something popped up finally (I wasn't checking it constantly) . But you know what? When I switched to showing badram patterns later and the problem reappeared, it was red line without any pattern! I used to see some addresses and such whenever error was detected by memtest (or experience computer halt, etc.), but that time I had seen nothing - it was quite enigmatic. I've checked memtest sources and test 5 was filtered out from showing badram patterns as apparently not being reliable. I changed it (I've just found the changed sources on the disk and diffed it to the original release ), recompiled the memtest and run it again. I got badram patterns this time .
I needed BadRAM patterns to secure myself from being constantly hit by this memory problem, but I don't remember whether I used them in the end.
I still use this computer. No big compiles, no big problems, but sometimes I get the iffy feeling. :) I had a few BSODs since then and I believe they were caused by this memory problem. Haven't seen its manifestation for a long long time. Maybe my computer was simply overheated back then? Since then I'm much more into checksums than I used to be...
Now I remembered something! Some time later I tried contacting Rick van Rein, thinking that maybe he would be interested in a new version of the patch, or rather he could review it and I would learn something, but I never got the reply from him.
I hope my attempt isn't too embarassing, yet can be interesting for someone, so I just published my mail to Rick with the attached patch . Feel free to show me how wrong I was in my poor "refinements" of the original patch! I would be more than happy (even though I'm not into kernel now at all) to get constructively bashed about it.
Great article on "how bit-flipping in memory chips or CPU caches can also cause you to visit a wrong domain that may be one character off from the real one." I think I remember seeing it on HN awhile back.
I've seen multiple vendors with buggy implementations. In what way are they buggy? In the cases I've seen, packets are corrupted by the driver and then have a hardware checksum applied to the corrupt data...so they go through the network just fine. (In one case, for example, two 64 byte blocks of data were swapped within a large packet.)
When you turn on TCP offloading, you're swapping the well-proven OS network stack for a who-the-hell-knows-how-much-testing-really-went on network device stack.
There may or may not be a noticeable performance penalty when you turn off TDP offloading features, so you should profile performance before and after.
Not if you do that over SCP / SSL / TLS that said... (and the bottleneck is the network, not the symmetric encryption used once the public/private symmetric key exchange has been made).
So, yeah, everytime I copy big files I'm using scp (small files too actually but it's just out of convenience).
This is actually pretty interesting! How well does encryption handle bit corruption in encrypted data? I would guess it would depend if signature is used and the chaining mode used.
Thank you for the question, I think I learned something today. I hope someone who knows better comes along, I'd be interested in knowing more.
I looked for some when I built my last workstation seen that I was putting 16 GB in but ended up using stupid normal RAM : (
Now between SSH, SCP, SSL / TLS, Git, diffs before commits, unit tests, etc. I'd be really unlikely to have a bit flip really "destroying my work" : )
So, yeah, 16 GB of non-ECC memory on a Linux workstation which regularly reaches 6 months of uptime: no need to enter in a paranoia either for a desktop/workstation.
For servers, ECC is great...
AFAIK to get ECC with Intel you have to pay for the Xeon.