Compared to the effective volume of a purpose designed one (ATLAS, CMS, Super Kaminokade, etc.) rather small, but a particle detector nevertheless.
A couple of months / years ago, there was an article (also linked here on HN, IIRC) that did a few back of the envelope calculations regarding expected event rates. IIRC it was something on the order of 1 event per day per 10^12 transistors. (EDIT: not the one I thought of but blows the same horn: http://energysfe.ufsc.br/slides/Paolo-Rech-260917.pdf )
Also radiation hardened software has been researched (and still is). Essentially the idea is to not only have redundant, error correcting memory, but also redundant, error correcting computation. NASA has some publications on that. e.g. https://ti.arc.nasa.gov/m/pub-archive/1075h/1075%20(Mehlitz)...
"As transistor sizes have shrunk, they have required less and less electrical charge to represent a logical bit. So the likelihood that one bit will "flip" from 0 to 1 (or 1 to 0) when struck by an energetic particle has been increasing. This has been partially offset by the fact that as the transistors have gotten smaller they have become smaller targets so the rate at which they are struck has decreased.
More significantly, the current generation of 16-nanometer circuits have a 3D architecture that replaced the previous 2D architecture and has proven to be significantly less susceptible to SEUs. Although this improvement has been offset by the increase in the number of transistors in each chip, the failure rate at the chip level has also dropped slightly. However, the increase in the total number of transistors being used in new electronic systems has meant that the SEU failure rate at the device level has continued to rise."
That goes out the window when you sleep though because you don't know how long you've been waiting to restart. And the P(bitflip) is a function of time. If you sleep too long you are non-spec compliant for silent data corruption, and since there isn't a way to know how long you sleep, the only "safe" option is to invalidate the cache and reload it.
Sad but an understandable approach. The downside is your wake from sleep is slower by the amount of time it takes to warm up the cache.
See section 3.7.2.f
Like a app in a phone or desktop that sends reports back with location and time information.
I don't know if any novel results have come out of this kind of thing.
I (thought I) signed up to be informed of beta releases, but never heard anything. I just checked their website and it mentions a beta app, but that seems to just go to a signup page.
Pity the photo with friendly smiles from the press package was left out, but the curious will find it with a brief search.
Why is this in such an inconvenient form? If I have x gb of ram how many unavoidable memtest errors should I expect per hour of testing? It seems like that could be used to tell us the minimum amount of time to run the test.
So if you have 128 gb, you would expect 128/104 ~ 2.5 bitflips per week, about every other day?
I did that for a bit on cloudfront.net and got dozens of them in a short amount of time.
back in USSR/Russia gamma rays and aliens were not-an-issue compare to the extremely low reliability of USSR/Russia hardware. The military hardware back then (and some telco hardware built in Russia in 199x (some even for export into Western countries!)) was built as triplicate systems - i.e. primitive quorum/consensus computing.
Similarly, as far as i heard back then, due to low reliability of Itanium especially in the beginning, the Itanium based Tandems were also available as "Tridems".
Cosmic ray showers  potentially. My prediction is that there should be "clusters" of bit errors every few minutes or so (given sufficient amount of servers...).
I was talking to a sysadmin at a local company running a few thousand servers about getting their ECC error logs in order to look for these, but scraping them apparently wasn't trivially managable for them.
 - https://en.wikipedia.org/wiki/Air_shower_(physics)
Those with poor or long distance DSL do see a big increase in errors.
And while hopefully getting rarer now home routers could easily overheat and introduce errors as soon as they saw more than usual traffic.
We recently had issues with the power supply for our router and it also caused issued.
Yes, but because it's so prevalent that people expect it, they've added checksums on multiple levels, so the network actually performs better. In this particular instance (DNS queries), it's very unlikely that the data was corrupted in transit: "We believe that UDP checksums are effective at preventing 'bitsquat' attacks and other types of errors that occur after a DNS query leaves a DNS resolver and enters the network."
For a bad connection it will also be quite common that the checksum will happen to be valid. TCP/UDP checksums are quite bad.
There was a bit that flipped and apparently took the whole thing to its knees.
This is the stuff nightmares are made of.
> Bitsquat traffic represents a slice of normal traffic
Wow. That's pretty amazing. With enough analysis, you could (possibly) recover the traffic pattern for major websites. That alone seems to have a lot of interesting implications.
Was fun; never published my results though..
Get a tls cert and you might even be able to serve the original file with some additional js of your own :)
1. Not all single char typos are bitflips and when you test it statistically you get more traffic for bitflips than typos. It's a bit more complicated, some typos are really common because they're close to the key.
2. You can look at the raw traffic that comes in and tell what is a simple get request with a fat finger (GET https://en.7ikipedia.org/ vs GET https://en.wikipedia.org/w/index.php?search=america&title=Sp...).
Source: I have some domains that are bitsquats of some high traffic domains. I get access token headers when I do a raw dump of the traffic. (I turned off the servers after satisfying my curiosity.)
Want to stop randos from getting some tiny portion of your traffic's access tokens? Use client side keys and send signed / encrypted access tokens or even full requests.
Yeah I know. It's so far down the list it doesn't matter. But every combination of github.com, microsoft.com, etc is bitsquatted, so this attack can be considered ongoing. This is doubly true if you're on a domain like .co.uk which is a stupid subdomain of .uk and now that .uk addresses are available someone can buy ko.uk and bitsquat the entire fucking country.
edit: Reading the linked article, they describe one of their validation method:
> The Host header contains the domain the HTTP client resolved to connect to the bitsquat server. If the Host header matches the original domain, the corruption occurred on the red path (DNS path). If the Host header matches a bitsquat domain, the corruption occurred on the blue path
 "These requests were not typos or other manually entered URLs"
There's a lot more details in the article.
I don't think that sounds right.
Complex embedded systems like your cellphone's baseband processor usually just give up at some point and suicide a task or even the whole OS if they detect a problem. For a while I had a Qualcomm debugger attached to the internal cell modem I had in a netbook I was working on, and the baseband crashed all the time due to hardware faults. I thought I had a bad chip for a while until I realized it never happened when I left it in an underground parking lot.
I also worked at a switch manufacturer. We had some ASICs from one of the big companies. Had crashes that we could not explain at all. We knew it was not us. Proved that bits where flipping in the switch ASIC. Turn out they had forgot to spec low alpha solder. Alpha partics will not go through your skin, but when it is layered right on to the chip....oops.
I knew big customers (such as Cisco) can ask for such test reports and they do get it.
It would be very interesting experiment to position some modem/old (7, 14, 28 nm chips) electronics (cell phones, raspberry Pi + solar panels) to various distant near nuclear accident site, monitor them remotely, classify and log the failure rate over the months/ years.
It was pretty neat.
So, maybe (again, me being a layman) what happens is that usually gamma rays hit a DRAM cell, but haven't imparted enough energy to cause a flip. A millisecond later the cell gets refreshed erasing what little influence the gamma ray had. No harm done. A flip would only occur if enough particles hit the cell within the refresh time frame. That's of course possible, but more rare.
Contrast this with processor cache. On-die cache is most likely SRAM, Static RAM. It doesn't get refreshed. So the slight voltage errors caused by gamma rays can slowly build up over time.
Perhaps this normally isn't an issue, because even though the cache is SRAM and doesn't get refreshed automatically, it'll get "refreshed" by virtue of being cache. i.e. as long as the processor is busy the cache is constantly getting re-written with new cache lines.
But that won't hold true when the processor is asleep. The cache will be sitting idle, making itself susceptible to accumulated charges. Thus the likelihood of a gamma flip is greatly increased.
All of that crude logic aside there's one caveat:
> he workaround was removed once the problem was fixed in microcode or in a later processor stepping.
So ... either everything I said is a load of bollocks and actually this was a processor bug that some CPU engineer mistook as gamma flips, or maybe my theory is correct and they changed the CPU to occasionally wake up and "refresh" its cache automatically.
The mystery remains...
Static RAM is basically a flip-flop. It's a bistable circuit that's actively held in a stable state. Single-event upsets work by, essentially, putting the energy into the circuit required to make it transition into the opposite state, i.e. basically the same way the SRAM cell is written to.
I recommend you just take a look at Wikipedia or something for an explanation of SRAM. Each bit is typically implemented using six transistors, four of which form a loop of two inverters.
They are continuously powered, which causes the continuous refresh the parent was talking about.
Imagine it like two valleys with a hill between. Rolling a ball from the OFF valley to the ON valley requires some energy. If it's not enough the ball rolls back into the valley it's currently in.
The process is entirely analog, ie, there is no refresh circuit that looks at the voltage and says "that's almost a ON, better fresh up the voltage". The output of the circuit is digital. (You can play with the R/S latches of most SRAM on an oscilloscope and it's quite fun, the output of non-integrated SRAM will react in an analog fashion. If it's integrated, ie has a controller, this is not possible sadly)
Until you cross the threshold, the circuit will simply slide back to the original position, once you cross it, it'll slide into the new position without any additional effort.
Anyway, what you're writing isn't wrong but I'd say misses the context of the conversation a bit :)
(Also, it's a great example of mansplaining...)
Tbh I find that accusation incredibly rude.
Your definition of "refresh" is unhelpfully specific and not particularly correct. The circuit that "looks at a voltage and freshens it up", also known as an amplifier, is just a transistor or pair of transistors.
The analogy presented is fairly accurate and easy to understand. GGGGGP talks about loops of inverters and continuous refreshing, the latter of which is invented terminology.
> The circuit is bistable which means that the system has two states in which, once reached, it will remain until some energy is expended to change that.
> Imagine it like two valleys with a hill between. Rolling a ball from the OFF valley to the ON valley requires some energy. If it's not enough the ball rolls back into the valley it's currently in.
If my attempt to make everything more digestible failed in your opinion, then I guess that is it, it's your opinion on the matter.
Sure it does, as long as you set the proper bias and use a definition of "fresh" that makes sense for digital signals.
> unlike a DRAM cell requires no clocking
Yes, that's why it's 'continuous'.
> You need two inverting amplifier
You don't need to use that design. It just happens to have reasonable size and leakage characteristics.
> more complex and isn't simply "looking at voltage to freshen in up"
That circuit is a hundred times simpler than the DRAM refreshing circuit you used those words to describe!
> I did disagree with the wording that made it sound like there is a refresh process that happens repeatedly.
Repeatedly continuous...? Nevermind though.
No; capacitors slowly leak and lose their charge. This is why DRAM remains powered when your computer sleeps, eternally refreshing.
Not as true anymore now that they support arm but... Still kinda true.
We also had Transmeta. I bet all of them have some kinks just as different models from the same vendor have different bugs etc. that require special handling.
Meltdown/Spectre is just a recent and very visible artifact that would demonstrate this.
I did work there too, and to be clear, did not witness anything like this. Just a personal opinion/hunch.
Doesn't need ECC memory for that.
Not sure if it remains in active development though.
I've heard many stories about bit-flips causing serious problems at higher-elevation sites. Apparently a major installation at either NCAR or UCAR was delayed by a month fighting such problems. While I haven't actually confirmed any of these stories first hand, I've heard enough to believe that a little paranoia is justified.
Maybe I misremember. It was 20 years ago.
I agree that these days I'm unaware of any "serious" CPU where the data isn't at least protected by at least a parity bit on chip.
RISC-V : spec mentions ECC, but in a phrasing that makes it clearly optional. https://github.com/riscv/riscv-v-spec/blob/master/v-spec.ado...
"Many processors use error correction codes in the on-chip cache, including the Intel Itanium and Xeon processors, the AMD Athlon, Opteron, all Zen- and Zen+-based processors (EPYC, EPYC Embedded, Ryzen and Ryzen Threadripper), and the DEC Alpha 21264."
I prefer to always use ECC and layers of defenses where persistence of data is in question. Jeff Atwood's reference does point to other options, assuming that validation is baked in to the storage process and distributed in such a way as to identify and correct errors across a distributed infrastructure rather than a single system; the distributed nature means it could be more resilient and the validation of data at rest / comparing results is arguably a higher level of integrity than just ECC can provide. https://blog.codinghorror.com/to-ecc-or-not-to-ecc/
Do you interface two different clock domains(which is basically most things)? Guess what all of your computing is built on the "chance" that bits won't flip.
Granted, statistics make this pretty solid but kinda blew my mind when I first stumbled across it.
Of course cosmic rays don't exactly leave a trace, so we will never know.
> At the Earth's surface approximately 95% of the particles capable of causing soft errors are energetic neutrons with the remainder composed of protons and pions.
I mean after all, did any superhero get their powers from muons?
May is most noted for having identified the cause of the "alpha particle problem", which was affecting the reliability of integrated circuits as device features reached a critical size where a single alpha particle could change the state of a stored value