
For a brief period, the Windows kernel tried to deal with gamma rays - ingve
https://blogs.msdn.microsoft.com/oldnewthing/20181120-00/?p=100275
======
datenwolf
Invalidating the caches is kind of a cringe inducing approach on this (actual)
problem. Especially in HPC radiation related single event upsets have become a
real problem. If you do the math, all the silicon area devoted to memory
(DRAM, caches, registers) adds up, and what you've got is essentially particle
detector.

Compared to the effective volume of a purpose designed one (ATLAS, CMS, Super
Kaminokade, etc.) rather small, but a particle detector nevertheless.

A couple of months / years ago, there was an article (also linked here on HN,
IIRC) that did a few back of the envelope calculations regarding expected
event rates. IIRC it was something on the order of 1 event per day per 10^12
transistors. (EDIT: not the one I thought of but blows the same horn:
[http://energysfe.ufsc.br/slides/Paolo-
Rech-260917.pdf](http://energysfe.ufsc.br/slides/Paolo-Rech-260917.pdf) )

Also radiation hardened software has been researched (and still is).
Essentially the idea is to not only have redundant, error correcting memory,
but also redundant, error correcting computation. NASA has some publications
on that. e.g. [https://ti.arc.nasa.gov/m/pub-
archive/1075h/1075%20(Mehlitz)...](https://ti.arc.nasa.gov/m/pub-
archive/1075h/1075%20\(Mehlitz\).pdf)

~~~
Nokinside
Could physicists and astronomers use all this distributed particle detection
for research?

Like a app in a phone or desktop that sends reports back with location and
time information.

~~~
evanb
You may be interested to learn of arXiv:1510.07655, "Detecting particles with
cell phones: the Distributed Electronic Cosmic-ray Observatory" by
Vandenbroucke et al.

I don't know if any novel results have come out of this kind of thing.

[https://arxiv.org/abs/1510.07665](https://arxiv.org/abs/1510.07665)

~~~
privong
There was also this one, a bit earlier, "Observing Ultra-High Energy Cosmic
Rays with Smartphones".

[https://arxiv.org/abs/1410.2895](https://arxiv.org/abs/1410.2895)

I (thought I) signed up to be informed of beta releases, but never heard
anything. I just checked their website[0] and it mentions a beta app, but that
seems to just go to a signup page.

[0] [https://crayfis.io](https://crayfis.io)

------
spullara
If you don't believe in bit flips, try this!

[http://dinaburg.org/bitsquatting.html](http://dinaburg.org/bitsquatting.html)

I did that for a bit on cloudfront.net and got dozens of them in a short
amount of time.

~~~
amichal
bitflips are also single char typos in ascii. I'm guessing that is what you
saw

~~~
detaro
I think the article gives good reasons to believe they are not typos, why do
you think otherwise?

~~~
amichal
While i like the sibling comment's network errors answer better then mine I
didn't see any reasoning other then a unfounded statement[1] in the article or
on a quick skim for "typo" in the referenced whitepaper:

[1] "These requests were not typos or other manually entered URLs"

~~~
eli
_" All of these requests used only four domains in the HTTP Host header, as
shown in Table 4. Three out of four domains contain more than one bit error,
ruling out a simple mistype of fbbdn.net for fbcdn.net. "_

~~~
seandougall
Not to mention, the number of people typing "fbcdn.net" into a browser has to
be vanishingly small to begin with.

------
trelliscoded
Embedded people deal with this all the time. One class of solutions involves a
checker task running continuously, which verifies the integrity of the data
structures, kind of like a poor man's ECC. Really important code generally
does everything three times, so there's a tie breaker in case there's a
temporary fault in code or memory. I've seen this done with macros in ways
that result in pretty wild code, like running a computation three times,
storing each of those results three times, and then comparing the resulting
nine outputs three times. That was in a diving related application, so it's
not crazy to do all that work over and over since it had to be right.

Complex embedded systems like your cellphone's baseband processor usually just
give up at some point and suicide a task or even the whole OS if they detect a
problem. For a while I had a Qualcomm debugger attached to the internal cell
modem I had in a netbook I was working on, and the baseband crashed all the
time due to hardware faults. I thought I had a bad chip for a while until I
realized it never happened when I left it in an underground parking lot.

~~~
ZakNichol
This. Cache is the least of their worries in aerospace. It's common to see
satellite IC's dosed up so high on the ol' Gamma that the silicon MOSFET
junctions themselves start disintegrating.

------
myrandomcomment
A bunch of years ago Cisco had an issue with some RAM in a new switch model, I
think it was in the 65xx. They where crashing randomly but only in certain
places in the world. Cisco spent tons of money on this. No idea. They brought
in a physics professor. The devices with the most issues were located in
countries up near the artic circle. Cosmic Rays caused a bit flip in this
particular set of RAM due to something in its design. Sorry for the light
details, it's been years.

I also worked at a switch manufacturer. We had some ASICs from one of the big
companies. Had crashes that we could not explain at all. We knew it was not
us. Proved that bits where flipping in the switch ASIC. Turn out they had
forgot to spec low alpha solder. Alpha partics will not go through your skin,
but when it is layered right on to the chip....oops.

~~~
srcmap
I used to work for two switch chip companies also. I know both of them has
SW/logic in the switch chips guard against random errors in SRAM - "Alpha
particles". I have seen detail test report that ran the system in nuclear lab
and graph out the level of radiation level vs impact on the system
error/recoverable error rate.

I knew big customers (such as Cisco) can ask for such test reports and they do
get it.

It would be very interesting experiment to position some modem/old (7, 14, 28
nm chips) electronics (cell phones, raspberry Pi + solar panels) to various
distant near nuclear accident site, monitor them remotely, classify and log
the failure rate over the months/ years.

~~~
myrandomcomment
We would take the switches to the labs and have the "shoot" various types of
particles other radiation sources at it to check resistance in both HW and SW.
Depending on what you did sometime it takes months get get the device back as
it has to "cool down".

It was pretty neat.

------
blattimwind
Way back in the late 90s IBM had a problem with alpha-source-contaminated
plastics in their SRAM chips. Those chips were used as caches in Sun SPARC
processor modules. IBM told some customers, but not Sun. This caused random
bitflips in the processor cache, leading to assorted failures and crashes in
what was supposed to be reliable UNIX servers.

------
fpgaminer
So ... here's what I'm thinking, as a complete layman with respect to how
radiation affects memory devices. RAM is DRAM, i.e. dynamic RAM. It has to get
automatically refreshed relatively frequently.

So, maybe (again, me being a layman) what happens is that usually gamma rays
hit a DRAM cell, but haven't imparted enough energy to cause a flip. A
millisecond later the cell gets refreshed erasing what little influence the
gamma ray had. No harm done. A flip would only occur if enough particles hit
the cell within the refresh time frame. That's of course possible, but more
rare.

Contrast this with processor cache. On-die cache is most likely SRAM, Static
RAM. It doesn't get refreshed. So the slight voltage errors caused by gamma
rays can slowly build up over time.

Perhaps this normally isn't an issue, because even though the cache is SRAM
and doesn't get refreshed automatically, it'll get "refreshed" by virtue of
being cache. i.e. as long as the processor is busy the cache is constantly
getting re-written with new cache lines.

But that won't hold true when the processor is asleep. The cache will be
sitting idle, making itself susceptible to accumulated charges. Thus the
likelihood of a gamma flip is greatly increased.

All of that crude logic aside there's one caveat:

> he workaround was removed once the problem was fixed in microcode or in a
> later processor stepping.

So ... either everything I said is a load of bollocks and actually this was a
processor bug that some CPU engineer mistook as gamma flips, or maybe my
theory is correct and they changed the CPU to occasionally wake up and
"refresh" its cache automatically.

The mystery remains...

~~~
blattimwind
> Contrast this with processor cache. On-die cache is most likely SRAM, Static
> RAM. It doesn't get refreshed. So the slight voltage errors caused by gamma
> rays can slowly build up over time.

Static RAM is basically a flip-flop. It's a bistable circuit that's actively
held in a stable state. Single-event upsets work by, essentially, putting the
energy into the circuit required to make it transition into the opposite
state, i.e. basically the same way the SRAM cell is written to.

~~~
russdill
In layman's terms, it's being continuously refreshed.

~~~
throwaway2048
No its not, that is what makes it fundamentally different than DRAM.

~~~
atq2119
Yes, it is. That's what makes it fundamentally different from DRAM, which
isn't being continuously refreshed, which is why the memory controller has to
manually refresh DRAM at frequent intervals. DRAM has much simpler cells, at
the cost of more complex control logic.

I recommend you just take a look at Wikipedia or something for an explanation
of SRAM. Each bit is typically implemented using six transistors, four of
which form a loop of two inverters.

They are continuously powered, which causes the continuous refresh the parent
was talking about.

~~~
zaarn
There is no continuous "refresh". The circuit is _bistable_ which means that
the system has two states in which, once reached, it will remain until some
energy is expended to change that.

Imagine it like two valleys with a hill between. Rolling a ball from the OFF
valley to the ON valley requires some energy. If it's not enough the ball
rolls back into the valley it's currently in.

The process is entirely analog, ie, there is no refresh circuit that looks at
the voltage and says "that's almost a ON, better fresh up the voltage". The
output of the circuit is digital. (You can play with the R/S latches of most
SRAM on an oscilloscope and it's quite fun, the output of non-integrated SRAM
will react in an analog fashion. If it's integrated, ie has a controller, this
is not possible sadly)

Until you cross the threshold, the circuit will simply slide back to the
original position, once you cross it, it'll slide into the new position
without any additional effort.

~~~
atq2119
That seems awfully nitpicky in this context. Look at russdill's comment, then
look at the throwaway, which is reasonable to interpret as stating that DRAM
is being continuously refreshed - which it isn't, it happens at discrete
intervals.

Anyway, what you're writing isn't wrong but I'd say misses the context of the
conversation a bit :)

(Also, it's a great example of mansplaining...)

~~~
zaarn
I'm not mansplaining, I don't even know who you are. You don't know who I am.
I'm attempting to make the comment digestible for a broader audience and not
only for you. You're not alone on this website. I'm sorry for doing that then.

Tbh I find that accusation incredibly rude.

~~~
Dylan16807
The explanation you were replying to was more digestible than yours.

Your definition of "refresh" is unhelpfully specific and not particularly
correct. The circuit that "looks at a voltage and freshens it up", also known
as an amplifier, is just a transistor or pair of transistors.

~~~
blattimwind
> The explanation you were replying to was more digestible than yours.

The analogy presented is fairly accurate and easy to understand. GGGGGP talks
about loops of inverters and continuous refreshing, the latter of which is
invented terminology.

> The circuit is bistable which means that the system has two states in which,
> once reached, it will remain until some energy is expended to change that.

> Imagine it like two valleys with a hill between. Rolling a ball from the OFF
> valley to the ON valley requires some energy. If it's not enough the ball
> rolls back into the valley it's currently in.

~~~
Dylan16807
The analogy is an okay description of part of what happens, but not why. It
also gives a pretty misleading idea of what happens to the voltages. It would
be much better if it was combined with a simple explanation of how the
inverters actually behave, which would only take a few words. This is why the
previous post mentioned them and said to check the wikipedia page.

------
xpaulbettsx
I found this code as an intern at Microsoft, while the manufacturer is hidden
in the post, I'll give you a clue - the company starts with "I" and ends with
"ntel"

~~~
worldlinx
Unless you know what you are doing, you might want to delete that. Probably
goes against the NDA you've signed, no?

~~~
asveikau
It's pretty obvious that when people at Microsoft say "processor vendor" it's
a euphemism for "Intel".

Not as true anymore now that they support arm but... Still kinda true.

~~~
tjoff
You do know that the 64 bit version of Windows XP was made for AMD processors
before intel had any.

We also had Transmeta. I bet all of them have some kinks just as different
models from the same vendor have different bugs etc. that require special
handling.

Meltdown/Spectre is just a recent and very visible artifact that would
demonstrate this.

~~~
asveikau
Yeah I know. But I think at MS they work closely with Intel and think of other
vendors as not as important.

I did work there too, and to be clear, did not witness anything like this.
Just a personal opinion/hunch.

------
dekhn
If you have a large enough fleet, and log your ECC errors, you have actually
built a not-very-sensitive and very expensive scientific instrument- a cosmic
ray detector. Physics is awesome.

~~~
pavel_lishin
An app was created to use smartphones for this very purpose!

[https://news.wisc.edu/physicist-turns-smartphones-into-
pocke...](https://news.wisc.edu/physicist-turns-smartphones-into-pocket-
cosmic-ray-detectors/)

~~~
ezoe
Duct tape the smartphone camera lens and keep observing the camera feed,
hoping the cosmic ray hit the camera.

Doesn't need ECC memory for that.

------
notacoward
To answer the question in the OP: yes, the processor cache might be more
susceptible than RAM, if the RAM is ECC.

I've heard many stories about bit-flips causing serious problems at higher-
elevation sites. Apparently a major installation at either NCAR or UCAR was
delayed by a month fighting such problems. While I haven't actually confirmed
any of these stories first hand, I've heard enough to believe that a little
paranoia is justified.

~~~
blattimwind
Internal data buses and caches are typically ECC'd. In fact, DRAM and the
memory bus are most likely the only non-ECC parts in desktop computers.

~~~
bboreham
The way I remember it, when this happened on our Sun E10Ks, the RAM was ECC
but the cache was not.

Maybe I misremember. It was 20 years ago.

~~~
mjevans
That was my thought process as well, that an early silicon spin / chip wasn't
ECC because it would change state frequently and the circuit budget wasn't
worth a soft error; but the external RAM might be ECC because a longer running
state might not have such a refresh/detection of error.

I agree that these days I'm unaware of any "serious" CPU where the data isn't
at least protected by at least a parity bit on chip.

RISC-V : spec mentions ECC, but in a phrasing that makes it clearly optional.
[https://github.com/riscv/riscv-v-
spec/blob/master/v-spec.ado...](https://github.com/riscv/riscv-v-
spec/blob/master/v-spec.adoc)

"Many processors use error correction codes in the on-chip cache, including
the Intel Itanium and Xeon[28] processors, the AMD Athlon, Opteron, all
Zen-[29] and Zen+-based[30] processors (EPYC, EPYC Embedded, Ryzen and Ryzen
Threadripper), and the DEC Alpha 21264.[23][31]"
[https://en.wikipedia.org/wiki/ECC_memory#Cache](https://en.wikipedia.org/wiki/ECC_memory#Cache)

I prefer to always use ECC and layers of defenses where persistence of data is
in question. Jeff Atwood's reference does point to other options, assuming
that validation is baked in to the storage process and distributed in such a
way as to identify and correct errors across a distributed infrastructure
rather than a single system; the distributed nature means it could be more
resilient and the validation of data at rest / comparing results is arguably a
higher level of integrity than just ECC can provide.
[https://blog.codinghorror.com/to-ecc-or-not-to-
ecc/](https://blog.codinghorror.com/to-ecc-or-not-to-ecc/)

------
rubenbe
Bit flips are real. I used to see them on my (admittedly low end) webserver.
Eg. There were occasional errors like "myOfunction not found". A quick glance
on a ASCII table shows that the original function name "my_function" is indeed
one bitflip away (0x4F vs 0x5F)

~~~
jandrese
If you were seeing errors like that with any regularity there's almost
certainly a hardware issue on your server. Cosmic ray bitflips are really
rare, and tend to be basically invisible until you're monitoring an entire
datacenter.

------
weberc2
When I was fresh out of college, I worked as a contractor for a prominent
agricultural equipment manufacturer. I was responsible for building out the
touch-screen interface for the radio (a Qt app). I was told by an engineer who
worked for the equipment manufacturer that my application wasn't good enough
because needed to be able to operate correctly in the face of arbitrary bit
flips "from lightning strikes"\--I kindly asked her to show me the
requirements which was sufficient to get her to relent, but that was still the
wackiest request I've ever received.

~~~
russdill
The requirement request is a great way to push back on feature creep. There's
a lot of cargo culting that goes on in the "protection against bit-flips". You
sometimes have to go a step further and ask what error rate are you required
to be below. Once you have that number, you can start asking what your current
error rate is without mitigations, and how much a given mitigation will reduce
your error rate.

~~~
vvanders
My favorite entry in that problem space is metastability[1].

Do you interface two different clock domains(which is basically most things)?
Guess what all of your computing is built on the "chance" that bits won't
flip.

Granted, statistics make this pretty solid but kinda blew my mind when I first
stumbled across it.

[1]
[https://en.wikipedia.org/wiki/Metastability_(electronics)](https://en.wikipedia.org/wiki/Metastability_\(electronics\))

~~~
russdill
Yup, a large portion of hardware design is based on getting below a required
maximum failure rate. For metastability, you just keep adding more flip-flops.
BTW, the cache invalidation request may be due to this. They figured they
could more easily reach their time between failure interval if they could
discount time during S1.

------
yongjik
I once saw a postmortem where a server process mysteriously tried to delete
whole data (fortunately no actual data was lost). After much confusion, the
conclusion was that a cosmic ray flipped a single bit in a register, making it
point to 8 bytes past the correct address in C++ virtual function table. As a
result, instead of calling UpdateRow(), the process executed DeleteTable().

Of course cosmic rays don't exactly leave a trace, so we will never know.

------
epaulson
I've heard stories from the supercomputing folks about trying to put their
machine rooms underneath parking structures, to get the added protection of
layers and layers of concrete overhead.

------
herogreen
Could it be that the manufacturer was aware of a bug and chosed to circumvent
it by using gamma rays as a pretext ?

~~~
JdeBP
That's the implication of the mentions of processor steppings, that almost
everyone is ignoring in favour of discussing a more exciting subject. (-:

------
kurtisc
Microsoft allow commented-out code in their kernel?

------
nategri
A pedantic point (good thing I'm on HN), but I wonder if they didn't actually
mean muons and not gamma rays?

~~~
21
Apparently 95% of "bit flipping cosmic rays" are neutrons:

> _At the Earth 's surface approximately 95% of the particles capable of
> causing soft errors are energetic neutrons with the remainder composed of
> protons and pions._

[https://en.wikipedia.org/wiki/Soft_error#Cosmic_rays_creatin...](https://en.wikipedia.org/wiki/Soft_error#Cosmic_rays_creating_energetic_neutrons_and_protons)

~~~
planteen
Sounds about right. I've been a part of a radiation test giving a SoC gamma
exposure for total irradiating dose measurement. I didn't see a single upset
the whole test.

------
nobrains
Relevant:
[https://en.wikipedia.org/wiki/Timothy_C._May](https://en.wikipedia.org/wiki/Timothy_C._May)

May is most noted for having identified the cause of the "alpha particle
problem", which was affecting the reliability of integrated circuits as device
features reached a critical size where a single alpha particle could change
the state of a stored value

------
baybal2
For people concerned, discover the wonderful thing called 8T SRAM

------
discoball
Maybe the cache should be made using Silicon on Sapphire for radiation
hardening? But I am not sure if multi-process silicon fabrication is viable.

------
effnorwood
It did not

