Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Compiling an application for use in highly radio-active environments (stackoverflow.com)
507 points by ingve on April 27, 2016 | hide | past | favorite | 104 comments


Another place this use case comes up is supercomputing. Not because of the unusual levels of radioactivity, but because of unusual numbers of processors. When you have more components, a single bit flip somewhere is increasingly likely. Resilience has been a research area in HPC for a while now, and people have looked at fault tolerant algorithms, redundancy schemes, faster checkpointing, and other ways to make sure your HPC application running on a million cores won't die because one core gets a bit flip.

So far this has luckily remained a research area in practice, because the vendors tend to do a good job of hardening their machines against errors as they get larger (most of the gloom and doom predictions take the error rates of current hardware and extrapolate). It will be interesting to see if it remains that way.

Relevant article: http://superfri.org/superfri/article/view/14


Same reason RAID-5 doesn't work on for large disks. If you have a dozen 3TB drives in an array and one dies, the probability that you'll get a read error on another drive during the rebuild is quite high.

Given enough bits, random errors that individually are almost impossible become almost a certainty.


What level of RAID would you use for an array involving a dozen 3Tb drives?

If the drives can't be trusted, then you'd need some kind of error correction in software, I'd presume. I thought the drives should auto-correct any read errors.

I'm not talking about in a radioactive environment, just normal usage.


IEEE spectrum had a nice article very recently http://spectrum.ieee.org/computing/hardware/how-to-kill-a-su...


I remember learning in my embedded programming class about state machines and how certain states cannot be reached.

In the lab, I accounted for every possible valid state with a transition out to another state.

I left my microcontroller running when I went to get lunch, came back and it was frozen. I did a dump, and discovered it was idling in a state that has no entry into it, and because I had no exit, it would just sit there.

I 'fixed' the problem by adding a transition to reset the controller from every state that shouldn't be reached.


This is considered a minor security feature. When a program unexpectedly finds itself in a routine, or a routine ends abruptly, it's possible it was a code jump, or the after-effects of a payload... Neither of which are good things.


If you use a switch(), have a default:, even if it's seemingly impossible.


and print "call Brian" in it (anecdote from a HN thread a month ago): https://news.ycombinator.com/item?id=11397251


Yes - in hardware where faults are expected to be rare it's often a good idea to just detect that something went wrong. If you know something bad happened you can choose to re-run. (E.g. restart using a watchdog timer)

Silently jumping back to a default state is sometimes useful, but could result in unpredictability or perhaps silent data corruption for some critical applications.


So the use case here is particularly interesting.

Most of the answers are talking about spaceflight, because it is one of the only radioactive environments that has a particle distribution that can be effectively fought.

Most earthbound environments have particle distributions that are practically unsolvable in software, and should instead be solved by using mechanical and noise tolerant analogue methods of sensing and manipulating and then massively shielding your computation (normally by controlling from a different building).

*Note. By particle distribution I am refferring to both rate and charge


>> Most of the answers are talking about spaceflight...

Another place to look is in safety critical systems. IEC 61508 in general and the automotive variant xxx26262. There are several dual core processors available now that run two copies of the code and check for any errors. That doesn't tell you which one failed, but it does catch the error. There are methods defined for creating fault tolerant systems too.

We've been doing this shit for years - your electric steering system, brakes, even throttle control (with certain exceptions of course ;-)

The funny thing in cars is that most of those systems have mechanical backups or consider "shut down" an undesirable but reasonable failure (vs steering left or dumping brake fluid). Your car can coast to the side of the road under driver control without any power. All this self driving stuff will require fully redundant components for some of these systems (including LIDAR or cameras) to really be safe.


I once read that NASA control engineers have three independent teams code up three versions of their guidance systems. If the systems disagree, they go with the majority vote.


The jargon for this practice is multiversion programming. Its based on the idea that different people will make different errors when implementing a design. However, in practice we find that errors are actually moderately or even strongly correlated between different programmers. So this practice is rather uncommon.


It's one of techniques I used to recommend although for anti-subversion instead of safety. I was worried whst you said might turn out true. My solution and hypothesis is that using three, very-different languages should counter that effect. Hard to imagine the same error happening in PreScheme, SPARK, and C.


IIRC, the Space Shuttle had four identical computers: three ran the same (full-featured) software and detected errors with a voting algorithm; the fourth ran different (minimally-featured) software developed by a separate team as a final backup.

The fourth computer's software was only sufficient to abort the mission and return the shuttle to the Earth. IIRC, they never had to use it.


"Every ship has three AI's. Due to the radiation and interference, all three suffer lapses in sanity. They're encased in lead, but the sensors are mostly wide open to radiation. Often, they 'hallucinate' from from observing the outside world with faulty sensors, so they often vote as to whether or not an input is actually real.

"A fourth AI is purposefully kept dormant. He's a bit single minded and his only purpose is to find a planet and touch down. If the other three ever can't agree or have a moment of clarity in which they realize they've become unstable, they deactivate and activate him. He regularly reloads from scratch, forgetting his previous incarnations and spends a majority of time validating the coordinates his previous incarnation left for him, making course adjustments, and ensuring the humans don't prevent him from saving them..."

I would read that book...


Maybe you should write it instead. Then we can read it. :)


Interesting to compare Frank Herbert's Destination: Void:

The crew are just caretakers: the ship is controlled by a disembodied human brain, called "Organic Mental Core" or "OMC", that runs the complex operations of the vessel and keeps it moving in space. But the first two OMC's (Myrtle and Little Joe) become catatonic, while the third OMC goes insane and kills two of the umbilicus crew members. The crew are left with only one choice: to build an artificial consciousness that will enable the ship to continue. The crew knows that if they attempt to turn back they will be ordered to abort (self destruct).

https://en.wikipedia.org/wiki/Destination:_Void


There's an old saying that goes something like, "When going to sea, take one clock or three—never two." The idea being if you bring two and they disagree, you won't know which one is right.


Every Airbus since the A320 uses the same system in their fly by wire design as well. Plus several fallback mechanisms where other computers can take over failed computers work or augument their work. For example the ELAC controls the ailerons and the SEC controls the spoilers, if the ELAC fails the SEC can take over and provide roll control via the spoilers, although limited (and those changes also come with a change of the planes flight envelope)


I can't remember if this is true of Airbii or if it's Boeings I'm thinking of, but I remember reading a while back that the three microcontrollers they run on are also from different manufacturers.


No idea about Boeing, but yes for Airbus. Every “computer" is actually two computers, one is the active computer (COM, command) and one is the inactive computer (MON, monitoring). Both still perform the same calculations based on the same input, however they use different software and hardware. There is a watchdog inbetween the two that verifies the results against each other, just in case there is a bug in the hardware or software. Then, you also have multiples of these computers, eg. there are two ELACs and three SECs. The ELACs and SECs are fed data from a different air data inertial reference unit (ADIRU) and use a different hydraulic line to actuate the flight controls. And lastly the results the ELACs and SECs come up with also have to agree with each other or the result is thrown out.

All of that redundency makes it possible to build some really robust flight envelope systems that keep the airplane within safe margins.

* I should note that all of this applies to the A320 family, the systems have been developed even further in recent years. For example with the A350 Airbus made some steps towards allowing the flight computers to be used in Simulators so that the same software/hardware as on the real plane can be used.


I'm a tad late, but I'm curious about where you said

> they use different software and hardware

Does this mean different architectures? If it does my respect for the redundant-hardware approach just went through the window.

Also, how is the watchdog redundant? I can't imagine there's only one; how does this work? Are both watchdogs somehow wired in parallel, are they cross-connected to each other, or...?


It litereally means two completely different architectures, on two physically disconnected computers. Here is a diagram: https://i.imgur.com/Tj0GKbQ.png

Also visible in the diagram is that each side has its own watchdog, both connected to each other. The way this whole thing work is fail safe, so if one computer fails the backup can jump in, and if that fails too the flight controls will either retract or stop in their current position depending on what makes the most sense. It’s also mirrored, so for example if spoiler 2 on the right wing fails and is retracted, spoiler 2 on the left wing will also retract.

Here is a description about the flight controls and how pilot input gets passed through to the control surfaces: http://www.smartcockpit.com/docs/A320-Flight_Controls.pdf

And here is a general overview about the architecture: http://www.skybrary.aero/bookshelf/books/2313.pdf


Oh, wow, that's amazing. Now I understand why avionics are so expensive - verifying the correctness of such a system sounds like a lot of "fun," or at least a lot of time.

(I wonder if there are any systems built on multiple architectures where each unit is itself a redundant system with CPUs in lockstep.....)

Am I to intuit from this diagram that the watchdog watches all the components - power, I/O, memory, and CPU? That's very impressive. Or does it watch a central bus/backplane everything is connected to?

Also, how does either side decide/figure out the other side has failed? Simply deciding that the other half is wrong if it doesn't match this half's output could fail catastrophically if one of the sides reaches this conclusion after entering an invalid state (ie, it's the other side that is correct, and this side is wrong).

I'm also mildly curious as to why the I/O on one side has two connections to the actuators, while the other has only one.


What do they do if all three disagree?


That will never happen. It would take the simultaneous failure of two independent systems, each of which are highly reliable. If the MTBF (mean time between failures) of one component were, let's say, ten years of continuous use then they each have a 1/87600 chance of failing in each usage-hour.

The odds of a simultaneous (within one hour) double failure is the square of that, or 1/763760000 per hour. This corresponds to a MTBF of roughly 3836880000 hours or 438000 years.


1000 of same model flying 12 hours a day (utilisation is nearer 11[1], Some airliners where made in numbers a lot more than 1000, 737's where over 8000[2]) for 20 years is 10000 flight years (very crudely).

438000/10000 is 43.8, so 2.3% chance over 20 years.

It's a longshot but it's not never.

[1] http://web.mit.edu/airlinedata/www/2014%2012%20Month%20Docum...

[2] https://en.wikipedia.org/wiki/List_of_most-produced_aircraft

So yes the odds of any one plane experiencing that problem are absolutely tiny but across the fleet not so much.


But compared to the other failures modes, still extremely unlikely. It's not really worth trying to prevent an error that happens to the whole fleet once every 400 years when you could work on fixing other problems that cause plane crashes much more frequently.


Agreed but I never said anything about engineering priorities, my observation was that unlikely events happen at scale.


I understand 'wear and tear' failure is extremely unlikely to strike simultaneously within the same flight, but what about intentional disruption? Is that possible or has it been explored?


If the bad guys can disrupt one clock, they can probably disrupt all the clocks you have, whether that's three or twenty.


If there is no mode, take the median for any numeric or otherwise orderable values. For non-orderable values, let the AIs pass around a single "I'm correct this time" token. Multiple simultaneous faults aren't going to be common enough to think up fancy recovery modes for them.


These type of posts are what bring me to hacker-news.


Another interesting case apart from spaceflight where radiation hardness is important: Particle accelerators.

CERN for example has a lot of material, a quick search leads to e.g. https://lhcb-elec.web.cern.ch/lhcb-elec/html/rad_hard_links.... or the Radiation to Electronics workgroup https://r2e.web.cern.ch/R2E/


I'm reminded of an old game called Core War, where the object is to write code that corrupts your opponent's code while simultaneously protecting your own code. I wonder if people who are good at that game would be the best candidates for this type of fault tolerant programming?


Israel has had its own version called "codeguru extreme" that's been going on for the last 11 years. The rules are significantly different, but the idea and inspiration is from core wars. (Sorry, hebrew only: http://www.codeguru.co.il/xtreme/about.htm)

It's sponsored by IBM, A large Israeli college, A technological high school chain - And Several Israeli Technological Military Units (Air force, Communications, Cyber). So yeah, that's probably a really good indicator you might be on to something.



Doing any modification to the source code to prevent this is not a good design. C/C++ is not designed for this use case, so it's going to create a mess if one will try to work around this. This might lead to similar problems that premature optimization does to the code. Instead the solution should be on hardware level, like in airplanes. Put three computers in radiation environment. Then put a forth computer that'd analyze the computing results from the three computers and do action only when there 3 out of 3 or 2 out of 3 computers agree on the result. Or average / median can be applied, depending on the task. Ideally the fourth computer needs to be put in non-radioactive environment. If not possible, that still simplifies the problem somewhat. Because only the final forth computer who collects the data and selects the trustworthiness can contain bugs, not the rest of the code.


Not so fast: it's possible to have library-level fault tolerance, or OS-level fault tolerance, that is implemented in software, without having to change the source code much or at all.

My perspective on this is informed by work on ABFT (for more on that, see https://www.computer.org/csdl/trans/tc/1984/06/01676475.pdf, and the literally 1000 later papers citing it) -- you can design a version of the basic linear algebra subroutines that have fault-tolerance built in, and use them without changing program source code.

For codes that spend a lot of time doing numerical computations (e.g., preliminary data reduction in a spacecraft on-board computer), ABFT is an interesting option.


To clarify, each piece of the puzzle must do what it is expected to do, otherwise it will be hacky. It's not expected for C++ program to detect its underlying hardware problems, there are no tools, nothing for that. So any solution would then look like and feel like a hack, temporary workaround. It's however expected for the computer as a whole to produce buggy results, it's normal and there are many existing design solutions that can take that into account and work around that. So that's why you take three computers and judge their output. Then each piece of the puzzle does it what it is intended to do, it does what it is expected to do and it doesn't do stuff outside of its responsibility.


Sorry it looks like a flooding, but to add to that, having program self-check itself is also unnecessarily ties the program to its usecase in the radiation environment, it creates unnecessary strong dependency to its execution environment, and that just goes against all the good principles of design. It's like having a software for clock in the microwave machine be aware that it is placed in the microwave machine.


Good principles of software design are simply different in different environments.

The microwave clock knows it's a microwave because it controls the microwave. It would be wasteful to put extra layers of abstraction in such a small system.

Likewise, a system designed for radiation tolerance will expect a close coupling between software and hardware; the whole point of the software in that case is the hardware.


If you want to address the problem at the software level, you could probably write a compiler for this. But that would be a massive undertaking...


Maybe not as massive as it seems. You could compile your code into llvm IR, work on that and compile to native code with llvm.


As far as I can see, no one mentioned fluidic[0] computation so far. I am not aware of a fluidic processor that can run C++ compiled code, but .. if you want hardware that can withstand 1000 degree celsius, there's pretty much no alternative that I'm aware of.

[0] https://en.wikipedia.org/wiki/Fluidics


I write code that is executed in rad environments.

The fist thing I learned is that any plan that I come up with to help mitigate risks always ends up being proven less effective than I initially thought. The idea that the hardware can flip bits or latch up behind your back is rarely a consideration for most software design work, so it really requires an entire new way of thinking. In the end, it's all about reducing probabilities, but sometimes your mitigation techniques can be borderline useless. This could be because of my relative lack of experience though. I'm sure there are people who have been doing this a while and have converged on a design process that works well.

Couple this with the fact that code in the failure path very rarely gets actually tested with _real_ failures during design, and you end up with situations where your entire test/precaution/recovery technique is rendered useless by a single oversight.

It kind of reminds me of how some bootloaders perform a DRAM test during power up. The only problem is that these tests are usually executing after the bootloader is already running from RAM itself. Sure, it's better than nothing and may help to find some obscure errors like a bad address line pin or something, but the issue here should be obvious - if there was a real problem with the DRAM, it's likely you wouldn't even get to that test, so now what?

Another example: suppose you have a PI bus connecting to an FPGA, and you are relying on the PI_done line to pace transfers. Now, a common test to verify the bus is working would be to read some magic word from the FPGA or maybe use an XOR test register, but what happens when there's an actual failure like a bad connection to the bus or bad power to the FPGA? Well, the CPU throws an exception and resets because the timeout for the PI_done expires. This all happens before your test has a change to even print anything to the console.

I run into these kinds of issues all the time when designing software routines to self-test the hardware on our devices.

Everything is all good when the hardware works, but as soon as hardware starts actually failing, all of those diagnostic tests are suddenly useless.

I've gotten kind of off topic here, but the thought process is what I'm getting at.

It usually goes a little something like this:

Possible solution: Always have a backup OS image. Caveat: There's only one copy of the bootloader, so if that area of the flash gets screwed, we're done.

Possible solution: Scrub the contents of RAM to check for errors. Caveat: The scrub routine is running from RAM.

Possible solution: Use a watchdog timer to force a reset if the CPU runs away. Caveat: What if the watchdog timer latches up and the CPU hard locks? (Unlikely, I know).

In the end, you can do a lot in SW to make a system more rubust, but the ultimate solution is probably on the mechanical side. You're better off spending your efforts modifying the enclosure and submerging the thing under water than you are spending hours designing in software techniques to try to recover from HW errors.

TLDR: Most people agree that software techniques for mitigating risks in a rad environment are all about reducing probabilities, but I feel that such techniques have a good chance of being less effective in practice than people may think. I don't know what the point of this comment really was though, other than being negative. :)


----------------

One should be able to design a watchdog system to handle the all kind of HW failures.

----------------

The KISS design principle is : "If I am a working SW/HW, I always assume the Master has failed and ready to take over the overall control unless I got Absolutely Positive confirmation that the Master system is working perfectly within the in N milliseconds windows."

10 years ago, I designed the system HA software for such system. It was only with two completely identical redundant system. Each system had 19 Xilinx FPGAs.

It worked very well. One can trigger any type of failures on any of one the system. Within 250 milliseconds, the other system automatically can / did take over the control whenever for whatever failure that was detected.

The system software design was very simple and worked very well in the Comcast and other cable network deployments. The input was 10GBits MPEG2 streams. The output 48 channels of Analog NTSC TV pictures. If you blink your eyes, you won't notices the complete System HW / Data Flow had been switched over.

It also made very cool demo. Unplugged power, network, boards on any one system, the other one automatically took over the complete control within 250 ms. If you watch very carefully, you only see slight flicker on the output TV signals for all 48 channels.

One other benefit, once the system was in place, it was trivial to add in-service SW Upgrade to the system. One can upgrade any SW (uboot, kernel, FPGA code, HW, etc) with only 250 ms impact on the end customer's service.

Once you got the idea on how it worked, it is actually not hard to expand for system with N - redundant HW/SW and let the "master working" HW to own / generate the correct solution within set N-millseconds time limit.


Good point: instead of thinking "what should I do if the other computer failed", you go straight to the "I'll do this and that if you don't prove me that you're doing good in 50ms"


It gets more complicated with shared state, though. Think of a DB split brain. If the slave incorrectly takes over then stuff gets corrupted. So you need proper fencing, and that alone seems like it's non-trivial. Common solutions seem to add a single point of failure (iSCSI fencing) or rely on the network being robust enough (quorum).


Syncing complete config db can be very simple too. For me it was just one line of wget code to fetch data over and untar it. The process took < 1 seconds.

After which, all the configure and mirrors to standby in C code.

A LOT more code was written to test the config + HA switch over process. The test script was continually configure the system on the master side via snmp and SW force switch over every 30 seconds and run over night, over weekend, and forever to see if there was any unexpected configure errors and failures.


DEC hardware the power on self test (POST) which included a RAM test was itself tested prior to starting. And as computers used to be so much more unreliable in their hardware there was a lot of work dedicated to error checking the hardware.

In today's hardware you can run a CRC check on your bootloader prior to starting it using only registers. When you start it, you can check a portion of memory from the bootloader, then using that memory, check the rest of memory. If you use ECC memory and initialize it properly, it will do a machine check if multiple bits get flipped. Multiple systems doing the same calculations should get the same answer, using three systems and flagging any system that disagrees with the consensus will identify failing hardware. I/O channels can use ECC as well, and software that is running in ECC protected I/O and memory can use data protocols to detect data corruption in flight (channel errors) or in situ (storage errors).

As with most things you can never be perfect (aka the halting problem) but you can make failure statistically rare enough that you are willing to risk it. It doesn't mean that bad things will never happen. We live with risk every day and still drive to work.

And the trick is it is easier to just say "Whoops, something isn't right here, we're resetting to a known state." but that can make functioning really difficult (see Kepler's safe mode, or the challenges Armstrong had landing the LEM when it was constantly resetting the flight computer). Sometimes you just break. And in those situations the best you can do is to break in such a way that you don't cause any additional harm.


Stupid question. For any other use than a satellite where weight is critical, isn't it easier to protect the computer with a heavy enough layer of metal and not worry about radiations, or are we talking about radiations that are so high that it wouldn't help?


Good question. I'm not sure about the specifics of the stackoverflow poster's application, but for my case, they're used in space.

It's my understanding that you would need a very thick layer of metal to actually protect a device. Generally, lead works, but the thickness required would mean that your device would be extremely heavy. However, like you said, this isn't a problem for ground applications.

Also, apparently water works well too.

This is all honestly beyond my knowledge though. Our customers do their own qualification since these are COTS devices, so any failures may be handled on their end by making separate enclosures or using another vendor's system as a backup. They haven't actually asked us to implement any software based risk mitigation, but I investigated it on my own.


Ironically the lead itself is sometimes the source of the radiation causing problems (Pb-210). One small benefit of lead free electronics!

Some experiments have used 2000 year old Roman lead to avoid this http://www.nature.com/news/2010/100415/full/news.2010.186.ht...

See also https://en.wikipedia.org/wiki/Low-background_steel


Another stupid question, can you suggest a reading about reliable software development ? (I prefer book)

BTW thank you for excellent information in your comment.


Sadly, I don't know of any good resources. It's kind of a specific topic, and it wasn't really part of my task to consider these issues, so I only spent a small amount of time researching it myself.

I believe most of the studies were conducted by NASA, so incorporating that into a search should help.


One rad hard environment is detectors at particle accelerators/colliders. There are at least two problems:

1) You'd need a lot of lead to totally shield the equipment, like meters thick. Of course partial shielding reduces the radiation load but unless you can lower it to a negligible amount, you still need to have radiation tolerant code.

2) We generally want to minimize the amount of material in a particle detector, especially at layers nearest the collisions (where radiation is most intense). These inner layers (called trackers) are used to track light charged particles, while the outer layers (called calorimeters) are used to absorb and measure energy. So we need less material throughout to ensure we get good energy measurements by the time particles get to the calorimeter.

In addition to using radiation hardened components, a lot of this is mitigated by doing as little computation as possible on the front-end readout electronics. Most of it is switched capacitor arrays, asics, and the occasional FPGA, which can be made fairly reliable since they're basically just responding to clock signals and measuring voltage (or something), and buffering these measurements. The raw data is then piped away from the detector at absurd bitrates to another part of the cavern (called the "counting room") that is much less radioactive and has lots of standard computers/electronics to do further processing.


Mostly, but there are always cases where this is not pratical. For example, the detectors at CERN (or any other particle accelerator probably) contain a lot of electronics (to capture and pre-filter the data) which cannot really be shielded (otherwise you don't have a detector anymore) or moved elsewhere (which would incur a lot of inconvenient delays and additional noise. Also, you don't wan't to lay a few meters of cables for every pixel ;)). Most of the computing power sits in an adjacent cave however where it is quite safe.


In most hard rad environments where you want to use a CPU you also want to move the CPU around (eg robots in a reactor). At least in the scenarios that I can think up right now. Weight and size are also important there.


Thanks for the insight. I'm used to missing packets (network connections) or erroneous bits (sat connections) but erroneous ram is a whole different game, I would think.

Stupid question: would it be possible to burn the OS + deployment in a ROM, kinda like the old 680x0 macs? Or are those just as vulnerable to radiation as RAM?

At any rate, it seems like designing in this environment would require a level of redundancy not unlike what's in the human brain, to store & correlate information. Very fascinating stuff!


Thank you for the insightful comment. How do you perform your own testing? What I mean is, how do you replicate space (or extreme-condition) failure scenarios on Earth? I imagine that either you need a lab in which you blast your hardware with radiation hoping to get some failures, or do you maybe have a reproducible way of simulating these types of failures?

If you're working out of a lab, once the testing is complete, what happens to the hardware? Is it irradiated and does it need to be disposed of safely?


One way to summarize your good post is in engineering there is no simple generic testing. There is testing to reduce mfgr mistakes. There is testing against outside opposition ranging from UI data entry errors to active attackers. There is testing to make a program as defensible as possible. There are other classes of testing too. And some testing strategies are not compatible with each other or compatible with the shipped operating environment.

So for mfgr mistake code testing you find a problem you give up or set the board on fire or anything to make sure it never gets shipped to a customer. That's not good for a spaceship or a nuclear reactor.

For outside opposition you can assume a non-zero BER on the telecom link either due to signal issues or radiation (acute or long term degradation) in the analog circuitry. However anti-attacker code is just a place to get flipped bits and crash. On the other hand a really naive string copier looking for a null with a flipped bit will never terminate, but you're watching memory address bounds anyway, so ...

The right thing for a desktop computer to do when the stack pointer or PC gets lost is to run off into outerspace and get powned or panic the system because who cares that should never happen on a shipped system on earth in a non-critical application. The right thing for an outerspace vehicle is populate "unused" memory with a jump to a recovery routine.

going off on a tangent related to your post:

Simplicate and add lightness. Interrupts exist because slow 1970s era processor tech couldn't keep up with 1970s IO tech or UI. In 2016 you poll. Your interrupt service routine can't get corrupt if you don't use one. Your "interrupts enabled" bit can't get cleared if you're not using it. Your stack pointer can't get lost if you don't use one. If you need to repoint the antenna every couple minutes and your program crashes every couple hours on random average, you power it up every couple minutes for a few seconds and then it can't crash or lock up when there's no power. Loops exist because in 1977 when we had 128 bytes of memory and wanted to load 8 bytes of data its cheaper in terms of memory to write a loop that reads one byte 8 times, but in 2016 you unroll all loops so if there's a single bit error in one read then "most of" the process works and a single bit error in the loop code can't crash it because there's no loop code. You're paying the mfgr for great gobs of memory in 2016, use it. From an engineering standpoint its like replacing crane mounting bolts the size of bratwursts with 6-32 sized hardware from walmart because thats how we rolled in 1977 and ain't nothin gonna change that and at least sometimes in desktop simulation it works.

The stackoverflow article displays perfectly why I don't use stackoverflow. 90% of the comments are people talking about how the problem is no fun, weird assumptions about the asker, blind cargo-cult style devotion to paperwork and tradition and passing the buck to other agencies, or complaining (incorrectly) that if its not perfectly solvable then no effort should be made. If you want to know the option flag to sort the output of GNU ls by modification time, then SO works, but SO is useless for all serious questions. SO is great at answering the questions I don't need to ask and sucks at answering the questions I need answered.


Interestingly - A lot of the advice here (such as having an "Active" module and a "Standby" module with a heartbeat between them, Watchdogs, reducing single points of failure, error detection codes) is applicable to a lot of other High Availability scenarios. Think load balancers, databases, routers and high scalability networking devices for example.


I appreciate it may not be the right tool in an extremely resource constrained embedded environment, but the "my hardware can't be relied on" on problem is almost exactly what's described several Erlang guides.

http://learnyousomeerlang.com/the-hitchhikers-guide-to-concu...


I wonder if there's a specialized debugger that injects random faults into your code while its running (e.g. flipping booleans or IO to peripherals). Not just fuzzing the inputs to the program, but fuzzing the contents of memory and registers as routines run. You could set the fuzz rate at different levels to detect different kinds of errors (e.g. maybe one error happens frequently but another requires a timer to tick up to a certain level before errors can affect it).

EDIT: Thinking about this a bit more, for messing with registers, you probably need specialized hardware or you need to intercept the data on the way to the register.


You could do this with an emulator I'm sure. At my company we have fairly high-fidelity models of the hardware we use/develop which could probably include on-chip peripherals in the things fuzzed. Also I guess in irradiated environments you use very simple hardware (no caches, MMUs, complex memory bus/controller) so that would make it more feasible. Would probably be really fun to develop!

Edit: if you had the resources maybe you could even fuzz simulations of the RTL. Our systems take about a week to boot Linux on simulations but I guess that would drop dramatically for simpler MCU systems.


That does sound like an incredible start. I fear that for the kinds of applications we're talking about speed could be an issue. For instance, if the system has to respond to events it can't control, queues of events can build up or things that are likely to be synchronized might not be etc. I feel like in many situations, you'd want to run at full speed for official tests though early testing could use an emulator. Of course, if you're emulating a much slower hardware as you mentioned, the test is more realistic. ;)


Ah the timing thing is not actually an issue because you simulate or emulate the entire system, including timers and even the PLLs that drive them. There's a globally consistent timeline. Of course this places constraints on the things you can simulate (e.g. networking is out the window).


There was an interesting 2012 interview with John Muratore (SpaceX director of vehicle certification) about radiation tolerant design on SpaceX Dragon: http://aviationweek.com/blog/dragons-radiation-tolerant-desi...


On a more trivial note, this reminds me of the radiation-hardened quine: https://github.com/mame/radiation-hardened-quine

Would be amazing if the method used for implementing it could be generalised somehow.


Off the top of my naive head.

Have 3 or 5 or a dozen copies of the thing running in parallel. As many as it takes.

Every action calls for a vote among the copies. Majority rules.

Periodically validate the data. Again, majority determines what's valid.


Make sure that the code that does the "ballot" is not a single point of failure itself.


oh good point. That makes it more interesting.


For ground-based applications, why can't this be solved by building a lead-shieled chassis? Like everything else in the software engineering world, abstract out the radioactivity problem and solve it separately. The other functional units shouldn't have to know or care about whether they are running in a radiactive or non-radioactive environment.


There are actually processors designed specifically to prevent issues like this [1]. They use ECC flash, ECC RAM, and dual CPUs that run in lock-step.

1. http://www.ti.com/product/rm48l550


There might be weight considerations if this is a mobile device. Then this "abstraction" is certainly not costless (like many abstractions in the software engineering world).


Coolest SO question I've seen in a while.


While not about radiation hardening or redundant hardware, this article shows how StarCraft's developers discovered and handled random hardware failure: http://www.codeofhonor.com/blog/whose-bug-is-this-anyway

The bug webpage mentioned in the article is dead, and clicking the link subtly redirects you. Here's the archived version: http://web.archive.org/web/20140824005145/http://www.guildwa...


isn't this like trying to do numerical analysis while having alzheimer's? if the hardware is bad, your software will be bad.


you're pessimistic. the optimist would say 'how much numerical analysis can i do at varying levels of alzheimer?' and responses to the question are very well worth reading.


All good, except most of your mitigating code/tricks cannot be tested reliable to know they actually work.


Liberal arts types say engineers are not creative, but they're wrong. Find some creative engineers, it won't be much of a challenge.

Write 5 lines of perl that run logical AND and logical OR on a randomly generated bitmap emulating flash memory error rates at 1e-3, 1e-5, 1e-7, then run a billion emulation cycles (depending on code complexity this might take a long time, or maybe not).

If you're doing it right, you have the code for the emulator, never use closed source for this kind of work unless you trust it with your life (LOL) so implement statistical error injection in your emulator. Guess what, your CPU adder now has a 1e-5 bit error rate.

This is before the dreaded "programmer with a soldering iron" is unleashed. A white noise source is attached to pin D0 of the CPU and levels are increased while behavior of the system is analyzed in the test jig.

Simplistically, radiation is linear, so running at 100x intensity for a month is the same as running at design intensity for 100 months. Talk to your local nuclear physicist or radiologist. Especially about neutron activated decay products, LOL. In more detail, if it takes xenon 12 hours to build up to a problem level, you can't just test at 100x intensity for an hour and call it good because 11 hours after the test ends you'll finally have 100x the xenon decay product as a contaminant, but that's a weird situation.


> Liberal arts types say engineers are not creative, but they're wrong.

I take your basic point--and it's a good one--that engineering is actually a very creative pursuit, when done well. But like any other high-expertise field, it's difficult for people on the outside to understand how engineers solve problems and, thus, to appreciate the sorts of skills required to do it well.

But isn't one of the lessons here that it's probably not a good idea to make gross generalizations about the attitudes and dispositions of people in other disciplines? :) I think "Liberal arts types say engineers are not creative" is probably about as true as "engineers are not creative." There may be some kernel of truth to each, but neither is especially accurate. (The kernel of truth may be that each statement could often be true when relativised to dumb liberal arts types and dumb engineers.)


One implementation of a single-event-upset testbed that I worked with used the valgrind infrastructure to inject faults in running code, purely in software. (Example: http://www.sts.tu-harburg.de/research/fitin.html, earlier work: http://dus.jpl.nasa.gov/projects/solar/Papers/granat-abft-sv...)

Your larger point is right: some commenters are giving up on problem analysis or solutions "because the problem is hardware". That implication does not follow. Perhaps people are too used to highly abstracted software systems?


I think you guys just misunderstand. The question you should be asking should be "is this this right solution", rather than "can this be done?".

If you want to build robust systems, picking the right solution is paramount. In this case the consensus seems to be to fix it in hardware rather than messing out with software (it takes longer to implement a verifiable solution, harder to test, etc).


I wrote to push back against some sweeping claims that are made nearby, such as "if the hardware is bad, your software will be bad." This is simply not true for many values of "bad". Another comment nearby regarding testing also surrendered prematurely.

I worked and published in the area for a few years, so I have a definite POV. I find it appealing that software can, in some cases, make up for the failings of hardware.

The topic area is fascinating. It has some of the intellectual aspects of security research: you can get somewhere by breaking through abstraction barriers in surprising ways.


That's still unnecessarily pessimistic. You can examine the various types of unreliability you're getting from your test techniques to try to drive an understanding about the operating environment.

If you have something in front of you, there's really no reason you can't at least try to come to some kind of understanding of it. Then do your best to communicate your findings.


Yes, that's why it's called rocket science

The answers there are very insightful


Yes. But what if you have to make a reliable system for, say, cleaning nuclear reactors? Or flying between planets?

Sometimes software needs to be able to work around bad hardware.


For the former you use mechanical manipulators and sensors and keep as far away as possible.


Even with more robust hardware (and the question says that their hardware is designed for this environment), software redundancy and error checking can still be worthwhile, even when it only means that you catch the hardware malfunctioning rather than being able to fix the error.

Still, for things that are really mission-critical, I'm not sure there's any substitute for full-blown hardware redundancy, e.g. having 3 computers and requiring 2 of them to agree on an action (via an interlock mechanism that is itself robust and redundant, which is a whole other can of worms...).


There are many cases where we accept that some component of a system will be faulty and then design a system around it that neutralizes or mitigates those faults.

Think of COTS servers, space-craft, or various people-management processes.


Yeah, naively the answer seems like "more shielding, Mr. Scott."


Is it concerning, that I see a question like that on Stack Overflow or am I too pessimistic?


There are lots of high radiation environments people have to deal with. Satellites, Medical Devices, Nuclear power plants, Particle accelerators, and even the type of robots sent in to nuclear disaster areas: http://www.sciencealert.com/the-robots-sent-into-fukushima-h...


I'd personally rather it get asked in a transparent way (like this) vs some programmer making some stupid assumptions in some cube somewhere.


I think the implication was "why is it possible for a programmer to make a stupid assumption". In such a critical environment, one would hope there's oversight and understanding of all of this, eliminating the need to ask on SO.


Why? Where & how would be the right way to ask a question like this?


I goota say, this is the most fascinating thing I have seen all week.


genetic code has a lot of the same protective mechanisms (multiple copies of genes, for example), and is highly focused on radiation protection.


An entirely different approach could be to make it more likely to crash to minimize the harmful effects of executing corrupt code.

I don't know how these corruptions manifest themselves, but if for example a single bit gets flipped every now and then somewhere (probably a naive assumption), a CPU that had a 0b11111111 nop instruction and 8 reset instructions at 0b01111111, 0b10111111, 0b11011111 etc., you could interleave your code with tons of nop instructions to minimize the possibility of executing corrupted code. You'd then be more likely to reset than to execute something that will lead to further corruption.

As for data, I like the idea of storing/reading it redundantly in 3 or 5 places and assume that the majority of a read to all of them (bit by bit) is correct and rewrite the supposedly corrupt data. Again, not sure how well a solution like this actually applies to radiation corruption. You can of course use any existing techniques for data error correction, like Hamming coding (used in for example teletext to correct errors).


> I don't know how these corruptions manifest themselves, but if for example a single bit gets flipped every now and then somewhere (probably a naive assumption), a CPU that had a 0b11111111 nop instruction and 8 reset instructions at 0b01111111, 0b10111111, 0b11011111 etc., you could interleave your code with tons of nop instructions to minimize the possibility of executing corrupted code. You'd then be more likely to reset than to execute something that will lead to further corruption.

I like this. It's functional, yet wildly impractical.


The idea of a single bit flip in an instruction fetch is very real. Imagine an instruction encoding where your intended load op becomes a store due to a single bit flip and your system hangs waiting for a response that never comes. Obviously you should have used ECC on your instruction ram but that can have a bug impact on your design so maybe you skipped it...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: