How Is Critical Life or Death Software Tested?

acomjean · on June 1, 2015

I wrote software for Radars. Kind of important (not like plane software). We used Ada alot, which in my estimation helped. Software was reviewed. Tests were reviewed. Reliability was favored over other things (for example recursion was discouraged). We used Ada's constrained types (this value is between 1 and 99, if it goes out of range, throw an exception).

For external hardware inputs, we had software simulating the messages when we couldn't get the real equipment in the lab. We were writing software for a radar that didn't yet exist.

And it was tested and tested and tested and tested.....

The when the pieces came together in integration. They tested and tested. An when they built the physical radar, the software worked. A few configuration things here and there needed adjusting.

Actually testing was built into the software. When it came up it would talk to the physical parts to make sure everything was communicating ok before it could start running.

Was it bug free? Probably not. You can't test every possible scenario. But we took the work seriously. It took longer to get code done than any other environment I've been in, but the code ended up very solid quality.

outworlder · on June 1, 2015

> Actually testing was built into the software. When it came up it would talk to the physical parts to make sure everything was communicating ok before it could start running.

I've been wondering about this for a while. We tend to run unit tests, integration tests, whatever tests, while the software is in development. However, once it is in "production" (for whatever definition of production), usually no tests are performed. At most, there's some sort of metrics and server monitoring, but nothing in the actual software.

It's a work of fiction, but Star Trek has these "diagnostics" that are run, with several levels of correctness checks, for pretty much everything. In your typical app, it could be useful to ask it to "run diagnostics", to see if everything was still performing as expected.

dugmartin · on June 1, 2015

Early in my career I worked on military flight data recorders, including the development of the software for the F-22's "black box". Those systems have SBIT, IBIT, PBIT and MBIT sub-systems were BIT is "built in test" and S = startup, I = initiated, P = periodic and M = maintenance. I remember making the Star Trek diagnostic joke myself when I was assigned the SBIT work.

Each BIT does varying level of testing based on it's runtime budget but there are a lot of very basic tests that don't make much sense until you see your first field report of some register bit "sticking". Its much better to ground a plane that can't add 1+1 than to find that out in mid-flight.

neurotech1 · on June 2, 2015

An F-22 had a FCS failure and didn't realize takeoff. The pilot didn't do a IBIT on the FCS as required, therefore wasn't aware that all 3 rate sensors latched up. The jet was uncontrollable once it left the ground, and the pilot ejected safely. [0]

As I recall, they modified the sensors to avoid latch up (extra pullup resister) and updated the FCS software to provide a warning if all 3 sensors return zero output. Even though this wasn't really an error in the FCS software, It could be argued that it failed to detect an erroneous input from 3 redundant sensors that latched up for the same reason.

This is a classic example of a pilot who misunderstood the pre-takeoff checklist procedure and cost the USAF a $325m aircraft.

[0] http://usaf.aib.law.af.mil/ExecSum2005/F-22A_20Dec04.pdf

infinotize · on June 2, 2015

To be fair this seems to be more of a classic example of a documentation or training problem. Right from your link:

> During the mishap sequence, the MP started engines, perfomled an IBIT, and had a fully functioning Flight Control System. Subsequently, the MP shut down engines to allow maintenance personnel to service the Stored Energy System. During engine shut down, the MA's Auxiliary Power System (APU) was running. The MP believed the APU provided continuous power to the Flight Control System, and therefore another IBIT after engine restart was unnecessary. This belief was based on academic training, technical data system description, and was shared by most F/A-22 personnel interviewed during the investigation.

michaelt · on June 2, 2015

  To be fair this seems to be more of a classic 
  example of a documentation or training problem.

Maybe - but if you were designing a consumer product, you wouldn't rely on the user following a checklist; the IBIT would run automatically when they turned on the ignition and sound an alarm (or even prevent takeoff) if the vehicle would be uncontrollable.

So you could also classify this as a user interface / design problem.

dugmartin · on June 2, 2015

SBIT & PBITs are the ones that run automatically, with SBIT running automatically at startup and PBIT running on a watchdog. The SBIT time budget and scope is usually much smaller than IBIT so time intensive tests like ones that talk to sensors on the bus aren't present in SBIT. You can think of the stages as SBIT: can I run? IBIT: should I run? PBIT: am I running right?

neurotech1 · on June 2, 2015

True. The pilot wasn't the only one who misunderstood the details.

derekp7 · on June 2, 2015

What types of errors would cause one of the tests to fail? Is it mostly testing for hardware errors, or are there any software logic errors that could make it to production, but be caught by one of the tests several months down the road? The only software related items I can think of are edge cases where a built in test is based on real time input. Kind of like running the calculations through multiple independent implementations of the software.

npatrick04 · on June 2, 2015

There's actually a decent probability of memory corruption in space applications due to radiation. So in addition to checking communication across busses, application checksums are typically run continuously.

TheLoneWolfling · on June 2, 2015

Another thing that's software-related is if you've got a (rare) race condition. For example a data structure that gets corrupted if you have a particular series of nested interrupts.

Now, hopefully your system is set up such that race conditions cannot happen, but good luck with that.

evntdrvn · on June 1, 2015

Yup, as far as I remember all the avionics hardware I saw had these types of BITs built in.

rzzzt · on June 1, 2015

Dropwizard (a web application framework/library) incorporates the idea of such health checks, and you can also implement additional ones specific to the application. It is encouraged to run them periodically in production to ensure that the database connection is still up, threads are not stepping on each others' toes, etc.

https://dropwizard.github.io/dropwizard/manual/core.html#hea...

joshrotenberg · on June 1, 2015

That uses the metrics library by the same author. Very handy stuff in there for development and production.

https://dropwizard.github.io/metrics/3.1.0/

acomjean · on June 1, 2015

Even while running the radar would keep track of communications between the parts and make sure things were still ok. The system needed messages periodically from the external components and vice versa to make sure things were ok. There were status messages sent around too. And a display of how things were doing. Its been a while, but I remember some of the things.

You could command the external things to run diagnostics and report back. Part of this was it makes fixing things easier (like your car computer's diagnostics), part of it was required so we can figure out where things weren't working optimally.

For example if a motor was running its controller computer didn't hear from the main system in X seconds, it would just stop, send a message about what it was doing and then wait for an instruction. Presumably this was to prevent all heck from breaking loose if the main system went down or was not responsive.

This was all spelled out in long requirements documents.

I wonder sometime if our cars controlling computers are doing this...

mlonkibjuyhv · on June 1, 2015

Cars often disable any system capable of interfering with the wheels at the least sign of issue. Insufficient seal on the fuel-cap? Disable ABS, TCS, etc.

acranox · on June 1, 2015

That's not correct. Cars will generally have something that's often called a 'limp home' mode. I saw a car that's PCM (powertrain control module) that failed some internal test, so it fell back to a basic mode where the engine wouldn't rev over about 2000RPM, and I'm sure all the emissions systems went into a basic fail-safe mode, where things like the fuel delivery goes into a hard-coded mode, instead of using feedback from the oxygen sensors to tune the fuel delivery. But a loose fuel cap doesn't disable the ABS or TCS. Even a single faulty wheel speed sensor doesn't have to disable the entire ABS, it can still independently assess each wheel, determine if a lock-up is imminent and modulate the brakes for that wheel.

mturmon · on June 2, 2015

That's neat. The counterpart for spacecraft is "safe mode" (http://en.wikipedia.org/wiki/Safe_mode_(spacecraft) ). The priorities are maintaining attitude control, conserving power, and listening to the radio. The science instruments and other bells and whistles are turned off as much as possible.

_nedR · on June 2, 2015

What really scares me are the ABS and traction control systems that are now becoming mainstream on motorcycles. These systems have accelerometers and gyros, which take in to account, information like lean angle of the bike when calculating the braking force. It certainly is a feat of engineering.

Little about being a biker, programmer, and an aspiring tree-shade mechanic, reassures me about the safety of these systems. Internals of a bike are much more exposed to abuse, bikers are known to take a spanner to their machines, the motorcycle repair workshops are a much more informal industry (at least here in India). What happens to a bike that is not subject to regular maintenance? On the other hand, I have always marvelled at how mechanical systems like motorcycles are usually built with some sort of graceful failure in mind - In a lot of cases a motorcycle will warn you about a faulty component before it fails catastrophically. I assume the people who designed these systems, would have kept that in mind while designing these systems (although stuff like Toyota's unintended acceleration does not inspire confidence).

And what happens in the event of a catastrophic failure:- a car locking up at speed is still dangerous, but there is room for error. If the front tires of your motorcycle locks up at speed, the odds of you walking away from the incident are not high.

Don't get me wrong, these systems DO SAVE more LIVES than they could possibly take away in the long run, But I am still disconcerted by the whole thing.

EDIT: typos

takeda · on June 3, 2015

> although stuff like Toyota's unintended acceleration does not inspire confidence

Another thing that worries me are that these same companies are also working on self driving cars.

Worst of it is that everyone pretty much jumped into the race after Google. It also doesn't look like existing solutions work on a real time system. I'm a bit worried about being hit by a car because it was running garbage collection process and did not react quickly enough.

nradov · on June 1, 2015

That's not correct. You can leave the fuel cap off and it won't disable ABS or TCS. The only thing that will happen is the "check engine" warning light will come on.

priv_acy · on June 2, 2015

Funny. Just the other day, my gas cap was loose. The traction control light came on along with the check engine light. As far as I could tell, it (traction control) was still operational, however.

Seemed like a funny combination of lights to blink on.

jerf · on June 1, 2015

It doesn't necessarily take much work to run integration tests as diagnostics on production, especially if you plan on it in advance. I've had good success with it.

Of course there's a certain level of destructive testing you can't do live, but that you really ought to do on your development system, load testing being a simple example. It behooves the wise developer to keep these quite separated in the code. :)

Tloewald · on June 1, 2015

The project I'm working on (for the Federal government no less, and very much not life and death) involves having test data in the production system and tests that run in the production system. (I was impressed by this when I found out.)

clearf · on June 2, 2015

Make sure they're well separated! c.f. http://www.wsj.com/articles/SB100014240527487033765045754919...

Tloewald · on June 2, 2015

The stuff we are doing is so not life and death that nothing like this could happen. E.g. We have fairly insane levels of security for information that is inherently public.

_csoo · on June 1, 2015

That would be called reliability testing and it's why Netflix has chaos monkey to throw a wrench into things.

lostcolony · on June 1, 2015

Also conformity monkey.

grayarea · on June 2, 2015

I used to write fire control and fire monitoring software. The whole idea of live testing is built into the ethos of such systems. In fire alarms and control the only tests that matter are the ones performed in a production environment.

z3t4 · on June 1, 2015

A lot of ppl get this backwards, that a stable program should never "crash". While it's actually the opposite, it should throw errors at every opportunity to do so.

The errors should then be logged and the program should be restarted by a watcher process.

Here's an example on how you can both log errors and e-mail them if a process crash, using a startup script (Linux, Ubuntu):

  exec sudo -u user /bin_location /program_path 2>&1 >>/log_path | tee -a /error_log_path | mail mail@domain.com -s email_subject

yellowapple · on June 1, 2015

This is how Erlang (for example) gets its reputation of being "nine-nines" capable (i.e. capable of 99.9999999% uptime, or downtime on the order of milliseconds per year). Erlang (and Elixir and LFE) software following the OTP framework is usually ordered into "supervision trees" - layer upon layer of Erlang processes managing other Erlang processes in turn managing other Erlang processes, all potentially distributed across multiple Erlang VM (nowadays BEAM) instances.

tormeh · on June 1, 2015

A watchdog pattern just splits the program into several processes, the program as a whole still never crashes.

peteretep · on June 2, 2015

If an error is caught and handled, calling it a crash seems disingenuous.

z3t4 · on June 2, 2015

One mistake that ppl do is they wrap their code around a try ... catch, where it's better to throw an error and exit. If there's an error in one place, chances are there are also errors elsewhere, so it's better to restart the program instead of continue with a bad state.

When the error gets thrown in your face, there's a higher chance that it gets fixed.

But this also have its setbacks. Loosing the whole state can be really bad.

peteretep · on June 2, 2015

I'm really note sure why you think catching an error in a separate process is somehow superior to catching it in a higher scope

sacado2 · on June 2, 2015

I think it depends on the kind of error. If it is a "bug-detected" error (null-pointer dereference, out-of-bounds, divide-by-zero, out-of-memory, etc.), you better restart the program since you're in an unstable state. If it is a "domain-specific" error (connection lost, robot could not reach its destination, battery low, etc.), you better deal with it as soon as possible.

_cx2w · on June 1, 2015

> Reliability was favored over other things (for example recursion was discouraged).

This sounds really strange to me. So may I ask why? I find that recursion - most of the time - helps shorten and clarify the code. Also, doesn't recursion make induction proofs trivial?

phleet · on June 1, 2015

It's a lot harder to reason about memory constraints on recursive programs.

The clearness and correctness of the code often ignores the possibility for stack overflow. Most naive implementations of DFS will hit the stack limit given trees that are all one long path from a single root to a single leaf.

mturmon · on June 1, 2015

Further to your point, here's the guideline, and rationale, from Gerard Holzmann's document on recommended coding practices for C at NASA/JPL:

1. Rule: Restrict all code to very simple control flow constructs – do not use goto statements, setjmp or longjmp constructs, and direct or indirect recursion.

Rationale: Simpler control flow translates into stronger capabilities for verification and often results in improved code clarity. The banishment of recursion is perhaps the biggest surprise here. Without recursion, though, we are guaranteed to have an acyclic function call graph, which can be exploited by code analyzers, and can directly help to prove that all executions that should be bounded are in fact bounded. (Note that this rule does not require that all functions have a single point of return – although this often also simplifies control flow. There are enough cases, though, where an early error return is the simpler solution.)

This is rule 1 of 10, so he apparently feels strongly about "banishing recursion." Gerard was formerly at Bell Labs and is also a fellow of the ACM and a member of the NAE.

flogic · on June 2, 2015

It's also important to note that these rules are made for critical control systems that tend to be low level. The cost benefit trade offs aren't going to be the same as in typical business software.

eropple · on June 1, 2015

It's not a bad question, but approaching it from a CS perspective will cause you to blow your foot off--because it's not about code length or code clarity, it's about safety (which is orthogonal). Your stack's of a finite length, and eventually will grow into the heap unless your system has protections against it.

In most systems lots of really important stuff is allocated at the bottom of the heap. It's very easy for a clobbered global flag (yes, hissss, globals, these are very constrained computers we're talking about here) to cause a system to have its shit get real at an alarming rate.

noir_lord · on June 1, 2015

The issue is largely that unbounded recursion is quite easy to do accidentally (in many of the languages that where used in the past) with the resulting stack smashing causing issues.

Also many of these systems where hard real time as in "if we don't respond in under 30ms something expensive goes bang" and again recursion can cause problems with that, lots of these systems are interrupt driven and have no garbage collection or threading so you can't just pre-empt them in that event since by the time you spot the problem you blew through your deadline and something went bang.

acomjean · on June 2, 2015

Yup. We had some processes running with their own cpu with interrupts turned off, so if the process went weird it meant reboot (as we discovered the hard way one day). So we tried to keep code simple.

On the plus side you had a pretty good idea about how long the max processing would take (and avoid the timeouts and aforementioned "bang"), as the OS couldn't interrupt us. Certain system calls couldn't be made while in what we called "soft real time". Memory allocation was done upfront.

The process control of that system was interesting. You could assign processes to processors or groups of processors and then give those groups a scheduling method. I haven't seen anything like it in the years since I left.

VLM · on June 2, 2015

Non recursion is waaay "easier" to troubleshoot finite precision floating point issues and failures of the first part of "be liberal in what you accept and conservative in what you send".

If you assume infinite precision arithmetic and a very friendly environment for inputs, recursion always looks simpler, but by the time you clean it up to handle real world issues, non recursion instead looks simpler.

Its too easy to write recursive end conditions along the lines of "if x == 42" when your helpful floating point routine somehow mysteriously rounded x to 42.00000001 so it'll never equal, or "no (supposedly) UTF-16 encoded string would ever have an odd number of bytes, even though I have no control of the source and the source is known to occasionally be insane" or at least thats how I remember it. I've run into both. Its not funny at the time but in retrospect its usually fairly hilarious.

Personally I think its harder for people to understand concurrency issues WRT recursion, but I'll probably just get flamed for that one. I feel more people have "leveled up" with concurrency and non-recursive code and functional style programming than have leveled up to include recursion in that mix. Imagine two (three?) concurrent recursive algos fighting each other over one data structure.

ArkyBeagle · on June 1, 2015

Every recursive algorithm has an equivalent iterative double. Iterative solutions ARE induction :)

If you are careful about test vectors, you can pseudo-exhaustively prove that an iterative dual to a recursive algorithm is equivalent.

albinofrenchy · on June 2, 2015

In addition to what others have said here; often recursive calls can also trivially be optimized into loops by the compiler.

This is very convenient in most cases, but could hide the fact that a direct recursive call will overflow your stack.

This isn't apparent and would test out ok until someone makes a small; seemingly insignificant change which the compiler can't do tail call optimization on, and all of a sudden things fail.

robotresearcher · on June 1, 2015

tldr version: Because you want (i) an acyclic function call tree of (ii) predictable depth.

eropple · on June 2, 2015

That reminds me--a friend was telling me earlier today about a piece of software he was working on that only allowed forward jumps. (It ensures that the program halts.)

TheLoneWolfling · on June 2, 2015

It does slightly more than that - it provides a cheap computation of the upper bound of how long it will take.

eropple · on June 2, 2015

Yeah, that too. Come to think of it, that was probably more his concern than just halting.

TheLoneWolfling · on June 2, 2015

Going out on a limb here, but I don't suppose it was for a superoptimizer?

golergka · on June 2, 2015

> An when they built the physical radar, the software worked.

I can only imagine a feeling of satisfaction you and your colleagues felt at this moment.

shanemhansen · on June 1, 2015

I've seen some software that does lots of tests on startup. One software I used (ATG) validated all the ORM mappings against the currently configured database on startup by default.

airza · on June 1, 2015

My experience with testing avionics controllers was: Everyone seemed to have the correct idea (that bugs were basically Not Allowed.) The company set up enough testing so that those bugs were eventually eliminated. However, the difference between good projects and bad projects was mostly the amount of time and money that this took.

My favorite was one where we had an entire test harness written in Python that could completely control the operation of the device being tested in a way that resembled human input. Code was first written in Ada against a monolithic requirements document, while testers wrote their standard test cases against the same document. After the exhaustive amount of testing that took place by developers, testers themselves had the freedom to create contrived test cases that might have escaped the attention of devs (What if we just turn the machine on and off 10 times, because why not?).

This had the advantages of a formal software process as well as the ability to exploit human creativity. It also led to me losing a bet that a doppler radar can't be fooled with an empty potato chip bag.

nchelluri · on June 2, 2015

Care to elaborate on the potato chip bag story? Sounds intriguing.

mbrameld · on June 2, 2015

My wild guess loosely based on my time spent writing software to analyze observed radar signals: I believe the parent probably opened the bag along the seams to get one flat piece of reflective (on the inside of the bag) material. I know weather balloons are often reflective so they can be tracked by radar. I'm guessing a full-size potato chip bag opened flat would be big enough for the radar to see, couldn't tell you if that's what the parent meant by fooling the radar or if something else was done to the bag to make the radar interpret it as an aircraft.

airza · on June 2, 2015

This is pretty much it. My boss was definitely a level 99 duct tape programmer, and this was one of the many examples of his ingenuity.

Aleman360 · on June 1, 2015

Worked on anesthesia machines for a few years. Since both hardware and software is involved, the testing was quite extensive.

* Lots of manual testing. While we did unit testing and some automated integration testing, most defects were found using exhaustive manual testing by trained engineers.

* Randomized UI testing. Used UI automation to exercise the UI with various physical configurations of the system. Would often run this overnight on many systems and analyze failures every day.

* Extensive hazard analysis. Basically, we wrote down everything that could possibly go wrong with the system (including things like gamma radiation), estimated the likelihood and harm, and then listed mitigations. The entire system could run safely even if there was full power failure. "Fail safe"

* Detailed software specifications, each of which was linked to manual test cases. Test cases were signed off when executed.

* Animal testing for validation. We went to a vet school and put a bunch of dogs under and brushed their teeth.

* Limited release for production. We would launch the system at one or two hospitals and monitor it for a few weeks before broader release.

click170 · on June 1, 2015

How does an anesthesia machine fail safe?

What I mean is, does it continue applying anesthetic in the event of a power failure or does that stop entirely?

Aleman360 · on June 1, 2015

It can operate completely mechanically. There is an integrated UPS in case mains fails. If the battery also fails, a pneumatic whistle goes off to alert the user. If for some reason something goes really bad, they switch to using an ambu bag, typically hung on the back of the machine.

david-given · on June 1, 2015

I like the idea of the pneumatic whistle. A pressure vessel with a sprung loaded valve held closed by a solenoid, maybe? However it works, it's a really neat piece of lateral thinking.

ux-app · on June 1, 2015

a kind of "dead man's whistle" is guess?

manarth · on June 1, 2015

You just have to look at Therac 25 for the risks of relying on software interlocks alone. One of the prevailing pieces of feedback was the lack of hardware interlocks - whether that's possible on anesthesia machines I don't know…but the prevailing wisdom is to use/include hardware interlocks wherever that's feasible, for any critical life-supporting equipment.

ploxiln · on June 1, 2015

I don't disagree that "hardware interlocks" are a good idea for such equipment. But now that I think about it, I'm annoyed that we can't know if software was competently written and tested - is it really that different from hardware in that respect?

How about, we require that the source be open, and if it's too convoluted for the hospital's respected experts to check, then it fails inspection?

Hardware uses many standard parts and materials, and similarly, the software could use a few plain-simple-standard libraries, like libc (but not the floating point functions), zlib, libpng.

The Therac 25 case was just incompetence. The vendor was told about the problem, but was in denial, then later supplied a hardware fix which didn't fix the problem. The problem had to be thoroughly investigated and proven by a doctor and operator over many months. Why couldn't the vendor have investigated more thoroughly themselves, in a week or so? Why weren't they more careful about race conditions? (The problem was triggered by a human able to type into the interface too fast. An actual human.)

ISL · on June 2, 2015

Hardware kills people too, sometimes in subtle ways. http://en.wikipedia.org/wiki/United_Airlines_Flight_232

The combined utility of hardware and software interlocks is that they're complementary:

Sometimes it's easy to specify an interlock in the language of hardware: Never, under any circumstances, should it be possible to slew an avalanche-control howitzer to point at permanent structures; let's use a steel pipe to block the barrel from traversing beyond safe limits.

Sometimes it's easy to specify an interlock in software: Never, under any circumstances, should a rocket launch unless every desk at mission control has authenticated their assent with the main control system.

When interacting with the real world, real-world interlocks are handy, but they're hardly sufficient to guarantee safety. Nothing is.

Aleman360 · on June 1, 2015

Not sure what you mean by "interlocks", but the hardware was quite distributed. Each critical component had its own board and industrial microcontroller. And we had various levels of watchdogs keeping track of system health at all times.

david-given · on June 1, 2015

Interlocks are usually fairly crude safety measures, normally in hardware, to make sure that particular combinations of events cannot happen.

The Therac-25 is a famous comp.risks cautionary tale. Among the many, many design misfeatures (if you haven't come across it, it's worth a read) was the one that killed people:

It was capable of providing two kinds of radiation therapy; electron beam radiation and X-ray radiation. It worked by having an electron beam generator which could be operated at either high power or lower power. Low power was used directly. High power was only used to irradiate a tungsten target which produced X-rays. (I'm simplifying here.)

You can probably guess what went wrong; people were directly exposed to the high power electron beam. Several of them died.

The obvious interlock here (which apparently previous versions had) was to have a mechanical switch which would only enable the high-power beam when the tungten target was rotated into place. No target, no high power. Simple and relatively foolproof (although it's possible for interlocks to go wrong too).

pdkl95 · on June 1, 2015

Really, every engineer should read the report[1] from the Therac-25 investigation. I would hope anybody that is working on anything that could be potentially dangerous has already read it.

There problems in the Therac-25 went a lot further than just the bad design of the target-selection, which had an (badly designed) interlock. It checked that the rotating beam target was in the correct position to match the high/low power setting (and NOT the 3rd "light window" position without any beam attenuator).

While many design choices contributed to the machine's problems, you could probably say that two big design failures lead to the deaths associated with Therac-25. One was this interlock, which failed if you didn't put it in place (there was no locking mechanism, either, just a friction stopper). If the target was turned slightly, the 3 micro-switches would sense the wrong pattern (bit shift)... which was pattern for one of the OTHER positions.

There was also a race condition in the software that would turn on the beam at a power MUCH higher than it is ever used. This race was only triggered when you typed in the treatment settings very quickly, which is why the manufacturer denied there was a problem: when they tried to recreate the bug by carefully - that is, very slowly - following the reported conditions, it never failed.

Therac-25 is an incredibly powerful lesson in what we mean by "Fail Safe", and why it is absolutely necessary to have defense in depth. Fixing the target wouldn't have fixed the race condition power-level bug. Fixing any of the software wouldn't have fixed the bad target design that could be turned out of alignment. Oh, and they had a radiation sensor on the target (which could shut off the machine as another independent layer of defense... but they mounted it on the turnable target, so the micro-switch problem allowed the sensor to be moved away from the beam path.

The really telling thing, though, is how the previous model acted. It was not software controlled, and was an old-style electromechanical device. It turns out the micro-switch problem existed there as well (among other problems)... and it would blow fuses regularly. Which was yet another layer of safety. It turns out that when they upgraded it to a software-based control system, they got cheap and took out all those "unnecessary" hardware interlocks and "redundant" features. There is a lot of blame to go around, but this is where I put most of the responsibility. You never assume one (or even a few) safety feature will work - the good engineer assumes it will all break at any moment, and makes sure that it will still Fail Safe.

> (although it's possible for interlocks to go wrong too)

If there is one lesson to learn from the Therac-25, this was it. Things break, mistakes happen, and when you're building a device that shoots high-energy x-rays at people, you need to assume that everything did go wrong, and make sure the rest of the device can safely handle that situation.

[1] http://sunnyday.mit.edu/papers/therac.pdf

manarth · on June 1, 2015

"The really telling thing, though, is how the previous model acted…it would blow fuses regularly"

Good. When a fuse blows, it shows something is wrong, and needs fixing. Replacing the fuse with a nail or something else that doesn't blow is a sure-fire way to set the thing on fire. Bad enough for a desk-lamp, a little worse for radiotherapy machine.

Sounds like people were irritated by fuses blowing, and decided to simply short-circuit the fuses instead.

pdkl95 · on June 2, 2015

The people using the machine a the hospital would replace the (expensive) fuses when they blew. It was the manufacturer that made the later model (the Therac-25) that didn't have the fuses (and other "old" hardware features).

Obviously, something was still very wrong. User error (or other bugs? I'm not sure) in the older hardware and the infamous race condition in the software-controlled Therac-25 was causing the beam to turn on some shockingly high amount of power. The better design of the older models saved people's lives by simply blowing fuses when the power went too high.

You could, perhaps, blame the poor communication between the hospitals and the manufacturer, because the fuse problem should have cause a bit of a panic among the engineer who designed the machine.

lambdaelite · on June 1, 2015

A familiar interlock would be the mechanism used to disable a microwave magnetron when the door is in the open position.

Sophistifunk · on June 1, 2015

Which apparently doesn't work so well, according to radio astronomers :-/

Kinda makes me worry about how many times I've microwaved the meat-n-two-veg just a bit.

sarchertech · on June 2, 2015

Damage to the testicles is caused by the thermal effects of the microwave, so it's likely you would have noticed.

TheLoneWolfling · on June 2, 2015

The other thing is to not ditch the software interlocks just because there are hardware interlocks. And never ignore interlock triggers.

watermelon59 · on June 1, 2015

> As for the code itself, its perfection came as the result of basically the opposite of every trope normally assigned to "coder." Creativity in the shuttle group was discouraged; shifts were nine-to-five; code hotshots and superstars were not tolerated; over half of the team consisted of women; debugging barely existed because mistakes were the rarest of occurrences. Programming was the product not of coders and engineers, but of the Process.

Serious question: where do I get a job like this? It's my dream way of programming professionally.

digikata · on June 1, 2015

Any field where the bug could be catastrophic. The only reason the shuttle group was sustainable the way it operated was because a bug in the software they worked upon was in that category. Off hand, commercial space, deep space science, aviation, medical, nuclear, & military systems (not all) are that way. Be warned that it's very slow moving, and your skills in relatively new hardware/software stacks will basically go out of date. Your skills will be in being a specialist in whatever the environment is of the system you're working upon - much more so than being a general software engineer. It's not bad, but it's very different than the typical HN environments.

aaronharnly · on June 2, 2015

This type of coding is I think best understood by the following tradeoff: they know exactly what the code must do; and the loss if it fails to do that thing is enormous; so it worth expending enormous effort to ensure it does that thing exactly.

In most user-facing software in the Internet age, the reverse is true: we do not really know what the software should do; but the penalty for a bug is not great; so, it is not worth expending enormous effort to be big free; instead it is better to expend great effort to be nimble and find out just what it is the software should be doing to begin with.

Very different environments yielding very different methodologies. I can't say one is better, just de gustibus non est disputandum.

acomjean · on June 1, 2015

Large corporations with government contracts have this kind of work. Probably financial institutions too.

The coding proces is slower than most people are used too and can become frustrating.

DannoHung · on June 1, 2015

Financial institutions do not necessarily do things this way. Some parts might, but I don't have any experience with them. The parts I do have experience with it is utterly a miracle that anything works.

noir_lord · on June 1, 2015

> The parts I do have experience with it is utterly a miracle that anything works.

Is my feeling about everything industry I've worked in.

The stuff that runs telecoms (mostly billing side) particularly is the stuff of nightmares.

aswanson · on June 2, 2015

Seriously? Billing code? I would have imagined that code would be so subject to customer complaint that it would be forced into quality.

gtirloni · on June 2, 2015

I've worked in a support team for a telecom billing system and I was tasked with interacting with our development team to investigate bugs in production (and eventually moved to the development team). These systems are created just like any other commercial system, without any formal proofs and minimum requirement docs. To make things worse, they have to be flexible enough to support all and any billing plans that the business might come up with, so there is a lot of moving parts.

As other people have said here, nobody wants to touch it. Developers would often limit themselves to fix just a small portion of the code even though they thought the overall system could be improved in many ways, for fear of breaking something, causing a few million dollars of damage and getting fired. There was no assurance that any part of the systems should work like this or that.. only some vague expectations.

You're right, that would be a systems that should be built from scratch with that kind of concern but unfortunately it's not.

annnnd · on June 2, 2015

It is a nice challenge how to rewrite such a system. One way would be to build a parallel system, identify all inputs and duplicate them to the parallel solution and then compare the outputs; in case of discrepancies fix the erroneous system. Once the systems produce same results (or once the new system produces better results than the old one) you just switch the systems.

The rewrite doesn't have to be complete; it can (should) be done in pieces of course.

noir_lord · on June 2, 2015

"Replace in place" is fairly common.

The big issues is duplicating the system while it is still morphing in production for all the edge cases, it often feels like trying to paint a moving bus.

jacobsenscott · on June 2, 2015

I would imagine this is a part of the code everyone is terrified to touch. When bugs do turn up they would be fixed by tactical 'if' statements. It would be a shit pile of hacks built up over many years.

qrybam · on June 2, 2015

Can add anecdotal evidence that people are afraid of changing a company's billing system as it may result in bills being generated that are significantly different as to either raise client or internal concerns, so best to leave it alone, right?

andrewchambers · on June 2, 2015

support staff are probably manually correcting problems when someone calls up.

yitchelle · on June 1, 2015

Think companies that build products that deals with human safety, ie automotive, military, medical, aerospace etc. Each of those industry will have their own definition of what is a good process is, and some are much stricter than others.

I work in automotive, so it is govern by processes such as ASPICE and ISO26262.

lambdaelite · on June 1, 2015

I'm my experience, engineering firms that work on software-intensive projects.

__z · on June 1, 2015

Government contracting.

bargl · on June 1, 2015

I'm currently writing software that is non-critical for satellites. It's non-critical in the manner that if we get things wrong our company will lose millions of dollars but the satellite won't burn any resources that it can't get back.

We are currently porting code over from C++ to a C# system with parallel computation. The current system has been flying for a long time but has no testing and is tied to a bad UI. So we are re-writing.

That said, accuracy is number one. We have a pretty solid method for testing so far. We know have some robust input scenarios and we know that we want to get a specific output. So we are able to do fairly robust automated "regression" testing. If the numbers don't match then we have an issue, and we have to fix that before moving on.

After every validation that the new code gives us acceptable margins of error we wrap it up with unit tests so that we can then modify the code to try to optimize. Our testing is integrated from the highest level of the code down (I know backwards) but that's how we know we can validate the input.

We have a lot of testing and a long schedule. If this weren't critical software we'd have a much shorter turn around on what we are writing. We also work very closely with subject matter experts on every change we make. We have a guy who's been working with this software (and the subsequent theory) for 20 years. He's open to change, but he also validates everything so we don't accidentally change the output when we're optimizing.

david-given · on June 1, 2015

C# is an odd choice for avionics software, surely? It's got a really heavyweight runtime and nondeterministic behaviour due to the garbage collector. I assume this is the satellite's application layer and isn't real time?

...and what are you running it on? I wasn't aware of any embedded operating systems which supported C#!

snops · on June 1, 2015

I'm guessing its actually running on a ground station, providing commands or analysing data from satellites, hence why its non-critical and the need has arisen for parallel computation, presumably to speed it up.

Many ARM embedded systems can support C# with the open source .net micro framework (http://www.netmf.com), which doesnt require any OS, and was originally developed for Microsoft's SPOT watch. I haven't used it myself, and I agree with you that it doesn't sound, ideal at first for realtime apps, but none or soft realtime embedded applications are common too.

bargl · on June 2, 2015

Yeah it's a ground station application. So it isn't running on any embedded hardware. We'd have done something different for that.

Performance isn't the only thing we are considering. We want to get improved performance, but our old analysis code was extreemly hard to maintain, so that went into the decision as well. Honestly, I'd probably pick another language, but I wasn't on the project when it started.

I personally would have liked to do this with F# because of how functional it is at it's core, but that's cuz we have a lot of Microsoft expertise in house.

Also one thing with the engineering apps is that anything where engineers (not software engieners) don't have to learn a new language is going to be an easier sell.

henrik_w · on June 2, 2015

The story of the Boeing engineers flying on the test flights is a perfect example of "skin in the game" (from Antifragile by Nassim Nicholas Taleb).

Here's what I wrote about that in a blog post on Antifragility and SW development:

At the end of the book, there is a chapter on ethics that Taleb calls “skin in the game”. To have skin in the game, you should share both in the upside and downside. Taleb quotes the 3,800 year old Hammurabi’s code: “If a builder builds a house and the house collapses and causes the death of the owner – the builder shall be put to death”. It is interesting to view this from a software development perspective. I have never worked on software where people’s lives were in danger if the software failed, but I would not be willing to submit to Hammurabi’s code if I did. But I think a little less extreme form of skin in the game is actually very good. Being on call for example. If the software you wrote fails, you may get called in the middle of the night to help fix it. I have been on call at most of the places I have worked in the past, and I think it has a lot of benefits. It gives you an incentive to be very thorough, both in development and testing. It also forces you to make the software debuggable – otherwise you yourself will suffer. Another way of introducing skin in the game is dog-fooding – using the software you are developing in your daily work. I have never worked on software that we have been able to dog-food, but I think that is another great practice.

http://henrikwarne.com/2014/06/08/antifragility-and-software...

dbdr · on June 3, 2015

The examples you are giving are all about sharing the downside. What are the options for the upside?

markbnj · on June 1, 2015

It's essentially an analog of the concept of tolerance in the physical world of manufacturing and assembly. The less your tolerance for error the more formal and carefully controlled the process, and the more money spent in testing, verification, feedback, and improvement.

And yet you can still measure one value in the metric system and another in English units and drill a smoking hole in Mars. It was sort of striking to read Charles Fishman's statement about the software being bug free, followed immediately by the supporting fact that the last three versions had one bug each. If they had one bug, how are you 100% sure they didn't have two?

eck · on June 1, 2015

> statement about the software being bug free

I bet they required the guy who delivered the rocket fuel to sign something saying it contained no impurities, the guy who delivered the external tank to sign something saying it did not leak, etc... why should the software guy be special?

Yeah I know. We're special. But the world doesn't always see it that way.

superuser2 · on June 1, 2015

No. That it contained under n ppm of impurities, that it did not leak more than 0.001%/year, maybe. Only a Sith (and consumer product marketing that is lying to you) deals in absolutes.

In the durable goods world, you don't pretend things are perfect. Failure modes are designed and disclosed, replacement of parts is expected and made reasonable, tolerances are marked, failure rate metrics like MTBF are known, and as a customer you choose the price-quality tradeoff that makes sense for you.

I just wish consumer products were also sold this way. Instead we pretend every product is awesome and act surprised when things break.

ufmace · on June 2, 2015

That reminded me of phone plans, oddly enough. For a while, everybody was advertising their data plans as unlimited, and then the tech press would get all upset when they found the limit.

I always thought the whole thing was dumb - of course there's a limit. Or are we supposed to believe we can push megabytes/s nonstop all month? I'd rather have them just tell me what the limit is and what happens when you go over it than pretend it's unlimited. And stop having the tech press act like the sky is falling when they discover that the unlimited plan actually has a limit.

_ofdw · on June 2, 2015

You're okay with scummy advertising that probably breaks false advertising laws?

tedunangst · on June 1, 2015

At least some consumer goods have similar limits, e.g. maximum amount of arsenic permitted in apple juice. Of course, whenever people find out about these limits, they lose their shit... "Why is any arsenic ok? OMG! The evil government is trying to poison us!"

lambdaelite · on June 1, 2015

The person in charge of a mass spectrometer (Thermo Element 2) at a local college likes to toy with his students by running tests on their commercial bottled water during a lab demonstration. This particular instrument can pick up minute traces of uranium, lead, and the like (on the order of fg/L IIRC).

At the end of the demonstration, the trash can is full of discarded water bottles.

pwnna · on June 1, 2015

> It was sort of striking to read Charles Fishman's statement about the software being bug free, followed immediately by the supporting fact that the last three versions had one bug each. If they had one bug, how are you 100% sure they didn't have two?

I think you're being overcritical here. It's really really really difficult to reach 100% in any real form of measurement. So I think when they mean "bug free", they probably mean the chance of a bug is below some threshold of probability. The famous six sigma rule comes in mind.

I do grant you that this is not specified explicitly in the article, but they do say: "each 420,000 lines long-had just one error each. The last 11 versions of this software had a total of 17 errors. Commercial programs of equivalent complexity would have 5,000 errors."

jacquesm · on June 1, 2015

Because it is so difficult you should not make statements that contain fragments such as 'bug free' unless they are preceded by a negation.

markbnj · on June 1, 2015

>> So I think when they mean "bug free", they probably mean the chance of a bug is below some threshold of probability. The famous six sigma rule comes in mind.

No doubt, and that was in fact my point. I think most of us would be very reluctant to use the phrase "bug free," and in this case his statement was obviously meant to be in stark contrast to that reluctance.

pwnna · on June 1, 2015

I think this is more of a technicality that's very subjective. I feel that "no" depends on the probability of something happening as well as how catastrophic it is.

For example, after buying a lottery ticket, I can pretty comfortably say that I'm not going to win. Sure, there is a chance, but is it really going to happen? No.

As another example, if I put down a waterbottle on my bike seat, there's a 1% chance of the water spilling onto the sidewalk. At this point, I'm also comfortable saying that it's not going to happen. However, if the waterbottle has a 1% chance of exploding when I put it down, I don't think I'll be comfortable saying that it won't happen. The chance of that must be much much lower before I can accept it.

I would like to give the person who said that the benefit of the doubt. I'm sure they understand the implication of bug free, but it's likely just easier to say that rather than explaining stats/tradeoffs to a journalist.

x5n1 · on June 1, 2015

Bugs? You mean features!

zyxley · on June 1, 2015

For people interested in the idea of formal verification, you may want to look at TLA+ (https://en.wikipedia.org/wiki/TLA%2B) and PlusCal (https://en.wikipedia.org/wiki/PlusCal), which have been mentioned on HN before.

They're a specialty system for writing code (and mathematical proofs) where every possible system behavior for a given range of inputs can be examined for safety (outputs within allowed ranges with no unexpected behavior) and for liveliness (the expected progression from one output to another).

ef4 · on June 1, 2015

It's all just a question of cost.

We know how to write software that comes arbitrary close to perfection. But as defects asymptotically approach zero, cost skyrockets.

The interesting question is what technologies can bend that cost/quality curve.

jerf · on June 1, 2015

This is why this discussion sometimes frustrates me. A lot of the defects we have are because you aren't willing to pay for the sort of software that wouldn't have defects. It's natural to read that as a sort of cynical accusation, but instead, I mean it straight... you really aren't willing to pay what it would take, and you shouldn't be. A $1000 Facebook-access app for your phone (that still somehow has some sort of economies of scale going for it, but that's another discussion) might not crash on you and might take a lot fewer phone resources, but there's no way it's going to be $1000 better than what we get today for free for the vast bulk of users.

On the flip side, the cavalier attitude developers who are on the very, very low side of the curve, where huge quality improvements can be obtained for very cheap, towards those cheap practices also frustrates me. What do you mean you can't be bothered to run the automated test suite that I have handed you on a silver platter on your build server? Are you serious? How can you pass up something so cheap, yet so powerful? And I don't just mean, how can you not be bothered to set it up, etc... I mean, as a professional engineer, how can you justify that?

manarth · on June 1, 2015

"What do you mean you can't be bothered to run the automated test suite that I have handed you on a silver platter on your build server?"

As devil's advocate, why not just run this for me (e.g. on every commit/every push)? Much like the web usability ethos "Why make me think" - why make me work? The lower the barrier to testing - ideally zero, it just happens without the dev having to do anything - the more testing will happen.

I don't often get the chance to set things up this way, but when I do, each dev works in their own git branch, and sends a pull-request with their changes. The test server(s) then run the complete test-suite on the branch, and either note the PR with "Tests passed" or emails the dev with "Tests failed" and the reasons. Devs don't need to think about running tests, reviewers/release managers don't need to even consider PRs until the "Tests passed" message shows up…saves time and effort for everyone, and improves code quality. The cost is simply the initial setup time.

jerf · on June 2, 2015

"As devil's advocate, why not just run this for me (e.g. on every commit/every push)?"

In my real-life experience with a multiple-team environment, which is where the question came from, my running the tests doesn't do any good if you're going to consider it "my" server and simply ignore the results.

The key point here isn't a technical one. The key point here was, as an professional engineer, how can you justify not taking such a great bang/buck quality option that will far, far more than pay for itself? Explaining how I can be a professional engineer on your behalf more than misses the point.

Kluny · on June 1, 2015

Keep in mind that this level of testing sometimes isn't available at any cost - my company, for example, even if we were awarded a million-dollar-plus contract for the product we already build, would not be able to come up with the stringent testing that those NASA engineers use.

HeyLaughingBoy · on June 2, 2015

You learn. Or farm it out to the people who know.

lifeisstillgood · on June 1, 2015

Not technologies, process.

I am 90% sure that actually taking any organisation and committing to good known process will raise the game by orders of magnitude - so technologies supporting and enforcing said process will be of benefit

And I think the process looks like this

1. Written requirements up front 2. Total isolation / integration points defined and contractually enforced 3. Test harnesses built first 4. Per romance and event metrics built in

Any ideas?

crpatino · on June 1, 2015

#1 is never going to happen. It is a fact of life that if you wait until having anywhere near 99% of understanding of the problem space, you will be driven out of market by the guy that did a quick prototype with mere ~70% and iterated from there. A second fact of life is that people know this and will strong-arm their pet feature into any requirements spec whenever they fell they can get away with it.

#2 and #3 might be feasible, but they will require a massive shift of perception across most stakeholders. It's a politics game, and you will need the perennial support of a very influential sponsor to push through the feet dragging phase.

#4 Might be easier to sell, but still require time and effort to implement. In a sense, this is ultimately also a political matter.

brudgers · on June 2, 2015

No discussion of mission critical software is complete without mentioning Margaret Hamilton:

https://en.wikipedia.org/wiki/Margaret_Hamilton_%28scientist...

istvan__ · on June 1, 2015

The article bit misleading, almost suggests that software correctness is achieved by testing, while this is definitely not true. Just for the record, none of the other technical fields using testing for safety critical engineering products as the main solution to ensure safety. This is not too different for software components in mission critical use cases. There are several ways to build a reliable system out of unreliable parts (like combining 3 different units by the 2 out of 3 principle, etc.) but testing is just not the way. It is a nice to have thing.

coderjames · on June 1, 2015

I think the article might be mistaken about one point: "For one thing, the Boeing approach is going out of style or has mostly gone out of style, according to SE poster Uri Dekel (handle: Uri), a Google software engineer."

It absolutely has not gone out of style in avionics software engineering. As a person who writes software for avionics, I can say that extensive design reviews at every step combined with rigorous testing is exactly how we build software. That's how its done at every avionics software company I've ever worked at (3 so far). Formal methods are generally still too cutting-edge and complicated for many people in this industry.

So maybe Google doesn't bother with design reviews, but those of us writing life-or-death software definitely do.

SnacksOnAPlane · on June 1, 2015

He meant the practice of sending the software engineers up on the first flight.

jarrettch · on June 2, 2015

I've been wondering about this lately. My stepdad has heart failure and they put a heart pump in his chest. It regulates blood flow and settings can be changed, etc. Leave it to a dev to think "How much testing has gone into this thing?" Even one minor slip-up in his blood flow, either too high or too low, could mean a stroke and possibly death. Or, God forbid, the thing crashes somehow and stops working.

That machine is much easier to code than a space shuttle obviously, but I still wondered about it. The tech has to be rock solid. Even one malfunction could cause so much despair in a family, and could also cost your company millions.

mrsteveman1 · on June 2, 2015

I don't know much about heart pumps, but I do have an implanted pacer/defibrillator, and I'm currently not too happy with the people who made it.

I've been told by my cardiologist (and engineers working for the manufacturer) that it doesn't "fail safe" if the battery level drops too low to keep the device running (which is inevitable if it isn't replaced after 7-8 years, but it can and sometimes does happen prematurely and without warning).

In that situation, not only can it suddenly become unable to correct an arrhythmia (as expected), it could actually cause one all by itself, or pace above 200BPM for no reason.

No one I've talked to in the healthcare industry seems at all surprised about this for some reason. They just started monitoring it more often the closer it got to the "replace me now" indicator level.

alkonaut · on June 1, 2015

I think in the "normal" software industry we have a skewed picture of quality, simply because it's not the primary focus. For everyday software, it's fine if it works to 95%. Even bugs you have identified must be weighed against new features, and features often win. For the customer it's better to have a piece of software that has all the features they need, but calculates a bad result every 1/1000 times, or crashes a couple of times per day, than a program that doesn't have all the features. It's also the "release early release often" thing where the cheapest testers are your end users. That is far from the "release once patch never" of rockets. So feature bloat and suffering quality isn't really due to bad practices, it's an active choice.

VLM · on June 1, 2015

Could have at least mentioned DO-178C

http://en.wikipedia.org/wiki/DO-178C

x0x0 · on June 1, 2015

I worked on an emr.

Each developer got between 1/2 to 3/2 qa people, in addition to dedicated qa engineers. You had to submit detailed test plans for features. Surprisingly, the company didn't expend much effort on unit tests -- they were there, but not heavily emphasized.

amelius · on June 1, 2015

If Amazon can do it, then surely an airplane manufacturer can do it [1]

[1] http://cacm.acm.org/magazines/2015/4/184701-how-amazon-web-s...

kbenson · on June 1, 2015

As much as that helps, it doesn't yield software, or systems, without faults. The AWS wikipedia page lists a few major outages of AWS[1]. My recollection is that they are generally caused by cascading failures and complex service interrelationships, which are likely much harder to formally prove, if possible.

1: http://en.wikipedia.org/wiki/Amazon_Web_Services#History

packetslave · on June 1, 2015

that Amazon paper is talking about design/algorithm verification using formal methods. This does NOT verify that the design or algorithm in question is actually implemented correctly and matches the design perfectly.

lambdaelite · on June 1, 2015

And as I recall, TLA+ doesn't tie into code generation which doesn't help with the verification process.

mariopt · on June 1, 2015

It depends on the software budget and managers. When I worked a company one the CEO's had an idea about adding a camera to road semaphores so that cars wouldn't stop is the road is empty. I asked: What happens if the sun light hits the camera too much? The guy laughed in my face and told I was being ridiculous. I left the company some months later for other reasons but It was pretty scary to me to hear such words at that time. If you're writing software that might cost someone else's life, the budget shouldn't be a limitation nor time. I'm not aware of any software security regulation, If I'm going to release critical software, Does some entity exists to validates my code? This kind of stuff should exist to block companies that only see the profit of getting a contract.

manarth · on June 1, 2015

Genuinely, in road safety, the company typically considers and weighs the cost of fixing an issue, vs the cost of lawsuits in the event of death/injury.

One paper on the topic talks about the Ford Pinto fuel system design: http://users.wfu.edu/palmitar/Law&Valuation/Papers/1999/Legg...

The GM ignition-switch recall also sparked a similar debate: http://en.wikipedia.org/wiki/2014_General_Motors_recall

So it's not uncommon that economics outweighs risk-to-life in a lot of businesses.

rodgerd · on June 2, 2015

Which is one reason people should be a lot more suspicious of "tort reform" than they are.

rexignis · on June 1, 2015

Additional reading material: some of NASA's own rules for safe code (with explanations of each! love the stack one, free memory management :P).

http://spinroot.com/gerard/pdf/P10.pdf

rzzzt · on June 1, 2015

JPL also published a C coding standard, which details language constructs that one should and shouldn't use in a mission critical embedded system. Some of the rules make a reappearance there (the "Power of Ten" article is mentioned in the introduction).

http://lars-lab.jpl.nasa.gov/JPL_Coding_Standard_C.pdf

zobzu · on June 2, 2015

"There's a serious move towards formal verification rather than random functional testing," he writes. "Government agencies like NASA and some defense organizations are spending more and more on these technologies. They're still a PITA [pain in the ass] for the average programmer, but they're often more effective at testing critical systems."

this part is kinda cool because there's also always been a movement for toward this for opensource/commercial security software, at least for the kernel. Of course, we have no such thing today (as in i doubt anyone here runs such a system in production), but the interest is there.

guelo · on June 1, 2015

The kind of process needed to achieve high quality is just not that fun. Most of the programmers I know would run screaming complaining about the bureaucracy and red tape.

on June 1, 2015

[deleted]

manarth · on June 1, 2015

"ship both a and b, whose output always has to match exactly or it reports a fault and the component requires replacement."

And when that software disagrees…but the plane happens to be at 40,000 feet? The plane just stops running until the component is replaced?

I don't know too much about plane hardware, but I scuba-dive a rebreather which has critical life-support electronics.

It has two independent computers, and three O2 oxygen-pressure cells. The 3 cells report their reading of O2 pressure to both computers. Both computers simultaneously display the pressure on independent displays. One computer is primary (active, controlling the O2 pressure), whilst the secondary is display-only.

Both computers use a majority-rule…the two oxygen cells with the closest value win, whilst the third is ignored. This could potentially be fatal - two failing cells can report incorrect pressures and win the vote - so rebreather divers are also taught manual techniques to validate the computer readings (such as a diluent flush which is expected to produce a known predictable reading).

So there are a few techniques beyond simple a/b testing: best of 3 (or more, if available); independent circuits (ideally designed + built by independent manufacturers); manual techniques to give human verification of the data.

Whilst I certainly don't have extensive knowledge of real-time/safety-critical systems, it's clear that there are a lot of techniques, processes and procedures that we wouldn't necessarily be aware of in unrelated tech (e.g. web-dev) that do directly relate to that subject, and might well solve many of the scenarios we come up with.

mitchty · on June 2, 2015

I have a friend that used to do embedded programming for nuclear power plants.

From what he said they have 5 systems, two running software from vendor a, two from vendor b, and another running software from vendor c for failsafe. A quorum have to agree, if any don't it gets pulled out of service and analyzed why it wasn't the same output. If you get down to one there is still a mechanical failover but at that point you're already bringing things down. 3 at any time was considered a critical failure. Two at once would be full stop time.

I'd presume planes are similar. I know the space shuttle did the same at least so I assume the technique is common.

digikata · on June 2, 2015

Who writes the software for the quorum? Or is it hardware?

mitchty · on June 8, 2015

Sorry for the later reply, basically all the devices communicate with each other and the odd one out voluntarily powers off or gets powered off by the rest.

TheLoneWolfling · on June 2, 2015

Are both computers running the same software? Because if so, that's a single point of failure right there.

(Just like multiply redundant hardware won't help you if there's a design flaw and they all fail at once.)

manarth · on June 2, 2015

Yes, they are…and that's an acknowledged failure risk. A number of divers mitigate that by having a separate independent fourth O2 cell, monitored by a separate computer (from another manufacturer). Divers are trained to watch for any sign that the computer isn't doing its job (and there are certain typical indicators, although they don't cover every possible scenario), and if there's any doubt, switch to an alternative (either bailout to open-circuit, or switch to a fully-manual mode of operation) - either way, the computer is then out of the picture. Of course, fully-manual mode just isn't possible for, say, a passenger jet.

mavdi · on June 1, 2015

Not sure if true, but I heard a couple of researchers in my uni wrote some code for Boeing. They wrote the code in Prolog. I assume because it's easier to formal test it. And they didn't trust the compiler, so they had to check the generated machine code line by line. Apparently.

tjr · on June 1, 2015

If it was DO-178B/C Level A, then yeah, they'd have to inspect the machine code. Regardless of trusting the compiler.

And the compiler would have to be inspected too.

jokoon · on June 1, 2015

It's true that writing software requires only computers, and that's it would be too expensive to test it in real situations, when the stakes are high maybe it's also important to do live testing ?

tormeh · on June 1, 2015

Hardware in the loop (HIL) is a sort of middle ground, where all the expensive stuff that breaks are replaced by ordinary computers, but the actual control hardware is not.

Corporate video by people who do this, explaining it: https://www.youtube.com/watch?t=116&v=YpxPAuHNpdM

neurotech1 · on June 1, 2015

The mission systems avionics on a F/A-18F costs around $1-2m per aircraft. This includes Displays, PowerPC based AMC boards, power supplies etc. Hardware in the Loop testing quite practical and was ran as an engineering flight simulator. Missions were flown in the simulator, and problems located without costing $20k/hr per aircraft for an actual flight test.

SpaceX do HIL tests with the Falcon 9 & Dragon spacecraft. http://www.spaceflightnow.com/falcon9/003/120424date/

imrehg · on June 2, 2015

Okay, the LightSail spacecraft is not a "life or death" thing, but, as it was in the news yesterday[1] seriously logging into CSV file onboard can crash the system and have to wait until it reboots itself (don't react to soft reboot either)?

Feels sad, that a lot of lessons learned are getting lost along the way...

[1]: http://www.planetary.org/blogs/jason-davis/2015/20150526-sof...

crucini · on June 4, 2015

I think the right answer here is to develop languages or libraries that enforce the constraints you want.

I think we should beware of adding layers of "silver bullet" technologies that promise to fix the last crisis, but increase the friction of development.

To illustrate what I mean, PLCs are normally programmed in ladder logic. From that level, I don't think you can crash the machine or corrupt memory. So the risks are limited to the "operating system" if you will, which can be more mature and tested than the "application".

lambdaelite · on June 1, 2015

Testing is certainly very important, but I think it should be emphasized that testing is but one part of the software development life cycle that produces an acceptably safe product.

anoopelias · on June 2, 2015

I always thought that the way to ensure zero failures at a level is to ensure accuracy at the next level.

For example, if all trains runs with a minute accuracy at every points, then there won't be any collision by mistake.

Davesjoshin · on June 2, 2015

As a professional tester, I find this pretty interesting. Always wondered how these ultra critical programs were tested. (I work in the creative website space)

danieltillett · on June 1, 2015

Have languages been written that are designed to be maximally orthangonal to each other in regards introducing bugs? If so what combinations are used?