> Fortunately, he points out, really important software has a reliability of 99.9999999 percent. At least, until it doesn't.
Part of Feynman's research into the Challenger disaster was finding out how such high numbers at the top level could co-exist with much lower and more realistic reliability estimate by the engineers working on the project.
Anyway, if you like this sort of thing, there's a lifetime supply of reading about horrifyingly expensive software disasters in the famous "RISKS Digest". https://catless.ncl.ac.uk/Risks/
More recently was Fukushima. Interviewed engineers all recited the same failure probabilities, and every couple of days more "highly unlikely" events kept happening--explosions, power outages, cooling outages, partial meltdowns, full meltdowns, containment failures. Their confidence in the risk calculations was disturbing, especially because they failed to revise their priors after each new event!
I don't mean to exaggerate the events. It was a shame nuclear energy suffered such a severe blow. But not as shameful as that parade of engineers who remained stubbornly committed to the consensus risk analysis as events unfolded (and seemingly remain so to this day).
It does not take a rocket scientist to say that diesel generators should not be located at ground level, at the ocean side, in a tsunami prone area. Fukushima would have been ok if they'd been able to get diesel power up to run cooling pumps.
> Steering was controlled by the on-board computer, which mistakenly thought the rocket needed a course change because of numbers coming from the inertial guidance system. That device uses gyroscopes and accelerometers to track motion. The numbers looked like flight data -- bizarre and impossible flight data -- but were actually a diagnostic error message. The guidance system had in fact shut down.
The real question is, why was the diagnostic error message interpreted as flight data?
If I recall correctly, the erroring code originated from the Ariane 4, where it was required to align the vehicle after launch. It was a bit of a perfect storm: the code was left in for diagnostic purposes, a steeper trajectory in the Ariane 5 triggered the bug, and the system had supposedly been battle-tested in Ariane 4.
Yes. My naive guess is, if this software was built more robustly, a supervisor would have read diagnostic message, restart the guidance system and try to recover from this situation as well as possible. Instead, the system was designed in a way it's not possible to spit anything out other than flight data.
Sort of, but not really. From the original report:
> It was the decision to cease the processor operation which finally proved fatal. Restart is not feasible since attitude is too difficult to re-calculate after a processor shutdown; therefore the Inertial Reference System becomes useless. The reason behind this drastic action lies in the culture within the Ariane programme of only addressing random hardware failures. From this point of view exception - or error - handling mechanisms are designed for a random hardware failure which can quite rationally be handled by a backup system.
> Although the failure was due to a systematic software design error, mechanisms can be introduced to mitigate this type of problem. For example the computers within the SRIs could have continued to provide their best estimates of the required attitude information. There is reason for concern that a software exception should be allowed, or even required, to cause a processor to halt while handling mission-critical equipment. Indeed, the loss of a proper software function is hazardous because the same software runs in both SRI units. In the case of Ariane 501, this resulted in the switch-off of two still healthy critical units of equipment.
> The original requirement acccounting for the continued operation of the alignment software after lift-off was brought forward more than 10 years ago for the earlier models of Ariane, in order to cope with the rather unlikely event of a hold in the count-down e.g. between - 9 seconds, when flight mode starts in the SRI of Ariane 4, and - 5 seconds when certain events are initiated in the launcher which take several hours to reset. The period selected for this continued alignment operation, 50 seconds after the start of flight mode, was based on the time needed for the ground equipment to resume full control of the launcher in the event of a hold.
> This special feature made it possible with the earlier versions of Ariane, to restart the count- down without waiting for normal alignment, which takes 45 minutes or more, so that a short launch window could still be used. In fact, this feature was used once, in 1989 on Flight 33.
> The same requirement does not apply to Ariane 5, which has a different preparation sequence and it was maintained for commonality reasons, presumably based on the view that, unless proven necessary, it was not wise to make changes in software which worked well on Ariane 4.
> Even in those cases where the requirement is found to be still valid, it is questionable for the alignment function to be operating after the launcher has lifted off. Alignment of mechanical and laser strap-down platforms involves complex mathematical filter functions to properly align the x-axis to the gravity axis and to find north direction from Earth rotation sensing. The assumption of preflight alignment is that the launcher is positioned at a known and fixed position. Therefore, the alignment function is totally disrupted when performed during flight, because the measured movements of the launcher are interpreted as sensor offsets and other coefficients characterising sensor behaviour.
In this case, the crash occurred because the overflow error threw an exception, which the software was not prepared to handle. If the overflow check had been omitted the rocket would have launched successfully. (Not making a statement about what should have been done, here.)
30 years ago read an article in a dead tree magazine by a firmware engineer. He mentioned clipping and flagging over and underflows. His example was sqrt(small negative number) better to return zero and set a flag than to blow up. Because small negative number is most likely 'nominal zero'
> But the second unit had failed in the identical manner a few milliseconds before. And why not? It was running the same software.
That got me to thinking, could 3 different teams have separately built and programmed 3 different units (separate software and hardware), and then had a voting type system so that, if say 2 units are saying "keep going straight" and the third is saying "turn left", then it knows to keep going straight? Is something like this ever done in systems that require extremely high reliability?
How did they get to know what had happened? Was it reconstructed after the crash? If so, then why was it not simulated beforehand? Or did they get live data during it?
"The SRI internal events that led to the failure have been reproduced by simulation calculations. Furthermore, both SRIs were recovered during the Board's investigation and the failure context was precisely determined from memory readouts. In addition, the Board has examined the software code which was shown to be consistent with the failure scenario. The results of these examinations are documented in the Technical Report."
You may also be interested in section 2.3 which describes the factors that led to simulations failing to discover this failure mode
Part of Feynman's research into the Challenger disaster was finding out how such high numbers at the top level could co-exist with much lower and more realistic reliability estimate by the engineers working on the project.
Anyway, if you like this sort of thing, there's a lifetime supply of reading about horrifyingly expensive software disasters in the famous "RISKS Digest". https://catless.ncl.ac.uk/Risks/