
An overflow error costing 500M dollars (1996) - pieterr
https://around.com/ariane.html
======
pjc50
> Fortunately, he points out, really important software has a reliability of
> 99.9999999 percent. At least, until it doesn't.

Part of Feynman's research into the Challenger disaster was finding out how
such high numbers at the top level could co-exist with much lower and more
realistic reliability estimate by the engineers working on the project.

Anyway, if you like this sort of thing, there's a lifetime supply of reading
about horrifyingly expensive software disasters in the famous "RISKS Digest".
[https://catless.ncl.ac.uk/Risks/](https://catless.ncl.ac.uk/Risks/)

~~~
wahern
More recently was Fukushima. Interviewed engineers all recited the same
failure probabilities, and every couple of days more "highly unlikely" events
kept happening--explosions, power outages, cooling outages, partial meltdowns,
full meltdowns, containment failures. Their confidence in the risk
calculations was disturbing, especially because they failed to revise their
priors after each new event!

I don't mean to exaggerate the events. It was a shame nuclear energy suffered
such a severe blow. But not as shameful as that parade of engineers who
remained stubbornly committed to the consensus risk analysis as events
unfolded (and seemingly remain so to this day).

~~~
walrus01
It does not take a rocket scientist to say that diesel generators should not
be located at ground level, at the ocean side, in a tsunami prone area.
Fukushima would have been ok if they'd been able to get diesel power up to run
cooling pumps.

------
munin
> Steering was controlled by the on-board computer, which mistakenly thought
> the rocket needed a course change because of numbers coming from the
> inertial guidance system. That device uses gyroscopes and accelerometers to
> track motion. The numbers looked like flight data -- bizarre and impossible
> flight data -- but were actually a diagnostic error message. The guidance
> system had in fact shut down.

The real question is, why was the diagnostic error message interpreted as
flight data?

I recommend reading the original report, which has much more meat:
[http://sunnyday.mit.edu/nasa-
class/Ariane5-report.html](http://sunnyday.mit.edu/nasa-
class/Ariane5-report.html)

~~~
gnulinux
Yes. My naive guess is, if this software was built more robustly, a supervisor
would have read diagnostic message, restart the guidance system and try to
recover from this situation as well as possible. Instead, the system was
designed in a way it's not possible to spit anything out other than flight
data.

~~~
munin
Sort of, but not really. From the original report:

> It was the decision to cease the processor operation which finally proved
> fatal. Restart is not feasible since attitude is too difficult to re-
> calculate after a processor shutdown; therefore the Inertial Reference
> System becomes useless. The reason behind this drastic action lies in the
> culture within the Ariane programme of only addressing random hardware
> failures. From this point of view exception - or error - handling mechanisms
> are designed for a random hardware failure which can quite rationally be
> handled by a backup system.

> Although the failure was due to a systematic software design error,
> mechanisms can be introduced to mitigate this type of problem. For example
> the computers within the SRIs could have continued to provide their best
> estimates of the required attitude information. There is reason for concern
> that a software exception should be allowed, or even required, to cause a
> processor to halt while handling mission-critical equipment. Indeed, the
> loss of a proper software function is hazardous because the same software
> runs in both SRI units. In the case of Ariane 501, this resulted in the
> switch-off of two still healthy critical units of equipment.

> The original requirement acccounting for the continued operation of the
> alignment software after lift-off was brought forward more than 10 years ago
> for the earlier models of Ariane, in order to cope with the rather unlikely
> event of a hold in the count-down e.g. between - 9 seconds, when flight mode
> starts in the SRI of Ariane 4, and - 5 seconds when certain events are
> initiated in the launcher which take several hours to reset. The period
> selected for this continued alignment operation, 50 seconds after the start
> of flight mode, was based on the time needed for the ground equipment to
> resume full control of the launcher in the event of a hold.

> This special feature made it possible with the earlier versions of Ariane,
> to restart the count- down without waiting for normal alignment, which takes
> 45 minutes or more, so that a short launch window could still be used. In
> fact, this feature was used once, in 1989 on Flight 33.

> The same requirement does not apply to Ariane 5, which has a different
> preparation sequence and it was maintained for commonality reasons,
> presumably based on the view that, unless proven necessary, it was not wise
> to make changes in software which worked well on Ariane 4.

> Even in those cases where the requirement is found to be still valid, it is
> questionable for the alignment function to be operating after the launcher
> has lifted off. Alignment of mechanical and laser strap-down platforms
> involves complex mathematical filter functions to properly align the x-axis
> to the gravity axis and to find north direction from Earth rotation sensing.
> The assumption of preflight alignment is that the launcher is positioned at
> a known and fixed position. Therefore, the alignment function is totally
> disrupted when performed during flight, because the measured movements of
> the launcher are interpreted as sensor offsets and other coefficients
> characterising sensor behaviour.

------
ChrisSD
But I was told my C++ code could be hundreds of milliseconds faster if I
ignore such checks...

~~~
klodolph
In this case, the crash occurred because the overflow error threw an
exception, which the software was not prepared to handle. If the overflow
check had been omitted the rocket would have launched successfully. (Not
making a statement about what should have been done, here.)

~~~
Retric
_rocket would have launched successfully_

That does not actually follow. It would not have crashed due to this specific
issue, but it might have crashed from some other issue.

~~~
klodolph
I was assuming that people were smart enough to figure that part out on their
own, but thanks for connecting the dots for us, I guess.

------
packet_nerd
> But the second unit had failed in the identical manner a few milliseconds
> before. And why not? It was running the same software.

That got me to thinking, could 3 different teams have separately built and
programmed 3 different units (separate software and hardware), and then had a
voting type system so that, if say 2 units are saying "keep going straight"
and the third is saying "turn left", then it knows to keep going straight? Is
something like this ever done in systems that require extremely high
reliability?

~~~
Rychard
This seems to be precisely how some aircraft computer systems are implemented,
if this aviation.stackexchange answer is to be believed.

[https://aviation.stackexchange.com/a/21746](https://aviation.stackexchange.com/a/21746)

------
dgritsko
Here's a video of the incident, from 2002:
[https://www.youtube.com/watch?v=A1gGGDG580E](https://www.youtube.com/watch?v=A1gGGDG580E)

~~~
rb808
lol the first comment

> why the hell is the video called "2002" is this is clearly the footage from
> 1996 ? you even wrote it in the description ! moron

------
llao
How did they get to know what had happened? Was it reconstructed after the
crash? If so, then why was it not simulated beforehand? Or did they get live
data during it?

~~~
jffry
"The SRI internal events that led to the failure have been reproduced by
simulation calculations. Furthermore, both SRIs were recovered during the
Board's investigation and the failure context was precisely determined from
memory readouts. In addition, the Board has examined the software code which
was shown to be consistent with the failure scenario. The results of these
examinations are documented in the Technical Report."

You may also be interested in section 2.3 which describes the factors that led
to simulations failing to discover this failure mode

[1]
[https://web.archive.org/web/19970125194224/http://www.esrin....](https://web.archive.org/web/19970125194224/http://www.esrin.esa.it:80/htdocs/tidc/Press/Press96/ariane5rep.html)

