
The Explosion of the Ariane 5 - pstadler
http://www.ima.umn.edu/~arnold/disasters/ariane.html
======
troymc
The Ariane 5 rocket was carrying four ESA spacecraft known as "Cluster"
(because they were to work together, in a tetrahedral formation). The bug and
subsequent failure give another meaning to the word "clusterf%#k".

<https://en.wikipedia.org/wiki/Cluster_%28spacecraft%29>

Edit: The above Wikipedia article has the Ada source code that caused the
problem.

~~~
tobych
I worked on the Cluster project as a software developer for many years, at the
University of Sussex in Brighton. My first job. And it blew up. Great first
job. I was watching live with hundreds of engineers and scientists at
Rutherford Appleton Laboratory. There was silence. All we could hear were
birds singing, coming over the satellite link. Quite an experience. Amazingly,
the Cluster project got up and running with replacement hardware a few years
later, and I was back on the job.

~~~
smackfu
Don't pretty much all launches have insurance? The rate of failure is high
enough that it seems necessary unless you are a gambler.

~~~
tobinfricke
Monetary compensation isn't everything, if you've already put years of effort
into a project and the failure means additional years of effort are required
to prepare for a new launch. The wikipedia article notes that replacement
spacecraft were not launched until four years later.

------
Gravityloss
I've heard from people in the space sector that it was the _exception_ , not
the overflow per se that caused the problem. Had it not been caught the flight
could have made it to orbit (if there weren't other problems). Wikipedia says
it was a hardware exception but
<http://www.ima.umn.edu/~arnold/disasters/ariane5rep.html> says it was a
_software_ one, and it was only in code that was needed in pre-flight so it
seems likely to not cause problems if there wasn't that crippling exception.

These systems have become so big and expensive. This was the case since ICBM:s
and it got only worse with Apollo.

Yet they are so vulnerable since there is no way to abort intactly once you
have flown something like 0.1 seconds. (At least Saturn V had some
redundancy.) You do not get a second try.

Both issues create a perfect recipe for stagnation - everything has to be
checked and rechecked for years before and after a software or hardware
change. If someone tries something new, and there is a launch or spacecraft
failure, it is a political issue and heads will roll. People's technical and
political careers are destroyed.

In short, this way is not likely to reach real spacefaring.

A more organic approach with lots of smaller actors working in parallel and
trying and failing a lot more - but with better processes built in to handle
said failures (technical, political and cultural) could be much more conducive
to real progress like increase in operational flexibility, shortened
schedules, better reliability and lowered price.

Reasonable sized reusable rockets with good intact abort capability in a
testing and development program could up the launch rate hugely, and all kinds
of different solutions could be quickly tested. I find it likely that this
will eventually happen, but it is frustrating how long it is taking.

In this "horizontal velocity overflow" case, you could do an intact abort if
you had a fallback to some alternate control law or even manual control. Those
are not incorporated to current expendable space launchers but they exist in
aircraft. (Saturn V and also the Lunar Module _did_ have manual backup. You
could fly the Saturn to orbit. The LM was hard got get to the right orbit
where the CM was waiting...)

------
callahad
The Therac-25 is also a fascinating case of software failures causing tangible
loss: <http://courses.cs.vt.edu/cs3604/lib/Therac_25/Therac_1>

------
nraynaud
You know, also I'm slightly tired of that story (I mean it stings, my family
works in the field), sometimes I feel like it's a good thing. Here in France,
and with the elite political clique at the power even more, are afraid of
risk. Our constitution was even emended towards risk-averseness. I think
blowing up the GNP of an african country had various positive side effect: 1)
risk is there, wether you have correctly signed the process paperwork or not
2) innovation feeds on blowups 3) be humble, stop being cocky on TV before a
test launch, sending back the champaign and buffet was very painful to watch
and there is no need for that.

~~~
avar
They should have celebrated that they learned something that day with that
champaign and buffet.

------
XorNot
I'd really love more details about this. What did the surrounding code look
like, why wasn't there a compiler warning being produced by this code etc.

There's certainly a much larger - and probably quite informative - story here.

~~~
david_p
A friend of mine did a presentation about this. What he told me is that
apparently, the developpers who wrote and tested the code that overflowed were
designing for a value in miles.

When stored in miles, the value would never got outside the range of a 16 bit
unsigned int, but the actual value used was in kilometers, and when converted
to kilometers, the value would overflow.

~~~
Tloewald
This seems unlikely for a European/French system. Are you perhaps confusing
this story with the NASA Mars Observer mission, mentioned elsewhere in the
thread, which failed because of a metric vs imperial error?

The value was horizontal velocity. I imagine in metric it would be expressed
in m/s while in imperial it's going to be feet per second (mph seems highly
improbable).

------
eps
Reminds me of the (alleged) reason why first Soviet Mars missions missed the
planet - there was an erroneous period instead of a comma at some part of its
nav program written in Fortran.

~~~
FoeNyx
or was it NASA Mariner 1 ?
[http://en.wikipedia.org/wiki/Mariner_1#Overbar_transcription...](http://en.wikipedia.org/wiki/Mariner_1#Overbar_transcription_error)

Anyway, as always wikipedia has an interesting list of software bugs :
<http://en.wikipedia.org/wiki/List_of_software_bugs>

I found interesting the computer crash of F22-Raptors after crossing the
International Date Line.

~~~
huhtenberg
They don't have a Mercedes Smart bug where they mixed left and right causing
the car to throw itself on its side when going through a turn.

(edit) ... and I'm downvoted. Lovely. Totally makes sense.

~~~
Someone
I think you mean a Mercedes-Benz A class, in the moose test
(<http://en.wikipedia.org/wiki/Moose_test>)

I have never heard that was software related, but it also is a decent bet
software was involved, e.g. for stiffening the suspension. What top of the
line car did not have software, in 1997?

~~~
huhtenberg
That's the one, bingo. It was a new line of smaller MBs, so I misremembered it
being Smart. The issue though was most certainly with the software
overcompensating the roll in the wrong direction. I would've not remembered it
otherwise :)

~~~
lttlrck
The A class didn't have active suspension. I believe it was a mechanical issue
and solved with a stiffer front anti-roll bar and other suspension geometry
tweak. Though there is a possibility the ESP was tweaked to apply the brake
under such circumstances I think that could do more harm than good. I could be
wrong, it was 1997.. IIRC the non-ESP MB Sprinter also had stability issues.

~~~
huhtenberg
@sebbi - you are shadow-banned.

------
jhonkola
Another (probable) software failure due to unexpected scenario was the Mars
polar lander
[http://en.wikipedia.org/wiki/Mars_Polar_Lander#Loss_of_commu...](http://en.wikipedia.org/wiki/Mars_Polar_Lander#Loss_of_communications)

The failure review concluded that the probable cause of loss was that the
landing system software apparently interpreted the deployment of lander's legs
as touchdown and shut down the descent engines. The vibrations caused by the
deployment of the legs was not taken into account when designing the software.

------
neurotech1
There has been several references that SpaceX "fly" their Falcon 9 computer
systems to test for bugs like this. The idea being that as far as the computer
is concerned, it is a real flight and should act accordingly. Most of the
problems to date, have been related to a mechanical problem. During the first
docking, there was a minor issue with the sensor "field of vision" but this
was fixed.

The point is that SpaceX procedures seem to be able to prevent similar
software bugs in the Ariane 5 from causing a catastrophic abort or failure.

~~~
kiba
How come the flight system at Ariane 5 wasn't tested like this?

~~~
InclinedPlane
Resources? Afterward they simulated flights using the Ariane 5 systems and
duplicated the errors.

~~~
kiba
I don't think resource is an issue given that they spend a lot of money on its
development.

~~~
InclinedPlane
Most big bureaucratic organizations tend to be "penny wise and pound foolish".
They might spend billions on developing a new launch vehicle but balk at
spending a few million on a HIL simulation of a real launch.

------
3327
happens to the best of us... If its any condolence my first game app crashed
after 10k points for a similar reason. check it out it should still be on the
android store - Alliegator

------
nernst
There is a good article examining the various possible causes of the Ariane-5
disaster by Bashar Nuseibeh: "Ariane-5: Who-Dunnit?". See PDF here:
[http://www.inf.ed.ac.uk/teaching/courses/seoc/2007_2008/reso...](http://www.inf.ed.ac.uk/teaching/courses/seoc/2007_2008/resources/ariane5.pdf)

------
webreac
"R1...More generally, no software function should run during flight unless it
is needed."

This means, that even using the most reliable language and trying to test as
much as possible, there is always a risk of an overseen bug.

------
jaxb
For more stories like this, get the book by David M. Harland, "Space Systems
Failures: Disasters and Rescues of Satellites, Rocket and Space Probes"

------
youngerdryas
From the linked James Gleick article:

"the programmers had decided that this particular velocity figure would never
be large enough to cause trouble. After all, it never had been before.
Unluckily, Ariane 5 was a faster rocket than Ariane 4. One extra absurdity:
the calculation containing the bug, which shut down the guidance system, which
confused the on-board computer, which forced the rocket off course, actually
served no purpose once the rocket was in the air. Its only function was to
align the system before launch. So it should have been turned off. But
engineers chose long ago, in an earlier version of the Ariane, to leave this
function running for the first 40 seconds of flight -- a "special feature"
meant to make it easy to restart the system in the event of a brief hold in
the countdown."

~~~
michielvoo
That seems like a contradiction: it was caused by a calculation regarding
velocity, but that calculation served no purpose once in the air.

Based on the quote I* have to agree with the developer then: this particular
'horizontal velocity' figure would never be large enough to cause overflow, it
would always be zero since the rocket should still be on the platform.

So maybe the existence of the routine was the root cause, and not so much the
potential for overflow inside the routine?

* I am not a rocket scientist

~~~
InclinedPlane
> _"this particular 'horizontal velocity' figure would never be large enough
> to cause overflow, it would always be zero since the rocket should still be
> on the platform."_

Except the Earth is not stationary, nor is the surface of the Earth moving at
a constant velocity.

~~~
CognitiveLens
Regardless, the Earth didn't substantially speed up between Ariane 4 and 5, so
although the horizontal velocity figure might not be zero, it would at least
be approximately constant on the platform.

~~~
InclinedPlane
The point I was making is that it is necessary to calibrate an inertial
guidance system to the ground, you can't just pretend that the launchpad is
stationary. More so, because a launch vehicle's guidance will become entirely
internal (dependent only on the on board systems and commands sent from
mission control) well before launch it's not so easy to have some software
routines running on the rocket which don't run during an actual launch. I
imagine it was far easier to simply use a timer to allow the routine in
question to run up through T+40s than to attempt to programatically trigger
deactivating the routine in the event the vehicle actually left the pad. More
so given that running the routine through part of an actual launch, on an
Ariane 4 at least, had never been problematic before.

The problem was that nobody had formally detailed all of the assumptions and
risks for every part of the code, so when the conditions changed and those
assumptions became faulty nobody was the wiser because nobody was aware they
were actually making such an assumption about the speed of the rocket.

------
flagnog
compare this to the failures experienced recently with SpaceX: Elon's launch,
while not perfect, recovered. I think this shows the power of E's vision, and
how he's going to change the launch market.

