

Some disasters caused by numerical errors - davidbarker
http://ta.twi.tudelft.nl/users/vuik/wi211/disasters.html

======
jkot
Doolittle Raid is probably the biggest disaster caused by wrong date
calculation. But it is practically unknown in hacker community.

It was first bombing of Japan in 1942. Bombers were supposed to be refueled in
China. But they crossed international date line and arrived day earlier. In
result airports were not ready and bombers crashed.

15 bombers crashed, 3+8 crew man died. Soviets got American airplane
technology.

> _Planners in Washington, DC, also had made a ridiculous blunder, forgetting
> that the ships would lose a day (April 14) crossing the International Date
> Line, putting the planes in China a day earlier than anyone expected.
> Because of this mix-up, when some of the bombers flew over Chuchow Field,
> which was supposed to have been their main refueling base, an air raid alarm
> sounded and the lights turned out._

[https://books.google.com/books?id=FkEeVAf-U7gC&lpg=PA43&ots=...](https://books.google.com/books?id=FkEeVAf-U7gC&lpg=PA43&ots=EJc_GC0u3S&dq=date%20line&pg=PA43#v=onepage&q=date%20line&f=false)

[http://www.americainwwii.com/articles/the-impossible-
raid/](http://www.americainwwii.com/articles/the-impossible-raid/)

[http://en.wikipedia.org/wiki/Doolittle_Raid](http://en.wikipedia.org/wiki/Doolittle_Raid)

~~~
moioci
The wikipedia article doesn't mention a date miscalculation. Probably the main
cause of their fuel difficulties was having to launch 10 hours and 170
nautical miles before originally planned due to being spotted by a Japanese
patrol boat. All but one of the planes crashed, but only three men KIA, 8 men
captured, and one crew interned in Russia.

------
alephnil
During the first Iraq war, the partiot missiles was presented as a great
success that prevented Israel and Saudi Arabia to be hit by Scud missiles. The
facts was very different.

Not a single Patriot missile manage to hit a Scud during that war, partly
because of the bug mentioned in the article and because anti missile defence
is really hard.

The reason the Scud missiles did not do more damage, was that Iraq had
modified them to extend their range, and this modification made them unstable,
and thus their accuracy really poor, so that they would almost certainly miss
the target. On the other hand, as mentioned on the web page, one of the
Patriot missiles that missed a Scud did hit something else, while the Scud
itself missed.

All in all, the missile defence caused more damage than if no defence had been
installed at all. This fact was only revealed several years later, and did not
reach the headlines.

------
akavel
For anyone interested in such matters (actually, every engineer should be to
some degree), a seemingly neverending (yet tenderly curated) stream of similar
interesting observations and stories is posted to:

"The RISKS Digest, [or] Forum On Risks To The Public In Computers And Related
Systems":

[http://catless.ncl.ac.uk/Risks/](http://catless.ncl.ac.uk/Risks/)

A summary from Wikipedia
[https://en.wikipedia.org/wiki/RISKS_Digest](https://en.wikipedia.org/wiki/RISKS_Digest):

 _" It is a moderated forum concerned with the security and safety of
computers, software, and technological systems. Security, and risk, here are
taken broadly; RISKS is concerned not merely with so-called security holes in
software, but with unintended consequences and hazards stemming from the
design (or lack thereof) of automated systems. Other recurring subjects
include cryptography and the effects of technically ill-considered public
policies. RISKS also publishes announcements and Calls for Papers from various
technical conferences, and technical book reviews"_

------
ahelwer
To pre-empt someone bringing up THERAC-25: although it is a very famous case
of software disaster, none of the root causes were numerical errors.

------
pippy
I can sympathise with the Patriot Missile floating point issue. A bug like
that would be next to impossible to track down and address, especially when
dealing with specialised embedded hardware.

The Ariane 5 issue however is pants-on-head retarded. Converting a 64-bit
floating point number to a 16-bit int is something a first year computer
science student would be embarrassed about.

~~~
alephnil
It is more subtle than that. The original code ran on the Ariane 4 rocket, and
there the engineers had proved that the error could never happen with the
first 80 seconds of the trajectory of that rocket, which was the period this
code ran. The management of the Ariane program decided that the unit was going
to be used in Ariane 5 as well, without certifying it for the new rocket. The
Ariane 5 rocket is much more powerful, and will get much further in the
trajectory than Ariane 4 in that time, so an angle big enough to get overflow
will happen. This was never discovered, because the trajectory of the Ariane 5
was never tested with the code. Thus it can also be considered a management
failure.

The code was also for stabilizing the rocket on the launchpad, and made no
sense after that, but it was not shut down before after 80 seconds.

~~~
Gravityloss
yes, since it was an overflow checker that caused the problem, it's a
philosophical issue too.

Let's take a hypothetical: You're flying a rocket on a one-time mission. The
rocket is not reusable and there are no redundant engines or any way to abort
the mission in an intact way. You then detect an overflow in your control
algorithm.

In practice, it almost never makes sense to do anything to these errors. If
the error was spurious, the best course was to not do anything. If it was for
real, the mission will be lost anyway so it doesn't make sense to spend effort
to pay attention to the error.

Your only abort criteria might be if your rocket starts venturing to a path
that will cause it to fly out of its designated safety zones.

However, if you have redundancy, then doing stuff like shutting down engines
starts making sense (Like on Saturn V or the Space Shuttle).

------
parados
There are other examples provided by the delightfully named European
Spreadsheet Risks Interest Group on their "Spreadsheet Horror Stories" page:
[http://www.eusprig.org/horror-stories.htm](http://www.eusprig.org/horror-
stories.htm)

------
userbinator
_The number was larger than 32,768, the largest integer storeable in a 16 bit
signed integer, and thus the conversion failed._

The largest 16-bit signed integer is actually 32767... although it probably
would not have mattered in that case, it's a little ironic to find an off-by-
one error on a page about numerical errors.

One event that comes to mind, not really a disaster, but a rather costly
mistake caused by units confusion, is this:
[http://en.wikipedia.org/wiki/Gimli_Glider](http://en.wikipedia.org/wiki/Gimli_Glider)

~~~
rplst8
They may have used Excess-K representation...
[https://en.wikipedia.org/wiki/Signed_number_representations#...](https://en.wikipedia.org/wiki/Signed_number_representations#Excess-K)

------
fubarred
Nuclear weapons and forgetting parens (in Perl)

[http://www.foo.be/docs/tpj/issues/vol2_1/tpj0201-0004.html](http://www.foo.be/docs/tpj/issues/vol2_1/tpj0201-0004.html)

Also, Castle Bravo was ~3x bigger than predicted because of a failure to
correctly model the tritium production from lithium-7.

[https://en.wikipedia.org/wiki/Castle_Bravo#Cause_of_high_yie...](https://en.wikipedia.org/wiki/Castle_Bravo#Cause_of_high_yield)

~~~
Padding
This can't be real.. 'PERL itself stood for "Precision Entry and Reentry
Launchings"'

~~~
vetler
No, it's not real:
[https://news.ycombinator.com/item?id=1822593](https://news.ycombinator.com/item?id=1822593)

------
VLM
The system design failure in the A5 is interesting to read, the full report is
at

[http://www.ima.umn.edu/~arnold/disasters/ariane5rep.html](http://www.ima.umn.edu/~arnold/disasters/ariane5rep.html)

Basically they tried to re-use software from the A4 in the A5. The laser gyros
have firmware that can get funky and the onboard computer would kind of "trust
but verify" the laser gyros on the A4. Obviously the programmers of the laser
gyros and the programmers of the nav system are not the same people and had
conflicting views on what makes a decent variable type for horizontal
velocity. The A5 doesn't technically need to verify the lasers alignment and
its of no use after liftoff anyway (so whaddya intend to do if it gets
misaligned anyway, land, fix it, and take off again?) The A4 guys saw that
pointlessness of it and stopped monitoring after 40 or so seconds (by then
either its working great or you've already crashed...)

So the A4 being somewhat anemic never hit the limits of that size int for
horiz velocity. But the A5 which was kind of like the A4 after extensive
steroids use was able to max out or roll over the conversion routine.

The nav computer wasn't very well designed at yet another level and instead of
going all RTOS and recovering in a few milliseconds it helpfully kernel
panic'd. And the kernel panic error message (probably something like "Oh oh"
in French) was helpfully interpreted by the engine computer as if it were
commands and slammed the engine nozzles over full hard one direction, which
the airframe couldn't survive (real big rockets can't pinwheel in flight, at
least not in one piece)

So the list of software design failures is epic and huge

1) Not every minor math exception NaN should kernel panic and fail the whole
system (shouldn't have to crash because one temp sensor got unplugged, etc)

2) Never do stuff you don't have to. If you don't need to align your lasers
due to technological advances, then stop doing it, because you can't crash the
whole system trying to do it if you never try to do it. Or, if, 50 seconds
into flight, you're commanded to slam your engine nozzles to 90 degrees
perpendicular to flight and you know the hardware can't bend that far anyway,
I think you can safely assume the nav computer has crashed and ignore it for
awhile (maybe the watchdog timer would have eventually saved them?), or
rephrased if you're commanded to do something that'll certainly blow the ship
up, unless you're a destruct charge, maybe you should ignore it and just keep
on chugging along till you hear something saner.

3) Most epic system failures are at the border of subsystems. Like where one
machine talks 16 bit ints and the other 64 bit floats. So minimize your
borders.

4) Simulation could have saved half a billion bucks. I'm actually kinda
surprised nobody ever tried this. I guess test driven development wasn't so
cool back then.

One way to look at the design failure is numerous times they failed when when
handling protocol failures. I've seen this in big bureaucracies time and time
again, who cares if the overall mission fails as long as our little group did
right by our own code and someone else can be blamed. The nav guys never
should have asked us laser gyro guys our horiz speed if it was over 15 bits
signed so its all their fault for asking. The nav guys never should have asked
us engine computer guys to point my engine nozzle sideways thus snapping off
the back of the rocket so its all their fault. The OS guys never should have
asked us math coprocessor guys to convert -40000.123 m/s into a 16 bit int so
its all their fault. The math guys should never have thrown a kernel panic on
a mere numerical conversion crashing the OS so its all their fault not ours in
OS land. The OS guys should never have run our nav routines on a survivable
recoverable non-RTOS system so its all their fault. Management never let us
software devs run a ground test to save money so its all their fault. The
blamestorming could go on for pages.

