
Patriot missile software failure, 28 soldiers died. Fix: reboot the system - bl4k
http://en.wikipedia.org/wiki/MIM-104_Patriot#Failure_at_Dhahran
======
patio11
This was covered in our C classes in college, and it is probably more
interesting for programmers here if you understand what the bug actually was.

The "software error" Wiki alludes to is that the Patriot missile kept track of
its internal clock with floating point numbers. When the machine had been
booted in the recent past, such as _every time in testing_ , the floating
point number spent most of its precision to the right of the decimal point.
This let it able to do the designed behavior, which was calculate very small
delta(time) to be able to do velocity/position calculations and get fairly
close to fast moving objects then go _boom_.

The problem is that floating point numbers have a limited amount of precision
available to them, and if you are using a few billion milliseconds (2 weeks),
almost all of your precision is lost to the left of the decimal point (and,
given that this is precision-intensive work, you didn't need to wait that long
to see anomalies).

Lower precision meant that taking delta(time) got increasingly less precise as
time went on. Which meant that velocity/position calculations got
progressively more screwed up. Which meant the missile did not go _boom_ in
the general vicinity of incoming missiles. Which killed Americans and allies.

Thus the moral of the lecture: a) your computer is a powerful, tricksy beast
which has many ways to trap you in even straightforward code and b) you should
treat software quality like some 19 year old's life depends on it, because it
might.

~~~
bl4k
I thought that the experience of developing the GPS network (where all the
sattelites and ground stations must be time co-ordinated to within 10+ decimal
places in seconds) would have helped them with this missile system.

A GPS satellite travels much faster than a missile, and is accurate to within
meters, so the tolerances are much higher (and the system was designed prior
to patriot).

There is a definite 'not invented here' syndrome amongst defense contractors -
I doubt they share any information, research or solutions amongst one other,
which means the US tax payer foots the bill each time one of these contractors
must independently develop and implement a system that has likely already been
built in another part of defense.

~~~
some1else
Oh, Interesting. I always figured GPS satellites were geo-stationary, but in
fact (as you pointed out), they travel at 7000 mp/h :-0

~~~
pmjordan
Geostationary orbits would tie the satellites to being directly above the
equator, which (a) would prevent the system from working beyond a certain
latitude and (b) would cause an extremely bad distribution of "visible"
satellites and angles between signals, precision would suffer.

~~~
kiujygtyujik
Not to mention the transmitter power you would need if they were 25,000miles
away in GSO rather than 90mi away in LEO

~~~
chrisbolt
GPS satellites are actually in MEO, at 12,500 miles away.

<http://en.wikipedia.org/wiki/Medium_Earth_orbit>

------
Natsu
I find it interesting how "just reboot" has become something of a user
expectation: often, users expect it to fix most anything. To be fair, it does
seem to work fairly often.

It's a strong enough expectation that when I had some industrial machines
running DOS (albeit on more modern hardware), I added my update, backup &
diagnostic scripts to autoexec.bat so that rebooting _would_ fix most of their
problems.

It made my life a lot easier, though, because I could update the files and
configurations via a master copy on the network, then tell them to reboot
everything whenever it was convenient for them (usually between shifts) and
the machines would all grab their updates and upload some log files for me to
monitor.

------
viraptor
I wonder how did they actually find out the reason for the failure? They had a
system which worked perfectly (almost) and probably could be tested in every
standard way without showing the problem. They must've had a seriously good
logging system that showed something suspicious, or someone had a really
interesting "a-ha" moment...

I'd like to hear the story of debugging this one. Also how they managed to
identify that this incident was caused by that specific bug.

------
dotBen
Imagine being the developer who wrote the line of code (who didn't
understanding floating point variables). Or the QA tester who didn't spot it,
or didn't decide it was worth reporting.

Aside from being a pacifist, this is why a number of engineer friends have
stepped out of building defense systems (including missile guidance systems)
and into more civilian engineering because the stress and moral burden is just
too great.

~~~
extension
Better than being the executive who decided not to release the patch until 28
people were dead.

~~~
viraptor
As long as the workaround was known, well... I don't know exactly what the
military procedures are for situations like that, but updating an active
rocket defence system in area where you don't necessarily have trained
engineers -vs- rebooting it every day or so to make sure it works. It looks
like a simple choice to me.

Also looking at who actually makes the mistake - if someone gives you an
update and the system fails, they're at fault. If you give clear instructions
for operation and users don't follow it...

~~~
extension
According to Wikipedia, the workaround came from the Israeli army, who found
the bug, not the manufacturer. The instructions for this workaround did not
propogate clearly and thus weren't followed.

Being defenseless for a few minutes every day during a reboot hardly seems
like a reasonable fix anyway, especially if it becomes standard procedure that
your enemy may learn about.

The fact that the manufacturer _did_ release a patch, rather than a
workaround, the day after the accident suggests that this was indeed the
safest course of action. It was just taken too late.

------
nitfol
More information about the errors in the (fixed point) math on the patriot:
<http://www.ima.umn.edu/~arnold/disasters/patriot.html>

~~~
joshzayin
That's interesting.

Given that it was an issue with a non-terminating binary representation, what
would be the way to handle this, without somehow resetting the clock (restart
or otherwise)?

Obviously, you could, in a modern system, have more memory and be able to
store more bits of the number, but there would still be a limit that you would
run into after some amount of time that would cause similar problems.

~~~
tetha
I'm just thinking about this. I think I would try to split the clock into a
precise part, which (for example) tracks how many milliseconds we are into an
hour and a precise part which stores the hours, or the date up to the hours or
whatever. Given this, I can reset the part with the degrading accuracy in a
duration which maintains enough accuracy in order to maintain enough accuracy
overall.

------
stretchwithme
Interesting how complex it is to determine the accuracy of the missiles.
Multiple Patriot missiles fired at each Scud, several possible outcomes, the
Scud can break apart making multiple targets for the Patriots.

~~~
patio11
I agree, though "We got the missile but missed the warhead. That has to count
for something." strikes me as a little off. Pretend Saddam Hussein has ordered
his team of crack experts to use crappy engineering as an active
countermeasure. They just beat you. Do better.

~~~
Anechoic
One of my college professors was an outspoken opponent of missile defense
systems (and the Patriot system specifically) that worked during the missile's
reentry/descent phase.

His objection is that it's too easy for an opponent to defeat the system
either by overwhelming it (MIRVs for instance) or by designing the reentry
vehicle to make random movements which would make it really difficult for a
intercepter to track. Even if the interceptor can hit it, it's more likely to
knock the warhead off course instead of destroying it. If a nuclear missile is
aimed at NY and an interceptor hits it so that it falls to Philly instead,
that's still a net loss.

Saddam inadvertently hit upon the both methods - his engineers tried to
improve on the Soviet scud design to give them more range (which they were
successful at) but their improvements made the missile more likely to break up
on reentry (which presented more targets than the tracking radar anticipated)
and the lack of aerodynamics of the resulting pieces (including the warhead)
the missiles fall in unpredictable ways which caused tracking problems. An
opponent that actually _tried_ to game the system could make his missiles more
difficult to hit.

The professor is advocating for boost-phase missile defense since the missile
movements are much more predicable.

~~~
icegreentea
But it's not like boost-phase intercept is a magic bullet (ha!). Well, really,
it generally takes a magic bullet. Launch sites are typically far away. Unless
you have a interceptor several times faster (or interceptor site much closer
to the launcher than the target), then it's very hard to actually reach the
missile while it's in boost phase.

That airborne laser that's been in development for seemingly forever was
pretty much determined to be the best way to intercept. You'll notice that it
combines BOTH elements. That 747 is flying within LOS of the missile path
while in boost phase, and it's also using the fastest projectile possible.

You'll also note that the Patriot's role is medium tactical air defense as
well as theater anti-ballistic missile defense. The second role was basically
tacked on, and then later massively expanded (once it became obvious that
there aren't many air forces in the world that can actually fight the US).

In the end, I'm sure every general and admiral actually out to improve their
warfighting abilities would want both systems. It's all about defense in
depth. It's the reason why warships have three different sets of anti-air
missiles, while we still have stingers when we have patriot missiles, and why
US fighter aircraft still carrying short range missiles and guns.

------
duncanj
The really interesting lesson that engineers have learned from the Patriot is
that "never reboot" might not be the best target for critical systems. Rather,
controlled rebooting can help clean up problems before they affect the
function of the system.

<http://portal.acm.org/citation.cfm?id=1251254.1251257>
[http://www.computer.org/portal/web/csdl/doi/10.1109/HOTOS.20...](http://www.computer.org/portal/web/csdl/doi/10.1109/HOTOS.2001.990072)

------
daemin
I seem to recall reading somewhere that the system was originally designed to
be a mobile platform against Soviet missiles in somewhere like west Berlin.
Where they needed something that would be moved around every day or two so
that the enemy would not know its location. That meant that the system would
be reset whenever it was moved, and therefore using a floating point clock was
a reasonable design trade off.

------
dRother
Sure, rebooting is often the most straightforward way to fix runtime issues
because it resets everything. In this case, sounds like resetting the clock
would have been just as effective. I'm sure these days, you'd have something
like the equivalent of a ntpd update every hour to take care of that.

