
Failed intercept at Dhahran caused by a software error in handling of timestamps - sine
https://en.wikipedia.org/wiki/MIM-104_Patriot#Failure_at_Dhahran
======
avar
Even better, the timeline:

    
    
        - February 11th: Vendor informed of the issue
        - February 25th: 28 people die because of the issue
        - February 26th: The vendor ships a fix
    

I'd have loved to be a fly on the wall for that phonecall on the 25th (or
early on the 26th).

~~~
phonon
You missed this date--

Feb 21--notice goes out to users to avoid "very long run times". Users do not
know what that means, and ignore warning.

[https://www.gao.gov/assets/220/215614.pdf](https://www.gao.gov/assets/220/215614.pdf)
(page 9)

"On February 21, 1991, the Patriot Project Office sent a message to Patriot
users stating that very long run times could cause a shift in the range gate,
resulting in the target being offset. The message also said a software change
was being sent that would improve the system’s targeting. However, the message
did not specify what constitutes very long run times. According to Army
officials, they presumed that the users would not continuously run the
batteries for such extended periods of time that the Patriot would fail to
track targets. Therefore, they did not think that more detailed guidance was
required."

~~~
macintux
That's terrible. Competent technical writing is criminally undervalued.

~~~
zeeZ
But there's also "presumed" and "did not think" in there. When there's a
problem with your killing device you probably shouldn't use it until you've
clarified what the problem is and don't just assume your end users will use it
correctly.

That's like saying "It's fine, the critical vulnerability patch will be
applied on reboot", while in reality all your users just suspend to disk and
move that annoying reboot nag window behind the task bar where it's out of
sight.

~~~
mikeash
“You probably shouldn’t use it until you’ve clarified...” doesn’t work so well
for a defensive system.

~~~
lostlogin
How long was the off/on cycle? If it was short it would be reasonable to do it
periodically. I don't think one can pin the blame on the vendor alone.

~~~
mikeash
Clearly the vendor thought it would be reasonable. They failed at
communicating it, though. Putting a number on it would have made things clear:
“The system must be rebooted after at most 12 hours [or whatever the
appropriate value would be] of operation.”

------
tofof
This particular bug is often taught in university compsci classes as "bug that
killed people" is a good attention grabber -- the CS/EE analysis is sound; its
truthfulness is only suspect because of the DoD's claimed successes.

A more truthful "computer bugs that killed people" example would be the
Therac-25 - a machine intended to treat cancer with tightly-focused radiation
therapy. Six patients died as a result of massive overdoses of radiation, on
the order of 20,000 rads. It was possible for the machine to end up in a state
where it delivered full-power radiation without a hardware shield in place to
protect the rest of the patient's body. No hardware interlocks were used to
ensure that the full power mode was only usable with the shield in place - all
safety features relied on software. In addition, the bug was only possible
when an operator made a mistake in mode selection and then _rapidly_
(proficiently) corrected it - the rapidity required prevented the bug from
being discovered during slow, methodic, careful testing.

See Hackaday's article Killed by a Machine (and associated HN discussion) or
for the especially curious, a 49-page post-mortem for more detail:

[https://hackaday.com/2015/10/26/killed-by-a-machine-the-
ther...](https://hackaday.com/2015/10/26/killed-by-a-machine-the-therac-25/)

[https://news.ycombinator.com/item?id=12201147](https://news.ycombinator.com/item?id=12201147)

[http://sunnyday.mit.edu/papers/therac.pdf](http://sunnyday.mit.edu/papers/therac.pdf)

------
otoburb
This was a tragic and preventable loss. It's incredible that a software bug
might have been the root cause.

At the time, this incident really stuck out because it broke the illusion of
our fabled Patriot missile shield protecting us. Civilian expats really
_believed_ the inflated Patriot interception rates parroted to us by
mainstream media and our American military expat buddies.

A large number of remaining expats who had stuck out the Gulf War to that
point decided to pack it in and leave when word got out that the Dhahran
barracks were hit. Although history shows that Iraq surrendered days after
this incident, at the time there was heightened fear and confusion amongst the
remaining expats, especially the non-Americans.

We left on the last Lufthansa flight (crewed by military personnel) after
hearing about this.

Nostalgic edit:

During the Gulf War embassies issued equipment and rations to expat citizens
who chose to stay behind. Americans were issued full body suits (for adults
and youths) due to the biological and chemical weapon payloads that Saddam
boasted his SCUDs were carrying, along with MREs that tasted fabulous! In
stark contrast, Commonwealth citizens were issued a bare gas mask (adult size
only) and mono-flavour MREs that tasted like cardboard.

The British embassy sticks out in my mind: with stern stone-faced expressions
they admonished us all for not evacuating and thus endangering children in a
war zone. In addition to the terrible rations and gas masks, they wordlessly
gave us a stack of translucent stickers. When asked what they were for,
embassy staff explained that in the event of the air siren going off, we
should get under our sturdiest tables and don our gas masks (standard
procedure), and _then_ slap the stickers on. If the stickers changed colour,
it meant we were in the presence of a biochemical agent and would have
approximately 10 seconds before we died a horrific death.

You kind of had to be there to appreciate the grim humour.

~~~
celticninja
I mean I kind of understand the attitude of the British Embassy, it wasnt like
trouble flared up overnight, the option to leave was there for a long time
prior to the war beginning. Obviously it isnt the fault of the children who
were kept there by their parents, but some responsibility needs to be borne by
the expats that decided they were getting paid well enough to stay.

~~~
otoburb
We all understood their stance. The notable point is the stark contrast
between the Americans (embassy and expats alike) and everybody else.

While most of us were cowering under our desks and tables during SCUD attacks,
some of our American civilian friends were out with their families in the
desert trying to film the Patriots "intercepting" the SCUDs and driving out to
try and pick up pieces of debris.

I look back upon those days with fondness and gratitude, especially for the
American forces that served.

------
sharemywin
I remember hearing about this in my numerical analysis class.

1\. I remember hearing the system was only designed for XX operational hours
but was being run over the operational spec.

2\. The time was stored in base 10 so the calculation errors added up over
time or something like that so if they had used some base 2 timing scheme it
would haven't have had issues with rounding errors.

My class was in the mid nineties so the details of my 25 year old memory is
pretty hazy...at best.

~~~
clw8
My recollection matches with yours, except I learned about it in the first
week of Embedded Systems 101. If it isn't a standard part of the curriculum at
every college embedded systems class, it should be! It really drove home the
point that bad code can kill.

~~~
pilom
I learned about it in a Decision Analysis course and had a completely
different point driven home. This wasn't bad code. It was code that was
correctly written to a very well defined requirement ("System shall be
operational for at most X hours before a reboot"). The code was written to a
spec that was approved by the customer (the military). Unfortunately though,
that requirement wasn't communicated to the end users.

~~~
Jtsummers
I'm failing to find anything that says the requirement was "System shall be
operational for at most X hours before a reboot". It's more likely that there
was a key performance paramater (KPP) saying that it should be functional for
_at least_ some period of time. And that was what was tested.

Generally KPPs (which aren't requirements themselves, but influence the
requirements for systems) are set at lower bounds, not upper bounds, for
somethnig like this. You wouldn't set a KPP: Should only work for 4 hours.
You'd use: Should work for at least 3 hours, 4 hours desirable (or some
similar language). If it works for longer, that's great. But longer won't be
tested since it's not a requirement or goal for the system, which also means
failure modes for longer runtimes won't be encountered because they're outside
the bounds of the system requirements and specs.

~~~
dragonwriter
As I gather a the Patriot was a mobile anti-aircraft / anti-cruise missile
platform that was meant to move, be activated when needed, and then be turned
off and move again because the original location was expected to become a
target. It was pressed, on short notice (with some software upgrades, but not
the normal cycle of specs, development, and validation that would go into that
kind of repurposing) into stationary, continuous coverage, anti-ballistic-
missile (critically, dealing with much faster targets than originally
envisioned, which means short warning times where deactivations have a lot
more risk) use.

So, while it's horrible in results, it can be very easy to understand why
basic functions would have specs not at all adapted to the use to which it was
being put.

~~~
Jtsummers
There's a distinction to be made, though. There was no _requirement_ that it
be rebooted after some period of time, though there was an expectation that
this would happen by the original developers. Consequently it was not
evaluated for 20 hour or 100 hour performance. That's a critical distinction
in developing, testing, and fielding systems. And the way we term it in our
requirements documents reflects this. We rarely say: System SHALL fail after
some period. Rather we say: System SHALL perform for some period. We leave the
result of longer durations undefined. The system may work, or it may not, we
aren't required to test it and so we don't. If the customer wants it to run
longer, we can evaluate it but they have to communicate that back to us (or to
the testing facilities, which may not be the developers).

Similarly, with regards to the speed of the missiles, the requirement would
not be: System SHALL fail to detect missiles above some threshold speed. But
rather: System SHALL detect missiles below some threshold speed. This leaves
open the possibility that it may be more or less accurate outside that range.
It should be documented for the operators as a potential for failure: System
may be ineffective against missiles operating above X m/s. But the
requirements wouldn't include that detail.

This pushes the problem into the documentation and training. Since it was
originally designed as a mobile platform with short run-times, there was no
explicit operating procedure requiring reboots. It was just assumed. At the
same time, the failure itself (after 20 hours) was unknown because testing
hadn't been done to see what would happen.

------
OedipusRex
That was a temporary fix, then a software patch was released. I also wouldn't
call that a "software" fix.

------
dredmorbius
The inimitable comp.risks discussed this in 1992:

[http://catless.ncl.ac.uk/Risks/13/35#subj1.1](http://catless.ncl.ac.uk/Risks/13/35#subj1.1)

[http://catless.ncl.ac.uk/Risks/13/76#subj8.1](http://catless.ncl.ac.uk/Risks/13/76#subj8.1)

And in 1997:

[http://catless.ncl.ac.uk/Risks/18/79#subj9.1](http://catless.ncl.ac.uk/Risks/18/79#subj9.1)

------
tntn
Despite other comments below, I think that the equivalence drawn between
"failed to save" and "killed" reflects an interesting philosophical choice. I
don't think that this equivalence is universally accepted, even by those who
call thinking otherwise fallacious.

If an EMT fails to save a victim of a car crash, did he/she kill the victim?
If the dispatcher misspoke and gave the wrong cross street, delaying aid, did
the dispatcher kill them?

~~~
rxhernandez
In the medical device industry the company who made the device can be found at
fault if a clinician makes a poor decision that leads to death based on a
fault in the device. If the soldiers would have sought better cover or be
otherwise saved in the case that there was no missile defense system was there
then yes, some, if not most, of the blame lies on the software error.

------
logfromblammo
For doing a ballistic propagation, you apply a gravitational map in Earth-
centered, Earth-fixed (ECEF) geodetic coordinates, then convert to Earth-
centered rotating (ECR) geodetic coordinates, because that way you don't have
to correct for the Coriolis effect. That ECEF-ECR conversion requires a time-
of-day parameter.

You can use a gravitational map that only accounts for latitude, but it isn't
as precise.

So using an accurate clock is _really_ important if your intent is to hit a
missile with a missile.

------
sjburt
This is a completely misleading headline. The Patriot missile was not
effective at destroying the Scud [0]. The DoD initially claimed successful
intercepts when the missile detonated near the Scud, but it rarely, if ever,
actually destroyed the warhead. The only reason there was an illusion of
success was that the Scud was also spectacularly unreliable and often broke up
on re-entry or failed to detonate. It is a complete falsehood to claim that
the Patriot would have prevented this loss of life.

[0]
[http://www.slate.com/articles/news_and_politics/war_stories/...](http://www.slate.com/articles/news_and_politics/war_stories/2003/03/patriot_games.html)

------
seorphates
Reboot. Around the same time-frame we gathered the flag for a deployment
(fleet admiral) and I was responsible for UNIX systems on the ship. Not long
after coming aboard the command came down to reboot all of the systems at
midnight, nightly (yes, only the UNIX systems). Being that "But Mister.."
never really gets you too far in the military I just rode it iterating through
any possible reason for the madness, nightly. I could never come up with a
good one. Until now. (ok, perhaps not a "good" reason but crazy enough to
count.)

It now makes much more sense to me that a (terrible) mishap had occurred and
possible prevention was only a reboot away. I can see how being exposed to
that context at upper levels could easily cause one to latch onto any
perceived preventative measures.

I also once saw a short ntp time step across multiple clusters (yeh,
simultaneously) shut down half of a wafer factory.

Time is important.. but rebooting all your systems at midnight probably will
not help you to control it. This especially if there are large, hot, fast
objects flying around in the night sky and definitely, really, don't do ALL of
them at the same time every day .. especially during, you know, battle. /pro-
tip

~~~
lostlogin
That's still not great logic. Think of all the crazy shit you have seen fix
machines. If all the was implemented you would have users doing some truly
bizarre things.

~~~
seorphates
Mm. That's on point. It is as illogical as having the means and knowledge for
prevention and not applying it. The crazy shit (booting theater active
operational assets) was implemented by authority. Not patching theater active
assets leads to death.

------
bertjk
I've often wondered, considering the supposed low accuracy of Scud missiles,
(wiki gives it a CEP of 450m) how much of the casualties from that incident
were more due to the bad luck of the missile actually hitting its target.

~~~
nerpderp83
If the Scud had been brought down earlier in it's trajectory it would have not
been near people regardless of any randomness in it's landing.

------
criley2
This is bad, editorialized title that is not the title of the article.

Mods should change this. The "software fix" was a software patch which
corrected the clocking bug.

The "software workaround" to use pre-fix was reboot.

I hate editorialized, lying titles :(

~~~
codazoda
Came here to mention that. The title needs a re-write but the story is
interesting still.

------
leggomylibro
I could be reading this wrong, but 1/3 of a second within 100 hours seems
really good, like something you'd get from a temperature-controlled crystal
oven.

I don't mean to second-guess them in an area I know so little about, but if
that was enough to cause a serious issue in the span of only a few days,
shouldn't the devices be designed with a separate synchronization system, at
least as a backup? Maybe GPS?

Which brings up a sort of interesting question...would a Patriot missile
system even have receivers for a weak public signal like GPS, or is it all
self-contained?

~~~
GCU-Empiricist
As a former submariner who has had used clock for inertial navigation or for
similar weapons systems 1/3 of a second over 100 hours is terrible.

~~~
leggomylibro
I mean, it's not an atomic clock, but I'm comparing it to the 32.768KHz RTC
crystals I use with consumer microchips. If super-precise isolated accuracy
were actually important, I assume they would use a rubidium or cesium
oscillator.

~~~
grkvlt
1/3 of a second in 100 hours is basically 1ppm, or TXCO levels of accuracy, so
pretty good i'd have thought, even for a submarine INS?

~~~
cocoablazing
USN ballistic missile submarines deploy the most accurate INS in the world
(ESGN), and that system is used in conjunction with another advanced gyro.

~~~
grkvlt
Not sure about that. ESGN is old technology, since submarines don't need
_that_ much accuracy. For example, space probes, ballistic missiles, smart
artillery shells/rockets/missiles and so on would all appear to have multiple
orders of magnitude better accuracy than submarines, in the fractional ppb
ranges, rather than tens of ppm. [0][1]

0\.
[https://www.sto.nato.int/publications/STO%20Meeting%20Procee...](https://www.sto.nato.int/publications/STO%20Meeting%20Proceedings/RTO-
MP-SET-104/$MP-SET-104-KN2.pdf)

1\.
[http://users.cecs.anu.edu.au/~Jonghyuk.Kim/teaching/Inertial...](http://users.cecs.anu.edu.au/~Jonghyuk.Kim/teaching/Inertial_Where%20to%20now_DraperLab.pdf)

------
brohoolio
This is depressing. One of my middle school classmates had a brother killed in
a SCUD strike.

------
jimjimjim
regarding the comments about bug killed people versus weapon killed people.

There is no 1 answer, this argument is a result of black-white/yes-no/us-them
single point of blame thinking. and it's terrible.

the bug _contributed_ to the loss of life.

------
macawfish
Little things do add up.

------
mlazos
The title of this post is misleading, they eventually supplied a software
patch that fixed the clock drift. The Israelis proposed rebooting as a stopgap
until the bug could be fixed.

~~~
sctb
We've updated the submitted title from “Clock error lead to death of 28
Soldiers. Software fix: Reboot system regularly” to a representative phrase
(edited for length) from the article. Submitters: please follow the guidelines
by not editorializing titles.

[https://news.ycombinator.com/newsguidelines.html](https://news.ycombinator.com/newsguidelines.html)

------
nathan_long
> The Patriot missile battery at Dhahran had been in operation for 100 hours,
> by which time the system's internal clock had drifted by one-third of a
> second. Due to the missile's speed this was equivalent to a miss distance of
> 600 meters.

------
jasonmaydie
The scud missile lead to their deaths, not the software. There's no absolute
guarantee it would have intercepted it, plus rebooting a deployed machine
regularly is an acceptable fix when it's live in the field

~~~
rosser
That's a _reductio_ fallacy. If you want to play that game, it was being
deployed to that specific place that caused their deaths. Or was it enlisting
in the first place? Maybe merely having been born?

This is a strictly technical examination of the proximate cause of their
deaths; it makes no claims about their ultimate cause. Whether or not a
missile system with an accurate clock might have hit the target, it is
unambiguous that this one missed _specifically_ because of clock drift.

~~~
jasonmaydie
How so? The implication you and the article are asserting is that the clock
error caused their deaths.. rather than the more accurate description "could
have prevented death".

~~~
Vivtek
Well, it wasn't the missile that caused their deaths. Strictly speaking, it
was the _explosion_ of the missile.

Well, wait. It wasn't the explosion - technically, it was the impact of the
pressure wave on their bodies that _caused_ ... well, no. Really, it was the
fact that their organs stopped working after impact of the ... well. If you
really want to be _accurate_ , it was the fact that metabolism ceased to be
practicable after their organs stopped working.

Well, no, actually, the fact that their mental processes _depended_ on their
metabolism - that was really the cause of their... Well, no...

