
Killed by a Machine: The Therac-25 - paraknight
http://hackaday.com/2015/10/26/killed-by-a-machine-the-therac-25/
======
lpage
My uncle was a radiation oncologist who worked with machines like this one. As
a young child in the early 90s he took me to the hospital where he worked to
watch him and their staff physicist calibrate a linear accelerator - an
experience I remember vividly. A very long checklist culminated in them
irradiating a plate of acrylic, which I got to keep. The plate was about 20cm
x 30cm x 4cm and aligned such that the beam would strike the 4cm face and
travel down the 20cm side, fanning out in the process. The result looked like
a thinner version of this:
[https://en.wikipedia.org/wiki/Lichtenberg_figure#/media/File...](https://en.wikipedia.org/wiki/Lichtenberg_figure#/media/File:PlanePair2.jpg)

I asked him how they could possibly use such machines on humans, given what I
just watched it do to a 1kg plate of acrylic. He told me that they hit the
plate with way more energy than they ever would a human. That prompted my
followup question: "uncle, what happens if you accidentally hit the wrong
button?" He told me that accidents like that used to happen, but that the
machines they used had special computers to keep the patient safe even if he
made a mistake. "But uncle, what happens if the computer makes a mistake!?" I
had no idea what a _bug_ , or for that matter code was. He didn't have a good
answer beyond "the computer can't make mistakes like we do." Having played
with computers enough by then to know that his statement wasn't entirely true,
I ended that outing wanting to know more. I started obsessing over what would
happen to people if computers controlling things like that linear accelerator,
or even the elevator in my dad's office building, made a mistake.

Incidentally, my uncle was the one who got me interested in science, and that
trip to the hospital got me using computers for something other than games.
Fast forward 24 years and...well...part of what I do is work on provably
correct systems.

------
kqr
Are there any statistics on the safety/reliability of software controlled vs.
hardware controlled devices?

I work in software engineering, so I'm exposed daily to broken software
controls, and I'm gradually becoming more of a "grumpy old man" longing for
the good old days when machines (be it cameras, cars, watches, medical
equipment or anything else, really) could be "debugged" by following levers,
wires, physical stops and hoses. I feel much more comfortable with that than
with complex computer systems.

I would like to know if my fears can be founded in actual truths, or if I'm
just riling myself up over nothing.

~~~
notalaser
Purely hardware-controlled vs. purely software-controlled devices are somewhat
difficult to evaluate (partly because purely X-controlled is somewhat of a
rarity).

The "trick", or rather, the true test of a design when it comes to this is
judiciously balancing the advantages that each of these categories bring you.
You generally want to use "hardware" means in order to nullify the important
risks of programmable logic (i.e. the major consequences of a blunder in the
logic that was programmed on the device), and use programmable logic for those
sections that require flexibility, ease of programming, potential fixes or
enhancements and so on.

Real-life example from a device I worked on: the linear actuator (that was
essentially sticking needles into people's brains) had a physical barrier past
the safe distance limit. Literally a big hunk of metal that the actuator could
not be moved past.

Of course, there was a limit in software as well (the software would refuse to
move the motor past the safe limit). However, with the physical barrier in
place, it protected the motor more than anything else.

It's worth noting that design decisions in these fields are done not so much
based on the amount of brokenness in existing implementations (to put it
bluntly, there's about as much broken hardware as there is broken software),
but based on the amount of risk that a solution introduces. Many regulating
bodies (e.g. FDA) will classify your device in a higher risk class if there's
programmable logic in it, simply because software tends to be harder to write
and test reliably (not only because of bad programmers, but also because of a
lack of standards, or at least consensus and metrics on testing).

Edit: I can't quote numbers right now; there is data that shows the risks
involved in purely software-based approaches, and it's easy to see why even in
the example above. A software-enforced limit on the distance of movement can
fail due to a bunch of reasons, not all of them bugs. Bugs are one thing, but
the system could also fail due to a broken connection between the motor driver
and the CPU, due to a bug in the hardware implementation of the motor driver
itself, due to glitches on the bus and so on.

There are software mitigations for all of these cases, too (e.g. you
continuously monitor the position of the motor; if the motor still moves after
you tried to stop it, you reset the system, and the hardware is wired so that
all power to the motor is cut when you come out of reset), but nothing is as
efficient as making sure that the motor just won't be able to move the load
past a certain distance by placing a physical barrier in its way. By an
uncanny chain of events, maybe all the software mitigations can fail; nothing,
however, can make a big hunk of metal disappear into thin air.

~~~
shabble
Do you have a 'big hunk of barrier presence' monitor though? Otherwise, when
someone removes the physical stop during maintenance and forgets to reinstall
it, and nobody notices for a long time because the software safeties are all
working as intended, until...!

Obviously, you can't go down the rabbithole forever, and this is a bit tongue-
in-cheek, but IIRC similar circumstances have occurred.

The San Salvador medical irradiation facility accident[1] is largely a tale of
gradual failure of safety features and interlocks, combined with the
misunderstanding that they were still adequate or safe, until they weren't.

[1] [http://www-pub.iaea.org/MTCD/publications/PDF/Pub847_web.pdf](http://www-
pub.iaea.org/MTCD/publications/PDF/Pub847_web.pdf)

~~~
pessimizer
> when someone removes the physical stop during maintenance and forgets to
> reinstall it

That wouldn't be seen as the fault of the device. Software bugs will still
show up when the product is used as directed. Ideally, though, the device
wouldn't be able to be used _without_ the physical stop.

~~~
notalaser
Indeed, there is no way to remove the physical stop.

Furthermore, the _design_ process is shaped so that it leads to such designs.
If the first iteration of the design had included a removable physical stop,
it would have been caught in the risk analysis. Most mission-critical fields
have standards that specify this sort of stuff (IEC 60601 for medical devices,
DO-254 and DO-178C for avionics etc.)

When designing this kind of devices, whether it's the fault of the device or
not is not too relevant in general. You have to mitigate _any_ kind of
unacceptable risk (i.e. things that lead to injury or death).

There are certain common-sense exceptions here. For instance, the device isn't
expected to operate properly outside its specified operation conditions, but
you have to clearly state what those conditions are (altitude, temperature,
humidity, presence or absence of liquids etc.) and put warning labels
regarding their breach in the manual). Similarly, the _level_ of mitigation is
often specified by standards. E.g. for medical devices, IEC 60601 specifies
insulation requirements that will protect against the kind of shocks you could
get from a faulty network, but not if the device is struck by lightning or
strapped to an electric chair. IEC 61010 (Safety requirements for electrical
equipment for measurement, control and laboratory use) similarly includes
provisions for the kind of protection you would need in equipment that falls
under a specific type of use (e.g. here on Earth, not up there in space).

------
dsfyu404ed
"It’s important to note that while the software was the lynch pin in the
Therac-25, it wasn’t the root cause. The entire system design was the real
problem. Safety-critical loads were placed upon a computer system that was not
designed to control them"

They wrote code that depended on hardware controls, didn't document their
reliance on the hardware controls and killed a bunch of people. DOCUMENT ALL
YOUR DEPENDANCES!!!

~~~
wyldfire
> They wrote code that depended on hardware controls

...similar to Toyota's electronic throttle control system design.

Also, if you rely on hardware features for safety, it's still ok/good to
design the software as if it didn't depend on those features whereever
possible.

~~~
dsfyu404ed
It all depends on the specifics. How much complexity will software checks add?
In a lot of cases the lifetime cost of spec'ing a hard failsafe may be cheaper
than designing, implementing and supporting a soft one. Then there's the whole
issue of software false positives/negatives vs hardware failures

At some level you can no longer abstract away the situation and the system has
to perform as a system. You can write soft controls all you want but if hard
controls are present in the spec and something that's not likely to change
there becomes a point where chasing down every little problem and checking for
every possible error is no longer cost effective. Nobody cares if you write
Mars rover tier code for a bulldozer and spending the resources doing is
wasteful if your competitors aren't also spending the resources. Obviously you
can write total crap that's outside the acceptable range toward the other end.

If you're designing software to control a widget that moves and does stuff has
a hard switch to prevent your widget's equivalent of an out of battery
detonation you have to strike a balance between relying on it and introducing
complexity from soft controls. Every aspect of the system is involved in
making that determination.

If you've got a hard switch in most cases you may as well write a five line
timeout controlled while loop that tries to perform the action and waits and
tries again if there's no feedback indicating the action was performed. As
long as nobody removes that switch and the code's dependency on it is very
obviously documented then that simple unsafe code is probably better than more
complex code that performs a redundant (redundant because you have a hard
switch) check because the more complex code has more going on.

~~~
throwanem
I mean hey, if I'm writing code for a machine that might kill somebody if a
hardware interlock is removed, I'm going to perform a software check for a
valid state before I take the potentially fatal action, and if that's a
problem for the organization that's paying me to write that code, I'm going to
resign and find work elsewhere, because that outcome strikes me as strictly
preferable to having someone's death on my conscience. I suppose it's arguable
either way, though.

------
csours
> It’s important to note that all the testing to this date had been performed
> slowly and carefully, as one would expect.

Ah, so if I click these angular nav-pills very quickly in series without
waiting for the page to load, something unexpected might happen?

What is annoying in one context may be deadly in another.

\----

If the "Sentinel Event" policy had been in place at the time, perhaps these
deaths would have been prevented.

A sentinel event is any event that either leads to death or serious permanent
injury -OR- could have lead to death or serious permanent injury.

1\.
[https://www.jointcommission.org/sentinel_event_policy_and_pr...](https://www.jointcommission.org/sentinel_event_policy_and_procedures/)

------
chillydawg
When you are literally a laser death cancer beam, it's probably best to assume
that the hardware the software is controlling is actively trying to resist you
and that the software controlling the hardware is actively trying to kill a
patient.

------
vkou
Strangely enough, nobody is bringing up the argument of 'But how many people
were saved by the existence of Therac-25?'

Yet, when a poorly designed self-driving car kills people, many rush to point
out (correctly or not) all the lives that were saved by the technology.

The valley seems to have an entirely different level of regard for the
consequences of their work then engineers and doctors do.

~~~
galdosdi
I think that might merely be because self driving cars are in the future, and
Therac-25 is in the past.

Also, it's easy to imagine self driving cars that are pretty good, but still
occasionally kill people, yet at a rate much lower than human drivers.

For the Therac-25 on the other hand, it's not so complicated. Why it failed
and how to fix it is already well known.

~~~
vkou
It is indeed in the past - unlike the claims about how self-driving technology
is currently safer then human drivers (Which is not at all clear yet), we can
definitely say that Therac-25 saved more lives then it ended. Yet, it is also
a case study in how not to build systems.

It's easy to imagine a medical device that's pretty good, but still
occasionally kills people because of flaws in its design. The image is fairly
terrifying, actually.

We could also say the same about the fatal Tesla accident - we know why it
failed, and how to fix it... And observe how quickly large parts of the tech
community (I would expect no less of the vendor) was to blame the human.

------
Wile_E_Quixote
If anyone is interested in how modern versions of these machines work, the
annual conference for the American Association of Physicists in Medicine
(AAPM) is currently proceeding at the Walter E. Washington Convention Center
in Washington, DC, and will continue until Thursday, August 4th. It's the
largest international gathering of medical physicists in the world. If anyone
is in the area and interested, you shouldn't have much trouble sneaking in and
taking a stroll through the vendor area where you can see examples of the
newest technologies in imaging and therapy physics. Just dress business casual
and you'll blend right in with the other couple thousand physicists.

------
triplesec
Example story "On March 21, 1986, a patient in Tyler, Texas was scheduled to
receive his 9th Therac-25 treatment. He was prescribed 180 rads to a small
tumor on his back. When the machine turned on, he felt heat and pain, which
was unexpected as radiation therapy is usually a painless process. The
Therac-25 itself also started buzzing in an unusual way. The patient began to
get up off the treatment table when he was hit by a second pulse of radiation.
This time he did get up and began banging on the door for help. He received a
massive overdose. He was hospitalized for radiation sickness, and died 5
months later."

------
blobbers
This story was covered in the book 'Set Phasers On Stun'
([https://www.amazon.com/Set-Phasers-Stun-Design-
Technology/dp...](https://www.amazon.com/Set-Phasers-Stun-Design-
Technology/dp/0963617885)). It was required reading in our Introduction to
Engineering class in first year.

As an engineering student it was meant to imprint one thing upon us, but talk
about a dark introduction to the word 'responsibility'...

~~~
chiph
Before I switched to CS, I was in a EE program. Our intro to engineering class
was filled with stuff like the outdoor decorative fountain that killed six
people one at a time, and the transformer box that was used in a game of hide
+ seek (the lock was missing).

In the case of the fountain, several couples had had a night out and decided
to splash in the fountain. The first two knocked a power conduit for one of
the pumps loose, electrocuting themselves. The others died as they each
entered the fountain to rescue the others. The system lacked a GFCI protector.

------
83457
"The VT-100 console used to enter Therac-25 prescriptions allowed cursor
movement via cursor up and down keys. If the user selected X-ray mode, the
machine would begin setting up the machine for high-powered X-rays. This
process took about 8 seconds. If the user switched to Electron mode within
those 8 seconds, the turntable would not switch over to the correct position,
leaving the turntable in an unknown state. ... It’s important to note that all
the testing to this date had been performed slowly and carefully, as one would
expect. Due to the nature of this bug, that sort of testing would never have
identified the culprit. It took someone who was familiar with the machine –
who worked with the data entry system every day, before the error was found."

------
platz
These stories are horrific.

~~~
jacquesm
Yes. They also should be required study material for anybody that writes
software interacting with the real world. Humans are fragile. I wrote a
cad/cam system for a lathe/mill combo, in all the years that that system was
out there we found _one_ bug that made it out past my desk and that only
happened because some idiot decided to demo a new feature to a prospect to try
to close a sale.

The fact that a simple error could cause someone to lose a limb or die does
wonders for your focus.

~~~
0xdeadbeefbabe
Doesn't it also prevent focus?

~~~
jacquesm
For me it definitely doesn't. It became part of the release cycle, analyze
each and every change made to make sure it was safe and have a whole bunch of
code that was tested to the max but never ever changed that would shut down
the machine if it ever got outside of it's expected envelope.

This is actually quite a tricky thing to do right because to be able to jog
the machine _out_ of a shut-down like that you have to re-enable it in a
potentially un-safe situation. For each and every little challenge like that
we found a good solution but some of those were real head-scratchers.

A really nasty one that I recall was that when you power up a bunch of latches
they can come up in an undefined state, so the decision was made to include a
detector for that undefined state which first would have to be cleared before
the output of the latches was allowed to influence the motors.

This worked well in practice but given the restrictions of the machine this
was all done on it took a bit of thinking, the solution we settled on was a
magic sequence output on the parallel port indicating the system had
successfully reset after which the relay powering the motor drivers would
engage. So until that relay triggered everything else was ignored.

This was already important enough with stepper drivers, but once we switched
to servos for some of the more demanding applications it became crucial to
safe operation that the drivers would never ever be energized with faulty
inputs. A servo driving a ball-screw will happily wreck itself, the machine it
is bolted on to and anything standing in-between (including the operator) if
it suddenly gets driven to -10 or +10 V and naturally, you'd always get one of
those two, never the safe '0'.

~~~
0xdeadbeefbabe
Is there also a way to make sure one guy isn't responsible? You know the way
they credited Petrov with saving us from nuclear war?

~~~
jacquesm
Well, that can cut both ways. In Petrov's case it worked out well, but
'ownership' of a problem (and the associated responsibility) in smaller
settings can also work to bring out the best in people.

Separation of duties is a good principle, and wherever possible you should use
it.

[https://en.wikipedia.org/wiki/Separation_of_duties](https://en.wikipedia.org/wiki/Separation_of_duties)

But in the case of a single tech guy in a company there isn't a whole lot you
can do in that direction, so it is best to clearly assign ownership and make
sure that the people involved realize full well the consequences of a fuck-up.

------
gene-h
As Therac-25 will perpetually tell, when you code your software, best do it
well.

------
ricardobeat
"Killed by software" would've been a much more faithful title!

