
Medical Devices: The Therac-25 (1995) - pdkl95
http://sunnyday.mit.edu/papers/therac.pdf
======
joezydeco
Just another plug for the comp.risks digest. This digest from the ACM
Committee on Computers and Public Policy has been moderated continuously by
Peter G. Neumann since its inception in 1985. If you don't frequent Usenet
like you used to in the 80s, the web archive is here:

[http://catless.ncl.ac.uk/Risks](http://catless.ncl.ac.uk/Risks)

The Therac-25 was discussed here many times, starting with Vol. 3 Issue 9:

[http://catless.ncl.ac.uk/Risks/3.09.html#subj2](http://catless.ncl.ac.uk/Risks/3.09.html#subj2)

------
phkahler
I've always taken that machine as a argument against event driven programming.
Why? Well John Carmack articulated the problems very nicely when he wrote
about inline code, covered on HN here:

[https://news.ycombinator.com/item?id=8374345](https://news.ycombinator.com/item?id=8374345)

The Therac problem was a result of states getting out of sync and into an
undesirable configuration. I think reading about the machine and then the
above Carmack will cause one to see the connection.

~~~
userbinator
I see it more generally as the perils of unwarranted complexity. One of the
bugs was a _race condition_ that - I'm almost willing to bet - would not have
existed if they didn't try to be "overly clever" and incorporate a crude
approximation of a multitasking OS in their software.

It is mentioned almost summarily in the report - "Designs should be kept
simple" is a phrase in there - but I think that this excessive complexity was
one of the biggest factors.

This Hoare quote is relevant: "There are two ways of constructing a software
design: One way is to make it so simple that there are obviously no
deficiencies, and the other way is to make it so complicated that there are no
obvious deficiencies."

~~~
TickleSteve
> "incorporate a crude approximation of a multitasking OS in their software"

Many embedded systems have a 'crude' OS library... and in many cases this
makes them _far_ simpler than including an RTOS. Not having seen the code
here, I can't comment on this one, but just including a simple scheduler is
not necessarily a bad design decision.

The the other aspects of your "Keep It Simple" answer, I fully agree with.

------
angersock
While this is pretty much _the_ Ur-example of faulty software design causing
human injury, the fact is that the entire system failed. Had the Therac-25 not
removed the hardware interlocks of the Therac-20, the accidents would've been
much less likely to occur.

I also think that we should be careful in trying to draw too much caution in
what we do from this accident--the majority of software (EHR systems, apps,
etc.) being developed in the medical field today would not be served by the
sort of scrutiny that would've prevented this accident.

In fact, one could (and I will) make the argument that simply having faster
release cycles and better customer interfacing (instead of, say, custom
consulting work _cough_ Epic _cough_ ) would cause a better increase in
quality than some insanely rigorous pile of paperwork.

~~~
jacquesm
A thorough review of a software production is _not_ an insanely rigorous pile
of paperwork. I think I'm going to have to disagree with you about the kind of
caution that we can draw from this incident, in fact I think cases like these
should be mandatory study material for anybody that makes or moves into making
software for critical applications.

I've built some stuff controlling machinery that would amputate your arm in a
split second and 'faster release cycles' would have caused accidents, not
better quality.

Exhaustive testing, thorough review and extensive documentation of not only
the code but also the reasoning behind the code saved my ass more than once
from releasing something in production that would have likely caused at a
minimum a serious accident.

One of my rules for writing machinery controlling software is that _I_
determine when a new piece of software can be taken out of my hands to be
passed up the chain. The only time someone violated that rule this happened:

It was around 6 pm when we finished working on the control software of a large
lathe, a Reiden machine with a 16' bed and a 2' chuck. I put the disks with
the new version on the edge of my desk for 'air' (machine otherwise not
powered up), 'wood' and 'aluminum' testing the next day. In simulation it all
looked good but it's easy to make mistakes.

When I walked back onto the shop floor the next morning it was deadly quiet.
My boss was sitting in his office upstairs and I asked him what was up. He'd
taken those disks to do a 'quick demonstration' for a prospect before I
arrived to show them a new feature (thread cutting iirc). A subtle bug caused
the machine to start cutting with a feed of 10mm instead of 1mm, the stainless
steel he used for the demo got cut up into serrated carving knives spinning
out of the machine at very high speed. Amazingly, nobody had gotten wounded or
killed, mostly due the power of the Reiden (it never even stalled) and the
holding force of the chuck (which had to keep hold of the workpiece during all
that violence), the machine had actually completed its cycle and the customer
had left 'most impressed' (and probably a few shades paler than they
arrived...). They actually bought the machine on the strength of the demo and
some showmanship of my boss, cheeky bastard, for all the same money there
would have been a couple of ambulances in front of the building that day.

After that nobody ever tried to use any of the binaries until I had signed off
on them on as 'safe for production'.

That mistake would have definitely been caught in the 'wood' testing phase and
a 'faster release cycle' would have missed it entirely since it _looked_ very
good right up to the moment where the cutting bit hit the metal.

Test protocols exist for a reason, skip them and you're playing with fire,
faster release cycles are great for non-critical software.

~~~
angersock
That's an excellent story, and something to remember when working on automated
systems, especially industrial ones.

For something like, say, an automated surgery robot or da Vinci Surgical
System, or the Therac here, or an implantable insulin pump, or whatever, it
absolutely makes sense to be super vigorous in testing.

For something that's basically just a big document database, though? Or a
glorified calculator? Or graphing and charting app? Or messaging app?

Hardly necessary.

In fact, the sort of testing and software rigor that makes sense for embedded
systems (like your lathe or the Therac machine here) is pretty much the worst
way possible to release one of the aforementioned systems on time and under
budget and useful enough to actually make people productive.

Adding more "rigor" to these applications would only serve as a barrier to
entry for folks trying to improve the industry. It wouldn't save lives and it
would only increase the power of the monopolies of existing players.

------
bjoveski
this might be a good guideline for the folks that don't want to read all of
the paper.

[http://web.mit.edu/6.033/www/assignments/rec-
therac25.html](http://web.mit.edu/6.033/www/assignments/rec-therac25.html)

The Leveson paper is quite long, and not all parts are equally important:

    
    
        Skim Sections 1 and 2. You should understand the basics of the Therac-25's design and how it was used. (You may also find this figure a helpful accompaniment to Figure 1 on page 4.)
        Skim Sections 3.1-3.3, which detail a few of the Therac-25 incidents.
        Read Sections 3.4 and 3.5. These detail a particular incident, the software bug that led to it, and the response to the bug. Pay close attention to 3.5.3, which describes the bug.
        Skip Section 3.6. (It describes an additional incident and a different bug—feel free to read if you are interested, though)
        Read Section 4 closely.

------
InclinedPlane
Similar problems still exist, turns out software is hard:
[https://medium.com/backchannel/how-technology-led-a-
hospital...](https://medium.com/backchannel/how-technology-led-a-hospital-to-
give-a-patient-38-times-his-dosage-ded7b3688558)

~~~
smarks
Nice article, thanks for posting it.

It's ironic that this article mentions the Toyota Production System as an
example of a safe and defect-free system. Another article about Toyota was
posted on HN today:

«Toyotas Unintended Acceleration and the Big Bowl of "Spaghetti" Code (2013)»

[http://www.safetyresearch.net/blog/articles/toyota-
unintende...](http://www.safetyresearch.net/blog/articles/toyota-unintended-
acceleration-and-big-bowl-%E2%80%9Cspaghetti%E2%80%9D-code)

Apparently Toyota's software development doesn't follow the Toyota Production
System.

OK, that was a flip comment; it's pretty clear that TPS isn't suited to
software development. However, it does seem clear that Toyota's software
development practices are deficient.

------
mcroydon
Not to be self-serving, but I've always been fascinated by Therac-25. I ended
up doing a deep dive a few months back and put together a short 5ish minute
podcast episode about it:

[http://tinycast.in/2015/01/27/therac-25/](http://tinycast.in/2015/01/27/therac-25/)

I used this PDF as one of the primary resources and it was a fascinating read.

------
jonshariat
Hey all,

I actually cover this story in my book Tragic Design

([http://shop.oreilly.com/product/0636920038887.do](http://shop.oreilly.com/product/0636920038887.do))

In my opinion, the software bug wasn't to blame but bad user interface design.
When the error occurred that caused patients to get a direct blast of 10x rad
more than what were supposed to get, the error was caught and displayed. But
because there were so many erroneous errors, users were used to bypassing
them. I go into much more detail in the book but I thought I'd chime in here.
What do you all think?

------
theGimp
Interesting read, but please label PDFs. I hate clicking on them while on my
phone.

~~~
emmab
would be great if HN could just do this automatically

------
cheeze
Please add a PDF warning.

