
Worst Computer Bugs in History: Therac-25 (2017) - dangom
https://blog.bugsnag.com/bug-day-race-condition-therac-25/
======
topkai22
As terrible as it was, that Therac-25 remains one of the most frequently cited
examples of software engineering flaws hurting people is somewhat encouraging
for the profession. 3 deaths is a tragedy, but the Hyatt bridge collapse a
year earlier was a couple of orders of magnitude worse (114 people,
[https://en.m.wikipedia.org/wiki/Hyatt_Regency_walkway_collap...](https://en.m.wikipedia.org/wiki/Hyatt_Regency_walkway_collapse))
from what was also a fairly subtle engineering failure.

IMO, harm from software bugs (so far) have been vastly surpassed by explicit
choices in system design. The various emission cheat scandals have almost
certainly taken a real toll on human life going into the hundreds of person
lives. More subtly, the choices to retain data inappropriately at Ashley
Madison (probably) lead directly to suicides and serious emotional harm. Those
are just the two recent examples that spring to mind as a practocing
developer, not an ethicist.

To somewhat over simplify but when discussing engineering ethics, the harm
from software developers building things wrong is swamped by building the
wrong things.

~~~
simias
I think that's because for most applications where bodily harm is a
possibility you generally (in my experience) have hardware protections that
will prevent the software from doing anything stupid. Take an elevator for
instance, even if the software controller is bugged (or hacked) and decides
that it should drop the cabin from the top floor to the ground level at full
speed there are hardware protections (security brakes, limitations on the
engine itself etc...) that will take over and make sure nobody gets hurt.
Therefore in order for something to go completely wrong you need both a
software _and_ hardware failure. The main flaw in Therac-25 was arguably that
no such protection was present, the hardware should have been designed in
order to make the bogus configuration impossible to achieve solely in
software.

I think unfortunately this is going to change with the advent of "AI" and
related technologies, such as autonomous driving (we've already had a few
cases related to self-driving cars after all). When the total enumerable set
of possible configurations become too great to exhaustively "whitelist" we
won't be able to have foolproof hardware designs anymore. In these situations
software bugs can be absolutely devastating.

~~~
oxymoron
Your point rings true even in this case. There was another Therac (50? 100?
It’s been a while since I read about it) machine which had the same bug, but
where noone got hurt due to hardware safeguards.

~~~
triska
In my opinion, one of the most tragic aspects of these horrific incidents is
that the _predecessors_ of the Therac-25 actually had independent protective
circuits and other measures to ensure safe operations, which the Therac-25
lacked.

Here is a quote from
[http://sunnyday.mit.edu/papers/therac.pdf](http://sunnyday.mit.edu/papers/therac.pdf):

"In addition, the Therac-25 software has more responsibility for maintaining
safety than the software in the previous machines. The Therac-20 has
independent protective circuits for monitoring the electron-beam scanning plus
mechanical interlocks for policing the machine and ensuring safe operation.
The Therac-25 relies more on software for these functions. AECL took advantage
of the computer's abilities to control and monitor the hardware and decided
not to duplicate all the existing hardware safety mechanisms and interlocks."

So, regarding these important safety aspects, even the Therac-20 was better
than the Therac-25!

The linked post also mentions this:

"Preceding models used separate circuits to monitor radiation intensity, and
hardware interlocks to ensure that spreading magnets were correctly
positioned."

And indeed, the Therac-20 also had the same software error as the Therac-25!
However, quoting again from the paper:

"The software error is just a nuisance on the Therac-20 because this machine
has independent hardware protective circuits for monitoring the electron beam
scanning. The protective circuits do not allow the beam to turn on, so there
is no danger of radiation exposure to a patient."

~~~
ecpottinger
I have friend who has 40 years programming experience, he is building a
computer controlled milling machine in his basement.

When I asked him about the limit switches it turns out they are read by
software only and the software will turn off power to the motor controllers if
a limit switch is activated.

I asked why he does not wire the switches to cut power directly to be on the
safe side.

His answer "It's to much bother to add the extra circuits."

We are talking less than $20 in parts and a day of his time. If the software
fails after sending the controller a message to start moving the head at a
certain speed then crashes there is nothing to stop the machine wreaking
itself.

E.C.P.

~~~
therein
In a situation like that, I wouldn't blame him. Consider while building his
milling machine how many of these situations he will come across. If he had to
make sure there was a hardware failswitch, it would simply not scale.

3D printers are like this too. They have mechanical limit switches [0] that
are read only by software. So if there is a bug in the software, nothing is
stopping it from pushing the hardware limits and breaking. Same goes the other
way around, if this switch is broken, same might happen.

[0]
[https://i.ebayimg.com/images/g/EYAAAOSwbopZguz4/s-l300.jpg](https://i.ebayimg.com/images/g/EYAAAOSwbopZguz4/s-l300.jpg)

~~~
Matumio
Most 3D printers don't have massive printing heads. If they drive into the
end-stops, the motors will likely just skip steps and be stuck. They are not
designed to apply much force.

I'm much more worried about the heating element. Its temperature is usually
controlled by the same cpu that also does motion control and g-code parsing.
If anything locks up the CPU the heat might not be turned off in time, and
(because you also want fast startup) there is enough power available to melt
something. At the very least you would get nasty fumes from over-heated
plastics, and maybe even teflon tape, which often is part of the print head.
At worst it could start a fire.

~~~
ioquatix
As a 3D printing enthusiast, I can confirm your fears. It's all in software
and while there are good control systems, nothings perfect. I had the hotbed
fail and it was smoking when I found it.

------
userbinator
I read this article, and many years ago the full report, and one of the
omissions on the list of causes that stood out to me was overcomplexity --- if
you read about the possible functions of the machine, they really don't
require multiple threads much less a full multitasking OS. None of these race
conditions would've occurred if it was a simple single-threaded embedded
controller.

To paraphrase an old Hoare quote, software can either be so simple it
obviously contains no bugs, or so complex that it contains no obvious bugs.

------
Malic
One of my favorite software horror stories is the one of $32 _billion_
overdraft by the Bank of New York.

From "Computer-Related Risks" by Peter G. Neumann, published 1994 (REALLY
recommended reading)

"One of the most dramatic examples was the $32 billion overdraft experienced
by the Bank of New York (BoNY) as the result of the overflow of a 16-bit
counter that went unchecked. (Most of the other counters were 32-bits wide.)
BoNY was unable to process the incoming credits from security transfers, while
the New York Federal Reserve automatically debited BoNY's cash account. BoNY
had to borrow $24 billion to cover itself for 1 day (until the software was
fixed), the interest on which was about $5 million. Many customers were also
affected by the delayed transaction completions."

Additional reference:
[https://www.washingtonpost.com/archive/business/1985/12/13/c...](https://www.washingtonpost.com/archive/business/1985/12/13/computer-
snarled-ny-
bank/a707acbe-35bc-4a2e-bd80-180d131618c7/?noredirect=on&utm_term=.96220129dffd)

Granted, no one died because of this but ... wow ... that was bad day for some
developers somewhere.

~~~
jobigoud
Imaging being called in. Ok guys we don't know what the problem is but it's
costing the company $3500 per minute the bug stays unfixed just in interests.
No pressure.

~~~
joering2
Also the follow up question should be ask: was an engineer who stopped the
leak and fix the bug awarded to some reasonable point??

~~~
dasil003
No because it was also caused by an engineer, so this could lead to terrifying
incentives.

~~~
sidlls
How is this different from rewarding a salesman who rescues a sale that a
different salesman had botched?

~~~
dasil003
Because you can't purposefully botch a sale in order to later recover it. Also
because you can't avoid botching sales by being more conservative or adding
more process. In short, sales and engineering have basically nothing in
common.

~~~
sidlls
I don't think GP was talking about a scenario where the same engineer who
created the bug fixed it and gets rewarded, rather one where a different
engineer fixes it. Of course it wouldn't make sense as you describe it.

~~~
dasil003
How do you decide objectively who is responsible for every single bug? The
whole thing is ripe for abuse from all sides. You need a blameless culture to
have good engineering, not a bounty-based one.

------
classichasclass
Therac is one of the reasons I get nervous about "health hacking." Yes, people
can verifiably benefit from some of the advancements made in this movement,
like the DIY diabetic insulin pump, and yes, I prefer to see such advancements
be open source than locked up in proprietary designs and trade secrets. And
there probably is room in health regulation for trimming the red tape anyway
even for innovations originating from the commercial sector.

On the other hand, when corners are cut (no hardware interlocks, for example)
and edge cases aren't considered, even innocently, then you get things like
this. It makes products more expensive to design and more costly to buy and
maintain to do the extra engineering. It is certainly a barrier to entry, too.
But do we want another case like this because people said "this is good
enough"?

~~~
jacquesm
I posted a link about a insulin pump that could be hacked into remotely. Even
if it isn't 'health hacking' because you've bought a product from what you
thought was a reputable vendor there are no guarantees that it will be secure
and bug free.

~~~
matheusmoreira
[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3262727/](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3262727/)

> An unauthorized third party can interfere with pump communication and
> undermine patient safety

> we confirmed this through laboratory experiments by sending commands to an
> insulin pump using an unauthorized remote programmer at a distance of 100 ft

> Thus, the specifically identified issues are a security breach that could
> result in:

> (1) changing already-issued wireless pump commands;

> (2) generating unauthorized wireless pump commands;

> (3) remotely changing the software or settings on the device;

> (4) denying communication with the pump device.

People can also attack the blood glucose monitors and the data they report to
the pump system.

Scary.

------
hedora
The crazy thing about this classic story is that the industry has learned
nothing from it: The lethal bugs were all in the frontend UI code.

Today, companies build equally important UI logic in JS frameworks that target
rapid prototyping and consumer-focused startups.

~~~
pjmlp
Lawful punishment of bad quality software needs to be a thing, just like in
other industries.

Only then will most companies actually start to care about software quality in
their development processes.

~~~
zzzcpan
I don't think it works, at least not within the current legal system. Where it
becomes mostly about legal bureaucracy of avoiding responsibility, rather than
truly focusing on reliability.

~~~
pjmlp
Sure it does, it is no different than when a company delivers spoiled goods or
when one does returns at a shop because the product does not work as described
on the box.

The root problem is that society got used to turn off/on and hope for the best
instead of going back to the shop and ask for their money back.

Also every time that there is an bunch of black hat hackers that expose
company internal data, if the security breach can be mapped into a CVE
database entry, a good law firm could probably make something out of it.

Not all jurisdictions are alike, but one needs to start somewhere.

~~~
anonuser123456
Probably not. As a company you just disclaim liability in your terms of
service.

Jurisdictions that try and override this, simply get excluded from the
customer base.

The market is still the ultimate decider for quality; if you build a crappy
product, expect to get innovated out.

~~~
pjmlp
>Probably not. As a company you just disclaim liability in your terms of
service.

Thankfully EULAs are void in Europe.

It is all a matter how big the customer base gets, I am hoping eventually we
get something like that EU wide.

> The market is still the ultimate decider for quality; if you build a crappy
> product, expect to get innovated out.

If that was true 1 € shops wouldn't exist, but even those products have more
testing than most software out there.

~~~
anonuser123456
> Thankfully EULAs are void in Europe.

It's not so clear cut that you should be thankful. The ability of companies to
dictate the terms of which users can use their software, affects their risk
calculation to produce the product in the first place. It is very likely that
very useful but imperfect software will not be written because the risk /
reward balance is tilted.

Remember, you always have the ability to reject an EULA; simply don't use the
product.

> If that was true 1 € shops wouldn't exist, but even those products have more
> testing than most software out there.

Consumers can make value choices on quality vs cost. This is a basic market
function.

~~~
paulie_a
As a customer I've found just ignoring all EULAs to be effective on the flip
side. They are meaningless in my opinion and I don't give a crap about what it
says. I'll use the software as I want.

------
dangom
According to the Wikipedia entry on the Therac-25, it was "In response to
incidents like those that the IEC 62304 standard was created, which introduces
development life cycle standards for medical device software and specific
guidance on using software of unknown pedigree".

For those working in safety and quality control of medical systems, how much
does compliance to those specifications actually diminish the chances of
another Therac-25 incident?

Considering that automation continues to increase, from automatic patient
table positioning up to diagnose-assisted AI, are there new challenges when it
comes to designing medical systems in order to keep them safe and
maintainable? How likely is it for the FDA or the equivalent agencies around
the globe to authorize the use of open source systems?

~~~
pjmlp
> How likely is it for the FDA or the equivalent agencies around the globe to
> authorize the use of open source systems?

Actually they already authorize stuff like Qt.

Computer systems where human lives are put in risk belong to what is called
High Integrity Computing.

There are very strict coding standards, where even C looks more like Ada than
proper C.

[https://ldra.com/medical/](https://ldra.com/medical/)

[https://www.qt.io/qt-in-medical/](https://www.qt.io/qt-in-medical/)

[https://www.vectorcast.com/testing-solutions/software-
testin...](https://www.vectorcast.com/testing-solutions/software-testing-
embedded-medical-devices-fda-iec-62304)

Source code availability is not an issue, because it is part of the
certification process to provide it.

The problem is having the money to pay for a certification, which becomes
invalid the moment anything gets changed, namely compiler being used, source
code, or if any of the third party dependencies gets updated.

------
jibal
In 1973 I programmed for the oncology department at L.A. County/US Medical
Center. We had a Varian Clinac linear accelerator with computer-readable and
-drivable motors. The clinicians would manually position the Clinac (with the
patient on it) for the first treatment, and the position would be saved in the
patient's computerized file and restored on subsequent treatments.

For some treatments, a metal wedge would be placed within the beam to
attenuate it more at the thick end of the wedge. Because of the non-linear
attenuation along the length of the physical metal wedge, dosages were
difficult to calculate.

Someone got the bright idea of creating a software wedge by slowly moving the
treatment couch at the same time as closing the beam aperture, so that there
would be 100% exposure at one end of the "wedge" and 0% at the other, with a
linearly decreasing distribution across the whole wedge.

I was the programmer for this project, and we had just started testing it with
a sheet of X-ray film on the couch when I received an offer I couldn't refuse
to go work elsewhere.

I'm glad that I departed before they started using this on live patients.

------
LeoPanthera
How horrible it must have been for the operator, to realize they had killed
two patients, through no fault of their own.

~~~
jsjohnst
Honestly, I disagree slightly. Reading the article as well as the original
report years ago, I wasn’t left with the feeling the operator made “no fault
of their own”. Are they to blame at all, no, but the operator certainly made
mistakes. For example, assuming an error is innocuous when you are
intentionally delivering radiation to a person is careless at best. Again, the
machine is at fault solely, but that doesn’t mean the operator didn’t have a
role in the death.

~~~
nv-vn
Yeah, I took away the same thing. As an example, in the aviation industry
something like this would simply not be tolerated. When you are operating a
potentially dangerous device, you have to do so with the utmost care. This
isn't to say the technician should be punished, but one of the results of this
investigation should have been a focus on making technicians aware of how
disastrous the consequences could be if they don't respond appropriately to an
error.

~~~
hexane360
Really?

I don't think it's reasonable to expect nurses to wait an undocumented 8
seconds after changing modes to avoid a race condition. That goes far past
"utmost care". Are pilots expected to never overlap command inputs? Are they
allowed to engage the flaps and then activate the spoilers before the flaps
are fully deployed?

I'm basing my account on this report as well as the OP:
[https://hackaday.com/2015/10/26/killed-by-a-machine-the-
ther...](https://hackaday.com/2015/10/26/killed-by-a-machine-the-therac-25/)

~~~
AnssiH
I'm pretty sure the parent only meant that the "Malfunction 54" error should
not have been ignored, not that the operator should have somehow avoided the
race condition in the first place.

~~~
Jtsummers
The operators had become conditioned to ignore those error/warning statements
due to their pervasiveness and apparent lack of consequence. This is why, as a
designer, you should use such warnings sparingly so that the operator/user
doesn’t become “blind” to them.

~~~
hexane360
To add to this, the more specific and informative an error message is the more
authoritative it appears. "Malfunction 54" is nowhere near as good as "Unable
to to set therapy mode"

~~~
Jtsummers
Yep. A good book on this sort of thing is _Tragic Design_ [0]. I was in a
pilot "Software Safety" course last year and this book was published _just_
after, but was an excellent companion text for the course. I've been meaning
to follow up with them to see what became of that work (the pilot course, my
employers at the time were considering making it a mandatory/highly-
recommended course for most of their software engineers and designers).

[0] [https://www.tragicdesign.com](https://www.tragicdesign.com)

------
matheusmoreira
> users will ignore cryptic error messages, particularly if they occur often

It's not just cryptic error messages. Pretty much anything that requires the
attention of people will end up being ignored eventually. For example:

[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4894506/](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4894506/)

I've also read about an anesthesiologist who turned off the alarms because
they annoyed him. One day he failed to secure the endotracheal tube during a
surgery, it came off and nobody noticed. The result was cardiac arrest, brain
damage, multiple organ failure, sepsis and death.

Monitoring hardware is very sensitive so it will fire off alarms if _anything_
changes, no matter how small. The more sensitive a test is, the more false
positives you get. This is extremely demanding of a health care professional's
attention, which in practice is multiplexed between countless patients.

Up to 99% of these alarms and messages will do nothing but get in the way of
people. These represent false positives, disconnected cables, and other minor
failures that don't represent a real danger and can be easily fixed. People
will get used to the alarms, and will learn to ignore them.

~~~
jibal
> Pretty much anything that requires the attention of people will end up being
> ignored eventually.

Perhaps, but your example is not supporting evidence. The PACU alarms were
muted, precisely because they were so hard to ignore.

> Monitoring hardware is very sensitive so it will fire off alarms if anything
> changes, no matter how small. The more sensitive a test is, the more false
> positives you get.

This is fixable. The problem is the same as the one in the Therac-25 case ...
severe and inconsequential alerts are indistinguishable.

Here's an particularly enjoyable piece of literature crafted around an
instance of alarm fatigue: [https://gutenberg.ca/ebooks/smithcordwainer-
deadladyofclownt...](https://gutenberg.ca/ebooks/smithcordwainer-
deadladyofclowntown/smithcordwainer-deadladyofclowntown-00-h.html)

------
binbag
A lot of comments here seem to think having a hardware failsafe to backup the
software failsafes is the key thing. In fact it doesn't matter about it being
hardware. Hardware can fail too. The keys are 1) redundancy and 2) having
multiple different failure modes. Adding hardware failsafe gives you both
these, but you get the same level of safety by introducing any second failsafe
that used a different method, including a separate software based technique as
long as its failure mode is dissimilar and uncorrelated to the first. The best
method is to add multiple different methods based on completely different
technologies.

------
IronBacon
I would guess the other infamous one is the Patriot missile timing bug.

------
tfolbrecht
A floating point bug [0] in the Patriot Missile Defense System killed 28
Americans and injured nearly 100 more when an Iraqi SCUD wasn't successfully
countered during the Gulf War.

[0]
[https://www.cs.drexel.edu/~introcs/Fa10/notes/07.1_FloatingP...](https://www.cs.drexel.edu/~introcs/Fa10/notes/07.1_FloatingPoint/Patriot.html?CurrentSlide=10)

------
dev_dull
> _The software consisted of several routines running concurrently. Both the
> Data Entry and Keyboard Handler routines shared a single variable, which
> recorded whether the technician had completed entering commands._

Raise your hand if you've made this type of mistake many times in the past.
Most of us have the luxury of not having our software bugs affect human lives.

It's a shame they've removed the hardware safety controls. I don't think I'd
even feel comfortable programming such a powerful tool without such circuit
breakers.

------
yc-kraln
The Therac-25 is part of the core curriculum in computer engineering, but I
wonder if it's actually (in the grand scheme of things) hat bad of an
incident. Compared with Facebook fomenting ethnic cleansing in Asia, the
people who were hurt or died were very limited. Are there any new(er) examples
which can show the dangers of a widely distributed, connected horror?

~~~
fpgaminer
I'm not familiar with the Facebook incident you are referring to, but it
doesn't sound like something caused by a bug? The article is about bugs; i.e.
unintentional tragedies. If what you're referring to is not the result of a
bug then it's off-topic (though perhaps not unimportant).

------
dangom
A descriptive video can be found here:
[https://www.youtube.com/watch?v=uEvu2PlDhO0](https://www.youtube.com/watch?v=uEvu2PlDhO0)

~~~
ysleepy
That video is terrible, read the article instead if possible.

~~~
dangom
Here's the link of (an updated version of) the original accident report:
[http://sunnyday.mit.edu/papers/therac.pdf](http://sunnyday.mit.edu/papers/therac.pdf)

