
Therac-25: When software reliability really does matter - almost
http://en.wikipedia.org/wiki/Therac-25
======
jrockway
The industry learned its lesson. Instead of continuing to use unsafe languages
like C, they switched to safe, statically-typed languages. The tiny speed hit
that checking memory bounds before accessing arrays required turned out to not
be a big deal, and the improved programmer productivity more than made up for
the need for slightly-more-expensive hardware. And, of course, the number of
lives saved was more valuable than a few extra microseconds of execution time
and the larger EEPROM needed to store the smarter runtime.

Oh sorry, that was me daydreaming. Of course nothing has changed, people just
"try to be more careful". Since people never make mistakes, even when pressed
with tight deadlines, this has worked out fine. Asking the computer to double-
check your code is just not necessary.

(Hmm, I think I got that backwards. It's just not my day today, is it...)

~~~
gamble
The Therac-25 was much more of a management and engineering failure than a
technical problem, though. While the immediate cause of the deaths was a race
condition in the software, it was only capable of causing harm because the
hardware safety mechanism had been removed as a cost-saving measure, without
proper verification that the software was capable of doing the same job.
Worse, it was widely known among the Therac's users that the hardware safety
would regularly trip during normal operation without apparent cause, yet the
problem was never investigated until after the deaths.

Ultimately, the Therac was a failure of software engineering more than a
software failure. I don't blame the technology any more than I would blame a
girder for buckling because a careless structural engineer didn't realize it
was overloaded.

~~~
jrockway
If the software had worked, it would have mattered if the hardware safety was
removed. I'm not saying to design without hardware safety mechanisms, of
course, but it's nice to know that you'll never need them.

~~~
gamble
The problem is that no one bothered to check whether the software worked when
they removed the hardware. It's not a good thing that there was a bug in the
first place, but proper procedures could easily have caught it before people
were killed. That's why you have multiple systems to prevent catastrophic
failures.

What makes the Therac-25 notable is the novelty of a software problem killing
people. If it had been a second hardware safety mechanism that had been
removed, it wouldn't be as notorious - but the engineering failure would be
equivalent.

------
jonjacky
News of these accidents motivated some of our approach on a more recent
radiation therapy machine project. Here's a short paper I wrote about it.

<http://staff.washington.edu/jon/cnts/iccr.html>

The software we wrote has been used without incident for more than ten years.

------
grinich
Here's the infamous paper by Nancy Leveson:
<http://courses.cs.vt.edu/cs3604/lib/Therac_25/Therac_1.html>

(original PDF if you have access to the IEEE:
[http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=...](http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=274940))

And an updated PDF: <http://sunnyday.mit.edu/papers/therac.pdf>

It's one of the first paper's read in 6.033 (Computer Systems Engineering) at
MIT, which is actually taught by rtm.

------
URSpider94
I agree that good software engineering practices are hugely important.
However, there are also supposed to be additional failsafes in the system,
namely trained operators and supervisory personnel, many of whom are falling
down on the job as well. Even worse, in many cases, the government does not
compel public disclosure of accidents, which certainly hinders investigation
of recurring failure modes.

For more, see <http://www.nytimes.com/2010/01/27/us/27radiation.html>

------
nirmal
The Therac-25 is one of the first things Georgia Tech CS students study in our
CS Ethics course. If you're interested in more such events see
<http://measure.cc.gt.atl.ga.us/classes/cs4001_fall_2009/>

------
RK
These kinds of problems are still occurring:

<http://news.ycombinator.com/item?id=887406> (Oct. 2009)

------
radiationboy
Ho boy, I am due to start radiation treatment next week.

------
kkowalcz
That's exactly why you shouldn't employ "good enough" programmers - always try
to recruit the best possible. It takes time, but tools like
<http://codility.com> might help with the elimination of wrong candidates.

~~~
Silhouette
Are you connected with Codility in any way?

~~~
Silhouette
I just took their sample test to see what it was like.

I was disappointed that they disallow the use of external libraries, but then
criticise C++ code that is vulnerable to the obvious integer overflow. In a
real interview, I would expect a good candidate to indicate that they would
pull in a bignum library to avoid this, but there is no way to say that in
this sort of test. Likewise, depending on the exact specifications, you might
consider using a 64-bit integer type to hold the sum, but that's not portable
yet. Presumably, if you're limited to standard C++ (which has no bignum
library) and required to pass the overflow test to get a top score, the only
way to do this is to reimplement the relevant parts of a bignum type. Of
course, that is exactly the _wrong_ thing to do if you have to solve this sort
of problem in reality.

It is particularly unfortunate that they would criticise a candidate's code
for this, while at the same time making a basic mistake themselves in the
specification for the equi function. (If, as the problem states, there may be
many elements in the input, then the correct return type for an index into a
std::vector<int> would be a std::vector<int>::size_type, not an int. A
size_type is always unsigned, so returning -1 in the event of finding no match
is a bad idea.)

