Hacker News new | past | comments | ask | show | jobs | submit login

One of the things the author fails to mention in the tweet summary, is that the same software was in use on a previous version of the machine and had absolutely no problems.

I believe (from memory) the previous version had hardware interlocks that masked the issue and the T-25 did not have the hardware interlocks installed. This lead to a situation where the software was viewed as heavily tested and therefore trusted, even though it shouldn't have been.

I've always seen this as an example of why physical/hardware interlocks are really important when you're mixing software with hardware that can easily hurt people.

I'm also always amazed by how few people seem to know about the Therac-25 incident, especially people that work in therapeutic radiation roles (in the UK anyway).




> I've always seen this as an example of why physical/hardware interlocks are really important when you're mixing software with hardware that can easily hurt people.

Not only that, but running in to the interlock should be considered a notable event. The machine should not just continue operating as normal, it should be clear to the operators that something potentially dangerous has occurred and should be investigated.

It seems like they had just assumed that because no one had managed zap someone with the previous models that the software must have been perfect, even though the previous models had hardware interlocks preventing the dangerous scenario. Those interlocks had presumably been tripped many times, just no one ever brought it to the attention of the vendor.

If a system trips a safety interlock it should fail to a safe configuration and remain there until reset by someone capable of investigating why it was tripped in the first place.

Modern traffic lights are a good example of doing it right. In those cabinets you see at every intersection, right next to the traffic light controller will be a device called a conflict monitor. This device will be wired to the circuits feeding the light heads themselves. If two conflicting movements are indicated for whatever reason, be it a failure of the controller, a short in the wiring, etc, the conflict monitor will trip and set the intersection to a fail-safe mode (usually either all-red blink or yellow blink for a main road with reds elsewhere) until manually reset by a human.

---

> I'm also always amazed by how few people seem to know about the Therac-25 incident, especially people that work in therapeutic radiation roles (in the UK anyway).

That's interesting, at least amongst my "techie" friends it's common knowledge. Many of us who went to college for computer science related things had it used in one class or another to get across the point that bad software can kill people in really unexpected ways.

I guess maybe the medical side of things doesn't find it worth as much attention because they don't have as much to learn from it.


We have physical interlocks all the time and they make intuitive sense to us.

If you park you car, and crank the steering wheel all the way to one side, it won't start unless you put it back. If you have an automatic transmission vehicle, if you turn the key without pushing the brake, it won't start. If you have a manual, it won't start unless you push the clutch in. (This didn't used to be true. Citing Ferris Bueller.)

Lots of stuff goes wrong when the hardware people depend on the software people for correctness and the software people depend on the hardware people for correctness. Insert <Group A> and <Group B> for hardware and software. Could easily be writers and editors in journalism.


> I've always seen this as an example of why physical/hardware interlocks are really important when you're mixing software with hardware that can easily hurt people.

Agreed. I was a developer on a medical instrument where the decision to implement a hardware interlock was made only after we were years into development and months away from product release. Initially, the belief was that we could get away with software-monitored interlocks but then the lead HW engineer realized it would be an uphill battle to get UL to certify the machine as safe. UL & OSHA want to see "hard interlocks:" systems where opening an interlock immediately cuts power and eliminates the Hazard without anything else in the control chain.

The problem was that the easiest way for the hardware engineers to implement this: by cutting power to all the controllers, was also the hardest way to manage in software.

At any given moment, with dozens of motors and actuators processing commands, all I/O controllers could have their power cut when an operator opened the cover. This meant retrofitting graceful failure tolerance (previously our code expected that these events would mean a hard fault and a system shutdown) to every single I/O command, resulting in changes of thousands of lines of code, and months of work to implement and review the changes.

In the end it was the right thing to do: the machine was rendered safe as soon as an operator opened the cover while system software kept the non-affected parts of the machine running as best it could. As wolrah suggests in the sibling comment, this was an exceptional event: the only way to recover was to acknowledge the fault and put the instrument through a restart procedure.



The THERAC-25 is a classic example taught in software engineering and usability engineering courses. I use it as an example in one of my undergraduate lectures.

I think you are right that I’ve never heard about it in any other contexts though.


What? No, the Therac-20 had exactly the same bug, that was triggered just as frequently, however it had a hardware guard preventing the bug from killing people.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: