All revenge does is halt progress on new technology, and cause all involved to cover up problems rather than expose and fix them.
> He was released in November 2007, because his mental condition was not sufficiently considered in the initial sentence. In January 2008, he was appointed deputy construction minister of North Ossetia. In 2016, Kaloyev was awarded the highest state medal by the government, the medal "To the Glory of Ossetia". The medal is awarded for the highest achievements, improving the living conditions of the inhabitants of the region, for educating the younger generation and maintaining law and order.
But that’s not the same thing as not knowing exactly who were present or involved when things happened, nor is it the same as not obtaining their assistance and testimony in reconstructing what happened.
This person should have been tracked down, 100%. Then having tracked them down, the report should avoid blaming them. It might even be useful to anonymize them after obtaining their testimony, to avoid harassment.
But if we aren’t interviewing them just to avoid blaming them, we have a very, very big meta-problem that needs to be solved.
However, it seems notable that they couldn't find the one person who actually knew the details of this situation. I suppose that in that era people could just move to a new place an be extremely hard to track down, but that ability to 'move away' and be hard to track down could also have been used as an attempt to lead to a more beneficial settlement. If this anonymous programmer were to testify that 'management was aware' of potential problems in the software, but insisted on shipping it anyway, it is the difference between negligence and gross negligence. The potential punitive damages would be dramatically increased in the latter case, and it would be 'cheap' to pay someone to hide to avoid this.
- Wishful cost and schedule "estimates" as a precondition for project approval.
- Feigning of current technical expertise as a job requirement for management.
- Hiding of real issues by lower management, for fear of a "shoot the messenger" reaction by upper management.
- Regulatory capture.
- The allure of the "plausible deniability" defense, ever-rising with ascending level of management.
The highest levels of management involved are of course boards of directors, and national legislators. And when was the last time any of us heard a "mean culpa" from either of those?
And everyone, not just the best people, know this. Which is why, in a blame-oriented culture, everyone will just sit back and watch a colleague walk off a cliff. If you're not already involved, then getting involved risks painting a target on your back.
I believe (from memory) the previous version had hardware interlocks that masked the issue and the T-25 did not have the hardware interlocks installed. This lead to a situation where the software was viewed as heavily tested and therefore trusted, even though it shouldn't have been.
I've always seen this as an example of why physical/hardware interlocks are really important when you're mixing software with hardware that can easily hurt people.
I'm also always amazed by how few people seem to know about the Therac-25 incident, especially people that work in therapeutic radiation roles (in the UK anyway).
Not only that, but running in to the interlock should be considered a notable event. The machine should not just continue operating as normal, it should be clear to the operators that something potentially dangerous has occurred and should be investigated.
It seems like they had just assumed that because no one had managed zap someone with the previous models that the software must have been perfect, even though the previous models had hardware interlocks preventing the dangerous scenario. Those interlocks had presumably been tripped many times, just no one ever brought it to the attention of the vendor.
If a system trips a safety interlock it should fail to a safe configuration and remain there until reset by someone capable of investigating why it was tripped in the first place.
Modern traffic lights are a good example of doing it right. In those cabinets you see at every intersection, right next to the traffic light controller will be a device called a conflict monitor. This device will be wired to the circuits feeding the light heads themselves. If two conflicting movements are indicated for whatever reason, be it a failure of the controller, a short in the wiring, etc, the conflict monitor will trip and set the intersection to a fail-safe mode (usually either all-red blink or yellow blink for a main road with reds elsewhere) until manually reset by a human.
> I'm also always amazed by how few people seem to know about the Therac-25 incident, especially people that work in therapeutic radiation roles (in the UK anyway).
That's interesting, at least amongst my "techie" friends it's common knowledge. Many of us who went to college for computer science related things had it used in one class or another to get across the point that bad software can kill people in really unexpected ways.
I guess maybe the medical side of things doesn't find it worth as much attention because they don't have as much to learn from it.
If you park you car, and crank the steering wheel all the way to one side, it won't start unless you put it back. If you have an automatic transmission vehicle, if you turn the key without pushing the brake, it won't start. If you have a manual, it won't start unless you push the clutch in. (This didn't used to be true. Citing Ferris Bueller.)
Lots of stuff goes wrong when the hardware people depend on the software people for correctness and the software people depend on the hardware people for correctness. Insert <Group A> and <Group B> for hardware and software. Could easily be writers and editors in journalism.
Agreed. I was a developer on a medical instrument where the decision to implement a hardware interlock was made only after we were years into development and months away from product release. Initially, the belief was that we could get away with software-monitored interlocks but then the lead HW engineer realized it would be an uphill battle to get UL to certify the machine as safe. UL & OSHA want to see "hard interlocks:" systems where opening an interlock immediately cuts power and eliminates the Hazard without anything else in the control chain.
The problem was that the easiest way for the hardware engineers to implement this: by cutting power to all the controllers, was also the hardest way to manage in software.
At any given moment, with dozens of motors and actuators processing commands, all I/O controllers could have their power cut when an operator opened the cover. This meant retrofitting graceful failure tolerance (previously our code expected that these events would mean a hard fault and a system shutdown) to every single I/O command, resulting in changes of thousands of lines of code, and months of work to implement and review the changes.
In the end it was the right thing to do: the machine was rendered safe as soon as an operator opened the cover while system software kept the non-affected parts of the machine running as best it could. As wolrah suggests in the sibling comment, this was an exceptional event: the only way to recover was to acknowledge the fault and put the instrument through a restart procedure.
I think you are right that I’ve never heard about it in any other contexts though.
Lone wolf programmer, no docs, vague testing. What kind of manager(s) would let this out the door knowing this? Never mind the lone wolf programmer. Find the managers and beat them with sticks
"Everything we know in aviation, every rule in the rule book, every procedure we have, we know because someone somewhere died... We have purchased at great cost, lessons literally bought with blood that we have to preserve as institutional knowledge and pass on to succeeding generations."
We forget that often process and procedures which we now take as "common sense" or "bare minimum" were introduced as a best practice exactly because someone somewhere made a mistake that would have been avoided with those procedures in place.
Another thing I think is important to bring up, is that in software teams we often think that procedures are in place for 'other' complacent people or inexperienced juniors.
The reality is that procedures enforce consistency. And that consistency is needed for you as well not just for 'others'.
Today you write something elegant, tomorrow you could have trouble at home, be sleep deprived, pushed for a deadline, pressured by management and suddenly you in that instance become the 'others'.
Procedures in the end eliminate any wiggle for negotiable 'business compromises' or relaxing quality 'just this once'.
We agree that management was faulty. So why do we give them the presumption of good faith by taking their word for how the programmer operated?
We haven’ heard from the programmer, just from people who are covering up their identity. I’m not saying the following is likely, but it’s possible:
What if the programmer argued with them that they needed to do more testing and allow more time for development? Or that their should be a budget for a redesign, rather than cobbling the -25 from the bits and bobs of the -6 and -20, but management rushed it into production over their objections?
Then people die, and the programmer quits.
Management goes on to settle while being careful to make it impossible to talk to the programmer, who may very well have a lot to say about management.
We agree that management is at fault. Why take their word for it that the programmer operated without documentation? Why take their word for it that the lack of testing was the programmer’s choice?
Maybe the programmer produced a huge document explaining why the product should not be shipped, and management buried it to save their own skins?
I worked at that time in the industry. Half the people I worked with were musicians getting some extra dough by programming.
At the time, programmers could be divided into "corporate" types and "computer nerds", people like the ones who founded Apple or the various software firms of today. Software wasn't an industry, it was a function within companies that did other things. You needed software to do useful things with computers and even to boot them up, so you had someone write it.
Not knowing the identity of the person who wrote this code is NBD. He likely never even knew there was a bug or problem, and if he did he couldn't be held responsible in any way, nor could the managers of the time.
If anyone should be held responsible, it would be the FDA for not getting ahead of the technology curve at the time and regulating computerized devices better than they did sooner than they did.
The investigators did the right thing, they focused on the systematic problems. And the coworkers also did the right thing by not providing someone as a scapegoat.
If they had interviewed him, he might have revealed systematic failures in many other parts of the company. But if he's a nameless individual who can't say anything, he can be a silent scapegoat.
But it would be nice if an investigation was completed, and all the facts found by the FDA before it was concluded. Maybe they could have collected information from him in a non-blaming way that would avoid similar disasters in the future.
Edit: It seems there was, as in he was developing alone and the company didn't disclose his identity. Puzzling.
Bugs happen, but reasonable steps should be taken to prevent them. Developers share that responsibility with management, whether they like it or not.
The story of that series of tragedies: https://en.wikipedia.org/wiki/2002_Überlingen_mid-air_collis...
Or did the failure happen because the code that was written for an 8-bit controller is now run on a 32-bit controller and no one realized that?
Perhaps you'd want to bring in the Test Engineer who verified that the particular feature passed? Why didn't they do their job? How about the Senior QA Engineer who wrote the test cases?
Do you also want to know who wrote the Requirement that the code met? Maybe the code did exactly what the Requirements said, but the Requirement was poorly written.
Point is, failures have to be analyzed on a Systems basis. Simply looking at a line of code can be completely meaningless and miss the big picture. And yes, each of the above failures is something I've come across in my career.
I think it's also important to name him, to interview him. To understand how he came to kill. So that anybody writing life-critical code today can say, "I'd better not end up infamous like that guy. I can't make the same mistakes."
Maybe if he had been held accountable, the programmers at Uber wouldn't have been lax enough to code up the negligent homicide of Elaine Herzberg.
If I remember correctly, the bug in this case was a race condition between "normally" running code and an interrupt handler. The race was only triggered if an interrupt happened in just the right window between two instructions. I'd be willing to bet that 99% of programmers, if simply given the code and asked, "Is there a bug here?" would have answered "No".
Should people writing operating-system-level medical equipment software be required to have a basic training about race conditions and how to prevent them? Yes. Is it fair to expect a random engineer who was ordered by his company to rewrite the code to work without a hardware interlock to know what he doesn't know? No.
Looking back at when I did Safety Critical Systems at Uni way back in another era, the most important points that we got out of the Therac-25 case study were not to do with bugs at all, but to do with the deficiencies of the system architecture and methodology, especially the decisions that led up to it.
The guy wasn't ordered to do this at the point of a gun. His bosses asked him to do something, he said yes because he liked the money, and people died. The bosses are also responsible, but that doesn't mean that his negligence didn't kill people.
The very fact that we don't know who wrote the code is exactly why it doesn't matter. The complete lack of traceability and accountability is what caused these deaths.
There's so many things that went wrong in order for Therac 25 to kill people that it's irrelevant who wrote the code. It could've just as easily been you or me.
It could be worse than that. He might not have even known the actual purpose and complete function of the system for which he was writing code.
That's common when big projects get outsourced. One might call it "decomposing the problem into small tasks" but it ends up with people writing code for machinery that they have no knowledge of. I've seen this myself. These kinds of projects depend upon layers and layers of oversight and process.
I think that the folks who are looking for "a perpetrator" are utterly missing the key-value of studying the Therac-25 case study.
Please note that you are working very hard to make up a fantasy situation that entirely exonerates the person like you. Your fantasy does not match the known facts. Think hard about why you think that's the most important thing for you to do here.
I'd also like you to note that you're pointing out the lack of accountability as a problem, and then dismissing one of the fundamental mechanisms of accountability: naming people who harm others.
Yes, other people on the project also should have pushed for traceability and accountability. But responsibility is not zero sum. "I was just following orders" is never an excuse for killing people.
However, a) none of that was true here, and b) in the wider societal frame, I think blamenessless must not always trump accountability. When we're talking about egregious negligence leading to death, I think naming the culprits is the very least we should do.
Self-driving software might find its way around a controlled test track okay but is nowhere near fit to be in control on public roads. Safe way to test it the is to have human drive and have the software simulate the same drive, then compare the two for discrepancies. Rinse and repeat until the software consistently equals or betters human decision-making. Then you can consider putting the software in charge.
Programmers are supposed to be the technical experts in the room. So if they aren’t pushing back against ignorant uninformed hubristic Management’s bad decision-making and making it right, how the hell do they think their software will do any better? Fucking Worse than Useless, the whole cowardly bloody bunch.
They are there to take the blame.
You cannot realistically expect a human spectator to switch modes at zero notice and save the machine from a deadly fuckup that the machine has already put them into. That’s not how human minds work. We just don’t bootstrap that fast. Jebus, we’re bad enough at salvaging bad situations we’ve knowingly manoeuvered ourselves into while already fully engaged in command and control mode.
At least when you—the human pilot—fucks up, you already know the decision chain that got you there because it is your own. With the machine you have first to determine it has gone catastrophically wrong, then determine how it has gone wrong, and finally calculate and execute the recovery strategy before…oh, whoops, too late: you just wiped out all the executive bonuses for this quarter. Also, there’s a blood streak on the street.
As for how large software development projects work, I think the trail of corpses already testifies more than suf as to how they don’t. Industrial institutionalization of incompetence is no defense, and any “professional” who hides behind it can go get fucked.
As much as I love the web, I think the move-fast-break-things ethos, which is arguably useful for startups doing who-cares-if-it-breaks things like social web front end tweaks, has been absolutely terrible for the industry more broadly. I have friends who make excellent money just sweeping up after the elephant parade of hotshots, solving infrastructure and code issues written by people who want to get paid like professionals without acting like ones. I'm glad for them, but the waste is maddening. And that's before we get to the body counts of places like Facebook and Uber.
Responsibility is not zero sum. The safety driver is responsible. But so are the people who sent out robot cars with a single safety driver, as they knew or should have known that paying attention in low-interaction situations for many hours in a row is not something humans reliably do.
The Uber managers and execs are also responsible, in that they set up the system that led to needless death.
But none of that absolves the programmers, some of whom did things that they knew or should have known were dangerous, and who did not make sure that the system they were committing code for was set up for proper safety.
Your notion that the only job of programmers is to meet the spec is one I deeply disagree with. We're not Amazon warehouse workers, desperate for a job and blindly following whatever orders come our way. We're highly paid professionals whose job is to understand what we're building and what effects it has. I think that's true for any sort of coding, but I believe it's very obviously true for life-critical systems. If we can't handle the responsibility, we shouldn't cash the (quite large) checks.
The failure here was to expect someone to stay alert for a rare event for long periods of time. They probably should have had a more active task that would have kept them more alert.
As others have pointed out, this was a systemic failure. Pointing out the individual highlights those upstream.
Exactly. Look at Experian. Attempted to blame an individual engineer for not applying a patch, then the next thing they know, the entire world is asking, why is your CISO a music major?
The implication of that question is nonsensical. Experian had/has many problems, but the fact that their chief security officer got a degree in music decades earlier wasn't one of them.
Some of the very best programmers I know were music majors - it's actually not uncommon at all. I mean, Steve Jobs only went to college for 1 semester, and afterwards he audited creative classes like calligraphy. Is the implication that he wasn't qualified to be CEO of Apple because he didn't have a degree in management?
You’ve managed to completely miss my point about attempting to blame an individual engineer backfiring and hitting senior management. No one would have cared what her degree was in otherwise. But it became a stick to beat the organisation with.
Boeing is having these same question asked about them and I wouldn't be surprised if their "solutions" would essentially be hardware solutions.
The software was badly broken, and had been for years. The difference between the 25 and previous models is that previous models had interlocks that did not allow the software errors to manifest.
This article gives an overview of the bug:
A much longer paper about the system here (linked from the OP twitter thread)
It was known.
It's a shame we live in such a vengeance culture what we really want is for it to be public.
People die weeky in most hospitals because doctors and nurses don't wash their hands.
Get over THERAC, is saved lives and the programmer helped with that.
From Reddit AMA -
"My teacher does know the name, but is bounded by the courts to not release it. He knows the programmer is living in guilt and did say that he has left programming as his career. Although, it was not entirely his fault, as my teacher explained, the necessary software development process for a machine like this was not there, and no checks were in place.
tl;dr Cannot be revealed, but wasn't entirely his fault."
Nobody says it was; such disasters are often multifactorial. But given to his position that person holds key knowledge and insights into what went wrong that no-one else has. Without access to that information, investigators can only hypothesize.
This is why things like whistleblower laws and indemnity insurance exist, to enable the full and unvarnished truth to come out. How are errors meant to be fixed correctly when you don’t have all the information as to the cause?
Compare how air crash investigations work. Or research into procedural improvements to hospital hygene. There are things far more important than just finding people to blame.
And hospitals and doctors alike do get sued for medical errors and negligence - especially lethal ones. "But I saved 99 other lives" is not really a good argument, it's not a game of points.
In my country, if something like this is considered violating human rights, then the general prosecutor could re-open the case, it doesn't matter how many years after (this was intended for investigations on the disappearances during last dictatorship, but I think our constitution abides my reasoning if something like this happens).
In any case, the only person who has all the details about the development towards this incident is the guy that wrote the program. He is the only one that can shed any light into this. I think it's worth finding him even nowadays.
"<b>One</b> programmer, over several years, revised the Therac-6 software into the Therac-25 software (AECL has not released any information about the programmer or his credentials)."
 https://web.archive.org/web/19980201101244/http://cobra.csc.... - "Death and Denial: The Failure of the THERAC-25, A Medical Linear Accelerator"
Previous Therac models had hardware interlocks to prevent some modes; they were removed in favor of software for the 25. No doubt there were some engineers who knew more about this.
At my first job we even had a separate safety officer for our low powered sources used for tracing waterflow.
It's always the result of many mistakes piled one atop the other, and you'll always find a bean-counter on the top adjusting an Excel spreadsheet somewhere to make the numbers come out in a way that pleases some executives.
Did they ever find the name of the accountant behind the fiasco?
It has a crazy impact on corporate culture, it was rarely talked about except in hushed tones over beer, the management was extremely averse to any publicity, no press contact for any reason (compare to my next employer putting out empty press releases at least weekly if not more often), sales staff had an extensive playbook for not answering questions about it
I can understand an individual developer wanting to disappear in this scenario. If they had internalized blame for this, I can certainly imagine them choosing never to work in the industry again (or making more extreme choices)
We get away with that kind of thing in software shops because a) we're relatively new, b) rarely deal with life-and-death designs, and c) haven't racked-up a large enough body count (empirically speaking, rather than morally) to warrant regulation.
Give Uber & Tesla a few more years of running people over, and engineer-style licensing for certain types of software development will probably be in the mail.
It was. There was a PE for software engineering until relatively recently in the US but no one took the exam because it basically wasn't required for anything.
Be careful what you wish for though. The requirements typically include a formal degree and some number of years working under a PE.
There's nothing magical about such a certification though. Other than the education and experience requirements, it's pretty much a GRE-type exam. I took the engineer-in-training exam way back when in a different engineering field but I stopped practicing before I sat for a PE.
No, no one took the exam because it was effectively impossible.
To become a PE, first the candidate has to pass one of the Fundamentals of Engineering exam to become an engineer-in-training. Except, whoops, there wasn't ever a software specific FE exam; the most relevant one is the EE/Comp. E. exam. Take a look at the list of topics: https://ncees.org/wp-content/uploads/FE-Ele-CBT-specs.pdf Most developers aren't going to pass that even with a CS degree.
Secondly, you need 4-8 years of supervision by a licensed engineer. Again, whoops, there are barely any software developers with a PE license, so who would they get to supervise them?
Only then do you get to take the PE exam for software engineering. Frankly, the situation was so absurd that one has to suspect that NSPE didn't want to certify software developers as PEs.
Edit: It would also explain the lack of provable credentials.
I agree higher-ups bear the brunt of responsibility, but
the title of this post currently says "the programmer behind the THERAC-25 Fiasco was never found"
Engineers at VW were found guilty:
There actually probably is a law where you live, but in almost all jurisdictions I don't think it's enforced. Parts of Canada were trying to protect the title Software Engineer but I think even they've given up now. It's unfortunately because now the title doesn't really mean anything.
I'm from Quebec, there's no laws that surround specifically software engineer. There's some for engineers in general, but nothing surrounding software development. The title truly doesn't means anything for us as long as no laws are passed, which is why I don't pay for it either.
I think Microsoft won on appeal, but I can't find anything on that right now.
I hate repeating myself, but here again: there's no laws that surround specifically software engineer.
I'm not arguing they doesn't have good laws around the engineer title, I even mention that I can't use the title of engineer because I don't pay for it. What I'm arguing is that there's no laws that surround specifically software engineer.
That title is meaningless and doesn't give us any power over our own works. It doesn't protect our works, thus it doesn't allow us to takes responsibility over it. In the case of the Therac 25, the software was done for the previous hardware and reused over the newer one. Even if the software engineer was against doing that, there's nothing in his power to stop that. I don't know if the electrical engineer has more power and could have done it in the Therac 25 situation, but as far as I know, only civil engineers require signature over their works to proceed.
In this case the author of the software was almost certainly not a P.Eng. (and thus shouldn't be called a software engineer IMHO) and couldn't be found guilty of malpractice.
Hire someone to design a plastic mold, pay for the work, never hear from that person again.
It is insane to expect anything more than a tax code, that is only retained for some few years in some dusty finance department file cabinet.
There is no source control, design history files, CD/CI, etc in a factory, i.e. 99% of small to medium business. Silicon valley and fintech are the exceptions, even today, let alone when that happened.
Also, it is a bunch of old timers recommending one another for work. The people on the floor and owners definitely know the person, but will not tell unless they have to.