The problem with the Therac-25 was NOT the race condition. Complex devices will always have bugs, because humans haven't figured out a way to do perfect, bug-free engineering. The problem with the Therac-25 was that it wasn't designed to fail safe. The previous model had the same race condition bug, but it also had hardware monitors and interlocks which provided defense in depth.
The lesson of the Therac-25 is not writing perfect software; the lesson is recognizing that humans make mistakes so anything remotely safety-critical needs to be designed to fail safely when - not if - mistakes/bugs happen.
Short of bit flipping due to environmental conditions, it's more than possibleto get an algorithm and state machine perfect. Look at the old landline networks. Look at vending machines. It's just HARD. No one wants to pay for HARD.
Yes. Therac-25's problem did come from lack of defense-in-depth. That doesn't take away from the fact that Quality was sacrificed in the name of a cheaper production cost, and less time to market.
The easiest way to get people to be okay with shoddy things? Tell them it's impossible to do better. Microsoft took that route, so everybody drinks the kool-aid. Even people who should know better.
You are correct, bug free software is possible - but often designing stuff to fail safe, is cheaper than catching every bug, and a better use of time. There is a reason we include hardware watchdogs on stuff still, because the watchdog is cheaper than building bug free software, and still provides the same quality of service.
Cosmic rays do flip bits. If you're making anything that might be safety-critical, your design needs to fail safely when hit by a cosmic ray or any other hardware failure. If your device isn't controlling a 25 MeV linac, "failing safely" may be as simple as halting or otherwise indicating an error. If it is controlling something dangerous, multiple layers of defense are necessary. One layer might be error-correcting RAM.
> Look at the old landline networks.
As Aloha already described, the POTS network is one of the best examples of defense in depth. Every piece was designed to work around failures. They even guaranteed service for weeks after a major disaster that took out the entire electrical grid (they had backup generators and phones were self-powered by the phone line). The phone network is exactly the type of engineering we need for safety and reliability.
> Look at vending machines.
Vending machines have a simple enough problem that they don't actually need a full stored program computer. You can be (and have been) an electromechanical device implementing a finite state automata. Even if you use software, the necessary state machine is small enough that you can actually prove that the software is correct. Most problems are too complicated.
Also, vending machines fail. They even fail safely; a locked up vending machine doesn't become a hazard to life and limb.
Why? Well, not only because bit flipping does occur naturally due to radiation but also because there's no such thing as a formally proven hardware stack in the world. At the end of the day your software has to run on hardware that exists. Not only has that hardware not been formally proven, you can't even know exactly how the hardware will behave from one month to the next due to firmware patches (spectre/meltdown mitigations, for example). Those patches can affect behavior and performance by huge amounts.
Additionally, formal proofs don't cover one of the biggest classes of bugs: insufficient specifications. If you formally prove your software to the wrong specification then you still end up with buggy software. So even with provably correct software methodology you still need classical QA processes.
Moreover, it is not necessary to have formally correct algorithms and state machines in order to achieve the maximum quality of software.
Citation needed; where and when did they state this?
Look at vending machines.
Like the one at work which tells me to use exact change only, then accepts more money and still gives me the right change?
(Re: Marketing disclaimer, e.g. the slogan in the UK "You can't get better than a Kwik-Fit fitter", Kwik-Fit being a high-street chain of budget vehicle repair, I think analogous to something like Jiffy-Lube in the USA, it would be unreasonable for anyone to take this as a factual claim that "no better mechanics exist on the planet", excellent mechanics are more likely to work on racing car engines than routine oil and brake changes for near minimum wage, say)
The ones which appear (to a layperson) to match the MIT license "NO WARRANTY" section, or the Mozilla Public License "indemnify all Contributers" and "disclaimer of warranty", and sections 7, 8 of the Apache License 2, and sections 15, 16 of the GPL v3?
If so, there's a lot of people stating "it's impossible to build better software than ours", and while I disagree that's what it communicates, if that is what it means, it's bad-faith to try and single Microsoft out on that count, isn't it?
I will guess that the author will agree with you. The goal is a safe solution and that implies better software, better hardware, better procedures, ... anything that you can learn from mistakes to make this not happen again. If you look at plane crashes reports they will highlight software and hardware malfunction but also pilot mistakes (one plane crash happened when the pilots were distracted talking during landing about another plane crash that happened because those pilots were distracted talking during landing).
> The problem with the Therac-25 was NOT the race condition. Complex devices will always have bugs,
That is the point of the article. " ... The reason? A race condition. An unplanned-for coincidence, made deadly by the conscious design decision to remove an old-fashioned safety interlock."
The problem was the race condition. But if you solve that one, another one will pop-up. So what was deadly was the "decision to remove a safety interlock".
We need to identify which one can be solved, and do it. If we can not write perfect software we should add hardware safety. If hardware safety is not enough we need to improve the procedures to handle it. If that is not good enough, we can add more software to help were humans are failing... it is a refining process that you learn studying past failures.
It is clear from the AECL documentation on the modifications
that the software allows concurrent access to shared memory,
that there is no real synchronization aside from data that
are stored in shared variables, and that the "test" and "set"
for such variables are not indivisible operations. Race
conditions resulting from this implementation of multitasking
played an important part in the accidents.
The reason for this comes from too sides: on the one hand, there is the client (either internal or external) who wants to have as many features for as little money as possible and who is many times not very good at judging if their demands are realistic.
On the other hand there is the IT organization which has an incentive to comply with unrealistic targets.
If it is an external client, this unrealistic promise is the advantage over the competition that might get them the project.
But I also see this happen regularly with internal projects where there is no competition between implementors. In that case the motivation is usually that IT management wants to have the consent of higher management to get this project started; they know that once a project has been going for a year or so, it won't be terminated if it's over budget or past deadlines. At the same time, if they would have tried to give realistic estimates too higher management the project might never have been approved.
The difference is that missing deadlines in construction projects come with bigger monetary penalties (citation needed,
based on the low penal damages I've personally seen in IT project contracts).
Everything is planned & tracked as a portion of an arbitrary 12-month time frame & the internal politics means each department are competing for funds on a "use-it-or-lose-it" basis. If you've never sat in on department/geography/company-wide governance forums where they agree next year's "change management" budget, the basic format is you present a business case & cost forecast based on FTEs over the expected duration of the project, all the mandatory regulatory compliance projects get approved automatically and the remainder of the budget is set based on how well each middle-manager can defend the interest of their department. They have a final number for total approved spend & approve as many projects as they who's forecast cost fits into that, Tetris-style, with a flat 20% contingency spread across everything to account for the fact that historically, trying to plan in this way is so inaccurate that it's basically a series of pseudo-random guesses. It's worth noting that even stuff badged as "Agile" is funded & budgeted for in this way.
Every programme I've been involved with has had some variation on "we can't start work on X in that month, because it won't finish until February and will cannibalise part of my budget for next year".
Unfortunately, given the economic & political status quo this problem is impossible to solve (but for the libertarians reading this, an interesting example of the 'higher order' unintended consequences of sincerely designed industry regulation.
This is particularly toxic, because often the senior executives and investors also have an interest in hiding the failure (avoiding personal accountability and avoiding write-downs on investment).
It is very hard to see how a culture of true reflection and learning can emerge in this environment.
Many managers seem to think that they can easily change the requirements in the middle of the project.
This is all things I've seen in the last month (yeah I'm not joining any more greenfield projects within organizations - if you think you're going to rebuild it from scratch and fix all the problems, you have no idea what went wrong in the first place).
"But of course I can change anything I want, whenever I want, and you all are supposed to handle it. That's why we use Agile!"
Another big factor is testing/QA. If your customer service department isn't going to throughly test your new program, and nobody holds them accountable, who gets blamed on launch day when it doesn't work right? IT!
Divide roles, or increase complexity of the project, or its relationships with the surrounding infrastructure, and planning or standardisation requirements escalate quickly.
You'll find plenty of examples of people on DIY or lone-inventor projects saying "we tried X, it didn't work, so we did Y instead. That's the unity-actor case of your hypothetical.
Dividing executive-agent roles is a complexity dimension.
Dividing executive (decision), design, expert (engineering), sourcing, external approval, interface, and labour/crafstmaan roles are multiple additional complexitty dimensions. And this doesn't even include the technical dimensions of the project and its infrastructure interfaces / interconnects.
Complexity is interaction between components and entities.
Behind any popular or profitable succesfull software project there are sometimes as much "PMing" practices as there are coders.
Risk is a nonapparent long-form cost, and unless specifically manifested (insurance, regulations, fines), will be ignored.
The results of a scholarly search for "a gresham's law of" are illuminating, spanning divorce, neighbourhoods, academic admissions, environmental regulations, incorporation, shipping, management, court citations, television programming, politics, and more.
How about treating it as capital with a risk-return element, say?
Despite the lack of public post-mortems, poor project management seems to be widely recognized as a problem. But there isn’t a clear cut solution. Agile promised to save us all but seems to be implemented poorly more often than not.
Project management is happily alive. Thinking Agile in some way solves project management is insane.
Plan out your system, estimate the time to build it (you don’t even need good estimates), execute ruthless change control. It’s not hard, just takes discipline. Ruthless change control is the hard part. That doesn’t mean saying no, it means saying “if you change things, it costs you schedule days”.
If you want a clear cut system, iDesign has some good classes. Imo at least.
When I worked for Philips as a test hardware design engineer in the old Mullard Radio Valve plant in the late 70s and early 80s (a major semiconductor plant despite its name) we were even stricter than that. Every project started with a rough estimate for the total cost, we then spent 10 percent of that on a combination of prototypes and a better estimate of the time and cost and refinement of the requirements. The customer department would then decide whether to proceed. At that point the requirements were frozen. Our department was then committed to the project schedule and price and the customer was committed to the requirements. Any changes would require exactly the same procedure as before: rough estimate, 10 percent preliminary study, better requirements and plan, commitment, development, delivery.
Part of the purpose of this was that at the end of the ten percent phase the customer could take his requirements to an external supplier (British Aerospace for instance) and get a quote from them instead and possibly actually give them the job.
> Ruthless change control is the hard part. That doesn’t mean saying no, it means saying “if you change things, it costs you schedule days”.
Just because software is dominated by soft costs doesn't mean that it's cheap to change requirements. That doesn't mean you can't deviate from your initial spec, it just means you have to charge the customer for changes that they want after you've already started development. Throwing away work just because it's "easy" to change software doesn't magically recover the time and money that you're already spent up to that point.
That said, from my POV it seems a lot of software related IT project failures is often correlated to two factors:
- Doing too much at once. Like replacing 6 different existing specialized systems with a single new one.
- Unwillingness to change the business procedures/workflow to cater to software.
The lure of the single do-it-all system seems strong with certain people. But at least in my experience, one could draw from software engineering and how good software is written as separated modules with well-defined interfaces at the boundaries. If you have multiple systems with good interfaces for data exchange, it's much easier to specialize where needed, and replace outdated or broken pieces.
The unwillingness to adjust the business procedures/workflow to software needs is a huge one. Complex software is fragile. By having complex rules in the business procedures you force the software to be more complex, thus invariably making the software more fragile. If business procedures were changed to be software friendly before the software is written/adapted, the software can be simpler and thus hopefully less fragile.
The root of the problem is the uncontrolled complexity of modern software products.
Because of this complexity responsibilities are diluted, most of your code is in your dependencies nowadays.
If you write a casual library, are you responsible if it is flawed and used in a critical operation?
Can dependencies always be carefully audited?
That doesn't resonate much with me. Yes maybe in terms of LOC most of our code is in dependencies. But most of our _important_ code is our own, in the business logic.
The reason our product is unreliable is mostly down to complexity as you note, but that complexity is driven by our users who want better integration and automation. Add one more optional behavior ("when X I need to do this tedious task Y, could your program do this for me?") and you've increased the testing surface exponentially.
On the other hand those additions is what sets us apart from our competition, and what makes the eyes pop when we show off our product to potential new clients.
So it's a balancing act. The complexity makes our users extremely productive, but it also makes our software more fragile.
: For example, one client went from almost an entire day of manual data entry for certain orders to less than half an hour due to a "smart" Excel importer I wrote.
: Users now want the Excel importer to behave in conflicting ways, and trying to cater to that without breaking one of the use-cases is very difficult.
However, I don't think shaming is what this is about though. To me it seems the objective is to learn from mistakes, and for that we need to be honest about what happened, and it is going to be pretty hard to be honest if we tip-toe around who did what and why.
I also agree that complexity is a problem. But I don't think acknowledging this gives us any path forwards. I don't think going back to the 'good old days' is going to be a solution. I therefore see this leaning process as helping us figure out how to move forwards, and to provide a motivation to the industry as a whole. It is this industry-wide motivation that will be needed to address some of the systematic complexity issues.
I don't think this would be enough on its own (and implementation is a whole other question), but I think it could be a step in the right direction.
The first citation of the words "Software Crisis" meaning the inherent difficulty of writing high-quality software in a predictable way was from a NATO conference fifty years ago: https://en.wikipedia.org/wiki/Software_crisis
It is taking a long time for good practices to be discovered and win-out, and even when obvious improvements have been made, they're not necessarily used effectively.
I suspect a large part of the reason why the software industry isn't maturing at the same speed that other industries have had to, is that in software, failure is much easier to hide.
> The root of the problem is the uncontrolled complexity of modern software products.
I think there's a feedback loop between those two things, especially when it comes to government or giant corporation projects. Lack of accountability causes accidental complexity, which in turn causes a lack of accountability.
It starts with a organisation that lacks tech leadership hiring a consultancy, which then treats the project as a "flagship engagement" which means trying to make everything perfect, where perfect is used in the context of the number of future sales-pitches that will cite this one project.
As a result, there's a gap between what the organisation needs and what it gets, which adds to the amount of work required and complexity to navigate, whenever changes are required, and the overall complexity snowballs from there.
Most of the above is business-as-usual for most very expensive projects. The real danger-zone is when you get to the third iteration, six or seven years down the line, and you're forced to re-hire the first consultancy again because they're the only one with the resources to take it on; but the tech-world has moved on, so they see you as a "modernisation engagement". They simultaneously can't criticise their own bad decisions from several years prior, but at the same time they want the wider-industry to see their "transformative" power, so can't merely iterate on what's already there either.
That's how you end up with iOS apps, talking to Ruby-on-Rails APIs (which used to be the primary web-app, before that was replaced with a React frontend), reading and writing from an Oracle database which is also updated with a series of batch jobs dating back to early 2000s Java EE.
The "coal face" developers in all these situations have done the best work to their ability, and quite often achieved minor miracles in stability given the underlying complexity. The problem is always a management (or lack-of management) problem.
Has it been getting more unreliable? Software is being developed, bugs are getting fixed, new use-cases are emerging faster than anytime before.
If it seems like software is getting more reliable, maybe its just that we're relying on it more and more.
Now there are serious meltdowns and failures every year.
Meanwhile, when was the last time you talked to a customer services rep of any large org without being told "Sorry - our systems are really slow today"?
I live in one of the most digitized countries in the world. So we’ve naturally digitized payment for public transportation. When we did it, nobody questioned the taxation system, even though it was made in the 70ies and build around a public structure called “amter” that hadn’t actually existed for many years when then system was build. We had also gone from 271 municipalities to 98 and their borders were part of the taxation too. So the taxation rules frankly didn’t make any sense and they were needlessly complicated, yet they were digitized, as is. Naturally it was a disaster, it was even predicted by the technical team and the project leads, but nobody wanted to touch the taxation politically. It got fixed eventually, but it could have been several hundreds of million danish kroner cheaper if they had simply redone the taxation models for ticket prices before the digitization.
So that’s one mistake, and a common one, both in the public and private sectors. The other common disaster is building systems for specific processes without looking at the bigger picture. Like a case working system that handles the welfare process for people who are sick. Except you forget that those citizens sometimes don’t go through official communication channels, and maybe send a letter or an email to the wrong department, so you need to be able to add those documents to their digital case file. But that’s not possible and neither is sending a notice to other systems in other departments which also deal with the same citizen. I’m guessing this last issue is bigger in the public sector than in private, because we often buy our software from companies that have very little actual domain knowledge outside of what their direct customers tell them, and the case workers they use for knowledge very often lack insights in the greater architecture of running 350+ IT systems together because they work with maybe 5 of them.
I mean, these things aren’t deadly as the x-ray machine, but they’ve been happening for the better part of 25 years and nobody seem to have really learnt anything.
I’m honestly not sure why that is. I’m hesitant to only ascribe it to incompetence, because not everyone can be, but maybe we only hear about the failed project with bad decisions.
A clarification: The Therac-25 had an unfortunate race condition, what made this deadly was the conscious decision by the designers to REMOVE the physical safety interlock. They didn't consider modes of failure. The post says exactly this. Always consider modes of failure, you never know when some "other guy" is going to naively count on your work being 100% reliable. It's a system not a goal, as I like to remind people.
Some of you might enjoy some of my other stuff, particularly on security:
The Tay Bridge disaster was important because:
1) Before it, we had several bridge failures in the UK.
2) After it we had almost none at all. Ever.
3) The report into the disaster was responsible for this improvement. It uncovered problems with: The design, the metal used, the way it was assembled, the maintenance regime, the project management and personal relationships and personalities of the people involved.
I'd lay money on the cause of the recent tragic bridge collapse in Italy being one of those already cited in that 140 year old report. It's all there.
Back to our own world...
When major IT projects fail, there is almost never a public enquiry, even when those failures are government projects, and even when they cost hundreds of millions of dollars/pounds. These failures are repeated regularly in government, and daily in the private sector.
Many of us who have been around a while have a (probably pretty good) understanding for why they fail, yet the lessons are not learned and there is little sign we are getting any better at all at not-failing. I suspect a bit of exposure to downside risk, or "skin in the game" as Taleb would call it, might improve things.
Sometimes the medicine is hard to take.
The book was published 28 years ago, in 1990.
We use words like science and engineering in conjunction with others like computer, programming, and software. And yet there's nothing scientific about how we don't learn from mistakes already made decades ago. And how we keep reinventing "engineering" best practice and call them new names.
You may not know its provenance: The Economist, Volume 186, January, 1958, or 60 years ago:
I've been trying unsuccessfully to secure a copy of this article for some years. PDF preferred, dredmorbius<at>protonmail<dot>com if anyone should have access.
My research trove exceeds 10k items. I don't have a $100/item, or even source, budget.
> Think about the fact...
You can think all you want, but it's unlikely to do anybody any good. Sometimes the fault for a failed project lies squarely with the engineers, but this is not at all the usual case. The people who are most responsible for failed software projects is management, and not just engineering management, but the people who engineering management reports to.
And the biggest problem management has is not simply lack of understanding of the nature of software development projects, but, often, a profound lack of interest in learning.
I don't know what to do about that.
It's hard to grasp the sheer scale of government. This article does a good job of juxtaposition in the case of the magnitude of engineering failures, but I want to add on that $20 billion is chump change when it comes to waste. The military sector alone plowed through $700 billion last year to accomplish the task of robo-killing brown people. The entire federal budget was $2 trillion. Stop and think about those numbers for a bit.
There are 2.8 million civil servants in the US, and 2 million military personnel. $2T divided by 4.7 million means every single government official is responsible for roughly $425,000 of your tax dollars. This includes postmen and every boot camp trainee.
Obviously only a fraction of these people are making decisions. So you can add zeroes to that number when you want to consider how much power the actual decision makers have. These decision makers are human, and humans are wont to see themselves as kings of their domain, and what is a king's job but to squander money squabbling over fiefdoms.
The sheer, mind-boggling scale of systems of government, all of them, from your homeowner's association to your neighborhood council to your city government to your state government to the national government to international governmental organizations like NATO and the UN, isn't even the most interesting aspect to consider here.
A more amazing thing to think about is how they manage to get anything done at all. But that's not even the biggest thing.
The biggest thing is that there is nothing new about this state of affairs. Civilization was built like this, thousands of years ago.
It's an admirable goal to want to get rid of waste in government. But that's an untamable firehose. You won't even get laughed at for a proposal to save $20b of tax money. They will look at you, decide whether you're going to look good on TV, maybe put you up in front of a camera if you're really really really really lucky, and everything you spent your whole life learning to finally try to do will get swept into a political capital generating exercise for a local politician. Thanks, try again next life.
Governmental cruelty knows no bounds.
It holds all company ownership data and a lot more. Right now there is no way to register a company in my country, as well as
making any changes to existing companies ( e.g. changing manager, shareholders, etc. ). It is one of the most important set of data for an EU country.
The original problem ( leaked by the government ) is that a 4 of the RAID5 disks broke down, but it is still a mystery why recovering the data takes more than 2 weeks.
0 : http://brra.bg
1 : https://bivol.bg/en/classified-information-and-human-error-c...
Though I bet it was just nobody ever checked if the disks were still working.
There is a reason my backups are held on a separate machine in a separate building (and also on large external encrypted drives that leave site everyday).
Sometimes you learn these the hard way.
It's also the message of the Phoenix Project book, which I did like.
The problem I have noticed is that although management does understand their businesses, it's easy to bullshit them when it comes to IT. And they let it happen because they are not into IT and they don't grok it. They would never treat other projects like this the same way when they would totally understand it.
I especially notice that as I read what higher management layers write about projects or effort, it's high on fluffy 'visionary' words but low on actual actionable vision that would help me to make every day decisions on what to prioritise.
I believe that the simple reason why IT projects fail is because of very mundane basic things.
But those are not sexy to write about. To me it's all about:
- Why are we doing this?
- How would you define succes and failure for this project
- Who is responsible for what / contact person
- How do we work with each other and detail this
- What are the guiding principles
- How do we assure quality
- How do we assure timely delivery
- What stuff do we need, gear, licenses, etc.
- P R E P A R A T I O N - do your homework, investigate things beforehand before you make choices.
I can go on and on. And it may bore you. But I think there is actually no true complexity involved in all those failing projects.
There is not something really special to IT projects. I wonder if we do pretend there is something special to them because we ourselves want to feel important in some sense.
It’s not like we collectively don’t know this stuff. Let’s stop drinking our own kool -aid. And for those of you that do work for cool companies or startup disrupters I’m really happy for you. For the rest of us, technology is not the centre of the universe. Appreciating that difference is important.
And yes, I know I’ve missed valid arguments swung the pendulum a bit far the other way. It’s deliberate. We are in danger of disappearing into our own navels.
My electricity supplier is close to loosing my business in part because their web site won't accept my meter readings.
I'll be basing my decision on what car to buy my wife at least partly on whether it works with my iPhone. If it doesn't then you won't be on the list.
And on and on. We increasingly interact with businesses through technology, if they can't get that right then they are going to suffer. They can't get it right unless they take it more seriously right up to board level.
But... there are three key problems.
1) The time scales are long - in my experience big project failures are on a >5 year time scale (because - big) I think proper studies will need to run >10 years, and that's a big ask for any academic or team.
2) The costs are borne by one set of stakeholders (IT) the benefits are accrued by another (next IT). Why invest to help your successor? No one is going to thank you, also you will likely be sacked faster! There is no board level education or knowledge about this. The only source of information that could convince boards that this is the right thing to do would be Mckinsey/Bain/BCM and those &&^^"! will never, every say this because it's the right thing to do and they are evil. (prove me wrong!)
3) What do you measure? The field is immature, it's not clear what the right inputs to check are - or what the right way to estimate the outputs are. So we need to do a lot of work now to set up the definitive studies.
I have an anecdote : there is a thing called The FEAST hypothesis http://users.ece.utexas.edu/~perry/work/papers/feast1.pdf I was a user of one of the studied systems, and I was curious about the study. I discovered that it hypothesised that development of big systems slowed as they got more complex and the data from the system I used was one of the points that confirmed this. I examined change control documents and discovered that the development of said system had* slowed before the end of the study, but then it had reaccelerated, a whole load of "robots" had been implemented by business units consuming the system and these had not been reported in the FEAST study (IT was largely unaware) the robots started causing problems, policy changed, they were insourced, on platform development took off.
- 5 year major international project to develop the art to support this
- legislation that mandates system development information is stored up front and in a shared place.
- legislation that mandates regular reviews that determine certain information that is signed off by an engineer.
- 20 year massive project to use above information
I am not optimistic.. We can't even prove that XYZ better than agile..
I am not sure i have too many answers. But having a genuine profession that is required by law to sign off on any life-critical software seems a sensible starting point
Yes, but how many bridge projects failed in the last 140 years because of cost overruns/missing deadlines which is a more direct analogy for most of the arguments in TFA.
And I'm guessing we're just talking about the UK since earthquakes have taken down a bridge or two in my lifetime...
I was reading just recently of a failed megaproject, the Nicaraguan Canal. Forecast costs range from $40 - 100 billion, though I cannot find a report of actual expenditures.
Contrast this with actual engineering failures, such as Fukushima or Banqiao. This is an apple-juicer to oranges comparison.
I have seen the best managers giving into ridiculous deadlines at the time of project onset just because they know that there is no other option.
"When Will We Learn?
Every major software incident requires a thorough and public analysis."
It was written last year, but it looks like a weird sentence this year:
It's even not that unusual. In https://en.wikipedia.org/wiki/List_of_bridge_failures#2000–p... I counted like 150 bridges collapses in the since 2000.
Metallurgy was primitive, and there were no x-rays available to find hidden cracks formed during the manufacturing process.
I think it contains today mostly stuff everybody knows, so it had a lot of impact on our profession. Not on the coding part, but very deeply on the management part.
If the software had acted as expected she would have been alive. If self driving cars become popular, coding mistakes will kill more people.
(Yes, they had wilfully disengaged the built in automatic breaking feature of the car in order to allow their software to control it, and the human safety engineer riding in the car, was not paying attention to the road (also because the safety egineer blidnly trusted the software runnign the car) were factor as well)
My understanding is the opposite. Uber's software was generating a lot of false breaking events so they set it up so it wasn't controlling the breaks. Drivers were trying to gather evidence about the triggers for these false events. That created this perverse situation where the software correctly identified it should have braked, but the only action it could take was to raise an alert, distracting the driver at a critical moment.
Thats the another key from the Therac-25 - it failed mostly silently - it displayed an strange message that didnt make any sense, and there was no obvious detection of a failure.
Software need not be bug free - for example, there is a reason we still include hardware watchdogs on embedded devices, and its largely because the watchdog is cheaper than bug free software, and will provide the same quality of service.
Also if you do go COTS and don't do the business change to fit the product. Trying to make COTS work your way is always so so so bad.
Starting off with a complete custom set-up for your core is another opportunity for premature optimization to creep in.
So what then dawns on the business is they realize that the missing 10-20% was part of the business that was really important and they have dropped serious money on a bunch of products. And really all they needed to do was better understand themselves and build their own business infrastructure.
But what you are saying about speed definitely rings true. But it's important to note IT failures that happen to new businesses are more or less written off as total business failures Usually resulting in the business going to the wall.
From Programming Pearls, Section 7.3 [Safety Factors], by Dr. Jon Bentley, which reproduces Vic Vyssotsky's advice from a talk he has given on several occasions.
"Most of you'', says Vyssotsky, "probably recall pictures of `Galloping Gertie', the Tacoma Narrows Bridge which tore itself apart in a windstorm in 1940. Well, suspension bridges had been ripping themselves apart that way for eighty years or so before Galloping Gertie. It's an aerodynamic lift phenomenon, and to do a proper engineering calculation of the forces, which involve drastic nonlinearities, you have to use the mathematics and concepts of
Kolmogorov to model the eddy spectrum. Nobody really knew how to do this correctly in detail until the 1950's or thereabouts. So, why hasn't the Brooklyn Bridge torn itself apart, like Galloping Gertie?
"It's because John Roebling had sense enough to know what he didn't know. His notes and letters on the design of the Brooklyn Bridge still exist, and they are a fascinating example of a good engineer recognizing the limits of his knowledge. He knew about aerodynamic lift on suspension bridges; he had watched it. And he knew he didn't
know enough to model it. So he designed the stiffness of the truss on the Brooklyn Bridge roadway to be six times what a normal calculation based on known static and dynamic loads would have called for. And, he specified a network of diagonal stays running down to the roadway, to stiffen the entire bridge structure. Go look at those sometime; they're almost unique.
"When Roebling was asked whether his proposed bridge wouldn't collapse like so many others, he said, `No, because I designed it six times as strong as it needs to be, to prevent that from happening.'
"Roebling was a good engineer, and he built a good bridge, by employing a huge safety factor to compensate for his ignorance. Do we do that? I submit to you that in calculating performance of our real-time software systems we ought to derate them by a factor of two, or four, or six, to compensate for our ignorance. In making reliability/availability commitments, we ought to stay back from the objectives we think we can meet by a factor of ten, to compensate for our ignorance. In estimating size and cost and schedule, we should be conservative by a factor of two or four to compensate for our ignorance. We should design the way John Roebling did, and not the way his contemporaries did -- so far as I know, none of the suspension bridges built by Roebling's contemporaries in the United States still stands, and a quarter of all the bridges of any type built in the U.S. in the 1870's collapsed within ten years of their construction.
"Are we engineers, like John Roebling? I wonder.''
“Siri didn’t immediately play the right song from my Infinite jukebox at my voice command” is not a bridge collapse. “My online banking was down for an hour” is not near the inconvenience of not having banking available every evening and night before online.
One saving grace is that truly incompetent software projects of any size never make it off the ground (or stay up long enough to be relied on).