Hacker News new | past | comments | ask | show | jobs | submit login
Age of Invisible Disasters (2017) (eutopian.io)
148 points by rrampage 7 months ago | hide | past | web | favorite | 108 comments

> The Therac-25 X-ray machine killed several people ... The reason? A race condition.

The problem with the Therac-25 was NOT the race condition. Complex devices will always have bugs, because humans haven't figured out a way to do perfect, bug-free engineering. The problem with the Therac-25 was that it wasn't designed to fail safe. The previous model had the same race condition bug, but it also had hardware monitors and interlocks which provided defense in depth.

The lesson of the Therac-25 is not writing perfect software; the lesson is recognizing that humans make mistakes so anything remotely safety-critical needs to be designed to fail safely when - not if - mistakes/bugs happen.

Not being able to write perfect bug-free software is honestly a myth.

Short of bit flipping due to environmental conditions, it's more than possibleto get an algorithm and state machine perfect. Look at the old landline networks. Look at vending machines. It's just HARD. No one wants to pay for HARD.

Yes. Therac-25's problem did come from lack of defense-in-depth. That doesn't take away from the fact that Quality was sacrificed in the name of a cheaper production cost, and less time to market.

The easiest way to get people to be okay with shoddy things? Tell them it's impossible to do better. Microsoft took that route, so everybody drinks the kool-aid. Even people who should know better.

the old landline network did have notable bugs - sometimes they caused service outages - we never noticed because the network as a whole was designed with enough redundancy and to not fail silently - you could loose many components of the network, and still have it remain functional, every switch for example deployed ran duplex (there were two computers running in lockstep), you could loose chunks of switching fabric within a switch and still have it be more or less functional, though in a degraded state. Almost every end office had more than one route out, to account for equipment or trunk failures.

You are correct, bug free software is possible - but often designing stuff to fail safe, is cheaper than catching every bug, and a better use of time. There is a reason we include hardware watchdogs on stuff still, because the watchdog is cheaper than building bug free software, and still provides the same quality of service.

> Short of bit flipping due to environmental conditions

Cosmic rays do flip bits. If you're making anything that might be safety-critical, your design needs to fail safely when hit by a cosmic ray or any other hardware failure. If your device isn't controlling a 25 MeV linac, "failing safely" may be as simple as halting or otherwise indicating an error. If it is controlling something dangerous, multiple layers of defense are necessary. One layer might be error-correcting RAM.

> Look at the old landline networks.

As Aloha already described, the POTS network is one of the best examples of defense in depth. Every piece was designed to work around failures. They even guaranteed service for weeks after a major disaster that took out the entire electrical grid (they had backup generators and phones were self-powered by the phone line). The phone network is exactly the type of engineering we need for safety and reliability.

> Look at vending machines.

Vending machines have a simple enough problem that they don't actually need a full stored program computer. You can be (and have been) an electromechanical device implementing a finite state automata. Even if you use software, the necessary state machine is small enough that you can actually prove that the software is correct. Most problems are too complicated.

Also, vending machines fail. They even fail safely; a locked up vending machine doesn't become a hazard to life and limb.

[1] https://www.youtube.com/watch?v=vhiiia1_hC4

[2] https://en.wikipedia.org/wiki/Finite-state_machine

It's impractical by a huge margin.

Why? Well, not only because bit flipping does occur naturally due to radiation but also because there's no such thing as a formally proven hardware stack in the world. At the end of the day your software has to run on hardware that exists. Not only has that hardware not been formally proven, you can't even know exactly how the hardware will behave from one month to the next due to firmware patches (spectre/meltdown mitigations, for example). Those patches can affect behavior and performance by huge amounts.

Additionally, formal proofs don't cover one of the biggest classes of bugs: insufficient specifications. If you formally prove your software to the wrong specification then you still end up with buggy software. So even with provably correct software methodology you still need classical QA processes.

Moreover, it is not necessary to have formally correct algorithms and state machines in order to achieve the maximum quality of software.

Tell them it's impossible to do better. Microsoft took that route

Citation needed; where and when did they state this?

Look at vending machines.

Like the one at work which tells me to use exact change only, then accepts more money and still gives me the right change?

You know what, give me a few days to do some research if you're genuinely interested. I heard about it from a co-worker who did work with the early telcos way back when. He's passed away, but I never actually looked it up considering that co-worker's role as my mentor and deep insight into computing history that checked out before.

I am mildly interested, to the extent that it's any kind of specific, public Microsoft technical claim, rather than a piece of generic marketing hubris, or a rumour of something said in a closed-doors meeting of C-level people once upon a time.

(Re: Marketing disclaimer, e.g. the slogan in the UK "You can't get better than a Kwik-Fit fitter", Kwik-Fit being a high-street chain of budget vehicle repair, I think analogous to something like Jiffy-Lube in the USA, it would be unreasonable for anyone to take this as a factual claim that "no better mechanics exist on the planet", excellent mechanics are more likely to work on racing car engines than routine oil and brake changes for near minimum wage, say)

That all-caps section in any Windows EULA since W95 saying essentially "if it breaks in any way it's all your fault"?


The ones which appear (to a layperson) to match the MIT license "NO WARRANTY" section, or the Mozilla Public License "indemnify all Contributers" and "disclaimer of warranty", and sections 7, 8 of the Apache License 2, and sections 15, 16 of the GPL v3?

If so, there's a lot of people stating "it's impossible to build better software than ours", and while I disagree that's what it communicates, if that is what it means, it's bad-faith to try and single Microsoft out on that count, isn't it?

I thought it was the canonical example - not that they're the only one.

Bit flipping is low-probability, but coupled to massive amount of events it comes out, Fermi-style, to "quite common, actually": http://dinaburg.org/bitsquatting.html

> The lesson of the Therac-25 is not writing perfect software

I will guess that the author will agree with you. The goal is a safe solution and that implies better software, better hardware, better procedures, ... anything that you can learn from mistakes to make this not happen again. If you look at plane crashes reports they will highlight software and hardware malfunction but also pilot mistakes (one plane crash happened when the pilots were distracted talking during landing about another plane crash that happened because those pilots were distracted talking during landing).

> The problem with the Therac-25 was NOT the race condition. Complex devices will always have bugs,

That is the point of the article. " ... The reason? A race condition. An unplanned-for coincidence, made deadly by the conscious design decision to remove an old-fashioned safety interlock."

The problem was the race condition. But if you solve that one, another one will pop-up. So what was deadly was the "decision to remove a safety interlock".

We need to identify which one can be solved, and do it. If we can not write perfect software we should add hardware safety. If hardware safety is not enough we need to improve the procedures to handle it. If that is not good enough, we can add more software to help were humans are failing... it is a refining process that you learn studying past failures.

Since John Stuart Mill, we've known that "cause" rarely refers to a sufficient condition, and that multiple necessary conditions always apply to real-world events (positive or negative.) We generally reserve "cause" for the necessary condition that is easiest to change. I think it's also legitimate to emphasise general solutions over particular ones, however; as you do.

This is the Achilles heel of many pieces of software: lack of provisions to fail gracefully.

IIRC it wasn't a race condition. It was a poor design. It was not a software bug but a hardware bug: it lacked a sensor for the position of the window. I don't think software could make up for that.


Relevant section:

  It is clear from the AECL documentation on the modifications
  that the software allows concurrent access to shared memory,
  that there is no real synchronization aside from data that
  are stored in shared variables, and that the "test" and "set"
  for such variables are not indivisible operations. Race
  conditions resulting from this implementation of multitasking
  played an important part in the accidents.
There's more in that paper.

Many IT projects are failing because their budgets and deadlines are deliberately set too optimistically.

The reason for this comes from too sides: on the one hand, there is the client (either internal or external) who wants to have as many features for as little money as possible and who is many times not very good at judging if their demands are realistic. On the other hand there is the IT organization which has an incentive to comply with unrealistic targets.

If it is an external client, this unrealistic promise is the advantage over the competition that might get them the project. But I also see this happen regularly with internal projects where there is no competition between implementors. In that case the motivation is usually that IT management wants to have the consent of higher management to get this project started; they know that once a project has been going for a year or so, it won't be terminated if it's over budget or past deadlines. At the same time, if they would have tried to give realistic estimates too higher management the project might never have been approved.

Construction projects are the same. It seems like every week I read about another construction project that’s far beyond its initial estimate. The difference is that a half-finished construction project is a gaping wound in the landscape, but a half-finished software project is just some code hidden in a corporate server. The incentive to finish construction projects is greater.

> The difference is that a half-finished construction project is a gaping wound in the landscape, but a half-finished software project is just some code hidden in a corporate server.

The difference is that missing deadlines in construction projects come with bigger monetary penalties (citation needed, based on the low penal damages I've personally seen in IT project contracts).

In a IT proje<t, the b1ame fa11s on the deve1oper who is then en<ouraged to work extra hours for free to make up for their own fai1ure.

IT projects are often something that the organization has never done before. Given that fact, it seems ridiculous that anyone my presuppose how long and how much a thing will cost without actually having ever done the thing.

From my own professional experience, I'd say it's got more to do with the artificial delivery constraints that come with the mandatory annual and quarterly financial reporting of publicly traded companies. They announce their yearly results to the market and next years IT budget is broadly defined as a portion of the previous year's profits.

Everything is planned & tracked as a portion of an arbitrary 12-month time frame & the internal politics means each department are competing for funds on a "use-it-or-lose-it" basis. If you've never sat in on department/geography/company-wide governance forums where they agree next year's "change management" budget, the basic format is you present a business case & cost forecast based on FTEs over the expected duration of the project, all the mandatory regulatory compliance projects get approved automatically and the remainder of the budget is set based on how well each middle-manager can defend the interest of their department. They have a final number for total approved spend & approve as many projects as they who's forecast cost fits into that, Tetris-style, with a flat 20% contingency spread across everything to account for the fact that historically, trying to plan in this way is so inaccurate that it's basically a series of pseudo-random guesses. It's worth noting that even stuff badged as "Agile" is funded & budgeted for in this way.

Every programme I've been involved with has had some variation on "we can't start work on X in that month, because it won't finish until February and will cannibalise part of my budget for next year".

Unfortunately, given the economic & political status quo this problem is impossible to solve (but for the libertarians reading this, an interesting example of the 'higher order' unintended consequences of sincerely designed industry regulation.

The reason that software projects deliver so poorly in the corporate world is that senior executives and investors understand them so poorly that a large amount of recognition and attention goes to those who can "bullshit" rather than fix the problem.

This is particularly toxic, because often the senior executives and investors also have an interest in hiding the failure (avoiding personal accountability and avoiding write-downs on investment).

It is very hard to see how a culture of true reflection and learning can emerge in this environment.

I think the biggest problem is getting the requirements right (as much as possible) prior to the actual coding.

Many managers seem to think that they can easily change the requirements in the middle of the project.

Requirements, infrastructure components, platform systems and promises from external vendors that "it'll totally be ready by the time you launch".

This is all things I've seen in the last month (yeah I'm not joining any more greenfield projects within organizations - if you think you're going to rebuild it from scratch and fix all the problems, you have no idea what went wrong in the first place).

> Many managers seem to think that they can easily change the requirements in the middle of the project.

"But of course I can change anything I want, whenever I want, and you all are supposed to handle it. That's why we use Agile!"

It might be even the idea that (without an effort that matches or exceeds the implementation itself), you can "get the requirements right", i.e. that there is a point in time where the requirements are complete, fully understood and thoroughly specified.

Given the enormous scope of some of these projects I wonder if a prototype phase would help. Humans seem to have a hard time with conceptualizing and a slow, barely/non working and unscalable prototype might be worth the added cost/time. If it’s part of the requirement gathering process with explicit expectations set that it’s a prototype would it facilitate better requirements input? Naturally you’d have some people making assumptions about its fitness as a final product or MVP but (if a UI is involved) giant bright warning banners plastered all over might help with that.

The default state of a software prototype is to be shipped into production.

You're right, it's hard to correctly spec out a software project. You can't build a house without the plans.

Another big factor is testing/QA. If your customer service department isn't going to throughly test your new program, and nobody holds them accountable, who gets blamed on launch day when it doesn't work right? IT!

You can build a simple house without formal plans, most especially if architect, builder, and owner/occupant are the same.

Divide roles, or increase complexity of the project, or its relationships with the surrounding infrastructure, and planning or standardisation requirements escalate quickly.

Yes, you can build a simple house like that. But what if halfway through the construction, the client says they want the bathroom on the other side of the house? You'd have to redo (e.g.) all the plumbing from the ground up, location of the heating system, walls, doors, perhaps even staircases ... etc. etc.

Note that amongst my qualifiers was unity of client and builder-architect.

You'll find plenty of examples of people on DIY or lone-inventor projects saying "we tried X, it didn't work, so we did Y instead. That's the unity-actor case of your hypothetical.

Dividing executive-agent roles is a complexity dimension.

Dividing executive (decision), design, expert (engineering), sourcing, external approval, interface, and labour/crafstmaan roles are multiple additional complexitty dimensions. And this doesn't even include the technical dimensions of the project and its infrastructure interfaces / interconnects.

Complexity is interaction between components and entities.

Yeah but behind your "as much as possible" statement hides the whole softwar engineering profession. Getting the requirements right sometimes can paralize a team, making it loose "capabilities" or momentum or even having attrition impact. There's no single PM methodology solution to avoid having the deal with the tradeoffs associated with the agile vs. waterfall dichotomy.

Behind any popular or profitable succesfull software project there are sometimes as much "PMing" practices as there are coders.

A form of Gresham's Law, whose generalised form is that equivalent current-cycle benefits outweigh non-apparent long term net advantage (benefit less cost). Or even more generally: a complexity constraint.

Risk is a nonapparent long-form cost, and unless specifically manifested (insurance, regulations, fines), will be ignored.

The results of a scholarly search for "a gresham's law of" are illuminating, spanning divorce, neighbourhoods, academic admissions, environmental regulations, incorporation, shipping, management, court citations, television programming, politics, and more.


That's not the problem; looking at it as a cost center is. It all boils down to not enough growth.

It's seen as a cost centre because no one can manage it in any other way. The consultants have found one way to provide some results sometimes, but aren't motivated or able to do better.

What might alternatives be?

How about treating it as capital with a risk-return element, say?

omg in my dreams: then it would be an asset on the balance sheet and it would be a shareholder value increasing move to improve it. CIO becomes = CFO.

The big software projects I’ve seen go bad don’t feel analogous to the bridge failure mentioned in the article. It’s more often poor project management, unclear requirements, and sub-par communication rather than a specific engineering failure.

Despite the lack of public post-mortems, poor project management seems to be widely recognized as a problem. But there isn’t a clear cut solution. Agile promised to save us all but seems to be implemented poorly more often than not.

Agile (of the scrummy, we just meet sprint goals that we set type) is not a project management solution, it is saying “well this seems hard, so we’re not going to do it”.

Project management is happily alive. Thinking Agile in some way solves project management is insane.

Plan out your system, estimate the time to build it (you don’t even need good estimates), execute ruthless change control. It’s not hard, just takes discipline. Ruthless change control is the hard part. That doesn’t mean saying no, it means saying “if you change things, it costs you schedule days”.

If you want a clear cut system, iDesign has some good classes. Imo at least.

> “if you change things, it costs you schedule days”.

When I worked for Philips as a test hardware design engineer in the old Mullard Radio Valve plant in the late 70s and early 80s (a major semiconductor plant despite its name) we were even stricter than that. Every project started with a rough estimate for the total cost, we then spent 10 percent of that on a combination of prototypes and a better estimate of the time and cost and refinement of the requirements. The customer department would then decide whether to proceed. At that point the requirements were frozen. Our department was then committed to the project schedule and price and the customer was committed to the requirements. Any changes would require exactly the same procedure as before: rough estimate, 10 percent preliminary study, better requirements and plan, commitment, development, delivery.

Part of the purpose of this was that at the end of the ten percent phase the customer could take his requirements to an external supplier (British Aerospace for instance) and get a quote from them instead and possibly actually give them the job.

What is “ruthless change control”? If it means placing realistic constraints on how sales or customer relations teams can alter deadlines, priorities or scope of work, then it simply does not exist in corporate IT and is intrinsically disallowed by the inborne organizational structure of modern companies.

Once you finished planning out your system the requirements have likely changed. If they didn't you'll find out that you built something different from what the customer wanted once you're finished programming.

I don't think you're necessarily disagreeing with the parent post.

> Ruthless change control is the hard part. That doesn’t mean saying no, it means saying “if you change things, it costs you schedule days”.

Just because software is dominated by soft costs doesn't mean that it's cheap to change requirements. That doesn't mean you can't deviate from your initial spec, it just means you have to charge the customer for changes that they want after you've already started development. Throwing away work just because it's "easy" to change software doesn't magically recover the time and money that you're already spent up to that point.

What industry are you in? I have never seen project management done well in enterprise companies.

I'm not a project manager, and I most certainly haven't been involved in huge IT projects.

That said, from my POV it seems a lot of software related IT project failures is often correlated to two factors:

- Doing too much at once. Like replacing 6 different existing specialized systems with a single new one.

- Unwillingness to change the business procedures/workflow to cater to software.

The lure of the single do-it-all system seems strong with certain people. But at least in my experience, one could draw from software engineering and how good software is written as separated modules with well-defined interfaces at the boundaries. If you have multiple systems with good interfaces for data exchange, it's much easier to specialize where needed, and replace outdated or broken pieces.

The unwillingness to adjust the business procedures/workflow to software needs is a huge one. Complex software is fragile. By having complex rules in the business procedures you force the software to be more complex, thus invariably making the software more fragile. If business procedures were changed to be software friendly before the software is written/adapted, the software can be simpler and thus hopefully less fragile.

It daunts me how much software is getting unreliable, but trying to shame people to hold them accountable is naive.

The root of the problem is the uncontrolled complexity of modern software products.

Because of this complexity responsibilities are diluted, most of your code is in your dependencies nowadays.

If you write a casual library, are you responsible if it is flawed and used in a critical operation? Can dependencies always be carefully audited?

> The root of the problem is the uncontrolled complexity of modern software products. > Because of this complexity responsibilities are diluted, most of your code is in your dependencies nowadays.

That doesn't resonate much with me. Yes maybe in terms of LOC most of our code is in dependencies. But most of our _important_ code is our own, in the business logic.

The reason our product is unreliable is mostly down to complexity as you note, but that complexity is driven by our users who want better integration and automation. Add one more optional behavior ("when X I need to do this tedious task Y, could your program do this for me?") and you've increased the testing surface exponentially.

On the other hand those additions is what sets us apart from our competition, and what makes the eyes pop when we show off our product to potential new clients.

So it's a balancing act. The complexity makes our users extremely productive[1], but it also makes our software more fragile[2].

[1]: For example, one client went from almost an entire day of manual data entry for certain orders to less than half an hour due to a "smart" Excel importer I wrote.

[2]: Users now want the Excel importer to behave in conflicting ways, and trying to cater to that without breaking one of the use-cases is very difficult.

It daunts me too, and reminds me of the early-days of commercial flight. I hope we'll have a similar "we cannot continue like this moment".

However, I don't think shaming is what this is about though. To me it seems the objective is to learn from mistakes, and for that we need to be honest about what happened, and it is going to be pretty hard to be honest if we tip-toe around who did what and why.

I also agree that complexity is a problem. But I don't think acknowledging this gives us any path forwards. I don't think going back to the 'good old days' is going to be a solution. I therefore see this leaning process as helping us figure out how to move forwards, and to provide a motivation to the industry as a whole. It is this industry-wide motivation that will be needed to address some of the systematic complexity issues.

I don't think this would be enough on its own (and implementation is a whole other question), but I think it could be a step in the right direction.

> It daunts me too, and reminds me of the early-days of commercial flight. I hope we'll have a similar "we cannot continue like this moment".

The first citation of the words "Software Crisis" meaning the inherent difficulty of writing high-quality software in a predictable way was from a NATO conference fifty years ago: https://en.wikipedia.org/wiki/Software_crisis

It is taking a long time for good practices to be discovered and win-out, and even when obvious improvements have been made, they're not necessarily used effectively.

I suspect a large part of the reason why the software industry isn't maturing at the same speed that other industries have had to, is that in software, failure is much easier to hide.

> It daunts me how much software is getting unreliable, but trying to shame people to hold them accountable is naive.

> The root of the problem is the uncontrolled complexity of modern software products.

I think there's a feedback loop between those two things, especially when it comes to government or giant corporation projects. Lack of accountability causes accidental complexity, which in turn causes a lack of accountability.

It starts with a organisation that lacks tech leadership hiring a consultancy, which then treats the project as a "flagship engagement" which means trying to make everything perfect, where perfect is used in the context of the number of future sales-pitches that will cite this one project.

As a result, there's a gap between what the organisation needs and what it gets, which adds to the amount of work required and complexity to navigate, whenever changes are required, and the overall complexity snowballs from there.

Most of the above is business-as-usual for most very expensive projects. The real danger-zone is when you get to the third iteration, six or seven years down the line, and you're forced to re-hire the first consultancy again because they're the only one with the resources to take it on; but the tech-world has moved on, so they see you as a "modernisation engagement". They simultaneously can't criticise their own bad decisions from several years prior, but at the same time they want the wider-industry to see their "transformative" power, so can't merely iterate on what's already there either.

That's how you end up with iOS apps, talking to Ruby-on-Rails APIs (which used to be the primary web-app, before that was replaced with a React frontend), reading and writing from an Oracle database which is also updated with a series of batch jobs dating back to early 2000s Java EE.

The "coal face" developers in all these situations have done the best work to their ability, and quite often achieved minor miracles in stability given the underlying complexity. The problem is always a management (or lack-of management) problem.

What makes you think dependencies are relevant to a discussion of IT project failures? Many organizations can't even manage to write the first-party code to any approximation of the requirements without going years and hundreds of millions over-budget. Getting tripped up by subtle bugs in dependencies would be a fantastic state of affairs compared to today.

We have the cycles to burn, which causes waste to approximately eat up all excess cycles. This has been observed since the 80's, so there is not much new here. Computer programs tend to fill up all available memory, use all available cycles and eat up all available storage no matter how far Moore's law has helped us come.

OK - I think we can all agree that it is hard - but still we need to do something about it. The question is who is in the best position to improve it. Customers will not put any meaningful pressure because they are too ignorant about what is responsible - it is only the programmers who have any knowledge about the source of the problems and they need to be incentivized.

>It daunts me how much software is getting unreliable, but trying to shame people to hold them accountable is naive.

Has it been getting more unreliable? Software is being developed, bugs are getting fixed, new use-cases are emerging faster than anytime before.

If it seems like software is getting more reliable, maybe its just that we're relying on it more and more.

I do think software has got less reliable in the sense that unclear random errors, freezes, and data loss are more common. I think this is mostly the result of a (correct) choice to add more value overall by prioritising features over reliability.

I'm not sure that is the case, twenty five years ago I'd be lucky if my desktop went a whole day without crashing and needing a reboot, that is quite rare now.

Twenty years ago a banking IT failure would have been considered catastrophic and unacceptable.

Now there are serious meltdowns and failures every year.

Meanwhile, when was the last time you talked to a customer services rep of any large org without being told "Sorry - our systems are really slow today"?

This is really a sign that big non-tech-centered orgs cannot get the best talent nowadays. Lots of banks pay under 2/3 of the real asking price for a great programmer.

"really slow" isnt catastrophic.

A lot of the failures I experience at born from trying to solve business process problems with digitization, or digitization without ever asking if it’s the right thing/way to do stuff. Another common problem is focusing too much on a particular set of business processes and forgetting that every IT system is part of a package of numerous IT systems that work together.

I live in one of the most digitized countries in the world. So we’ve naturally digitized payment for public transportation. When we did it, nobody questioned the taxation system, even though it was made in the 70ies and build around a public structure called “amter” that hadn’t actually existed for many years when then system was build. We had also gone from 271 municipalities to 98 and their borders were part of the taxation too. So the taxation rules frankly didn’t make any sense and they were needlessly complicated, yet they were digitized, as is. Naturally it was a disaster, it was even predicted by the technical team and the project leads, but nobody wanted to touch the taxation politically. It got fixed eventually, but it could have been several hundreds of million danish kroner cheaper if they had simply redone the taxation models for ticket prices before the digitization.

So that’s one mistake, and a common one, both in the public and private sectors. The other common disaster is building systems for specific processes without looking at the bigger picture. Like a case working system that handles the welfare process for people who are sick. Except you forget that those citizens sometimes don’t go through official communication channels, and maybe send a letter or an email to the wrong department, so you need to be able to add those documents to their digital case file. But that’s not possible and neither is sending a notice to other systems in other departments which also deal with the same citizen. I’m guessing this last issue is bigger in the public sector than in private, because we often buy our software from companies that have very little actual domain knowledge outside of what their direct customers tell them, and the case workers they use for knowledge very often lack insights in the greater architecture of running 350+ IT systems together because they work with maybe 5 of them.

I mean, these things aren’t deadly as the x-ray machine, but they’ve been happening for the better part of 25 years and nobody seem to have really learnt anything.

Oh, we have learned all those things as developers. It just seems that none of the decision makers have gotten any hint.

I’m honestly not sure why that is. I’m hesitant to only ascribe it to incompetence, because not everyone can be, but maybe we only hear about the failed project with bad decisions.

Why can't everyone working in a given field be incompetent? When Dr. Semmelweis discovered that surgeons washing their hands drastically reduced patient mortality, he was dismissed and ridiculed. I would say that 100% of those surgeons were objectively bad at their jobs and therefore "incompetent" in a narrow sense. Groupthink can cause everyone in a given field to converge on the same orthodox belief, and if that belief is wrong or dangerous, shouldn't they all be considered incompetent? Even today there are many pseudo-scientific fields where literally none of the practioners are objectively able to accomplish what they claim... and it's not at all obvious that project management is not among them.

I wondered why an obscure post of mine was suddenly popular.

A clarification: The Therac-25 had an unfortunate race condition, what made this deadly was the conscious decision by the designers to REMOVE the physical safety interlock. They didn't consider modes of failure. The post says exactly this. Always consider modes of failure, you never know when some "other guy" is going to naively count on your work being 100% reliable. It's a system not a goal, as I like to remind people.

Some of you might enjoy some of my other stuff, particularly on security: https://blog.eutopian.io/winning-systems--security-practitio...

The Tay Bridge disaster was important because: 1) Before it, we had several bridge failures in the UK. 2) After it we had almost none at all. Ever. 3) The report into the disaster was responsible for this improvement. It uncovered problems with: The design, the metal used, the way it was assembled, the maintenance regime, the project management and personal relationships and personalities of the people involved.

I'd lay money on the cause of the recent tragic bridge collapse in Italy being one of those already cited in that 140 year old report. It's all there.

Back to our own world...

When major IT projects fail, there is almost never a public enquiry, even when those failures are government projects, and even when they cost hundreds of millions of dollars/pounds. These failures are repeated regularly in government, and daily in the private sector.

Many of us who have been around a while have a (probably pretty good) understanding for why they fail, yet the lessons are not learned and there is little sign we are getting any better at all at not-failing. I suspect a bit of exposure to downside risk, or "skin in the game" as Taleb would call it, might improve things. Sometimes the medicine is hard to take.

There's this book by Peter DeGrace and Leslie Hulet Stahl called Wicked Problems, Righteous Solutions. It describes all of these problems and others. It presents a number of very practical and proven solutions.

The book was published 28 years ago, in 1990.

We use words like science and engineering in conjunction with others like computer, programming, and software. And yet there's nothing scientific about how we don't learn from mistakes already made decades ago. And how we keep reinventing "engineering" best practice and call them new names.

You've probably heard the line "complexity is the enemy", or perhaps even its full form: "complexity is the enemy of reliability".

You may not know its provenance: The Economist, Volume 186, January, 1958, or 60 years ago:


I've been trying unsuccessfully to secure a copy of this article for some years. PDF preferred, dredmorbius<at>protonmail<dot>com if anyone should have access.

If you're willing to pay $100 to secure a copy, you can probably read it in the Economist Historical Archive.



My research trove exceeds 10k items. I don't have a $100/item, or even source, budget.

Can I ask where you heard of that book? I was only made aware of the idea of wicked problems by my favorite business professor in university and have never heard anyone else mention it

Can't remember exactly, but suspect it was through a friend at Microsoft. I got my first copy in '91 or '92.

> If you find yourself on a failing project, squandering tens of millions of pounds and hundreds of man-years of talent, pause for a moment. Think about the fact that almost 140 years ago, civil engineers stopped building bridges that fell down. They stopped building them because the failure of one bridge was laid bare so publicly.

> Think about the fact...

You can think all you want, but it's unlikely to do anybody any good. Sometimes the fault for a failed project lies squarely with the engineers, but this is not at all the usual case. The people who are most responsible for failed software projects is management, and not just engineering management, but the people who engineering management reports to.

And the biggest problem management has is not simply lack of understanding of the nature of software development projects, but, often, a profound lack of interest in learning.

I don't know what to do about that.

Kind of weird that they used a novelty fake newspaper front page to illustrate this. The Scottish Scribe is a book of mocked-up newspaper front pages attempting to show how a modern tabloid might have dealt with historic events.

> I count £20b in failed IT projects over the last decade alone.

It's hard to grasp the sheer scale of government. This article does a good job of juxtaposition in the case of the magnitude of engineering failures, but I want to add on that $20 billion is chump change when it comes to waste. The military sector alone plowed through $700 billion last year to accomplish the task of robo-killing brown people. The entire federal budget was $2 trillion. Stop and think about those numbers for a bit.

There are 2.8 million civil servants in the US, and 2 million military personnel. $2T divided by 4.7 million means every single government official is responsible for roughly $425,000 of your tax dollars. This includes postmen and every boot camp trainee.

Obviously only a fraction of these people are making decisions. So you can add zeroes to that number when you want to consider how much power the actual decision makers have. These decision makers are human, and humans are wont to see themselves as kings of their domain, and what is a king's job but to squander money squabbling over fiefdoms.

The sheer, mind-boggling scale of systems of government, all of them, from your homeowner's association to your neighborhood council to your city government to your state government to the national government to international governmental organizations like NATO and the UN, isn't even the most interesting aspect to consider here.

A more amazing thing to think about is how they manage to get anything done at all. But that's not even the biggest thing.

The biggest thing is that there is nothing new about this state of affairs. Civilization was built like this, thousands of years ago.

It's an admirable goal to want to get rid of waste in government. But that's an untamable firehose. You won't even get laughed at for a proposal to save $20b of tax money. They will look at you, decide whether you're going to look good on TV, maybe put you up in front of a camera if you're really really really really lucky, and everything you spent your whole life learning to finally try to do will get swept into a political capital generating exercise for a local politician. Thanks, try again next life.

Governmental cruelty knows no bounds.

There is currently a huge problem with the Bulgarian Electronic Trade Register [0]. The register stopped working two weeks ago and is still not online [1].

It holds all company ownership data and a lot more. Right now there is no way to register a company in my country, as well as making any changes to existing companies ( e.g. changing manager, shareholders, etc. ). It is one of the most important set of data for an EU country.

The original problem ( leaked by the government ) is that a 4 of the RAID5 disks broke down, but it is still a mystery why recovering the data takes more than 2 weeks.

0 : http://brra.bg

1 : https://bivol.bg/en/classified-information-and-human-error-c...

It’s amazing that 4 R5 disks could break down in the first place. Apparently they had triple redundancy and STILL managed to make it fail.

Though I bet it was just nobody ever checked if the disks were still working.

What didn't help was that the backups were stored on the same LUN as the production database.


There is a reason my backups are held on a separate machine in a separate building (and also on large external encrypted drives that leave site everyday).

Sometimes you learn these the hard way.

I once read on Hacker News that in the future, the C-level people all need to be very strong in IT as any company or organisation nowadays is so much reliant on IT.

It's also the message of the Phoenix Project book, which I did like.

The problem I have noticed is that although management does understand their businesses, it's easy to bullshit them when it comes to IT. And they let it happen because they are not into IT and they don't grok it. They would never treat other projects like this the same way when they would totally understand it.

I especially notice that as I read what higher management layers write about projects or effort, it's high on fluffy 'visionary' words but low on actual actionable vision that would help me to make every day decisions on what to prioritise.

I believe that the simple reason why IT projects fail is because of very mundane basic things.

But those are not sexy to write about. To me it's all about:

- Why are we doing this? - How would you define succes and failure for this project - Who is responsible for what / contact person - How do we work with each other and detail this - What are the guiding principles - How do we assure quality - How do we assure timely delivery - What stuff do we need, gear, licenses, etc. - P R E P A R A T I O N - do your homework, investigate things beforehand before you make choices.

I can go on and on. And it may bore you. But I think there is actually no true complexity involved in all those failing projects.

There is not something really special to IT projects. I wonder if we do pretend there is something special to them because we ourselves want to feel important in some sense.

I don’t disagree with you at all, however we (in IT) also bear responsibility primarily in the following areas: 1. IT don’t really understand business. Often we think we do - more so than the business does - and that hubris often blinds us to the real issues that the business neeeds to address. This can be resolved through use of boring old enterprise architecture (real E, not just tech E), and business analysis. Actually making the effort to understand. Sadly most EA’s and BA’s aren’t. These are specific skills and they relate primarily to people and process rather than technology. 2. Technology is largely overrated. I know that is likely to be an unpopular opinion. Sorry, not sorry. I am an Enterprise Architect by training and by trade. I’ve worked on several multi-hundred million dollar programs. I can promise you that in my experience, had every single tech decision been flipped or changed, the difference in outcome would have been +/- 10% at most. Projects are not won and lost on technology. Completely missing or misunderstanding requirements, miscommunication, poor program financial management, overestimating the business’ capacity for change - these (and more) are the things that are more often than not likely to make the difference between success and failure. 3. You are not Google. Or Apple. Or Spotify. Or Amazon. Unless you are one of those companies. But if you’re an energy company or a financial services company - then no. Just no. Your business is largely conservative, managed (ideally) by risk averse managers, invested in by people who want a certain return. Your industry is highly regulated and there are things you have to do, that you don’t control and you do them whether the timing is good or not, at the expense of things that you do want to do. So stop fucking kidding yourselves and realise what you actually ARE, and cherry-pick & adapt those things that are likely to work for you.

It’s not like we collectively don’t know this stuff. Let’s stop drinking our own kool -aid. And for those of you that do work for cool companies or startup disrupters I’m really happy for you. For the rest of us, technology is not the centre of the universe. Appreciating that difference is important.

And yes, I know I’ve missed valid arguments swung the pendulum a bit far the other way. It’s deliberate. We are in danger of disappearing into our own navels.

My last bank was a staid conservative heavily regulated business that lost my custom because they were so poor at technology.

My electricity supplier is close to loosing my business in part because their web site won't accept my meter readings.

I'll be basing my decision on what car to buy my wife at least partly on whether it works with my iPhone. If it doesn't then you won't be on the list.

And on and on. We increasingly interact with businesses through technology, if they can't get that right then they are going to suffer. They can't get it right unless they take it more seriously right up to board level.

“Conservative” in this context doesn’t mean how they interact wi H customers but how they run their business (of which customer interaction is a part). The tech you are talking about is the visible bit of the iceberg. Work in the back office for any length of time and you’ll know what I’m talking about.

There is some work on this : https://spectrum.ieee.org/static/lessons-from-a-decade-of-it...

But... there are three key problems.

1) The time scales are long - in my experience big project failures are on a >5 year time scale (because - big) I think proper studies will need to run >10 years, and that's a big ask for any academic or team.

2) The costs are borne by one set of stakeholders (IT) the benefits are accrued by another (next IT). Why invest to help your successor? No one is going to thank you, also you will likely be sacked faster! There is no board level education or knowledge about this. The only source of information that could convince boards that this is the right thing to do would be Mckinsey/Bain/BCM and those &&^^"! will never, every say this because it's the right thing to do and they are evil. (prove me wrong!)

3) What do you measure? The field is immature, it's not clear what the right inputs to check are - or what the right way to estimate the outputs are. So we need to do a lot of work now to set up the definitive studies.

I have an anecdote : there is a thing called The FEAST hypothesis http://users.ece.utexas.edu/~perry/work/papers/feast1.pdf I was a user of one of the studied systems, and I was curious about the study. I discovered that it hypothesised that development of big systems slowed as they got more complex and the data from the system I used was one of the points that confirmed this. I examined change control documents and discovered that the development of said system had* slowed before the end of the study, but then it had reaccelerated, a whole load of "robots" had been implemented by business units consuming the system and these had not been reported in the FEAST study (IT was largely unaware) the robots started causing problems, policy changed, they were insourced, on platform development took off.

We need

- 5 year major international project to develop the art to support this - legislation that mandates system development information is stored up front and in a shared place. - legislation that mandates regular reviews that determine certain information that is signed off by an engineer. - 20 year massive project to use above information

I am not optimistic.. We can't even prove that XYZ better than agile..

There are many problems with software projects but a fundamental one often not raised is it is hard to say "this bridge will be built using this quality steel and this much effort and time" when there are ten other companies, all looking from the outside as convincing, saying we will donit in half the time for half the cost.

I am not sure i have too many answers. But having a genuine profession that is required by law to sign off on any life-critical software seems a sensible starting point

> Think about the fact that almost 140 years ago, civil engineers stopped building bridges that fell down. They stopped building them because the failure of one bridge was laid bare so publicly.

Yes, but how many bridge projects failed in the last 140 years because of cost overruns/missing deadlines which is a more direct analogy for most of the arguments in TFA.

And I'm guessing we're just talking about the UK since earthquakes have taken down a bridge or two in my lifetime...

Exceptionally good point. Confounding failure of a project to deliver wwith the catastrophic public failure of that project's deliverables is a truly extraordinary category error.

I was reading just recently of a failed megaproject, the Nicaraguan Canal. Forecast costs range from $40 - 100 billion, though I cannot find a report of actual expenditures.


Contrast this with actual engineering failures, such as Fukushima or Banqiao. This is an apple-juicer to oranges comparison.



Poor managment doesn't always stem from ignorance or lack of understanding of how SW works. Sometimes it emanates from pressure to deliver within very tight timelines for the sake of survival of business or standing up to the competition in the market.

I have seen the best managers giving into ridiculous deadlines at the time of project onset just because they know that there is no other option.

In this post, Bertrand Meyer made a similar claim taking the airplane industry as a reference.

"When Will We Learn? Every major software incident requires a thorough and public analysis."


> Think about the fact that almost 140 years ago, civil engineers stopped building bridges that fell down.

It was written last year, but it looks like a weird sentence this year:



It's even not that unusual. In https://en.wikipedia.org/wiki/List_of_bridge_failures#2000–p... I counted like 150 bridges collapses in the since 2000.

Back then (to quote another comment here): "a quarter of all the bridges of any type built in the U.S. in the 1870's collapsed within ten years of their construction."

Metallurgy was primitive, and there were no x-rays available to find hidden cracks formed during the manufacturing process.

The book he wants has been written. It's called The mythical man-month, by Fred Brooks. It does a post-mortem about what went wrong (an what went well) on the development of an IBM OS.

I think it contains today mostly stuff everybody knows, so it had a lot of impact on our profession. Not on the coding part, but very deeply on the management part.

A (justifiably dead) comment mentions the, erm, case, of the FBI's Virtual Case File,a $170m project killed in 2007.




Things makes me think about the poor woman who was killed by a "self driving car".

If the software had acted as expected she would have been alive. If self driving cars become popular, coding mistakes will kill more people.

(Yes, they had wilfully disengaged the built in automatic breaking feature of the car in order to allow their software to control it, and the human safety engineer riding in the car, was not paying attention to the road (also because the safety egineer blidnly trusted the software runnign the car) were factor as well)

> also because the safety egineer blidnly trusted the software runnign the car

My understanding is the opposite. Uber's software was generating a lot of false breaking events so they set it up so it wasn't controlling the breaks. Drivers were trying to gather evidence about the triggers for these false events. That created this perverse situation where the software correctly identified it should have braked, but the only action it could take was to raise an alert, distracting the driver at a critical moment.

Just as big as failing safe, is not failing silently.

Thats the another key from the Therac-25 - it failed mostly silently - it displayed an strange message that didnt make any sense, and there was no obvious detection of a failure.

Software need not be bug free - for example, there is a reason we still include hardware watchdogs on embedded devices, and its largely because the watchdog is cheaper than bug free software, and will provide the same quality of service.

The biggest errors I've seen is where people buy COTS for their core business. Which is usually a mistake is IT is a main driver for your business in some special circumstance.

Also if you do go COTS and don't do the business change to fit the product. Trying to make COTS work your way is always so so so bad.

COTS = Commercial Off The Shelf software.

It depends on the phase your business is in. For the beginning using standardized stuff that you buy ready made can be a real time saver, and time is usually in short supply. Once you achieve a certain scale and you can afford to do it you can usually save substantially and scale up further by doing something more customized.

Starting off with a complete custom set-up for your core is another opportunity for premature optimization to creep in.

I've seen this happen where a mature business starts to functionally decomposes their business down to what they do and how they deliver. They then go out and buy products that do those things and link them in a chain with a DB. But each one of those is short about 10-20% in all the used to haves and nice to haves.

So what then dawns on the business is they realize that the missing 10-20% was part of the business that was really important and they have dropped serious money on a bunch of products. And really all they needed to do was better understand themselves and build their own business infrastructure.

But what you are saying about speed definitely rings true. But it's important to note IT failures that happen to new businesses are more or less written off as total business failures Usually resulting in the business going to the wall.

A bridge falling under a train and a non-delivered project don't really have much in common. Major engineering projects keep being delivered very late or scrapped altogether, not that software is altogether different in this regard.

Using the collapse of an Italian bridge as an example of the kind of disasters that happened in the past is somewhat unfortunate, although the author couldn't have known.

We are seeing bridge collapses again. Genoa last week. Florida last month.

When I see Therac it reminds me of Theranos - another medical device with that went into production with serious issues.

It is unfair to the makers o the Therac to compare them with Theranos.

This is not the first time bridges have been used in comparison with software development.

From Programming Pearls, Section 7.3 [Safety Factors], by Dr. Jon Bentley, which reproduces Vic Vyssotsky's advice from a talk he has given on several occasions.

"Most of you'', says Vyssotsky, "probably recall pictures of `Galloping Gertie', the Tacoma Narrows Bridge which tore itself apart in a windstorm in 1940. Well, suspension bridges had been ripping themselves apart that way for eighty years or so before Galloping Gertie. It's an aerodynamic lift phenomenon, and to do a proper engineering calculation of the forces, which involve drastic nonlinearities, you have to use the mathematics and concepts of Kolmogorov to model the eddy spectrum. Nobody really knew how to do this correctly in detail until the 1950's or thereabouts. So, why hasn't the Brooklyn Bridge torn itself apart, like Galloping Gertie?

"It's because John Roebling had sense enough to know what he didn't know. His notes and letters on the design of the Brooklyn Bridge still exist, and they are a fascinating example of a good engineer recognizing the limits of his knowledge. He knew about aerodynamic lift on suspension bridges; he had watched it. And he knew he didn't know enough to model it. So he designed the stiffness of the truss on the Brooklyn Bridge roadway to be six times what a normal calculation based on known static and dynamic loads would have called for. And, he specified a network of diagonal stays running down to the roadway, to stiffen the entire bridge structure. Go look at those sometime; they're almost unique.

"When Roebling was asked whether his proposed bridge wouldn't collapse like so many others, he said, `No, because I designed it six times as strong as it needs to be, to prevent that from happening.'

"Roebling was a good engineer, and he built a good bridge, by employing a huge safety factor to compensate for his ignorance. Do we do that? I submit to you that in calculating performance of our real-time software systems we ought to derate them by a factor of two, or four, or six, to compensate for our ignorance. In making reliability/availability commitments, we ought to stay back from the objectives we think we can meet by a factor of ten, to compensate for our ignorance. In estimating size and cost and schedule, we should be conservative by a factor of two or four to compensate for our ignorance. We should design the way John Roebling did, and not the way his contemporaries did -- so far as I know, none of the suspension bridges built by Roebling's contemporaries in the United States still stands, and a quarter of all the bridges of any type built in the U.S. in the 1870's collapsed within ten years of their construction.

"Are we engineers, like John Roebling? I wonder.''

All that verbiage was the cover story, after the fact. The problem was flutter - and humans have known that can happen since there were flags. The bridge was severely under-engineered, just omitting what had been standard components (including trussing under the bridge) for such bridges for a long time, to save money. There was nothing unpredictable about the result.

This entire article and discussion is based on a fake headline and a false premise.... there are far fewer disasters of these kinds than there were because we learned from them. There are far fewer system failures of most technical kinds as well.

“Siri didn’t immediately play the right song from my Infinite jukebox at my voice command” is not a bridge collapse. “My online banking was down for an hour” is not near the inconvenience of not having banking available every evening and night before online.

One saving grace is that truly incompetent software projects of any size never make it off the ground (or stay up long enough to be relied on).

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact