
Age of Invisible Disasters (2017) - rrampage
https://blog.eutopian.io/the-age-of-invisible-disasters/
======
pdkl95
> The Therac-25 X-ray machine killed several people ... The reason? A race
> condition.

The problem with the Therac-25 was _NOT_ the race condition. Complex devices
will always have bugs, because humans haven't figured out a way to do perfect,
bug-free engineering. The problem with the Therac-25 was that it wasn't
designed to _fail safe_. The previous model had the same race condition bug,
but it also had hardware monitors and interlocks which provided _defense in
depth_.

The lesson of the Therac-25 _is not_ writing perfect software; the lesson is
recognizing that humans make mistakes so anything remotely safety-critical
needs to be designed to fail safely when - not if - mistakes/bugs happen.

~~~
salawat
Not being able to write perfect bug-free software is honestly a myth.

Short of bit flipping due to environmental conditions, it's more than
possibleto get an algorithm and state machine perfect. Look at the old
landline networks. Look at vending machines. It's just HARD. No one wants to
pay for HARD.

Yes. Therac-25's problem did come from lack of defense-in-depth. That doesn't
take away from the fact that Quality was sacrificed in the name of a cheaper
production cost, and less time to market.

The easiest way to get people to be okay with shoddy things? Tell them it's
impossible to do better. Microsoft took that route, so everybody drinks the
kool-aid. Even people who should know better.

~~~
jodrellblank
_Tell them it 's impossible to do better. Microsoft took that route_

Citation needed; where and when did they state this?

 _Look at vending machines._

Like the one at work which tells me to use exact change only, then accepts
more money and still gives me the right change?

~~~
salawat
You know what, give me a few days to do some research if you're genuinely
interested. I heard about it from a co-worker who did work with the early
telcos way back when. He's passed away, but I never actually looked it up
considering that co-worker's role as my mentor and deep insight into computing
history that checked out before.

~~~
jodrellblank
I am mildly interested, to the extent that it's any kind of specific, public
Microsoft technical claim, rather than a piece of generic marketing hubris, or
a rumour of something said in a closed-doors meeting of C-level people once
upon a time.

(Re: Marketing disclaimer, e.g. the slogan in the UK " _You can 't get better
than a Kwik-Fit fitter_", Kwik-Fit being a high-street chain of budget vehicle
repair, I think analogous to something like Jiffy-Lube in the USA, it would be
unreasonable for anyone to take this as a factual claim that "no better
mechanics exist on the planet", excellent mechanics are more likely to work on
racing car engines than routine oil and brake changes for near minimum wage,
say)

------
misja111
Many IT projects are failing because their budgets and deadlines are
deliberately set too optimistically.

The reason for this comes from too sides: on the one hand, there is the client
(either internal or external) who wants to have as many features for as little
money as possible and who is many times not very good at judging if their
demands are realistic. On the other hand there is the IT organization which
has an incentive to comply with unrealistic targets.

If it is an external client, this unrealistic promise is the advantage over
the competition that might get them the project. But I also see this happen
regularly with internal projects where there is no competition between
implementors. In that case the motivation is usually that IT management wants
to have the consent of higher management to get this project started; they
know that once a project has been going for a year or so, it won't be
terminated if it's over budget or past deadlines. At the same time, if they
would have tried to give realistic estimates too higher management the project
might never have been approved.

~~~
Joeri
Construction projects are the same. It seems like every week I read about
another construction project that’s far beyond its initial estimate. The
difference is that a half-finished construction project is a gaping wound in
the landscape, but a half-finished software project is just some code hidden
in a corporate server. The incentive to finish construction projects is
greater.

~~~
hobofan
> The difference is that a half-finished construction project is a gaping
> wound in the landscape, but a half-finished software project is just some
> code hidden in a corporate server.

The difference is that missing deadlines in construction projects come with
bigger monetary penalties (citation needed, based on the low penal damages
I've personally seen in IT project contracts).

~~~
forgotmypw
In a IT proje<t, the b1ame fa11s on the deve1oper who is then en<ouraged to
work extra hours for free to make up for their own fai1ure.

------
zakum1
The reason that software projects deliver so poorly in the corporate world is
that senior executives and investors understand them so poorly that a large
amount of recognition and attention goes to those who can "bullshit" rather
than fix the problem.

This is particularly toxic, because often the senior executives and investors
also have an interest in hiding the failure (avoiding personal accountability
and avoiding write-downs on investment).

It is very hard to see how a culture of true reflection and learning can
emerge in this environment.

~~~
amelius
I think the biggest problem is getting the requirements right (as much as
possible) prior to the actual coding.

Many managers seem to think that they can easily change the requirements in
the middle of the project.

~~~
groestl
It might be even the idea that (without an effort that matches or exceeds the
implementation itself), you can "get the requirements right", i.e. that there
is a point in time where the requirements are complete, fully understood and
thoroughly specified.

~~~
brazzledazzle
Given the enormous scope of some of these projects I wonder if a prototype
phase would help. Humans seem to have a hard time with conceptualizing and a
slow, barely/non working and unscalable prototype might be worth the added
cost/time. If it’s part of the requirement gathering process with explicit
expectations set that it’s a prototype would it facilitate better requirements
input? Naturally you’d have some people making assumptions about its fitness
as a final product or MVP but (if a UI is involved) giant bright warning
banners plastered all over might help with that.

~~~
curuinor
The default state of a software prototype is to be shipped into production.

------
underwater
The big software projects I’ve seen go bad don’t feel analogous to the bridge
failure mentioned in the article. It’s more often poor project management,
unclear requirements, and sub-par communication rather than a specific
engineering failure.

Despite the lack of public post-mortems, poor project management seems to be
widely recognized as a problem. But there isn’t a clear cut solution. Agile
promised to save us all but seems to be implemented poorly more often than
not.

~~~
jib
Agile (of the scrummy, we just meet sprint goals that we set type) is not a
project management solution, it is saying “well this seems hard, so we’re not
going to do it”.

Project management is happily alive. Thinking Agile in some way solves project
management is insane.

Plan out your system, estimate the time to build it (you don’t even need good
estimates), execute ruthless change control. It’s not hard, just takes
discipline. Ruthless change control is the hard part. That doesn’t mean saying
no, it means saying “if you change things, it costs you schedule days”.

If you want a clear cut system, iDesign has some good classes. Imo at least.

~~~
adrianN
Once you finished planning out your system the requirements have likely
changed. If they didn't you'll find out that you built something different
from what the customer wanted once you're finished programming.

~~~
lazulicurio
I don't think you're necessarily disagreeing with the parent post.

> Ruthless change control is the hard part. That doesn’t mean saying no, it
> means saying “if you change things, it costs you schedule days”.

Just because software is dominated by soft costs doesn't mean that it's cheap
to change requirements. That doesn't mean you can't deviate from your initial
spec, it just means you have to charge the customer for changes that they want
after you've already started development. Throwing away work just because it's
"easy" to change software doesn't magically recover the time and money that
you're already spent up to that point.

------
magicalhippo
I'm not a project manager, and I most certainly haven't been involved in huge
IT projects.

That said, from my POV it seems a lot of software related IT project failures
is often correlated to two factors:

\- Doing too much at once. Like replacing 6 different existing specialized
systems with a single new one.

\- Unwillingness to change the business procedures/workflow to cater to
software.

The lure of the single do-it-all system seems strong with certain people. But
at least in my experience, one could draw from software engineering and how
good software is written as separated modules with well-defined interfaces at
the boundaries. If you have multiple systems with good interfaces for data
exchange, it's much easier to specialize where needed, and replace outdated or
broken pieces.

The unwillingness to adjust the business procedures/workflow to software needs
is a huge one. Complex software is fragile. By having complex rules in the
business procedures you force the software to be more complex, thus invariably
making the software more fragile. If business procedures were changed to be
software friendly before the software is written/adapted, the software can be
simpler and thus hopefully less fragile.

------
iTokio
It daunts me how much software is getting unreliable, but trying to shame
people to hold them accountable is naive.

The root of the problem is the uncontrolled complexity of modern software
products.

Because of this complexity responsibilities are diluted, most of your code is
in your dependencies nowadays.

If you write a casual library, are you responsible if it is flawed and used in
a critical operation? Can dependencies always be carefully audited?

~~~
adamcharnock
It daunts me too, and reminds me of the early-days of commercial flight. I
hope we'll have a similar "we cannot continue like this moment".

However, I don't think shaming is what this is about though. To me it seems
the objective is to learn from mistakes, and for that we need to be honest
about what happened, and it is going to be pretty hard to be honest if we tip-
toe around who did what and why.

I also agree that complexity is a problem. But I don't think acknowledging
this gives us any path forwards. I don't think going back to the 'good old
days' is going to be a solution. I therefore see this leaning process as
helping us figure out how to move forwards, and to provide a motivation to the
industry as a whole. It is this industry-wide motivation that will be needed
to address some of the systematic complexity issues.

I don't think this would be enough on its own (and implementation is a whole
other question), but I think it could be a step in the right direction.

~~~
benashford
> It daunts me too, and reminds me of the early-days of commercial flight. I
> hope we'll have a similar "we cannot continue like this moment".

The first citation of the words "Software Crisis" meaning the inherent
difficulty of writing high-quality software in a predictable way was from a
NATO conference fifty years ago:
[https://en.wikipedia.org/wiki/Software_crisis](https://en.wikipedia.org/wiki/Software_crisis)

It is taking a long time for good practices to be discovered and win-out, and
even when obvious improvements have been made, they're not necessarily used
effectively.

I suspect a large part of the reason why the software industry isn't maturing
at the same speed that other industries have had to, is that in software,
failure is much easier to hide.

------
sidstling
A lot of the failures I experience at born from trying to solve business
process problems with digitization, or digitization without ever asking if
it’s the right thing/way to do stuff. Another common problem is focusing too
much on a particular set of business processes and forgetting that every IT
system is part of a package of numerous IT systems that work together.

I live in one of the most digitized countries in the world. So we’ve naturally
digitized payment for public transportation. When we did it, nobody questioned
the taxation system, even though it was made in the 70ies and build around a
public structure called “amter” that hadn’t actually existed for many years
when then system was build. We had also gone from 271 municipalities to 98 and
their borders were part of the taxation too. So the taxation rules frankly
didn’t make any sense and they were needlessly complicated, yet they were
digitized, as is. Naturally it was a disaster, it was even predicted by the
technical team and the project leads, but nobody wanted to touch the taxation
politically. It got fixed eventually, but it could have been several hundreds
of million danish kroner cheaper if they had simply redone the taxation models
for ticket prices before the digitization.

So that’s one mistake, and a common one, both in the public and private
sectors. The other common disaster is building systems for specific processes
without looking at the bigger picture. Like a case working system that handles
the welfare process for people who are sick. Except you forget that those
citizens sometimes don’t go through official communication channels, and maybe
send a letter or an email to the wrong department, so you need to be able to
add those documents to their digital case file. But that’s not possible and
neither is sending a notice to other systems in other departments which also
deal with the same citizen. I’m guessing this last issue is bigger in the
public sector than in private, because we often buy our software from
companies that have very little actual domain knowledge outside of what their
direct customers tell them, and the case workers they use for knowledge very
often lack insights in the greater architecture of running 350+ IT systems
together because they work with maybe 5 of them.

I mean, these things aren’t deadly as the x-ray machine, but they’ve been
happening for the better part of 25 years and nobody seem to have really
learnt anything.

~~~
Aeolun
Oh, we have learned all those things as developers. It just seems that none of
the decision makers have gotten any hint.

I’m honestly not sure why that is. I’m hesitant to only ascribe it to
incompetence, because not everyone can be, but maybe we only hear about the
failed project with bad decisions.

~~~
olooney
Why can't everyone working in a given field be incompetent? When Dr.
Semmelweis discovered that surgeons washing their hands drastically reduced
patient mortality, he was dismissed and ridiculed. I would say that 100% of
those surgeons were objectively bad at their jobs and therefore "incompetent"
in a narrow sense. Groupthink can cause everyone in a given field to converge
on the same orthodox belief, and if that belief is wrong or dangerous,
shouldn't they all be considered incompetent? Even today there are many
pseudo-scientific fields where literally none of the practioners are
objectively able to accomplish what they claim... and it's not at all obvious
that project management is not among them.

------
nickdothutton
I wondered why an obscure post of mine was suddenly popular.

A clarification: The Therac-25 had an unfortunate race condition, what made
this deadly was the conscious decision by the designers to REMOVE the physical
safety interlock. They didn't consider modes of failure. The post says exactly
this. Always consider modes of failure, you never know when some "other guy"
is going to naively count on your work being 100% reliable. It's a system not
a goal, as I like to remind people.

Some of you might enjoy some of my other stuff, particularly on security:
[https://blog.eutopian.io/winning-systems--security-
practitio...](https://blog.eutopian.io/winning-systems--security-
practitioners-5.-resilience/)

The Tay Bridge disaster was important because: 1) Before it, we had several
bridge failures in the UK. 2) After it we had almost none at all. Ever. 3) The
report into the disaster was responsible for this improvement. It uncovered
problems with: The design, the metal used, the way it was assembled, the
maintenance regime, the project management and personal relationships and
personalities of the people involved.

I'd lay money on the cause of the recent tragic bridge collapse in Italy being
one of those already cited in that 140 year old report. It's all there.

Back to our own world...

When major IT projects fail, there is almost never a public enquiry, even when
those failures are government projects, and even when they cost hundreds of
millions of dollars/pounds. These failures are repeated regularly in
government, and daily in the private sector.

Many of us who have been around a while have a (probably pretty good)
understanding for why they fail, yet the lessons are not learned and there is
little sign we are getting any better at all at not-failing. I suspect a bit
of exposure to downside risk, or "skin in the game" as Taleb would call it,
might improve things. Sometimes the medicine is hard to take.

------
Spearchucker
There's this book by Peter DeGrace and Leslie Hulet Stahl called Wicked
Problems, Righteous Solutions. It describes all of these problems and others.
It presents a number of very practical and proven solutions.

The book was published _28 years ago_ , in 1990.

We use words like science and engineering in conjunction with others like
computer, programming, and software. And yet there's nothing scientific about
how we don't learn from mistakes already made decades ago. And how we keep
reinventing "engineering" best practice and call them new names.

~~~
dredmorbius
You've probably heard the line "complexity is the enemy", or perhaps even its
full form: "complexity is the enemy _of reliability_ ".

You may not know its provenance: _The Economist_ , Volume 186, January, 1958,
or 60 years ago:

[https://books.google.com/books?id=aDsiAQAAMAAJ&q="complexity...](https://books.google.com/books?id=aDsiAQAAMAAJ&q="complexity+is+the+enemy+of+reliability"&dq="complexity+is+the+enemy+of+reliability"&hl=en&sa=X&ved=0ahUKEwjVw5S5o4vdAhUCF6wKHV8bDckQ6AEIKDAB)

I've been trying unsuccessfully to secure a copy of this article for some
years. PDF preferred, dredmorbius<at>protonmail<dot>com if anyone should have
access.

~~~
wilsonnb3
If you're willing to pay $100 to secure a copy, you can probably read it in
the Economist Historical Archive.

[https://shop.economist.com/products/the-economist-
historical...](https://shop.economist.com/products/the-economist-historical-
archive#)

~~~
dredmorbius
Not.

My research trove exceeds 10k items. I don't have a $100/item, or even source,
budget.

------
curtis
> _If you find yourself on a failing project, squandering tens of millions of
> pounds and hundreds of man-years of talent, pause for a moment. Think about
> the fact that almost 140 years ago, civil engineers stopped building bridges
> that fell down. They stopped building them because the failure of one bridge
> was laid bare so publicly._

> _Think about the fact..._

You can think all you want, but it's unlikely to do anybody any good.
Sometimes the fault for a failed project lies squarely with the engineers, but
this is not at all the usual case. The people who are most responsible for
failed software projects is _management_ , and not just engineering
management, but the people who engineering management reports to.

And the biggest problem management has is not simply lack of understanding of
the nature of software development projects, but, often, a profound lack of
interest in learning.

I don't know what to do about that.

------
mprev
Kind of weird that they used a novelty fake newspaper front page to illustrate
this. The Scottish Scribe is a book of mocked-up newspaper front pages
attempting to show how a modern tabloid might have dealt with historic events.

------
vinceguidry
> I count £20b in failed IT projects over the last decade alone.

It's hard to grasp the sheer _scale_ of government. This article does a good
job of juxtaposition in the case of the magnitude of engineering failures, but
I want to add on that $20 billion is chump change when it comes to waste. The
military sector alone plowed through $700 billion last year to accomplish the
task of robo-killing brown people. The entire federal budget was $2 trillion.
Stop and think about those numbers for a bit.

There are 2.8 million civil servants in the US, and 2 million military
personnel. $2T divided by 4.7 million means _every single government official_
is responsible for roughly $425,000 of your tax dollars. This includes postmen
and every boot camp trainee.

Obviously only a fraction of these people are making decisions. So you can add
zeroes to that number when you want to consider how much power the actual
decision makers have. These decision makers are human, and humans are wont to
see themselves as kings of their domain, and what is a king's job but to
squander money squabbling over fiefdoms.

The sheer, mind-boggling scale of systems of government, all of them, from
your homeowner's association to your neighborhood council to your city
government to your state government to the national government to
international governmental organizations like NATO and the UN, isn't even the
most interesting aspect to consider here.

A more amazing thing to think about is how they manage to get anything done at
all. But that's not even the biggest thing.

The biggest thing is that there is nothing new about this state of affairs.
Civilization was built like this, thousands of years ago.

It's an admirable goal to want to get rid of waste in government. But that's
an untamable firehose. You won't even get laughed at for a proposal to save
$20b of tax money. They will look at you, decide whether you're going to look
good on TV, maybe put you up in front of a camera if you're really really
really really lucky, and everything you spent your whole life learning to
finally try to do will get swept into a political capital generating exercise
for a _local_ politician. Thanks, try again next life.

Governmental cruelty knows no bounds.

------
drinchev
There is currently a huge problem with the Bulgarian Electronic Trade Register
[0]. The register stopped working two weeks ago and is still not online [1].

It holds all company ownership data and a lot more. Right now there is no way
to register a company in my country, as well as making any changes to existing
companies ( e.g. changing manager, shareholders, etc. ). It is one of the most
important set of data for an EU country.

The original problem ( leaked by the government ) is that a 4 of the RAID5
disks broke down, but it is still a mystery why recovering the data takes more
than 2 weeks.

0 : [http://brra.bg](http://brra.bg)

1 : [https://bivol.bg/en/classified-information-and-human-
error-c...](https://bivol.bg/en/classified-information-and-human-error-caused-
trade-registers-collapse.html)

~~~
Aeolun
It’s amazing that 4 R5 disks could break down in the first place. Apparently
they had triple redundancy and STILL managed to make it fail.

Though I bet it was just nobody ever checked if the disks were still working.

~~~
raarts
What didn't help was that the backups were stored on the same LUN as the
production database.

~~~
noir_lord
Ouch.

There is a reason my backups are held on a separate machine in a separate
building (and also on large external encrypted drives that leave site
everyday).

Sometimes you learn these the hard way.

------
louwrentius
I once read on Hacker News that in the future, the C-level people all need to
be very strong in IT as any company or organisation nowadays is so much
reliant on IT.

It's also the message of the Phoenix Project book, which I did like.

The problem I have noticed is that although management does understand their
businesses, it's easy to bullshit them when it comes to IT. And they let it
happen because they are not into IT and they don't grok it. They would never
treat other projects like this the same way when they would totally understand
it.

I especially notice that as I read what higher management layers write about
projects or effort, it's high on fluffy 'visionary' words but low on actual
actionable vision that would help me to make every day decisions on what to
prioritise.

I believe that the simple reason why IT projects fail is because of very
mundane basic things.

But those are not sexy to write about. To me it's all about:

\- Why are we doing this? \- How would you define succes and failure for this
project \- Who is responsible for what / contact person \- How do we work with
each other and detail this \- What are the guiding principles \- How do we
assure quality \- How do we assure timely delivery \- What stuff do we need,
gear, licenses, etc. \- P R E P A R A T I O N - do your homework, investigate
things beforehand before you make choices.

I can go on and on. And it may bore you. But I think there is actually no true
complexity involved in all those failing projects.

There is not something really special to IT projects. I wonder if we do
pretend there is something special to them because we ourselves want to feel
important in some sense.

~~~
phs318u
I don’t disagree with you at all, however we (in IT) also bear responsibility
primarily in the following areas: 1\. IT don’t really understand business.
Often we think we do - more so than the business does - and that hubris often
blinds us to the real issues that the business neeeds to address. This can be
resolved through use of boring old enterprise architecture (real E, not just
tech E), and business analysis. Actually making the effort to understand.
Sadly most EA’s and BA’s aren’t. These are specific skills and they relate
primarily to people and process rather than technology. 2\. Technology is
largely overrated. I know that is likely to be an unpopular opinion. Sorry,
not sorry. I am an Enterprise Architect by training and by trade. I’ve worked
on several multi-hundred million dollar programs. I can promise you that in my
experience, had every single tech decision been flipped or changed, the
difference in outcome would have been +/\- 10% at most. Projects are not won
and lost on technology. Completely missing or misunderstanding requirements,
miscommunication, poor program financial management, overestimating the
business’ capacity for change - these (and more) are the things that are more
often than not likely to make the difference between success and failure. 3\.
You are not Google. Or Apple. Or Spotify. Or Amazon. Unless you are one of
those companies. But if you’re an energy company or a financial services
company - then no. Just no. Your business is largely conservative, managed
(ideally) by risk averse managers, invested in by people who want a certain
return. Your industry is highly regulated and there are things you have to do,
that you don’t control and you do them whether the timing is good or not, at
the expense of things that you do want to do. So stop fucking kidding
yourselves and realise what you actually ARE, and cherry-pick & adapt those
things that are likely to work for you.

It’s not like we collectively don’t know this stuff. Let’s stop drinking our
own kool -aid. And for those of you that do work for cool companies or startup
disrupters I’m really happy for you. For the rest of us, technology is not the
centre of the universe. Appreciating that difference is important.

And yes, I know I’ve missed valid arguments swung the pendulum a bit far the
other way. It’s deliberate. We are in danger of disappearing into our own
navels.

~~~
tonyedgecombe
My last bank was a staid conservative heavily regulated business that lost my
custom because they were so poor at technology.

My electricity supplier is close to loosing my business in part because their
web site won't accept my meter readings.

I'll be basing my decision on what car to buy my wife at least partly on
whether it works with my iPhone. If it doesn't then you won't be on the list.

And on and on. We increasingly interact with businesses through technology, if
they can't get that right then they are going to suffer. They can't get it
right unless they take it more seriously right up to board level.

~~~
phs318u
“Conservative” in this context doesn’t mean how they interact wi H customers
but how they run their business (of which customer interaction is a part). The
tech you are talking about is the visible bit of the iceberg. Work in the back
office for any length of time and you’ll know what I’m talking about.

------
sgt101
There is some work on this : [https://spectrum.ieee.org/static/lessons-from-a-
decade-of-it...](https://spectrum.ieee.org/static/lessons-from-a-decade-of-it-
failures)

But... there are three key problems.

1) The time scales are long - in my experience big project failures are on a
>5 year time scale (because - big) I think proper studies will need to run >10
years, and that's a big ask for any academic or team.

2) The costs are borne by one set of stakeholders (IT) the benefits are
accrued by another (next IT). Why invest to help your successor? No one is
going to thank you, also you will likely be sacked faster! There is no board
level education or knowledge about this. The only source of information that
could convince boards that this is the right thing to do would be
Mckinsey/Bain/BCM and those _& &^^"! will never, every say this because it's
the right thing to do and they are evil. (prove me wrong!)

3) What do you measure? The field is immature, it's not clear what the right
inputs to check are - or what the right way to estimate the outputs are. So we
need to do a lot of work now to set up the definitive studies.

I have an anecdote : there is a thing called The FEAST hypothesis
[http://users.ece.utexas.edu/~perry/work/papers/feast1.pdf](http://users.ece.utexas.edu/~perry/work/papers/feast1.pdf)
I was a user of one of the studied systems, and I was curious about the study.
I discovered that it hypothesised that development of big systems slowed as
they got more complex and the data from the system I used was one of the
points that confirmed this. I examined change control documents and discovered
that the development of said system _had* slowed before the end of the study,
but then it had reaccelerated, a whole load of "robots" had been implemented
by business units consuming the system and these had not been reported in the
FEAST study (IT was largely unaware) the robots started causing problems,
policy changed, they were insourced, on platform development took off.

We need

\- 5 year major international project to develop the art to support this \-
legislation that mandates system development information is stored up front
and in a shared place. \- legislation that mandates regular reviews that
determine certain information that is signed off by an engineer. \- 20 year
massive project to use above information

I am not optimistic.. We can't even prove that XYZ better than agile..

------
lifeisstillgood
There are many problems with software projects but a fundamental one often not
raised is it is hard to say "this bridge will be built using this quality
steel and this much effort and time" when there are ten other companies, all
looking from the outside as convincing, saying we will donit in half the time
for half the cost.

I am not sure i have too many answers. But having a genuine profession that is
required by law to sign off on any life-critical software seems a sensible
starting point

------
UncleEntity
> Think about the fact that almost 140 years ago, civil engineers stopped
> building bridges that fell down. They stopped building them because the
> failure of one bridge was laid bare so publicly.

Yes, but how many bridge projects failed in the last 140 years because of cost
overruns/missing deadlines which is a more direct analogy for most of the
arguments in TFA.

And I'm guessing we're just talking about the UK since earthquakes have taken
down a bridge or two in my lifetime...

~~~
dredmorbius
Exceptionally good point. Confounding _failure of a project to deliver_ wwith
_the catastrophic public failure of that project 's deliverables_ is a truly
_extraordinary_ category error.

I was reading just recently of a failed megaproject, the Nicaraguan Canal.
Forecast costs range from $40 - 100 billion, though I cannot find a report of
actual expenditures.

[https://en.wikipedia.org/wiki/Nicaragua_Canal](https://en.wikipedia.org/wiki/Nicaragua_Canal)

Contrast this with actual _engineering_ failures, such as Fukushima or
Banqiao. This is an apple-juicer to oranges comparison.

[https://en.wikipedia.org/wiki/Fukushima_Daiichi_nuclear_disa...](https://en.wikipedia.org/wiki/Fukushima_Daiichi_nuclear_disaster)

[https://en.wikipedia.org/wiki/Banqiao_Dam](https://en.wikipedia.org/wiki/Banqiao_Dam)

------
vjsc
Poor managment doesn't always stem from ignorance or lack of understanding of
how SW works. Sometimes it emanates from pressure to deliver within very tight
timelines for the sake of survival of business or standing up to the
competition in the market.

I have seen the best managers giving into ridiculous deadlines at the time of
project onset just because they know that there is no other option.

------
dwenzek
In this post, Bertrand Meyer made a similar claim taking the airplane industry
as a reference.

"When Will We Learn? Every major software incident requires a thorough and
public analysis."

[https://cacm.acm.org/blogs/blog-cacm/227943-when-will-we-
lea...](https://cacm.acm.org/blogs/blog-cacm/227943-when-will-we-
learn/fulltext)

------
gus_massa
> _Think about the fact that almost 140 years ago, civil engineers stopped
> building bridges that fell down._

It was written last year, but it looks like a weird sentence this year:

[https://en.wikipedia.org/wiki/Florida_International_Universi...](https://en.wikipedia.org/wiki/Florida_International_University_pedestrian_bridge_collapse)

[https://en.wikipedia.org/wiki/Ponte_Morandi#Partial_collapse](https://en.wikipedia.org/wiki/Ponte_Morandi#Partial_collapse)

It's even not that unusual. In
[https://en.wikipedia.org/wiki/List_of_bridge_failures#2000–p...](https://en.wikipedia.org/wiki/List_of_bridge_failures#2000–present)
I counted like 150 bridges collapses in the since 2000.

~~~
Nomentatus
Back then (to quote another comment here): "a quarter of all the bridges of
any type built in the U.S. in the 1870's collapsed within ten years of their
construction."

Metallurgy was primitive, and there were no x-rays available to find hidden
cracks formed during the manufacturing process.

------
hyperman1
The book he wants has been written. It's called The mythical man-month, by
Fred Brooks. It does a post-mortem about what went wrong (an what went well)
on the development of an IBM OS.

I think it contains today mostly stuff everybody knows, so it had a lot of
impact on our profession. Not on the coding part, but very deeply on the
management part.

------
dredmorbius
A (justifiably dead) comment mentions the, erm, case, of the FBI's Virtual
Case File,a $170m project killed in 2007.

[https://web.archive.org/web/20130729205010id_/http://itc.con...](https://web.archive.org/web/20130729205010id_/http://itc.conversationsnetwork.org/shows/detail1688.html)

[https://en.wikipedia.org/wiki/Virtual_Case_File](https://en.wikipedia.org/wiki/Virtual_Case_File)

[https://web.archive.org/web/20051201114736/https://www.spect...](https://web.archive.org/web/20051201114736/https://www.spectrum.ieee.org/sep05/1455)

------
ThinkBeat
Things makes me think about the poor woman who was killed by a "self driving
car".

If the software had acted as expected she would have been alive. If self
driving cars become popular, coding mistakes will kill more people.

(Yes, they had wilfully disengaged the built in automatic breaking feature of
the car in order to allow their software to control it, and the human safety
engineer riding in the car, was not paying attention to the road (also because
the safety egineer blidnly trusted the software runnign the car) were factor
as well)

~~~
Jwarder
> also because the safety egineer blidnly trusted the software runnign the car

My understanding is the opposite. Uber's software was generating a lot of
false breaking events so they set it up so it wasn't controlling the breaks.
Drivers were trying to gather evidence about the triggers for these false
events. That created this perverse situation where the software correctly
identified it should have braked, but the only action it could take was to
raise an alert, distracting the driver at a critical moment.

------
Aloha
Just as big as failing safe, is not failing silently.

Thats the another key from the Therac-25 - it failed mostly silently - it
displayed an strange message that didnt make any sense, and there was no
obvious detection of a failure.

Software need not be bug free - for example, there is a reason we still
include hardware watchdogs on embedded devices, and its largely because the
watchdog is cheaper than bug free software, and will provide the same quality
of service.

------
gonzo41
The biggest errors I've seen is where people buy COTS for their core business.
Which is usually a mistake is IT is a main driver for your business in some
special circumstance.

Also if you do go COTS and don't do the business change to fit the product.
Trying to make COTS work your way is always so so so bad.

~~~
jacquesm
It depends on the phase your business is in. For the beginning using
standardized stuff that you buy ready made can be a real time saver, and time
is usually in short supply. Once you achieve a certain scale and you can
afford to do it you can usually save substantially and scale up further by
doing something more customized.

Starting off with a complete custom set-up for your core is another
opportunity for premature optimization to creep in.

~~~
gonzo41
I've seen this happen where a mature business starts to functionally
decomposes their business down to what they do and how they deliver. They then
go out and buy products that do those things and link them in a chain with a
DB. But each one of those is short about 10-20% in all the used to haves and
nice to haves.

So what then dawns on the business is they realize that the missing 10-20% was
part of the business that was really important and they have dropped serious
money on a bunch of products. And really all they needed to do was better
understand themselves and build their own business infrastructure.

But what you are saying about speed definitely rings true. But it's important
to note IT failures that happen to new businesses are more or less written off
as total business failures Usually resulting in the business going to the
wall.

------
macleginn
A bridge falling under a train and a non-delivered project don't really have
much in common. Major engineering projects keep being delivered very late or
scrapped altogether, not that software is altogether different in this regard.

------
contravariant
Using the collapse of an Italian bridge as an example of the kind of disasters
that happened in the past is somewhat unfortunate, although the author
couldn't have known.

------
Animats
We _are_ seeing bridge collapses again. Genoa last week. Florida last month.

------
damian2000
When I see Therac it reminds me of Theranos - another medical device with that
went into production with serious issues.

~~~
jacquesm
It is unfair to the makers o the Therac to compare them with Theranos.

------
defined
This is not the first time bridges have been used in comparison with software
development.

From Programming Pearls, Section 7.3 [Safety Factors], by Dr. Jon Bentley,
which reproduces Vic Vyssotsky's advice from a talk he has given on several
occasions.

"Most of you'', says Vyssotsky, "probably recall pictures of `Galloping
Gertie', the Tacoma Narrows Bridge which tore itself apart in a windstorm in
1940. Well, suspension bridges had been ripping themselves apart that way for
eighty years or so before Galloping Gertie. It's an aerodynamic lift
phenomenon, and to do a proper engineering calculation of the forces, which
involve drastic nonlinearities, you have to use the mathematics and concepts
of Kolmogorov to model the eddy spectrum. Nobody really knew how to do this
correctly in detail until the 1950's or thereabouts. So, why hasn't the
Brooklyn Bridge torn itself apart, like Galloping Gertie?

"It's because John Roebling had sense enough to know what he didn't know. His
notes and letters on the design of the Brooklyn Bridge still exist, and they
are a fascinating example of a good engineer recognizing the limits of his
knowledge. He knew about aerodynamic lift on suspension bridges; he had
watched it. And he knew he didn't know enough to model it. So he designed the
stiffness of the truss on the Brooklyn Bridge roadway to be six times what a
normal calculation based on known static and dynamic loads would have called
for. And, he specified a network of diagonal stays running down to the
roadway, to stiffen the entire bridge structure. Go look at those sometime;
they're almost unique.

"When Roebling was asked whether his proposed bridge wouldn't collapse like so
many others, he said, `No, because I designed it six times as strong as it
needs to be, to prevent that from happening.'

"Roebling was a good engineer, and he built a good bridge, by employing a huge
safety factor to compensate for his ignorance. Do we do that? I submit to you
that in calculating performance of our real-time software systems we ought to
derate them by a factor of two, or four, or six, to compensate for our
ignorance. In making reliability/availability commitments, we ought to stay
back from the objectives we think we can meet by a factor of ten, to
compensate for our ignorance. In estimating size and cost and schedule, we
should be conservative by a factor of two or four to compensate for our
ignorance. We should design the way John Roebling did, and not the way his
contemporaries did -- so far as I know, none of the suspension bridges built
by Roebling's contemporaries in the United States still stands, and a quarter
of all the bridges of any type built in the U.S. in the 1870's collapsed
within ten years of their construction.

"Are we engineers, like John Roebling? I wonder.''

~~~
Nomentatus
All that verbiage was the cover story, after the fact. The problem was flutter
- and humans have known that can happen since there were flags. The bridge was
severely under-engineered, just omitting what had been standard components
(including trussing under the bridge) for such bridges for a long time, to
save money. There was nothing unpredictable about the result.

------
kweinber
This entire article and discussion is based on a fake headline and a false
premise.... there are far fewer disasters of these kinds than there were
because we learned from them. There are far fewer system failures of most
technical kinds as well.

“Siri didn’t immediately play the right song from my Infinite jukebox at my
voice command” is not a bridge collapse. “My online banking was down for an
hour” is not near the inconvenience of not having banking available every
evening and night before online.

One saving grace is that truly incompetent software projects of any size never
make it off the ground (or stay up long enough to be relied on).

