Hacker News new | comments | show | ask | jobs | submit login
How complex systems fail (2002) [pdf] (mit.edu)
154 points by mpweiher 10 months ago | hide | past | web | favorite | 19 comments



TL;DR Spend 80% on expensive consultants qualified in "project management method" such as PRINCE2/V-Model and make a f*g nightmare for the people who actually try to make it work.

A picture is worth a thousand words:

https://upload.wikimedia.org/wikipedia/commons/8/8f/Systems_...


It seems you've been hurt recently by a consultant :).

The institutional response to managing the risk of complex systems is often to introduce layers of approval and process that appear to be dealing with a problem ("this system can't fail, therefore please check with everyone involved in the system before making a change") but really have little real world value except driving everyone crazy and incurring enormous cost. They also shield individuals from real responsibility (how carefully do you review something you are the only approver on? What if there are 5 approves? What about 15?).

A better answer is to find ways to conduct real world tests with subsets of your system where you can roll back bad consequences. As far as I can tell This is the approach of google/Facebook and other newer tech companies with pushing small changes to subsets of customers for testing.

Legacy enterprise companies are woefully inequipped to do this for the most part, both technically and culturally.

Some industries have regulations that make this approach difficult (financial services) or human consequences to failure that make the cost of experimentation too high (medicine, airlines, etc)


Funny, my thought was the opposite.

Having worked in heavily bureaucratic organisations, I was reading the pdf as an allegory of the blame game and safety first heavily bureaucratic, process driven culture big orgs have, which do nothing to do actually reduce risk - rather it stifles people actually being able to, and wanting to, change anything.

Basically I came away with the opposite to what you said.


Did you even read it? There's no mention of PRINCE2 or any recommendations in it at all


The system practitioners operate the system in order to produce its desired product and also work to forestall accidents. This dynamic quality of system operation, the balancing of demands for production against the possibility of incipient failure is unavoidable. Outsiders rarely acknowledge the duality of this role. In non-accident filled times, the production role is emphasized. After accidents, the defense against failure role is emphasized. At either time, the outsider’s view misapprehends the operator’s constant, simultaneous engagement with both roles.

So much this. I coach and train my clients (Fortune 500) in extreme programming (unit testing, TDD, CI, CD, etc). The duality these developers live with and managements obliviousness to their own detriment create a toxic and anxiety inducing work environment. It's very upsetting to me how these developers live with the manager/stakeholder constantly breathing down their necks to cut corners and "get it done" meanwhile holding them accountable for any mistakes.


jdjebc82747's dead comment: >"get it done" meanwhile holding them accountable for any mistakes. Sorry, is there an area of business that is not like this?

It's not that other areas of business aren't like this, it's that management creates an environment where mistakes are more likely (and in some cases almost certain) to happen, and then penalizes people when they happen.

An example: Bosses refuse to provide funding for materials and time to automate the test framework (embedded systems). So testing is done mostly manually, this consumes a great deal of time or tests don't get conducted due to the lack of time or capability (I can't flip a switch 10 times in a second, or at a particular and precise time). So either we don't have enough time to take the test feedback and correct the system, or we never get the test feedback (because some tests aren't done) in order to correct the system. Errors are virtually guaranteed to slip into production if you're operating on either short schedules or complex systems under these circumstances.

Management expects perfect results, but ties the engineers hands too much so that they aren't able to execute effectively, and then blames (and often dismisses) the engineers as a result.


> constantly breathing down their necks to cut corners and "get it done" meanwhile holding them accountable for any mistakes

It's not just cutting corners in the actual implementation, but also the metrics and management infrastructure for post-release operations.


This all looks spot on from my experience, except I have this nagging voice in my head that says the methods they are criticizing, seem to be exactly what the airline industry has done for decades in the wake of plane accidents. The result has been an ever-improving safety record, without preventing normal operation; although it is a pain to get through security at the airport, people do in fact continue to get from here to there. So, how do the methods they criticize here, differ from the airline industry's, or if they don't what's the flaw in this analysis?


I know nothing of the industry, so everything that follows is guesswork.

I'd say airlines are a sweet spot because we are probably near a technological local-optimum. We are not constantly evolving new paradigms for the industry, rather just incrementally improving the old one.

If post-mortem culture in the industry results in "fighting the last war", then that probably does little harm, at worst it adds in incremental cost. In the mean-time there is still incremental techonolgical improvment giving us more head-room in the cost vs. safety trade-off.


Well this makes some sense, since the speed of sound pretty much stopped technological progress toward every-higher speeds a few decades past.


Really?

The ancient ones, the humans who lived before my birth - those who launched rockets to the moon and detonated city destroying bombs with atomic power, feats no country has matched in these declining years of civilisation - they also had faster than sound passenger aircraft.

Although it might be the tendency to exaggerate stories with time, rose-tinted glasses and the myth of the noble-savage.


There's a common theme amongst many of those heroic feats: They tended to occasionally catch fire, fall out of the sky and/or explode at inopportune times.

It's the difference between prototype and a production. The Saturn V rocket and the Concorde were both prototype-grade vehicles. The marvel of them is that they worked at all. Now that we've proved the point, we won't put either "into production" unless it's reliable enough to be boring.


'Production quality' things spontaneously catch fire these days, too.

http://www.theclever.com/20-dangerous-cars-with-the-highest-...


Except not.

The air safety Human Factor culture have moved away from what they criticise 30 years ago. They are now fully on Safety II.

I advise Steven Shorrock work recently at https://humanisticsystems.com


This episode is the most relevant, but see the whole series:

https://www.youtube.com/watch?v=3HaqpSPVhW8


If this is interesting to you, I highly recommend "The Field Guide to Understanding Human Error" by Sidney Dekker - it covers these points with examples. [0]

Another note, I wondered what the root cause of the financial meltdown was for a number of years, but looking at it from this point of view, it's obvious that a number of things have to go wrong simultaneously; but it is not obvious beforehand which failed elements, broken processes, and bypassed limits lead to catastrophe.

For your own business/life, think about things that you live with that you know are not in a good place. Add one more problem and who knows what gives.

This is not intended to scare or depress, but maybe have some compassion when you hear about someone else's failure.

0 https://www.amazon.com/Field-Guide-Understanding-Human-Error...



Here it is directly from MIT instead of the cancerous researchgate silo: http://web.mit.edu/afs.new/athena/course/2/2.75/resources/ra... (the 1998 version to be fair but the sole addition seems to be some infographic of the contents)





Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: