Hacker News new | past | comments | ask | show | jobs | submit login
A Matter of Millimeters: The story of Qantas flight 32 (admiralcloudberg.medium.com)
680 points by xenophonf on Dec 9, 2023 | hide | past | favorite | 297 comments



I don't know about others, but I can't help but smile when I read the detailed series of events in aviation postmortems. To be able to zero in on what turned out to be a single faulty part and then trace the entire provenance and environment that led to that defective part entering service speaks to the robustness of the industry. I say that sincerely since mistakes are going to happen and in my view robustness has less to do with the number of mistakes but how one responds to them.

Being an SRE at a FAANG and generally spending a lot of my life dealing with reliability, I am consistently in awe of the aviation industry. I can only hope (and do my small contribution) that the software/tech industry can one day be an equal in this regard.

And finally, the biggest of kudos to the Kyra Dempsey the writer. What an approachable article despite being (necessarily) heavy on the engineering content.


As a former Boeing engineer, other industries can learn a great deal from how airplanes are designed. The Fukushima and Deepwater Horizon disasters were both "zipper" failures that showed little thought was given to "when X fails, then what?"

Note I wrote when X fails, not if X fails. It's a different way of thinking.


When I worked in an industrial context, some coding tasks would seem trivial to today's Joe Random software dev, but we had to be constantly thinking about failure modes: from degraded modes that would keep a plant 100% operative 100% of the time in spite of some component being down, to driving a 10m high oven has the opportunity to break airborne water molecules from mere ambient humidity into hydrogen whose buildups could be dangerously explosive if some parameters were not kept in check, implying that the code/system has to have a number of contingency plans. "Sane default" suddenly has a very tangible meaning.


> we had to be constantly thinking about failure modes

This to me is the biggest difference between writing code for the software industry vs. an industrial industry.

Software is all about the happy path ("move fast and break things") because the consequences typically range from a minor inconvenience to a major financial loss.

Industrial control is all about sad paths ("what happens if someone drives a forklift into your favorite junction box during the most critical, exothermic phase of some reaction") because the consequences usually start at a major financial loss and top out in "Modern Marvels - Engineering Disasters" territory.


You do /not/ want to make it on the USCSB YouTube channel.


Yeah, I work as a Functional Safety Engineer in the process and machinery sector and 90%+ of effort is in planning, considering all the possibilities outside of intended operation and traceability.

I have worked on projects where in retrospect the LOC generated per day, if spread out across the whole project, were between 1 and 3.

But typically, writing of the code does not even commence in the first year, sometimes two.

Then there is the test cases and test coverage etc etc.

This is the difference between engineering code and just producing it - all the effort that goes into understanding all the unwanted code behaviour that may occur and how to detect, manage and/or avoid it.

Implicit state is the enemy, therefore the best code has all states explicitly defined.


As an engineer I think a lot about tradeoffs of cost vs other criteria. There is little I can learn from nuclear or aviation industry, as the cost structure ist so completely different. I’m very happy that the costs of safety in aviation are very good accepted, but I understand that few people are willing to pay similar costs for other things like, say, cars.


The costs of the Fukushima and Deepwater Horizon were very, very high. Both could have been averted at trivial expense with simple changes to the design.

Fukushima:

badthink - the seawall is high enough that it will stop tidal waves

goodthink - what happens when the seawall is overtopped? Answer: the backup generators drown. Solution: put the backup generators on a platform.

Deepwater Horizon:

badthink - the pipe is strong enough to never break

goodthink - what happens when there's enough force to bust the pipe off? Answer: the pipe flow cannot be shut off. Solution: put a fuse (a weak spot) above the valve, so when the pipe busts off, it breaks above the valve, and the valve can be turned to shut off the flow. (The valve was located on the sea floor.)


This is so easy in retrospect when you know what the failure mode will be.

badthink: the Fukushima backup generators must be placed on a platform to keep them out of the range of a once in a millenium tsunami

goodthink: what happens when a typhoon comes and damages the generator on an exposed platform; an event which happens predictably and far more often than tsunamis. Answer: put the backup generators in the basement of a reactor building behind a large seawall. What catastrophe could put the reactor building completely underwater, and still have the reactor survive?

Yeah, trivial changes to the design can prevent all sorts of disasters, but you have to know what you are trying to prevent in a world of infinite complexity


A large seawall to be sure, but not a particularly tall one. If I recall correctly the seawall was remarkably short relative to maximum expected wave heights on a 100 year time frame.


We're making a niche B2B application, and this is very much it for us as well.

Our customers are in a cutthroat market with low margins. We can't spend a ton on pre-analysis, redundancies and so on.

Instead we've focused reduced the impact of failures.

We've made it trivial to switch to an older build in case the new one has an issue. Thus if they hit a bug they can almost always work around it by going to an older build.

This of course requires us to be careful about database changes, but that's relatively easy.


You can not. AI though, can be cheap enough to produce that. I wonder what happens if you take a b2b application and let it rewrite with AI to Nuclear Industry/ Aviation standards into a seperate repo. Then on fixes/rewrite the engineers take the "safety aware repository" as inspiration.


What you're describing is almost exactly the opposite of what LLMs are good for. Quickly getting a draft of something roughly like what you want without having to look a bunch of stuff up? Great, go wild. Writing something to a very high standard, with careful attention to specs and possible failure cases, and meticulous following of rules? Antithetical to the way cutting-edge AI works.


Have you tried using an LLM to write code to any kind of standard? I recently spent two hours trying to get GPT 4 to build a fiddly regex and ultimately found a better solution on Stack Overflow. In my experiments it also produced lackluster concurrent code.


You’ve missed the point. Those standards don’t relate at all to writing code, they relate to process, procedure and due diligence - i.e. governance. Those all cost a lot in terms of man hours.


Exactly. Even without learning from those groups, there's a ton of stuff we know we could do to improve the reliability of our product. It's just that it would take way too much development time and our customers wouldn't want to pay for it.

It's like buying a thermometer from Home Depot vs a highly accurate, calibrated lab thermometer. Sometimes you just don't need that quality and it's a waste paying for it.


Yeah, it costs. That, and that people will accept shite software makes it high quality a fight software companies can avoid. Rationally therefore, they do.


I don't think that's the right way to reason about it.

I find that I can learn a ton from those industries, and as a software engineer I have the added advantage of being able to come up with zero-cost (or low cost), self-documenting abstractions, testing patterns, and ergonomic interfaces that improve the safety of my software.

In software, a lot of safety is embodied in how you structure your interfaces and tests. The biggest cost is your time, but there are economies of scale everywhere. It really pays to think through your interfaces and test plan and systems behavior, and that's where lessons from these other industries can be applied.

So yeah, if you think of these lessons as "do tons of manual QA", you'll run into trouble resourcing it. But you can also think of them as "build systems that continuously self-test, produce telemetry, fail gracefully in legible ways and have multiple redundancies".


Cars might not be the best example, since human lives are at stake, as in aviation. Unless you work on Teslas autopilot, it seems. But yes, backups and restores are often good enough.


As it turns out (and as much as we wouldn’t want them to) human lives are still subject to cost/benefit analysis.

An airliner is a lot of lives, a lot of money, a lot of fuel, and a lot of energy. Which is why a lot has been invested in training, procedure, and safety systems.

Cars operates in an environment which is in most ways a lot more forgiving, they’re controlled by (on average) low-training low-skill non-redundant crews, they’re much more at risk of “enemy action”, the material stresses are in a different realm, and they’re much, much more sensitive to price pressure.

Hell, the difference is already visible in aviation alone, crop dusters and other small planes are a lot less regulated amongst every axis than airliners are.


I wouldn't say it's simply cost-benefit analysis. It's also scale of accidents.

A whole lot more people die from car accidents, yet there are few reports on national news on accidents. So fewer people care. Meanwhile each time there is an aviation disaster, 100s of people die and it's all over the news for weeks. Similarly with train accidents and nuclear accidents. There where only 2 very large ones but they still haunt the field to this day, while (for example) the deaths from solar installations by people falling from roofs are mostly ignored.

Large accidents have to be avoided, a lot of small ones are more acceptable.


> I wouldn't say it's simply cost-benefit analysis. It's also scale of accidents.

But that is cost/benefit analysis. When any accident can kill hundreds and do millions to billions in damage besides (to say nothing of the image damage to both the sector and the specific brand), the benefit of trying to prevent every accident is significant, so acceptable costs are commensurate.


I think it goes beyond what you'd expect just from the increased scale putting more lives at risk. Compare our regulatory system for buses and cars, two transportation options that are probably as close as possible to differing only in scale. Buses are ~65x less deadly than cars, and yet we still respond to the occasional shocking bus accident by trying to make them safer.

Which is actually counterproductive! This makes it harder to compete as a bus service, bus lines shut down, and more people drive. I wrote more about this at https://www.jefftk.com/p/make-buses-dangerous and https://www.jefftk.com/p/in-light-of-crashes-we-should-not-m...


There are a fair amount of backups in your car. For example, the braking system is dual. There's also engine braking and the parking brake that can be used. All the "energy absorbing" features are a backup for when you crash.


Any substantiation for "Unless you work on Teslas autopilot, it seems"?

I mean you're implying that there are more accidents with autopilot than without it, right? Seems like quite the claim...


No, I'm implying that the autopilot code has not been as thoroughly tested as it should have been.

Example: https://www.theguardian.com/technology/2023/nov/22/tesla-aut...


Tesla people always try to reduce any critique to some metric on deaths per x.

The fact is, there’s a lot of history and best practice around building safety critical systems that Tesla doesn’t follow.

Additionally, even with the practices they follow, they call a consumer facing product that isn’t really an autopilot “autopilot”, while focusing outbound comms on a beta product that is more like an autopilot, but not available to them.


I agree with most of this but the naming of "autopilot" seems fine. Nobody expects commercial aircraft to fly on autopilot without a pilot's supervision, the same _should_ be true of Tesla vehicles (especially considering their tendency to jump into the wrong lane and phantom brake on the highway etc.)


What matters is what the user of the system thinks because that’s where confusion can be dangerous.

A plane pilot knows very well what the limits of the autopilot are and what the passenger believes is irrelevant.

Conversely if too many/most car “autopilot” users believe it does more than what it really does then it’s dangerous.

In electrical engineering 600V is still “low voltage”. Any engineer in the field knows that so that’s fine right? But if someone sells “low voltage” electric toothbrush or hand warmer no normal person will think “it’s 600V, it will probably kill me”. When you sell something, what your target audience takes away from your advertisement matters. If they’re clearly confused and you aren’t clearing it up after so many years then “confusion” and misleading advertising are part of your sales strategy.


> Nobody expects commercial aircraft to fly on autopilot without a pilot's supervision

Nobody here on HN, because we're really into tech. Outside the tech world, I would guess that 50% of the population thinks that "autopilot" (on any device) means that no human is needed.


Considering Tesla was willing to do unsafe things in visible way (e.g, running stop signs feature), then I have no trust that they are maintaining safety in the less visible ways.


In the context of disasters that happened due to software failures (e.g. Ariane 5 [1]), one of my professors used to tell us, that software doesn't break somewhen but is broken from the beginning.

I like the idea of thinking 'when' instead of 'if', but the verdict should be even harder when it comes to software engineering because it has this rare material at its disposal, which doesn't degrade over time.

[1] https://en.wikipedia.org/wiki/Ariane_5#Notable_launches


An example of zipper failure in the Airbus incident is when a wire bundle gets cut, all the functions of all the wires in that bundle are lost. Having two or more smaller bundles physically separated would greatly reduce that risk. Certainly, having the primary and the backup system in the same bundle is a bad idea.

On the 757, one set of control cables runs under the floor. The backup set runs in the ceiling.


It’s the same on Airbus aircraft, I can tell you from experience.


I thought Airbus was fly-by-wire, not cables?


It is. I'm talking about redundant electrical wires being physically separated so they don't get damaged by the same event.


What's fascinating about airplane design for me is not the huge technical complexity, but rather, the way it is designed such that a lot of its subsystems are serviceable by technicians so quickly and reliably, not just in a fully controlled environment like a maintenance hangar, but right on the tarmac, waiting for takeoff.


Designing the airplane to minimize required maintenance and to make maintenance and inspections easier and faster is a huge issue for the engineering department. Also make it very difficult for the mechanics to do things wrongly.

As it was pointed out to me, airplanes sitting on the ground are a black hole sucking up money. Airplanes in the air carrying payload (note the "pay" in payload) are making money. Boeing understands this very well, and is very focused on getting that airplane in the air making money as much as possible.


> When my AoA sensor fails, then what?

crickets, let's just randomise which sensor we use during boot, that ought to do it!


> Airlines really want to be able to use pilots' existing type-rating on this hulking zombie of a 60s-era airframe with modern engines but it behaves differently under certain conditions, what do we do?

let's just build a system that pushes the nose down under those conditions, have it accept potentially unreliable AoA data, and not tell pilots about it!


"AoA sensor" - Angle of Attack sensor.

And the reference is presumably to 737 MAX accident. https://www.afacwa.org/the_inside_story_of_mcas_seattle_time...


Epic fail indeed, costing many lives.


I agree in principle, but I don't think industries should be looking at current-day Boeing's engineering practices except for an example of how a proud company's culture can rot from the inside out with fatal consequences.


I think Boeing has had some difficulties. They have also had some undeniable successes. The 777 and 787 programs have no in-service passenger fatalities attributable to engineering errors to date. That's a monumental achievement.


The 787 has no hull losses at all right? And it’s been flying for 10 years now.


An extra safety margin is conferred by the stepladders found in the tailcones :-)


Reminder that this article was about an aircraft built by Airbus.

(Airbus is not Boeing.)


How are aeroplanes designed differently at Boeing vs Airbus? What's the secret sauce?


A pilot once explained to me..

Boeing planes (before MCAS): we have detected a problem with your engines, would you like to shut down?

Airbus planes: we have detected a problem with your engines, we have shut them down for you.


Same way Samsung phones are not Huawei phones? Or BMWs aren't Lexus?


At this point the secret sauce is that the EAA isn’t tolerating the same degree of certification fucking and laxity from airbus, and that they generally seem to have their act together.

Like what’s the secret sauce of nvidia vs radeon or AMD vs intel? Reliable execution, seemingly - and this is an environment where failures are supposed to be contained to very specific rates at given levels of severity.

The FAA has gotten into a mode where they let boeing sign off on their own deviations from the rules, the engine changes forced the introduction of the nose-pusher-down system which really should have required training, but Boeing didn't want to do that, because the whole point of doing the weird engine thing was having ostensible "airframe compatibility" despite the changes in flight characteristics. And they have become so large (like intel) that they don’t have to care anymore, because they know there’s no chance of actual regulatory consequences, nor can the EAA kick them out without causing a diplomatic incident and massively disrupting air travel, so they are no longer rigorous, and we simply have to deal with Boeing’s “meltdown”.

And yes they should be doing better but in the abstract, certification processes always need to be dealing with “uncooperative” participants who may want to conceal derogatory information or pencil-whip certification. You need to build processes that don’t let that happen and nowadays there’s so much of a revolving door that they can just get away with it. Like none of this would have happened with the classified personnel certification process etc - it is fundamentally a problem of a corrupted and ineffective certification process.

This decline in certification led to an inevitable decline in quality. When companies figure out it’s a paper tiger then there’s no reason to spend the money to do good engineering.

The FAA’s processes are both too strict and too lax - we have moved into the regulatory capture phase where they purely serve the interests of the industry giants who are already established and consolidated, and they now serve primarily to exclude any competitors rather than ensure consistent quality of engineering.

The specifics are less interesting than that high-level problem - there obviously eventually would be some form of engineering malfeasance that resulted from regulatory capture, the specific form is less important than the forces that produced it. And that regulatory capture problem exists across basically the whole American system. Why do we have forced arbitration on everything, why are our trains dumping poison into our towns? Because from 1980-2020 we basically handed control of legislative policy over to corporate interests and then allowed a massive degree of consolidation. Not that airbus is small, but the EAA isn’t regulatory capture to the extent of most American bureaus.


It's actually safer for new airplane types to have flying characteristics like the previous types. There have been many accidents where a situation happened and the pilot did the right thing for the previous airplane he flew, but was the wrong thing for the one he was currently flying.

Most of what was written about the MAX crashes in the mass media is utter garbage and misinformation. No surprise there, as journalists have zero expertise in how airplanes work.

Both crashes could have been easily averted if the crews had followed well-known procedures. There was also nothing wrong with the aerodynamics of the MAX, nor the concept of the MCAS system. The flaw was in the way the MCAS system was implemented, and the way the pilots responded to it.

For example, rarely mentioned is the third MAX incident, where the airplane continued normally to their destination. The crew simply turned off the stab trim system.

BTW, I had a nice conversation with a 737 pilot a few months ago. He told me what I had already concluded - the crashed crews did not follow the procedures. I've also had unsolicited emails from pilots who told me what I'd written about it was true.


Everything I wrote is true. The LA crew restored normal trim 25 times, but never thought to turn off the stab trim system. The trim cutoff switch is right there on the center console within easy reach for just that purpose.

The EA crew oversped the airplane (you can hear the overspeed warning horn on the CVR) and did nothing to correct it. This made things worse. They were also given an Emergency Airworthiness Directive which said to restore normal trim switches, then turn off the trim system. They did not.

That's it.

I'd say half the fault was Boeing's, the other half the flight crews'.

The MCAS is not a bad concept, note that MCAS is still there in the MAX.

Pilots are a brotherhood, and they don't care to criticize other pilots in public. But they will in private.


Everything you said might well be true, and indeed as far as I know it is, but aircraft should not have fail-deadly systems which require lightning reflexes and up-to-the-second training to diagnose and disable fast enough before they crash the freaking plane in the first place. Yes, the pilots of the affected flights might have been able to save the aircraft if their training had been just that little bit better. We'll never know. But the real blame falls squarely on the shoulders of Boeing for shipping such a ticking time bomb in the first place.

Which is why the entire worldwide MAX fleet was grounded for more than a year, and the regulators didn't just mandate a bit of extra training.

Coming up with this narrative about how it's the crew's fault because they failed to disable Boeing's quietly introduced little self-destruct system fast enough to save their own lives was a particularly despicable move from their PR department and I lost a lot of respect for them over that.


It did not require lightning reflexes or up-to-the-second training. The first LA crash came after the crew dealt with it for 11 minutes, and restored trim 25 times. The EA crew restored normal trim a couple times, and crashed after 3 minutes if I recall correctly.

As for training, turning off the stab trim system to stop runaway trim is a "memory item", which means the pilots must know it without needing to consult a checklist. Additionally, after the first crash, all MAX crews received an EMERGENCY AIRWORTHINESS DIRECTIVE with a two-step procedure:

1. restore normal trim with the electric trim switches

2. turn off the trim system

I expect a MAX pilot to read, understand, and remember an EMERGENCY AIRWORTHINESS DIRECTIVE, especially as it contains instructions on how not to crash like the previous crew. Don't you?

> might have been able to save the aircraft

It's a certainty. Remember the first LA MAX incident, the airplane did not crash because after restoring normal trim a couple times, the crew turned off the trim system, and continued the flight normally. They apparently didn't even think it was a big deal, as the aircraft was handed over to the next crew, who crashed.

> a bit of extra training

They are already required to know all "memory items".

> Coming up with this narrative about how it's the crew's fault because they failed to disable Boeing's quietly introduced little self-destruct system fast enough to save their own lives was a particularly despicable move from their PR department

AFAIK Boeing never did say it was the crew's fault. The "have to respond within 5 seconds" is a fantasy invented by the media. It is not factual.

Both Boeing and the crews share responsibility for the crashes.


> Both Boeing and the crews share responsibility for the crashes.

And I never said they didn't. I just choose to assign Boeing the lion's share of the blame, as they should never have let that rush-job, cost-cutting death trap of a machine take to the skies in the first place.

Anyway, I see you have your mind made up, so there's not much point in arguing further. If you feel like continuing, why don't you take it up with - let's see - every single global aviation regulator, who also somehow came to the conclusion that there was maybe something a little bit wrong with the type.


>Both crashes could have been easily averted if the crews had followed well-known procedures.

I thought that the majority of the problems was that Boeing wanted the same type-rating, so that airlines could avoid paying for training. This resulted in crews not getting proper training and so not knowing the proper procedures ... which was by decision.

Both the airlines and Boeing should take the blame; I don't really see how it would be the pilots fault, if you lie and say "it's the same plane, it flies the same, you don't need conversion training".

I am not in aviation, most of this is from YouTube sources, so y'know ...


Are you serious in saying that other industries could learn from Boeing?


Glancing at Walter Bright's brief Wikipedia page - I'd say he worked for Boeing well before they succumbed to the McDonnell Douglas Brain Fungus.


He didn't actually say that.


I think many of us are so used to working with software, with its constant need for adaptation and modification in order to meet an ever growing list of integration requirements, that we forget the benefits of working with a finalized spec with known constants like melting points, air pressure, and gravity.


Completely agree - I think it can go one of two ways. Software is more malleable than airplanes are and that also comes with downsides (like how much time and effort it takes to bring a new plane to the market)


I was just thinking of this metaphor today.

Try drawing the software monstrosity you work on / with as an airplane. 100 wings sticking out all different directions, covered with instruments and fins, totally asymmetrical and 5 miles long. Propellers, jets, balloons, helicopter blades.

Yep, it flies.

When it crashes, just take off again.


So software is my son's Bad Piggies flying monstrosity! You only left out the crates of TNT.


The article talks about a piece of software that partially failed, when they needed to calculate the braking distance for the overweight aircraft.


Airliners face constantly changing specifications. No two airliners are built the same.


Do you mean no two individual planes? Like two 767s made a month apart, do you mean they literally would have different requirements?


Yes. There are constant changes to the design to improve reliability, performance, and fix problems, and the airlines change their requirements constantly.


Neat little detail of the world Wikipedia once told me: the 00 suffix of classic Boeing planes, dropped in 2016, was substituted with Boeing assigned customer code on registration documents. e.g. a PAN AM 773-300 would have been 777-321, an Air Berlin Jetfoil would have been 929-16J, and so on.

1: https://en.wikipedia.org/wiki/List_of_Boeing_customer_codes


I think they means that airplanes are made in different versions, catered to particular airline. Also planes are constantly updated.

Two 767 made few months apart will have initial difference, like two different versions of java 8 SDK.


I think they meant a 737-400 is different from a 737-500 is different from a 787 and a AirBus 320 and a MD-80 and…

Every single model is somewhat bespoke. There’s common components but each ends up having its own special problems in a way I assume different car models in a common platform (or two small SUVs from competing manufacturers) just don’t.


It took hundreds of subject experts from ten organizations in seven countries almost three years to reach that conclusion.

Here at HN we want a post mortem for a cloud failure in a matter of hours.


> Here at HN we want a post mortem for a cloud failure in a matter of hours.

I'll go one further - I've yet to finish writing a postmortem on one incident before the next one happens. I also have my doubts that folks wanting a PM in O(hours) actually care about its contents/findings/remediations - its just a tick box in the process of day-to-day ops.


Something similar that struck me was that, in early February, Russia invaded Ukraine.

And then, I saw an endless stream of aggrieved comments from people who were personally outraged that the outcome, whatever it might be, hadn't been finalized yet at the late, late date of... late February.


I work at mid tier FAANG, our SLA for post mortems have SLA in the 7-14 day period. Nobody seriously wants a full PM in hours.

They may want a mitigation or RCA in hours, but even AWS gives us NDA restricted PMs in > 24 hours.


Apples to oranges


> To be able to zero in on what turned out to be a single faulty part and then trace the entire provenance and environment that led to that defective part entering service speaks to the robustness of the industry.

And to be able to reconstruct the chain of events after the components in question have exploded and been scattered throughout south-east Asia is incredible.


My impressiom was that the defective part was still inside the engine when it landed.


Makes it even more impressive: the parts that were actually implicated in the explosion itself (and scattered from the aircraft) were not defective, so the investigation had to go through parts which did not seem to have exploded in order to track down the defect.

Or at least, I assume the turbine parts weren’t defective, although given what seems to be quite a happy-go-lucky approach to manufacturing defects in Hucknall, maybe my assumption is not made on solid grounds…


Probably a reference to other incidents. Shout out to the NTSB for fighting off alligators while investigating this crash... https://en.wikipedia.org/wiki/ValuJet_Flight_592


Aviation is great because the industry learns so much after incidents and accidents. There is a culture of trying to improve, rather than merely seeking culprits.

However, I have been told by an insider that supply chain integrity is an underappreciated issue. Someone has been caught selling fake plane parts through an elaborate scheme, and there are other suspicious suppliers, which is a bit unsettling:

"Safran confirmed the fraudulent documentation, launching an investigation that found thousands of parts across at least 126 CFM56 engines were sold without a legitimate airworthiness certificate."

https://www.businessinsider.com/scammer-fooled-us-airlines-b...


Admiral Cloudberg has covered a case where counterfeit or EOL-but-with-new-paperworks components were involved in a crash.

https://admiralcloudberg.medium.com/riven-by-deceit-the-cras...


I suspect this is precisely what is happening in Russian civil aviation now. No legit parts supplied, so there will be a lot of fake/problematic parts imported through black channels.


The Checklist Manifesto (2009) is a great short book that shows how using simple checklists would help immensely in many different industries, esp. in medical (the author is a surgeon).

Checklists of course are not the same as detailed post-mortems but they belong to the same way of thinking. And they would cost pretty much nothing to implement.

Also CRM: it's very important to have a culture where underlings feel they can speak up when something doesn't look right -- or when a checklist item is overlooked, for that matter.


Yes, but they do have one critical failure mode: that the checklist failed to account for something (or that an expected reaction to a step being performed didn’t occur).

I was a submarine nuclear reactor operator, and one of my Commanding Officers once ordered that we stop using checklists during routine operations for precisely this reason. Instead, we had to fully read and parse the source documentation for every step. Before, while we of course had them open, they served as more of a backstop.

His argument – which I to some extent agree with – was that by reading the source documentation every time, we would better engage our critical thinking and assess plant conditions, rather than skimming a simplified version. To be clear, the checklists had been generated and approved by our Engineering Officer, but they were still simplifications.


If the alternative to the check list is reading the full documentation, that's one thing. But in my experience -- as a Software Engineer, and random dude on the Internet -- the alternative is usually no check list or documentation.


For sure – short of large and well-supported projects like Django et al., docs are notoriously incomplete if present at all.

Even then, you have to get people to read them, which is somehow a monumental task. Docs? Nah, lemme read this Medium blog instead.


Checklists are great if you use them properly: to make sure you remember. Checklists are dangerous when they are used improperly: to replace or shut-down critical thinking.


A colleague of mine came from a major aviation design company before joining tech and said they were in a state of culture shock at how critical systems were designed and monitored. Even if there are no hard real time requirements for a billing system, this guy was surprised at just how lax tech design patterns tended to be.


If 200 people died after a db instance crashed, software would be equal in that regard.


To prove this, software that deals with medical stuff is somewhat more like aviation.


Also, aviation and software aren't orthogonal. E.g., the article mentioned that part of the reason the pilot was able to sustain a very narrow velocity window between stall and overrunning the runway was because of the A380's fly by wire system.


Yep. Insulin pumps can kill their owner and the software updates need to be FDA approved:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4773959/


Likewise, in "aviation" when the entertainment system completely fails in a 4 hour flight, there is most like no post mortem at all. They turn it off/on again just like most of us.


This is true in a lot of industries. Unless there’s 7+ figure costs or significant human losses, there’s usually not an exhaustive investigation to conclusively point to the exact cause and chain of events.


Some people who think this is ideal for any sort of software tech sound they would also want a 3 hour post mortem with whoever designed the rooms, after slightly stubbing a toe.


This kind of makes sense, but it is only possible because of public pressure/interest. Many people are irrationally emotional about flying (fear, excitement etc.), that's why articles and documentaries like this post are so popular.

On a side note, that's also why there's all the nomsense security theater at airports.


> robustness has less to do with the number of mistakes but how one responds to them

It must have something to do with the number of mistakes, otherwise it's all a waste of time!

It's all well and good responding to mistakes as thoroughly as possible, but if it's not reducing the number of mistakes, what's it all for?


> It must have something to do with the number of mistakes, otherwise it's all a waste of time!

Not really. Imagine two systems with the same amount of mistakes. (Here the mistakes can be either bugs, or operator mistakes.)

One is designed such that every mistake brings the whole system down for a day with millions of dollars of lost revenue each time.

The other is designed such that when a mistake happens it is caught early, and when it is not caught it only impacts some limited parts of the system and recovering from the mistake is fast and reliable.

They both have the same amount of mistakes, yet one of these two systems is wastly more reliable.

> if it's not reducing the number of mistakes, what's it all for

For reducing their impact.


Aerospace things have to be like this or they just wouldn’t work at all. There are just too many points of failure and redundancy is capped by physics. When there’s a million things which if they went wrong could cause catastrophic failure, you have to be really good at learning how to not make mistakes.


> you have to be really good at learning how to not make mistakes.

Not exactly. The idea is not not making mistakes, it's whatcha gonna do about X when (not if) it fails.


> Being an SRE at a FAANG and generally spending a lot of my life dealing with reliability, I am consistently in awe of the aviation industry. I can only hope (and do my small contribution) that the software/tech industry can one day be an equal in this regard.

There's a slight difference in terms of what kind of damage an airplane malfunctioning causes compared to a button on an e-commerce shop rendering improperly for one of the browsers. My point is that the level of investment in reliability and process should be proportional to the potential damage of any incidents.


I agree, and also I enjoy the attitude. While in my profession the postmortems goal is finding who to blame, here the attitude is towards preventing it to happen again, no matter what. Or at least that’s how I feel.


Your profession? Or you mean your company? Unless it's a very specific profession I would not know, it would usually imply that the company is dysfunctional.


Richard Hipp talks a lot about how SQLite adopted testing procedures directly from aviation.


> I can only hope that the software/tech industry can one day be an equal in this regard

I’d love to be an engineer with unlimited time budget to worry about “when, not if, X happens” (to quote a sibling comment).

But people don’t tend to die when we mess up, so we don’t get that budget.


Hard agree. Civil & mechanical engineering have a culture and history of blameless analysis of failure. Software engineering could learn from them.

See the excellent To Engineer is Human in just this topic of analyzed failures in civil engineering.


To a half-competent machinist or manufacturing metrologist, half a millimetre of concentricity error on a part of that size might as well be half a mile. It's a huge, grievous error that can be seen with the naked eye. You don't get an error of that scale through normal variation, it's a clear sign of a serious problem with your setup.

This part of the article really leapt out at me:

The tolerance for this bore was supposed to be Ø 0.05 mm according to the design drawings, but was changed to Ø 0.5 mm in the manufacturing drawings without explanation. Even so, the non-conformance on the accident hub was between Ø 0.90 and Ø 0.98 (an offset of 0.45–0.49 mm), which should have been flagged by the machine. The CMM records from the accident hub were not retained, so it was not possible for investigators to confirm that the error was actually registered.

The meaning might not be obvious if you've never worked in a machine shop, but it's crystal clear if you have. Many people at that plant knew that they were delivering out-of-spec parts. Everyone who handled that part could have told you at a glance that the counterbore was badly off-centre. Rather than going back to remake the parts, rather than figuring out why the parts were bad, they just went through the motions of QC, shipped them anyway, falsified documentation and discarded evidence. For all the complexity of the analysis, the root cause is blindingly simple - flagrant negligence, concealed by flagrant deceit.


The article said it wasn’t visible because that stub was machined after it was placed in the hub. Which begs the question “why would you weld a tube in place and then finish machining it after?” Maybe it was easier/faster to machine it while it was on a hub. Also, wasn’t there an oil filter that had to go in there? Wouldn’t the oil filter experience interference if the counterbore was offset?

Closing comment: damn I thought people paid more attention when building turbines.


Yes, but what the poster meant is that it would be and that is confirmed in the images.


30 years ago I was in an emergency landing due to engine failure situation (flight attendants take away your shoes, practice crash position, rearrange the passengers etc) and the thing that stuck out the most for me was that everybody did as they were told. No self righteous people; it was clear to everyone why there are flight attendants aboard and that they were key to your survival. The evacuation was orderly, though the follow up was lengthy (e.g. everybody’s passport was still on board).

More recently I’ve seen pictures of people evacuating down the slides with their luggage! Seems incredibly dangerous, not just for the slide experience but in slowing down evacuation. We had no fire in the cabin but what if we had?

Oh yeah, you know the stereotype of the press sticking their camera in your face to see how freaked out you are? It does happen in real life.


You’re not supposed to take anything on the slides. No luggage. No shoes. Just you.

But it is ignored. Which is sad, people could really get hurt.

Your right though the fact as many people comply as they do is kind of incredible given how people act in other situations.


Yeah, according to the linked article 5 - 10% of people are injured using the escape slides, which is why they waited for the stairs in this case.


They took out per shoes away so that was that. According to a parallel reply, they no longer do that.


Why in the world do you have to take your shoes off before going down the slides? I could understand jackets or jewelry, but shoes?


As silly as it might seem, you do something enough times and oddball rare things happen .. this is an instruction intended to reduce:

* shoes | boots with sharp objects embedded in soles (glass, bent nails)

* extra spikey high heels,

* work boots with hard edged metal hooks for laces,

(etc) causing damage to both inflatable slipways and to other passengers.

How often has a passenger going down an emergancy slide caused a rip that deflated that slide?

Not very often .. and aircrew are taught to issue instructions that make that as an unlikely occurence as possible.


Also, try to swim/stay afloat with shoes ... Apart from young athletes, most people will drown within a minute


I dunno — when I go camping by canoe, I keep my hiking boots on all the time: paddling, portaging, and yes, when having a swim during a break for lunch or after making camp). A disabling injury could be fatal.


Are you "young" and "athletic", by any chance ?


And if something gets caught on the slide as you go down you could fall a dozen or more feet onto hard asphalt. Friend fell on a slide and got a compound leg fracture.


Many shoes have hard, sharp parts that could damage the slide, even to the point of complete deflation. There is no time to assess whose shoes would be safe and whose not, so the blanket rule is "no shoes".


High heels are not OK, for obvious reasons. Regular shoes are fine.


They confiscated all our shoes. Crashing into someone at the bottom with shoes could be a problem too.


Huh. That sounded wrong so I googled it. I thought it was all shoes.

You’re right. What I said above used to be true. That seems to have been questioned in the 90s and in 2000 the FAA finalized a rule changing it.

The current recommendation (https://www.faa.gov/travelers/fly_safe/information) say you can keep your shoes on but to remove high heels, as you said.

A bit of googling says it was changed because of passengers injuring their feet on the terrain/debris after crashes. Additionally modern slides are much tougher than they used to be and won’t tear from shoes and probably even high heels.

But I bet high heels are probably not a smart thing to be wearing on possibly uneven debris covered terrain in an emergency when you need to move fast and safely.

Learn something new every day.


Not just high heels, but also many boots have sharp protrusions (e.g. lace hooks on some hiking boots and work boots, metal decorations on goth and cowboy boots)


Ahh that makes way more sense. Thanks.


I was in a hotel fire evacuation once and the stairwells were all blocked because everyone brought every piece of their luggage with them.


Disgraceful.


People evacuate with their luggage because in times of high stress, we fall back on habit. What do we do when it's time to leave an aircraft? We make sure we have all our belongings with us!

That's just one reason why it's important to listen to the safety briefing, even if you've heard it before. The repeated drill helps us to remember what to do, even when there's added stress.


I don't really think the phenomenon is anything other than people selfishly wanting their belongings to be saved over another passenger's life.


This seems a bit too misanthropic. The risk-reward is more nuanced. If I take my stuff, odds are everybody will still be okay, and I'll have my stuff.


I’ve always wondered what happens after an emergency landing. Do you just kinda sit there and wait for bags and personal belongings to be offloaded? And then wait for another flight out?


Honestly, each and every one of those people should either be charged with reckless endangerments, put on an no-fly list or both. It really pisses me off when I see that. F**ing entitled idiots.


I remember there was this video of a plane in Russia that was on fire, multiple people died. And you see people walking away with their luggage, can’t help it think people would still be alive if it wasn’t for those who so urgently needed their suitcases.


People don’t follow the rules when they don’t trust their government for providing sane rules.

Case in point when you provide an example with Russia. Other example is Covid.


People don’t follow the rules because they are self-centered assholes who believe they are the main character of a movie, and they value their own personal convenience and comfort over the lives of other people they see as NPCs.

People aren't taking their luggage with them during an airplane fire because of their distrust of The Deep State.


Would it be possible to make an even more negative interpretation of this comment? I don’t think so.

You irritation (to the extreme, I mean you just linked the deep state with assholes) is probably due to powerlessness in fixing the issue, let alone interpreting it properly.

However, it is still a widely-acknowledged idea that an unhinged state like the former-USSR-newly-Putin-land makes people afraid of not being able to feed themselves the next day.

I don’t suppose you oppose the idea that when the state organize something, civil servants’ interests are so badly aligned with the proper execution of the plan that it’s a shitshow every single time. And with entirely incompetent people in power, one’s better off doing the opposite of what regularly-proven liars tell them to do.

If you’re upset, is it because you envision lying as a normal form of citizen management? How’s that going for your camp? Do you feel that people trust you?


I'm not saying that there aren't people out there motivated by government distrust--I'm sure there are, maybe more in Russia--I don't know, I'm not Russian.

I'm saying that's not the common case. If you pick 100 random people out there who are disobeying some safety rule, where the outcome is greater convenience for themselves, I'd be willing to bet money that at least 90 of them are doing it out of selfishness and not some principled distrust of the government.


Oh great.

Fire creates smoke. Smoke quickly makes people unconscious, then kills them. It doesn't matter whether you trust the government or not, whether you follow the rules or not, whether you hold your luggage or not, if you can't get out of smoke, your fate is pretty much set in stone. Putting the blame on people who never had the time in the first place, and couldn't supernaturally turn to liquid and go through the door all at once is like telling that those who were robbed could've trained themselves to run faster.

Although you can argue that Titanic could survive, if only had it blasted bass-boosted <anthem of a proper country>.


My first job was working at a mro that overhauled engines a bit smaller than the Trent 900s but same principles apply.

I built qa software to digitize the forms and signature process like what’s mentioned in the article as having not correctly been signed off on.

I ate lunch with repair engineers that had dark wells of knowledge about the engines they worked on. They could talk so deep on a subject that lunch break was over and we’d resume conversation over weeks.

There’s a paragraph in this post that hits a few points that are very subtle. The missing sign offs and engineers not knowing the process and and and. I think the criticism of RR is valid here. The qa manager at the mro I worked at was a force of nature. He was feared and uncompromising. He was also the signature that could cause an engine shutdown in flight. I admired this person and still do.

There’s small issues like this that go on every day on every engine model all over the world. There’s thousands of engines flying right now that have little defects that could cause a shutdown. There’s issues that have been identified, signed off as low risk and will be checked next time the engine comes in for overhaul.

There’s engineers out there that see the same fault, a premature cracked pipe, carbon buildup, abnormal corrosion, after a while of seeing this problem, they’ll raise the paperwork which will go up the chain and sit. It may be ignored, taken for information for future designs, identified as something that should be fixed or monitored or the frequency of monitoring increased. Maybe the part life will be reduced or you will be forced to NDT the part at each overhaul.

The cheese wheel concept is great as these systems are so complex there’s always going to be some issues.

As for Qantas, near the end it mentions the plane was repaired at great cost. It’s a source of company pride that they’ve never lost an airframe. They repair planes which are BER (beyond economic repair) just to keep this record.


> As for Qantas, near the end it mentions the plane was repaired at great cost

Indeed. Qantas has been ranked the safest airline int he world almost every year since forever [1]

I clearly remember when QF32 happened and everyone was utterly shocked. That simply DOES NOT happen to Qantas.

[1] https://www.forbes.com/sites/laurabegleybloom/2023/01/03/ran...


QANTAS has, for the last 10+, had a CEO who was not part of this culture and did everything he could to drive costs down. He laid off huge swaths of engineers, outsourced key maintenance contracts to the lowest bidder and left the airline with an aging fleet that needs billions spent to replenish. He was recently fired by the board for essentially destroying the reputation of the airline within Australia, with their practice of cancelling flights at short notice, illegally sacking thousands of staff during COVID and taking 100's of millions of dollars from the Australian government to keep staff employed during the airline's grounding during COVID and handing it all to shareholders.

It is a situation very similar to the downfall of Boeing.


The destruction of Qantas as a quality airline is entirely driven by exactly the same MBA/shareholder-value bullshit that destroyed Boeing and others.

Financial engineers should be banned from operating businesses. They are not focused on the quality of the business, from which profits are derived. They work backwards from their financially engineered results to drive down "costs", even if those "costs" are entirely essential to the operation of the business.

Qantas (and its subsidiary Jetstar) are having to recover their engineering, customer service, and other "costs" to actually achieve the operating business that their expensive tickets require. Currently they are being priced out of operating in Asia, not because they have too expensive operations, but because their board and CxOs were entirely driven by shareholders, not the ongoing operation of the business.


Agreed. I've worked in a company that was AS9001 certified, and pretty much the first things a quality auditor would have wanted to look at would be non-conformances and concessions. With than number of missing signatures we'd have been skinned alive, and it would likely have prompted the auditor to then turn the place upside down looking for more problems.

That would then have produced major failings in the audit, if not the outright revocation of the quality accreditation, which I would then expect to be followed up on by an audit from the customer (which in the case of TFA would be Rolls Royce), asking some rather uncomfortable questions of the management, examining whether the inter-company concession process was being adhered to, and perhaps reflecting internally (i.e. within RR) - "Do we think these folks are the right people to be making these parts for us?"

From what I've read here it seems to me that Rolls Royce were astonishingly lax in not riding their subcontractors nearly hard enough, quality wise.


I had a small experience with RR as a company through a contract. Including some time spent in Derby.

The things I saw left me question how any innovation could happen at all in there or why we did not have a much higher rate of fuck-you-shima per year or how the hell plane engines are not exploding daily.

IIRC the B777 engine controllers are still m68k. Discontinued in 1995.


> IIRC the B777 engine controllers are still m68k. Discontinued in 1995.

That seems sensible? You’d need a really compelling reason to rewrite the entire control software and recertify the engine to match. Especially for an engine which has seen no order in 15 years.


The planes are still in service and need new engines and even existing engines require spare parts.

What I heard was that there was quite a scramble to buy up all existing supply and also talk some alternate manufacturers into continuing production at a low rate.

B777 was introduced in 1995. Having an engine controller that is obsolete and not available any more at the moment it is launched, seems a bit shortsighted to me.

Then again it works, the planes are flying in the end it's fine.


> B777 was introduced in 1995. Having an engine controller that is obsolete and not available any more at the moment it is launched, seems a bit shortsighted to me.

First, in 1995 Motorola stopped development of the ISA, that says nothing about chip manufacturing which is what RR or airlines would care for. Ti launched the 68k-powered 89 three years later, and only switched away with the N-Spire CAS in 2007. Pilots launched 68k-powered Pilots in 1997.

Second, the early 90s were a time of flux for ISA and you could not necessarily know the plans of your provider, the 68k probably looked quite reasonable when RR started developing the Trents in the mid 80s. RR launched the 777’s 800 in 1991. And even after that, 68ks powered much of the early 80s and early 90s.


All sorts of ancient architecture chips are still being made. 6502, z80, 8081, 386


I was on the flight and took the picture referenced as "A passenger took this photo in flight, showing turbine fragment exit holes in the upper surface of the wing. (ATSB)" Forced myself on another A380 flight shortly after so I won't lose faith in it's engineering safety.


Wow. I was (long ago!) in an engine fire emergency landing situation and though I did take a connecting flight to get home I didn’t fly for a while afterwards. Psychologically, your choice was probably the smarter one.


I've been in a couple situations.

- The main one was that I had a flight from Vancouver to Victoria and the weather was too bad for the helicopter to fly. So we took a prop. On takeoff, some cross-wind hit the plane and we tipped over. My colleague and I who were sitting across from each other thought that was it.

- The other one was my plane was reported crashed when I was visiting my parents for some holiday or other. I got panicked call on drive back from airport.


> On takeoff, some cross-wind hit the plane and we tipped over.

I had a near tip-over coming out of a DIA years ago. DIA gets very windy. We were nearing speed to lift off and a gust of cross wind hit the plane. Looking out the window I thought for sure the wing was going to hit the ground, but in that moment the pilot seemed to shift from a standard take off to something that felt much more vertical. Once we were airborne the flight attendant came by who looked a little shaken and offered me a free drink.


Hopefully without incident that time?


Thankfully yes! I lived in Singapore at the time and thought... my goodness. It's a small island. If you end up afraid of flying, what do you do!?

Kudos to the Qantas crew on board as well as Captain de Crespigny and his co-pilots and two check captains. We happened to have a lot of experienced pilot power on board.

A video from that time: https://youtu.be/U8Un2boLZD8


Good on you!


The article is complex and well written, but I am a bit perplexed by the victorious tone and never-ending praise of safety. It resembles a sales pitch a bit too much, even though no one is selling anything. Maybe it's unintentional, and being around salesmen just does that to people.

If you are like me, you've probably said “hmm…” to yourself multiple times when certain things were mentioned, because those were things that actually didn't work (that they were left intact really boosts the credibility of the author). From calculation software that had never ever been tested with out-of-ordinary data to the computer keeping the broken engine running. From pure luck with fuel tanks being almost full and unable to explode to absence of any physical kill switch to stop the engine. An hour being generously available to go through ALL the checklists to clear the notifications. An hour of passengers and crew staying on top of the poodle of fuel hoping that nothing would ignite it. Finally, pure randomness in debris flying the way it did. It's not a story of “layers of safety” overlapping, it's a story of “layers of randomness” overlapping.

What would be really interesting is a distribution of outcomes for all possible trajectories of debris, i. e., how (un)lucky they actually were. I guess corporations don't release models like those to the public.

Also, that special chamber for oil filter requiring precise drilling of a perfectly fine pipe seems “ewww” to me. It is not serviceable anyway without reinstalling everything from scratch, as far as I understand, why not make it a single piece?


The author is positive because of all the safety layers that existed and staid intact, despite how flawed humans and companies are. The culture of looking at previous accidents like the UA232, where they lost ann engine and ALL controls with it, meant the A380 control system was engineered to take even more damage and it worked.

I do agree though it did not spend enough effort focusing on the areas to improve:

- A computer controlled engine that runs for 60 seconds while on fire, and lets a dangerous part spin too fast. It seems like something that should of been covered ahead of time.

- An engine manufacturing process that is so complex it’s almost impossible to validate.

- A fault management system that only shows you 1 or 2 at a time when you have 40.


> - A fault management system that only shows you 1 or 2 at a time when you have 40.

As long as the system prioritizes the warnings/cautions with the most pressing ones shown first, this is a very good thing. In a high-stress situation, you don't want the pilots to have to deal with figuring out which of the 40 warnings need to be taken care of first.


…none of which did happen. Checklists are not made for “prioritization”. Checklists are not made for “high-stress situations”. They simply had to do that because that was the intended way to diagnose a complex black box. If you don't have an hour to hang in the air, bad luck. There is an obvious unusable model of operation, and you praise it for being good… because someone said it's good?


We're not talking about checklists. We're talking about the ECAM warnings/cautions/advisories display. It's a well known fact that overwhelming human operators with large amounts of information all at once is a bad thing -- even just in aviation, there are numerous examples. That's why there's a 'clean cockpit' rule that the FAA enforces; Distracting pilots with either useless, or extraneous and not immediately actionable information one average, causes worse outcomes. Checklists largely come into play once you start acting on the ECAM warnings.

Also, as the article says, the pilots did their job following the aviate, navigate, communicate mantra. They first made sure they had the appropriate time to follow the checklists, and only then did they proceed to follow them.

There's over 100 years of aviation experience backing many of these procedures and approaches to dealing with problems. Many are hard-won with literal blood and lives.


The situation can be described simply as «no one had expected such a grand connectivity failure to happen, so The Computer and The Manuals were not as helpful as they could be in finding out what worked and what didn't». That's it. Why are you coming to me like you are manager with 50 volumes of printed bureaucratic runarounds under his belt?


> The situation can be described simply as «no one had expected such a grand connectivity failure to happen, so The Computer and The Manuals were not as helpful as they could be in finding out what worked and what didn't». That's it.

Did we read the same article ogurchik? This situation was not simple, and the computer and manuals were as helpful as they could be given the unknown situation.

All I was trying to point out that your assumption about the ECAM system may be ignoring some reasons for why it works that way. No need to be a salty pickle about it.


I suspect the ECAM only showing a couple of failures at a time is a design feature, not a flaw, to prevent overwhelming the crew as they work through them


> the computer keeping the broken engine running

That’s on purpose, you don’t want an automation decide such a drastic move as shutting down an engine. That’s the pilot’s decision.

> absence of any physical kill switch to stop the engine

There is, you shut down the fuel flow with a valve. But that “kill switch” was damaged.

> An hour being generously available to go through ALL the checklists to clear the notifications

Again, pilot decision to do it if time is available. Isn’t it safer that way?

> pure randomness in debris flying the way it did

Well that’s the nature of the failure. It’s like complaining that which HDD fails in a datacenter is random.

> outcomes for all possible trajectories of debris,

Yes it’s not public data, but all positive trajectories are analyzed at the design stage, and structural and systems components are kept segregated accordingly.


I'm not an idiot (citation needed). I can see that a storm unplugging some imaginary tiny heartbeat cable, which in turn shuts down all the engines instantly, is not how planes should operate. What I don't understand is the approach to defend status quo, and pretend that “randomness is now conquered”.

It seems to me that fixing one complex problem creates 10 other complex problems. They can be rare, but it's ignorant to shift focus from them.


I've read dozens of Admiral Cloudberg articles, and when you do so you notice a pattern: in old aviation crashes, a single error or a single part failure usually took down a plane with tens of dead bodies. Also the story of how and why the sterile flight deck started in response to some crashes where the pilots were distracted talking. In modern aviation accidents, it seems very unlikely. Even with an engine exploding, the pieces ripping half the cables, a wing, the fuel reservoir, hydraulics, and the airplane is still almost perfectly flyable and landable. Do the same to any car, were nothing is redundant, and lets see how well it performs.

The beauty of it is that everyone in aviation seems eager to learn and build on errors. This event prompted new actions that makes future flying even safer, despite having no victims.


That's the problem. Even if there were victims, one could've written the exact same article about “flying even safer”.


The victorious tone comes in my opinion (though I'm projecting a bit) from this graph[0].

There has been very systematic and deliberate effort to better aviation safety DESPITE commercial pressures.

The swiss cheese means that there are many more layers of randomness that have to line up. Many of those layers came from previous accidents. Those layers are not random at all. Also none of those layers are hole free.

If that disk had disintegrated differently a potentially different set of layers would have applied. Would it have meant fatalities? Possibly. Would it have instantly blown up the plane? We don't know.

But it is pretty obvious that had many of those layers not existed then the chances of a much more disastrous outcome would have been much higher.

[0] https://upload.wikimedia.org/wikipedia/commons/e/ef/Fataliti...


And on other aviation systems we do examine multiple failure modes. For example, a round going though the fuselage of an Apache, tumbling and smashing and causing spalling, thousands of simulated trials. Then coupled physics models that look at dozens of unintended interactions, avgas squirting out onto electronics, hot manifolds, etc.

There a whole field of Fault Tree Analysis that looks at how adjacent faults can propagate into unrelated components, then Event Tree Analysis to determine what will happen next. Models that assess robustness against failures even when we have no idea how the failure will occur.

Reliability of cyber physical systems is a constantly evolving field, lots of recent work on concepts like probabilistic model checking, ML for anomaly detection, resistance to cyber attacks, and so on.


There is more that one way to interpret this history of “triumph of technology and human mind”, yada yada.

This flight can be seen as an expensive (thrilling, entertaining, newsworthy, etc.) experiment on live subjects whose outcome was not controlled by existing tools and procedures.

The same for everything before to which it is compared so lightheartedly.

Please don't forget that your image shows a giant graveyard.


Looking at your other comments it seems that you are just arguing out of habit or stubbornness so there is not much point in trying to point out aspects that might bring nuance to it.

Have a mice day.


That this plane was maneuverable despite a massive engine explosion that took out 65% of its roll control surfaces is absolutely a victory of the engineers of that aircraft. I was shocked when I read that.

Sheer dumb luck was certainly involved. Those discs could have cleaved the plane in half to say nothing of the humans in its way but somehow missed most of the plane entirely. We definitely need to count every single one of those blessings. It's hard not to be positive when such an episode ended with zero fatalities, zero injuries even.


To me it’s impressive because presumably shards of debris cutting through so many distinct parts of the plane at the same time like this is a rare thing compared to more localized failures which the plane would be designed for. Yet all the different failsafes still worked enough to get the plane safely to the ground.


It is very common and encouraged to add a "What went well" in post mortems. This is not a pat yourself on the back moment. It is to reflect on what failed and what didn't.


I guess it's a glass half full type situation. There's a lot of universes where that plane did not make it back and a lot of decisions aligned to ensure that it did.


They do have multiple kill switches to stop the engines, up to dumping a bunch of flame retardant into it which makes it impossible to restart. The problem was that all these systems for the #1 engine were rendered inoperable by the damage caused by the failure of the #2 engine.

Certainly there was a fair bit of luck involved as well.


It may be a cliché to call someone a "national treasure", but I would take it a step further for Admiral Cloudberg: she is a world treasure.

Kyra has written so many great articles under her nom de cloud. Trust me, just pick any of them and you will learn something.

https://news.ycombinator.com/from?site=admiralcloudberg.medi...


there's a video podcast, too, which they should put on TV instead of whatever is on there now, overdramatized claptrap


there are some crazy talented pilots out there who are able to perform under massive amounts of pressure, United Flight 232 is a more extreme version of this article

https://en.wikipedia.org/wiki/United_Airlines_Flight_232

>Despite the fatalities, the accident is considered a good example of successful crew resource management. A majority of those aboard survived; experienced test pilots in simulators were unable to reproduce a survivable landing. It has been termed "The Impossible Landing" as it is considered one of the most impressive landings ever performed in the history of aviation

plane lost all hydraulics and had to be steered and crash landed using only the engines


Errol Morris made an exceptional documentary about UA232. One of the pilots just looks into the camera and tells the story. https://www.youtube.com/watch?v=nf33RDu_D6M


Not just any camera - an Interrotron!


That is an amazing story, thanks for sharing it. This part leapt out at me:

> Rescuers did not identify the debris that was the remains of the cockpit, with the four crew members alive inside, until 35 minutes after the crash.

I can't imagine spending a half hour waiting to be rescued, not knowing whether any of your passengers had survived.


Article by the same author as the submitted one on this: https://admiralcloudberg.medium.com/fields-of-fortune-the-cr...


I’m only aware of one other incident of an aircraft landing after loss of hydraulics.

https://en.m.wikipedia.org/wiki/2003_Baghdad_DHL_attempted_s...


I am addicted to a fault to Mentour Pilot's studies of flight incidents. Again, here, he goes into greater depth:

https://www.youtube.com/watch?v=JSMe1wAdMdg


I like Mentour Pilot but the outcome of the incident is only revealed at the end.

Admiral Cloudberg's articles are more like Columbo: they start with what happened, and then go back in time to find out and explain all the little details that caused it. In a way it's much more logical that way.

Mentour Pilot constantly has to say "remember this, it will prove important later". But we don't know why it's important, and so we don't remember, and as a result the narrative is much less clear.


I was going to mention him! I found his channel in the last year and have loved watching his coverage, especially from the point of view of a pilot.

If you like this article you’ll also likely like the show Air Disasters too (also known as Air Crash Investigations and Mayday, depending on where you are). It goes into a lot of detail based on crash reports without sensationalizing things too, though not quite as far as this article.


Another great channel is "Green Dot Aviation". I think he's the best in class, personally. Him and Admiral Cloudberg are the best aviation content out there.


One thing that jumped out at me was the narrow range of safe airspeeds on the landing approach--only three or four knots between stall and max speed not to overrun the runway. Quite a good piece of flying to get the plane down safely, not to mention all the other things the crew had to do.


Yep, they were very heavy and needed to land essentially at stall speed - which they basically did seeing as the stall warning chimed in moments before touch down - in order to allow for as much space as possible to stop the plane. I took from the article that their calculations were kind of hacked together with a number of overrides, so I guess they erred on the side of caution in case any of the assumptions needed a margin of error.

Amazing article. So well written. Kudos to the Qantas flight team, especially the pilot - they know their stuff for sure. And also kudos to the Airbus engineering team, that was such an epic win for redundant systems.

(It was interesting to see how stopping calculations were improved as part of the post mortem, for one.)


> especially the pilot

Worth noting that there was an unusual flight crew: 3 captains (one to check the captain's proficiency, and another to check the checker's proficiency) plus the first and second officers.


Plus the off-duty one upstairs watching the tail camera on the entertainment console. Article says 140 years of combined experience between them which is more than impressive. Airbus really couldn't have hoped for a better crew for this to eventually happen to.

One of my favourite things about the A380 is that in-flight live feed from the tail, surprised more planes don't do it. Offers visual detail of the entire topside and a lot of information that might not otherwise be available.


This is definitely arm-chair quarter-backing, but wouldn't ground effect allow for a lower stall-speed?


> wouldn't ground effect allow for a lower stall-speed?

Slightly lower, yes, but since, as the GP pointed out, the stall warning sounded just before touchdown, it looks like their calculations already took that into account.


Coincidentally I just finished reading the self-authored book ("QF32") of the pilot's own recount of the day. The book leads in with many interesting life experiences that led him to make so many good life-and-death choices that day.


My internal alarm bells started going off as soon as I read about datum AA and datum M. Shouldn't it be possible if not standard practice for the design software to issue a giant warning if you have a part that is defined by two datums that are almost but not quite the same? If they aren't the exact same datum then something like this will inevitably happen.


There is nothing wrong with 2 datums, the issue is that during machining this is not "a part" but an assembly that moves. There are so many failings from my POV in the manufacturing process and verification of the part, which is summarized nicely by the following quote:

. Furthermore, initial inspections at the start of the production run were supposed to verify that the manufacturing process was creating products that satisfied the “design intent,” but the initial products were checked against the manufacturing drawings, not the design drawings.


It’s definitely possible, but the checking system was probably built under the implicit assumption that the sources of truth for the checks (datums, dimensions) were correct.

2 datums might need to be very similar but not quite the same, so checking for it might present a lot of hard to handle false positives and make the system very complex.


> Meanwhile on the ground, events were taking an unexpected turn. On Batam Island in Indonesia, debris from the №2 engine plunged into a populated area shortly after the failure, resulting in surprise and alarm. Among the debris was a large portion of the failed IP turbine disk, which fell with such force that it cleaved straight through a building, razing a brick wall. Thankfully, no one on Batam was hurt by the debris. However, photographs of locals holding airplane wreckage in what appeared to be Qantas livery were soon posted to Twitter, where they were taken as indications that a Qantas airplane had actually crashed somewhere over Batam. Qantas engineers already knew that the plane was still flying, but they were unable to contact the crew to find out more information. And outside that bubble, the news that a Qantas A380 had possibly gone down spread so quickly that even investors reacted while the plane was still in the air. In fact, the first time Qantas’s CEO learned of the situation was when he received a call asking why the company’s stock price was dropping.

Information flies so fast in the modern world. There is a classic XKCD about learning about an earthquake via Twitter moments before the ground starts shaking.


The crypto markets responded to Russia's invasion of Ukraine even faster than Twitter did. That was an interesting day



> By specifying an landing weight in excess of the maximum, the system logic changed to apply the operational coefficient only once — for unrelated and obscure reasons — and lo and behold, when he ran the numbers this time, the computer said they could just barely land on any of the 4,000-meter runways at Singapore Changi Airport, with only 100 meters to spare. It wasn’t much, but with no better runways anywhere nearby, it would have to do.

Hacking overflows in an emergency, topnotch.


And this is why fully autonomous flight control systems won't be certified for airliners in our lifetimes. While autonomous systems are capable of taking off, navigating to a destination, and landing they are largely incapable of handling major emergencies. It's impossible for engineers to foresee every possible failure mode and program for it.


> It's impossible for engineers to foresee every possible failure mode and program for it.

Playing devil's advocate: you don't have to. It just has to be better than a pair of experienced airplane pilots working together. Which is still very hard, and there's still a good chance we won't see it in our lifetimes, but at least it's not impossible.


Also, let's not completely discount remote pilots


Let's completely discount remote pilots. There is no technology on the horizon which would solve the network latency or sensor fidelity problems that prevent remote piloting from being adequate for handling in-flight emergencies.


I don't claim to be knowledgeable. It's just a hypothetical question.

Surely, it depends on the nature of the emergency. As I understand it, in this Qantas example, the pilots did not need to fly the plane with real-time responses, just to make good decisions.

Let's not completely discount remote pilots, while recognising they are not a universal panacea.


They needed to make a lot of real-time responses when coming in to land, as they had a very narrow window of viable speeds and limited control.


That seems to be correct.

A partial mitigation of these issues could be high bandwidth / low latency networks just in take-off / landing corridors?


There are plenty of times the thing that happens and the bits that save the plane are in remote areas of up high.

They may still need help at landing, or by then it could be relatively normal.

But if you can’t provide that level of help everywhere (including over oceans) the design of the system is choosing to lose planes in a trade off for needing fewer human pilots.


There are probably hundreds of ways a plane could fail that would require constant low latency supervision by a pilot. For example, in this specific circumstance, the pilots had to manually maintain speed within a narrow range of 3-4 knots with a bunch of blown control surfaces.

Let's do completely discount remote pilots, please.


It’s worth pointing out that there are also plenty of airliner crashes that are attributed to pilot error.


Plenty compared to the set of airliner crashes, which is very small. There are also a lot of near misses that don’t turn into crashes precisely because of pilots being good at their jobs.

For AI to replace pilots, you don’t need to prove that sometimes humans fuck up. You need to demonstrate that AI would fuck up less often and in a more acceptable way. This requires looking at the big picture, not only bad cases.


Fair points.

But I reckon single pilot operation with emergency autoland will happen. The tech already exists for general aviation.


Maybe for cargo. But there's zero chance that single pilot operation will be allowed for airliners. The workload for managing emergencies is too high for a single pilot, even with extensive automation.


You don't think extensive automation could cut the workload in half?


Not for major emergencies. It's impossible to build automation if you can't anticipate all of the possible failure modes.


So did he pass his check flight? ;)


I know airline regs and reality would never allow it, but I like to think the check pilot tore up the assessment form, and just walked into the CEO's office and plunked down the cockpit audio. "Yeah, he passed."


> What happened in there?

> Emergency landing.

> You look like you've been through it.

> The engine... exploded.

> So is he a pass or a fail?

> He's a pass.


I wonder if there is any correlation with having been in the Air Force and handling these high stress civilian airline near disasters.

Both this captain and the Sullenberger of thMiraclenon the Hudson were Air Force (RAF and USAF respectively). Since, you will be going against an enemy who may damage your aircraft, there is likely more training on how to assess and recover from damage as well as how to handle these types of situations.


From watching Air Disasters pilots with military training has helped out a number of times.

However such pilots being very authoritarian or having bad crew resource management and not listening/refusing to let the copilot help has caused a number of accidents (or been a contributing factor) numerous times too.


Very lucky they had that mastermind crew. The facts they had to keep within 3 knots of an ideal landing speed indicates how hard this was to get out of. They landed 150m from end of runway (perfect for the scenario). Amazing.


An amazing recovery, there's even an Air Crash Investigations episode about it:

https://imdb.com/title/tt3234896/


Love that show. Makes me wish we dealt with software even a tiny bit like that. Checklists alone for troubleshooting common customer problems would save so much hassle.

But so many companies (including mine) still work on more of the “heroic” model where it’s up to individuals to just learn the hard way through helping lots of customers and noticing patterns.


I'm trying to introduce SRE as a practice in to my organization. We don't have anywhere near the safety requirements of aviation or medical or power generation software, but our operations do affect thousands of people around the world.

Getting people to understand that SRE is a code of practise and an overall approach has been very difficult, even with the so-called "QA" team, who think their job ends when the latest upgrade is deployed.

We do work in public transport, and the best solution I've found so far, is when they say they're "done", I ask them whether they are willing to stand at the railway station at peak hour and explain to passengers why they can't get home on time (or to work).

The usual result is that they go away and think about it and there is more testing done. But getting that to be a standard approach and way of thinking is very difficult, especially when product owners and project managers are only focussed on the next milestone/payment.


The thing is that you end up having to be very process heavy. From an efficiency, rather than safety perspective, I had an offer from (and interviewed with--in that order) Boeing many moons ago. The thing I remember from a long-ago dinner after that interview was a guy who had spent a couple of years on some design tweak that saved some fraction of a percent on fuel consumption. That's the sort of thing that most engineers do in aviation (where it's perfectly appropriate).


Incredible. That’s a fault tolerant system, operated by a highly knowledgeable crew. Congrats to all those involved, from system designers to pilots and crew.


Reading articles and seeing videos about airline disasters tends to increase my faith in flying rather than making me more afraid of it. Terrorism or sabotage aside, so many failures have to compound to put a modern airliner in a truly irrecoverable state that it's effectively impossible to happen and not worth my time to even worry about. What times we live in that we can hurtle ourselves across oceans at hundreds of miles per hour and be in substantially no more danger than we would be walking down a sidewalk in our home town (in before HN commenters reply with information about all the dangers associated with sidewalks).


That assumes the environment the aircraft flies in, behaves predictable. Sometimes it does not.

Turbulence is an obvious one. Downdrafts another. You can have a perfectly functional aircraft, but if the whole air column it's in goes down faster than the aircraft can climb, the aircraft will go down with the air column no matter what.

Reminds me of an Air Crash Investigation episode: some volcano had erupted, ash was high up in the air, air traffic control wasn't aware of this, and iirc it didn't show up on weather radar or similar systems (or on the planes' systems).

So it looked all clear. Meanwhile the whole plane was getting ash-blasted. To the point that paint was stripped, cockpit windows went from clear to matte, and ash attached itself to engine fan blades. Obviously trouble followed...

Bottom line: the environment a vehicle moves through, is always a factor. Sometimes an unpredictable, uncontrollable and/or hazardous one.


I'm not familiar with the volcano incident you referred to, but a bit of searching seems to indicate it was British Airways Flight 009 in 1982, where a 747-200 had all four engines fail due to volcanic ash… then glided safely out of the ash cloud and was able to restart three of the engines and land safely at a major airport. From a complete loss of power to all engines to on the ground with zero deaths, zero injuries. That's exactly the kind of story I'm talking about that gives me such faith in flying!


Sounds like the one! Engine after engine going out. Without (at first) any obvious cause.

> From a complete loss of power to all engines to on the ground with zero deaths, zero injuries. That's exactly the kind of story I'm talking about that gives me such faith in flying!

Understood (and agreed). But you missed my point: fate of that flight didn't result from safety engineering. It depended entirely upon the ash-laden air it flew into, and its effect on the aircraft & its engines. No amount of systems redundancy could have made it a safe flight.

So yes: flying is very safe these days. But there are limits to what safety engineering can provide.


> No amount of systems redundancy could have made it a safe flight.

But it did! One obvious example being the redundancy that allowed the plane to fly safely despite one of the engines not restarting.

The plane encountered an entirely unpredicted situation that caused damage, but thanks to its design was still able to land safely.


They got lucky because when they descended after the engines died, the engine cooling caused small physical size changes and the caked/burned ash just fell off the rotor bits allowing the engine to work again.

To RetroTechie’s point, they got lucky. No design decision saved them. Without that they’d have been a glider until they hit 0ft and it likely would have been far worse.

We’ve clearly gotten very good at flying, managing most weather conditions we’re likely to fly through, the mechanics/maintenance of the planes, and pilot training.

I’ve gained a ton of appreciation for how detailed our preparations are from watching Air Disasters. But we just can’t control everything, some danger is inherent.


Today we have VA monitoring satellites and aircraft aren't routed through VA.


That flight is why, IIRC.


I was a nervous flyer until I worked on the Boeing 757 design and found out how all the redundancy, etc., worked.


I was a nervous flyer until I piloted a 737 completely from power off to takeoff. I did the exact moves that I HATED as a passenger. Turns out you can’t control the wind and ATC transmissions eat into your mental capacity sometimes leaving you “behind the ball”. The result was a take off flying above speed causing the auto throttle to reduce engine speed greatly (the feeling as though the pilot turned off the engine in mid climb), turning to match ATC requested heading and banking a little bit more than expected passenger comfort, and finally reducing flaps without banking or otherwise reducing AoA giving that weightlessness rollercoaster feeling during the climb. All of this in a span of 5 minutes.

Once I got into our regular flight profile and following our flight plan, I just sat back in my seat and let out a hysterical laughter. I am the calmest person when I fly now :)


And that's how software should be written too..


What an extraordinarily detailed writeup.


[flagged]


Are you seriously accusing Admiral Cloudberg of writing articles with ChatGPT?


It's an especially ironic accusation given she's currently dealing with YouTube channels stealing her write-ups and reading them into a video using AI. She even had a "if you read this you're an AI bot" section in an article a few weeks ago.


Which adjectives specifically? And which sentence structure? I did not see anything out of the ordinary for such a technical discussion.


Not sure about GPT, but it's hella overwrought - it's like someone took the investigation PDF and tried to make the cheesiest Lifetime movie out of it. Far too many superfluous adjectives and embellishments; barely readable IMO.

eg "The red-hot, wildly spinning disk instantly fractured into several sections, which rocketed outward in multiple directions at incomprehensible speed"


Well the whole point of articles like this is that they are more "literary" than the investigation reports and therefore more entertaining and engaging to read.

There's a time and place for reading dry technical investigation reports and this is not one of them.

Also, none of the adjectives you highlighted are beyond human comprehension or usage or even rare so it's certainly not an example of what parent was trying to convey.


Taking a sentence and doubling the world count with folksy-sounding adjectives does not actually make the prose more "literary". It's just bad copy...


It's a matter of taste. Clearly many of us here were entertained by the style of the blog vis a vis the wiki version or other sources of the same information.


I'd love to see you demonstrate "adjectives that no humans would use". Do tell.


This article highlights the dangers from fake/illegitimate/non-oem aircraft replacement parts that are being used to repair aircraft.

https://www.reuters.com/business/aerospace-defense/engine-ma...

Doesn't make me feel comfortable about flying.


It doesn’t really specify the risk though, the parts may not be critical. I would hope regulations require independent certification for critical parts, but I’m scared to look.


Sometimes non-critical parts can cause a disaster though, as in Swissair 111, where arcing in the in-flight entertainment system led to a fire that quickly doomed the plane.


The story of what happened in the cockpit during the failure is just as interesting! The captain made a number of right decisions in a very challenging situation that allowed the plane to safely land.

Mentour Pilot did a video on that: https://youtu.be/JSMe1wAdMdg?si=YSgbqFpR_EBe-FvX


It's so interesting to read on aviation postmortems even for people that don't fully understand all the technical details as explained in the article (great by the way, kudos to the author). I've always wondered if there is an authoritative database with all significant events in aviation to read and learn more about? Do pilots study past events as part of their training?


>> The story of Qantas flight 32, as told herein, is therefore not only the tale of a dramatic emergency, but a testament to the safety of aviation today — a story that should make every reader feel a little less fearful of flight.

I'm not afraid of flight. I'm afraid of fall.

(Sounds much better in Greek: δεν φοβάμαι την πτήση, φοβάμαι την πτώση).


> It goes without saying that if any of the turbine fragments had entered the passenger cabin, there would have been injuries, if not fatalities

Understatement of the century. Somehow those spinning discs completely missed those passengers!! They were in such a state that when they disintegrated they went up and through the wing, towards the ground and one nearly missed the plane itself, striking its ventral section instead of passenger cabins. That near miss took out a huge number of aircraft systems so how catastrophic would the damage have been if they had gone through the plane??

I read the entire article and my conclusion is their luck was immeasurable from the very beginning. They were blessed with such tremendous luck even before the flight crew got the chance to demonstrate their badassery and heroism. One of those discs destroyed part of a building.


The one thing which sticks out to me is the ECAM system including a baked corrective message of "open fuel transfer valves" due to the imbalance.

That seems like an odd message to include in an emergency action system, which by definition is only active in unexpected situations. Is there really no system to confirm if a fuel leak is happening?


A320/330/340/350 driver here (can't get away from Airbus apparently).

Nope, there is no system to confirm a leak apart from a camera around the tail if you're lucky enough to have one, my previous airline had a flight where an engine leak was detected this way. Think about it, how would you design such a system? So this falls on the crew.

The procedure to determine if you have a leak is pretty much the same across types: add the fuel on board (FOB) to the fuel used (FU) and make sure that the number you get is the same as what you started the flight with. If it's less by some margin then you probably have a leak. You can confirm further by looking at tank quantities (but they take time to reduce depending on the size of the hole). If you get an engine or pylon leak then you might also see increased fuel flow on that engine. If the leak is elsewhere in the system then you might notice a smell. If you can't work it out then the procedure (at least on Airbus types) usually involves turning an engine off to see if the leak stops (yep, really).

As for the ECAM "open fuel transfer valves" message, I don't know for sure on the 380 but all the other Airbus types I've flown have something like:

.IF NO FUEL LEAK

FUEL IMBALANCE....MONITOR

So it doesn't really instruct you to open the transfer valves but leads you into the fuel imbalance procedure if you think you need it. The very first line of the fuel imbalance procedure says something like "Don't apply this procedure if fuel leak is suspected".


Thank you for bringing your expertise here. I was wondering if you could give some insight on something that occurred to me while reading this: at first sight, transferring fuel to the leaking tanks might seem to be a substitute for the failure of the fuel jettison system, while also doing something about the increasing lateral imbalance.


That’s good lateral thinking :)

Given that the aircraft can be landed over max landing weight (needs a maintenance inspection) and is still controllable with total imbalance I’d say that balancing just wasn’t as pressing of a concern.

Also, with that much damage you never really know where else it could be leaking. Leaking fuel into critical spaces of the aircraft could be bad so turning on the fuel crossfeed might add extra issues.


You could absolutely design a system that could detect a leak. I’m guessing that it’s just not common enough, or at least catastrophically common enough, to warrant.

At its simplest you measure estimated volume delivered to the engines against estimated volume remaining in the tank. Both are things that should be digitally measurable.

The problem seems to be that the only case it really matters is in a catastrophic accident where such measurements are going to be broken anyways.


It’s a good idea, some aircraft have quite complex fuel systems though so it would have to account for fuel moving between tanks.

E.g. the A330 has an inner tank in each wing (which itself can be split into two compartments if damaged), an outer tank in each wing and fuel in the horizontal stabiliser which is used for CG control in the cruise. All of that plumbing can leak too. You’d be adding significant weight and complexity implementing leak detection across all that.

Regardless of all of this, the aircraft is still fully controllable even with a total asymmetry (one side empty the other full) so balancing the tanks isn’t a massive priority.


All of that only adds complexity in the calculation, not the measurement.

The engines have predictable fuel consumption patterns. Even if fuel move across a bunch of tanks, you can still calculate total onboard fuels and detect a leak.


That’s what it already does though. We get a total fuel figure in the flight deck (FOB) and a figure for how much the engines have used (FU - measures flow in the pylons). Add the two together and if the resulting number isn’t what the flight started with then there’s a leak.

The challenge is knowing where the leak is.


For Boeing aircraft you compare the totaliser fuel quantity with the calculated quantity based on engine fuel burn to determine a leak.


Given the sad state of the world in general, I am in awe of aviation industry because it actually works as designed, where all millions of potential points of failure are handled gracefully (and airlines are still profitable somehow).

A true miracle.


Had to chuckle at this: "If you have made it this far, I first of all commend your patience, and/or your nerdiness."

A thoroughly enjoyable write-up this was.


Many things failed, at least: (1) failure to identify safety-critical parts for review (2) failure in design drawings made without regard for fabrication (3) failure to make reference to original design goals when altering part designs (4) bad machining plan altering work-holding during the machining (5) failure to inspect finished parts


(5) was particularly bad: an off center hole should have been visible to the naked eye of any trained machinist, and at least one person should have inspected all safety critical parts. This would have been the final fail-safe to catch the combination of all prior errors. It was not made.

There's actually another failure - (6) - which is the failure to perform visual inspection of the assembling mechanic which should have generated a query or note at the very least. While at this point the mechanic is probably given specific points to check, and an aggressive timeline for assembly, and has little motivation to extend critical thinking beyond the assembly task as they cannot possibly intuit all of the design intent for every part and subassembly, a better run assembly process with a culture of observation would have had this flagged for verification with designers.


I'm wondering why the tolerances for the oil pipe were so small in the first place. Why not make the pipe one or two mm thicker?


It would add up weight wise, and it’s one of the simpler parts. Jet engines are high performance precise machines with many quickly spinning parts. If you can’t bore a tube correctly how are you going to machine a high efficiency, balanced turbofan system?

That said it seems like did have a poor process where a part could be out of spec and they had no good way to check it. As they mentioned about Swiss cheese, you want as many layers as possible, and checks like that are needed.


Because there are a zillion important parts on the airplane, if you make each one heavier than it needs to be, the airplane will be nailed to the tarmac.


That makes sense. Here’s the question I left the article with:

Why not counterbore the pipe before installation, so it’s a trivial process?

Would it then not survive welding perhaps?


Because its dead simple to machine a center and the designer did not factor in machinist/engineer/qa/facility incompetence.


You don’t need any such incompetence in this case, as explained in the article, though it does help and that specific facility had several issues. The tube was built to spec, it’s the specs that were not what they should have been.

The failures were more with the whole process (like the reference points with different tolerances and the inadequate paperwork) rather than machinist incompetence. They are just the guys at the bottom.


The engineer documents did not match the design documents. Incompetence number one. The machinist would have seen with the naked eye very easily that the hole was not close to center, an old salt would have raised it up. Incompetence number 2. The machinist not being aware that moving the jig was ruining the setpoint. Incompetence number 3. There were clearly incompetent individuals working at the facility. I get what you are saying... Don't blame the individual but best thing you can do from a process perspective is hire good people.


> The engineer documents did not match the design documents. Incompetence number one.

There was no correcting mechanism and nothing to catch the issue. This is a system failure. The person who set the second datum for a different operation did nothing wrong: a new reference was needed for production reasons and that new reference did not need the same tight tolerance. They were not responsible for what happened later. Then, as the author mentions in the article, the software should have flagged the tolerance mismatch when that datum was used for something else.

> The machinist would have seen with the naked eye very easily that the hole was not close to center, an old salt would have raised it up.

No machinist is ever going to eyeball tolerance violations by a fraction of a millimetre on all the measures of every piece they build. That’s science fiction. Checking should have been in the manufacturing check list. Again, a system failure.

> The machinist not being aware that moving the jig was ruining the setpoint. Incompetence number 3.

Moving it was more or less required for another operation, which is the reason why there were different reference points for seemingly the same thing. Again, that is not the problem. The fundamental problem was that the second datum was used instead of the first.

Besides, if a machinist being aware that a bit of metal must not move at all is required to keep aircrafts flying, the real failure is not to specify that. I don’t know where you live, but in most of the world humans do what they can but are not perfect. That’s why there are checks and procedures to correct mistakes. No single decision or action should result in a crashed aircraft. Otherwise the whole system is just creating death traps.

> There were clearly incompetent individuals working at the facility.

You call them incompetent without seemingly understanding the actual problems, even though they were explained in detail in the article. There will always be out-of-specs pieces and random issues everywhere. If your system depends on humans being perfect, then your system is the problem.

Even great people are bound to make a mistake sometimes. You need to reduce it, sure, and I hope that this specific failure never happens again, but we need to take a broader view.

The article mentioned some of these tubes being rejected, and yet this one made it through.


I know very basic machining, but I know that part looks almost so simple I could manufacture it.

It’s very interesting that there were not wall thickness measurements. That would have solved this whole issue.


The part out of tolerance could have been caught by comparing it to a master part within tolerance?

Laserscan it or capacitor probe measure it and encounters the lack of material.

In car part companies there is usually a qm-lab drawing samples from production. Does airplane turbine production not have this step?


A turbine disc fragment ripped through the entire plane cross-section and exited on the other side. Stunning.


I like the pragmatic engineering point of view described in the article:

> For engineering purposes, disk fragments are assumed to have infinite energy at the moment of release; they will cut through any reasonable material and cannot be contained.


Also demonstrated by the picture in the op of the brick wall. Note that it wasn't smashed or knocked down, but looks as if it was cut.


That must have been the section that broke downward, so was traveling faster than terminal velocity.


I applaud this article for being so thorough and informative on this subject. I hope Airbus changes their mind about building new ones. We like them even if airlines don't like the operating costs.


Just throwing this out there, if anyone reading this knows of other writers/blogs/books/etc. similarly looking into aviation engineering/failures I would love some recommendations.




Fantastic write up, and amazing testament to the engineering in the A380. It’s extremely impressive that the pilots were able to safely land the plane with such extensive damage to so many separate systems.


Engineering at its finest. Lots of problems but multiple layers and layers of redundancies that prevented a major issue from becoming a bigger issue involving souls


With the engine ruptured, still operating and the fuel leaking, how did they know the whole thing is not gonna explode at any moment?


They didn't, which is why they shut it down.


I've read the ruptured engine was still operating for 3 hours after the plane landed.


The ruptured engine (#2) was shut down within a minute of the incident, in-flight. Two of the remaining engines (#3, #4) were shut down after landing. The last engine (#1) could not be shut down and had to be drowned in fire-fighting foam. This is in the article.


Engine 1 was continuing to operate after landing because the fuel shutoff valves were inoperative. Engine 2 was the engine that had the uncontained failure.


Absolutely astonishing and riveting read.


Really interesting read


Raymond Babbit would be pleased to hear the passengers landed safely.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: