Doing an unauthorized departure turn can impact terrain. At night, it may not even be noticed by the pilots.
Source: commercially-rated pilot.
It wasn't clear to me what the exact problem was from reading the article, just that it occurred under a very specific and uncommon set of circumstances.
> The FMS may change the planned database turn direction to an incorrect turn direction when the altitude climb field is edited.”
What I mean, though, was I didn't understand what the bug was - not how it manifested.
Each list has Missed Approach Point, at which the list branches into two: landing or abort.
If you are not ready to land at that point (too fast, can't see runway, previous aircraft still on the runway, etc), then you fly the abort part. Usually it tells to climb to a certain safe altitude and turn towards a waiting area for another landing attempt.
These procedures can be flown manually, or activated for autopilot to fly. A bug made the autopilot turn in opposite direction to what's in the abort section.
Here's a landing procedure for Helena regional airport: https://flightaware.com/resources/airport/HLN/IAP/ILS+OR+LOC...
The narrowing beam is instrument landing system that guides you towards the runway. If you reach 4580 feet minimum altitude during approach, but can't see the runway, you must fly the abort procedure, which is drawn with dashed lines: climb immediately to 4700 feet on current heading, keep climbing to 9000 while turning to heading 021, then proceed north at 9000 on 336 degree radial from Helena radio beacon, and upon reaching waypoint WOKEN circle until further instructions.
This bug could cause the aircraft turn southwest (towards mountains) instead of northeast (valley).
At Helena, the bug would not reveal itself because the right turn is 114 degrees. If the procedure required to turn more than 180 degrees right, for example, 200 or so degrees right towards SWEDD, the aircraft would make a left (shortest) turn instead. Green should be flown, red would be flown: https://i.imgur.com/ojShQa2.png
Many airports have construction cranes up to 200' high in the airport area.
The US uses heliostats along the southern border (tethered balloons, with multiple guy wires) to at least 1000'.
Airliners are moderately tall on the ground, so that's something else you can hit if near enough to the departure end on an adjacent taxiway or runway.
There are illusions when flying at night that confuse your sense of bank, so without looking out the window, only instruments would indicate a wrong turn. Look away or get distracted, and you just hit something.
In addition, airline pilots aren't test pilots. There is an assumption that systems are unsurprising from one second to the next, and that a checklist can be used if not. An unexpected turn at ground-level would often turn out badly.
A dive at altitude would give you time to react, usually minutes.
Hills are invisible in the dark.
Also, charted departure instructions are something you bet your life on, as well as your passengers. So if you don't trust the plates or FMS, you can't fly in IMC or at nite.
- dive with power causing overspeed
- dive causing controls to lock due to transsonic shock waves
- improper recovery from a spiral dive will over-G
- MCAS-style confusion
- bottoming out on a phugoid oscillation
- unrecoverable spins, usually in jets
But in general, if you unload the wings (reduce G), then nothing breaks.
But if the missed approach or departure procedure is wrong the crew has a high probability of not noticing it (this all happens in a very high workload situation). If you don't notice you can't fix it. What makes it worse is that this bug happens in situations whether the procedure requires a turn "the long way around", they wouldn't design the procedure that way unless it's really necessary. So there is a big chance there is terrain or an obstacle on the other side.
Source: I'm a commercial pilot
the 21st century, planet Earth.
It killed at least 157 people. The culprit in this case iirc was a combination of a flaw in the hydraulic cylinder design with large temperature swings. The story of the guy who finally figured it out is a fun one.
The Boeing MAX and Starliner come to mind, but the failed Moon missions by Israel and India are also examples of this trend.
Cost cutting in software development is costing companies dearly. Boeing may even go bankrupt because of this.
This hiring problem is compounded by the oversight problem. The program managers are similarly inexperienced. Or they came from strictly a testing side with no concept of what software development itself entails (I've seen this a lot). So they aren't bad at managing requirements, they may actually be really good at it, but they absolutely fail to understand that software is a hard problem (especially when dozens of subcontractor are involved) that extends beyond just the technical problem, and to the communication and coordination problem. That's assuming they're experienced, USAF program managers for software (IME) are straight out of college history majors. DoD programs are scary.
 Most avionics systems, in my experience, boil down to rather straightforward state machines. Understood this way they become much simpler to write and test. The hard part is hitting your timing constraints, but that's easier to achieve with correct-but-slow-and-maintainable code than with incorrect-but-fast-and-unmaintainable code. Inexperienced developers won't see this possibility, either by failing to spend time studying the requirements or failing to understand how to implement state machines at all.
There are two reasons why you can expect software to be more and more the cause of airliner safety issues:
- software is eating the world
- software is getting more complicated
The first is a long-term trend now. Look under the hood of any automobile from before the 80s: no computer to be found. Look under the hood of any automobile from the past 30 years: computers abound. The reason for this is that many problems are easier to address in software than in hardware. Of course, you go from N hardware problems to some possibly smaller set of possibly simpler hardware problems at the cost of gaining a set of software problems -- but this trade-off usually pays off. In some cases this trade-off enables functionality that would be infeasible to create otherwise.
The second problem is also a long-term trend: CPUs, systems, operating systems, and applications have all tended to get more complex. In embedded systems the trend has been less strongly towards ever-increasing complexity, but even in embedded systems things have gotten more complex.
Whether the problem is less competence among today's programmers is hard to establish here. First, we need much more software, which means we need many more programmers, which means the quality of programmers you get probably does decrease, though then again, we do have more programmers overall as more people (competent and otherwise) are attracted to the industry. But more importantly, the increase in complexity of today's systems could very well be enough to make yesteryear's competent programmers incompetent today -- you can't really compare software development 40 years ago to software development today.
(I object to this idea that lower-wage programmers necessarily can't be competent, though that isn't quite what you wrote. It's true that a lax process for outsourcing can mean you get less competent programmers, and it's probably true that higher GDP/capita correlates with availability of competent programmers. But it doesn't follow that there are no competent lower-wage programmers in India, say.)
Boeing has been cutting too many corners since the MBA's took over and started reorganizing things to maximize profitability. There's a good chance the company will fail because of this.
Many things that used to be done by analog computers or manually done by the pilot are now done in software.
In addition, the Boeing MAX was a largely system design issue; the software was operating as-designed, and had it been implemented in hardware, it would have likely failed in the same manner.
Theres no surprise that's where most of the failures occur.
To me, when a software bug shows up in a critical system, that means you actually have a logistics bug. Airplane control software should not be allowed to have bugs. CPUs should not be allowed to have bugs. And OS's should not be allowed to crash (looking at you Microsoft).
When one of these things happens, in my opinion the correct response is _not_ to just release fixes and workarounds and then say "we'll try really hard to not let it happen again." You do that, sure. But the first time you see airplane software malfunction, that means you need to change the way the software is written and released so that the whole class of issues will not ever happen again. You don't stop at a public apology, you don't fire the person that unintentionally wrote the bug. If you have to hire mathematicians to formally prove the critical paths of the software, you do that. If it costs 10x more to release bug-free software, oh well, you do that.
All of these corporate people thinking they can save money by spending less on quality are extremely naive. You can do a financial analysis of this, but they're doing it wrong. Did you ever consider what the cost of a whole generation just not trusting air travel at all would be?
This is pretty good intuition but often a systemic change is not economically feasible. For avionics software at least, a rewrite of the software would likely have to be recertified from scratch before it would be allowed to fly.
We do, however, have several different quality assurance programs in Aerospace that are supposed to address this sort of thing.
Once you identify the root cause, the process found to be deficient is supposed to have a Process Owner who is required to create a preventive and corrective action plan to prevent a recurrence, with more severe problems requiring more robust action plans. Done right, the process owner is supposed to be empowered to make the changes that need to be made.
These systems tend to be evolutions of ISO 9000 as pioneered by Toyota (IIRC). They are highly bureaucratic and soul-sucking, but they are also the least-shitty solution that's been tried.
You should also keep in mind that real systems have fault modes aside from software bugs and hardware glitches, such as unanticipated edge cases and user error, which may dominate your actual failure statistics.
The difference in reliability between normal software and airplane software is so vast that "best practices" from normal software can not be applied to airplane software since that would be gross criminal negligence. To explain, in the 10 years prior to the 737-MAX problems there were 50,000,000 flights and software was not implicated in a single passenger air fatality. The average flight is ~5,000 KM which is ~4-5 hours. So, in ~250,000,000 flight-hours, there were two crashes due to software. A plane takes ~3 minutes to fall from cruising altitude, so we can model this as a downtime of 6 minutes per 250,000,000 hours which gives us an downtime of 1 in 2,500,000,000 or a 99.99999996% uptime (yes, that is 9 9s). In contrast, I think most software people would agree that AWS is high quality. The AWS SLA specifies a 99.99% uptime (1 in 10,000 downtime). So, by this metric, airplane software is 250,000x more reliable than normal high quality software.
The point of this is that the standard for airplanes is almost inconceivably high compared to normal software. To think that they are incompetent or suggest that all they need to do is adopt X or Y common-sense/best-practice is a gross misunderstanding of what is being done and what needs to be done to improve. It would be like someone trying to tell a civil engineer making a 50-story skyscraper that they really need to adopt high quality wood construction techniques from makers of doghouses. To actually improve it, you need to consider practices 250,000x better than "best practices" and go from there.
To put it another way, the solutions are actually really really good, unfortunately the problems are really really really really hard.
The issues with the MAX were also clearly preventable and there were multiple failures of the systems (regulators, internal reviews, etc.) that were in place to catch these kinds of issues.
But as you point out, the aeronautical industry has an excellent track record for software reliability, if you evaluate reliability by hull losses. By other metrics, it's a bit more debatable (eg. the integer overflow for Dreamliners such that they need to be restarted at least every 248 days), but still keeps people moving safely.
My primary point is that many people look at these failures and incorrectly conclude that the processes in place are objectively terrible and below average. This leads to them discounting the processes in these systems in favor of policies from vastly less reliable systems that they think are quality-focused or "best practices" because they, fairly, think "bad" in a safety-critical context means the same as regular "bad", so regular "amazing" is clearly better. In truth, "unconscionable deathtrap" and "gross criminal negligence" in the airplane world is more of a synonym for "amazing beyond belief" in the rest of the software industry. The correct takeaway is understanding that regular "amazing" is actually orders of magnitude worse than "unconscionable deathtrap" and is thus completely inadequate for the job. As a corollary, if you do not think you are doing "way better than amazing" you are probably not doing an adequate job in these contexts.
To reiterate, the solutions are really really good, unfortunately the problems are really really really really hard.
Uptime is not a comparable metric in any way. Aircraft computers often reboot every flight or every day. AWS downtimes don’t typically result in fatalities. The fall time of the 737 MAX before it impacts isn’t ‘downtime’, and simply cannot be used to summarize the reliability of aviation software as a whole. Arriving at 250000x this way makes it a meaningless number, and you didn’t account for the bug in the linked article in your reliability estimate at all.
I somewhat agree that the metric I chose is somewhat sloppy, but you can afford to be sloppy when you are comparing things with such disparate outcomes. Sure, maybe we are not comparing a 1 story house to a 50 story skyscraper, it is only a 30 story skyscraper, but that has little impact on the fact that they are fundamentally different and to declare that they are even remotely comparable is a massive category error.
I, however, disagree that "uptime" is a nonsense metric, though there are absolutely better ones. "Uptime" in this context means duration/probability of critical operational failure which is an extremely relevant metric. That AWS does not result in fatalities during critical operational failure has no bearing on whether critical operational failure occurred or not, it just means that it matters less. A valid quibble is that I am using crashes as a proxy for failure which discounts critical software failures that did not cause critical operational failure due to non-software redundancy, but again, the outcomes are so disparate it beggars belief that this would bridge the gap.
As for aircraft computers being rebooted frequently, true. So? I am comparing full system reliability during operation, not individual components. It is not like individual AWS servers run indefinitely; they are rebooted frequently, but the system as a whole stays operational due to redundancy and migration.
The reliability estimate does account for the bug. The bug did not cause a critical operational failure. It could cause a critical operational failure in an extremely unlikely case if it remained undetected and no measures were taken to avoid or correct for it. However, it was detected and countermeasures have been put into place, so the processes in place continue to achieve their intended goal of preventing critical operational failure. So, the outcome-based estimate continues to be accurate.
Just to be clear, an outcome-based estimate is not perfect. By its nature, it only looks at the past, so has no true predictive power. You can not use an outcome-based estimate to predict the effects of process changes. However, it is a relatively unbiased way of evaluating if prior processes were effective which we can use to inform us which processes of the past were actually effective or not and the effects of process changes.
Yes it did cause operational failure! An airplane turning itself the wrong direction is an outcome, and an extremely serious one.
There was a bug that put people at risk, and you are saying that just because a human caught it and it didn’t crash the plane, it doesn’t count as unreliable?! You’ve just rationalized ignoring all bugs that don’t cause fatal crashes when estimating software reliability. This is making your point weaker, not stronger. You’re arguing that software reliability should only be measured by fatalities. If you really want to go that way, one might conclude that “normal software” like AWS is infinitely more reliable that aviation software, because it never killed anyone. By discounting any bugs that don’t lead to plane crashes, you are undermining your own claim that aviation software is “250,000x” more reliable than other kinds of software.
This kind of analysis- the insistence that reliability is high because death has not occurred often- has played a major role in several high profile accidents. In the shuttle disaster, for one, it was specifically called out that reliability estimates were exaggerated. The Therac-25 incident is another case where engineers failed to understand what happened for a long time due to vastly exaggerated reports of the system’s risks and reliability.
No, uptime still makes zero sense to compare, it is a nonsense metric in this context. Uptime is a measure of continuous operation, and planes aren’t in continuous operation. Simple as that. It’s a metric that does not apply to aircraft, no matter how you spin it.
There are multiple cases of major software failure in military and aviation from systems being in continuous operation for too long. There was a thread just the other day about an airline’s safety procedures specifically requiring in writing a reboot every 30 days due to known bugs.
And you’re ignoring that the 737 MAX did not suffer system operational failure. The system didn’t go down, it kept working. If the system had gone down, those people might have survived. The crash happened precisely because the buggy system kept working. If you want to count the downtime of the system, you maybe ought to count all the flight hours the plane would have flown since the crash, rather than using a bogus concept of only the ratio of fall time to all flight hours to estimate industry reliability. Again, that ratio is completely and utterly meaningless as a proxy for software reliability.
“Downtime” in normal software is not always caused by catastrophic failure, sometimes it’s due to maintenance and upgrades, sometimes it’s due to low performance, sometimes it’s caused by people actively attempting to fix bugs during uptime. None of those things happen during an airplane’s uptime.
> One of the only ways to compare processes and not be tricked by fancy words, especially as a non-expert, is to look and compare actual outcomes.
I’m not arguing against comparing outcomes. I’d agree that looking at outcomes is a good thing, if, and only if, you are actually fair about seeing all outcomes. I’m suggesting that pointing at the more easily verifiable volume of testing effort and safety concern in regards to aviation software, when compared to how much testing and verification happens on ‘normal software’, might adequately persuade someone who didn’t know it that aviation software testing and bugs are taken way more seriously than testing and bugs of web apps are.
The Spectre/Meltdown issues are deep and architectural, not simple to fix. It's not just a batch of CPUs that's the problem, but all of them.
Besides, if a CPU ships with a bug that can be fixed via a microcode patch, then it would be a tremendous economic waste for all humanity to throw those CPUs out.
Even when new CPUs come out that can be shown not to have Spectre/Meltdown issues, it will take a long time to replace the installed base of those that do because it's not a matter of a little bit of money, but a matter of a great deal of money and opportunity costs.
So microcode patches and software mitigations is all there is. Absolutist attitudes don't help.
This sounds pretty good in theory but in practice you will just trade the current set of issues against new issues.
In reality, Systems and their interactions are so complex that there is no amount of software design that can avoid bugs and fixing them. We sure can improve but it would be naive to think you can design 100% reliability into something like an airplane.
In a way every real software improvement (not fancy language flavor 'x' of the year but entirely new ways of developing software) have always been with the main goals of writing software with fewer bugs faster.
That's the whole reason we have abstractions, compilers, syntax checkers, statical analyzers and so on. In spite of all those, software still has bugs and budgets are still not sufficient to write bug free software.
On another note: this problem is getting worse over time. As tools improved codebases got larger and the number of users multiplied at an astounding rate resulting in many more live instances of bugs popping up. After all, software that contains bugs but that is never run is harmless, only when you run buggy software many times does the price of those bugs really add up.
Somewhere we took a wrong turn and we decided that more of the same is a better way to compete than to have one of each that is perfected and honed until the bugs have been (mostly...) ironed out.
If you’re trying to keep planes from crashing at all costs, sure. If you’re trying to reduce deaths from travel, that’s a terrible plan. Every family that you price out of commercial air travel and convert over to private auto travel instead has been placed at significantly higher risk as a result of the excessive pursuit of safety.
It’s the reason the FAA allows lap infants under 2 years old. Not because that’s “safe” in absolute terms, but because it’s safer than the likely alternative.
The issue was the specification itself, which assumed pilots would reliably catch the uncommanded trim down, diagnose it and disable the whole electric trim subsystem within seconds of the problem behavior arising.
That assumption turned out to be massively flawed.
It’s not that hard by the way. And they did that, but handwaved the critique - the typical approach of “my guts are probably more correct than maths”.
Even if my comment implies that there might be pilot error, pilot error doesn't mean pilot blame.
In this case, I'm very much of the opinion that the blame either belongs with the official Boeing training program, which didn't correctly train any 737 pilots to correctly handle this scenario.
Or the blame belongs to the design specification that relied on the assumption pilots would be able to correctly handle this scenario with out even testing that assumption. Or potentially both.
Even if say 10% of pilots could fluke into handling this scenario without the correct training, doesn't mean the other 90% are to blame for not flukking into a correct solution.
This is a perfectly reasonable request by the airlines. Some airlines rely on the operational efficiency of a single aircraft type. It lets them interchange parts and people and not have to worry that the wrong airplane is in the wrong spot.
What is NOT reasonable was Boeing providing an aircraft that actually had MAJOR differences yet claiming it was the same.
And what makes it particularly stupid is no airline that relies on a single airplane type is going to switch from Boeing to Airbus because they would have to migrate their entire fleet en masse. So Boeing had plenty of time to certify the 737 MAX airframe properly.
The draft conclusions, these people said, also identify a string of pilot errors and maintenance mistakes as causal factors in the fatal plunge of the Boeing Co. plane into the Java Sea, echoing a preliminary report from Indonesia last year."
Private planes and industrial planes still have an awful safety record.
Most stats also exclude 'unrelated' deaths which happen during a flight (even though there is a good chance the changes in air pressure, stress, lack of medical care, and cramped conditions at least contributed to the death).
Stats also often exclude terrorist or war shootdowns of commercial planes, which are starting to become significant.
I don't know about industrial, but I assume "private" is a combination of 1) private pilots suck and 2) too much catering to client.
Kobe Bryant would be my unfortunate shining example of 2). The pilot either wanted to cater to Kobe or would get fired if he didn't, and so went up in weather that it was stupid to go up in.
As for 1), I've seen far too many sleep-deprived, hungover, drunk, or stoned private airplane pilots. And this is on top of the fact that they probably aren't the most experienced pilots to begin with. What is it about piloting that seems to attract frat boys who never grew up?
Thankfully there are no nearby hills for this bug to kill anyone there.
Unrelated, but how many carbon offsets do I buy?
Flightaware puts the actual distance at 760km, so it'd be ~844kg of CO².
I found this great guide by the David Suzuki Foundation which assesses different carbon offset vendors with a few different measures:
Pages 42-49 go into the different criteria, Page 50 is the table scoring them all.
I decided to go with https://carbonzero.ca, it was $19.89CAD for 0.88t. Thank you for your help :)