I was speaking with a 787 pilot last Sunday, I told him that the week before when I was at an airport there were two pilots sitting next to me talking about how "This is the third bloody 787 rescue we've had this month... I can't believe we had full engine and <I think he said auxiliary?> failure at the same time" - I asked him if this is common and he said "I hear of it, but I haven't had that many major failures, but lots of little things - last time I flew in from <city> a few moments after we touched down we lost auxiliary power from the rear engine, all the cabin lighting went black along with a number of other things, thankfully we'd already significantly reduced speed and were straight and already lost most of the speed we were carrying, so we were fine and taxied to the disembark location, they had it up and flying again within the day - but it certainly was disconcerting to say the least".
I will be slightly paraphrasing from memory there, but certainly was quite surprised how calm he was about the whole thing, there's no way I'd board one of those things.
Modern two-engine planes like the 787 have an auxiliary power unit (APU) in the tail. This is a small turbine that runs a generator and a pump for the hydraulics. It’s typically only turned on when the plane is on the ground, or if there’s an emergency in mid-air. It is also needed to start the main engines so if the APU is faulty the plane will probably be stuck where it is. In theory a 787 can take off with just one engine but this is not very safe and wouldn’t be done in all but the most exceptional circumstances.
There are variations on this depending on the plane model, of course. Some older planes can use an external starter for their engines, but I think that’s very rare now.
Aircraft with INOP APUs can generally be "air started" with a ground-based high-pressure air system. It's relatively common and I've been on a plane that had to do the procedure. It was entirely undramatic other than engines being started before the pushback, but I doubt most passengers even noticed.
Now, interestingly, the 787 is a "bleedless" aircraft, so it doesn't use high-pressure air from the APU to spool up the engines. I believe it can use its hefty bank of lithium-ion batteries to start its engines if the APU (and associated electrical generator) is INOP.
Not a pilot/engineer - just an enthusiast. Someone more au fait with the 787 might be able to correct me on the above.
My understanding is that there was a push to modify the U shaped tow trucks they use to position planes to have a battery powered system to start the engines.
The idea being that the APU isn't particularly clean burning, not compared to power plant emissions. It's been a long while since I've heard anything about that plan, for or against.
Interesting! Although it'd (presumably) only be useful for the 787, short of heavy modification to existing aircraft. Even the Airbus A350, an aircraft from the same era, uses a traditional bleed system. If planes continue down the bleedless route I can see it happening.
>Modern two-engine planes like the 787 have an auxiliary power unit (APU)
Where "modern" here includes jet airliners made in the 70s yes.
>It is also needed to start the main engines
The engines need an air source, and the APU can be an air source, but at one point at least, big airlines preferred using ground hookup provided air sources for starting, in order to save gas. Next time you fly, look at the jetway. There will be a large yellow duct system underneath it that can be hooked up to the plane to provide pneumatic pressure and air conditioning air without starting the APU. There are similar hookups for electrical power so that a plane won't drain its battery during routine turnover operations.
The bottom price flights I've taken recently don't seem to hook either up though, preferring instead to start the APU during taxi to the gate while shutting down one engine, shut down the other engine once they are at the gate, and reverse the process to taxi back out to the runway. The turnaround time is so short, and the required work to clean and restock the cabin so little, I bet they just don't pay for ground hookups.
There is also a RAT at the back that can be deployed to generate some power(~5-10 minutes max) in case of severe emergency in Air. It is what you hear sometimes, when the aircraft is making a very shrill noise flying over your head.
However, if it is not a test flight, a RAT deployment should make you very uncomfortable and worried…
> RAT … It is what you hear sometimes, when the aircraft is making a very shrill noise flying over your head.
I’ve been around a lot of airplanes and I can’t say I’ve seen or heard a ram air turbine deployed in flight. There was a recent incident involving a Frontier Airlines flight in which the RAT was deployed when the aircraft was put in emergency electrical configuration. The deployment of the RAT would be quite rare.
I find it hard to believe that anyone reading this was within earshot of a plane in a severe emergency and heard this particular sound and since turbine engines are already quite shrill I am basically just sorta confused who your audience is for this suggestion.
Usually, when the RAT is really deployed because of an emergency, the jet engines will be a lot more silent (because they're not producing any power). Although I'm not really sure how loud a windmilling jet engine really is, and I somehow doubt there is a YouTube video of a plane landing with both engines disabled - but you never know...
Would you hear it from inside the plane? Even if it’s not as loud as the main engine, if it’s audible at all a lot of people would notice a change in pitch/tone. At least, I notice when the sounds the plane is making change even though I don’t know anything about the reason.
> After starting the descent, the flight crew made an announcement to the passengers; however, unbeknownst to the flight crew, the noise generated by the RAT (because of its high rotation speed) prevented the passengers and the cabin crew from hearing the announcement.
It always surprised me that there aren’t small, local lithium batteries to provide backup power for critical components like the smoke detectors. Is the risk of those catching fire considered too high?
>It always surprised me that there aren’t small, local lithium batteries to provide backup power for critical components like the smoke detectors
There is, well, only lithium on the 787. If all power generation is dead, then the most critical flight instruments and gauges get about 20-30 minutes of power from the plane's batteries, things like your backup old fashioned gauges, the engine computers, and maybe some basic flight computer on newer planes. The RAT is intended to keep flight surfaces operational when everything else is utterly fucked, so it usually produces the same kind of energy as whatever the primary flight control system uses, which until recently was hydraulic power. On civilian airliners they generate tens of kilowatts. Airliners do not want to carry around an EV sized battery for the extremely rare occasions when you lose all systems, because that's a waste of gas. The RAT provides the same functionality for lower weight.
When the RAT is deployed, you do not care much whether a smoke detector is powered, you are already vectoring towards an attempted landing.
I feel like it's not the RAT you'll notice from inside the plane, it will be the silence from the engines. That combined with at least a momentary flicker of the lighting (I'm not sure if a RAT on a 787 will run cabin lighting but I doubt it), and you'll know.
Haha I didn’t parse it that way but I can see how you thought that upon rereading. I just want to understand why we would hear the RAT when there wasn’t an emergency overhead. I supposed planes regularly test them?
I'm not going to bother slogging through everything to be able to speak in specifics for every airplane ever built, but:
A RAT provides backup electrical and/or hydraulic power for control surfaces (and other goodies). A RAT would certainly be inspected during a heavy check and likely even during line checks (e.g. an "A" check or equivalent). How often is gonna depend on the airplane. But to suggest that a critical piece of equipment isn't checked regularly is just silly.
Additionally, it's pretty much guaranteed that if an airplane comes with a RAT the RAT is required to be functional for ETOPS flights. That alone means you're gonna be inspecting it pretty frequently. ETOPS certification has three parts: airplane, airline, and humans. You'd want to look at the ETOPS Maintenance Document at whatever airline to be sure.
Outside of Asia (where domestic widebody flights are still common) I'd guess many if not most 787 flights are ETOPS flights.
> Additionally, it's pretty much guaranteed that if an airplane comes with a RAT the RAT is required to be functional for ETOPS flights. That alone means you're gonna be inspecting it pretty frequently.
I remember a decade or more ago I was on a US domestic flight - I forget exactly what, I think it was American from SFO to LAX - so I doubt it needed ETOPS. But the captain announced - while we were still at the gate - that he was getting an error in the cockpit saying the RAT was faulty. And he called maintenance, and they told him to try resetting something (a computer or circuit breaker or whatever) to see if that cleared the error - and when it didn’t, he announced we could not take off and would all have to go back into the terminal. Thankfully they had a spare plane a few gates over and they put us all on that (same crew, same passengers) so we only lost an hour or two.
Right. In the context of this discussion ETOPS buys you significantly increased inspection and maintenance requirements. That's why I don't playing this game of telephone. Someone told someone else that something else did something else. Were everything to have unfolded as transcribed here there almost certainly would've been a high profile investigation.
Back to your flight, both the FAA and EASA require airliners to have a minimum equipment list (MEL). It's entirely unrelated to ETOPS (overwater flights). This list describes what equipment is required to be functional, what you can fly without and when. What's on the list is all going to come down to what kind of plane we're talking about. Could be you're not allowed to fly without a functional RAT ever. Could be that you can fly without a RAT as long as something else (e.g. APU) is functional. Could be you can only make a certain number of flights with a non-op RAT.
A real world example is that ATR 72 crash in Brazil recently. One of the PACKs (air conditioning / cabin pressurization) was not functioning on the accident plane. Per the MEL you can dispatch an ATR in that condition, but you're limited to a service ceiling of 17,000 ft. Unfortunately that put the flight in direct conflict with the weather.
You’re right; my statement was in the context of the above discussion about people claiming to somewhat-regularly hear RATs in the air above them. That definitely isn’t happening.
But the turbine generates power to keep the plane flying. Why would it only work for 10 minutes? Certainly the flight time is a product of fuel level and altitude. Even if both engines fail the flight time would be a function of altitude. I don’t see how deployment of the RAT informs flight time.
It does not generate power to provide thrust; it generates power - using the airstream as the aircraft moves through the air - for the avionics and/or hydraulics.
/s? A generator or alternator powered directly by the engines is more efficient than towing a wind vane (still indirectly powered by the engines and/or the potential energy of the airplane) every single time.
This discussion has nothing to do with engine out failure modes.
I thought the guy I was speaking mentioned something about instrumentation but I wasn't 100% sure and that sounded more serious so didn't mention it - but if the aux engine failing would do that - I guess that lines up!
I have a way simpler explanation. IEEE 754 double can only represent integers up to 2^53 without precision loss, so if you naively average two numbers greater than 2^52, you get an erroneous result.
It just so happens that 2^52 nanoseconds is a little bit over 52 days.
I've seen the same thing with AMD CPUs where they hang after ~1042 days which is 2^53 10-nanosecond intervals.
The comment said IEEE754 doubles can represent integers to 2^52. But I missed the double or assumed float. Floats cannot do that and it would be disastrous to assume so. For that matter, doubles also have some pretty big issues when you do operations on them (loss of precision), but as long as you are purely doing integer operations, it “should” be fine. A practical example with non-integers: 35 + -34.99
Having done exactly this math for GStreamer bindings in JavaScript (where the built in numeric types are double or nothing), this would also be my prime suspect.
Had a similar problem to this many years ago. Happened every 24 days approximately and lost one user setting. Had a logic analyser connected to it for days trying to reproduce the issue in some way. Went to go for a piss and get a coffee one afternoon and came back and there it was triggered!
What happened? Well it turns out there was a timer that no one used that overflowed and caused an interrupt which wasn’t handled any more, the interrupt handler fell through, caused a halt and the WDT fired fire rebooting it and some idiot hadn’t stored that one setting in the NVRAM.
So then we had more problems. 5000 things with EPROMs in that were rebooting every 24 days which were spread all over the planet. Many questions to ask over how the hell it ended up like that.
I hope people are asking these sorts of questions at Boeing.
Edit: also the source code we had did not match what was on the devices. Turned out the engineer who provided the hex file hadn’t copied that code to the file server and had left a year before hand. We didn’t find that until the WDT fired and piqued our interest and could reproduce it on the dev board because the software was different (should have checked that past the label on the ROM which was wrong!)
I’d note that commercial airplanes generally operate with 6-7 9’s of availability. For anyone that’s ever built a system with 5 9’s, this is impressive. In fact it’s impressive enough you probably don’t think twice about sleeping on a flight.
Six 9s would be half a minute of downtime per year.
I don't see how that is possible given the maintenance required for these planes. Even the simple A checks ground a plane for hours every couple hundred flights while D checks take months to complete every 6-10 years.
It counts any event where the plane has an unplanned event including unscheduled maintenance, unplanned flight deviations, and of course catastrophic failures.
I know a commercial pilot who used that as a joke once and got in trouble. The plane in question had several pilots on it but the rest of the passengers didn’t find it funny for obvious reasons.
“We don’t wish to cause any alarm, but is there any one on board who is familiar with regular expressions, cron expressions and parameter expansion rules in bash?”
You joke but… There was an emergency nose high recovery out of San Diego airport where at one point the pilot had every passenger crowd into the first class cabin…
I’ll wager if you got into a situation you can’t escape where you had a 30% chance of a horrific death over the next six hours you wouldn’t snuggle into your sound suppressing headphones and doze off between snacks no matter how inevitable things are.
Where in the world did you get that 30% number? Even on Boeing's worst planes, the chances of any incident are still much more like 0.003% or something like that. "30%" is just fearmongering.
Meh, had worse odds and slept fine. Notably, however, odds are not nearly that bad, even in Aeroflot flights out of Africa. Or for that matter, combat flights in war zones.
Besides, a plane crash is far from the worst way to go. Dramatic, sure.
But dementia? Cancer? That’s often a pretty miserable death.
Plenty of things out there to get worried about if you want.
We’re just going to see more and more issues like this as more and more software is used in applications like this. I would be willing to bet that a Tesla would also spontaneously crash if left on for hundreds of hours, but they just rarely if ever are left on that long.
Ford F150 Lightning had a similar issue on a cross country road race some YT'ers put on. It died at 13% battery, Ford said it was due to not letting the truck rest.
This is remarkably business-as-usual for airplane electronics.
As a more mundane example: the wifi on planes does temporary [edit: DHCP, not NAT] leases. But the system on many has expiration windows on the order of hours, possibly more than a day... Couple that with the number of passengers planes serve and busy routes can easily exhaust the lease pool.
The solution: there's a button the flight attendants can push to reboot the router, dumping the lease table.
To steelman the choice, the reserved IP /8 subnet is 10.x.x.x and is often used for corporate networks and other larger subnets experience similar usage. People on the plane using WiFi are likely to access their corporate networks via VPN, potentially causing routing issues.
Users VPNing into the reused address space for their own home VPN are probably knowledgeable enough to figure out what is going on and a small enough user base to not care about.
I'm no network guy so someone please explain why using 10.x.x.x. on a plane might "potentially cause routing issues"? It doesn't jive with what I understand about unrouteable address spaces. Is the 10.x.x.x space somehow different than the 192.168.x.x space that millions of people use VPN's out of every day (basically every WFH person on their cheap NAT'd home Wifi)?
Because IPv4 sucks! If you don't have enough publicly routable addresses then you are forced to use reserved ranges like 10/8. That means you'll get collisions, ie. multiple networks using the same addresses. With IPv6 you'd just get a real public IP address and all would be fine.
Edit: I feel bad for saying IPv4 sucks. It's one of my favourite pieces of tech and an astonishingly good one at that. It just doesn't have a big enough address space.
Hopefully I'm not too late to the party. When you setup a VPN, you are telling your network stack that all connections for a set of IP addresses will be handled by it, in this example case, all 10.x.x.x requests will be routed through the VPN's application. The VPN will then wrap up all requests through that connection and send them out to the Internet towards the public IP address on the other end of the VPN. To send things out to the Internet, you use your default gateway, basically an IP address everything is sent to when it doesn't match any other configured route `ip route`. If your local network is using the 10.x.x.x subnet for local connections, it will likely be 10.0.0.1 or something. But who handles that route? Your VPN which would then just recursively keep handling its own request.
Now, I think VPN applications are smarter than that and will still get the outgoing packet to the default gateway (citation and research needed), but what happens when it doesn't know to handle a route automagically. For instance, with DHCP, a router can tell your computer what DNS server to reach out to. If that's on the local network, now you see all DNS requests actually routing into the network on the other side where you almost certainly aren't going to be talking to a DNS server. And now, you can't go to any websites.
Hopefully this helps. I'm not the most knowledgeable about VPNs and routing, but I'm pretty sure this is all fairly accurate.
I would vote for a once a year internet holiday. It would bring minor mental wellbeing improvements, coerce important industries and systems to exercise redundancy pathways, provide opportunities to have such a cutover like switching to IPv6, and remind a million petty tyrant product managers that no, our goddamned fart app does not need 6 9s of reliability.
100 bucks says IPv6 would still not get implemented. We need legislation at this point. There's enough stubborn assholes in the networking infrastructure industry just refusing to do their job for it to happen by itself. They will insist they need to save a few thousand bucks and hold the whole damn world back.
Interestingly, there was some controversy in middle schools whereby mischievous students would deliberately disable their device's Internet access in order to play a built-in browser game, and this was seen as undesirable, so I believe that the agreed mitigation was to disable the game entirely. :-(
Scary as it is, is there any reason for a passenger jet to have uptime if more than, say, 24hrs? Wouldn't you just switch it off and on again between every flight, regardless?
If this issue was in a car, we would never know as no one keeps their car running for 50 days straight.
Overnight, planes tend to be plugged in to ground power, to ventilate, keep the batteries charged, for the cleaning crews, etc. Most get rebooted once in a while, but it's always possible one won't be, hence the directive to be certain.
This particular problem has been known for years (the article is from 2020).
Unfortunately, an aircraft has no “reboot”. It is just a violent power cut. A lot of headache is introduced in non-critical aircraft software because there is no “graceful shutdown” or long power duration. Infact, certain hardware has an upper limit(much lower than a week) before which it needs one power cut(sometimes called power cycle) or it suffers from various buffer overflow, counter overflow and starts acting mysterious.
It's amazing that's legal. Like, why do we accept software that does this? It can be done in such a way that these things don't happen.Put another way, why aren't the companies involved being fined and sued out of business? Why aren't their managers facing criminal negligence charges? It's outrageous.
Because there has never been a single commercial jetliner fatality caused by software in its intended operational domain failing to operate according to specification. That makes the commercial jetliner software development and deployment process by far the safest and highest reliability ever conceived by multiple orders of magnitude. We are talking in the 10-12 9s range.
And just to get ahead of: “Well what about the 737 MAX”, that was a system specification error, not due to “buggy” software failing to conform to its specification. The software did what it was supposed to do, but it should not have been designed to do that given the characteristics of the plane and the safety process around its usage.
>“Well what about the 737 MAX”, that was a system specification error, not due to “buggy” software failing to conform to its specification. The software did what it was supposed to do
Exactly: the system was designed to fly the plane into the ground if a single sensor was iced up, and that's exactly what the software did. Boeing really thought this system specification was a good idea.
That is a massive over-simplification and that invites patently false characterizations like it was a "stupid mistake" that would have been fixed if they were not stupid (i.e. adopted average development process). That is absolutely not the case. They were really capable, but aerospace problems are really, really hard, and their safety capability regressed from being really, really capable.
They modified the flight characteristics of the system. They tuned the control scheme to provide the "same" outputs as the old system. However, the tuning relied on a sensor that was not previously safety-critical. As the sensor was not previously safety-critical, it was not subject to safety-critical requirements like having at least two redundant copies as would normally be required. They failed to identify that the sensor became safety critical and should thus be subject to such requirements. They sold configurations with redundant copies, which were purchased by most high-end airlines, but they failed to make it mandatory due to their oversight and purchasers decided to cheap out on sensors since they were characterized as non-safety-critical even if they were useful and valuable. The manual, which pilots actually read, has instructions on how to disable the automatic tuning and enable redundant control systems and such procedures were correctly deployed at least once if not multiple times to avert crashes in premier airlines. Only a combination of all of those failures simultaneously caused fatalities to occur at a rate nearly comparable to driving the same distance, how horrifying!
A error in UX tuning dependent on a sensor that was not made properly redundant was the "cause". That is not a "stupid mistake". That is a really hard mistake and downplaying it like it was a stupid mistake underestimates the challenges involved designing these systems. That does not excuse their mistake as they used to do better, much better, like 1,000x better, and we know how to do better and the better way is empirically economical. But, it does the entire debacle a disservice to claim it was just "being stupid". It was not, it was only qualifying for the Olympics when they needed to get the gold medal.
I really don't think it takes a mastermind of software design to go "okay I've built a system that takes control of the plane's maneuverability, let's make sure we have redundant sensors on this". Furthermore, descriptions of MCAS and its role were dangerously under played so that they didn't have to tell their customers to retrain their pilots. An egregious breach of public trust in a company we put a whole lot of faith into.
>They failed to identify that the sensor became safety critical and should thus be subject to such requirements.
Whistleblower testimony indicated it wasn't a failure to identify it as safety critical, but a conscious decision not to mention it as such to the regulator, and not implement it as a dual sensor system as doing so would have caused the design to require Class D simulator training; which Boeing was relying on the abscence of as a selling point to prevent existing airlines from defecting to Airbus.
>They sold configurations with redundant copies, which were purchased by most high-end airlines, but they failed to make it mandatory due to their oversight and purchasers decided to cheap out on sensors since they were characterized as non-safety-critical even if they were useful and valuable.
Incorrect. All MAX's have two AoA vanes, each paired to a single Flight Computer. The plane has two Flight Computers, one on each side of the cockpit, and the computer in command is typically alternated between each flight. One computer per flight will be considered in-command (henceforth referred to as Main), the other will be henceforth referred to as operating as "auxillary". The configuration you're thinking of is an AoA disagree light, implemented by enabling a codepath in software running on the Main FC whereby a cross-check of the value from the AoA vane networked to the auxillary FC would light up a warning light to inform pilots that system automation would be impacted, because the AoA values between the MFC and AFC differed. A pilot would be expected to recognize this as and adapt behavior accordingly/take measures to troubleshoot their instruments. Importantly, however, this feature had zero influence on MCAS. MCAS only took into account inputs from the vane directly wired to the Main FC. While a cross-check happened elsewhere for the sole purpose of illuminating a diagnostic lamp, there was no cross-check functionality implemented within the scope of the MCAS subsystem. The MCAS system was not thoroughly documented in any delivered to the pilot documentation. The program test pilot got specific dispensation to leave that out of the flight manual. See the Congressional investigation, final NTSB, and FAA report.
>The manual, which pilots actually read, has instructions on how to disable the automatic tuning and enable redundant control systems and such procedures were correctly deployed at least once if not multiple times to avert crashes in premier airlines.
The documentation, which included an Airworthiness Directive and NOTAM, informed pilots any malfunction should be treated in the same manner as a stabilizer trim runaway. Said problem is characterized in aviation parlance as a continual uncommanded actuation of trim motors. MCAS, notably is not that. It is periodic, and in point of fact, it ramps up in intensity over time until over 2° of travel are commanded by the computer per actuation event, with the timer between actuations being reset to 5 seconds by use of the on yoke Stab trim switches. This was ncommunicated to pilots. Furthermore, there were design changes to the Stab-Trim Cutout switches between 737NG (MAX's predecessor), and MAX. In the NG, the Stab Trim cutout could isolate the FC alone, or both FC and yoke switches from the Stab Trim motor. In MAX, however, the switches were changed to never isolate the FC from the Stab trim motors, because MCAS being operational was required for being able to checkmark FAR compliance for occupant carrying aircraft. So when that cutout was used, all electrically assisted actuation of the horizontal stabilizer became unavailable. The manual trim wheel would be the only trim input, and in out-of-trim attitudes, would result in such excessive loading on the control surface that physical actuation without electronic assistance was not feasible on the timescales required to recover the plane. There was a maneuver known to assist with these conditions (when they occurred at high altitude) called "roller coastering" in which you dive further into the undesired direction to unload the control surface to render it actuable. This technique has not been in official documentation since Dino 737 (Pre-NG). The events you're referring to when uncommanded actuations were recovered on other flights, happened at high altitudes, and were recovered with countered electrical stab switch actuation followed by Stab trim cutout within the reset 5 second watchdog timer prior to MCAS activation subsequent to a Stab-trim yoke control switch actuation. This procedure, and the implementation details needed to fully understand its significance, were undocumented prior to the two crashes. Furthermore, this procedure to cut out MCAS/the MFC from the stab trim motor and finishing the flight in a completely manually trim controlled configuration meant that technically you were flying an aircraft in a configuration that could not be certified to carry passengers when taking the FAR's prescriptively, and uncompromisingly rules-as-written with zero slack offered for convenience, because MCAS was necessary for grandfathering the MAX under the old type cert, and without MCAS functional, it's technically a new beast, which is non-compliant with control stick force feedback curves when approaching stalls, which by the way, just to make it clear, a compliant curve has been a characteristic of every civil transport in all jurisdictions worldwide for well over 50 years.
This was not documented and only became apparent after investigation. Again, see the House findings, FAA report, and NTSB.
>Only a combination of all of those failures simultaneously caused fatalities to occur at a rate nearly comparable to driving the same distance, how horrifying!
Oh, the multi-billion dollar aircraft maker built a machine that crashes itself, gaslit it's regulators, pilots, airlines, and the flying public to juice the stock price so executives could meet their quarterly incentives, and diverted tunds away from it's QA and R&D functions to do stock buybacks, move HQ away from the factory floor, and try to union bust. With over 300 direct measurable deaths within a couple of months and multiple years worth of grounding and mandated redesigns to fix all the other cut corners we've been unearthing, and veritable billions of dollars of loss incurred in delays. Heavens, it could happen to anybody. How could you possibly see this as something to get upset about? /s
Thank you for providing a more thorough and complete technical explanation.
As you can see from my final statement, I made no argument that it was not a travesty. It was ABSOLUTELY UNACCEPTABLE. This is not a defense of their inadequacy.
I was pointing out how it is absolutely incorrect to claim that it was a "stupid mistake". That argument is used by people implicitly arguing that "If only Boeing used modern software development practices like Microsoft/Google/Crowdstrike/[insert big software company here] then they would have never introduced such problems". That is asinine. As can be seen from your explanation, the problem is multi-faceted requiring numerous design failures in both implementation, integration, and incentives. In fact, the problems are even more subtle and pernicious than in my original explanation that was derived from high level summaries rather than the investigation reports themselves.
I do not know if this has changed in the last few years, but at Microsoft you were required to have 1 whole randomly-selected person, with no required domain expertise, say they gave your code, in isolation, a spot check before it could be added. This is the same process applied regardless of code criticality, as they do not even has a process to classify code by criticality. This is viewed as a extraordinary level of process and quality control that most could only dream of achieving. Truly if only Boeing threw out whatever they were doing and adopted such heavyweight process by "best-in-class" software development houses they would have discovered and fixed the 737 MAX problems.
Boeing does not need to adopt modern software development "best practices" and whatever crap they use at Microsoft/[insert big software company here] that introduces bugs faster than ant queens. The processes in play that created the 737 MAX already make Microsoft and its peers look like children eating glue, but they are inadequate for the job of making safe aerospace software and systems. What Boeing needs to do is re-adopt their old practices that make the 737 MAX development processes look like a child eating glue. The 737 MAX was not stupid, it was inadequate. BOTH ARE UNACCEPTABLE, but the fix is different.
This is a totally bizarre strawman argument. Safety-critical software has almost nothing in common with Microsoft crapware, or indeed, most typical desktop software. Even within the desktopo software industry, MS has never been held up as "best-in-class", but rather the butt of jokes.
As the other poster said, it doesn't take a genius to figure out that a new safety-critical system needs its sensors to be redundant. It wasn't stupid, though, it was malicious: Boeing wanted to hide the existence of MCAS so that pilot retraining wouldn't be required.
So what should we make of these issues described in the article? When, not if, this kind of thing kills people will it be a specification error? Will we blame it on maintenance? Surely it can't be the software's fault!
First of all, who got blamed for the 737 MAX? Boeing did. This is one of the few industries where the responsibility does not get easily sloughed off.
Second, 787s have been flying for ~13 years and ~4.5 million flights [1]. Assuming they were unaware of the problem for the majority of that time, their unknowing maintenance and usage processes avoided critical failures due to the stated problems for a tremendous number of flights. Given they now know about it and are issuing a directive to enhance their processes to explicitly handle the problem, we can assume it is even less likely to occur than previously which was already experimentally determined to be ludicrously unlikely. Suing someone into oblivion for a error that has never manifested as a serious failure and that is exceedingly unlikely to manifest is a little excessive.
Third, they should be remediating problems as they arise balanced against the risks introduced by specification changes and against the alternative of other process modifications. Given Boeing’s other recent failings, they should be given strict scrutiny that they are faithfully following the traditional, highly effective remediation processes. It should only be worrisome if they are seeing disproportionately more problems than would be expected in a aircraft design of its age and are not remediating problems robustly and promptly.
> Suing someone into oblivion for an error that has never manifested as a serious failure and that is exceedingly unlikely to manifest is a little excessive.
I appreciate your point of view. The air travel industry is undeniably safe, moreso than any transportation system ever. By a large margin. On the other hand, it is possible to make software systems that do not have the defects described in the article. So how do we get to the place where we choose to build systems that behave correctly? I don't think we get there without severe penalties for failure.
>The air travel industry is undeniably safe, moreso than any transportation system ever.
I disagree: the Japanese shinkansen bullet train system has never had a fatal accident, except for a single incident 30 years ago when someone was caught in a door and dragged 100 meters. No fatalities from collisions, derailings, etc., ever, since the 1960s. That's far safer than air travel could ever claim to be.
Even other train systems have better records than commercial aviation, in general. Plane crashes are rare these days, but they still happen once in a while, and the results are usually catastrophic.
Are planes safer than cars? Well of course, but that's a really, really low bar: cars are driven by all kinds of morons who frequently (esp. in the US) have little to no training or testing, are frequently distracted, don't have a copilot who can take over at any time, and are frequently operating in a very, very chaotic environment (like city streets). It's truly a wonder there aren't more fatal crashes. But safer than trains in general? I seriously doubt it.
Actually, the Shinkansen seems to average ~100 billion passenger-km per year [1] or ~60 billion passenger-miles per year. Using that as a overestimate for the last 60 years, that is a grand total of 3.6 trillion passenger-miles.
US commercial aviation averages ~1 trillion passenger-miles per year [2]. So if we compare the last 4 years of US aviation that is a comparable number of passenger-miles.
Over the last 4 years recorded on this dataset (2019-2022)[3] it looks like there were 5 fatalities total. Over the last 4 years recorded on this dataset (2018-2021)[4] it looks like there were 2 fatalities total.
So, while it does not appear to be safer, it is within a few factors on a passenger-mile basis. Furthermore, there are multiple periods of 4 trillion consecutive passenger-miles where there were 0 recorded accidents. It nowhere near obvious that it is “far safer than air travel could ever claim to be” and certainly a much closer race than you believed given your other assertions.
That's not exactly a fair comparison, because you're comparing distances traveled, rather than trips taken. Of course planes are going to look good, since they travel much longer distances than cars or trains, and because planes are more likely to have trouble when taking off or landing than any time in-between. It's not like you can just take a commercial airliner flight to go to your local grocery store, even though statistically you're more likely to get killed on that trip than on a cross-continent flight.
First of all, passenger-distance per event (or its inverse) is the standard metric used when comparing transportation safety. You would be hard-pressed to find any broad, rigorous comparison that does not compare on that metric. It encodes the risk of a trip to a location of a certain distance. It is absolutely a fair comparison.
Second of all, even if we do use your metric which only cares about passenger-trips per event it still does not matter. The Shinkansen has transported ~6.4 billion people since inception. As seen in the second link I provided above, US commercial aviation serves ~900 million passengers per year. So, that is 7 years of US commercial aviation to transport the same number of people the Shinkansen has ever transported. As seen on the third link the last 7 years (2016-2022) had ~6 fatalities and as seen on the fourth link the last 7 years (2015-2021) had 2 fatalities compared to the 1 fatality on the Shinkansen.
Third of all, given that the Shinkansen has transported ~6.4 billion people, but averages 150 million people per year and ~60 billion passenger-miles per year, we can reasonably conclude that I overestimated at ~3.6 trillion passenger-miles and it would likely actually be ~2.4 trillion passenger miles or just 2.5 years of US aviation. From the third link that would be a mere 1 fatality and from the fourth link 0-1 fatalities.
If we extend our analysis to the last decade the third link indicates 15 fatalities over ~10 trillion passenger miles, ~2x the Shinkansen rate, and the fourth link indicates 2 fatalities over ~10 trillion passenger miles, ~50% the Shinkansen rate. Again, broadly comparable, but it is hard to truly tell which one is "safer" than the other. And again, they are clearly in the same ballpark and not dramatically different as you implied.
> So how do we get to the place where we choose to build systems that behave correctly? I don't think we get there without severe penalties for failure.
What failure? The planes work. This is puritanism.
> First of all, who got blamed for the 737 MAX? Boeing did. This is one of the few industries where the responsibility does not get easily sloughed off.
The whistleblowers dying is coincidental and convenient.
1. For at least one of the whistleblowers, it was certain not "convenient" because he already managed to go public with the accusation, the lawsuit was filed, and his deposition was already made.
2. I'm not sure how a few whistleblowers dying disproves "responsibility does not get easily sloughed off". If anything, they're getting extra responsibility than is warranted. Every time there's something wrong with a Boeing product, people almost reflexively start posting about how it must be caused by corner cutting by Boeing, or how it's yet more evidence that Boeing it circling the drain. This happens even for planes that's are decades old, have a solid service history, and by all accounts are probably caused by pilot error or improper maintenance.
It works fine until it doesn't and people die. At which point the blame falls on the maintenance crew? That's wrong. And where there's smoke there's fire. If the software has this horrible bug, likely the broken culture that created it has written worse, more subtle bugs.
I agree completely with the first part.
But SWA-1380 was a commercial operating fatality in 2018. Not a crash into terrain, but the engine definitely crashed into the fuselage.
Because changes to that software go through a enormous amount of testing, validating and documentation for a new baseline to become a flashable item. Meanwhile a always working workaround is needed now.
Have you even found the documentation around things like ACPI? It's kinda coupled with UEFI these days I think, and hell, I'm not even sure of the hardware boards/revisions aircraft makers are using these days... Are they still on BIOS? Or old-as-sin linux/RTOS kernels/microcontrollers?
Point being, when you start talking about high QA systems, where the Quality is non-negotiable (you will have everything documented and tested); barring exec/managerial malfeasance in preventing that work from being done, you reach for the same simple things over and over again since it takes a hell of a lot of work to actually characterize and certify a thing to the requisite level of reliability/operating conditions.
There's also no reason why a "reboot" can't be a "violent power cut", especially if the equipment in question doesn't hold any state. For instance, there's no reason why you'd need to go through a shutdown sequence for a printer.
The testing for aerospace is extremely rigorous ... For DO-178C level A (Catastrophic failure that can cause a crash or many fatal injuries) we're estimating 2 years to do MC/DC test coverage metric of a fairly basic software system that has two mechanical backups. And that's above and beyond the extensive unit tests.
The main thing that gets checked is the worst-case timing analysis for every branch condition. And there are stack monitors to monitor if the stack is growing in size.
Look at Rapita System's website for more info ... we don't use them, but they explain it well.
Many car's control units continue to run while the car is off. If you want to reboot your vehicle, you need to unplug the 12v battery for at least a minute.
On some cars (recent VWs in particular) when you plug the battery back in you need to twiddle some settings in the computer otherwise the charging circuit will fry the battery prematurely. We've gotten ahead of our skis with this nonsense, time to rein it in.
It's hard to imagine an interpretation of this behavior that doesn't involve manufacturers trying to punish independent mechanics and end users who service their own cars. Like, there's no way it's an "honest mistake", right?
BTW I have an AGM ("advanced glass mat") battery in my 1995 Toyota which has a completely analog charging system, and it doesn't get cooked, so it's not because there's something special about the battery.
My point is there was absolutely no need for the System Engineers to touch the charging system. The normal analog diode rectifier variety that has been standard since the 1960s is Good Enough. No "Innovation" Needed. Take your spacecamp nerds elsewhere.
Having performed repairs on a BMW motorcycle, I am quite aware. It is a good point, but I highly doubt that it would play a role in this case. There must be something there that we are missing.
That's because BMW ECUs adapt to the lower voltage as the battery ages and instruct the alternator/charger to provide more current. Replace the battery and the ECU would cause it to be overcharged unless you notify it of the replacement. Yes it's an over-engineered system, but ... German car.
Some of these planes are constantly flying as long as they're not in maintenance. A plane not in the air is a plane the company bought that's not currently generating profit.
unsure what you mean here. most of the systems go to a sleep state in modern vehicles ev or not. the 12v battery keeps only certain ECU's up - think ECUs that control alarm, lock and unlock state and any communication with the mobile app via LTE... but the rest of the systems are OFF, you don't want an EV battery to hit 0% and 12V to also hit 0% - that would basically make it a brick from what I understand- because EV's have contactors which need to shut for the battery to be 'engaged' the 12V battery controls these contactors.
A car with an enormous rack of high capacity batteries able to accelerate an 8000 pound object to 60mph and sustain that for hundreds of miles generally doesn’t depend on the backup battery for literally anything. It has so much excess energy storage in the form of electricity in the primary batteries it generally doesn’t power down the onboard computers at all.
Indeed when you get close to exhausting the main battery rack it starts selectively shutting down everything. I’ve never personally let mine get to 0% ever - but for instance a Tesla is continuously on, and if you use sentry mode it’s not just on but the GPU is constantly doing classification of the environment to determine if someone is prowling your vehicle.
Every EV depends on the 12v battery for starting up / has the HV battery off when your car is off, that's why if the 12v battery is dead your car won't start.
Low voltage battery death in any EV essentially causes a brick. The only exception is some cars (I think Tesla does this?) keep their contactors closed all the time when the 12v is determined to be failing. It makes the drain at idle much higher, but then at least it can continue moving… as long as you don’t let the HV pack drain…
Very strange, because for me, an aircraft(medium) is never alive for more than 24h. A big one like 787 may be alive for up to 72h(assuming longer routes). 50 days for me would be a dream and a lot less headache but it is very expensive to keep an aircraft powered that long with ground power.
I know someone on the north slope of Alaska. He does not turn his personal truck off all winter. This is even more typical for semi trucks and whatnot around there.
I think it's about the worst case scenario. You wouldn't want this to happen even rarely, especially when it can be solved by putting more time (and god forbid, money) into R&D.
Airlines will run the aircraft as long as possible. As another commenter mentioned, if an aircraft isn't in flight, it's in maintenance. All of these times it's on.
In the software world I call this an end user discovered issue. But when the issue involves a plane that is carrying actual souls. That can feel very scary.
I am sure this has been resolved by now since its from 2020.
Reminds me of the F-22 Raptor crossing the International Dateline error in 2007. They were flying a squadron of them from Hawaii to Japan. They crossed the IDL and all nav/fuel systems went down, as well as some communications gear.
They only made it back because they were flying with tankers at time, who led them back to base.
That depends on how much code was having trouble, and what you mean by "resolved".
The safe option might be to avoid the situation, and I could imagine that even if there is a code update it might just make the plane balk at getting ready to take off after a certain amount of uptime.
There was a similar problem with a specific generation of 688-class submarines, where a calculated temperature would slowly drift. The metric wasn’t used for any protective actions, so it wasn’t a “shut down immediately and return home surfaced on the diesel” situation, but still disconcerting.
I assume that after this the software was soak-tested for weeks / months to eliminate that class of bug. Naval Reactors is many things, but repeating the same mistake twice isn’t one of them.
Really ? Mine has an uptime of a year or so, it resets only if a big storm stopped the main power for a few seconds.
Maybe it is the new hardware ? I have the original one (arvm6, 512 Meg of RAM)
I'm honestly impressed that the Register included a prominent blurb explaining to the reader that while this sounds like a catastrophic issue, the most likely outcome if this is experienced in flight is a safe and controlled landing.
> Sidenote
>
> Pitch and power is a simple concept. If you have the throttles, say, three-quarters open and the nose of the aeroplane is pointing a few degrees above the horizon, chances are you're probably flying straight and level at a safe speed. Training manuals normally contain a number of precise pitch and power settings (they vary between aeroplane types) so if display systems start failing, pilots can fall back to these with confidence.
That's not what's alarming to me. What's alarming is that the plane could possibly be in a position to be continuously powered on for 51 days in the first place.
maintenance would happen with the aircraft in 'wheels on ground' mode but that may not mean all systems are turned off. I expect it's like a bug in the SMC on a computer. To really turn it off you have to do some magic.
I've flown with airlines before where there was a cascading delay due to a "plane deficit" at the terminal (not the technical term, that's my own). Not to say it's always uptime, but I imagine there are instances of constant uptime.
They can't just change things up on a dime like that. Even if it's 3 AM and most planes are sitting on the ground they can't just be used for your flight like that because they are all scheduled to take off in the morning rush a few hours later.
well now your system doesn't do anything because its stuck in a forever loop checking the time. it's most likely programmed in C so you can remove the OOP as well.
Not getting it.. yeah the famous 32 bit ms overflow after 49 something days. But why then 51 here? Shouldn't they be required to reboot after 49 days please please? :D
It's possible to run tasks instead of starting every second, starting one second after the previous iteration finishes.
So if you have something that checks the system health every millisecond, and keeps a count instead of a duration, then if it takes a couple microseconds to complete you might get something less than 86 million ticks per day instead of 86.4 million.
The OS used on the 787 has a hard real-time scheduler. Tasks are started up at a specific frequency (set per task), run to completion or to the end of their time slot (set per task) and terminated. We had, IIRC, a strict 100ms slot for our bit of LRU software to do everything and it would be launched every 1s (from memory, that was 15 years ago). Information could be stored between executions so partial completion is something you could handle if needed by storing state information and using it at the start of the next iteration (we didn't need that, our tasks finished in the slot).
You don't base the start of a future task on the end of the prior one, you base it on a fixed clock for these kinds of systems.
This company just can’t stay out of the news. Their planes are trash. Software is straight garbage. Many people have died because of this company and suffered undue stress/anxiety because of the massive dip in quality.
Boeing engineers/builders caught on audio stating they wouldn’t be caught dead in their own planes unless feeling suicidal.
The company definitely can't stay out of the news and it's gone downhill over the recent years but you've picked an interesting post to lament about those on. The news they can't stay out of is over 4 years old in this case. The model of plane it's about (787) has never had a single fatality despite >15 years of operations and >1,000 units operating today. In all, deaths are probably the worst possible metric to berate Boeing on - including every death (e.g. hijackings, not just engineering failures) their popular 747 line has had comes to <6,000 fatalities despite carrying billions of passengers over a period of >50 years.
Despite their ever increasing incompetence on delivery speed, test compliance, and innovation... commercial air travel with Boeing (and other major air manufacturers) has always been one of, if not the, safest mechanisms of travel we've ever executed on. Particularly the last 5 years have been the safest period in terms of air travel deaths or injuries.
None of that means we shouldn't criticize Boeing by any means, just that doing it over perceived death and accident counts because of what news headlines imply is complete nonsense in terms of actual numbers no matter how you slice it. It's important those kinds of things are reported but it's equally important to not get swept up in paranoia over it.
Agreed, my 737 fears were relieved by researching how many of them are in the air at any moment, how many millions of trips they fly each year, how old airframes can get before they get retired, etc. Even the "worse" models are feats of engineering.
I will be slightly paraphrasing from memory there, but certainly was quite surprised how calm he was about the whole thing, there's no way I'd board one of those things.