
Boeing changing Max software to use two computers - thomasjudge
https://news.yahoo.com/ap-sources-boeing-changing-max-184231846.html
======
redis_mlc
>“Only one computer was used in the past, because Boeing was able to prove
statistically that its system was reliable, the person said.”

Yeah, I have a real problem with that.

A priori of a computer failure, you may have statistical reliability. But once
it actually fails, the failure rate is 100% and a backup computer is needed.

Also, external AoA sensors fail all the time (like when a bird hits them), so
the failure probability is very high. So we're not just talking about solid-
state computer reliability here.

What also concerns me is that airliners fly at transsonic speeds. How would
the AoA calculations be affected when the airplane is in a slip for some
reason, and the AoA sensors are reading different values? What happens when
MCAS goes berserk at high speeds (ie. near Mach 1.)

If MCAS wasn't even tested on takeoff, it sure as hell wasn't tested at Mach
0.8.

If the stakes are not clear ... besides the loss of life, an airliner crash
often results in the bankruptcy of the entire carrier.

Source: commercially-rated airplane pilot who's studied high-speed
aerodynamics.

~~~
varjag
As an embedded systems developer, I have developed mistrust to statistical
reliability figures. These days you quite often come across components and
subassemblies with MTBF stating (sometimes tens of) millions of hours. When
you try to clarify with the vendor how they arrive at these figures there is
nothing but general handwaving about using industry tools and statistical
models.

But sorry, just can't seriously take it on faith that your product is going to
last 11 centuries at 20C.

~~~
londons_explore
I really wouldn't be surprised if some components don't last that long...

Take a regular ceramic resistor. I'd bet that if you kept it in a temperature
controlled sealed box, it would last 2000 years.

I would possibly even reckon it might last 200,000 years.

Ceramics are so so so inert, and as are metals as long as they're away from
oxygen.

~~~
varjag
The 11 century MTBF example was a complete SoM.

------
nemild
From the article:

>“Only one computer was used in the past, because Boeing was able to prove
statistically that its system was reliable, the person said.”

Years back, my father co-wrote a paper on the Space Shuttle Challenger showing
issues with the statistical thinking at NASA:

[http://www.math.montana.edu/shancock/courses/stat401/Dalal_e...](http://www.math.montana.edu/shancock/courses/stat401/Dalal_etal_1989-Challenger.pdf)

I wonder what statistical techniques Boeing used, and how defensible those
techniques were.

~~~
voldacar
"Prove statistically" is such an odd phrase. I wonder if they used fuzzing or
something like that? Because even so that is quite far from a formal software
proof

~~~
jwilliams
It's relatively common in embedded programming (in my experience). A lot of
real-time programming is around scheduling. You certainly can use formal
proofs there too, but statistical methods would be common at significant
scale.

------
Randor
Many years ago I wrote navigation software for ocean going vessels/ships. We
used double and triple redundancy on many of our sensor types. We generally
used three control computers that would 'vote' before deciding to make vessel
navigation changes. We also always included a "Dead man's switch" that allowed
the bridge crew to take control at any time.

I can't even imagine designing aerial vehicle autopilot without redundency.
The stakes are too high...

I would be interested in having a look at the statistical model they used to
prove 'the system was reliable with zero-redundancy'. While designing these
systems for ships the only way we were able to get an error probability near
zero was when we used triple redundancy.

~~~
rkagerer
How did you deal with redundancy "choke points"? eg. What if the component
that tallies computer votes and actuates things based on the results fails
(especially in a way that's hard to detect)?

Were you able in some cases to maintain isolation all the way through from
sensors to actuators and design such that a single failed one (in a worst case
failure mode) could be overcome by the rest?

~~~
fit2rule
(Disclaimer: SIL-4 programmer for safety critical rail transportation
applications)

The way this is done is that 2 of 3 computers need to 'agree' on the final
decision in order for it to be considered the correct one - there isn't a
single point of assessment, but rather a consensus that must be formed from
the results of all 3 computers. Ideal case, all 3 produce the same results.

This has worked successfully for decades. What's changing now, is that those 3
computers are now no longer the same architecture - you'll have a PPC and an
x86 and an ARM-based CPU all attempting to agree to the same data, in order to
prevent systemic failure throughout.

~~~
crocal
(Same disclaimer)

> What's changing now, is that those 3 computers are now no longer the same
> architecture

I think / hope this idea will die a well deserved death. Rail systems must
face 25 years lifetime and with such design obsolescence headaches are
multiplied by 3. In addition this creates bugs and integration nightmares.

And on top of all that, we have known for decades that CPUs can be protected
against systemic failures using the vital coded processor technique [1].

Note to the curious: this is one of the most fascinating piece of software I
have ever encountered. Make the software resilient to any hardware fault
through the power of arithmetic...

[1] [https://www.semanticscholar.org/paper/Vital-
software%3A-Form...](https://www.semanticscholar.org/paper/Vital-
software%3A-Formal-method-and-coded-processor-
Dollé/3a4a1645e672353c49d2c41718fe010c5fa2405b)

~~~
fit2rule
SIL-4 doesn't generally follow the same rules as consumer computing, as I'm
sure you are aware.

There are track-side systems still running on 80386's, and a maintenance
infrastructure in place to keep these systems running for at least a few more
years, before they are replaced with the newer, Pentium-based systems.

Design obsolescence is not built-in for these systems. Long-term ability to
support is more important. Also, burn-in period. Many bullets (SPECTRE, etc.)
have been dodged by relying on older, more proven, more tested technologies ..

~~~
crocal
Design obsolescence is absolutely built-in for railway systems (safety related
or not). It’s part of the fun of the profession to deal with bizarre voltages
that are inherited from the 1930s. For sure it’s not something consumer
electronics care for but that’s precisely the point I am making.

Actually, it’s one of the required extension to ISO 9002 by IRIS, the quality
framework for rail engineers.

Components are chosen not only for their function but also for their supply
availability in the long run. One of the things we look at is « multisourcing
»: how many different folks can provide this stuff. The more the better.

Going back to this great idea about hardware diversity (that is absolutely not
a general rule in railway): with an architecture requiring 3 different
hardwares, you essentially shoot yourself in the foot. Instead of having 3
suppliers for one component, now you need to find at least 6 for 3 components.
And the probability to face an obsolescence issue has basically tripled...

And for what? It’s proven useless by science... :b

And as far as dodging bullets, railway is no better than SCADA. Older tech
means older vulnerabilities stay in place, and there are countless people on
HN that will tell you it’s /not/ a good thing. I don’t think we dodged, I
think we were just lucky. There is a reason why now cybersecurity is making
its way inside the last revision of standards.

~~~
fit2rule
> hardware diversity (that is absolutely not a general rule in railway)

I don't know where you're working in the industry, but in my company (THALES)
its _definitely_ a thing, and we are absolutely working on diversifying the
3-of-3 and 3-of-5 configurations away from Intel.

And maybe we're talking about design obsolescence in different terms, but yeah
.. 30-year old CPU's are still being shipped to customers, yo. They _WANT_ it,
so.

~~~
crocal
You can design your platform in a number of ways to be relatively independant
of the underlying CPU model, thereby mitigating the risk of supplier lock-in.
All suppliers will try to find a way to achieve such target.

It’s a different thing than saying you need hardware diversity in a majority
vote system to achieve better safety. That is demonstrably false. For example,
Siemens VCP is /proven/ to be safe and it does not even use a majority vote
(see my previous comment for a reference to the VCP)

I prefer not to involve my employers on HN. What I write here is my opinion
only.

As for your last remark, let’s be careful in assuming what customers want. The
fact that they have to live with obsolete stuff does not necessarily mean they
are super happy about it.

~~~
fit2rule
Its not about the software being independent of the architecture. Its about
using diverse hardware platforms to avoid the situation where an un-detected,
hardware-level bug affects the voting ability of all participants. We've seen,
time and time again, so-called dependable platforms weaken over the years as
more and more issues are uncovered.

Diverse voting node architecture requirements are designed to prevent hardware
bugs from crashing trains, not software bugs.

>Siemens VCP /proven/

.. and yet, it still crashed trains.

>What customers want

Its not obsolete if a customer wants it.

Customers want older CPU platforms because the tooling and industry required
to support them in the field is long-entrenched, and costs of upgrading to
"newer, sexier CPU's", not really worth it if the lesser platforms are capable
of doing the job...

~~~
crocal
> .. and yet, it still crashed trains.

Do you have evidence for that claim?

------
qaq
This sounds pretty scary a major rewrite on tight schedule under huge
pressure.

------
semerda
“Only one computer was used in the past, because Boeing was able to prove
statistically that its system was reliable, the person said.” — here boss no
need for redundancy coz we write amazing code that’s statistically never going
to fail! What were they thinking?!

There is no system 100% efficient. Even a small error rate over the standard
~20yr service life isn’t a guarantee a failure won’t occur. Considering the
max are refurbished 737s, wouldn’t more precaution be taken?

~~~
alkonaut
The question is whether is fails safely. The design likely considered failure
to not include near-inevitable crashes, but rather some manual intervention.
In that case N failures per X thousand flight hours is nothing unusual. The
change isn’t to get zero failures but to get a system that fails safely.

Compare to a Tesla autopilot: it can work 99% of the time, if at 1% of the
time it slows down and gives up. It can’t crash into oncoming traffic 1% of
the time. The failure mode matters.

------
josemanuel
I don't understand... so the crashes were due to faulty single sensor. Adding
computer redundancy won't fix single sensor screw up. Will they retrofit all
planes with dual sensors?! I really don't want to fly on this airplane. This
whole thing just sounds like PR trying to sell the idea that this plane will
be fit to fly when it will clearly come back into active with a heap of
flaws...

~~~
charlesism
I know little about Boeing, but I'm sure the amount of money at stake, if they
have a serious mechanical issue, would be staggering. I don't want to fly on a
737 Max. I worry they have too much incentive to conclude every root cause is
software.

~~~
ethbro
I look at it like x86 / x64 processors.

There are _always_ hardware errata. But they're patched and papered over with
firmware.

Essentially, I'm operating a system, and if you can guarantee the system
functions as documented -- I don't really care how the sausage gets made.

What I would care about is if a pilot (in this metaphor, my compiler, I
guess?) doesn't feel comfortable operating the system.

And I think the pilots' unions have been interesting in this. Because they
don't really want to slag their employers (the airlines), but they're willing
to push back hard on Boeing for essentially the same issue.

A lack of transparency about and sufficient training in differences.

~~~
charlesism

        > Essentially, I'm operating a system, and if you can 
        > guarantee the system functions as documented -- I 
        > don't really care how the sausage gets made.
    

If a software fix for a hardware issue works well enough, I'm with you 100%.

The thing about the incentive, though, is that a software fix ticks the "we
did something" box, as long as it improves the issue _to any degree_. That it
does _any good_ at all, combined with the lower cost, can give management (or
ass-covering employees) enough substance to rationalize the decision away.

I doubt anyone at Boeing would willingly risk customer lives. But if they are
working under pressure, and can say "well, we did do _something_ to address
the issue," that can lead to bad decisions.

------
AnssiH
A Seattle Times article from a few days ago seems to contain more detail:
[https://www.seattletimes.com/business/boeing-
aerospace/newly...](https://www.seattletimes.com/business/boeing-
aerospace/newly-stringent-faa-tests-spur-a-fundamental-software-redesign-
of-737-max-flight-controls/)

It includes e.g. a description of FAA bit-flip testing to induce incorrect
system behavior that Boeing intends to solve by using two computers.

~~~
danjayh
"The fault occurs when bits inside the microprocessor are randomly flipped
from 0 to 1 or vice versa. This is a known phenomenon that can happen due to
cosmic rays striking the circuitry."

This is called Single Event Upset. For all those of you that aren't in the
industry, essentially the problem is that any bit inside the flight computer
(RAM, cache, NVM, registers - ANY bit) can change state randomly at ANY time.
It's rare, but when you get into millions of flight hours, it WILL happen. The
software and hardware have to be designed to mitigate problems caused by this
behavior.

~~~
GistNoesis
If this is due to cosmic rays, doesn't flying over the poles make it more
likely that such event happens ?

How are we sure there aren't local bursts of cosmic rays that would make
suddenly a few of those Single Event Upset, for example when there are some
high likelihood of seeing northern lights ?

How do you test your hardware and software to show that you are indeed cosmic-
ray proof ?

~~~
fit2rule
You test the living crap out of it, and not just in the lab on the workbench
but also in operation while online - while the thing is running in operation,
it is also consistently testing itself to ensure that the hardware is
performing as expected.

Online software tests check for cosmic ray bit flips about 1000 times a
second, in addition to whatever hardware mechanisms are in place to detect
this (ECC, etc.) This is a standard module in most SIL-4 applications, where 2
of 3 consensus model is being used.

What I don't understand is why Boeing aren't using 2-of-3 computer
architecture in this application - or maybe they are, and the '3 voting units'
are considered to be 'one computer' and they've just added another one to be
sure.

In rail transportation systems, this is taken even further by using 2-of-3
configurations where each computer is a different architecture completely ..

------
ypcx
> Boeing changing Max software to use two computers

Does that mean we will find the MCAS in the 737 MAX 8 MMEL (Master Minimum
Equipment List)? (Used to be here but right now it's not loading:
[http://fsims.faa.gov/wdocs/mmel/b-737-8_rev%200.pdf](http://fsims.faa.gov/wdocs/mmel/b-737-8_rev%200.pdf))

Isn't this the single biggest failure point of the whole affair? Adding a
critical component which can bring the plane down (as it has happened twice)
but not making it redundant?

One can perhaps argue (pardon the roughness of my analogy) that if pilots
aren't trained to not open the door during the flight, and then do open the
door during the flight, then it's a training issue.

But this analogy doesn't apply, because while the door is not motorized and
cannot open by itself, the MCAS is a computer system which can fail (not just
because of a faulty AoA sensor, but e.g. because of a chip failure or a
software bug) and then actuate the trim into a dive where the pilots aren't
able to fix the issue manually anymore due to high wind pressure.

~~~
redis_mlc
How the MMEL (FAA) and MEL (operator model-specific version) works is that
anything NOT required for safe flight is listed in the MEL. Otherwise
everything would have to be 100% working.

So I would not expect MCAS to be listed in the MEL since it's required for
safe flight.

We'll see.

~~~
ypcx
Oh, I've gotten the MMEL definition wrong then (opposite). Thanks for
clarifying!

------
teh_infallible
Keep talking about the software. Make it about the software, so no one knows
the airframe is fundamentally flawed..

~~~
rrss
How do you know the airframe is fundamentally flawed? ("the engines are too
big" is not an answer).

~~~
tus88
Because MCAS is required, unlike every other aerodynamically stable airplane
that came before it.

~~~
danjayh
Airbus does it too. They have several control laws that are used to provide
flight envelope protection. It was the pilot's failure to understand his
plane's software that caused AF447 to crash. Essentially he didn't realize
that the flight control software had changed modes, that it wasn't providing
the normal handling augmentations, and that his inputs were putting the
aircraft into a stall, which caused the crash.

See:
[https://en.wikipedia.org/wiki/Air_France_Flight_447](https://en.wikipedia.org/wiki/Air_France_Flight_447)

[https://aviation.stackexchange.com/questions/62338/why-
did-a...](https://aviation.stackexchange.com/questions/62338/why-did-
af447-never-return-to-normal-law)

~~~
tus88
Build unstable airframes? Not outside of forward canard jet fighters.

And that pilot crashed a perfectly fine airliner after he lost situational
awareness and then did the one thing you should never do - pull the steering
column back in a mad panic until the plane stalls.

------
inamberclad
People here seem to be missing a few things.

1: The _737 family, including the 737 MAX, still use manual /hydraulic
controls_, not fly-by-wire. A computer failure isn't supposed to be as
catastrophic as it would in a fully fly-by-wire system. 737 pilots are already
trained on other automatic system failures, such as a pitch trim motor
runaway.

2: The original design of MCAS had a limited scope - stop pitch instabilities
from causing an uncontrolled pitch-up into a stall in high AoA and high power
regimes, such as just after takeoff. It read from both a G-sensor and the AoA
vane and both sensors had to be reading excessive values for the system to
trigger. This was probably the justification for not requiring higher levels
of reliability.

3: The plane was found to have handling issues in slow flight regimes. In
order to improve the handling, Boeing engineers modified MCAS to be active in
more of the flight regime, removing one of the sensor readings. Now it would
trigger if it detected the plane was near a stall, even in level flight
without a high G loading. It is unclear whether MCAS was recertified within
Boeing for continuous use. Wikipedia states "The FAA did not conduct a safety
analysis on the changes. It had already approved the previous version of MCAS,
and the agency's rules did not require it to take a second look because the
changes did not affect how the plane operated in extreme situations."

4: Pilots could not disable MCAS without disabling electric trim control. In
at least the LionAir case, the pilots did disable the electric trim, but were
unable to re-trim the plane manually against the aerodynamic forces of their
dive. They re-enabled the electric trim to re-trim the aircraft, and MCAS re-
triggered and put them back in the same situation. Remember, _these aircraft
use manual controls_. The pilots need to put a serious amount of force into
the control column when the plane is out of trim and at a high airspeed. This
greatly diminishes their ability to do anything else.

5: Both accident aircraft were missing an _optional safety feature_ \- the AoA
disagree warning. Both aircraft experienced a failure of one of their AoA
vanes. I don't recall whether these failures happened in flight or before the
flight, but such a warning would have likely stopped the pilots from taking
off.

6: The single computer issue was not uncovered until simulator testing after
the crashes. It is not directly related to the MCAS problems.

Wikipedia is your friend, as usual.
[https://en.wikipedia.org/wiki/Maneuvering_Characteristics_Au...](https://en.wikipedia.org/wiki/Maneuvering_Characteristics_Augmentation_System)

Here's even more detailed information:
[http://www.b737.org.uk/mcas.htm](http://www.b737.org.uk/mcas.htm)

~~~
_ph_
Thank you, this is a very good summary of the issue. Only a small comment to
4: it was the Ethiopian Air flight, where they disabled MCAS but did not
manage to regain control, mostly due to their air speed being too high. Lion
Air had 2 MCAS incidents, the first on the day before the crash, where the
pilots disabled electric trim control and continued the flight without an
incident and the second, where they did not disable MCAS at all, leading to
the crash.

Interestingly, Airbus as a consequence of the events reviewed their flight
software and also found an issue they are quickly correcting, but it seems
that no planes are grounded as a consequence.

~~~
wikibob
First I’ve heard of the Airbus issue. Can you provide a reference? I couldn’t
find anything

~~~
_ph_
Here is something: [https://samchui.com/2019/07/18/easa-warns-of-
airbus-a321neo-...](https://samchui.com/2019/07/18/easa-warns-of-
airbus-a321neo-control-anomaly/)

------
tracer4201
I’ve lost all faith in Boeing. Boeing’s greed and cutting corners to avoid
pilot recertification is textbook example of ethical failures.

I have second thoughts about many of the companies I’ve invested in over the
years, but Boeing right now takes the cake. I’ve dumped my shares. Hopefully
there are some criminal investigations at some point if our leaders in
Washington could stop dancing to the tune of whichever company throws them a
bone like the dogs they really are.

~~~
skunkpocalypse
> I’ve lost all faith in Boeing.

Guess what? They don't care because it doesn't matter.

You really can't avoid their airplanes without avoiding air travel.

~~~
tempguy9999
Is boeing really the only producer of aeroplanes?

My guess is there are others. My guess is that orders will be cancelled and
there'll be a shift towards airbus (oh!there's one!)

I'm tempted to apply my first ever downvote to this apparently witless
comment, but perhaps you meant something more subtle?

~~~
_ph_
Orders can't just be shifted to Airbus, as their production lines are fully
loaded for years to come. Also, there is little desire to make Airbus a
monopolist. The problem basically started when Boing was merged with McDonell
Douglas, as this created the duopoly which now has the market trapped. And of
course it creates a huge amount of problems for all the airlines which ordered
and planned for new planes which have the same certificatation as their
existing 737.

~~~
JohnJamesRambo
It feels like we see this over and over when you trace corporate problems
back- loss of competition and development of monopolistic conditions. There’s
a reason we developed antitrust laws and to watch them be eroded in the modern
era is quite worrisome.

------
wayanon
Will anyone appear in court for this episode in a year or two?

------
alkonaut
>“Only one computer was used in the past, because Boeing was able to prove
statistically that its system was reliable, the person said.”

I read that as ”statically” (thought it meant formally) and was really
impressed by how advanced that sounded! Then I read it again and saw
_statistically_. Less impressive.

~~~
Tomte
The question of statistical reliability is also an American-European (or at
least -German) divide: if you look at machinery safety, the old German
standard 13849 is all about architecture. Two channels, two channels with
diagnostics, etc., yes, also a bit of MTTF numbers, but architectural concerns
are central.

The American way is using statistics to argue that something is not dangerous
in a meaningful way (for example several magnitudes less likely than being hit
by lightning). There is even a white paper by ABB arguing that safety
controllers should be allowed to be purely one-channel architectures, because
reliability is so good. Most people in the field would take that as humorous.

And so the "modern" standard 61508 is a bit of a hybrid. There are still
architectural elements (hardware fault tolerance numbers etc), but front and
center is also statistics, namely the SIL levels.

------
penglish1
And what happens when the 2 computers disagree? I thought 3 was standard
practice for this sort of thing.

~~~
tzs
You need 3 for things that if they stop working the plane crashes.

This is more of a "if it stops working the pilots have to fly more
conservatively until it is fixed" thing. It apparently is generally considered
acceptable for such things to not have triple redundancy, as long as a failure
will be detected so the pilots can deal with it. Two computers is good enough
for that.

------
ilaksh
The language in the article seems to conflate hardware and software changes.
If they are introducing a new computer, that is a hardware change in addition
to software.

~~~
Scoundreller
I think they're running the software in a second (existing) computer.

Which still doesn't make sense from a redundancy perspective: What do you do
when 2 systems disagree? Which one is right?

I guess you could shut it down, but would be better to have an odd number of
systems and take a majority result while displaying an error light.

~~~
jatgoodwin
From what I understand the flight computers can tell the pilots something is
wrong and/or the pilot and co-pilot can have their instruments from different
sensors like airspeed so they can compare and figure out which one is right.

~~~
Scoundreller
In isolation, yes, but hopefully it's not the 4th issue on top of 3 issues the
pilots are already trying to figure out.

------
meerita
I'm surprised they 3rd party some parts of their software. Software is it the
most important part of a plane, how crazy you should be to do that.

~~~
maest
> Software is it the most important part of a plane

I thought it was the wings.

~~~
meerita
Software controls everything.

~~~
dTal
As depicted in this classic Far Side cartoon:

[http://i.imgur.com/hAjOWmV.jpg](http://i.imgur.com/hAjOWmV.jpg)

------
nurblieh
Anyone who has watched "Minority Report" or "Neon Genesis Evangelion" knows
you need 3 computers.

------
doggydogs94
Hopefully, 99% of the new code was already written, not something they just
whipped up and are slamming into production at the 11th hour.

------
bfrog
This seems like a pretty significant change, something that won't take a few
months but possibly years of work to do?

------
ggm
Has Boeing now lost enough money, that had they not tried 737 certification
they might be financially ahead?

~~~
Aloha
Without certification as a 737, the airplane wouldn't have much of a reason to
exist.

------
pankajdoharey
Woah thats a standard even in tesla cars. Everyday it becomes more apparent
that Boeing is a disaster engineering project.

