
Boeing 737 MAX crash and the rejection of ridiculous data - VBprogrammer
https://philip.greenspun.com/blog/2019/04/08/boeing-737-max-crash-and-the-rejection-of-ridiculous-data/
======
Ataraxic
It should be noted in the doomed Air France 447 flight, the plane activated
the stall warning because of a high angle of attack that was leading to stall.
(thanks pdx for the corrected info)

At some point the system rejected the data and stopped the stall warning
because the angle of attack was so severe that it considered the data
erroneous. This is speculated to have caused the co-pilot to keep pulling back
on the stick and to maintain the stall because everytime he let the nose of
the plane come down the stall warning activated again as the AoA was
decreasing into the range that the airplane considered a real signal.
[https://www.vanityfair.com/news/business/2014/10/air-
france-...](https://www.vanityfair.com/news/business/2014/10/air-france-
flight-447-crash)

A plane's instruments, the actions software takes, and it's interactions with
the humans that fly the plane isn't as simple as an if statement.

If anything, fewer and simpler controls or automated systems are easier to
debug and work around than a plane that has an internal calculus of what is
valid data.

Air France 447 crashed because of a stall caused by a severe AoA that MCAS
might have prevented (if MCAS simple pushes the nose down at high AoA it
might. I am unsure of the exact implementation). MCAS obviously had a
different impact on the Lion and Ethiopian Air Flights.

Basically, it's complicated.

~~~
ChuckMcM
Your basic point is spot on, it isn't simple.

The challenge I think is to keep two things separated, one is the flight
control laws that the system is implementing to keep the plane in the air (to
the best of its ability), and the other is the situational awareness
indicators for the pilots so that they can tell what what the plane is
"thinking" about how it is flying (or not).

The closest analogy I can come up with is the SQL explain command. That
command will generate the complete decision tree for how records are included
or excluded from a SQL query. The air equivalent might be display that shows
the flight status, and the state of the instruments that are being used to
determine that status. And then it is up to the pilots (or DBA :-) to figure
out if what is happening is what they think should be happening.

To use this particular example, it is, in my opinion, negligent on Boeing's
part not to include an indicator for every change in the flight control laws.
If MCAS activates to avoid a stall it should always indicate that it is, and
why it is activated. It has been reported that this was an "extra price"
option for the jet, and it is _that_ choice that makes it feel negligent to
me.

Generally, there seems to be enough indications with backups in the cockpit so
that the pilot can ascertain what is going on with the aircraft reliably
(backups and such), but what was missing here was, again in my opinion, was
the rationale the plane was using for the flight laws it was implementing
being available as well.

~~~
chupa-chups
This is not the problem. MCAS in itself sadly is the problem.

They had to solve the issue that the pitch-up moment shouldn't be accelerating
on its own just depending on AoA, which was (aerodynamically) inevitable
without automated adjustments due to the placement of the engines.

They had to resort to the worst possible cludge, since they weren't even
allowed to add new electrical systems (which would have caused a
recertification and/or a retraining), so they resorted to an already existing
system (assisted trimming by autopilot).

The effect of MCAS is only "fixed" by manually adjusting the trim, which
involves moving a jackscrew which holds the last section of the elevator at a
fixed angle. Unfortunately the required force of moving this jackscrew
increases with the airspeed. There is no easy way out. A bit more background
can be found here: [https://www.satcom.guru/2018/11/stabilizer-
trim.html](https://www.satcom.guru/2018/11/stabilizer-trim.html)

Btw, the _extra price_ item was an AoA disagree warning. This is related to
MCAS in a way like an odometer failure to traction control. It wouldn't have
helped, if pilots stuck to their memory items and checklist (which they have
to: they recognize runaway trim and have to react accordingly).

To get a slight feeling what it means to have runaway trim (without assistance
which they had to disable in concert with MCAS) please have a look at this
video: [https://vimeo.com/329558134](https://vimeo.com/329558134)

~~~
johnp_
The video is gone (censored?), does someone have a backup copy?

------
yongjik
With all due respect, this sounds too much like the kind of armchair
quarterbacking that routinely appears on HN when avionics/politics/astronomy
is mentioned, where a lone programmer feels competent enough to criticize an
industry for missing "something obvious".

I mean, this particular change might have saved the particular 737, but I'd
rather hear it from someone who actually knows how 737s fly.

~~~
js2
Philip Greenspun is a highly experienced pilot:

[https://philip.greenspun.com/flying/milestones](https://philip.greenspun.com/flying/milestones)

~~~
fixermark
He doesn't seem to be an experienced owner of a home outdoor thermometer
though.

Or if he is, I need to know where he found one that goes to 452 degrees and
why he even needs that much range. ;)

~~~
squarefoot
The thermometer sensor and circuitry might be perfectly ok, still the 452
number might be a single digit gone bonkers in the panel, so that data being
transmitted (maybe also recorded?) is ok while the readout is plain wrong. A
human acting according data being shown there might react very differently
from someone reading the temperatures log. Sometimes software problems can be
awfully subtle.

------
linuxftw
Do we actually have the raw data from the sensor in the flight recorder, or do
we have the flight computer's account of that sensor's data?

> IF AOA > 15 AND AOA < 25 THEN RUNAWAY_TRIM();

And if the AOA is frozen at 16 due to some fault? What is a loss of signal
interpreted as? Does it use last-known value? 0? 100?

How often does the AOA get sampled? Is there any attempt to smooth the data?
Was the data corrupted (bit errors) during transmission/reception? Perhaps
there's a flaw in the hardware (or microcode) protocol implementation of the
data bus (or whatever it would be in this case) doesn't disregard packets with
parity errors? Or perhaps it only checks for 1 bit of parity and it needs to
check for 3? Or perhaps it's sending int64 and the flight computer is
expecting int32.

It's all so simple to make completely uneducated guesses.

When Boeing made such obvious mistakes such as "don't re-command full-nose
down after reset" that should have been caught be any competent user
acceptance / QA testing, I have to call the entire engineering effort into
question. I cannot assume they did even 1 thing correct in this system, and it
really needs to have an independent review of the entire system hardware and
software implementation.

~~~
dramm
You seem to be arguing multiple sides here.

Philip Greenspun's speculation is a plausible concern, that is all it needs to
be a valuable point, that type of data limit handling ought to be considered
as the system is looked at. I read his post that he was talking a specific
_simple_ example, all the other types of things you mention could also be
looked at. How the AoA sensors failed, any potential issues with signal
handling, and especially what happened with the Lion Air AoA sensor repaired
in Florida need to be investigated, that sensor repair sure seems to be. (BTW
data handling in aviation is pretty interesting,

Greenspun is a fairly unique combination of EE/CS/past MIT lecturer/geek and
experienced pilot, including holding an ATP certificate and flying for
regional airlines.

The system as implemented could not have a "re-command full-nose down after
reset". To start with there is not really a "reset" for MCAS. The pilot's only
way to fully disabling MCAS continuing to do bad things is via STAB TRIM
cutout switches. The STAB TRIM cutout switches are required to handle lots of
other problems. e.g. runaway full nose down trim. The stabilizer trim system
itself is a dumb as a rock, and has no idea where the trim should be set to if
the power is restored to it. I suspect the only likely sensible behavior of
the trim system itself is that it makes no automatic change. If MCAS is
driving the trim wrong and if it was possible to remove the MCAS input to the
trim system then the pilot should be able to reset the trim they want (with
then hopefully functioning electric trim switches), or allow the autopilot to
do it (outside of MCAS the A/P is the other automatic system that manages the
stabilizer trim). The problem is that was not anywhere in Boeing's plans here,
there was no way to separate MCAS going nuts and commanding extreme trim
changes, from the actual stabilizer trim system.

However I do agree with your sentiment about (all the other) bad mistakes. In
my view when a seemingly largely self-regulated group goes off and designs
something with so many glaring issues (single AoA sensor source, lack of
documentation and training, not even having a standard AoA disagree alert,
etc.) and other possible issues (rationale of extension of MCAS trim
authority, trim wheel forces needed to crank mechanical trim at stabilizer
trim limits, Boeing slowness in responding to issues etc.) then _everything_
needs to be looked at. My hope is there are very thorough investigations, of
the actual systems, of all the proposed remedies (which separately, I am not
convinced are enough), of Boeing, and of the failure of FAA oversight here.
And I hope that is done as fast as possible, and as slow as really needed.

~~~
toast0
> not even having a standard AoA disagree alert

Sorry to nitpick on one thing here; but I don't know that having an AoA
disagree alert would be that useful. There was already an airspeed disagree
alert, and AoA disagree strongly implies airspeed disagree, because AoA is
involved in the airspeed calculation. Furthermore, knowing an AoA sensor is
broken wouldn't be super useful without a workable procedure to do something
about it; in the Ethiopia crash, the procedure followed was insufficient to
regain control.

~~~
dramm
AoA disagree does not imply airspeed disagree. Indicated airspeed is purely a
pitot and static pressure driven measurement. AoA is there largely as part of
the anti-stall warning system, e.g. driving the stick shaker/pusher. An
important point of having AoA sensors is to provide a stall warning systems
that is independent of IAS, you don't want these systems crossed with each
other.

B737 airspeed disagree I believe just requires 5 knots disagreement for more
than 5 seconds between left/right IAS. No AoA involved.

There may be some confusion happening due to the events in the Lion Air B737
Max. The maintenance crew tried to address what they believed were ADIRU/ADR
issues in previous flights... including as a possible source of erratic
airspeed indications and airspeed disagree warnings. So there are some
discussions of airspeed disagree in relation to that aircraft. There may be a
lot going on there we don't know about yet, with say faulty ADIRU/ADR causing
multiple problems. The ADIRU has inputs from the AoA vane but does not use
that in calculating airspeed, it is using that to drive stick shakers/pushers
and optional display AoA, and unfortunately in these cases to also drive MCAS.

Having a AoA disagree indication (and/or full AoA display option) _might_ have
allowed more prompt diagnosis of problems, but only if pilots were aware of
and trained on MCAS. And even then, yes it's a minor point. I was not
intending to claim that itself is a deep solution here, just not having a
disagree warning as standard was a bad decision. The entire system, and
overall approach taken by Boeing seems tragically deeply flawed.

~~~
cjbprime
I think the Ethiopian preliminary report says they were in IAS disagree for
the whole flight too. I don't know what caused the IAS disagree. It played a
role in the crash, since its checklist disallowed lowering flaps (which would
have disabled MCAS) or throttling back (which would have avoided surpassing
Vmo, which likely made manual retrim impossible).

~~~
dramm
Ah now my addled brain finally wakes up, well maybe. What may be happening
here is the ADIRU is doing AoA based static port compensation. It will
normally be a small correction, and I was trying to keep that out of the
discussion because I thought there there have been reports of erratic airspeed
in previous Lion Air flights and that was not jiving with small changes of
airspeed I was expecting (and now I'm going to reread the Lion Air preliminary
report to check what was reported). But to start with you only need 5 knots to
trip the IAS disagree, so sure I'll expect it could trigger that (and also
possibly an altitude disagree). But potentially more intersting is what does
that usually small static compensation change do when the ADIRU thinks the AoA
is really out of whack, does it keep applying the "correction" or does it
ignore it? So that question circles back to be similar to what Greenspun is
asking in the original article.

------
quanticle
On the other hand, one of the contributing factors to the Three Mile Island
nuclear disaster was the fact that the temperature sensors were showing their
maximum programmed value of 280 degrees Centigrade. The actual core
temperature was far far higher, but because the engineers designing the
reactor never thought of the meltdown scenario, they programmed the
temperature gauge to cut off at 280C, rejecting higher readings as obviously
erroneous. This (along with a water gauge malfunction) led the reactor
operators to misdiagnose the fault. The operators thought the reactor was
overfull with coolant and was at risk of overpressure, when in reality the
coolant was draining away.

I wouldn't be so quick to dismiss "obviously incorrect" sensor data.

~~~
fixermark
While this is true, the point that it's worth thinking through the nature and
consequences of "obviously erroneous" sensor data is still a valuable exercise
holds.

If the temperature is past 280, generally speaking the same steps to diminish
temperature can be taken (... I'm speaking broadly; this may not actually be
true of nuclear reactors and if it's not, additional sensors with larger
ranges were definitely warranted if there's discontinuity in the safety and
disaster mitigation strategies at temps higher than 280).

There is no amount of automatic airplane wing trim that can arrest a 70-degree
AOA. When the sensor's getting 70 degrees, the failsafe operation would have
been to back out of the control calculation and defer to pilots.

~~~
quanticle
I agree. In the case of the temperature sensor, my opinion is that the correct
behavior should have been to show no reading, or an error value. At least that
way the operators would have known that they didn't know the true temperature
inside the reactor. As it stood, they saw high, but constant temperature,
combined with rising pressure. That indicated that the water level was rising
inside the reactor, so they opened drain valves to let water out. That was
exactly the wrong thing to do, and it contributed directly to the severity of
the meltdown.

------
oconnor663
> Beyond 25 degrees, therefore, it is either sensor error or the plane is
> stalling/spinning and something more than a slow trim is going to be
> required.

I am not a pilot, and I'm going to take this pilot at his word. But my first
question would be, if we add this additional rule into the system (that
runaway trim turns off above 25° AOA), will any pilot ever need to know about
this rule?

If the answer is "absolutely not, never" then that's all well and good. But if
there's some way-out-there scenario where the plane is wavering between 24°
and 26° AOA, and in that scenario the pilot needs to be aware that the
computer's trim behavior is switching back and forth between two different
laws, then I'd want to ask whether that's presenting pilots with too much
complexity.

There's a rule of thumb in software design, when we're thinking about
designing a complicated system to solve some messy problem. Will users have to
deal with the system getting things wrong? If the answer is pretty much never,
then the system can be very complicated if that means doing a good job. But if
the answer is that the system won't always be right, and that users will have
to step in some of the time, then making the system complicated inevitably
means that users will have to learn all that complexity. I wonder if that
applies here.

EDIT: Ah yeah, top comment right now is about Air France 447. That's a very
good example.

~~~
andbberger
Thanks for sharing your perspective.

------
tbabb
There are several inexcusably egregious errors in the design of the MCAS
system, and this "solution" addresses none of them.

\- Single point of failure: The system makes command decisions based on the
readings from a single sensor. The fact that nobody asked (or was bothered by
the answer to) the question "what happens when that sensor fails?" is
negligence.

\- No re-training of pilots: Pilots were not aware of new ways in which the
plane might take command away from them, and were left in the dark with only
seconds to react to a deadly situation. The decision not to train was a cost
and marketing motivated decision that sacrificed safety, to the tune of
hundreds of lives lost.

Slapping on heuristics to condition unreliable data is not a good solution for
life-critical systems. As another commentor pointed out, this is armchair
quarterbacking, and it is not good armchair quarterbacking. This article
should not be here.

~~~
treis
>No re-training of pilots: Pilots were not aware of new ways in which the
plane might take command away from them, and were left in the dark with only
seconds to react to a deadly situation

Both the Lion Air and Ethiopian Air flights recovered from the initial MCAS.
The mistake was assuming that pilots could reliably recover from a runaway
trim situation. That might have been true when that was a more common failure
and when pilots had more experience flying with manual trim.

~~~
cjbprime
Agree with you, though as a small correction it now appears (according to
Ethiopian preliminary report) they made it to 7k altitude AGL, not 1k AGL.
Still not so much to work with when you're above the plane's max airspeed and
trimmed all the way down.

------
chupa-chups
This video from "Mentour pilot" has been deleted (supposedly upon request from
boeing):

[https://www.youtube.com/watch?v=EzgBft-79U8](https://www.youtube.com/watch?v=EzgBft-79U8)

You can see it here (european version of youtube):
[https://vimeo.com/329558134](https://vimeo.com/329558134)

Or more info here:
[https://news.ycombinator.com/item?id=19627525](https://news.ycombinator.com/item?id=19627525)

~~~
Diederich
Strongly suggest everyone check out the video. It's...kind of shocking.

~~~
cm2187
One thing I don't understand is that if the trim is stuck in a position that
pushes the plane down, shouldn't the high speed (and therefore strong air
flow) help to put the trim back in a neutral position rather than making it
more difficult?

~~~
05
Because of the nose up elevator..

Here's the explanation in one picture: [1]

Here's the article the picture came from: [2]

[1]
[https://3.bp.blogspot.com/-6c8hfXS8WO4/XKd93voscrI/AAAAAAAAF...](https://3.bp.blogspot.com/-6c8hfXS8WO4/XKd93voscrI/AAAAAAAAFw0/AXPqRNGD3kYefHUUWM4RwxvVsTKuTZNIgCEwYBhgL/s1600/Screen%2BShot%2B2019-04-05%2Bat%2B9.05.25%2BAM.png)

[2] [https://www.satcom.guru/2019/04/stabilizer-trim-loads-and-
ra...](https://www.satcom.guru/2019/04/stabilizer-trim-loads-and-range.html)

------
starpilot
Now, imagine you have 5,000 such checks in millions of lines of flight control
system code, many of them interdependent, and you have to fly the ship to test
each one. You need to schedule time with the test pilots (who have lives of
their own) and get the data dump from IT post-flight. It's aerospace so all
this undergoes review, documentation, and signoff, and it all takes time. How
do you prevent a single check from slipping through? It's not easy. These
software engineers and others in the process screwed up, but it's a failure of
processes and not just forgetting a conditional statement.

~~~
cpplinuxdude
I'm sorry i'll have to mention that the software can be thoroughly tested in
simulation flights. Funnily enough i was involved with some virtualisation
software used to test booking systems for Airports. If you can virtualise a
booking systems, trust me you can virtualise the on-board flights systems.

~~~
yaur
Is this really true though?

From what I have read it sounds like part of the problem is that manually
adjusting the trim wheel requires more strength than at least some pilots
possess due to the mechanical forces on the plane. I don't think it's
reasonable to expect simulators to replicate those types of forces.

~~~
cjbprime
But they do! See the Vimeo link elsewhere in these comments.

------
chx
I accidentally copypasted my one time login code into the amount of bill
payment and my ebank didn't refuse the sixty seven million dollar payment (I
caught it before sending it). Needless to say, I don't have and never had even
one percent of that money.

My brother one time actually succeeded in wiring ten million euros instead of
ten million forints. The exchange ratio is 1:320. Obviously the account didn't
have 10M EUR on it.

I just entered 300 USD as a courier tip into the food order app I just tried
for the first time because I expected it'll just do 3.00 for me. I caught it
before sending it.

All of these are lacking this kind of defensive validation. I recognize the
irony of not even knowing the terminus technicus for it despite I am a senior
sw developer with decades of experience.

~~~
bambax
I sell stuff on Amazon European marketplaces. I sometimes change the price;
you have to do that on each marketplace (there are 5: UK FR DE ES IT) in the
interface for sellers.

The decimal separator for UK/English is the dot; in all other marketplaces
it's the comma. If you put a dot where a comma is expected it's silently
ignored (so for example EUR 9.99 becomes EUR 999)... but "slowly": the dot you
type appears normally, and then after some periodic ajax validation, it's
removed.

I now know this and am careful about it, but the first few times I had my
items priced at 100x their normal price, until customers emailed me to ask if
this was normal.

It would maybe make sense to issue a warning if the new price differed from
the old price by two orders of magnitude. But there's no warning.

~~~
noir_lord
I inherited a system that let you put AAA in the product price field and would
persist it.

Validation is hard to get right so they simply didn’t try, not even common
sense.

I’ve spent years thinking about the optimal way to structure validation and
I’ve never come up with an approach I really like on an architectural level so
I do the usual approach of simple sanity checks on the client and proper
validation against business logic on the server but even that isn’t as
isolated as I’d like.

------
btrettel
When working on my PhD in mechanical engineering, I created fairly large data
compilations and I would include some assertions like this. I'd make them
assertions because I considered data that had these faults to be not worth
adding, so if the data wasn't a typo then I'd just cut it.

It can be hard to think of good tests for data like this. Sometimes my
assertions were wrong, e.g., I thought R^2 couldn't go below 0. That's
actually false.

One I've found useful to spot typos was to check whether a column which is
incrementing is in fact incrementing. E.g., if a table is ordered by the
pressure tested at, then obviously there's a typo is the pressure is not
increasing.

------
devy
According to Captain Chesley Sullenberger (the former air force pilot who
saved US 1549, dubbed the Miracle on the Hudson) that the existing avionics
are not designed to take our best advantages[1]

    
    
         Humans are much better suited for the doers role with
         technologies monitoring us, instead of what we are 
         currently assigning the human component as exactly the 
         opposite.
    
    

[1] [https://www.fastcompany.com/video/the-future-of-aviation-
acc...](https://www.fastcompany.com/video/the-future-of-aviation-according-to-
captain-sully-sullenberger/Zigmp07v)

------
aidenn0
If the leaked information so far is at all reliable, it seems that the MCAS
system was obviously inappropriate for FAA certification; a single AOA sensor
cannot be used as a critical component of a DIL B system, full stop.

The suggested change makes the system not actively crash the airplane under
ordinary situations, but does not account for the increased load on a pilot
due to the need to manually trim down under the conditions that a properly
operating MCAS would be needed. If the MCAS is necessary to prevent crashes
due to pilot overload, then this 10 character change is insufficient.

Sure the suggested change makes the cure no longer be worse than the disease,
but doesn't address the fact that the cure was needed in the first place.

Another way of putting it is that the crash could have also been avoided by
using this pseudo code instead of what is suggested:

    
    
        IF FALSE THEN RUNAWAY_TRIM();
    

but the MCAS system wasn't added by Boeing because they wanted to spend more
money on the aircraft, so it (like the fix suggested in the article) seems
likely to be insufficient from the point of view of passenger safety.

------
Xixi
If you read about Air France Flight 447, at some point instruments data went
so ridiculously far from expected values that the stall warning stopped,
because the computer did exactly what is suggested here: rejecting ridiculous
data. So trying to fight the stall would... trigger the stall warning, adding
confusion and cognitive load to an already dire situation (IIRC it is very
likely, though we will never know, that the pilots didn't quite understand
until the very end that the plane had stalled).

The "solution" described in the article seems to add more complexity and
surprises to a system (MCAS) that already behaves in surprising and unexpected
ways. Engineering is hard, adding a band aid on top of an ill devised clutch
without carefully thinking, writing, and peer reviewing all the possible
consequences, is unlikely to help.

~~~
prmph
There is a difference between the AF 447 situation and what is recommended in
the article.

What is suggested is that the aircraft refrain from performing risky
manoeuvres based on such wildly out-or-range sensor outputs.

This does not mean that aircraft should suppress warnings of such sensor
readings from the pilots.

------
FartyMcFarter
Maybe someone can illuminate this.

This comment seems to say that Boeing has recently implemented a bunch of
fixes to prevent MCAS from activating in various scenarios, which allegedly
make MCAS safety a non-issue:

[https://philip.greenspun.com/blog/2019/04/08/boeing-737-max-...](https://philip.greenspun.com/blog/2019/04/08/boeing-737-max-
crash-and-the-rejection-of-ridiculous-data/#comment-325105)

I'm not knowledgeable on planes, but some parts of this make me wonder if
these fixes will cause the opposite problem.

For example if the two sensors disagree then MCAS now won't activate. How long
until a crash happens because of that new behaviour? Either MCAS is necessary
safety equipment, in which case this sounds dangerous, or it isn't, in which
case why bother?

~~~
salawat
You are correct. The rather humorous aspect is that they really can't stick to
the "no retraining" claim or really the claim the plane is airworthy because
the plane is only certifiably airworthy when MCAS is operating.

So in short, yes. Their fix is introducing problems elsewhere. This
characteristic is what has many thinking their was a series of grievous
process, business, and regulatory errors layered on top of the known poor
engineering decisions that came together to set the stage for these
catastrophes.

------
theoh
The problem with this is the idea that this kind of software would ideally be
written or specified with "if" statements.

I'm not sure about the argument that AOA values of more that 25° constitute an
error, but a more plausible design would be a module that monitors the signal
from the AOA sensor and classifies it into various advisory categories. That
module could get complex internally but still have a clean interface,
providing data of the form (AOA, Advice) where "Advice" would be an indication
of the modules' conclusion or recommended action.

Rejecting ridiculous data is potentially a very subtle ML problem. To suggest
that it should be taken care of in such a simplistic ad hoc way really doesn't
do justice to the problem.

------
CodeSheikh
I refuse to take any US domestic flights on 737 MAX until it is decomm and
engines are replaced with the ones that were intended for its body.

Boeing is not going to fly its execs on domestic flights for certain number of
months to regain consumers confidence.

------
area51
This needs more than an IF statement. This needs calculation of the rate of
change of the variable, in addition to observing the value of the variable.
Rate of change calculations are atleast an order of magnitude harder than just
reading the variable. Doing it for infinite number of variables, in embedded,
memory constrained, real-time systems is hard. So this wasn't going to be
fixed via trivial 10 characters of code. But, yes, a more complex system would
have avoided it.

------
vitalus
The sensor reading could have just as easily been 24 degrees rather than 70
and caused the same crash w/ the author's proposal.

Not saying or excusing the failures that led to this crash, but it seems like
an oversimplification of the needed solution to suggest that ridiculous data
be thrown out, with the benefit of hindsight being used to determine where the
threshold of "ridiculous" is.

------
kgilpin
Assuming that there are many software subsystems that need angle of attack
input, it shouldn't be the responsibility of every one of those systems to try
and determine if they are receiving bad input from the AOA indicator. Rather,
there should be one angle of attack (AOA) sensing software system which feeds
AOA data into all the other dependent systems. If the AOA sensing software
system cannot determine a reliable value, then it should feed a value of "I
don't know" to the downstream subsystems.

Then, all possible expertise about how to determine if the AOA input is valid
(and I'm sure there are many, many, such factors, redundant physical sensors
being just one) can be directed to that one AOA sensing system.

If MCAS gets an input of "I don't know" from the AOA sensing system, then
clearly it is going to disable itself. So, a complex decision has been turned
into a very simple one.

If you agree with me, then this is a really important example of why
separation of concerns is so important.

------
webreac
I have nothing to comment about this article, but this other one contains very
detailed information: [https://www.skybrary.aero/index.php/B38M,_en-
route_south_eas...](https://www.skybrary.aero/index.php/B38M,_en-
route_south_east_of_Addis_Ababa_Ethiopia,_2019)

------
laydn
Despite all we've read about this subject, are we really still implying that
this is a software problem?!

Very basic rules of commercial aircraft design were violated during the
development of 737MAX.

No point in listing all the issues here again. Let's just point out one: There
should have been more than 2 AoA sensors feeding MCAS. No excuses.

~~~
linuxftw
> Despite all we've read about this subject, are we really still implying that
> this is a software problem?!

We don't know it's _not_ a software problem (in addition to lack of hardware
inputs). We don't know that anything in the system is fit for purpose
whatsoever. That's what an obvious exclusion of basic failure testing tells
us. That the system had NO critical oversight.

------
meshko
All the talk about how crappy Boeing engineering here was is bullshit and
speculation and I am surprised PG participates in it. What we can discuss
objectively here is incident response in which Boeing allowed the situation to
continue after the first crash. How did they not run hundreds of hours of
simulations, code reviews etc, etc on the system assumed to be at fault? How
did they not immediately change the safety features associated with MCAS to be
free and mandatory for everyone? Engineering mistakes happen and are hard to
prevent. Business mistakes like this are a sign of terrible culture, lack of
priorities and are an existential thread to the company.

~~~
salawat
It's not really speculation. The proof, as they say, is in the impact crater.

The only mystery left is, what is the nature of the paper trail that led to
this catastrophe?

Was there malicious malfeasance? Overt and irresistible pressure to certify at
all costs?

Was it all just a tragic mistake? We don't know. We only know the physical
systems that contributed to the crashes, and some of the motivations that
would have contributed. The technical implementation can be roughly inferred
by any programmer, and it doesn't take a rocket scientist to figure out a ball
was dropped somewhere for a plane development program to fall afoul of such a
foreseeable failure state.

~~~
meshko
"The technical implementation can be roughly inferred by any programmer",
"such a foreseeable failure state"... how many years of experience do you
have?

~~~
salawat
How rude. Here I was thinking we were having a civil discourse over the
Internet. More than born yesterday, less than since the Moon landing.

Regardless, my assessment is based on most juniors I've worked with. By their
third year most seem to have already grasped the need to test for boundary
conditions, and to ensure proper error handling for GIGO failure. Any 1 year+
with at least FizzBuzz levels of understanding can be handheld toward it with
the right nudge, and in fact, the less experience they have the more eager and
likely they tend to be to pick up on error handling since they haven't yet
developed sufficient skills to be able to get their head around the "test you
don't need to write" because the result can be inferred from a test at another
level of the system (a frequent coping strategy that starts creeping it's way
in with increased levels of familiarity with a complex system).

Any problem grokking the above points is usually solved with an impromptu
exercise and lecture where I have the junior play the part of a computer until
they realize just how much the computer "doesn't know", and has no capacity to
derive from reasoning, unless it's actually coded/implemented to. I've not yet
had a junior who failed to grasp this to some degree (though a recent one is
giving me a run for my money), and become capable within a couple months of
inferring two to three-function away error states to test for. Within the
year, I can typically point them at an arbitrary code block and get back a
reasonable testing surface.

Which brings me to my next observation, where I think you may be attempting to
make a point:

 _If I run into juniors of 1-3 years experience who need coaching to fully
understand what I explained above, then perhaps the average programmer is not
capable of inferring what I claim._

To which all I can say is, my observation may be skewed, because I'm a bloody
paranoid polyglot of a tester when it comes to safety critical systems. Even
when I was pre-collegiate programming calculators, the more someone else
actually depended on something, the greater the lengths I'd go through to test
things before cutting them loose with anything I was producing for them. The
THERAC-25 postmortem is bedtime reading for me, and I've pushed myself to
understand computer science and software engineering as more than mere
'coding'.

If the argument then, is that I'm an atypical representative of my software
composing brethren, then I'd like to know why in the $deity's name we're not
triple-checking safety critical code at system integration time, seeing as we
can assume this level of inattention to detail by the average programmer.
Especially given as the languages these types of systems are implemented in
are typically _not_ the most 'friendly' languages.

This suggests cultural issues, undue pressure to fast-track approval,
disincentive to raise red flags that could impede delivery, or an "over-the-
wall" hyper siloing of expertise/responsibility that lead to the least
experienced in complex system implementation being blindly trusted by those
who had the experience to realize something was horribly wrong.

If the above doesn't assuage any concerns relating to my experience, I'm
afraid not much else will.

~~~
meshko
You think I was rude asking you about your experience. Now think how rude this
unsubstantiated allegation of obvious simplicity of the code in question is to
the person who wrote it -- with the weight of hundreds of lost lives on their
shoulders. These control systems can get arbitrary complex. We don't know
anything about the hardware this runs on and what it has to interface with. We
don't know the constraints and age of the codebase. Nothing. To assume that
this boils down to a simple if statement is something I would expect from a
recent college graduate, or someone who has only worked at a web startup, not
a person with 5+ years of real world experience building complex systems. I
agree about all the points about testing and business processes. We have
enough evidence to conclude that unforgivable mistakes were made there (and I
point to that in my original comment).

~~~
salawat
>You think I was rude asking you about your experience.

You asked for a number. Who I am, doesn't factor in. The experiences and
insights I can bring to the conversation do. Of which I provided more than
enough to get you in the right ballpark experiencewise.

>We don't know anything about the hardware this runs on and what it has to
interface with.

These are essentially networked embedded microcomputers, likely utilizing
various potential protocol stacks such as

CAN, CAN FD AFDX® / ARINC 664 ARINC 429 ARINC 825 Ethernet CANopen for
networking.

They are likely highly constrained, and must be compliant with DO-178B/C,
which includes a need to verify the software down to the op codes spit out by
the compiler.

The most popular languages for this purpose are known to be C, C++, FORTRAN,
and Ada.

There's this wonderful place called the Internet where Engineers and other
really dedicated people share information about what they use to do things.

>Arbitrary complexity Is a possibility, but tends to be bounded by the fact
humans still need to be able to implement and verify the systems they make in
a reasonable amount of time. Which coincidentally, seems to have missed a few
layers or so given we're here talking about this.

The world has very little that can't be found with a little digging, and in
the interests of saving time, we tend to reuse technologies when appropriate
from things like, cars, in other things, like airplanes.

If you can gain a mastery of how to network and program computers in general,
you gain insights into how other physical systems, even though they aren't
Turing machines, interconnect and propagate information and forces.

If you can then understand engineering principles well enough to decompose
complex things into a network of simpler basic parts, and understand how to
employ mathematics to analyze and predict the behavior of those systems, you
can quickly formulate broad guesses about contributory factors to a failure
state given the even small amounts of information.

And if you say all that's impossible to appear in one person, I don't know
what to tell you. I'm not asking you to have faith, I'm asking you to think,
question, imagine, and connect the dots between what information is available
out there.

But hey, what do I know? I'm just a guy who objects to having credibility
pidgeonholed based on some number instead of the content of what is being
communicated.

I apologize if I sound aggravated or hostile, but I do not appreciate it when
something as tightly regulated as aircraft out of the blue starts killing
people, and the reason looks to be a lack of scrutiny/verification, rushed
implementation, intentionally sparse communication, and unethical sales
practices for whatever reason.

There are ways to do things, and there are ways _not_ to do things. I expect a
leader of an industry to at least show a level of effort such that I can
entertain the benefit of a doubt that gross incompetence Or greed was not a
factor. I have no such illusions left to me based upon what I've been able to
work out. The cause is somewhere in their culture or business practices, and I
want it ripped out into the light as an example to everyone, everywhere.

I don't care half as much about what happens to the people involved as long as
it is enough to dissuade anyone thinking of doing the same thing from going
down that path.

------
ckastner
I have no practical experience with sensors but for those gained from a few
university courses.

In one course, we were re-assembling systems on a weekly basis, and the touch-
screens we used would all return somewhat quirky offsets, so I had to
calibrate the input after each re-assembly because I could never be sure which
module I was working with. I learned to never really trust a sensor.

And this was an undergraduate course. I would have assumed that in safety-
critical applications, in a mature industry, performing every possible check
on a sensor reading would be the obvious thing to do.

Was this particular case just out of the ordinary, or is it really that
uncommon to do that?

------
Luc
There must be all kinds of neat things one could do with access to the raw
sensor data. Perhaps manufacturers should be forced to make it available, so
owners can hook up a laptop and run independent analysis software in real-
time.

I imagine a frozen Angle Of Attack sensor would stand out like a sore thumb to
a neural net that has access to all the sensors. In fact, I imagine it would
return a suspiciously constant value even considered on its own.

This doesn't seem all that difficult, really. Teams at engineering
universities do more challenging stuff for their master's project.

------
TYPE_FASTER
I am surprised that an aircraft control system would not have fault tolerance
built in that will recognize a physically impossible condition.

------
olliej
I mean of course Arizona does get hot enough to melt street signs:
[https://www.accuweather.com/en/weather-news/its-so-hot-in-
ar...](https://www.accuweather.com/en/weather-news/its-so-hot-in-arizona-that-
street-signs-and-mailboxes-are-melting/70002032)

Maybe those were metal ones a 400+ degrees does happen? I mean survivor bias
means we won't have met anyone from Arizona who went out in that heat :D

~~~
salawat
Completely off topic, but I hate Accuweather. The guy that runs it has lobbied
left and right to keep the NWS and NOAA from being able to just deliver
weather products because they'd cut into his revenue stream he's developed by
charging people for access to public data.

Anyway... End rant.

~~~
olliej
I agree entirely, but we all use adblockers right? :D

------
hhanesand
Why don’t these planes use internal gyroscopes and compare the plane’s
orientation wrt gravity rather than external sensors?

------
geggam
How does it feel to enter a world where software is killing people ?

Driven a new car lately ?

~~~
the_mitsuhiko
> How does it feel to enter a world where software is killing people ?

How is this new? Software has been killing people for a long time.

~~~
geggam
Smaller groups, yes it has. Software is in everything now so you have to
actively avoid it.

------
sdinsn
This 'article' is very low quality and is written by a layman. If his main
source of information is one line of wikipedia, he shouldn't be telling
experienced engineers how to write flight software.

Furthermore, it is incredibly unsafe to reject so-called "ridiculous" data
completely.

