
Software Engineering Lessons from Aviation - riceo100
https://riceo.me/posts/software-engineering-lessons-from-aviation/
======
jefffoster
I think there's a lot to learn from the aviation industry. I did a talk at my
companies internal conference on this (turned into words at
[https://medium.com/ingeniouslysimple/why-dont-planes-
crash-1...](https://medium.com/ingeniouslysimple/why-dont-planes-
crash-14a0579a5e2d)).

For me it's the mindset that differs. Too often as software engineers we find
a bug and just fix it. Aviation goes a step deeper and finds the environment
that created the bug and stops that.

Unfortunately, the recent 737 MAX incidents seem to have changed this. From
what I understand the reaction to the problems sounds more like what I'd
expect a software business to do, rather than the airline industry!

~~~
jasode
_> as software engineers we find a bug and just fix it. [...] Unfortunately,
the recent 737 MAX incidents seem to have changed this._

I think there's some nuance about MCAS that's lost in all the media reports.
As far as I understand, the MCAS software didn't have a "bug" in the sense we
programmers typically think of. (E.g. Mars Climate Orbiter's software
programmed with incorrect units-of-measure.[0])

Instead, the MCAS _system_ was poorly _designed_ because of financial pressure
to _maintain the fiction of a single 737 type rating_.

In other words, the MCAS _software_ actually did what Boeing managers
_specified_ it to do:

1) Did the software only read a _1_ AOA sensor with a single-point-of-failure
instead of reading _2_ sensors? Yes, because that was what Boeing managers
wanted the software to do. It was purposefully _designed_ that way. If the
software was changed to reconcile 2 sensors, it would then lead to a new _"
AOA DISAGREE"_ indicator[1] which would then raise doubts to the FAA that
Boeing could just give pilots a simple iPad training orientation instead of
expensive flight-sim training. Essentially, Boeing managers were trying to
"hack" the FAA criteria for "single type rating".

2) Did software make adjustments of an aggressive and unsafe 2.5 degrees
instead of a more gentle and recoverable 0.6 degrees? Yes, because Boeing
_designed_ it that way.

Somebody at Boeing _specified_ the software design to be _" 1 sensor and 2.5
degrees"_ and apparently, that's what the programmers wrote.

I know we can play with semantics of "bug" vs "design" because they overlap
but to me this seems to be a clear case of faulty "design". The distinction
between design vs bug is important to let us fix the root cause.

The 737 MAX MCAS software issue isn't like the Mars Climate Orbiter or
Therac-25 software bugs. The lessons from MCO and Therac-25 can't be applied
to Boeing's MCAS because that unwanted behavior happens _in a layer above_ the
programming:

\- MCO & Therac: design specifications are _correct_ ; software programming
was _incorrect_

\- Boeing 737MAX MCAS: design specifications _incorrect_ ; software
programming was "correct" \-- insofar as it matched the (flawed) design
specifications

[0]
[https://en.wikipedia.org/wiki/Mars_Climate_Orbiter#Cause_of_...](https://en.wikipedia.org/wiki/Mars_Climate_Orbiter#Cause_of_failure)

[1] yellow "AOA Disagree" text at the bottom of display:
[https://www.ainonline.com/sites/default/files/styles/ain30_f...](https://www.ainonline.com/sites/default/files/styles/ain30_fullwidth_large_2x/public/uploads/2019/03/aoa_vaneindicator_aoa_disagree_lg2.jpg?itok=vNHrwopw)

~~~
Animats
That's a different issue. Aircraft systems are classified as to degree of
risk. This is from MIL-STD-882C.

\- I Catastrophic - Death, and/or system loss, and/or severe environmental
damage.

\- II Critical - Severe injury, severe occupational illness, major system
and/or environmental damage.

\- III Marginal - Minor injury, and/or minor system damage, and/or
environmental damage.

\- IV Negligible - Less then minor injury, or less then minor system or
environmental damage.

Now, face it, most webcrap and phone apps are at level IV. Few people in
computing outside aerospace regularly work on Level I systems. (Except the
self-driving car people, who are working at Level I and need to act like it.)

MCAS started as just an automatic trim system. Those have been around for
decades, and they're usually level III systems. They usually have limited
control authority, and they usually act rather slowly, on purpose. So auto
trim systems don't have the heavy redundancy required of level I and II
systems. Then the trim system got additional functionality, control authority,
and speed to provide the MCAS capability. Now it could cause real trouble.

At that point, the auto trim system had become a level I system. A level I
system requires redundancy in sensors, actuators, electronics, power, and data
paths. Plus much more failure analysis. A full fly-by-wire system or a full
authority engine control system will have all that.

So either MCAS needed to have more limited authority over trim, so it couldn't
cause trim runaway, or it needed the safety features of a Level I system.
Boeing did neither. Parts of the company seem to have thought the system
didn't have as much authority as it did. ("Authority", in this context, means
"how much can you change the setting".)

Management failure.

~~~
Gibbon1
> So either MCAS needed to have more limited authority over trim, so it
> couldn't cause trim runaway, or it needed the safety features of a Level I
> system.

There are two other dodgy things going on. One you can't disable MCAS without
totally disabling the electric trim. And the mechanical advantage of the
manual trim isn't sufficient to re-adjust trim once it's too far out. And
hasn't been _forever_.

------
sn
Checklists and written procedures are very important. One of the earlier
things I did when coming into my company was create a written procedure for
software upgrades until we had time to automate it with ansible.

One thing I have not had very good discipline about is I want to use
checklists both for code submitted for review and when I'm doing reviews. Lint
checkers etc. can only go so far.

If anyone has published checklists for code reviews I'd be curious to see
them. This one seems reasonable:
[https://www.liberty.edu/media/1414/%5B6401%5Dcode_review_che...](https://www.liberty.edu/media/1414/%5B6401%5Dcode_review_checklist.pdf)
though I'd add concurrency to the list.

------
cjbprime
This was great!

> 1\. Don’t kill yourself

> 2\. Don’t kill anyone else

Could we reorder these, though? Every once in a while a plane will hit a house
and kill its occupants (and the pilot, usually) and it's so awful. I think not
killing others as a pilot is so much more important than not killing yourself.

~~~
X6S1x6Okd1st
That ordering reminds me of the first rule of search and rescue: don't create
another victim.

If your job is to save a life and that life depends on you, you don't do
anyone any favors if you die

------
myl
"...plenty of episodes of Mayday/Air Crash Investigation available on Youtube
too. (Be warned though, all doomed flights take off from one of the busiest
airports in the world .)" Great show. Comment is spot on, and don't forget
"investigators were under extreme pressure".

------
shamino
Nathan Marz talks about this previously, with unique insights:
[http://nathanmarz.com/blog/how-becoming-a-pilot-made-me-a-
be...](http://nathanmarz.com/blog/how-becoming-a-pilot-made-me-a-better-
programmer.html)

------
billfruit
Though article isn't about software development in the aviation industry, a
few thoughts on that:

The industry is really slow to change its practices and tools. Like the use of
C for most software, I do feel a more safer language out to be preferred.

Use of 1553 bus for inter device communication, the bus and protocol aren't
general, it is very opinionated/rigid about the manner in which communication
should happen. And the hardware parts for it are horrendously expensive
compared to most ethernet, IP equipment. There is an aviation ethernet
standard, but adoption of it has been slow.

~~~
HeyLaughingBoy
_it is very opinionated /rigid about the manner in which communication should
happen_

This could be a strong factor in its popularity. If things _must_ happen in a
certain order, then the behavior of the system becomes easier to verify. Ease
of verification should never be understated in safety-critical systems.

~~~
billfruit
Yet the industry uses largely the C language, which isn't a model for safety
or ease of verification.

~~~
magduf
It is, compared to other languages, because it's simple and deterministic. The
#1 most important thing with avionics systems and software is determinism.
That's why they even disable CPU caches on avionics systems.

------
starpilot
Not killing yourself and a checklist (like we learned in driver's ed but apply
informally at best) also apply to driving a car.

~~~
bdamm
Uh, no. In a car, if things go badly you pull off the road and work on a
solution. If things to really badly, you have seatbelts, airbags, crumple
zones, and a thick frame to help you out.

In an airplane, if things to badly, you keep flying until you land. If things
go really badly, remember that everything is built to be light weight, and
unless the crash is well controlled, everything will be destroyed and everyone
will die. If your engine quits, your cabin ruptures, your instrumentation
fails, you keep flying. And you need instruments; in poor visibility, your own
sensory inputs are in fact faulty, and won't help you figure out which way is
down.

Unlike in a car, where it's pretty obvious where the ground is, for example.

------
marcosdumay
I still hold my opinion that checklists are for hardware issues. One should
not be filling them on software tasks. Instead, software is automated,
automatically tested and automatically verified - routine checks are an anti-
feature and inversely correlated to quality.

------
skookumchuck
The article talks about pilot procedures, not engineering procedures specific
to aviation.

------
horacio_colbert
Thinking of aviation makes me remember the impact of doing things right.

