
How to prevent software bug from killing lives like Boeing's or Tesla's? - leeu1911
As an experienced software engineer, I&#x27;m sure people make mistakes as to err is human. Working at dahmakan.com, a SEA food delivery startup, the worst can happen when our software has problem is someone won&#x27;t get a meal on time or a meal is wasted.<p>But for software of Boeing or Tesla, it is highly more critical when errors happen as we saw.<p>I would love to learn about your suggestions&#x2F;experience about preventing these costly mistake from happening.
======
cbanek
Overall, I think designing a system that prevents people from being harmed is
very hard, and has to be a design level concept that is in everyone's minds as
they start, and every day through development.

Even your example "the worst can happen when our software has problem is
someone won't get a meal on time or a meal is wasted." isn't really true. What
if you ordered fish, or oysters, and they were left out too long and caused
some kind of food poisoning (just as an example).

There are many levels of thinking about this problem. Maybe you can have a
sticker on the package that reacts to temperature to let someone know the meal
isn't safe, etc. You still have to train the user to know what it is, and when
it is safe.

So in this simple example, you have software, hardware, redundancy, and user
training that all have to happen. Same for things like cars or planes. You're
really trying to build a safety critical system, and many times (such as the
Boeing example), it isn't just software or hardware that causes the problems,
but issues arise at the intersection of both.

For Boeing, it would be lack of user training, lack of good UX, possibly
hardware design issues with being prone to stall, hardware issues with the
angle of attack sensors, lack of enough redundancy of the angle of attack
sensors to operate properly, etc.

You can never get to a 0% chance of failure. Most of the time you are just
attacking the highest chances of failure, since when you get down to the level
of faulty parts or mechanical fatigue, things always break.

Of course, each subsystem and integration should have good testing to find all
these things, but it's sadly less of a science and more of an art IMHO. And I
used to work on rocket software.

Many times, the answers are more simple than you think. Simplicity usually
means better operation than trying to overcomplicate error handling. Sometimes
you just need to change the whole way you are thinking about the problem.

~~~
leeu1911
Thank you for very interesting thoughts. Learned something new.

You were right with the counter-exaple that we have redundancy in place that
is 'hardware' \- mealbox stickers, production qa.

------
bjourne
Follow engineering best practices. One thing you should never do is to write a
new system, requiring it to _perfectly_ emulate an old one. It can never work
and there will always be unexpected deviations in system behavior. I.e. in
Boeing's case it isn't so much about any particular flaws (faulty AoA sensor
etc), it is about the whole idea of having a completely new aircraft design
appearing to the pilot as if it was an old one. It is similar to replacing an
Active Directory deployment with OpenLDAP and betting on users not noticing.

Another engineering best practice is to keep proper logs. The Toyota recalls
from 2009 to 2011 were likely caused by some bugs but they weren't able to
find the root cause. Ostensibly because not enough data were being logged.

~~~
hhs
Interesting point on keeping good logs. Do you know of any best practices? I
wonder if there are any reference texts that articulate rules on writing logs
for coders/software developers in training?

------
Nokinside
Tesla and Boeing cases are somewhat different.

Driving using computer vision is heuristic by nature. That's completely new
back of worms altogether. Boeing's case is more traditional design errors.

For driving cars SAE Level 3 Automation level (expectation that the human
driver will respond appropriately to a request to intervene) is dangerous
fools errand. Humans either drive the car with light assistance or the
automation must care of everything without human fallback. Unless human is
constantly driving their response time drops and they can't react and take
wheel when fallback situation occurs. The middle ground SAE Levels 2 and 3 are
inherently dangerous because human cognition and psychology.

The human-automation interaction is very critical issue that connects the both
cases.

------
beatgammit
Here's a different way of looking at it. How many die because of faulty
software vs faulty humans?

Yes, all software is going to have bugs and bugs in critical software can cost
real lives, but I think we focus too much on the negatives and ignore all of
the lives that we've saved because of modern technology. People seem to prefer
explainable patterns over random ones, even if the random ones are less
common. For some reason, "the pilot must have been overworked" is more
acceptable than "an unlikely condition wasn't tested for and the software got
into an invalid path", which can look random from the outside.

My point here is that, while software failing is terrible and we should do
everything we can to prevent it, we need to recognize that it's often a net
benefit. As for practical ways to prevent it, here are a few thoughts:

\- formal proofs of correctness \- extensive tests, both automated and manual
\- frozen compilers \- limited scope; the less code there is, the easier it is
to make reliable \- high quality hardware (unexpected bit flips are just a
deadly as a software bug)

I don't write critical software like this, but I do read about it, such as
NASA's design guidelines. However, we have to accept that there will be errors
when going into a critical project, and do everything we can to prevent them.

------
imhoguy
Check this as an introduction: [https://en.m.wikipedia.org/wiki/Safety-
critical_system](https://en.m.wikipedia.org/wiki/Safety-critical_system)

In general I think the direction is to cover any software development with
formal proofs to detect any posibility of unexpected system state.

------
matt_s
A simple approach would be to not automate things that don't need automating.
I think as developers we get a little too far ahead of ourselves.

The failure on the MAX was related to computers moving controlled surfaces
(flaps, rudder, etc.) on the plane automatically based on sensor readings.
Questions should be asked about what problem that was solving, and was that a
requirement (aka complaint) from pilots? Why was that feature necessary?

~~~
leeu1911
I read that it was due to the new design of their engine which is more likely
to lead to a 'stall' situation, hence, they put in 'anti-stall' system

------
hackermailman
For minimizing disaster, the standard grad text for this type of thing is
_Requirements Engineering: From System Goals to UML Models to Software
Specifications_ by Axel van Lamsweerde. There you learn about creating models
and risk analysis, fault tolerance modeling[1], privacy requirements etc., all
established methods based on engineering foundations but applied to software
modeling and development.

If you mean software development for mission critical things that control
movement like aircraft, drones, factory robots etc., these engineers I would
assume use verified compilers/toolchains like CompCert project to implement
the models they have already formally analyzed
[http://symbolaris.com/course/fcps17.html](http://symbolaris.com/course/fcps17.html)
but I've never done mission critical work, just dabbled in it to apply methods
there to non-critical software.

[1]
[https://arxiv.org/pdf/1611.02273.pdf](https://arxiv.org/pdf/1611.02273.pdf)
\- Application-layer Fault-Tolerance Protocols

------
vkaku
Not to mention Uber!

Anyway, there are a few things anyone can do against human stupidity.

The most important thing is to have a decent backbone as an engineer. Do not
take shortcuts in safety even if your stupid or arrogant boss tells you to do
so.

------
finnthehuman
Process process process.

Start with researching formal "quality management systems." At the very least
that'll introduce you to using FMEAs, external standards, rigorous design
reviews and testing, quality gates between development and production (if
they're not a PITA, they're not working) and traceability for everything in
production.

There are off-the-shelf learning materials for all of it.

If you do go down that road hire someone with QMS experience to design your
process and handhold your team through the transition. Otherwise you're likely
to over-complicate it for a less effective result.

------
Glawen
Safety critical SW is done by following a strict and thorough process. It
always begins with safety experts and systems engineers who will identify and
classify the risks and determine what will be done to minimize the risk. The
software engineer only comes later to implement the SW, which is only a part
of the systems implementation.

Here Boeing misclassified MCAS and gave a lower risk than later identified,
leading to a more relaxed development process. I hope that light will be shed
on what happened at Boeing, because it looks like an intern did the MCAS.

------
clnhlzmn
I don't work with safety critical software, but I think the advice in general
is to avoid relying on software alone for safety critical functions.

~~~
gus_massa
Meatware make error too. For example
[https://en.wikipedia.org/wiki/Air_France_Flight_447](https://en.wikipedia.org/wiki/Air_France_Flight_447)
was 50% a software/hardware/design error and 50% a meatware error.

------
hacknat
Having a QA team, who are full fledged software engineers, who can provide
formal correctness tests is the ideal situation.

They’re developing a whole other project in isolation from yours and the only
thing that needs to be agreed upon is the interface(s).

------
Ultramanoid
Redundancy.

~~~
segmondy
Redundancy is not equivalent to Fault tolerance.

