
E-Stop and Fuel, software that keeps you awake at night - janvdberg
https://jacquesmattheij.com/e-stop-and-fuel
======
otakucode
Rather than the Therac-25 debacle, I would more recommend looking into the
Toyota 'unintended acceleration' case. And the legal fallout from that.
Because it is terrifying. Toyota was essentially as grossly negligent as it is
possible to be. And the result? The court said there existed no standards they
could be held legally liable for violating. So your self driving car? It will
be developed by junior developers hired as cheaply as possible, driven like
slaves by business-oriented managers who only care about meeting schedules,
not given the tools or information needed to do an effective job, with testing
time cut short and any claimed 'industry standards' for safe coding ignored.
The automotive industry had 90+ coding practices they list as either
'required' or 'recommended'. Toyota followed 4 of those in their code. And the
court said this was OK. Do you think Toyota spent tens or hundreds of millions
of dollars rebuilding their entire development infrastructure, hiring more
competent software engineers, firing the business managers who got people
killed by rushing an unsafe product to market, and putting the engineers in
charge of all future decisions regarding scheduling and release? No, of course
not. If anything, they probably saw it as carte blanche to make things worse.

~~~
elcritch
Wasn’t the cause of the “unintended acceleration issue” hardware and driver
error related rather than software related? That’s what I’ve read in one book
about Toyota management (the Toyota Way) and also a Wikipedia entry on the
topic says:

> Toyota also claimed that no defects existed and that the electronic control
> systems within the vehicles were unable to fail in a way that would result
> in an acceleration surge. More investigations were made but were
> unsuccessful in finding any defect until April 2008, when it was discovered
> that the driver side trim on a 2004 Toyota Sienna could come loose and
> prevent the accelerator pedal from returning to its fully closed
> position.[4]

Based on those two sources it seems the issue was hardware related, and Toyota
may have tried papering over the matt issue. The faulty matt design issue
doesn’t support your claim of shoddy software practices and hiring underpaid
junior developers. That may still be the case but it appears not to have
caused the SA issue.

~~~
otakucode
There was a hardware issue, Toyota told their engineers one of the boards
would have ECC RAM, but went cheap and used non-error-correcting RAM, but that
wasn't the primary issue. After the court case ended, the code was subjected
to static analysis by various researchers and a litany of problems were
immediately found. Race conditions, uninitialized values, lack of fail-safe
structure, spaghetti code, etc. I read a much better summary article awhile
back that talked about automotive industry standards for embedded design and
the working environment at Toyota (where developers had no static analysis
tools, and didn't even have access to a bug tracker), but this article covers
some of the points. The software was definitely at fault in many of the cases
that killed 38 people:

[http://www.sddt.com/Commentary/article.cfm?SourceCode=201311...](http://www.sddt.com/Commentary/article.cfm?SourceCode=20131104tbc&Commentary_ID=140)

~~~
elcritch
That article makes a clearer picture of the issues you were hitting upon.
Especially the later software audits. Perhaps you could add it to the
Wikipedia page?

------
snerbles
For modern industrial applications, the safety circuit is often (edit- see
child comment's note on safety PLCs) managed by discrete safety relay hardware
such as the AB GuardMaster or Pilz PNOZ. There's a good chance these weren't
even available at the time of OP's application!

A common configuration involves emergency stops, guard doors, light curtains,
etc. being wired in a pair of loops with the relay. The relay continuously
monitors both loops (usually with a phased pulse train), and any interruption
or crossover will trip the unit. Only when the loop states return to nominal
will the relay permit a reset to re-enable the outputs.

The safety relay's outputs are generally connected to dumb hardware interlocks
on the various dangerous bits of the machine.

~~~
MarkSweep
As a programmer, I like this approach to human safety for robots. By putting
electrical interlocks on the doors that expose humans to the robot you can
make it impossible for a software error to hurt a human.

For some applications where you need to have humans working in the same area
with the robot things get a lot hard. You probably need some software involved
in enforcing speed limits for robots. The compliance engineers I've talked to
say getting safety certification for software is quite arduous. In this case
the off-the-shelf solutions the parent child comments mention become valuable.

~~~
_trampeltier
Just slow down machines is not enought. I'm still stunning how much the
operators trust on things like E-Stop and Safety-Doors. They open the doors,
don't even look and go into the machine with there hands. Most people have no
idea what's behind a E-Stop and Safety-Doors these days. Also how much paper
work each time .. sadly the paper part grows very fast .. and does not really
help at all. Just worthless paper.

~~~
bsder
> I'm still stunning how much the operators trust on things like E-Stop and
> Safety-Doors. They open the doors, don't even look and go into the machine
> with there hands.

Wow. Where I am we sometimes do this, but never without real thought. And,
generally, we revisit that part of the process again and again over time.

One example: initial setup of a CNC machine part. If you don't have
$50K-$100K, you are setting this up by hand. You are moving the cutting head
while your hand is in the same space. If you screw up, it probably won't rip
your hand off, but you will likely wind up with a solid, painful gash and it
might break a small bone if you get really unlucky.

People don't respect servos enough. They're remarkably powerful and probably
moreso than you expect.

~~~
froindt
> initial setup of a CNC machine part. If you don't have $50K-$100K, you are
> setting this up by hand.

Curious what specifically you need your hand inside for? Do you simply mean
the machine is on (though entirely inactive) while putting the part in,
touching off the part slipping the business card in and out, or something else
entirely?

~~~
_trampeltier
It's not just about CNC machines. All automated machines have to have E-Stops
and so on. Mostly operators open door because something is mixed um inside.
(To many bottles, broken bottles, broken paper, broken sensor, wathever ..

Of course as more often you can build the same machine, as better you can work
out details. But sometimes there is just "one" machine. Then usually have to
work out the the flaws first ..

A friend from school cleaned a CNC mill while the machine worked. The safety
door was manipulated and the 2D table drove over his hand ..

------
gonzo
When I was the CTO & VP of Engineering for Wayport (public, mostly hotel
Internet) we designed an Ethernet switch that could use Home PNA or Ethernet
PHYs. (Later adapted to also offer VDSL to an in-room modem.)

We also designed our own 802.11 access points.

All of our competitors had at least one fire. In a hotel with hundreds of
people asleep. It didn’t matter if they used commercial gear or not. Every one
of them had a fire.

We never had one, but I was obsessed with not hurting anyone because we had
missed something.

And yes, it kept me up at night.

~~~
sleepychu
I'm not following, what caused the fire?

~~~
DiabloD3
I'm not him, but probably either shoddy PSUs and/or there were lithium
batteries involved. It's the curse of "modern" equipment.

------
cube00
> needed to be re-written because the source code could not be produced

Is that a diplomatic way of saying someone lost the source code?

~~~
asterius
They might not have paid for a source code licence. or they did, but they
never made sure they had a copy, just left it with the developer. Surprsingly
common for companies to get a big binder of paperwork, an installer disk, and
consider it done.

------
rodrigocoelho
Can relate. I 'stayed awake at night' when I had to write a program to
calculate salaries.

~~~
alex_hitchins
Ha! I once wrote a program to calculate commission for the sales people at our
company. I remember the director telling me he would love my numbers to be
true, but thought it best I had another look. Fat fingered decimals could have
resulted in some expensive commissions!

------
hawktheslayer
This was an enjoyable read. While I don't often lose sleep over my code, (it's
my kids that cause that), I do often find that my mind is working on solving
coding puzzles in my sleep as I will frequently wake up with a spark of
insight.

------
kemonocode
My own personal "staying awake at night" case is an application that connected
to an ancient version of Banner (A big kludge of an ERP system for
universities) to handle new applicants' enrollment process and billing. I'm
rather skittish when directly handling money, doubly so when the code was all
written in Perl (Which I had to learn in order to re-implement in far more
readable PHP, the lesser of two evils, natch) and extremely poorly documented.
In retrospective, I should not have accepted a job like that.

------
fnord77
> a fuel estimation program for a small cargo airline > ...floating point
> computations

Hello, rounding errors. Oh hell no.

~~~
TylerE
Please. Don't kneejerk. Any floating point errors will be multiple orders of
magnitude smaller than than the accuracy of either the fuel gauges or the
pumps.

~~~
DiabloD3
Did you know in the US it's essentially against the law to use floating point
math on currency? The situation is more complex than I simplified here,
however the complexity is on the order than you'd need a lawyer to explain it
to you to really appreciate it.

You can call it a knee-jerk reaction, but the law itself clearly has that
bias, and for good reason.

All regulations were paved in blood.

~~~
_0ffh
Not just the US, IIRC financial institutions around the world are required to
do decimal rounding. OTOH I don't see why you would even use floating point
for money. Just scale your integers to the required precision (and make sure
you've got enough bits for the kind of amounts you need to be able to handle)
and you're good.

------
joezydeco
Just a plug for the RISKS Digest, now in it's 32nd year of operation:

[http://catless.ncl.ac.uk/Risks/](http://catless.ncl.ac.uk/Risks/)

------
knodi123
Geez, and I thought I was stressed after I wrote software that would automate
computing someone's pay bonus based on efficiency metrics.

------
w_t_payne
So how can we approach safety in a systematic manner?

Clearly 'blame' isn't an appropriate response. It has to involve tooling.

~~~
HeyLaughingBoy
Well before tooling is considered, it has to involve people and process. At
the highest level, you must have a culture of "blame the process, not the
people" or people will do what is natural when things go wrong: try to cover
it up and avoid being blamed.

There are procedures in various safety-conscious industries for handling this
kind of development. I like that you used the word "systemic" because it is
literally a _systems_ issue, not a software, or electronics, or mechanical
issue. The entire system has to be considered and analyzed for potential
faults.

I spent over a decade writing code for medical devices and while the software
aspect of these systems was the most advanced in terms of development process
(unlike what many on HN seem to think :-), everything we did had to be
considered from a system perspective because even if the individual parts were
designed properly, it was possible for the interactions between them to cause
problems.

~~~
w_t_payne
I strongly agree with your emphasis on systems.

------
fireismyflag
Thank you for sharing this. I enjoyed imagining scenarios so different from
the ones that rob MY sleep at night.

