
It’s the spec bugs that kill you - yoav_hollander
http://blog.foretellix.com/2015/07/28/its-the-spec-bugs-that-kill-you/
======
danso
What's disturbing is the mindset in the friendly-fire disaster that the OP
links to:

> Nonetheless, the official said the incident shows that the Air Force and the
> Army have a serious training problem that needs to be corrected. "We need
> to know how our equipment works; when the battery is changed, it defaults to
> his own location," the official said. "We've got to make sure out people
> understand this."

I hope that's not the mentality of the military today, that special forces
operators in the field need to remember UI/UX details of the software they
use...rather than the software developer adding in relatively simple
safeguards, such as confirmation boxes.

It's also a lesson in how software development and engineering can go awry
without close feedback from the audience. There are on-the-field realities --
such as the battery going low -- that even if the developers are aware of,
they can't easily predict how the intended user will react in those
scenarios...to tragic results, in this case.

~~~
jerf
It's a bit of both... yes, these bugs ought to be fixed in software whenever
possible. However, the military can't wait until everything's perfect to
deploy, and the users of the technology need to be trained on all the sharp
pointy bits in the code. Some of these will be bugs, and some of these will
just be that this is military code and it's built to have sharp pointy bits,
some of which inevitably can be aimed at oneself. As someone else points out,
firing on one's own position is a valid use case in the military.

If you can't wrap your brain around that statement... go thank a soldier.

Again let me emphasize I'm not saying the software is OK as is. It should be
fixed, somehow. But at the same time, the military can't afford to just, say,
stop using the unit entirely until then, and wait until all equivalently
serious bugs are fixed. Not having the software 'cause it's in the shop can be
fatal too.

~~~
yoav_hollander
Right. Cutting-edge military systems are, indeed, risky business, and one has
to balance (as you say) more testing against not having those systems when you
need them.

This puts a high premium on the efficiency of bug-finding, especially spec
bugs of new systems. My intuition is that this could improve by a lot.

------
paulmd
One thing the article touched on that I think is dead on: unexpected
interaction between complex interdependent components is a really common root
cause of failure. Something veils the flaw, and then one day the circumstances
change and it's revealed. I think monolithic systems which undergo incremental
improvement are particularly susceptible to this. Loosely-coupled systems make
it much easier to precisely specify and test the desired functionality.

As an example, consider the Therac-25 incident. It was a radiotherapy machine
designed to operate in two modes - direct exposure to a low-power electron
beam, or firing a high-power beam at a set of targets to produce X-rays. The
predecessors used a loosely-coupled system - when the targets were not in
place the high-power mode was electrically disconnected, or in other words the
exposure system was separate from the control system. This was switched to a
tightly coupled system, where the computer also served as the safety
interlock. A particular sequence of inputs combined with a race condition
could result in the targets (and sensors) being unlocked and removed but the
beam firing at high power. Since the predecessors had hardware interlocks,
this didn't result in any exposure. But they re-used the software going
forward, and this bug surfaced.

[https://en.wikipedia.org/wiki/Therac-25](https://en.wikipedia.org/wiki/Therac-25)

No direct relation to this incident, of course. The controller should have a
specific warning message and confirmation dialog before firing within a
minimum safe distance. Probably wasn't in the specs though.

~~~
digi_owl
Whenever i hear about confirmation dialogs and warning messages i find myself
reminded of how many times i have seen or heard about people going "next next
next ok" while operating Windows. Throw enough dialogs and such at anyone, and
they will develop the instinct to hit ok without reading.

------
moron4hire
I see at least a couple of posters in this thread are suggesting "there should
be a confirmation dialog".

Stop.

Confirmation dialogs are dead UX. Users have been trained by internet browsers
with pop ups and other poorly designed pieces of software with poorly worded
dialog boxes to completely ignore dialog boxes. I've tested this several
times. Users don't even _consciously register_ that the dialog box ever
appeared. You can watch them, right over their shoulder, and as soon as they
close a dialog, they will turn to you and ask "what do I do now?" you will
ask, "what did the pop-up box say to do?", and they will respond "what pop-up
box?". Walk them through the process again, don't pre-warn them about when the
dialog appears, and they will do it again.

This is also why crapware is so easy to install on any system that has wizard-
based installers.

Do not use dialog boxes. It's better to hire two testing teams and set them
against each other to break the software. Test and test and test again, then
disallow bad behavior. The "plugger" example should not have allowed issuing a
fire command on its own location, unless some explicit, completely out-of-
band, one-time-use option would enable it so. Actually, better yet, it should
not re-initialize the target location on start-up with the device's location.
Just clear out the target, when the user tries to fire, they will see the lack
of target and probably know "oh, must be because I changed the batteries".
There is a reason they are called "fail-safe"s. You create failure modes that
err on the side of safety. It's better if the device fails to fire than to
fire at the wrong target.

I don't care if you can't afford it. If you cannot afford a large testing
infrastructure, then you can't afford to make safety-critical software at all.
You cannot solve this problem with software.

These situations come about when you have a management infrastructure that
cares more about feature lists than correctness. I can hear them now, "we need
to move faster", and "if it works, don't fix it."

~~~
ajuc
If the dialog never appeared before people will notice it.

If the dialog has OWN POSITION in big red blinking letters they will
understand.

Never firing on your position can cost lives too.

~~~
moron4hire
The point is that the vast majority of people reading this thread aren't
working on weapons systems. They're working on web app software to create ad
space they can sell to leverage their userbase as a source of income. Dialog
boxes are terrible UI.

------
oaktowner
This is certainly an extreme example, but I can't tell you how many bugs I've
seen closed as "Works As Intended" where it may have been what the _spec_
intended but certainly wasn't what the _user_ expected is...well, it's way
larger than it should be.

I've seen both engineers and product managers use this as a crutch. And,
frankly, as a PM I can say the blame generally lies with the product manager.
This is our job. It is to understand how this technology will be put in
practice by users and to make sure we are properly distilling and prioritizing
the needs of those users into requirements for the engineers.

~~~
digi_owl
> This is certainly an extreme example, but I can't tell you how many bugs
> I've seen closed as "Works As Intended" where it may have been what the spec
> intended but certainly wasn't what the user expected is...well, it's way
> larger than it should be.

Sounds like just about every law or regulation written since the dawn of
civilization.

Not sure if it a solvable problem. Heck, even nature has crap like this
popping up (a human embryo grows a functional tail at one point, and sometimes
it stays around).

------
MatthewWilkes
> Somebody simply did not consider one specific implication of some top-level
> level requirement (“Don’t harm the ground operator”) on some lower-level
> specs.

While I agree with the general point about specifications, we are talking
about a weapon of war here. Interrupting a firing procedure for maintenance
and then not checking the co-ordinates is exactly the kind of careless mistake
I'd expect to get someone killed. At least it defaulted to the location of the
receiver, preventing further errors, rather than a semi-random target.

------
ers35
Can anyone find more information about the friendly fire incident cited in the
article? I found a scan of a newspaper printing:
[https://news.google.com/newspapers?nid=1876&dat=20020324&id=...](https://news.google.com/newspapers?nid=1876&dat=20020324&id=80AfAAAAIBAJ&sjid=atAEAAAAIBAJ&pg=5261,3177516&hl=en)

I did not find anything in the official Washington Post archive:
[http://www.washingtonpost.com/wp-
adv/archives/front.htm](http://www.washingtonpost.com/wp-
adv/archives/front.htm)

This search shows other articles by that author on the Washington Post site
during that time, but not this specific one:
[https://www.google.com/search?q=%22vernon+loeb%22+%22kandaha...](https://www.google.com/search?q=%22vernon+loeb%22+%22kandahar%22+%22gps%22)

~~~
halefx
This is especially confusing because "GPS Navigator Magazine" has the wrong
date on the article.

It looks like this happened in December 2001 and the speculation of it being
caused by the battery change was released in March 2002. Christian Science
Monitor [1] reported on it in December when the cause was unknown. This PDF
[2] has some more information on the Washington Post article.

1:
[http://www.csmonitor.com/2001/1207/p2s1-usmi.html](http://www.csmonitor.com/2001/1207/p2s1-usmi.html)

2: [http://wwwhomes.uni-
bielefeld.de/cgoeker/SysSafe/WiSe%2011-1...](http://wwwhomes.uni-
bielefeld.de/cgoeker/SysSafe/WiSe%2011-12/Cases/GpsFriendlyFireAfghanistan.pdf)

~~~
halefx
Dec 6, 2001. Bomb Kills Three U.S. Soldiers; 20 Are Injured 'Friendly Fire'

[http://pqasb.pqarchiver.com/washingtonpost/doc/409215373.htm...](http://pqasb.pqarchiver.com/washingtonpost/doc/409215373.html)

Feb 2, 2002. U.S. Soldiers Recount Smart Bomb's Blunder

[http://pqasb.pqarchiver.com/washingtonpost/doc/409315969.htm...](http://pqasb.pqarchiver.com/washingtonpost/doc/409315969.html)

Mar 24, 2002. 'Friendly Fire' Deaths Traced to Dead Battery; Taliban Targeted,
but U.S. Forces Killed

[http://pqasb.pqarchiver.com/washingtonpost/doc/409245838.htm...](http://pqasb.pqarchiver.com/washingtonpost/doc/409245838.html)

------
yoav_hollander
To Matthew Wilkes: It is true that in military situations the tradeoffs of
safety vs. utility are quite different.

However, I think this is one of those (many) cases where it is an "operator
error" which a better design could have prevented.

In other words, I think it was a plain bug. Somewhere in the fire() function
there should have been some check for:

    
    
         distance(current_gps_coordinates, target_coordinates) < min_safety_distance
    

but there was not.

~~~
hyperion2010
Thing is, you actually don't want to put that line of code in there, because
sometimes you DO need to fire on yourself. If you build a tool that fails that
case because some idiot hard coded a rule with no override because they didn't
understand every single case where their too might need to be used no amount
of training can fix the problem. Sometimes you have to assume the user is
smart enough to use your tool. I think this is the right idea, but you
absolutely want to let the user shoot themselves in the foot if they want to,
but you should make sure they know they are going to shoot themselves in the
foot.

~~~
yoav_hollander
Absolutely. What I meant was that there should have been some "are you sure"
confirmation in this case.

~~~
boinky
It seems reasonable that this device may not only be used to drop bombs, could
be replenishing munitions or dropping food on target. Point being, a simple
"are you too close" does the complexity a disservice.

------
RachelF
All software has spec bugs. Normally these can be ironed out in extensive
testing, if your software has many users. This sort of military system has few
and probably is not used that much. An testing jig should be designed and used
for this sort of system. However, it will not iron out all the bugs.

------
yoav_hollander
To me, the really-interesting question here is "How can we find such issues in
the design before they hit".

In the article, I explain why this is going to be a bigger problem in the near
future, and speculate that some simulation-based could, perhaps, do the trick.
But I am not sure.

Any thoughts about that?

~~~
seane999
I find it interesting that the OP has had essentially zero feedback on
anything other than confirmation dialog boxes - especially since I don't think
that was part of the OPs speculation.

Having some form of 'interactive spec' where stakeholders can 'play with' the
system to verify intended behaviour is a really interesting idea. Of course it
is used in software development all the time to varying levels of fidelity
(wireframes, mockups, interactive prototypes and so on). But I think here is
the rub... the better the quality the simulation the greater the amount of
effort until it becomes approximately equal to the cost of just doing it.

Maybe with enough automation and tooling...

~~~
yoav_hollander
I think you are absolutely correct. It is the price/performance of creating
these "interactive specs" that will determine whether this idea is any good.

From what I have seen (I looked a bit at the what people are doing in UAVs,
autonomous vehicles and so on - see e.g.
[http://blog.foretellix.com/2015/07/03/my-impressions-from-
th...](http://blog.foretellix.com/2015/07/03/my-impressions-from-the-
stuttgart-autonomous-vehicles-test-development-symposium/)), I think a lot can
be done to improve both price and performance.

In other words, I think it is possible to invent new tools / methodologies to
make simulation (especially high-level simulation) easier, and especially to
get a lot more out of it, at all stages of design / verification /
maintenance.

