
Integer Overflow Bug in Boeing 787 Dreamliner - h43k3r
http://www.engadget.com/2015/05/01/boeing-787-dreamliner-software-bug/
======
anderspitman
Might be time to remove working on the 787 from my resume. I feel like the
poor thing has been one disaster after another in the news.

I can't speak to the quality of the A and B level (most critical) code, but
the development process for the C level software I was working on definitely
could have used a lot of improvement. Messy code, tests and documentation were
an afterthought/checkbox item, etc. The incentives were just wrong.

I think there's a ton of room for process innovation in avionics software
development. One thing I wanted to build for a long time was a tool for
tracing. In theory, every industry requirement (DO-178B, etc) was supposed to
trace to a hight level Software Design Document (SDD) requirement, which was
supposed to trace to a Software Requirements Document (SRD) requirement, which
would trace to a code function. We maintained all of that BY HAND. It was a
huge mess. Perfect example of something that could have been an extremely
valuable development tool, but ended up just being a hassle to try and
maintain.

Then of course there's language choice. C is king, which isn't necessarily a
bad thing, but it's certainly not the safest, even in the restricted forms
used in avionics. Sadly, my very first ever project as a programmer was
porting an Ada codebase to C for the 787 (off by one errors for days...). It's
almost cliche to say nowadays, but I would be really excited to see Rust gain
some traction in avionics over the next 20 years or so. Because that's how far
behind avionics is. We were using Visual Studio 6 in 2011!

~~~
icegreentea
I work in the medical device field, and we have a similar process requirement
(traceability from Design Input Requirements -> Software Specification ->
Software Verification Procedure (and implicitly, the actual code function) ->
Software Verification Report). We're currently wrangling with a giant-ass
spreadsheet to keep track, and it totally sucks.

~~~
anderspitman
Ah yes I completely forgot SRD -> TESTS -> Then Code. Maybe that's because we
always did it the other way around...

I'm telling you, there's money to be made building tools for this stuff. I
think a big part of the reason things aren't being improved is that the people
in a position to recognize bad process and tooling maybe aren't the type of
people to see an opportunity to make money solving the problem rather than
putting up with it. I wouldn't associate most of the engineers I knew at
Honeywell as the type to stay up until 2AM every night for 3 months working on
a side project to pitch to their boss.

I think it's really exciting what's happening in healthcare right now though.
The innovative culture is exploding. Ultimately I care much more about what
happens in medicine than avionics, as long as planes aren't falling out of the
sky every 248 days...

~~~
HeyLaughingBoy
There are already tools for this stuff. Problem is that they are all various
forms of crappy and the market is so small that there is little incentive to
improve them. I work in Medical Devices and one of the tools I'm supposed to
use (we find every excuse to avoid it) has a UI like a 2001 Swing app and
while it works, it is insanely painful to use due to its absolutely
counterintuitive interface.

We're actually integrating more and more of our work in to Visual Studio since
its tools are excellent. The problem is that the organization needs to
validate any tool before we can use it as a part of Quality Management and
that process can take forever.

~~~
anderspitman
Visual Studio is awesome. I'm really excited to see how Code turns out on
Linux, especially for things like building GUI apps and 3D stuff.

------
AshleysBrain
Windows 98 had a similar bug where the system would hang after 49.7 days:
[https://support.microsoft.com/en-
us/kb/216641](https://support.microsoft.com/en-us/kb/216641)

Although IIRC, the impact was limited, because it was quite a feat for a
Windows 98 system to stay up for 49 days :)

~~~
alfiedotwtf
I found 98 respectible. It was 95 that didn't last a whole week!

~~~
Rexxar
"Windows Millennium Edition" was the worst.

~~~
windsurfer
"Malfunction Edition" as it was sometimes known

------
pslam
I recall a long time back, when Linux was configured as standard for 100Hz
ticks (aka "jiffies"), the counter was initialized close to wraparound instead
of 0.

The result was you typically encountered "jiffy wraparound" after a few
minutes of uptime. You learned whether your system was stable in this
situation fairly quickly, rather than 248 (or 497) days later. Kernel
developers typically don't have uptimes measured in days. Starting the counter
close to wraparound increased the likelihood it was going to get code
coverage.

~~~
ekimekim
I really love this methodology. If an exceptional case exists, and it's cheap
to cause the exceptional case to occur during standard usage, then do it so
that the code is well-exercised.

------
dwightgunning
I found it curious that the journalist refers to the bug as a "vulnerability".
This is could be misinterpreted given that's a term is more commonly used in a
security context.

~~~
GigabyteCoin
>journalist

There's your answer right there. They write stories for a living, not
programs.

~~~
smsm42
Some journalists still care to acquire correct terminology. If somebody is
reporting from the court and call somebody accused of car theft "murderer"
because you don't know the difference between the two, they'd probably get
laughed at.

------
Mojah
They're actually not alone, Dell's EqualLogic (a big & expensive storage
array) had the same problem, after 248 days.

They would initiate a controller failover and reboot:
[https://ma.ttias.be/248-days/](https://ma.ttias.be/248-days/)

~~~
baruch
There was a similar issue in the Linux kernel in version 2.6.32 where the
kernel would crash after 208 days:
[http://www.novell.com/support/kb/doc.php?id=7009834](http://www.novell.com/support/kb/doc.php?id=7009834)

This was a serious problem in some storage systems too:
[https://www.ibm.com/developerworks/community/blogs/anthonyv/...](https://www.ibm.com/developerworks/community/blogs/anthonyv/entry/208_day_reboot_bug3?lang=en)

------
ghshephard
I find it, well, interesting to read that the "Fail Safe" mode is to
deactivate all power systems on the plane.

~~~
sitkack
I find it troubling that the generators can't reboot w/o continuing to supply
power, that reboots aren't staged to ensure that the plane continues to have
power and that power from the generators it necessary to control the aircraft.
Isn't there battery backup to enable the plane to continue operating normally
while the generator reboots?

From the article it looks like their whole failsafe/redundant system
architecture is flawed.

Just like buffer bloat, reboot times and corruptible system state are a
chronic systemic flaw in modern technology.

The only stuff that works is the stuff that is used all the time. Look towards
crash only software [1] and microreboots [2,3]

[1]
[https://www.usenix.org/legacy/events/hotos03/tech/full_paper...](https://www.usenix.org/legacy/events/hotos03/tech/full_papers/candea/candea.pdf)

[2]
[https://www.usenix.org/legacy/event/osdi04/tech/full_papers/...](https://www.usenix.org/legacy/event/osdi04/tech/full_papers/candea/candea.pdf)

[3]
[http://dslab.epfl.ch/pubs/perfeval.pdf](http://dslab.epfl.ch/pubs/perfeval.pdf)

~~~
VonGuard
The actual reboot takes something around a minute, and the plane is designed
to keep flying while rebooting. It's during take off or landing that this
would be a real problem.

Think Starship Enterprise. You can't turn off the whole ship, or everyone
suffocates. You run diagnostics system by system and keep the warp core online
as much as possible so you don't run out of power.

------
lotsofmangos
Reminds me a little of the floating point precision bug in the patriot missile
targeting systems, where the longer it was left on, the less accurate it got.

[http://fas.org/spp/starwars/gao/im92026.htm](http://fas.org/spp/starwars/gao/im92026.htm)

------
limaoscarjuliet
I file it under "funny" but certainly it is nothing unusual or surprising.
Software, like anything else, has faults and breaks. Even on an airplane.

It mostly is funny to us, developers, because we have all been trying to
convince our bosses that "it will never happen". That object_id being an int4
sequence? You would need one object a second for 70 years to overflow. And
yet, somehow, it does, e.g. because someone loaded data with object_id set to
1.9B and the sequence followed from there.

P.S. My favorite pastime? Watching Aircraft Disasters series in an airport.
Not brave enough to watch it during the flight yet. Karma might be a bitch and
I do not want to test it 10 km up there ;-)

~~~
istvan__
Like everything else? What was the last time the Golden Gate bridge collapsed?
:) Everything else does not include the most of the engineering output. In
software faults are more common because of the tools we are using and because
there is no life in danger if Twitter is down. On the other hand, we cannot
allow a bridge to collapse or an airplane to fall down from the sky because it
has a fault. There are several techniques to build reliable systems out of
non-reliable parts.

~~~
ams6110
Bridge collapses are actually not that uncommon. While often times overloading
or damage is the cause, sometimes it's due to design flaws.

[https://en.wikipedia.org/wiki/List_of_bridge_failures](https://en.wikipedia.org/wiki/List_of_bridge_failures)

~~~
istvan__
I think the point is the frequency.

------
userbinator
Presumably all of Boeing's _other_ planes have such counters in their systems,
and don't have this bug (or if they did, it was corrected already), so why
only the 787? That's what I find most surprising.

Edit: one theory that seems plausible is that they were "overly paranoid" and
put in overflow checks, on a time counter whose overflowing would not have had
negative effects otherwise since the other code was designed to handle a
wraparound correctly.

~~~
Jtsummers
Software tends not to be reused between planes unless you go back to the same
vendor and there are no major hardware changes with the component as well.
Aircraft software is kind of a broken world.

~~~
agumonkey
Broken Aircraft software makes me wanna rethink the notion of correctness, or
broaden the scope of failure and function.

~~~
jMyles
Right? If aircraft software is broken, but my linux desktop is supposed to be
the picture of success, I'm not sure the definitions are meaningful. :-)

~~~
agumonkey
And I was serious. We should study this and see why what could be described as
a fault, a bug, etc ... is actually not that meaningful.

~~~
Retra
Those are features!

------
excel2flow
Could have abstract interpretation
([http://www.astree.ens.fr/](http://www.astree.ens.fr/)) or some other formal
method prevented it?

~~~
tlb
Someone wrote something like:

    
    
      int32_t ticks; // 100ths of a second
    

which overflows in 248 days, a particularly unfortunate amount of time because
it doesn't show up during testing.

Although it would be a good engineering choice, a formal verifier would say
that:

    
    
      int64_t ticks; // 100ths of a second
    

is also incorrect, since it also overflows (after 10^9 years).

In a hard real time system,

    
    
      mpz_t ticks; // 100ths of a second, infinite precision libgmp type
    

is still formally incorrect, since as the the number grows in size it will
eventually exceed a time limit or memory (after 10^10^9 years)

The overall lesson from formal methods is that it's impossibly to write
formally correct useful programs. So programmers just muddle through.

~~~
mhogomchungu
> int64_t ticks; // 100ths of a second

I would go with uint64_t

as it documents "ticks" as a variable that can not hold negative values and
also doubles its range of positive values.

~~~
speakeron
I would go with

uint64_t ticksOfDuration10ms; // No comment necessary

~~~
cnvogel
The concept of timer "ticks" is well established as a unit of time in embeded
programming, it's almost universally included in your embedded (realtime-)OS
and might increase at any conceivable rate, both limited by the hardware
constraints (e.g. a fixed, simple, 16-bit ripple counter that is clocked by
the main CPU clock of 8 MHz will clock at 122.07 Hz) or at your application
requirements (you let a slightly more configurable timer only count to 40000
at half the CPU clock to get exactly 100 Hz). Hence you shouldn't explicitly
inscribe the tick rate in your symbol name, as it can change when requirements
change.

You'll almost always have a global variable, preprocessor define... or
something similar to get the frequency (or time increase per tick), which you
should use whenever you have to convert "ticks" to actual physical units. If
the actual effective tick rate is visible at many places in your code, both as
a symbol name or as a comment, you are most certainly doing something wrong.

~~~
speakeron
I think you kind of missed the point of my post (which was a bit tongue-in-
cheek). The original code fragment had the tick duration embedded in a
comment, so changing a global variable which defines it something other than
10ms is going to cause all sorts of problems in maintaining that code.
(Leading possibly to the very problem Boeing had).

~~~
cnvogel
...well, then my irony-detector is broken ;-).

------
zaroth
This failure mode in particular was deemed _exceedingly unlikely_ by Boeing,
which got them an exception to some initial airworthiness issues with the RAT,
which in turn would have made a total loss of power catastrophic.

~~~
firethief
They can deem things unlikely? That seems broken in general. I would deem it
unlikely they'd ship with any errors they _didn 't_ deem unlikely; those are
precisely the failure modes we should most look for...

~~~
ams6110
The entire aircraft is an electro/mechanical system with many thousands of
things that could go wrong, but are deemed unlikely. All engines could fail at
the same time, but it's deemed unlikely. Redundant hydraulic systems could
fail together, but it's deemed unlikely. There is no certainty in systems this
complicated.

------
stcredzero
IIRC, the VisualWorks VM had such a bug that would mysteriously crash an
automated airport people-mover after some interval, like 45 or 90 days.
(Software crash, not train-hardware crash! Train would simply stop.) Also, as
I recall, the train software project did not use automated tests at all! (By
that time, VisualWorks VM was implementing them.)

(Learn from history. Don't cling so hard to the notion that your language will
make you into super-programmers. Certainly, some tools are better in certain
contexts than others. However, group culture and the quality of working
relationships often have an effect even greater than choice of language.
Besides, people often dislike someone who projects an air of superiority.)

------
mrmondo
FYI - Engadget has very intrusive advertising that you can't close on a mobile
device: [http://i.imgur.com/nqgc2p7.png](http://i.imgur.com/nqgc2p7.png)

~~~
digi_owl
"tech for ladies", aka a power bank with a led flash and a "designer" case...

------
kazinator
Is that it really an overflow bug? Or a counter wraparound bug?

For example, incorrectly using a X > Y comparison on values that are
congruential (and do not overflow) isn't an "overflow bug". You can only
locally compare values that are close together on the wheel, using
subtraction.

The simple thing to do with tick counts is to start them at some high value
that is only minutes away from rolling around. Then the situation reproduces
soon after startup, rather than days or months later, and its effects are more
likely to get caught in testing.

~~~
kazinator
Another thing you can do is reduce the range of tick counts. Say you have a 32
bit tick count which increments a hundred times a second, but the longest
period (biggest delta between any two live time values) you care about in your
module (driver or whatever) is well within 30 seconds. That's only 3000 ticks.
Then, whenever you sample the counter, you can mask it down into, say, the 13
bit range [0,8192): effectively a tick counter that rolls over every 81.9
seconds (which you treat correctly as a 13 bit value in your calculations like
is_before(t0, t1) or add_time(t0, delta)).

~~~
ambrop7
There's no need to reduce the range, you can just treat correctly the full
counter range (see my other comment).

~~~
kazinator
Well, it's a tautology that if you _treat correctly_ anything, you don't need
any defensive tricks.

Treat all the ones and zeros correctly and everything else takes care of
itself.

~~~
ambrop7
You mentioned correct treatment first :) I'm just saying that masking the
clock is unnecessary and doesn't make correct treatment any easier.

------
xrstf
Current workaround: Restart the estimated 28 U.S. planes at least every 120
days[1].

Wonder how long it could take for the update to be actually available (after
testing, approving, ...). Are we talking weeks, months, years?

[1] [https://s3.amazonaws.com/public-
inspection.federalregister.g...](https://s3.amazonaws.com/public-
inspection.federalregister.gov/2015-10066.pdf)

~~~
ferrix
Actually I am a bit surprised that somebody would want a plane to be powered
on for such a long time. There is no way one could fly for that long anyway
and they are regularly taken to service between long hauls.

~~~
ghshephard
I completely agree with you - on at least three of the dozen or so flights
I've taken this year, when there was a problem with the passenger area (Audio
in one case, WiFi in another, and finally my _POWER_ connector in the third) -
the flight attendants power cycled the entire system, which took about 15
minutes, and let me watch the Linux boot process on the back-of-seat console.

My suspicion is the "Reboot" approach is pretty common to aviation systems. It
wouldn't surprise me that many of the components are rebooted daily, and
almost certainly on a weekly basis.

120+ days without a reboot sounds unlikely to me.

~~~
gkop
Jtsummers, userbinator: This 787 bug shuts down the generators which I
understand provide only the AC power aboard the aircraft? How critical is this
AC power?

~~~
twistedpair
B787 is a mostly electric airliner. There are far fewer hydrolic/cable
operated systems than in previous planes. This is much safer since a
explosion/leak/breach/clog in a hydrolic line won't take out an entire
hydrolic system (most planes have 3 systems and fuse valves to mitigate this).

However, since there are so many electricly operated systems, you really need
power. The B787 as is No Bleed [1], so electrical power is also used to
pressurize it [2]. Need Electric Power.

[1]
[http://www.boeing.com/commercial/aeromagazine/articles/qtr_4...](http://www.boeing.com/commercial/aeromagazine/articles/qtr_4_07/article_02_1.html)
[2] [http://www.airliners.net/aviation-
forums/tech_ops/read.main/...](http://www.airliners.net/aviation-
forums/tech_ops/read.main/218933/)

~~~
ghshephard
Isn't most of that DC power though? How much of it is AC power (like the power
that each passenger seat gets?)

That's the power that I was talking about requiring a reboot. Not sure if it's
related to the AC power associated with the bug in question - possible there
are two AC power systems on the plane?

------
shellmayr
Wow, how can something like this happen? I thought airplanes had triple
redundant software systems using 3-version programming [1] in order to avoid
such bugs/problems. Can anyone familiar with flight technology shed some light
on this?

[1][http://en.wikipedia.org/wiki/N-version_programming](http://en.wikipedia.org/wiki/N-version_programming)

~~~
CHY872
Not an airplane programmer, but I seem to remember that the literature says
that it's not generally a cost-effective way of finding bugs. In particular,
you multiply the cost of development by (say) 3x (which is fine, on its own)
but also the places where bugs are inserted are typically the hard parts; so
you don't reduce the number of bugs as much as you'd like; it can easily be
more cost effective to invest the few million in static analysis etc.

As much as we'd like plane manufacturers to test things to death, it'd become
too expensive too quickly. For all we know, this software could be written by
a contractor, or the firmware for a third party part.

As far as I know, N-version programming was effective when software systems
were small (shuttle ran on 50k lines of code) and where poring over every
single line was possible, because the hard part was coming up with the spec.

Nowadays a big plane like the A380 might be expected to have 100M lines of
code in its subsystems, and it's simply too expensive.

~~~
hello_there
> Nowadays a big plane like the A380 might be expected to have 100M lines of
> code in its subsystems,

Why does an airplane require 100M lines of code?

~~~
mschuster91
Linux kernel alone comes in at 15m SLOC. Now add an userland subsystem and
you're at 20-30M just for one device.

Multiply by all the little and big subsystems, the embedded chips, in-flight
entertainment, network gear... 100m SLOC is too low, I think.

~~~
georgerobinson
Sorry if this is incredibly ignorant, but I can't believe flight control
systems are running Linux?

Do these systems not have hard real-time requirements about the execution time
and periodicity of tasks which can't be guaranteed by the time-sharing
scheduling algorithms in Linux?

~~~
jacquesm
Real time systems will be running a RTOS: VxWorks or QnX or something
equivalent to that.

They'll definitely build a prototype using Linux but they won't get that
certified so it literally 'won't fly', it's just a means to speed up initial
development.

~~~
Kliment
Is the order of magnitude of lines of code in QNX different from that of
linux? At a first approximation, I don't see why it would be.

~~~
jacquesm
The QnX kernel is very small compared to the Linux kernel.

Small enough that I could-reimplement it in approximately 3500 lines of code +
another 850 for the virtual memory management.

~~~
Kliment
Wow, I had no idea. Since their source is closed and untouchable I had no way
to check either. Is there any reason there aren't several certified open
RTOSes around?

~~~
jacquesm
I don't know if there aren't _any_ open certified RTOS's around, but I can
explain the 'why' part easily: if you pay for the certification of an open
RTOS then everybody that can use one will say 'thank you' for the effort and
that's that, since the certification would apply to any and all copies of that
particular version. So you're essentially paying for the privilege of cutting
your competitors a break.

This could only work if the entity paying for the certification had a way of
making that money back somehow and I don't see how that could be done.

~~~
walterbell
Not an RTOS, but seL4 is a correctness-proven open-source ARM microkernel:
[https://sel4.systems](https://sel4.systems). Looks like a mixture of public
and private funding. It's part of the L4 family,
[http://en.m.wikipedia.org/wiki/L4_microkernel_family#Univers...](http://en.m.wikipedia.org/wiki/L4_microkernel_family#University_of_New_South_Wales_and_NICTA)
which includes OKL4 (deployed on 1B+ ARM-based mobile phones) and
[http://genode.org](http://genode.org) (x86/ARM) from Dresden.

~~~
jeff_marshall
Not to detract from the fine work done by the sel4 folks, but there is a large
gap between what they have and what DO178 C requires for level A software.
Like many other bureaucratic organisations, the FAA (and other regional
equivalents) have a process with it's own set of rules (MCDC testing,
requirements/design traceability artifacts, etc).

It would cost a significant amount of money to develop the necessary artifacts
and engage the FAA to obtain a certification.

~~~
jacquesm
That's absolutely true but something like this could be a good starting point.

What I think the whole thread above misses is that the economics simply aren't
there, cost isn't the limiting factor for the OS licenses for avionics but an
extra certification track (especially for a fast moving target) would be,
besides, it is not just the OS that gets certified but you will also have to
(separately) certify (usually) the hardware that it runs on (unless you're
going to use a design that has already been certified).

That means that modifications are expensive and that 'known to be good' trumps
'could be better' or 'could be cheaper in the longer term'.

Someone would have to come up with a very good reason to see open source trump
the existing closed source solutions.

------
__Joker
I remember a legacy web service system we used to support which will crash
after couple of days, after hogging all the resources. The first thing we did
was to set up a monitor with a daily restart scripts. That was much cheaper
and quicker fix than the fixing the memory leaks which took 3 months to reach
to production.

~~~
dvirsky
I had a memory leak in a Python program (that's a really rare thing), that
would trigger OOM kills in about 3-4 days. After a few days of investigation
that yielded nothing, I put a restart job every day or so, and returned to it
only after a few months when I had some time. Eventually it came down to
someone replacing dict.get() with dict.setdefault() in a lookup dictionary in
some utility library, causing each miss to leak a small, non GC-ed empty entry
in an otherwise small lookup table.

------
tzs
This and other stories are claiming it is an integer overflow, but I've seen
no source for that. It seems to be just speculation based on the observation
that a 100 Hz 32-bit counter would behave similarly.

------
stox
Coming soon: OTA updates for Boeing aircraft. What could possibly go wrong?

------
serf
well, at least it's not a pacemaker.

It's a terrible oversight, and makes me wonder about the rest of the code, but
are there very many airliners that are online for that long at once? I don't
know much about how commercial air travel works behind the scenes.

------
varjag
The Lord could not count grains of sand with a 32-bit word

------
xnull2guest
This reporting comes on the heels of an GAO study on hijacking airliners. It
is not clear why the Congress ordered a study on hacking airliners, though
there's a long list of things (MH flights, Carter's claims of a 'cyber Pearl
Harbor') that some people might speculate over.

Does anyone know the impetus behind the study?

~~~
VonGuard
Probably part of a larger security initiative with money to pay for studies.
I'd wager there's a Congressional committee tasked with writing policy to
secure critical infrastructure like power plants and such. After 9/11, you can
bet planes and FAA systems would be a part of this.

~~~
xnull2guest
There are definitely security initiatives like this, there have been since
before 9/11 (and an uptick afterwards), but it is unusual for the Congress to
be involved or to demand a study.

------
dschiptsov
How come it is not Java?

~~~
falis
In its current incarnations, it is considered uncertifiable for high
criticality levels under DO-178C. May be used in entertainment systems and
such though.

~~~
dschiptsov
You mean it is considered to be crap?)

