
Reboot Your Dreamliner Every 248 Days to Avoid Integer Overflow (2015) - pjmlp
https://www.i-programmer.info/news/149-security/8548-reboot-your-dreamliner-every-248-days-to-avoid-integer-overflow.html
======
nimbius
I read the article from tip to tail and as a professional engine mechanic, It
reminds me of Fords "myford touch" platform.

the system was rolled out across newer fleets without much testing, and in
some models controls practically _every_ single feature in the vehicle from
climate control to the onstar SOS. There was a recall for platinum model F150
trucks because the system could glitch out after so many hours of continuous
operation and trigger a fault in the 4 wheel brake force distribution system.
This in turn either completely arrested the brakes, or caused them to quietly
apply themselves at around 15%...you couldnt undo this unless you pulled the
fuse. Even worse, collision detection would be disabled because the system
thought you were aware of a potential crash and were braking.

If a certain bluetooth phone were paired, it could cause the trailer load and
position sensor to erroneously predict grade downshifts. The result was that
incoming calls on the highway would either wreck the rear differential or put
the truck on the side of the road.

~~~
DoofusOfDeath
I've noticed the occasional posting in HN from people who aren't [aspiring]
professional software developers, which surprises me.

If you don't mind me asking, what brings someone in your line of work to a
site like this?

[edit: I just realized that ^ sounds like a tragically bad pickup line.]

~~~
glup
There's also a ton of academic lurkers -- I've always though that this due to
the facts that 1) the border between academia and industry is pretty porous in
technical fields (super obvious in ML research, but I think holds more
generally) 2) academics borrow---and contribute to---the same set of methods
and technologies 3) most individuals have an interest in complex systems (and
their hilarious, tragic faults) such that they find articles like this one
interesting.

I think it's also useful to look at preceding communities like Slashdot — in
that case it started out pretty tightly scoped for Linux enthusiasts, gaming,
programming, and the then-nascent social internet but then the cohort of
regular readers and contributors got older and they found themselves in
policy, academia, and managerial business roles; as long as people remained
involved in the community, the content of the site expanded to reflect the
(extremely rapidly increasing) role of tech in these domains. (Obviously,
Slashdot suffered from several ownership changes, a lame commenting system,
and most everyone moved on to other venues).

~~~
phs318u
and 4) the signal-to-noise ratio on HN is very good compared to other online
discussion forums.

------
marijn
> Your options are to increase the number of bits used, which puts off the
> overflow, or you could work with infinite precision arithmetic, which would
> slowly use up the available memory and finally bring the system down.

Yeah, no. Doubling the amount of bits to 64 while keeping the same precision
gets you about 3 billion years worth of time, which is _probably_ enough. And
I'm going to leave calculating how much time it'd take to fill up any
reasonable amount of memory with a single arbitrary-precision integer as an
exercise to the reader.

~~~
ainar-g
Even if you do use arbitrary precision arithmetic and count nanoseconds, the
heat death of the universe is more likely to occur before your number takes
1KiB of RAM.

~~~
bunderbunder
Probably the deeper problem with using arbitrary precision arithmetic is that
you end up with a variable-sized datatype, which I _believe_ means at least a
modicum of extra hassle & complexity for any language that the control
software is likely to be written in. And less predictable timing, which might
be a big no-no if this is something that needs to be used in timing-sensitive
places.

I'd much rather take the 64-bit int, myself.

~~~
AnimalMuppet
If I recall correctly, at least some guidelines for avionics software (JSF,
maybe?) forbid dynamic memory allocations, period.

~~~
bunderbunder
JPL's does. See Rule 5, on page ten: [https://lars-
lab.jpl.nasa.gov/JPL_Coding_Standard_C.pdf](https://lars-
lab.jpl.nasa.gov/JPL_Coding_Standard_C.pdf)

------
chrisacky
This is a little familiar with the rocket failure at Dhahran[1] of 1991
resulting in 28 deaths. The Patriot missile battery at Dhahran had been in
operation for 100 hours, by which time the system's internal clock had drifted
by one-third of a second. Due to the missile's speed this was equivalent to a
miss distance of 600 meters.

Two weeks earlier, on February 11, 1991, the Israelis had identified the
problem and informed the U.S. Army and the PATRIOT Project Office, the
software manufacturer. As a stopgap measure, the Israelis had recommended
rebooting the system's computers regularly. The manufacturer supplied updated
software to the Army on February 26.

[1]:
[https://en.wikipedia.org/wiki/MIM-104_Patriot#Failure_at_Dha...](https://en.wikipedia.org/wiki/MIM-104_Patriot#Failure_at_Dhahran)

~~~
mabbo
While it is a good lesson from a software perspective- your bits are going to
overflow, make your software handle that gracefully- I've always been a bit
uncomfortable with the blame for deaths being placed on that bug.

The bug didn't kill anyone. The scud missile fired with intent to kill did.
And of course, it was fired because it was the Gulf War and countries were
attacking each other. Blame who you like for that situation. But all anyone
talks about is how it was the Patriot Missile bug that lead to the deaths.

The software bug failed to prevent the deaths that were going to happen
anyway. Lessons learned, yes, but I'd hate to imagine a programmer somewhere
living in guilt over it.

~~~
brianwawok
> The software bug failed to prevent the deaths that were going to happen
> anyway.

It MAY have failed to prevent the deaths. I don't think Patriot has a 100%
success rate. If the normal success rate of that intercept was 1%, 50%, or 99%
- it changes the wording a bit I think?

~~~
SiempreViernes
>was 1%, 50%, or 99%

Incidentally, that's about the numbers the pentagon reported, but in reverse
chronological order.

[http://www.turnerhome.org/jct/patriot.html](http://www.turnerhome.org/jct/patriot.html)

~~~
dmix
The relevant quote:

> Official assessments of the number of Scuds destroyed by the Patriot missile
> system in the war have fallen from 100 percent during the war, to 96 percent
> in testimony to Congress after the war, to 80 percent, 70 percent and,
> currently, the Army believes that as many as 52 percent of the Scuds were
> destroyed overall but it only has high confidence that the Patriot destroyed
> 25 percent of the Scud warheads it targeted.

~~~
SiempreViernes
Don't forget the bit after:

> Independent review of the evidence in support of the Army claims reveals
> that, using the Army's own methodology and evidence, a strong case can be
> made that Patriots hit only 9 percent of the Scud warheads engaged, and
> there are serious questions about these few hits. It is possible that the
> Patriots hit more than 9 percent, however, the evidence supporting these
> claims is even weaker.

------
tim333
This report actually sounds a little worse

>the FAA's new rules require operators to reboot the plane's electrical system
every now and then because "all three flight control modules on the 787 might
simultaneously reset if continuously powered on for 22 days." The effect of
this simultaneous reset "could result in flight control surfaces not moving in
response to flight crew inputs for a short time and consequent temporary loss
of controllability."

Hope they've fixed that one now - also from 2016
[https://www.popularmechanics.com/flight/airlines/a24151/boin...](https://www.popularmechanics.com/flight/airlines/a24151/boing-787-dreamliner-
reboot-bug/)

------
ainar-g
IIRC, 64 bits give you 200 years when you're counting _nanoseconds._ Why on
Earth would they use a 32-bit integer? I doubt this was some kind of
microoptimisation. My bet is on some sort of legacy component that is 64-bit-
o-phobic.

~~~
ajross
> Why on Earth would they use a 32-bit integer?

Why stop there? Why on earth would anyone use an integer instead of a double,
given the inherent risk of truncation error? Or for different arguments: Why
on earth would anyone use a 16 bit wchar_t? Why on earth would anyone make
char unsigned (or signed)? Why on earth would anyone put the little end first?

Machines are machines. They have fixed representations for different types,
with tradeoffs. And you have to pick one. And the thing about timeout handling
specifically is that _everyone_ along the path from the timer driver up
through the app needs to agree on the precision needed, or you'll get an
overflow condition.

Arguments of the form "Bugs are bad and we shouldn't write them" have not
historically helped with improving software quality.

~~~
titzer
"Well, I figured....that bridge is gonna fall down in 200 years no matter
what, so I went ahead and used rubber bands to hold the suspension lines in
place. Only cost me $2!"

No, just no. Don't be a hilljack. Build something to spec. 32 bits is clearly
not to spec.

~~~
0xffff2
How do you know? What's the spec? At first glance, it seems totally reasonable
that the vehicle would be completely rebooted at least once every n << 248
days.

~~~
titzer
> What's the spec?

I dunno but clearly running more than 248 days was not in it, otherwise this
would hopefully have been caught and tested. If it was specifically specced to
last N days and N < 248, personally I would have asked for my money back. As
this appears to come as a surprise to everyone (caught late), I'm calling this
one as I see it--a facepalm.

------
andyjohnson0
The space shuttle had a related problem, although not caused by overflow. Some
parts of the STS's avionics used a clock that would reset to zero at 00:00:00
on 1st January, while other components had clocks that would continue to count
up. If a shuttle mission spanned the new year boundary then systems would
panic if they could no longer agree on the time.

The only reference I could find to this is
[https://abcnews.go.com/Technology/story?id=2699091&page=1](https://abcnews.go.com/Technology/story?id=2699091&page=1)

------
sbradford26
"One interesting fact is that the FAA claim that it will take about one hour
to reboot the GCUs - so there clearly isn't a reset button."

I am incredibly surprised by this, most higher level flight control systems
have power up requirements in the seconds. Then lower level actuator controls
or engine controls have power up requirements in the milliseconds.

~~~
cm2187
What do they do if there is a temporary power loss in mid air?

~~~
sbradford26
So that depends on if you mean a power source temporarily goes down or if a
box experiences a power blip. So in the first case each box usually has 3
different sources of power, something like the generators on the engines, then
backup on the 24 volt bus, then finally a backup battery.

In the second scenario it depends on how long the blip is. Usually there are
holdup requirements that a box will not reset if power is lost for x amount of
time. If power is lost for longer it will save state and leave itself in a
state that will come back up quickly.

What I am thinking is happening in this case is that some value is getting
messed up in NVM and it must be reset by the maintenance crew, so the "reset"
they are talking about isn't just rebooting after the error occurs. But if you
reboot before the error happens the NVM doesn't get messed up and the value is
just updated with the correct number.

~~~
cm2187
So you think the one hour quoted by the article isn't the time it takes for
the system to boot but rather the time it takes for the maintenance to access
the device, reset it, and close everything?

~~~
sbradford26
That seems much more reasonable to me. Mostly because planes are rebooted for
maintenance every so often between flights and customers(Airlines) would be
not be okay with a 1 hour reboot time.

There are also usually maintenance tests that can be ran to reset a box on the
plane. So the technician would have to put the plane in maintenance mode and
run through the test to get the box reset or something.

~~~
paulie_a
Yes they might not be okay with that but airlines apparently can hold the
passengers for 8 hours because of delays. To the point of toilets overflowing.
I think if you run an airline, you are legally required to go out of your way
to to provide terrible service.

------
jaclaz
Provided that the guess is correct:

>A simple guess suggests the the problem is a signed 32-bit overflow as 2^31
is the number of seconds in 248 days multiplied by 100, i.e. a counter in
hundredths of a second.

The question that comes to my (perverted) mind is what is the counter for, or
more strictly why it _needs_ an accuracy of 1/100th second?

If it is related to a "periodical" action (a time interval) it makes little
sense to have that degree of precision, and on the other hand, if the
precision is needed, why not calculate it from a base point?

I mean, if it is related to "boot" time, I presume that noone would ever use a
counter, rather a log of some kind with a timestamp and calculate (properly)
the time elapsed ...

~~~
gargravarr
The systems are controlling generators, which produce AC current. It's not
unreasonable to think the controllers are monitoring the AC waveform, which
has a frequency of 50-60Hz. Ergo, if the controller is checking the generator
is producing the expected waveform (because a lot of sensitive things depend
on the waveform being accurate), it makes sense that it could be using a 100Hz
polling interval, which is where the counter comes into play.

It's possible that it's using a system-wide timer for convenience since the
embedded hardware is very limited compared to a full computer, where spinning
off a separate timer is trivial. When the requirement for different timers hit
the program designer's desk, they probably took the most precise use case and
designed one timer around it, and neglected to take into account the overflow.

~~~
jaclaz
Yep, but purely anecdotal data, I once was involved in the construction of a
(small) hydroelectric plant (I was responsible for the construction/building,
not for the electrical parts, our construction company had a couple electric
partners for that).

For some reasons the tender was for the building and plant, but the client
made a separate tender for the actual turbines.

Though I tried (vainly) to convince the manufacturer of the turbines to use a
controller from the same firm we had as partner, they decided to use "their"
way.

In theory not a problem as the turbines and their controllers would "talk" to
the external control system (SCADA, etc.) through a "standard" RS422
connection.

After a couple days of testing, it was clear that the SCADA was going
periodically "beserk".

Though not at all my field/responsability, I was willing to have the problem
solved and after 3 (three) days of the engineers from the turbine manufacturer
and from the control system manufacturer finding nothing (and BTW largely
failing to communicate between them) I started looking with them, one by one,
at the signals that were exchanged on the connection.

It was evident that there was some form of overflow, the on-turbine controller
sent way too many data and "clogged" the receiving part.

There was a sensor for rotation speed and another one for oil pressure that
were polled (and sent data) at a rate of 1000 per second.

The turbines were of the Pelton type, with an external (large) flywheel, and
it took (literally) tens of seconds since you stopped the waterflow to have
the actual wheel slow down and after several more tens of seconds stop, and
viceversa once it got to the intended speed, it had in practice no variation.

So you had these two sensors polled 1000 times a second to measure something
that would change - maybe - only after several seconds intervals and that
could be as well corrected with a delay of tens of seconds.

Reducing the polling rate from 1000 to 10 times a second (still way overkill
for the proper functioning of the system) all errors went away.

It came out that these sensors were a new type/make/model, the first to be
capable of Khz polling, and the turbine engineers set them to the max only
because they _could_.

------
ocfnash
The article finishes with the rather vague and somewhat terrifying assertion
that:

"It is estimated that the Airbus A380, comparable in complexity to the
Dreamliner, has more than 100 million lines of code."

~~~
moreira
The thing is, do they just count all the lines of code in all the libraries
they use, as well as the OS (all of it compiled from source, I would guess)? I
imagine they do, since that's the only reasonable way that 100 million lines
of code could be reached.

~~~
sbradford26
So on a project like this, you do not use many standard libraries. It is
usually explicitly stated that you can't use anything from the std c library
and such.

Also source lines of code are usually calculated as lines of code that need to
be verified. So if a line of code is in the software, it needs to have a test
that covers that line of code.

~~~
moreira
Then that really does raise some questions as to what 100 million lines are
for. What operations exactly does an airplane need to execute that require
that much code?

~~~
NikolaeVarius
Everything?

Modern avionics is insanely complex. This isn't something you throw 1-year
self taught JS devs on.

When I worked on next-gen turbofans, we had multiple dozen engineers working
on managing the requirements of the software, much less the software itself.

You have the main avionics software managing the flight itself, the engine
software managing fuel consumption, and the 3x safety factor required to be
certified.

~~~
TeMPOraL
"Everything" isn't telling much. I'd love to, out of curiosity, browse through
a codebase of an avionics system. Is anything like this publicly available?

~~~
sbradford26
Most likely there is not much that you can look at but this overview of the
space shuttle can give you a lot of insight into how software and hardware are
designed for flight critical applications. I believe this was posted a while
ago here.

[https://spaceflight.nasa.gov/shuttle/reference/shutref/orbit...](https://spaceflight.nasa.gov/shuttle/reference/shutref/orbiter/avionics/)

One take away is that software and hardware are nearly impossible to separate
in a flight critical application.

Also one note, the space shuttle was incredibly complex for its day. So I
would say the complexity of the space shuttle would give you a decent idea of
what commercial aviation does now.

~~~
Dylan16807
> Also one note, the space shuttle was incredibly complex for its day. So I
> would say the complexity of the space shuttle would give you a decent idea
> of what commercial aviation does now.

And yet it has less than a million lines of code.

------
jjwiseman
When this became known, I asked on the aviation stackexchange about whether
there were any circumstances under which it might be expected for a 787 to
remain powered on for 248 days:
[https://aviation.stackexchange.com/questions/14494/in-
what-c...](https://aviation.stackexchange.com/questions/14494/in-what-
circumstances-could-a-787-stay-powered-on-continuously-for-248-days)

The answer seems to be that it is very unlikely, but not impossible.

~~~
rosege
Too bad it was seem very strange if you asked the pilots before boarding just
how long it had been since they rebooted it.

------
saint_abroad
(Signed version of) the 497 day bug [1] strikes yet again.

[1]
[https://news.ycombinator.com/item?id=3231781](https://news.ycombinator.com/item?id=3231781)

------
foofoo55
Related article from 1986, predicting such bugs, and listing a few:

[http://articles.chicagotribune.com/1986-12-14/news/860403047...](http://articles.chicagotribune.com/1986-12-14/news/8604030475_1_software-
bug-computer-scientist)

------
kbutler
On a recent family trip a few miles from our destination, all my vehicle's
dash controls went out, then reappeared with a charging system error
indicator.

Everything seemed fine - I watched battery gauge and hoped I'd make it. When I
got to the destination, I stopped and restarted the engine, and everything
looked fine, and the charging system indicator went back to normal.

I noticed afterward that the "Engine Hours", which had been getting close to
10,000, was now in single digits. No other internal counters were reset.

I wondered if it was an overflow condition, but it appears more mundane - many
vehicle owners report seemingly random resets. The surprising thing seems to
be that it hadn't reset before getting close to 10,000 hours!

~~~
kw71
Wow, that car was driven a lot. All my 20+ year old cars have less than 4000
h.

~~~
kbutler
Or did their counters reset? ;-)

30 odometer miles per engine hour seems about average (to one significant
digit - varies with proportion of highway vs city vs idle hours). That would
suggest all your 20+ year old cars are under about 120K miles? Or they have a
very high mix of freeway miles. Either way, they're not likely getting the
average 10-15k miles per year.

Conversely, that rough evaluation is making me question whether my
recollection of nearing 10K hours was correct - the vehicle is under 200K
miles, which would suggest <20mph average.

------
nashashmi
My office building has a computer system for elevators that directs the
traveler upon selecting the level to an elevator preselected with the
destination floor. Every month or so, the system begins to slow down or lag,
e.g. user enters floor, system pauses (during morning rush hour this causes
delays), flashes to the selected elevator, and returns back to the user
screen. The longer the interval since reboot, the longer the pause. Obviously
there is some sort of memory corruption in the system causing a buffer
overflow and the routine programmed in the software is not clearing it. So a
manual reboot is required to make it work as normal.

~~~
nradov
Those symptoms would be more likely caused by a memory leak and/or
fragmentation, not by memory corruption or a buffer overflow.

------
dsfyu404ed
Do civilian aircraft not have requirements for maximum reboot times?

Naval vessels usually have sub-minute times from power restored to shooting
back and they're legacy code and legacy hardware nightmares. A 787 doesn't
have layers upon layers of legacy code that most naval applications do. While
the stakes are definitely a little lower you'd think they would be able to do
a full reboot in less than the time it takes to fall out of the sky by a
comfortable margin.

------
okket
Doesn't routine checks include complete reboots? They should happen more often
than every 248 days.

~~~
tonysdg
You would certainly hope; and yet, long uptimes unforeseen by designers can
have devestating consequences:

> A government investigation revealed that the failed intercept at Dhahran had
> been caused by a software error in the system's handling of timestamps. The
> Patriot missile battery at Dhahran had been in operation for 100 hours, by
> which time the system's internal clock had drifted by one-third of a second.
> Due to the missile's speed this was equivalent to a miss distance of 600
> meters...the Scud impacted on a makeshift barracks in an Al Khobar
> warehouse, killing 28 soldiers.

> __As a stopgap measure, the Israelis had recommended rebooting the system 's
> computers regularly. __

[https://en.wikipedia.org/wiki/MIM-104_Patriot#Failure_at_Dha...](https://en.wikipedia.org/wiki/MIM-104_Patriot#Failure_at_Dhahran)

~~~
oftenwrong
more detail on the patriot missile failure:

[https://web.archive.org/web/20100702180720/http://mate.uprh....](https://web.archive.org/web/20100702180720/http://mate.uprh.edu/~pnm/notas4061/patriot.htm)

------
mvpu
Forgive me if it's a stupid idea - but why not shutdown these plane computers
every night when they aren't in service and boot them again in the morning
before the first flight? Let all the counters reset, all memory creep and
leaks go away, start fresh....

~~~
jhpankow
This will increase the failure rate due to added thermal/electrical stresses
caused by the constant power cycling.

------
kokey
I remember an issue with the Oracle DB client libraries for Linux (RHEL 2 or 3
I think) where if the system had something like a 160 day uptime any software
using the library to connect to the DB would just hang. I saw the effects with
strace where it would just get stuck in a loop doing some time related
syscall. We managed to get an unofficial patch from Oracle that I could wedge
into our servers, since reboots were under strict change control. I remember
this issue came up when I was being interviewed for a position at the Guardian
a couple of months later and they seemed amused by what I had to do to fix it
since they also encountered the same issue but fixed it by making sure they
reboot more than twice a year.

------
baybal2
Remember the Ariane rocket - also a piece of verified code running in verified
OS, running on verified MCU, made in verified silicon, but the obvious flaw
was still missed.

The problem with formal verification that for a complex system, the amount of
constraints goes over the roof, and it is no longer possible to humanly
understand if one of them makes sense in complex.

It might very well be possible that something "register A should never
overflow when input B is below value C" was put in validation rules, but
nobody gave importance to understanding what it was. Or worse, somebody was
simply lazy to change it, fearing it will upset 20 some even more obscure
validation constrains.

------
08-15
I wonder why the GCU needs a time counter in the first place and how it is
used so that the whole controller shuts down on overflow. I bet, 16 bits would
be enough if handled properly.

~~~
sbradford26
Software with airplanes is usually periodic and not event driven. So every 10
ms or some interval of time it will compute all of the control laws and
monitors and such. So that 32 bit counter is most likely the period counter or
a maintenance interval counter.

~~~
08-15
Yeah, makes no difference.

I imagine, the software needs to remember some state variables with time
stamps, to ramp power up at a certain rate or to implement PID control of the
power level. Crucially, it never needs an absolute time stamp, it only needs
to know how old the data is.

The most simple(!) approach is to use an unsigned integer as timer/counter and
let it overflow all the time. Age is computed by unsigned subtraction,
ignoring wraparound. With a 16 bit timer and 10ms resolution, ages up to
almost 11 minutes can be represented. Why would a turbo-generator even need to
remember what it did 11 minutes ago?

In other words, the mistake is probably not the narrow counter, it's the
signed arithmetic and the subsequent failure when it computes a negative age.
Someone else said "they probably stuck a crappy old 8-bit micro in there."
That would surprise me---people programming 8-bit micros tend to know how to
use unsigned arithmetic where it makes sense.

------
nmg
Naive question: Would a reasonable way to avoid this scenario be, to increment
a secondary counter when the primary counter reaches max-1, then reset the
primary counter to zero?

~~~
sbradford26
A couple different ways to avoid the issue:

1: Size the type used to store the value so that it cannot overflow even in
corner cases such as a plane being on for several months. Ex. going from a 32
bit int to a 64 bit.

2: Have a flag that you set when the counter overflows, so you can calculate
the actual time. Given the length of time that planes are on this would give
you enough time.

3:Lessen the precision of the counter (Probably not an option since the
precision is usually a requirement)

------
ayy_lmao
To be honest this isn't quite a big deal. Why would anyone let an airplane
running for days continuously.

Just a quick calculation. If it takes 248 days = 248 * 24 * 3600 seconds to go
to 1 to 2^32, then the sensors have a sampling time of 5ms so it takes a
measurement at a 200Hz frequency. Not related to anything but it's nice to
know I guess.

This is quite old news though. I remember it was mentioned in last year's CS50
lecture about integers.

------
sitkack
> Your options are to increase the number of bits used, which puts off the
> overflow, or you could work with infinite precision arithmetic, which would
> slowly use up the available memory and finally bring the system down.

Pedantically true for an infinite universe, but _merely_ moving the counter to
64 bits would give one 2^32 * 248 days, lets say we only use 30 bits, we still
on the order of a billion years for 4 extra bytes.

------
lucb1e
Site does not load for me. Found this one:
[http://archive.is/dLsMd](http://archive.is/dLsMd)

------
kwhitefoot
> Your options are to increase the number of bits used, which puts off the
> overflow,

Just adding one more byte puts the overflow off for almost 200 hundred years,
should be long enough.

The question that comes to my mind is: now that they know that Boeing is
sloppy are they going to thoroughly audit the code to see if any other
overflows are lurking in it? And will they do the same to Airbus? And if not
then why not?

------
M_Bakhtiari
> We are issuing this AD to prevent loss of all AC electrical power, which
> could result in loss of control of the airplane

So the 787 doesn't have a mechanical backup for the flight controls? So much
for the Boeing fanboys talking about their mechanical yokes. Through
realistically that says very little about actual reliability and safety, just
look at the 737 rudder issues.

~~~
supernova87a
When you say "mechanical", in a 787 or any large airliner it doesn't mean
straightforwardly what it might mean in a small plane.

In any large airliner the pilot's controls are indirectly linked to the flight
surfaces by either hydraulic or electrical systems. There's no way a person
could produce the forces needed to handle a plane that large adequately.

So I suppose it depends on the specific system, but if electrical power were
to be totally lost, perhaps they would be dead in the water. (air)

As is my rudimentary understanding.

~~~
M_Bakhtiari
> In any large airliner the pilot's controls are indirectly linked to the
> flight surfaces by either hydraulic or electrical systems.

I understand the 777 and 787 are special cases, but my impression of the rest
of their line was that the flight controls are directly linked by control
cables and are assisted by mechanical-hydraulic servos, a bit like hydraulic
power steering on cars. So with a total loss of electrical systems they would
still have normal primary flight controls.

Of hydraulic servos can fail (for example the aforementioned 737) and even
triple hydraulic systems can be taken out by something like the uncontained
engine failure of United Airlines Flight 232.

I tend to think that electrical and electronic systems can ultimately be more
reliable and maintainable, but I worry that the safety culture is much worse
among the designers of such systems than their mechanical equivalents (no
wonder, we've been tinkering with the latter for thousands of years and the
former only a few decades).

------
raverbashing
Funny how they didn't catch that in the apparently deep code reviews and
complicated processes they have in the aerospace industry.

------
stcredzero
A company I worked for licensed a managed language VM, which was being used to
operate an airport people mover, which was crashing after just shy of 50 days.
(Crashing software-wise, not physically. It would just awkwardly stop.) It
turned out to be an integer overflow. If this was a 32 bit register for
milliseconds, what would be about right.

------
leephillips
"You may be used to rebooting a server every so often to ensure that it
doesn't crash because of some resource problem"

Is this something that people have to do? I maintain a few Linux servers, and
I never need to reboot them. They only go down when the hosting company needs
to do some kind of hardware maintenance.

~~~
linker3000
That was once the reason put on a support ticket for a Windows 2000 Server
which I reported as running slowly - something like: "Server had been running
for over 30 days - needed reboot". I suspect this was down to the bespoke app,
and not Win2K.

------
sp527
While this is pretty sketchy, I DO like the design principle of ephemeral
subsystems. It just makes sense to assume a service can go down at any time
for any reason, from its conception, and bake that into how it's built. You
could argue periodically restarting things is just a natural extension of
that.

------
King-Aaron
Reminds me of "Memory leaks on missiles don't matter"

[https://twitter.com/pomeranian99/status/858856994438094848](https://twitter.com/pomeranian99/status/858856994438094848)

------
yellowapple
In this case, does it really matter? I'd be very surprised if it's running
nonstop for more than a couple days, let alone almost a year. Or is this some
system that normally doesn't power down when the rest of the plane does?

~~~
racingmars
Planes like the 787 are probably very rarely completely powered down. At the
gate, they're on ground power and there are probably some power buses left
energized even if the plane is going to sit unused until the next day. These
generator control units (since the article says they take an hour to boot
up...running through a ton of tests, I guess?) are probably energized even
when most other aircraft systems are turned off. If the aircraft is being
towed to a hangar or away from a gate for parking, they probably want its
lights on while in motion on the field so will stay powered from battery or
fire up the APU (auxiliary power unit) before disconnecting from ground power
to tow it around. When it makes it to the hangar, perhaps it gets plugged back
in to ground power. So it's conceivable that this particular component could
remain powered up continuously for a looong time.

That said: as others have stated, some forms of maintenance would power down
the plane completely, and these likely happen more often than this overflow
bug occurs. If I remember from the articles when this bug was first
discovered, it was discovered through simulation or specific testing and not
because an operator encountered it, which would imply that yes, operators do
"reboot" the whole airplane often enough where this doesn't naturally happen.

------
Tempest1981
This one was 216 days -- which seems stranger than 248.

Crucial m4 -- 5184 Hour Bug
[https://www.storagereview.com/node/2676](https://www.storagereview.com/node/2676)

------
plg
Has this issue been fixed or do they still have to reboot every 248 days?

------
kleff
Wait, so they don't turn off the planes when not in use?

~~~
mnw21cam
This kind of plane isn't "not in use" very often. In fact, any time it's not
flying people around is costing the airline loads, and they try to minimise
that.

~~~
ceejayoz
Plus, when it's _not_ flying passengers, it's being fueled, cleaned,
maintained, and inspected. All of those are easier if there's power and
systems online.

------
na85
Looks like we killed their server. Anyone have a cached copy?

~~~
bcaa7f3a8bbc
Here we go.

[https://archive.fo/dLsMd](https://archive.fo/dLsMd)

------
mack1001
Doesn’t this call for an OSS vehicle platform that can be reviewed by
thousands of contributors to avoid these type of issues?

~~~
mannykannot
I see that the "many eyes" hypothesis lives on, despite the accumulation of
contrary evidence.

~~~
imtringued
Well usually the difference between open source and proprietary is that the
proprietary vendors are slow to fix security issues because it doesn't
increase their profit. In some cases they even try to hide the fact that the
software is insecure or even sue the person who reported the vulnerability.
Meanwhile most OSS software gets fixed as soon as the vulnerability is found.

~~~
sbradford26
In the case of aircraft though a company could find a bug and do their best to
fix it quickly. But if the plan is already flying any changes to the software
require a re-certification of the software which can take months.

------
stephengillie
When a city bus needs to reboot, the driver will park the bus first. How do
they reboot a plane in the air?

~~~
abstractbeliefs
They don't, they reboot it on the ground. The implication is that airlines
need to find at least one hour in each 248 days to service this bug.

This shouldn't be a problem, but in the highly optimised logistics of air
travel, finding a solid hour to just reboot the GCU is expensive.

~~~
ubernostrum
Even the "minimal" A-check process happens often enough to avoid a 248-day
limit, and typically grounds an aircraft overnight.

People really do not realize how much maintenance commercial aviation
involves.

~~~
TillE
It should be fairly easy for anyone to figure out that planes generally aren't
used around-the-clock. For example, I routinely take a flight which arrives at
11pm, and I know that nothing is flying out of that airport again until the
morning.

------
tantalor
_will go into failsafe mode and the plane will lose all electrical power_

Sounds real "safe".

------
paultopia
OMG, that's terrifying. United Airlines flies a lot of those. United doesn't,
as far as I can tell, do maintenance until something breaks.

~~~
ubernostrum
_United doesn 't, as far as I can tell, do maintenance until something
breaks._

In order to be allowed to fly in the US, and other countries, you better
believe United does maintenance.

This is one of my favorite reddit comments ever, and goes over aircraft
maintenance from a car analogy, explaining what you'd have to do to a car in
order to meet the _mandatory_ maintenance standards imposed on airlines as a
matter of law:

[https://www.reddit.com/r/aviation/comments/3v1hzj/how_do_old...](https://www.reddit.com/r/aviation/comments/3v1hzj/how_do_old_planes_keep_flying_safely/cxk2sf4/)

~~~
jaclaz
JFYI, I often use as a reference the actual VW Owner's manual for the Beetle,
you can find a copy of the 1952 version here, relevant pages are 7 and 8
"Operating Instructions":

[https://www.thesamba.com/vw/archives/manuals/1_52bug.php](https://www.thesamba.com/vw/archives/manuals/1_52bug.php)

[https://www.thesamba.com/vw/archives/manuals/1_52bug/3.jpg](https://www.thesamba.com/vw/archives/manuals/1_52bug/3.jpg)

[https://www.thesamba.com/vw/archives/manuals/1_52bug/4.jpg](https://www.thesamba.com/vw/archives/manuals/1_52bug/4.jpg)

but at least up to the '70's the recommendations were more or less the same.

