
Boeing 787s must be turned off and on every 51 days to prevent 'misleading data' - edward
https://www.theregister.co.uk/2020/04/02/boeing_787_power_cycle_51_days_stale_data/
======
chrischattin
Commercial pilot here. Can confirm that turn it off and back on again works
for troubleshooting avionics issues.

But seriously, this is clickbait and nothing to see here. Many things on the
aircraft are checked, cycled, etc before every flight, let alone on a 51 day
mx schedule.

~~~
Xcelerate
> Can confirm that turn it off and back on again works for troubleshooting
> avionics issues.

Have you ever had to do this mid-flight?

~~~
hailwren
I worked in aviation for a while. This is super common. There isn't a pilot on
the planet who hasn't turned avionics off in flight (there are always
redundant and backup systems). There probably isn't a working pilot in the
world that hasn't had to cycle a circuit breaker in flight this month.

Edit: Well, if this was a normal month.

------
hnarn
I am going out on a limb here but I seem to remember reading somewhere that
airliners do have maintenance schedules that are very strictly kept, for
obvious reasons. If the maintenance schedule is N days, then any news article
pointing out how amusing it is that an airliner needs to be rebooted every >N
days is at best sensationalism, at worst pure fearmongering.

I don't know for a fact this is the case here for the 787, but I think there
are far better things to worry about when it comes to technical security in
airliners than how often they need to be rebooted. For example, whether the
on-board WiFi is sufficiently separated from the in-flight systems, and (as
discussed recently here on HN) whether the advent of touchscreens for critical
flight systems is sufficiently durable, tested and redundant.

~~~
munk-a
> is at best sensationalism, at worst pure fearmongering

I don't know about the case here, but any time I've hit an issue in my work
where "Thing X needs to be done every Y or bugs start happening" it's a pretty
clear sign of some deeper issues and likely a lot of underlying bad dev
processes.

This issue might be as "simple" as a memory leak that will suddenly require
reboots every N minutes when a seemingly unrelated patch exacerbates an issue.

~~~
im_down_w_otp
Devices in systems like this are full of monotonically increasing sequence
numbers used for all manner of coordination and diagnostic functions. In this
case it appears to be a way to ensure some recency constraint on critical
data. This is an extremely common method of attempting to assess/identify
staleness of critical data (i.e. "Is the sequence number I'm looking at before
or after the last one I saw, and by how much?") in critical real-time systems.

Probably this is a counter that rolls over if it's not reset, the
predictability of needing to reset it before time T is an indicator it's a
sequence number that's driven by a hard real-time trigger with extremely
predictable cadence.

~~~
panda88888
I think you are right. 50 days is 4.32e9 ms, which is just a bit under max
value of unsigned 32-bit int.

~~~
munk-a
It's actually a bit _over_ the max value[1] - I agree though that I'd strongly
suspect this issue is related to overflowing a millisecond counter stored in a
32-bit int. The numbers are way too close.

Hey, maybe <51 was just a off-by-one error... or maybe the actual advisory is
to be <50 and some PM decided that number was too round or violated an SLA.

1\. 4,294,967,296 or 4.29e9

~~~
3pt14159
But with engineering we almost always have safety factors. I'd say it's
probably a 64-bit int, but that would be way too much of a safety factor.

~~~
pdpi
Safety factors are a thing, sure. Safety factors of 4.29e9x (which is what you
get when you go from 32- to 64-bit ints) are possibly a bit excessive, and not
at all worthy of an FAA airworthiness directive.

~~~
goblin89
My biggest surprise today is from learning that critical aircraft software is
left running for days without a full restart. Somehow I assumed everything
gets completely shut down every time they refuel or so.

------
renewiltord
Classic Related Stories:

* Patriot Missile launchers must be rebooted periodically to account for accumulated errors from truncation: [http://www-users.math.umn.edu/~arnold//disasters/patriot.htm...](http://www-users.math.umn.edu/~arnold//disasters/patriot.html)

* Cruise missiles with memory leaks that do not matter because they'll reach the target before they allocate all memory: [https://groups.google.com/forum/message/raw?msg=comp.lang.ad...](https://groups.google.com/forum/message/raw?msg=comp.lang.ada/E9bNCvDQ12k/1tezW24ZxdAJ)

------
choeger
I have heard this anecdote before. In case you might not: 51days is awfully
close to 2^32ms ...

~~~
dan_quixote
Just for good measure: (2^32)/(1000 * 3600 * 24) = 49.71 days

~~~
sedatk
so the correlation is meaningless here, as the reboot time is after the
overflow.

~~~
CyanBird
It could very easily be that some manager decided to add an extra day to the
schedule "just in case" as in _close enough_

------
mortehu
Safe use of Microsoft Windows also requires rebooting on a slightly shorter
schedule, because GetTickCount will overflow. In particular if you're running
a real-time simulation which is likely to use delta time as a critical
parameter, and you can't audit the code or know for a fact that it uses
GetTickCount.

~~~
nomel
I've often wondered how people came to the conclusion that "Windows is
unstable, you can't leave it on for more than a couple months!" due to this.

At one company I worked for, all of our National Instruments test equipment
would start to fail with communication problems after about two months on our
Windows XP computers. Being familiar with GetTickCount, I rebooted the
computers, recorded the date, verified the next failure was 49 days later,
then emailed National Instruments with a link to the GetTickCount
documentation. They pushed out an update with a fix 3 days later. Oops.

------
S_A_P
One thing Im curious about is what is classified as a "reboot" of the plane?
It it is parked over night are all the systems shut down and restarted the
next day it is put back in service? Does it sit "running" in some sleep mode?
Last time I checked a plane cannot(practically) stay airborne for 51 days. Is
the reboot a pain in the ass 5 day procedure? there are too many unknowns to
sound alarm bells.

------
yingw787
Do safety-critical systems have memory burned as ROM, instead of having
dynamic memory allocation? From my point of view, the plane doesn't really
change, so the avionics suite shouldn't require changing either. You build a
physics model of the plane, translate it into memory, then bake it in, and
dynamic allocation is only needed for when you need inputs. Or is this
dangerous because the physics model does change significantly for different
loadouts?

I'm not an aerospace or embedded engineer.

------
cgb223
I wish some upstart would try to create a plane making startup to crush Boeing
at their inefficient bloated game

We need an Elon Musk of flight, Boeing has gotten away with too much
mediocrity

~~~
VBprogrammer
Are we talking about the same Elon Musk who's car company created an autopilot
which has been known to crash into to side of trucks? Have I missed the
obvious sarcasm?

------
raister
Getting late to the discussion, but people tackle this for a long time in
software engineering, it's called Software Rejuvenation, with models of
repairing systems, Markovian assumptions, applications in JVM, etc.
Interesting topic. It was used to analyse Patriot missiles that needed the
same approach to replenish its internal variables each time.

------
niffydroid
I think airbus 350 needs a reboot as well

~~~
briandear
Every 149 hours..[1]

However, since Airbus != Boeing, nobody around here cares. Only stories that
are pointing out problems with Boeing are allowed (or upvoted) apparently.

[1]
[https://www.theregister.co.uk/2019/07/25/a350_power_cycle_so...](https://www.theregister.co.uk/2019/07/25/a350_power_cycle_software_bug_149_hours/)

~~~
tolien
Apart from when that exact article was posted and got 65 points and 70
comments? [0] You even commented in it.

0:
[https://news.ycombinator.com/item?id=20524391](https://news.ycombinator.com/item?id=20524391)

------
shakit
I am guessing manufacturer doest have a budget to fix this. They are too busy
sorting the 737 pitch controls. I am guessing they need bigger buffer to would
clear it out and some good GPS and timestamps database and add a clr button on
the console to clear the historic alt and speed data. The historic data can go
to the black box and the new one stored to the buffer. A sensor should only
look at data from the past week and not calculate stuff from 49 days in the
past. What use would pilots have other than for service and maintainence. What
was the OS written in objective C?

------
jdblair
Overflow is not the only kind of bug triggered by uptime. In February 1991 a
Patriot missile failed to intercept and incoming Scud due to an accumulation
of time-based errors. The missile system had been online for 100 hours and
this resulted in enough error that the intercept calculation was incorrect.
People died.

[http://www-users.math.umn.edu/~arnold//disasters/patriot.htm...](http://www-
users.math.umn.edu/~arnold//disasters/patriot.html)

~~~
benibela
I wonder if there were ever uptime issues caused by heap fragmentation.

~~~
jdblair
This happens on set top boxes, especially when the graphics memory heap is
allocated separately from the system memory heap. The graphics memory heap can
be fragmented and surfaces stop being rendered because there are no contiguous
memory blocks large enough. Having two heaps on a low memory device leads to
unfortunate compromises.

------
vincnetas
I'm a bit ashamed of that but i guess im not the only one like this. At work
we have a system which started crashing, and we could not figure out why. It
runs normally, but restarts after some time and then again continues to
function properly. So what did we do? Ran multiple instances behind a proxy
and let instances crash. But cluster as a whole functions perfectly even when
parts of it are restarting because of unknown error that we have no capacity
to identify and fix.

------
shakit
I am guessing they need bigger buffer to would clear it out and some good GPS
and timestamps database integrations to clear the alt and speed data. A sensor
should only look at data from the past week and not calculate from 49 days in
the past. What use would pilot have other than for service and maintainence.

------
cm2187
Is it because it runs on Windows 95?

~~~
tpmx
It was 49.7 days for Windows 95:

[https://sites.google.com/site/edmarkovich2/whywindows95andwi...](https://sites.google.com/site/edmarkovich2/whywindows95andwindows98wouldcrashafter49.7daysofuptime)

Still, it's remarkable that two separate Seattle-based companies have produced
a similarly short time bomb on very expensive and highly visible product
development projects.

~~~
JJMcJ
This wasn't noticed for a few years after Win95 was released.

The joke was that nobody had ever had a Win95 system stay up for 49 days.
Mwah-hah-hah.

------
superjan
I honestly think the safest solution is that an aircraft should refuse to take
off after two weeks until you reboot it. In stead, Boeing and Airbus leave it
to customers to test if the plane still flies after six months.

------
lovecg
Is it common to leave the electronics running for that long anyway? My naive
understanding is that it would be rebooted after every flight anyway.

~~~
detaro
Ideally a plane is spending as little time as possible not doing anything.
It's on the ground for as short as possible and there's ideally always
something happening that needs monitoring or communication. Restarting a bunch
of low-level systems just because doesn't fit into that, so apparently a 51
day span without powering it off wouldn't be unheard of.

------
sanguy
And to think of the billions given to Boeing to bail it out while the
management team who got it into this state got golden parachutes?

If the government deems that Boeing much be saved, it should also deem that
the that prior management was negligent and cause for this situation and seize
their personal assets and hold them criminally accountable.

------
snickerbockers
The article doesn't mention this, but 51 days is approximately 2^32
milliseconds...

~~~
wolfgke
2^32 ms is about 49.71 days ( (2^32)/(1000 * 3600 * 24) ), so _less_ than the
reboot cycle of 51 days.

~~~
SAI_Peregrinus
I mentioned this in another thread, but 2^32 * 1024us is 50.9 days. So it's
probably a systick at 1.024ms overflowing a uint32_t. If you've got a 1us
timer it's a lot cleaner for the CPU to make the tick happen at 1024us than at
1000.

------
_trampeltier
At least better than 2016. Then it was just 22 days before the plane stopped
working.

------
101404
> 2 Apr 2020

Not published April 1.

~~~
RegBarclay
I scrolled back up to the top of the page to check before I finished reading
it.

