
Some models of Airbus A350 airliners need to be hard-rebooted after 149 hours - known
https://www.theregister.co.uk/2019/07/25/a350_power_cycle_software_bug_149_hours/
======
BrentOzar
I think the comments here say more about us as an audience than they do about
the Airbus. We're totally accustomed to hardware & software problems whose
solution is, "just restart it," and we don't even find it all that disturbing
- mostly just humorous.

Our standards as developers & testers have gotten pretty low.

I think the only way you'd see outrage here at HN is if the restart involved a
real physical crash (as opposed to a software one) and the loss of human
lives. Otherwise, we're all, "meh."

~~~
AuthorizedCust
That's not necessarily a "low" standard. It's that the value of preventing a
reboot every few days is much lower than other things we could spend our time
on. Mandatory oftware perfection would be a huge drag on valuable new
features.

~~~
pcurve
I agree with you to a certain degree but appropriateness of the trade off
should really depend on where the software is used. Airplane, powerplants, and
train signaling systems? I would want longer uptime.

------
danso
> _The remedy for the A350-941 problem is straightforward according to the AD:
> install Airbus software updates for a permanent cure, or switch the
> aeroplane off and on again._

The subhed and the above graf say it's as straightforward to fix as installing
Airbus software patches. But I assume this process is not as streamlined or
convenient as iOS or Tesla OTA auto-updates. Anyone have insight to what
installing patches on an airliner entails? I'm assuming it involves more
downtime than powering down and restarting, if that's the current status quo.

~~~
dijit
The problem isn’t the update itself, which is actually straightforward. It’s
the fact that as soon as you modify the software you must do a c check on the
craft.

Which is a horribly long process, around 6,000 man-hours and puts the aircraft
out of commission for a few weeks.

~~~
PunchTornado
good. I don't want someone to update the software and next day send it in a
commercial flight.

~~~
londons_explore
I do.

As long as the hardware has been tested, and the software update tested on
different hardware, then as long as the test hardware and my hardware are
nominally the same, and as long as the software has basic "self test the
basics of every component on startup", then I don't see a reason to do more
tests.

------
mangatmodi
It is a very well know issue with every plane. Sometimes there are no
solutions to a problem. You need these hacky solutions. The title is clearly
catchy with everything that is going on with Boeing.

But the point is, this reboot process is very well managed and known. So I
won't call it scary.

~~~
viraptor
I agree it's not scary and it's a good, known workaround. But it's software -
we shouldn't say "there are no solutions". The solution is: fix this problem
and add long runtime testing to the qa process. Especially if this is a known
issue in other planes.

~~~
mangatmodi
When dealing with physical systems it is impossible to have no bugs. To give
an extreme example, there is a _not_ a small chance that the cosmic rays can
change a bit in a system's memory -
[https://stackoverflow.com/questions/2580933/cosmic-rays-
what...](https://stackoverflow.com/questions/2580933/cosmic-rays-what-is-the-
probability-they-will-affect-a-program)

The issue here is overflow due to time. The time is saved in a variable (don't
know how much bits), which overflows after the gives period. Now there are two
options 1\. Upgrade circuits of every plane. These planes were designed/built
a long time back. Bigger registers were not practical due to costs. 2\.
Document it and have a process for it.

~~~
pbhjpbhj
Environmentally induced errors aren't software bugs, just because there's a
problem elsewhere doesn't mean we shouldn't seek to mitigate other problems.

In plane investigations I've looked at (not many) the issue has always been a
compounding of several errors or shortcomings .. that strongly suggests you
shouldn't let small errors build up in different systems, to me. [1]

If it's a register which takes down the whole system then surely they'd know
that (and could fix it with a watchdog that returned the effected systems to
the boot state without reboot) -- other comments seem to be saying "meh, it's
complex, doesn't matter what the error is as long as reboot fixes it"; that
seems really dangerous in safety critical systems.

[1] but I acknowledge the "better the devil you know" issue and that
pragmatism and cost take over at some point.

------
__michaelg
Normal :) Same for the Boeing 787 back in 2015:
[https://news.ycombinator.com/item?id=17907654](https://news.ycombinator.com/item?id=17907654)

~~~
ceejayoz
Patriot missile batteries, too.
[https://en.wikipedia.org/wiki/MIM-104_Patriot#Failure_at_Dha...](https://en.wikipedia.org/wiki/MIM-104_Patriot#Failure_at_Dhahran)

------
fluffything
Software has a bug. Patch with a fix is released. Some users don't want to
update for reasons and demand a workaround. The workaround sucks.

How is this news?

~~~
sokoloff
The workaround doesn’t even suck, IMO.

~~~
fluffything
It doesn't suck unless you forget to turn it off and back on.

------
haltingproblem
Time- ~Late 1990s.

Place- Small financial institution in the NE.

Having just finished two weeks on the job, one Friday evening before heading
out I decided to reboot my Sun Sparcstation. I of course did not have root but
there was L1-A which put you into the BIOS. Then: sync;sync;reboot

Workstation starts rebooting.

30 seconds later the sysadmin is standing over my shoulder.

    
    
      SA - "What did you do?"
      me - "I rebooted it"
      SA - Incredulous. Like I just set my hair on fire. "Why?"
      me - "Its been two week, you know defrag memory, free up the page tables" 
      (some vague psuedo cs bs)
      SA - "This is a Solaris X.Y/SunOS Q.Y machine.
      It had an uptime of 180+days. 
      I have machines here with an uptime of 2+ years."
      me - "Really?"
      SA - "These machines do not need a reboot. Ever. Please do not do this again."
    

I arrived Monday to find that L1-A had been disabled on my machine.

How far we have (pro|re)gressed in 20 years! ;) (edits - typo)

------
IronBacon
Without reading the article it reminds me of the famous software glitch of the
Patriot defense system, to work around a rounding error bug it was required to
reboot the system if used for not so many hours...

------
lawlessone
I kind of feel like these things should be reset regularly to be safe anyway.

This was a bug that was known, what if there others?

~~~
NullPrefix
Crash early, crash hard.

~~~
syntheticnature
Not what you want in aviation, per se.

------
hoseja
That is 2^29 milliseconds. I dread to imagine the reasons leading to 28-bit
millisecond time in an airplane.

~~~
pcl
Presumably someone is using the high four bits for some other purpose. It
wouldn't surprise me if parts of an A350's avionics software is old enough
that those sorts of space optimizations made sense.

~~~
datenwolf
It could also be a tagged value. These are quite common in code created by
"high reliability" languages (Ada, OCaml). The high bits are used to carry
some metadata.

E.g. you might want to set a certain bit if the value went outside the defined
input range of a function (think for example far from the accepted numerical
window of a Taylor series expansion). Instead of dealing error conditions at
each and every step (thereby making the timing properties of the code
unpredictable) you just collect all error conditions and at the end decide if
you discard the value, or just use it partially (e.g. it might still be "good
enough" to be used as a parametizing input of an adaptive filter, where it
averages out with the rest).

------
syntheticnature
Interestingly, it looks like the A350 software probably was developed under
DO-178B[1] rather than DO-178C[2] given the timing of release. This may seem
minor, but Wikipedia's comparison indicates 178B was looser. Of course, these
specs just consider requirements, not so much advice on implementation.

[1][https://en.wikipedia.org/wiki/DO-178B](https://en.wikipedia.org/wiki/DO-178B)
[2][https://en.wikipedia.org/wiki/DO-178C](https://en.wikipedia.org/wiki/DO-178C)

------
mpweiher
So this is already fixed:

"The remedy for the A350-941 problem is straightforward according to the AD:
install Airbus software updates for a permanent cure ..."

And it has the known workaround.

So this has almost nothing to do with Airbus at this point, the directive and
the "sighs" uttered by the EU aviation agency are directed at the airlines
that won't install the update.

But good distraction from Boeing's woes as long as you only read the
headline...

~~~
briandear
> So this has almost nothing to do with Airbus at this point, the directive
> and the "sighs" uttered by the EU aviation agency are directed at the
> airlines that won't install the update.

Why won’t EASA ground the un-updated airplanes?

~~~
syntheticnature
As mentioned else-thread, the plane requires a certain level of inspection
after the software update, which merely takes a few weeks and lots of human
effort.

~~~
AnimalMuppet
I find this interesting to compare to the 737 Max. Here the reaction is "just
reboot because the inspection after installing the fix takes too long". But
with the 737 Max, the reaction is "Create MCAS to avoid the cost of having to
retrain pilots? How could you be so stupid?"

I know, the A350 bug hasn't killed anyone (yet). But I see the parallels in
the issue, and yet the reactions here are completely opposite.

~~~
syntheticnature
Part of it, I'm sure, is that the A350 bug is comprehensively root-caused and
that root cause is understood to be completely bounded by rebooting the
system, whereas MCAS reduces the number of points of failure to, in certain
builds of the 737 MAX, a single sensor.

------
dahartigan
It's a scary thought that such issues exist, regardless of how common or not.

~~~
raxxorrax
I think planes already scrutinized to a very high degree. What I am more
concerned about is airlines doing the reboot in flight to save time. Planes
are often on a very tight schedule (maybe not cargo planes).

~~~
JorgeGT
Don't worry, the AD (airworthiness directive) specifically calls for complete,
ground power cycles. These cannot be done in flight.
[https://ad.easa.europa.eu/blob/EASA_AD_2017_0129_R1.pdf/AD_2...](https://ad.easa.europa.eu/blob/EASA_AD_2017_0129_R1.pdf/AD_2017-0129R1_1)

------
JustSomeNobody
Occasionally rebooting a system is actually a good thing.

“Hello, IT, have you tried turning it off and on!?”

~~~
NKosmatos
The opening line from the IT Crowd [0], one of the cult British sitcoms and a
favorite of mine. Highly recommended :-)

[0]
[https://en.wikipedia.org/wiki/The_IT_Crowd](https://en.wikipedia.org/wiki/The_IT_Crowd)

------
_pmf_
And? Does it come as a surprise that planes require regular maintenance?

------
StreamBright
Software engineering needs some significant improvements if we would like to
keep using critical systems like aircrafts. It is getting to the point where
serious software problems are impacting everyday life, not in a good sense.

~~~
greatpatton
Airbus is already using formal method for creation and validation of the code
which are the ultimate version of testing.

~~~
StreamBright
Whatever method prevents buffer overflows.

AFAIK that is what happens here.

"The CPIOM is effectively a mini computer; in the A350 CPIOMs run discrete
avionics "applications", in the sense of apps. CRDCs themselves do not host or
run applications, suggesting that the failure condition detailed in the EASA
AD may mean loss of a particular app on a CPIOM after a buffer overflow."

~~~
AnimalMuppet
s/AFAIK/I'm guessing/

Or do you have any objective basis for suspecting that _this_ is a buffer
overflow?

