Hacker News new | past | comments | ask | show | jobs | submit login
Some models of Airbus A350 airliners need to be hard-rebooted after 149 hours (theregister.co.uk)
65 points by known 25 days ago | hide | past | web | favorite | 70 comments

I think the comments here say more about us as an audience than they do about the Airbus. We're totally accustomed to hardware & software problems whose solution is, "just restart it," and we don't even find it all that disturbing - mostly just humorous.

Our standards as developers & testers have gotten pretty low.

I think the only way you'd see outrage here at HN is if the restart involved a real physical crash (as opposed to a software one) and the loss of human lives. Otherwise, we're all, "meh."

The thing here is that the issue is well understood and the root cause is known. We know that rebooting prevents the error.

The problem is when "have you tried rebooting" becomes the first troubleshooting step and you stop investigating.

But if you realise after the investigation that a reboot reliably prevents the issue, it's a worthwhile approach, especially if fixing the error would require many complicated changes that may lead to further errors.

That's not necessarily a "low" standard. It's that the value of preventing a reboot every few days is much lower than other things we could spend our time on. Mandatory oftware perfection would be a huge drag on valuable new features.

I agree with you to a certain degree but appropriateness of the trade off should really depend on where the software is used. Airplane, powerplants, and train signaling systems? I would want longer uptime.

Especially when you compare that to all the other maintenance that has to happen routinely on a plane for it to remain safe. Not a reason to not fix it though.

> Our standards as developers & testers have gotten pretty low.

It's just an unsolved problem. No one never knew how to write complex robust software. Even with memory-safe languages. Even with advanced type systems, generative tests, immutability and advanced concurrency primitives.

And it's not a discipline problem.

Or maybe it is a testament to how high our standards are that the expectation is that we should essentially never have to restart things.

I can't say that my car has ever gone 149 hours without a "reboot" of some kind or another.

Mine sure has, and just about every car these days does. The ECU doesn't lose power when you turn off the car; the only way to "reboot" your car is to disconnect the battery.

It is an issue if they have to reboot it twice.

> I think the only way you'd see outrage here at HN is if the restart involved a real physical crash

Or if it were an American company. There wasn’t any outrage when Germanwings’s lack of attention to their suicidal pilot’s mental health even though there were significant warning signs before he crashed a plane into the Alps. Little outrage over Malaysian Air despite their pilot practicing his suicidal crash in the simulator months before. But for an American company, out come the pitchforks.

The fact that Airbus hasn’t yet had a 350 crash is just happy luck, but the issues are still just as outrage worthy. Where is the rage against the EASA that allowed such a glitch to pass certification? Where are the complaints against EASA outsourcing certification tasks[1] as the FAA was criticized about?

If this were an American company, you’d see plenty of outrage here and it would be related to some flavor of the United States not being leftist enough. (i.e. having lower regulatory strictness, lower taxes and thus an “undersized” bureaucracy. There is an undeniable and distinct anti-American sentiment on HN when it comes to issues like this. From everything such as climate, economic policies, consumer issues, defense, and even food and housing choices of Americans, the US always seems to be held in a persistant, low-level contempt, while almost anything European or Japanese gets a positive initial reaction. This might be a controversial opinion around here, but it’s my perception after being around the HN community for almost 7 years and watching it change from a place that celebrates entrepreneurship, business and capitalism to little more than a better moderated subreddit for those that view American startups and technology with derision and contempt. Actual Marxism ideas get upvoted and free market ideas get downvoted. Which is ironic because Marxism didn’t create anything that keeps many of us highly compensated while working on cool technology. This forum itself wouldn’t even exist were it not for a venture capital firm. Most of the tech stack we all use comes from entrepreneurship and innovation from the free market.

Since Airbus is essentially a state-owned enterprise in the practical sense, of course it isn’t going to result in any outrage — or at least not more than one of those dirty, evil, profit-oriented American corporations.

[1] https://etendering.ted.europa.eu//cft/cft-display.html?cftId...

Both of the examples you gave were of mental health issues, which we're shitty at handling as a society. I've seen multiple posts on reddit about people "glamorizing" mental health problems, when really they're just more comfortable openly talking about it. People still just don't want to hear it. We don't see these problems as real issues to be dealt with as a society. There's still widespread belief that these are personal failings. People need to shut up about it and get on with their life. Otherwise, we need to identify these people so we can avoid them.

Your claim that American companies are getting the shaft is complete nonsense. It's classic US conservative insanity. It is not a shock, nor is it anti-American bias, that concious engineering decisions which result in loss of life get more scrutiny and outrage. There was plenty of outrage for the VW emissions scandle. Here, not only has there not been any deaths, but an update's been provided. If you seriously think there will be no outrage for a european engineering disaster here, you're crazy.

I even dispute the notion that there is anything wrong with being anti-American. It's not a personal insult unless one chooses to take it as such. I am American by birth, have lived abroad, and find the way things are done in the US to be abhorrent on many fronts. It seems to me anything done well elsewhere is rejected out of some weird sense of independence. America's profit driven healthcare business ruined my life and is still making it miserable. The only decent care I have ever received was in a European country. There is a default idea on the internet that America is the world. There is a default idea in much of America that conservatism is "true America". There is a default idea that success as a human is "being rich" and "winning". My family has this belief system and anytime those beliefs or the country are criticized, even if its one amongst 100 comments, its focused on as how everyone is biased against them and they are "victimized" by this. I find the incredibly common idea in the USA, that one should be less critical of the nation because "this is the best place for us to get rich" is the opposite to what I consider important values, and is in fact antithetical to the hacker ethos. The hackers I admire don't do so to enrich themselves...they do it to fix problems and improve the world, even if only for one other person.

The only thing I'll say is that I get annoyed by certain knee jerk anti-American, pro foreign sentiments. It can be a bit much. Especially when other countries are wrestling with the same issues, and may even be worse in some regards.

Otherwise, I agree. Specifically, the criticism of our health care system. How people can see no issue with it is beyond me. There's conservatives that literally think the problem is us treating people in the ER, instead of letting them die. Disgusting.

OP is completely off base on this issue. The examples he gave are not an apples to apples comparison. The idea that we wouldn't criticize an EU company for making boneheaded business/engineering decisions that cost lives is ludicrous.

I just don't think nationalism is anything but damaging. I am a human being. My self worth isn't related to where I was randomly born. I have no allegiance or pride owed to a passport. Such tribalism is used to justify wars and greed and lots of bad stuff. Even in in posts like these the conversation at its root is "offense" at criticism of "team". Poor behavior is poor behavior regardless of who is doing it, but some teams think theirs is more justified. The team that I think matters is humanity. Nationality is an administrative bother to me.


Fact check: Harsh but mostly true. This used to be the forum for the rationalist and capitalist community in tech, or at least people that tolerated them or were willing to engage without downvotes. Now it attracts those with such distaste for rationalist viewpoints to the point pg himself is altering his essays[1]. His original point was much clearer and more memorable than his revision. Are there any alternatives to this site that look more like the news.ycombinator of 5+ years ago? Maybe a discord or subreddit or telegram group?

[1] http://www.paulgraham.com/ineq.html

EDIT: the replies to your post only reinforce your point, you are being called a kook (“completely insane”) for your post, which is antithetical to civil discussion.

> The remedy for the A350-941 problem is straightforward according to the AD: install Airbus software updates for a permanent cure, or switch the aeroplane off and on again.

The subhed and the above graf say it's as straightforward to fix as installing Airbus software patches. But I assume this process is not as streamlined or convenient as iOS or Tesla OTA auto-updates. Anyone have insight to what installing patches on an airliner entails? I'm assuming it involves more downtime than powering down and restarting, if that's the current status quo.

The problem isn’t the update itself, which is actually straightforward. It’s the fact that as soon as you modify the software you must do a c check on the craft.

Which is a horribly long process, around 6,000 man-hours and puts the aircraft out of commission for a few weeks.

That amount of friction is...strangely reassuring. Is there criteria that defines what kind of software update triggers this process? Or is it required with any update, no matter how small? Though if it's the latter, I'm guessing that incentivizes manufacturers to never send out small or trivial updates.

good. I don't want someone to update the software and next day send it in a commercial flight.

I do.

As long as the hardware has been tested, and the software update tested on different hardware, then as long as the test hardware and my hardware are nominally the same, and as long as the software has basic "self test the basics of every component on startup", then I don't see a reason to do more tests.

Sounds like a reboot is a better option then.

Sounds better than silicon valley's "move fast and break things", which would be more like "...and kill people" here.

To my knowledge all upgradeable software is treated exactly like a new part. Upgrade must be certified, and is visible in part databases with part number. There must be a chain of trust. If it's a field upgradeable part, it's delivered in flash drives. If it's something more fundamental, they take the aviation computer out and update the EEPROM.

Certifying a upgrade is the hard part.

Once an upgrade is certified though, does it need to be recertified on every plane?

It is a very well know issue with every plane. Sometimes there are no solutions to a problem. You need these hacky solutions. The title is clearly catchy with everything that is going on with Boeing.

But the point is, this reboot process is very well managed and known. So I won't call it scary.

I agree it's not scary and it's a good, known workaround. But it's software - we shouldn't say "there are no solutions". The solution is: fix this problem and add long runtime testing to the qa process. Especially if this is a known issue in other planes.

But according to the article, the bug has been fixed:

> The remedy for the A350-941 problem is straightforward according to the AD: install Airbus software updates for a permanent cure, or switch the aeroplane off and on again.

Maybe the update process isn't streamlined enough, or maybe there are other reasons not to upgrade, but in general, the airlines need to install the updates.

When dealing with physical systems it is impossible to have no bugs. To give an extreme example, there is a not a small chance that the cosmic rays can change a bit in a system's memory - https://stackoverflow.com/questions/2580933/cosmic-rays-what...

The issue here is overflow due to time. The time is saved in a variable (don't know how much bits), which overflows after the gives period. Now there are two options 1. Upgrade circuits of every plane. These planes were designed/built a long time back. Bigger registers were not practical due to costs. 2. Document it and have a process for it.

Environmentally induced errors aren't software bugs, just because there's a problem elsewhere doesn't mean we shouldn't seek to mitigate other problems.

In plane investigations I've looked at (not many) the issue has always been a compounding of several errors or shortcomings .. that strongly suggests you shouldn't let small errors build up in different systems, to me. [1]

If it's a register which takes down the whole system then surely they'd know that (and could fix it with a watchdog that returned the effected systems to the boot state without reboot) -- other comments seem to be saying "meh, it's complex, doesn't matter what the error is as long as reboot fixes it"; that seems really dangerous in safety critical systems.

[1] but I acknowledge the "better the devil you know" issue and that pragmatism and cost take over at some point.

> it is impossible to have no bugs.

I agree with that. But once you know about a specific big, there is always a solution.

Normal :) Same for the Boeing 787 back in 2015: https://news.ycombinator.com/item?id=17907654

Software has a bug. Patch with a fix is released. Some users don't want to update for reasons and demand a workaround. The workaround sucks.

How is this news?

The workaround doesn’t even suck, IMO.

It doesn't suck unless you forget to turn it off and back on.

Yeah it's literally a joke. "Have you tried turning it off & back on?"

Time- ~Late 1990s.

Place- Small financial institution in the NE.

Having just finished two weeks on the job, one Friday evening before heading out I decided to reboot my Sun Sparcstation. I of course did not have root but there was L1-A which put you into the BIOS. Then: sync;sync;reboot

Workstation starts rebooting.

30 seconds later the sysadmin is standing over my shoulder.

  SA - "What did you do?"
  me - "I rebooted it"
  SA - Incredulous. Like I just set my hair on fire. "Why?"
  me - "Its been two week, you know defrag memory, free up the page tables" 
  (some vague psuedo cs bs)
  SA - "This is a Solaris X.Y/SunOS Q.Y machine.
  It had an uptime of 180+days. 
  I have machines here with an uptime of 2+ years."
  me - "Really?"
  SA - "These machines do not need a reboot. Ever. Please do not do this again."
I arrived Monday to find that L1-A had been disabled on my machine.

How far we have (pro|re)gressed in 20 years! ;) (edits - typo)

Without reading the article it reminds me of the famous software glitch of the Patriot defense system, to work around a rounding error bug it was required to reboot the system if used for not so many hours...

I kind of feel like these things should be reset regularly to be safe anyway.

This was a bug that was known, what if there others?

Crash early, crash hard.

Not what you want in aviation, per se.

That is 2^29 milliseconds. I dread to imagine the reasons leading to 28-bit millisecond time in an airplane.

I first encountered such bug way back in 1996 time frame. When I tested embedded system nonestop for ~25 days and system has issue. It is counter wrap around time for 2^31 if counting once every millisecond.

From that point on, I make sure test cover counter wrap around in 2^31, 2^32 etc. Until one day, someone told me the system I worked on, deploy in Comcast data center has issue after ~250 days......

I used hearing news about Newton, Win98, 777 etc has issue after 25 days, 49 days, etc. It is very easy to guess what the potential issues were.

Presumably someone is using the high four bits for some other purpose. It wouldn't surprise me if parts of an A350's avionics software is old enough that those sorts of space optimizations made sense.

It could also be a tagged value. These are quite common in code created by "high reliability" languages (Ada, OCaml). The high bits are used to carry some metadata.

E.g. you might want to set a certain bit if the value went outside the defined input range of a function (think for example far from the accepted numerical window of a Taylor series expansion). Instead of dealing error conditions at each and every step (thereby making the timing properties of the code unpredictable) you just collect all error conditions and at the end decide if you discard the value, or just use it partially (e.g. it might still be "good enough" to be used as a parametizing input of an adaptive filter, where it averages out with the rest).

Maybe it’s counting in microseconds?

Then it would be ~153 hours to be near a bit overflow.

Interestingly, it looks like the A350 software probably was developed under DO-178B[1] rather than DO-178C[2] given the timing of release. This may seem minor, but Wikipedia's comparison indicates 178B was looser. Of course, these specs just consider requirements, not so much advice on implementation.

[1]https://en.wikipedia.org/wiki/DO-178B [2]https://en.wikipedia.org/wiki/DO-178C

So this is already fixed:

"The remedy for the A350-941 problem is straightforward according to the AD: install Airbus software updates for a permanent cure ..."

And it has the known workaround.

So this has almost nothing to do with Airbus at this point, the directive and the "sighs" uttered by the EU aviation agency are directed at the airlines that won't install the update.

But good distraction from Boeing's woes as long as you only read the headline...

> So this has almost nothing to do with Airbus at this point, the directive and the "sighs" uttered by the EU aviation agency are directed at the airlines that won't install the update.

Why won’t EASA ground the un-updated airplanes?

As mentioned else-thread, the plane requires a certain level of inspection after the software update, which merely takes a few weeks and lots of human effort.

I find this interesting to compare to the 737 Max. Here the reaction is "just reboot because the inspection after installing the fix takes too long". But with the 737 Max, the reaction is "Create MCAS to avoid the cost of having to retrain pilots? How could you be so stupid?"

I know, the A350 bug hasn't killed anyone (yet). But I see the parallels in the issue, and yet the reactions here are completely opposite.

Part of it, I'm sure, is that the A350 bug is comprehensively root-caused and that root cause is understood to be completely bounded by rebooting the system, whereas MCAS reduces the number of points of failure to, in certain builds of the 737 MAX, a single sensor.

Because the root cause of the problem is very well known and because of that the workaround is considered safe.

It's a scary thought that such issues exist, regardless of how common or not.

I think planes already scrutinized to a very high degree. What I am more concerned about is airlines doing the reboot in flight to save time. Planes are often on a very tight schedule (maybe not cargo planes).

Don't worry, the AD (airworthiness directive) specifically calls for complete, ground power cycles. These cannot be done in flight. https://ad.easa.europa.eu/blob/EASA_AD_2017_0129_R1.pdf/AD_2...

There's plenty of time to do the reboot between turning the planes (disembarking and boarding) - a good amount of A350 planes spend the day in the sun at their destinations since the operator can't fly anywhere else with them between flights.

you're expecting software without bugs? there's no such thing.

Occasionally rebooting a system is actually a good thing.

“Hello, IT, have you tried turning it off and on!?”

The opening line from the IT Crowd [0], one of the cult British sitcoms and a favorite of mine. Highly recommended :-)

[0] https://en.wikipedia.org/wiki/The_IT_Crowd

And? Does it come as a surprise that planes require regular maintenance?

Software engineering needs some significant improvements if we would like to keep using critical systems like aircrafts. It is getting to the point where serious software problems are impacting everyday life, not in a good sense.

Airbus is already using formal method for creation and validation of the code which are the ultimate version of testing.

Whatever method prevents buffer overflows.

AFAIK that is what happens here.

"The CPIOM is effectively a mini computer; in the A350 CPIOMs run discrete avionics "applications", in the sense of apps. CRDCs themselves do not host or run applications, suggesting that the failure condition detailed in the EASA AD may mean loss of a particular app on a CPIOM after a buffer overflow."

s/AFAIK/I'm guessing/

Or do you have any objective basis for suspecting that this is a buffer overflow?

That's a very broad statement. Focusing on specifics would be more helpful. For example, 149 hours isn't that long, especially for aircraft which have high duty cycle percentages. Why didn't the original test plan estimate how long it would take to find a certain percentage of [insert-bug-type-here], and find this issue? Would re-testing the code-base pro-actively find related latent issues?

But none of these kinds of observations are fundamental. There's no general conclusion regarding software engineering to be drawn.

Is it too broad to expect software working properly? 149 hours mean time between failures (MTBF) is extremely low. Comparing that to hard drives (1000 000) or anything that has complex electronics and mechanics is laughable.

What kind of improvements do you have in mind? Any specific examples?

Having systems that are not prone to buffer overflows? Having workarounds? Not sure. As far as specific examples:


Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact