Commercial pilot here. Can confirm that turn it off and back on again works for troubleshooting avionics issues.
But seriously, this is clickbait and nothing to see here. Many things on the aircraft are checked, cycled, etc before every flight, let alone on a 51 day mx schedule.
I worked in aviation for a while. This is super common. There isn't a pilot on the planet who hasn't turned avionics off in flight (there are always redundant and backup systems). There probably isn't a working pilot in the world that hasn't had to cycle a circuit breaker in flight this month.
Oh yeah. Many times. But, it's not as scary as it sounds. There are multiple redundancies and backup systems. So, you can cycle something on/off without touching the other systems. It's often a step in abnormal procedures checklists.
This article is turning a routine checklist/maintenance item into scary sounding clickbait.
If I'd know that it were every "(2^32 - 1) ms to days" -> "49 days 17 hours 2 minutes 47.3 seconds" (millis stored in an unsigned long) then I'd be at ease, but 51 days doesn't say anything to me.
That's fine, and I'm sure it won't be a huge maintenance problem, but it indicates the underlying software is such a mess that they can't even adequately fix a simple issue.
In software, it's what we call an "ugly hack." An "ugly hack" meant that 737s didn't rely on both sensors, and people died. An ugly hack meant that the Ariane 5 rocket exploded in mid-air.
Ugly hacks should not be a part of any project where lives are at stake.
No, it indicates that the problem domain is sufficiently dangerous that the risk of fixing must be balanced against the risk of the fix causing a different unknown error. There were ~500,000 787 flights in 2017 with an average of 200 people per flight. The 737-MAX resulted in 385 fatalities, so if a fix had a 1 in 250,000 chance of causing a different error that could result in a fatal crash then it would be worse than the 737-MAX problems. Do you have confidence that systems you have worked on have processes in place to guarantee that there is less than a 1 in 250,000 chance that a fix would not cause another error? If not, are you aware of any organization whose development practices you have first hand knowledge of and that you are confident could give such a guarantee? That is the risk analysis that must be done when doing a fix.
To be fair, this is somewhat of an over-exaggeration of the requirements since not all systems are critical and not all errors cause critical problems. In addition, the risk must be balanced against the alternative, in this case the risk caused by making sure a reboot is done every 51 days, so you would need to do an analysis of the failure probability and possible consequences of the status quo and compare that against the possible error modes of a software fix.
As an addendum to the risk analysis, the above analysis was only for one year error and on a per-flight basis. If you expect the 787 to fly for ~30 years then the fix must not cause two crashes over 30 years so a 1 in 7,500,000 chance. The average flight is ~5,000 KM which is ~4-5 hours per flight for a total flight time of ~60,000,000 hours. A plane takes ~3 minutes to fall from cruising altitude, so we need fleet downtime of 6 minutes per 60,000,000 hours which is 1 in 600,000,000 downtime. That is 99.9999998% uptime, 8 9s, 6,000x the holy grail of 5 9's availability in the cloud industry, 60,000x the availability guaranteed by the AWS SLA (again, somewhat of an over-exaggeration since you need correlated failures to amount to 3 minutes of continuous-ish failure, but that depends on an analysis of mean-failure time and mean-time-to-resolution which I do not have access to).
> .. if a fix had a 1 in 250,000 chance of causing a different error that could result in a fatal crash then it would be worse than the 737-MAX problems.
That's an improper calculation. Slight nuance but orders of magnitude difference. A better approximation is:
If a fix had a 1 in 250,000 chance of causing a different error that would result in a fatal crash then it would be worse than the 737-MAX problems.
And converse:
If a fix had a 1 in 250,000 chance of causing a different error that could result in a fatal crash then it could be worse than the 737-MAX problems.
MAX's MCAS generates way more errors than 1 in 250k. A few of them resulted in a crash.
Well, what I mean is that perhaps such critical software should be carefully rewritten with rigorous architectural and QA standards so they don't have to use an ugly hack in the first place.
Rewriting software is not only costly and subject to breakage, but for some of these systems requires an absolutely monumental FAA recertification process.
The cost of recertifying software shouldn't be the sole reason not to rewrite something, but I imagine you've been part of a rewrite where not quite everything worked as intended, even after having a lot of tests.
"The devil you know" very much applies to software that controls such life-critical functions and flying airplanes. If a pilot knows how to work around something, introducing something they may not know how to work around could be the difference between life and death.
Why did two 737-MAXes crash and the fleet grounded? New systems were introduced that pilots didn't know how to address - and seemingly could not workaround, even in spite of the engineers who designed them not wanting that outcome.
A rewrite, even with the most rigorous architectural and QA standards, is not a panacea.
I agree, critical software should be written to a high quality standard. In fact, I will take it one step further and say that critical software must be written to an OBJECTIVE quality level that is sufficient for the problem at hand. If that level can not be reached, where we are confident that the risk is mitigated to the desired level, then the software should not be accepted no matter how hard they tried or how much they adhered to "best practices". We do not let people build bridges just because they tried hard. They have to demonstrate that it is safe, and, if nobody knows how to build a safe bridge in a given situation, then the bridge is NOT BUILT.
To then circle back to airplane software, the standards of original development are even higher than the standard I stated above. There are ~10,000,000 flights in the US per year according to the FAA and for at least the 10 years before the 737-MAX problems (I believe closer to 20), software was not been implicated in a single passenger air fatality. That means that in over 100,000,000 flights there were only two fatal crashes due to software (not even in the US, so we would actually need to include global flight data, but I will not bother with that since I am unaware of the count of software-related fatalities in other countries) for a total fleet-wide reliability of 1 in 50,000,000, 7 9s on a per-flight basis. If we use the per-time basis I used above, there are 25,000,000 flight-hours per year, so over 10 years, 6 minutes in 250,000,000 hours is 1 in 2,500,000,000 or 99.99999996% uptime, 9 9s, 25,000x gold standard server availability, 250,000x the availability guaranteed by the AWS SLA. Also note that with servers, we can use independent replicated servers to gain redundancy allowing uptime to multiply (1 in 100 failure for each server means chance of failure of both at the same time, assuming independence, is 1 in 10,000), but the same does not apply to airplanes since every airplane must succeed.
The thing to understand is that the software problems we are seeing are not necessarily an indication that their standards are lower than the prevailing software industry and that they should adopt their practices. It could be, and likely is, that the OBJECTIVE quality level we require is extremely high, and they have not been able to achieve it as of late. This obviously does not excuse their problems since, as I stated above, they must reach the OBJECTIVE quality level we require; it is just an observation that maybe it is not because they are incompetent, maybe they are really, really good and the problem is just really, really, really hard.
I'm sorry to hear that people have become so accustomed to fixing failures of the mind (which these defects are) with reboots.
It takes a certain type of person to fly a plane and resilience in face of unknowns and following checklists are some of the qualities they have.
In the same vein, these kind of failures are known to programmers for tens of years, just like metallurgists are aware of metal fatigue _and plan for it_.
Failure of software professionals to plan and mitigate for this kind of foreseeable problems is inexcusable and I liken them to some incompetent metallurgists in an alternate Univers who brushed off De Havilland Comet's like there was nothing to learn from.
Yes software bugs happen, but they are fixing this with documentation rather than root causing the problem.
That should worry people, because until its been root caused the actual implications are also unknown. For all we know its a symptom of a bug that will cause a more severe problem somewhere else.
This is a bug with ""several potentially catastrophic failure scenarios". Yet, its not been fixed in the ~10 years since it first flew. Nor is it the first, there have been a number of fairly critical bugs on this airplane that took a long time to diagnose before changes were submitted for certification.
So, In ~10 years and many major revisions of the aircraft, multiple re-certifications, etc none of them have bothered to fix it. One might argue they are afraid of changing the software because it might cause other catastrophic failures, but that leads down a thought process just as severe.
I am going out on a limb here but I seem to remember reading somewhere that airliners do have maintenance schedules that are very strictly kept, for obvious reasons. If the maintenance schedule is N days, then any news article pointing out how amusing it is that an airliner needs to be rebooted every >N days is at best sensationalism, at worst pure fearmongering.
I don't know for a fact this is the case here for the 787, but I think there are far better things to worry about when it comes to technical security in airliners than how often they need to be rebooted. For example, whether the on-board WiFi is sufficiently separated from the in-flight systems, and (as discussed recently here on HN) whether the advent of touchscreens for critical flight systems is sufficiently durable, tested and redundant.
> is at best sensationalism, at worst pure fearmongering
I don't know about the case here, but any time I've hit an issue in my work where "Thing X needs to be done every Y or bugs start happening" it's a pretty clear sign of some deeper issues and likely a lot of underlying bad dev processes.
This issue might be as "simple" as a memory leak that will suddenly require reboots every N minutes when a seemingly unrelated patch exacerbates an issue.
Devices in systems like this are full of monotonically increasing sequence numbers used for all manner of coordination and diagnostic functions. In this case it appears to be a way to ensure some recency constraint on critical data. This is an extremely common method of attempting to assess/identify staleness of critical data (i.e. "Is the sequence number I'm looking at before or after the last one I saw, and by how much?") in critical real-time systems.
Probably this is a counter that rolls over if it's not reset, the predictability of needing to reset it before time T is an indicator it's a sequence number that's driven by a hard real-time trigger with extremely predictable cadence.
Exactly. I even worked on a medical instrument that had a similar problem.
Early in development we ran everything (real-time embedded system) on a system interval whose finest level of granularity was a 1ms tick. The system scheduler used a 32-bit accumulator that we knew would roll over after 50 days. However, we were given assurances that the system would have to be powered down for maintenance weekly so it didn't matter. Since proper maintenance is a hard requirement (or the instrument will start reporting failures) that was OK.
Eventually, some time after release we started getting feedback that the system was shutting down for no apparent reason. We investigated and found that those failures were all due to not having been powered down in months.
Apparently, since it could take up to 30 minutes after powering on the instrument before it was ready to run, some labs were performing maintenance with the power still on, so the machines were hitting much higher than expected uptimes. In many cases it wasn't an issue, but if time rolled over in the middle of a test, the instrument would flag the "impossible" time change as a fatal error and immediately shut everything down.
Next release moved to a 64-bit timer. I think we're good :-)
That's an interesting issue and makes me happy I'm working with a non-critical device. I like to follow practices (again, in non-critical settings) where cases like that can be accepted - but if such a case is detected we bail fatally. With an airplane or even a medical instrument, the cost of suddenly aborting in the middle of an action could be the plane falling out of the sky or some surgical tool becoming unresponsive at a critical time... So I think trying to keep working is the best course of action, but I thank the stars I work with non-critical applications where I can always declare a bad state and refuse to continue.
> the cost of suddenly aborting in the middle of an action
This is where FMEA (Failure Modes Effects Analysis) is useful: the likelihood of critical failures is assessed and the ones that are both dangerous and unacceptably likely to occur are removed by design. The rest are assigned specific ways of being handled.
In this particular case, the severity of the failure (not completing a test) is relatively, but not unacceptably high, but the fault is very unlikely to occur since it requires that (a) a lab violate the maintenance protocol we specified and (b) it happens during the time period between a test starting and ending. In all other cases it's a non-issue.
If we were to continue running in this scenario, the outcome could be far worse than shutting down since we would now have the possibility of providing incorrect diagnostic data to a physician. Again, the FMEA would say that although shutting down is bad, continuing to run is far worse.
I think FMEAs are a very good procedural tool. Because they put you into a mindset of considering your system and its functionality that's more "failure first" oriented.
However, they're also very difficult and time consuming to perform and keep updated throughout the development lifecycle. They're also necessarily sparse in terms of real coverage of a system's operational/behavioral domain for complex systems.
That said, I think way more software engineering organizations should be doing them as a matter of course even outside safety-critical systems. They're a very useful procedural tool to highlight blindspots at the very least.
It's actually a bit over the max value[1] - I agree though that I'd strongly suspect this issue is related to overflowing a millisecond counter stored in a 32-bit int. The numbers are way too close.
Hey, maybe <51 was just a off-by-one error... or maybe the actual advisory is to be <50 and some PM decided that number was too round or violated an SLA.
Yes. Jiffies (the main linux counter) rolls over either 10 or 20 minutes after boot. Yes. It is to ensure kernel modules as code can handle the roll over.
Safety factors are a thing, sure. Safety factors of 4.29e9x (which is what you get when you go from 32- to 64-bit ints) are possibly a bit excessive, and not at all worthy of an FAA airworthiness directive.
My biggest surprise today is from learning that critical aircraft software is left running for days without a full restart. Somehow I assumed everything gets completely shut down every time they refuel or so.
Given this appears to be used in the communications layer of the system I expect that the width is a defined segment of some part of the wire protocol.
Bug-free software of any complexity is at the very least exceedingly improbable, so there is always a tradeoff to be made and a lesser evil to be chosen.
Aircraft firmware requiring mandatory reboots in alignment with maintenance schedule, but working reliably otherwise, inspires more confidence than firmware advertised to run bug-free forever.
Aren't Ada and similar languages designed for safety critical cases like this?
When lives are on the line software should be tested for reliability beyond 51 days. Having to restart is a symptom of reckless disregard for safety IMO.
When lives are on the line software should be tested for reliability beyond 51 days.
Avionics software is written a world of verifiable requirements.
For how many days should the software be required to operate?
Is it acceptable to add that many [more] days to the software verification schedule in order to verifiably demonstrate that it works according to requirements?
Taking a plane from design to commercial delivery takes years. I'm sure they can spare 2-3 months to do some long running tests. Especially if those can run in parallel with other fit-and-finish work unrelated to software.
So, your idea is a plane comes with firmware when you buy it (say in 1985) and then that's the only version forever. Every problem between 1985 and now, too bad, this passed QA back in 1985 and we're not changing anything? No.
Airliners are very long-lived equipment. So in fact they ship new releases. New releases have features that may be really valuable to safety, as well as features that are nice quality of life improvements. They're not shipping once per hour like a web startup, or even once per day like the NT internal team, but they do need to ship more than "once per new model of aircraft".
I've written before about an accident I spent a bunch of time looking at. No fatalities, just a smashed runway light but still reportable because of the "But for..." rationale. Two of the easiest things that would have prevented that from occurring were firmware tweaks. One was a recommended (but not mandatory) change in a newer build and the other exists only in Airbus planes so far.
Specifically the newer build does OAT disagree meaning if you tell the plane "It is -20°C outside" thus automatic takeoff thrust is much lower, the plane considers the temperature sensor at the engine inlet and it says to itself, this reads +15°C which is 35K different, that's the difference between flying and crashing into the fence at the end of the runway. I disagree with your guess about the temperature and so I refuse to try to figure out what to do next. You can realise you entered it wrong and type a more realistic value in, or you can set the thrust yourself manually if my sensors are broken.
The fancier Airbus approach was not to focus on the result of air temperature calculations. If the plane isn't accelerating enough, it can't fly, we don't care why it isn't accelerating, maybe the wheels are square - we need to abort takeoff so we don't crash. So teach the plane how long runways are, it can use GPS to figure out which runway it's using, and then it can tell pilots if they aren't getting enough acceleration and they'll abort because they don't care why it's not enough acceleration either, they don't want to die in a fireball.
Long running tests don't mean firmware cannot be updated. Updates will just themselves need time. And with better upfront testing updates should not need to be as frequent.
I had an A380 flight that was slightly delayed due to a “software update” taking longer than expected.
It was at the SIN layover for QF1 LHR to SIN, so it was kind of worrying/amusing to have your plane need a software update halfway through your journey
This just gets back to the issue though - since all this software is locked behind privacy laws we have no visibility into how thoroughly this bug was identified. We don't know if there is a test out there confirming that this data corruption occurs after the expected amount of time after each patch.
I don't have as much a problem with that. There's always been tension between regulator access and proprietary corporate information. In order to get the former, you have to design a scheme to protect the latter.
1. You say like aircrafts are being manufactured by laymen. Despite all recent problems with Boeing, it's not the case.
2. Running a battery of formal proof tests is expensive and way more complicated than running a unit test suite for software.
3. Probably more complexity is required to solve this issue, and where is more complexity there, there might be more risk.
I'm not saying that this is even acceptable or a great trade-off, but the way you worded your comment is presumptuous.
We can't see what's in the box (since closed source), but I personally would be okay with this being a clearly laid out limitation, i.e. having a nice blinking red function comment saying "This integer will overflow if the system is up for more than 50 days, but due to hardware limitations we're unable to properly do X Y & Z with a 64 bit width integer on these subsystems."
If this issue is clearly identified and tested around that's alright, it isn't a huge deal to have to reboot periodically... I'm more concerned this issue is one of those "Oh well, it just... gets a bit off after fifty days - try rebooting it, that seems to fix it."
I’d be more alarmed about the fact that FAA had to issue a directive to deal with this situation. Either Boeing did not include the reboot in operation or maintenance procedures, or operators did not follow those procedures.
The requirement of a reboot on its own, though, would not strike me as a blatant disregard for safety, as long as the period between reboots is long enough to exceed the maximum possible length of flight (taking any emergencies into account) with leeway to spare.
I wonder how many days before you hit 51 is the cutoff where planes need to be grounded - obviously a continuous 51 day flight can't happen without a lot of fancy stuff going on that doesn't apply to the 787, but let's say you're at 49.2 days and considering a .6 day flight, is that allowed or is .2 days within the expected probable variance of your flight time? What if it's closer?
Bugs may be inevitable but reasons and outcomes matter. If the entertainment system goes down then no big deal. If the pilot is misinformed and the plane crashes then how is it possible for the company to get such slop certified?
I have audio devices which, when installed, will run flawlessly until power or hardware failure. The firmware isn't bug free, but the operation never encounters bugs.
Continuous Deployment environments are susceptible to versions of this. You have a slow leak, and nobody notices until Thanksgiving, when the processes that used to run for 3 days now run for 10. And just about the time you think you got that sorted out, Christmas comes along and busts the rest of your teeth out.
I worked at Google way way back when. We had an emergency code red situation where dozens of engineers from all over the company had to sit in a room and figure out what was making out network overload. After a bit of debugging it became clear that Gmail services where talking to Calendar services with an exceeding amount of traffic that nobody would have expected.A little debugging later and it became clear that restarting the gmail server fixed the issue. One global rolling restart later and all was well.
But then the debugging started. Turns out the service discovery component would health check backend destinations once a second. This was fine as it made sure we would never try to call against a server that was long gone. The bug was that it never stopped health checking a backend. Even if the service discovery had removed a host from the pool long ago. Gmail had stopped deploying while it got ready for Christmas, and Calendar was doing a ton of small stability improvement deploys. We created the perfect storm for this specific bug.
The most alarming part? This bug existed in the shared code that did RPC calls/health checking for all services across Google and had existed for quite a long time. In the end though, Gmail almost took Google offline by not deploying. =)
Statistics being what they are, eventually you will have, in the same build, an unfixed bug that requires a restart, and an unfixed bug that only works until you restart. That is never a fun day.
Have you heard the anecdote about buffer overflows in missiles? The short story is: they don't matter if they happen after you know the missile has exploded. It doesn't by definition make it an "underlying bad dev process".
What about running repairs in something like Cassandra? In some cases it is by design. Here I'm a little surprised an airliner would even go that long without a reboot
I have no idea, hopefully? Or maybe this bug just manifests as a minor fuzziness on calculations that'd fall within acceptable error for seemingly unrelated tests. I also have no idea what Boeing's QA is like and I feel like assuming the best is clearly incorrect.
Agreed. How is it okay to brush off serious issues like this.
Power cycling as a maintenance requirement is absurd. You don't sell someone a garage door opener for example and say "oh yeah, by the way, part of the ownership maintenance is that you need to unplug it at least every month, otherwise it might just start opening and closing randomly. No biggie".
This is an indication of a serious problem with the engineering, and after everything that has been failing with Boeings products lately, makes me want to avoid getting on a Boeing aircraft.
This has to be a joke. I can't fathom that this sort of issue is in production. It means that there is some sort of unaccounted saturation of buffers, or worse, memory leaks. In a deterministic system, these are the obvious things you need to get right.
Did you really just compare a passenger aircraft to a garage door opener?
It would also be absurd to say that part of the maintenance for your garage door opener is tearing the whole thing apart after x amount of use, but that's absolutely standard for aircraft engines.
>If the maintenance schedule is N days, then any news article pointing out how amusing it is that an airliner needs to be rebooted every <N days is at best sensationalism, at worst pure fearmongering.
Why not treat it like security? Yes, there are other layers of defense, but any given layer needs to be measured on its own. If I find that some large government website allows for javascript to be inserted for an XSS, but prevents it from running because it only allows javascript executed from a specific javascript origin, it is still a security flaw because some user might use a browser that does not implement content security policy. Yes, the user shouldn't be using such an insecure browser, but the website itself should not allow for scripts to be injected and not properly encoded.
Don't you mean >N days? If a maintenance schedule requires maintenance every 60 days but the plane needs to be rebooted every 50 days, that would be cause for concern.
Maybe regular maintenance does not always include reboot the computer, or remove the batteries and power on/off etc. If is a safety issue is safer to make it clear and not let if optional.
If regular maintenance does not including rebooting the computer, this is news-worthy.
If the maintenance schedule does document system reboots, this is boring and business as expected... no different than periodically reinflating tires or replacing oil. I'd have no concerns flying on such a plane.
I understand, but as developers it is interesting that even in such critical software there are such bugs (in case it was not designed with the reboots in mind from the start)
Indeed, working with a Boeing subcontractor, I've seen a few cases where something is "fixed by process control", where rules for people are designed to circumvent the deficiency in the software.
Basically, adding more code to make software "smarter" for those edge-cases was judged (rightly or wrongly) as having an even higher risk of introducing new bugs and creating new test procedures or invalidating previous testing.
> It does make you wonder what other bugs the software has though? Presumably it's not intentionally designed so that it needs rebooting periodically...
Not necessarily. If the computer will never actually need to run for 51 days continuously, it may be a reasonable trade-off to require the reboot instead of writing (potentially buggy) code to handle a scenario that can be easily prevented from happening.
> I was once working with a
customer who was producing on-board software for a missile. In my analysis
of the code, I pointed out that they had a number of problems with storage
leaks. Imagine my surprise when the customers chief software engineer said
"Of course it leaks". He went on to point out that they had calculated the
amount of memory the application would leak in the total possible flight time
for the missile and then doubled that number. They added this much
additional memory to the hardware to "support" the leaks. Since the missile
will explode when it hits its target or at the end of its flight, the
ultimate in garbage collection is performed without programmer intervention.
If it were working as designed (and properly documented), it does not seem likely that the FAA would find it necessary to issue an Airworthiness Directive.
To me the implication was not that you should write more potentially buggy code to prevent the need to reboot every 50 days... but rather than you should fix the bug that caused it to need rebooting every 50 days
Not necessarily - memory management depends on the application.
There are situations where it is better to just grow memory usage than garbage collect. Since airplanes require very routine maintenance anyway, this maybe be safer.
Lots of things are designed with periodic restarts in mind. From memory, one of the JVM garbage collectors is designed with daily restarts in mind. This is done to avoid having to deal with the expense of memory fragmentation.
This reports that this bug was found and mitigation was put in place (maintenance schedules are amended to include a reboot). Where is the "fearmongering"? Because it mentions that the not-implementation of that would be bad?
I am going out on a limb here but Google makes this easy enough to verify.
Waxing anecdotal from a randos unverified, admittedly fuzzy memory is a pretty banal application of my agency. “People aren’t sharing info correctly! Allow me to fix this while admitting to a fuzzy understanding myself!”
Imagine what could be done if we curved that emotional sense to think and know into agency towards learning it and not bloviating online, bending readers mood to our position, disregarding literal effort at verification.
Add in behind the back character assassination of the author, and yeah really buying into your expertise. At least I’m being dismissive to your face.
I'm not sure presicely what you are ranting against. I mean it doesn't seem unlikely that an aircraft has a strict maintenance schedule right? You haven't even suggested that you don't think this is the case. You certainly haven't provided any sources suggesting that it isn't the case.
This isn't Wikipedia. We don't demand citations for every comment. From memory this isn't the first time an aircraft has suffered from similar issues where it needs to be restarted. It generally doesn't matter because they have to have their internal databases updated with changes to navaids, radio frequencies, airport closures etc.
Here is an example from a completely different aircraft from a different manufacturer:
You know there's this trend going on which makes things kind of confusing and that is accusing everything of being fear mongering even when it may not be.
Things like Corona virus not being that bad and how surgical masks are useless. I genuinely question whether or not what you're saying is actually true or just following the trend.
What I find funny is how developers still "forget" to account for variables that increment and will overload after X days (and the process doesn't catch it).
It was funny in Windows 95 days (and Unixes knew how to handle those) now it's just sad.
Of course the problem might be a bit more complex and it might be a combination of issues. Still not good though
Now that the turn of the century is 20 years ago and the next century is 80 years away, I feel it's safe to go back to two digit years for most things.
The 51 days in the headline popped out to me because I have a Sonicwall brand firewall I help manage which requires a reboot every 51 days or it starts dropping packets. Yesterday was day 51 and like clockwork, it started failing.
This must be a coincidence because surely there is a large safety factor in this 51 day schedule. If the reboot is required to be every 51 days, surely the issue this is trying to avoid only appears after at least 3-4x as long.
> This AD requires repetitive cycling of the airplane electrical power. This AD was prompted by a report that the stale-data monitoring function of the common core system (CCS) may be lost when continuously powered on for 51 days
The issue appears after 51 days. I can't see in that AD a specific recommendation to reboot at predetermined intervals, just that it requires reboots before 51 days of continuous operation.
Safe use of Microsoft Windows also requires rebooting on a slightly shorter schedule, because GetTickCount will overflow. In particular if you're running a real-time simulation which is likely to use delta time as a critical parameter, and you can't audit the code or know for a fact that it uses GetTickCount.
I've often wondered how people came to the conclusion that "Windows is unstable, you can't leave it on for more than a couple months!" due to this.
At one company I worked for, all of our National Instruments test equipment would start to fail with communication problems after about two months on our Windows XP computers. Being familiar with GetTickCount, I rebooted the computers, recorded the date, verified the next failure was 49 days later, then emailed National Instruments with a link to the GetTickCount documentation. They pushed out an update with a fix 3 days later. Oops.
One thing Im curious about is what is classified as a "reboot" of the plane? It it is parked over night are all the systems shut down and restarted the next day it is put back in service? Does it sit "running" in some sleep mode? Last time I checked a plane cannot(practically) stay airborne for 51 days. Is the reboot a pain in the ass 5 day procedure? there are too many unknowns to sound alarm bells.
Do safety-critical systems have memory burned as ROM, instead of having dynamic memory allocation? From my point of view, the plane doesn't really change, so the avionics suite shouldn't require changing either. You build a physics model of the plane, translate it into memory, then bake it in, and dynamic allocation is only needed for when you need inputs. Or is this dangerous because the physics model does change significantly for different loadouts?
Are we talking about the same Elon Musk who's car company created an autopilot which has been known to crash into to side of trucks? Have I missed the obvious sarcasm?
Getting late to the discussion, but people tackle this for a long time in software engineering, it's called Software Rejuvenation, with models of repairing systems, Markovian assumptions, applications in JVM, etc. Interesting topic. It was used to analyse Patriot missiles that needed the same approach to replenish its internal variables each time.
However, since Airbus != Boeing, nobody around here cares. Only stories that are pointing out problems with Boeing are allowed (or upvoted) apparently.
I am guessing manufacturer doest have a budget to fix this. They are too busy sorting the 737 pitch controls. I am guessing they need bigger buffer to would clear it out and some good GPS and timestamps database and add a clr button on the console to clear the historic alt and speed data. The historic data can go to the black box and the new one stored to the buffer. A sensor should only look at data from the past week and not calculate stuff from 49 days in the past. What use would pilots have other than for service and maintainence. What was the OS written in objective C?
Overflow is not the only kind of bug triggered by uptime. In February 1991 a Patriot missile failed to intercept and incoming Scud due to an accumulation of time-based errors. The missile system had been online for 100 hours and this resulted in enough error that the intercept calculation was incorrect. People died.
This happens on set top boxes, especially when the graphics memory heap is allocated separately from the system memory heap. The graphics memory heap can be fragmented and surfaces stop being rendered because there are no contiguous memory blocks large enough. Having two heaps on a low memory device leads to unfortunate compromises.
I'm a bit ashamed of that but i guess im not the only one like this. At work we have a system which started crashing, and we could not figure out why. It runs normally, but restarts after some time and then again continues to function properly. So what did we do? Ran multiple instances behind a proxy and let instances crash. But cluster as a whole functions perfectly even when parts of it are restarting because of unknown error that we have no capacity to identify and fix.
I am guessing they need bigger buffer to would clear it out and some good GPS and timestamps database integrations to clear the alt and speed data. A sensor should only look at data from the past week and not calculate from 49 days in the past. What use would pilot have other than for service and maintainence.
Still, it's remarkable that two separate Seattle-based companies have produced a similarly short time bomb on very expensive and highly visible product development projects.
GetTickCount is a monotonic timestamps still supported in Windows 10, available for any application to use. These days, you should use GetTickCount64, but any application that doesn't handle the rollover of GetTickCount is buggy.
Good thing we improved with Windows 10 and have to reboot only every 52 days. Progress!
Seriously, Windows 10 does tend to get dodgy if you don't reboot for a few weeks. I'm not the only one who's noticed. Granted, it's less likely to outright crash, but it acts increasingly drunk.
I honestly think the safest solution is that an aircraft should refuse to take off after two weeks until you reboot it. In stead, Boeing and Airbus leave it to customers to test if the plane still flies after six months.
Ideally a plane is spending as little time as possible not doing anything. It's on the ground for as short as possible and there's ideally always something happening that needs monitoring or communication. Restarting a bunch of low-level systems just because doesn't fit into that, so apparently a 51 day span without powering it off wouldn't be unheard of.
And to think of the billions given to Boeing to bail it out while the management team who got it into this state got golden parachutes?
If the government deems that Boeing much be saved, it should also deem that the that prior management was negligent and cause for this situation and seize their personal assets and hold them criminally accountable.
I mentioned this in another thread, but 2^32 * 1024us is 50.9 days. So it's probably a systick at 1.024ms overflowing a uint32_t. If you've got a 1us timer it's a lot cleaner for the CPU to make the tick happen at 1024us than at 1000.
But seriously, this is clickbait and nothing to see here. Many things on the aircraft are checked, cycled, etc before every flight, let alone on a 51 day mx schedule.