Hacker News new | past | comments | ask | show | jobs | submit login
Integer Overflow Bug in Boeing 787 Dreamliner (engadget.com)
275 points by h43k3r on May 2, 2015 | hide | past | favorite | 203 comments

Might be time to remove working on the 787 from my resume. I feel like the poor thing has been one disaster after another in the news.

I can't speak to the quality of the A and B level (most critical) code, but the development process for the C level software I was working on definitely could have used a lot of improvement. Messy code, tests and documentation were an afterthought/checkbox item, etc. The incentives were just wrong.

I think there's a ton of room for process innovation in avionics software development. One thing I wanted to build for a long time was a tool for tracing. In theory, every industry requirement (DO-178B, etc) was supposed to trace to a hight level Software Design Document (SDD) requirement, which was supposed to trace to a Software Requirements Document (SRD) requirement, which would trace to a code function. We maintained all of that BY HAND. It was a huge mess. Perfect example of something that could have been an extremely valuable development tool, but ended up just being a hassle to try and maintain.

Then of course there's language choice. C is king, which isn't necessarily a bad thing, but it's certainly not the safest, even in the restricted forms used in avionics. Sadly, my very first ever project as a programmer was porting an Ada codebase to C for the 787 (off by one errors for days...). It's almost cliche to say nowadays, but I would be really excited to see Rust gain some traction in avionics over the next 20 years or so. Because that's how far behind avionics is. We were using Visual Studio 6 in 2011!

I work in the medical device field, and we have a similar process requirement (traceability from Design Input Requirements -> Software Specification -> Software Verification Procedure (and implicitly, the actual code function) -> Software Verification Report). We're currently wrangling with a giant-ass spreadsheet to keep track, and it totally sucks.

Ah yes I completely forgot SRD -> TESTS -> Then Code. Maybe that's because we always did it the other way around...

I'm telling you, there's money to be made building tools for this stuff. I think a big part of the reason things aren't being improved is that the people in a position to recognize bad process and tooling maybe aren't the type of people to see an opportunity to make money solving the problem rather than putting up with it. I wouldn't associate most of the engineers I knew at Honeywell as the type to stay up until 2AM every night for 3 months working on a side project to pitch to their boss.

I think it's really exciting what's happening in healthcare right now though. The innovative culture is exploding. Ultimately I care much more about what happens in medicine than avionics, as long as planes aren't falling out of the sky every 248 days...

There are already tools for this stuff. Problem is that they are all various forms of crappy and the market is so small that there is little incentive to improve them. I work in Medical Devices and one of the tools I'm supposed to use (we find every excuse to avoid it) has a UI like a 2001 Swing app and while it works, it is insanely painful to use due to its absolutely counterintuitive interface.

We're actually integrating more and more of our work in to Visual Studio since its tools are excellent. The problem is that the organization needs to validate any tool before we can use it as a part of Quality Management and that process can take forever.

Visual Studio is awesome. I'm really excited to see how Code turns out on Linux, especially for things like building GUI apps and 3D stuff.

I worked at Honeywell on these kinds of projects. I agree that these people definitely weren't that type. :)

Bizarrely now I am also helping (in a small way) with an innovative healthcare tech project. Maybe many people go through similar trajectories in programming without realizing it.

I'm curious about this process. Do you have contact info or pointers to other resources about it?

It's essentially the v model [1] for software development. It's commonly used in systems engineering and the industries that implement it, so military, avionics, medical, safety engineering etc. I work on industrial safety systems, and would echo the other posts here - it's a requirement of the industry, but it's rarely done well due to the cost to implement, and the difficulty in implementing tools like Doors.

NASA and the U.S. Military have good guides and white papers on it, which you should be able to find online with a bit of digging.

[1] http://en.m.wikipedia.org/wiki/V-Model_(software_development...

Unfortunately for you, I don't. I stay far away from the regulatory stuff - I really only enough to get my job done, mostly in the form of directives from my manager and our quality section.

That said, our software development process is really just a mild adaption of all our other development processes (like ones we use for hardware and reagent development). As I've described it above, it's a pretty standard engineering approach to design control, and there are all sorts of variations on them.

In fact, the FDA makes note of how broad the field of medical devices are, and does not actually enforce any specific design control process, but rather provides a framework for you to develop a process that meets your developmental and regulatory needs. As part of getting FDA approval, your design control process will be audited, as well as how well you followed it. In addition, medical devices are classified into 3 classes, corresponding to patient risk. Our product is currently classified in the lowest risk level, and so we can avoid a lot of additional requirements. As I recall from skimming through the standard (sorry, we only had a hard copy at work.. costs money apparently), higher risk devices with software components must have certain types of tests performed. In our risk class, we have total freedom to define our testing requirements - that said, the rigour of our testing is still under FDA scrutiny, it just means that there are no specific checkboxes to hit (like integration testing for example).

I guess if you want a place to start off, this might be a good place to look: http://en.wikipedia.org/wiki/Design_controls

> Then of course there's language choice. C is king, which isn't necessarily a bad thing, but it's certainly not the safest, even in the restricted forms used in avionics. Sadly, my very first ever project as a programmer was porting an Ada codebase to C for the 787 (off by one errors for days...). It's almost cliche to say nowadays, but I would be really excited to see Rust gain some traction in avionics over the next 20 years or so. Because that's how far behind avionics is. We were using Visual Studio 6 in 2011!

C has the big advantage that having been used for many years, it is very well understood, the compilers that exist are mature, and the formal analysis tools you want to use in a high assurance setting all exist also.

Higher level languages usually have none of these things.

> C has the big advantage that having been used for many years, it is very well understood,

C is probably the least understood language I know of. Vast parts of it are undefined, everyone makes their own decisions and compiler writers define the language.

C is a practical choice because it is performant, legacy code exists, and because a lot of people know the language.

Personally, I would never take a job writing safety-critical code in C. To give a single example of a far superior language for such applications, I would consider SPARK Ada, amongst others. I don't know how people that work on safety critical systems using C can sleep at night.

C is for systems programming where performance is king, not for building software that can kill people.

Ada is a good decision in this field.

Nevertheless, the parent was comparing C to Rust, where the language is immature, has no static analysis tools (to pick up the most important bugs), no well defined semantics and is entirely unproven in the field. This is the barrier for entry.

Rust's compiler is itself a static analysis tool that is more powerful than any static analysis that can be written for C. There is also a blossoming collection of "lint" libraries that extend Rust's type system to provide user-defined static analysis passes. Servo uses lints like these extensively to help manage interaction with SpiderMonkey, whose (C++-based) garbage collector manages the nodes in Servo's (Rust-based) DOM.

It's absolutely true that Rust is yet to be proven. We need to be ever careful not to oversell its benefits. But here's something that's been proven time and time again: use of C is fraught with horrors and pitfalls, and we should all be frantically seeking to improve the situation; if not via Rust, then via something else.

You could do with some learning about what's possible with static analysis.

For example, will the Rust compiler prove that your program must follow some state machine? (SLAM2). Will it compute a useful upper bound on the stack usage for your program? (a3/stack) Will it compute a useful upper bound on the runtime of your program? (a3/WCET) Can one prove (in certain cases) zero potential execution errors? (Astré)? Can one automatically prove generic first order logical specifications of a program? (Caveat)

A lot of these are decades old theory that have been put into practice that simply don't work in the context of a compiler. I'm sure they could be made to work with Rust, but they haven't.

You can do a lot more with static analysis with time and money than you would want to do in a compiler for a language that's a couple of years old.

Furthermore trusting Rust's statically derived guarantees means trusting the Rust compiler to have no bugs. One does not need such a leap of faith for C; they can use a formally verified compiler: http://projects.laas.fr/IFSE/FMF/J3/slides/P05_Jean_Souyiris...

Writing a formally verified compiler for Rust would be (I'm sure of this) a right pain in the arse.

Sorry, that came off overly harsh, I think. Anyway, static analysis is really cool :)

NBD, I was being a little overzealous there myself. :) IME when most people say "static analysis" they're referring to tools that attempt to detect the usual class of memory vulnerabilities that Rust prevents outright (or maybe they do even worse and just attempt to detect typical coding patterns which happen to be correlated with the same).

Part of my zeal is also that Rust's semantics are strictly, er, stricter than C's (disregarding the quagmire of C's undefined behavior), which means that static analysis tools should be even easier to write for it.

> the poor thing has been one disaster after another in the news

everything this complicated has always had these problems, difference is news jumps on it like a hawk now

In theory, every industry requirement (DO-178B, etc) was supposed to trace to a high level Software Design Document (SDD) requirement, which was supposed to trace to a Software Requirements Document (SRD) requirement, which would trace to a code function. We maintained all of that BY HAND. It was a huge mess. Perfect example of something that could have been an extremely valuable development tool, but ended up just being a hassle to try and maintain.

This is a symptom that group culture there is interfering with rational decision making around tool use.

Then of course there's language choice. C is king, which isn't necessarily a bad thing, but it's certainly not the safest, even in the restricted forms used in avionics.

This is another symptom. Perhaps it's excusable if there were preexisting libraries/codebase.

It's almost cliche to say nowadays, but I would be really excited to see Rust gain some traction in avionics over the next 20 years or so. Because that's how far behind avionics is. We were using Visual Studio 6 in 2011!

Another indication that group culture is blocking rational decision making about tool/language choice.

> This is another symptom. Perhaps it's excusable if there were preexisting libraries/codebase.

Uh, what would you use? As far as I know, the choices are: Assembler, Ada, and C. We're talking about avionics code, here.

C++ is not going to help you in this case. Java will be too slow/unpredictable. Real-Time Java is kind of sketch in my opinion (I could be wrong there). Most everything else is totally out.

Actually I'd probably choose Ada... but there are reasons the industry switched away from Ada.

> Java will be too slow/unpredictable. Real-Time Java is kind of sketch in my opinion (I could be wrong there).

* http://en.wikipedia.org/wiki/Real_time_Java

* http://www.rtsj.org/

I'm already familiar with those (which is why I mentioned it). Maybe you are just linking it for other people, in which case, thanks.

Why did the industry switch away from Ada? I thought they preferred Ada because it was such a 'safe' language?

Because Ada programmers were too expensive (i.e., there weren't that many of them).

At least, that's what I remember hearing---can't really vouch for it and I'd be happy to see someone else weigh in.

Also, my understanding was that Ada was basically forced on the defense industry by the Pentagon, and then there was eventually a backlash, and they dropped it. It may have had more success if it had been adopted voluntarily all along. In fact, it may yet have a resurgence. I know there is at least one serious Ada vendor out there still.

Also, my understanding was that Ada was basically forced on the defense industry by the Pentagon, and then there was eventually a backlash, and they dropped it.

So they went to C? There's some irrational component in there somewhere.

What about Swift? Reference counting has good properties for real time applications. The toolchain is excellent for implementing verification tools. It has fairly high performance, combined with a very large and growing developer base. If someone put together a toolchain for Swift that covered the most useful 20% of features provided by Ada in that context, then such a thing could take over avionics programming.

Ahh, makes sense. Thank you.

As far as I know, the choices are: Assembler, Ada, and C. We're talking about avionics code, here.

In that case, "group" may refer to the avionics software industry as a whole.

When you say Java I'm assuming you're refering to a JVM? Now what JVM are you refering to?

I'm not sure how C is specifically to blame for something like int overflow - while in many languages this specific problem won't happen, in many popular ones it would happen the same way as in C. E.g. doesn't Rust also have limited integer types?

In Rust, integer overflow is defined behavior, though: https://github.com/rust-lang/rfcs/blob/master/text/0560-inte...

True, but as far as I see from this RFC, you have to explicitly check for overflow or run code in debug mode (which of course doesn't help unless your CI includes the one-year-without-reboot test, which would probably do wonders to the production schedules ;). If you thought about it, when writing that code, then you could add the same check in C too! The whole issue is that most people do not make such checks consistently.

> I'm not sure how C is specifically to blame for something like int overflow

C is not a good choice when safety is paramount.

"The low-level nature of C and C++ means that bit- and byte-level manipulation of objects is commonplace; the line between mathematical and bit-level operations can often be quite blurry. Wraparound behavior using unsigned integers is legal and well-defined, and there are code idioms that deliberately use it. On the other hand, C and C++ have undefined semantics for signed overflow and shift past bitwidth: operations that are perfectly well-defined in other languages such as Java."


We use Doors for that. I think we tried TeamCenter for one project but it was a bag of hurt.

The UI for doors is a little clunky (looks like a win95 app), but it fulfills it's purpose.

Doors equivalent functionality could be built with a web interface and a sql db backend. Maybe any kind of database would work, relational just makes the most sense to me.

That said, assuming DO178B is similar to DO254, it's more about making sure you have a design process and that you are following it, and less about how good that process is.

Ah yes DOORS! I remember using that on one of our projects. Pretty sure it was 787. I had the feeling that if used properly it could be a powerful tool. But I also remember the whole IBM Rational suite being incredibly slow...

Weird, I had to do the same thing (porting Ada to C) for a similar project (the JSF). It's an awful idea, because so much is missing in the translation.

> Then of course there's language choice. C is king...

Is that true for all of Boeing? I recall reading about the Lustre synchronous dataflow language being used by Airbus. It apparently allows all sorts of interesting safety guarantees.

Not sure about all of Boeing. I worked for a subcontractor who got a small piece of the project from Honeywell who contracted with Boeing. But we worked on aircraft from Boeing, Gulfstream, Airbus, etc and other than the 787 Ada stuff (which we ported to C), I don't recall seeing anything that wasn't written in C, C w/ classes (heavily stripped C++), or Javascript. Of course it's possible that my boss just asked for those types of projects since we had the most expertise there.

Curious: what was Javascript used for? (In terms of general usage contexts.)

That was a joke ;)

> Because that's how far behind avionics is. We were using Visual Studio 6 in 2011!

Wait, what about autoPilot.js ?

> Rust

so you want to fly in planes using software written in Rust

A subset of it, yes. Any dynamic memory allocation is out, which would pare it down heavily, but you can still reap the safety advantages with the static stuff.

Rust happens to have already defined a subset of itself that forbids dynamic memory allocation, which makes use of a pared-down standard library called libcore.

Awesome! I figured such a thing would exist eventually but didn't realize it was already in the wild.

I definitely don't see a problem telling people the plane was built with rust

Did you consider using dtrace for tracing?

Edit: Oops, misread the comment.

Different type of tracing. I'm referring to textual tags which are used to associate a design requirement with a piece of code, the idea being that you can show a requirement is fulfilled by showing that it traces to tests and code, and that the existence of any piece of code needs to be justified by tracing back to the requirements.


Borland's Caliber was once popular for requirements traceability, http://www.borland.com/Products/Requirements-Management/Cali...

Caliber looks great. I can only hope they're using something like that nowadays. Of course it has the word "Agile" in it so it's probably about 10 years too young.

Different type of tracing. This is basically tracing line items from a document to another set of line items on lower level documents, potentially all the way down to specific lines of source code.


> Sadly, my very first ever project as a programmer was porting an Ada codebase to C for the 787 (off by one errors for days...).

that's a terrifying statement. to those of us who know what a person's first programming projects are like, quality-wise

just terrifying. mindlessly stupid on your management's part

Keep in mind we're talking about C-level communications software, not an autopilot system. And the code was eventually tested heavily, just in a very inefficient and expensive manner. But I don't completely disagree with your assessment.

Do you think Java or a JVM language have been better suited for this task instead of C?

fair considerations

Windows 98 had a similar bug where the system would hang after 49.7 days: https://support.microsoft.com/en-us/kb/216641

Although IIRC, the impact was limited, because it was quite a feat for a Windows 98 system to stay up for 49 days :)

Since we're talking about aviation, that same overflow bug caused a meltdown of Southern California air traffic control in 2010 after someone forgot to preemptively reboot their Windows servers every 30 days: http://www.techworld.com/news/operating-systems/microsoft-se...

Stupid question to those who work in the aviation industry:

This seems like a candidate for an automated script that reboots the server every n days or hours. Would such script be considered a potential risk? If so, why?

All of a sudden you have an emergency going on and a plane coming in in the middle of the night even though the airport is usually only open until 2100.

And there goes the servers because of a stupid script. :-]

Of course they went down anyway and by the far more likely cause of forgetting to reboot.

Today it is even better. 2GB of updates will get applied after rebooting, before system starts.

There was a story recently about a german 2nd division basketball team that ran their scoreboard on Windows and applied updates about an hour before the game - scoreboard wasn't back up in time, team lost since the match couldn't start within the allowed timeframe and of course this match was important, kind of an endgame and the loss led to the relegation of the team.

And that is why WSUS exists...

Which is why you have redundant systems. The scheduled updates allow you to validate that both are working so that in the event of an unplanned downtime, you aren't stuck scrambling to figure out why the redundant system doesn't function at all.

Well, there is the problem of how you use the server while it's rebooting...

Yeah, and as I recall, it took something like 5 years to find this bug because no one had kept a machine up that long.

I found 98 respectible. It was 95 that didn't last a whole week!

"Windows Millennium Edition" was the worst.

"Malfunction Edition" as it was sometimes known

Win2k was golden.

My last surviving win2k machine finally died. Power supply would not fire up after a power cycle. I could repair it, but maybe it is just time to leave it be.

I had to abandon 2k after i tried swapping the motherboard on it, and ended up with the blue screen of "don't have driver X for critical hardware Y".

Yes, I had tried that earlier. Also tried upgrading to a later version, but not enough bios.

And it is running 32 bit windows on a 64 bit board.

I recall a long time back, when Linux was configured as standard for 100Hz ticks (aka "jiffies"), the counter was initialized close to wraparound instead of 0.

The result was you typically encountered "jiffy wraparound" after a few minutes of uptime. You learned whether your system was stable in this situation fairly quickly, rather than 248 (or 497) days later. Kernel developers typically don't have uptimes measured in days. Starting the counter close to wraparound increased the likelihood it was going to get code coverage.

I really love this methodology. If an exceptional case exists, and it's cheap to cause the exceptional case to occur during standard usage, then do it so that the code is well-exercised.

I found it curious that the journalist refers to the bug as a "vulnerability". This is could be misinterpreted given that's a term is more commonly used in a security context.


There's your answer right there. They write stories for a living, not programs.

Some journalists still care to acquire correct terminology. If somebody is reporting from the court and call somebody accused of car theft "murderer" because you don't know the difference between the two, they'd probably get laughed at.

I guess somehow they got the idea that someone could breach the firewall between the entertainment system and the flight controls, and insert a bogus entry in the right spot to trigger this.

Frankly with all the scaremongering going on with computer security these days i kinda agree with Torvalds policy of not highlighting security fixes in Linux patch notes.

They're actually not alone, Dell's EqualLogic (a big & expensive storage array) had the same problem, after 248 days.

They would initiate a controller failover and reboot: https://ma.ttias.be/248-days/

There was a similar issue in the Linux kernel in version 2.6.32 where the kernel would crash after 208 days: http://www.novell.com/support/kb/doc.php?id=7009834

This was a serious problem in some storage systems too: https://www.ibm.com/developerworks/community/blogs/anthonyv/...

I find it, well, interesting to read that the "Fail Safe" mode is to deactivate all power systems on the plane.

They probably decided that started from a known state after a major problem was better than starting from an inconsistent state.

I find it troubling that the generators can't reboot w/o continuing to supply power, that reboots aren't staged to ensure that the plane continues to have power and that power from the generators it necessary to control the aircraft. Isn't there battery backup to enable the plane to continue operating normally while the generator reboots?

From the article it looks like their whole failsafe/redundant system architecture is flawed.

Just like buffer bloat, reboot times and corruptible system state are a chronic systemic flaw in modern technology.

The only stuff that works is the stuff that is used all the time. Look towards crash only software [1] and microreboots [2,3]

[1] https://www.usenix.org/legacy/events/hotos03/tech/full_paper...

[2] https://www.usenix.org/legacy/event/osdi04/tech/full_papers/...

[3] http://dslab.epfl.ch/pubs/perfeval.pdf

The actual reboot takes something around a minute, and the plane is designed to keep flying while rebooting. It's during take off or landing that this would be a real problem.

Think Starship Enterprise. You can't turn off the whole ship, or everyone suffocates. You run diagnostics system by system and keep the warp core online as much as possible so you don't run out of power.

Reminds me a little of the floating point precision bug in the patriot missile targeting systems, where the longer it was left on, the less accurate it got.


I file it under "funny" but certainly it is nothing unusual or surprising. Software, like anything else, has faults and breaks. Even on an airplane.

It mostly is funny to us, developers, because we have all been trying to convince our bosses that "it will never happen". That object_id being an int4 sequence? You would need one object a second for 70 years to overflow. And yet, somehow, it does, e.g. because someone loaded data with object_id set to 1.9B and the sequence followed from there.

P.S. My favorite pastime? Watching Aircraft Disasters series in an airport. Not brave enough to watch it during the flight yet. Karma might be a bitch and I do not want to test it 10 km up there ;-)

Like everything else? What was the last time the Golden Gate bridge collapsed? :) Everything else does not include the most of the engineering output. In software faults are more common because of the tools we are using and because there is no life in danger if Twitter is down. On the other hand, we cannot allow a bridge to collapse or an airplane to fall down from the sky because it has a fault. There are several techniques to build reliable systems out of non-reliable parts.

Bridge collapses are actually not that uncommon. While often times overloading or damage is the cause, sometimes it's due to design flaws.


I think the point is the frequency.

I don't think this is a one to one comparison, bridges etc get maintenance work done all the time to fix small issues, and they do fail.

See above, I am talking about the frequency of catastrophic failures. I agree it is not a good comparison.

Presumably all of Boeing's other planes have such counters in their systems, and don't have this bug (or if they did, it was corrected already), so why only the 787? That's what I find most surprising.

Edit: one theory that seems plausible is that they were "overly paranoid" and put in overflow checks, on a time counter whose overflowing would not have had negative effects otherwise since the other code was designed to handle a wraparound correctly.

Software tends not to be reused between planes unless you go back to the same vendor and there are no major hardware changes with the component as well. Aircraft software is kind of a broken world.

Broken Aircraft software makes me wanna rethink the notion of correctness, or broaden the scope of failure and function.

Right? If aircraft software is broken, but my linux desktop is supposed to be the picture of success, I'm not sure the definitions are meaningful. :-)

And I was serious. We should study this and see why what could be described as a fault, a bug, etc ... is actually not that meaningful.

Those are features!

They are meaningful within their own contexts.

Exactly, it's a holistic approach, everything matters differently depending on the context. Sadly it kills my dreams of unification on the way.

Maybe it's for the best. Ariane 5 comes to mind.

Most expensive software reuse ever.

I think RTW (reinventing the wheel) is a favorite pastime in avionics, right up there with DRY (do repeat yourself) and TED (test eventually design).

Or it could be that they use fixed circuits rather than software.

Could have abstract interpretation (http://www.astree.ens.fr/) or some other formal method prevented it?

Someone wrote something like:

  int32_t ticks; // 100ths of a second
which overflows in 248 days, a particularly unfortunate amount of time because it doesn't show up during testing.

Although it would be a good engineering choice, a formal verifier would say that:

  int64_t ticks; // 100ths of a second
is also incorrect, since it also overflows (after 10^9 years).

In a hard real time system,

  mpz_t ticks; // 100ths of a second, infinite precision libgmp type
is still formally incorrect, since as the the number grows in size it will eventually exceed a time limit or memory (after 10^10^9 years)

The overall lesson from formal methods is that it's impossibly to write formally correct useful programs. So programmers just muddle through.

As you have written, a formal test/analysis will always detect that a monotonously increased tick counter will not be bound by an upper limit. And the obvious solution is that you don't rely on such a thing, but define your API such that the (preferably unsigned, but doen't matter) tick-counter will roll over in a well defined way.

If the algorithms really depend on an always monotonously increasing tick-counter (which I doubt), the solution is quite easy: After 2^30 ticks set a flag which raises the "service needed" light in the cockpit, taking the plane out of service until it's power cycled. By this you explicitly state that your device cannot be used longer than 120 days continuously.

Agree with the first paragraph, but in the second I don't see how requiring a periodic reboot is a solution. Your "service needed" light is a "Case closed, WON'T FIX" message made real.

Airplanes already have an elaborate schedule for mandatory periodic service. Pressing a "reset" button once every 30 days is pretty trivial compared to dismantling the entire engine every couple of years.

What made this bug dangerous is that nobody knew about it, that's the main problem that needs to be solved.

On the assumption that there's a complicated control algorithm which, unfortunately, does arithmetic on (now-then) tick-values everywhere... but this algorithm has been extensively validated to be correct on the non-overflowing case, and it will take a while to find out how it handles discontinuities in its time scale.

Then the simple "raise service-needed signal" would be a valid workaround and easily testable local change for the next two years until the extensively fixed-algorithm went through testing and certification.

A general solution to the overflowing-clock problem is to deal with the clock in modular arithmetic. When wanting to know if t2 comes before t1, check the MSB of the modular difference.

  uint32_t t1 = ...;
  uint32_t t2 = ...;
  if ((uint32_t)(t2 - t1) >= UINT32_C(0x80000000)) {
    // t2 is before t1
  } else {
    // t2 is after or equal to t1
What this gives us is that if the difference of the actual times (not these uint32 representations which are ambiguous modulo 2^32) is less than 2^31 units (plus minus one maybe..), this check will give the expected result. This does allow a correct system that never fails if the timing/duration of the events is suitably limited.

For example you time events at a fixed time interval, and it will keep going forever in spite of clock roll-over.

  uint32_t next_time = now();
  while (1) {
    while ((uint32_t)(now() - next_time) >= UINT32_C(0x80000000));
    next_time += interval;
The timing events also need to be processed quickly enough of course (that printf shouldn't block for longer than about 2^31).

Using this technique to compare the current time to engine start time would cause the exact problem described.

It would. Which would fail the condition that I mentioned "if the timing/duration of the events is suitably limited". So you should just not do what you suggest :)

Isn't the actual problem using raw integers to represent time, instead of a proper date/time data type and supporting (tested) library functions?

I disagree. You can reduce space usage with a logarithmic complexity. A couple tens of bytes is enough to store miliseconds until the heat death of the universe.

Just drag the complexity of bignum arithmetic into a hard-realtime embedded system... What could possibly go wrong?

I was not implying that.

Nothing. Built in arithmetic will do the job nicely.

> int64_t ticks; // 100ths of a second

I would go with uint64_t

as it documents "ticks" as a variable that can not hold negative values and also doubles its range of positive values.

I prefer signed time types, because you frequently subtract them. If you use unsigned, then you have to cast the result to signed every time:

  int64_t elapsed = (int64_t)(t1 - t0);
And it's very easy to cause disaster with:

  if ((t1 - t0) > 50) ...
which also succeeds if t1<t0.

While it's theoretically possible, using all 64 bits is tricky and very hard to test.

The "doubles its range of positive values" is weak argument because you should never be reaching values more than a few decades, never mind 10-100x the age of the universe. Such a state is a bug.

The "can not hold negative values" argument is also weak because a uint does not prevent generating negative values - it only prevents you from knowing that you've generated negative values. Such a state is a bug.

Using a uint only serves to make it harder to test when your system is in an invalid state.

I would go with

uint64_t ticksOfDuration10ms; // No comment necessary

The concept of timer "ticks" is well established as a unit of time in embeded programming, it's almost universally included in your embedded (realtime-)OS and might increase at any conceivable rate, both limited by the hardware constraints (e.g. a fixed, simple, 16-bit ripple counter that is clocked by the main CPU clock of 8 MHz will clock at 122.07 Hz) or at your application requirements (you let a slightly more configurable timer only count to 40000 at half the CPU clock to get exactly 100 Hz). Hence you shouldn't explicitly inscribe the tick rate in your symbol name, as it can change when requirements change.

You'll almost always have a global variable, preprocessor define... or something similar to get the frequency (or time increase per tick), which you should use whenever you have to convert "ticks" to actual physical units. If the actual effective tick rate is visible at many places in your code, both as a symbol name or as a comment, you are most certainly doing something wrong.

I think you kind of missed the point of my post (which was a bit tongue-in-cheek). The original code fragment had the tick duration embedded in a comment, so changing a global variable which defines it something other than 10ms is going to cause all sorts of problems in maintaining that code. (Leading possibly to the very problem Boeing had).

...well, then my irony-detector is broken ;-).

Why not uint64_t thisVariableIncrementsByOneEvery10ms?

uint64_t thisVariableIncrementsByOneEvery10msSoItWontOverflowForAReallyLongTime

(then you'd know it was safe)

Another good practice is to initialize the time counters to something close to the overflow point, rather than zero. This encourages overflow bugs to show up at timescales where they will be noticed during testing, rather than after 248 days of service.

This is a scary-ass bug in a codebase that was supposed to be authored to strict professional standards.

The Linux kernel does that. The 'jiffies' counter's initial value rolls over 5 minutes after boot:



but I would suspect Boeing used something better than C.

I honestly can decide whether this is serious, satire, or conspiracy theory, but it's awesome nonetheless. My first project working on the 787 was converting an Ada codebase to C.

Isn't Ada precisely suited for this sort of application? What was the motivation for switching to C?

> What was the motivation for switching to C?

Invariably, cost. SPARK Ada is demonstrably superior to C for safety-critical development (I can't cite the sources for this, but a major company developing safety-critical software has shown this to be the case).

But, SPARK Ada requires a lot of highly skilled manpower and it's slow to develop. C gets the job done, albeit with lots of bugs.

If the industry is unwilling to invest in the training or tooling for a safe language like SPARK Ada, is there research into "easier" safe languages, something between C and Ada? Or do companies like Boeing still expect to be writing new avionics safety in C in 2030 or 2040?

Yes, Real-Time Java being an example.

Realistically, it seems to me that avionics etc. will be written in C for a very long time to come. It all comes down to the cost and availability of programmers.

C is incredibly portable and tons of programmers know it. Ada is great in a lot of ways but just never got the traction it needed to be #1.

Sure, but shouldn't safety be the number one concern here? Programmers can always be trained to learn it, as long as they demonstrate competence. It seems like an unfortunate case of trying to cut costs at the cost of safety.

This failure mode in particular was deemed exceedingly unlikely by Boeing, which got them an exception to some initial airworthiness issues with the RAT, which in turn would have made a total loss of power catastrophic.

They can deem things unlikely? That seems broken in general. I would deem it unlikely they'd ship with any errors they didn't deem unlikely; those are precisely the failure modes we should most look for...

The entire aircraft is an electro/mechanical system with many thousands of things that could go wrong, but are deemed unlikely. All engines could fail at the same time, but it's deemed unlikely. Redundant hydraulic systems could fail together, but it's deemed unlikely. There is no certainty in systems this complicated.

IIRC, the VisualWorks VM had such a bug that would mysteriously crash an automated airport people-mover after some interval, like 45 or 90 days. (Software crash, not train-hardware crash! Train would simply stop.) Also, as I recall, the train software project did not use automated tests at all! (By that time, VisualWorks VM was implementing them.)

(Learn from history. Don't cling so hard to the notion that your language will make you into super-programmers. Certainly, some tools are better in certain contexts than others. However, group culture and the quality of working relationships often have an effect even greater than choice of language. Besides, people often dislike someone who projects an air of superiority.)

FYI - Engadget has very intrusive advertising that you can't close on a mobile device: http://i.imgur.com/nqgc2p7.png

"tech for ladies", aka a power bank with a led flash and a "designer" case...

Is that it really an overflow bug? Or a counter wraparound bug?

For example, incorrectly using a X > Y comparison on values that are congruential (and do not overflow) isn't an "overflow bug". You can only locally compare values that are close together on the wheel, using subtraction.

The simple thing to do with tick counts is to start them at some high value that is only minutes away from rolling around. Then the situation reproduces soon after startup, rather than days or months later, and its effects are more likely to get caught in testing.

Another thing you can do is reduce the range of tick counts. Say you have a 32 bit tick count which increments a hundred times a second, but the longest period (biggest delta between any two live time values) you care about in your module (driver or whatever) is well within 30 seconds. That's only 3000 ticks. Then, whenever you sample the counter, you can mask it down into, say, the 13 bit range [0,8192): effectively a tick counter that rolls over every 81.9 seconds (which you treat correctly as a 13 bit value in your calculations like is_before(t0, t1) or add_time(t0, delta)).

There's no need to reduce the range, you can just treat correctly the full counter range (see my other comment).

Well, it's a tautology that if you treat correctly anything, you don't need any defensive tricks.

Treat all the ones and zeros correctly and everything else takes care of itself.

You mentioned correct treatment first :) I'm just saying that masking the clock is unnecessary and doesn't make correct treatment any easier.

Or you just use a 64 bit variable and don't ever bother with this again in mankinds existence in this galaxy.

Assuming the frequency stays in the same ballpark. :)

Current workaround: Restart the estimated 28 U.S. planes at least every 120 days[1].

Wonder how long it could take for the update to be actually available (after testing, approving, ...). Are we talking weeks, months, years?

[1] https://s3.amazonaws.com/public-inspection.federalregister.g...

Actually I am a bit surprised that somebody would want a plane to be powered on for such a long time. There is no way one could fly for that long anyway and they are regularly taken to service between long hauls.

I'm guessing these GCUs are like the ECUs in a car - even when "off" they're still powered and running in a standby mode. The fact that the AD contains messages such as "batteries do not need to be disconnected" and that this "electrical power deactivation" takes approximately an hour of work suggests something needs to be gotten at, unplugged, and replugged.

Boeing 787 aircraft have been in service for longer than 248 days (entry into commercial airline service was in 2011), with zero confirmed encounters with the bug in the wild. This suggests that your "always on even when off" theory does not hold up.

You know this how? Could it be that these reboots are happening on the ground? I also have in my hand zero confirmed enounters, that doesn't prove anything.

Well, A) sensationalistic news would have sensationalized the hell out of any live incident, in the air or on the ground, and B) aircraft maintenance and utilization schedules being what they are, the likelihood of any 787 actually going 248 days continuously powered on is... technically not zero, but effectively indistinguishable from zero.

I completely agree with you - on at least three of the dozen or so flights I've taken this year, when there was a problem with the passenger area (Audio in one case, WiFi in another, and finally my POWER connector in the third) - the flight attendants power cycled the entire system, which took about 15 minutes, and let me watch the Linux boot process on the back-of-seat console.

My suspicion is the "Reboot" approach is pretty common to aviation systems. It wouldn't surprise me that many of the components are rebooted daily, and almost certainly on a weekly basis.

120+ days without a reboot sounds unlikely to me.

Dittoing userbinator here. Avionics systems aren't generally treated so casually. DO-178B [0] has several levels of designations for criticality of systems. A is the highest. Essentially, if these fail, people will die. E is the lowest. If these fail, people are inconvenienced. Passenger AV systems, wifi, those are conveniences and are not held to a terribly high standard.

[0] http://en.wikipedia.org/wiki/DO-178B

IFE and other "passenger amenities" are not considered critical-to-flight, and it definitely shows in their reliability (or lack thereof)... the avionics are designed to a much higher level of reliability.

Jtsummers, userbinator: This 787 bug shuts down the generators which I understand provide only the AC power aboard the aircraft? How critical is this AC power?

B787 is a mostly electric airliner. There are far fewer hydrolic/cable operated systems than in previous planes. This is much safer since a explosion/leak/breach/clog in a hydrolic line won't take out an entire hydrolic system (most planes have 3 systems and fuse valves to mitigate this).

However, since there are so many electricly operated systems, you really need power. The B787 as is No Bleed [1], so electrical power is also used to pressurize it [2]. Need Electric Power.

[1] http://www.boeing.com/commercial/aeromagazine/articles/qtr_4... [2] http://www.airliners.net/aviation-forums/tech_ops/read.main/...

Isn't most of that DC power though? How much of it is AC power (like the power that each passenger seat gets?)

That's the power that I was talking about requiring a reboot. Not sure if it's related to the AC power associated with the bug in question - possible there are two AC power systems on the plane?

Starting the plane's flight computer takes a long time, as in hours [1] because there are tons of automatic checks to be done. We're not talking restarting the DVD player. Also, when planes like these are at the gate, they run off the Ground Power Unit (GPU), basically a wall power cord. Because of this, the engines and other systems can be off, but the core flight computer is still on. If they could, airlines would never restart these. Remember, it needs to do things like reset all the laser ring gyro's, inertial reference units, etc because these are voting systems, so they need to figure out what truth is. As such, the boot process is understandably complex.

[1] http://www.reuters.com/article/2014/02/04/us-poland-dreamlin...

Agreed... while this is a software design issue I too would be surprised to see a plane powered up for that long continuously. Needs to be fixed, yes... causing a lot of issues at the moment? Probably not.

Wow, how can something like this happen? I thought airplanes had triple redundant software systems using 3-version programming [1] in order to avoid such bugs/problems. Can anyone familiar with flight technology shed some light on this?


Not an airplane programmer, but I seem to remember that the literature says that it's not generally a cost-effective way of finding bugs. In particular, you multiply the cost of development by (say) 3x (which is fine, on its own) but also the places where bugs are inserted are typically the hard parts; so you don't reduce the number of bugs as much as you'd like; it can easily be more cost effective to invest the few million in static analysis etc.

As much as we'd like plane manufacturers to test things to death, it'd become too expensive too quickly. For all we know, this software could be written by a contractor, or the firmware for a third party part.

As far as I know, N-version programming was effective when software systems were small (shuttle ran on 50k lines of code) and where poring over every single line was possible, because the hard part was coming up with the spec.

Nowadays a big plane like the A380 might be expected to have 100M lines of code in its subsystems, and it's simply too expensive.

> Nowadays a big plane like the A380 might be expected to have 100M lines of code in its subsystems,

Why does an airplane require 100M lines of code?

Linux kernel alone comes in at 15m SLOC. Now add an userland subsystem and you're at 20-30M just for one device.

Multiply by all the little and big subsystems, the embedded chips, in-flight entertainment, network gear... 100m SLOC is too low, I think.

Sorry if this is incredibly ignorant, but I can't believe flight control systems are running Linux?

Do these systems not have hard real-time requirements about the execution time and periodicity of tasks which can't be guaranteed by the time-sharing scheduling algorithms in Linux?

Real time systems will be running a RTOS: VxWorks or QnX or something equivalent to that.

They'll definitely build a prototype using Linux but they won't get that certified so it literally 'won't fly', it's just a means to speed up initial development.

Is the order of magnitude of lines of code in QNX different from that of linux? At a first approximation, I don't see why it would be.

The QnX kernel is very small compared to the Linux kernel.

Small enough that I could-reimplement it in approximately 3500 lines of code + another 850 for the virtual memory management.

Wow, I had no idea. Since their source is closed and untouchable I had no way to check either. Is there any reason there aren't several certified open RTOSes around?

I don't know if there aren't any open certified RTOS's around, but I can explain the 'why' part easily: if you pay for the certification of an open RTOS then everybody that can use one will say 'thank you' for the effort and that's that, since the certification would apply to any and all copies of that particular version. So you're essentially paying for the privilege of cutting your competitors a break.

This could only work if the entity paying for the certification had a way of making that money back somehow and I don't see how that could be done.

Not an RTOS, but seL4 is a correctness-proven open-source ARM microkernel: https://sel4.systems. Looks like a mixture of public and private funding. It's part of the L4 family, http://en.m.wikipedia.org/wiki/L4_microkernel_family#Univers... which includes OKL4 (deployed on 1B+ ARM-based mobile phones) and http://genode.org (x86/ARM) from Dresden.

Not to detract from the fine work done by the sel4 folks, but there is a large gap between what they have and what DO178 C requires for level A software. Like many other bureaucratic organisations, the FAA (and other regional equivalents) have a process with it's own set of rules (MCDC testing, requirements/design traceability artifacts, etc).

It would cost a significant amount of money to develop the necessary artifacts and engage the FAA to obtain a certification.

That's absolutely true but something like this could be a good starting point.

What I think the whole thread above misses is that the economics simply aren't there, cost isn't the limiting factor for the OS licenses for avionics but an extra certification track (especially for a fast moving target) would be, besides, it is not just the OS that gets certified but you will also have to (separately) certify (usually) the hardware that it runs on (unless you're going to use a design that has already been certified).

That means that modifications are expensive and that 'known to be good' trumps 'could be better' or 'could be cheaper in the longer term'.

Someone would have to come up with a very good reason to see open source trump the existing closed source solutions.

In theory, while certification would be done on a binary derived from seL4, any improvements resulting from the certification process could benefit the open-source core and derivative binaries. Compared to a proprietary OS, improvements would have ecosystem-wide benefits.

In addition, a modular microkernel architecture could use reproducible builds to generate identical binaries from identical source. This would enable binary components to be certified both separately (akin to unit testing) and as an integrated system (mix and match components). This could reduce overall duplication and certification costs, even among competing commercial products derived from seL4 components.

That's super cool, thank you for that link.

There are several real-time Linux variants.

There are - to my knowledge, feel free to correct me - no real time linux versions (or even any version of linux) that are currently certified for avionics (DO-178B certification is required for that, there are multiple levels and I don't know of any linux distro (or just the kernel) with that certification).

Avionics certifications are well past the extent of my knowledge and I will gladly accept your point. I was generally addressing the idea that the need for preemptive scheduling precludes the use of a Linux-like kernel.

I've never heard of anyone getting Linux running under hard real time constraints. You can get it pretty good (excellent for real time audio, for example), but can never be 100% sure you're going to meet the deadline. The people I've talked to who tried said by the time you strip out enough of the kernel to approach hard real time, you've lost enough of the advantage of using Linux that you may as well switch to an RTOS.

Real time audio only if you use a large enough buffer and a hardware component to clock the data out.

It's all about the latency guarantees and that means that you're going to have to inspect each and every path through the code for length. With the complexity of the linux kernel that's a pretty tough job and I suspect that anything lower than a few hundred milliseconds (guaranteed!) is out of the question.

I built a little controller using linux that required hard real time during a long (multiple minutes) of operation and the way I hacked it was to simply disable all interrupts and recover the various drivers as good as possible once that phase was over. It worked well but was mostly deaf to input during that time except for polling one 'stop' switch which would cause the machinery to coast down to a halt after which interrupts would be enabled.

Good enough for tinkering but I certainly would not bet anything in production on that strategy.

Real time is hard, soft real time is hard enough (without guarantees but with a best effort and a very large fraction of the deadlines satisfied), hard real time (no misses at all, guaranteed) is hard for a kernel of any complexity.

Erm, the article I got my numbers from is: http://www.aerospacelab-journal.org/sites/www.aerospacelab-j...

where it states that the airbus A380 has more than 100 million lines of code in its avionics systems.

3/4 of that 15M is drivers and filesystems you won't be using, and this is not including the entertainment systems.

They should be able to keep it under 10M lines pretty easily if they actually cared about bloat.

A complicating factor is that you really should include your compiler as part of your codebase (if you're doing lots of formal method work, you will usually do it on the source code, and the compiler can invalidate all of the guarantees you carefully program in), and this will be millions of lines of code (where you can't simply factor a tonne out).

Think GCC 5 is about 15 million or so now.

it's always easy when you're not the one making the changes on the system you don't understand. is there a name for that? I feel like it's a common enough thing people do that it should have a catchy name.

Eh. To say that something that used to be mechanical should be possible with a mere few million lines of code seems pretty obvious to me.

Or compare to the Apollo missions and the space shuttle being well under a million.

I'm not saying that it would be easy to redo the entire system from scratch now, I'm just saying that if it was a design goal from the start it wouldn't have been very onerous.

Because each discrete component that you can eliminate (and replace with code) is a weight saving. Because once you tip over a point in complexity, you just keep adding more code to guard against more edge cases - edge cases you can't avoid because they involve crashing into mountains. Because you want to offload as much possible effort from the cockpit crew, while still allowing them full control over the automated feature

All of these things just add more and more code.

"independently generated from the same initial specifications"

For all we know, if the spec required a counter of 100ms intervals, all N implementations could contain the same bug.

N-version avoids random, but not systematic bugs.

The complexity of the software that manages the N versions seems terrifying, and decision-by-committee in the implementation seems problematic. 3 implementations could select different sequences of responses to a situation, each of which individually is correct, but which are disastrous when combined by vote. E.g. say they're voting on 3 different numbers that have to add to 1; proposals are (1 0 0) (0 1 0) (0 0 1). Consensus is (0 0 0), oops there goes a billion kilowatt dam.

I remember a legacy web service system we used to support which will crash after couple of days, after hogging all the resources. The first thing we did was to set up a monitor with a daily restart scripts. That was much cheaper and quicker fix than the fixing the memory leaks which took 3 months to reach to production.

I had a memory leak in a Python program (that's a really rare thing), that would trigger OOM kills in about 3-4 days. After a few days of investigation that yielded nothing, I put a restart job every day or so, and returned to it only after a few months when I had some time. Eventually it came down to someone replacing dict.get() with dict.setdefault() in a lookup dictionary in some utility library, causing each miss to leak a small, non GC-ed empty entry in an otherwise small lookup table.

We've also had a web service that grew slower and slower, until the web server was restarted. But I tracked it down to this gem: http://bugs.otrs.org/show_bug.cgi?id=9686

(P.S. the link to the pull request in the bug tracker is wrong, because the repo was once deleted and re-created, losing all the old pull requests).

This and other stories are claiming it is an integer overflow, but I've seen no source for that. It seems to be just speculation based on the observation that a 100 Hz 32-bit counter would behave similarly.

Coming soon: OTA updates for Boeing aircraft. What could possibly go wrong?

well, at least it's not a pacemaker.

It's a terrible oversight, and makes me wonder about the rest of the code, but are there very many airliners that are online for that long at once? I don't know much about how commercial air travel works behind the scenes.

The Lord could not count grains of sand with a 32-bit word

This reporting comes on the heels of an GAO study on hijacking airliners. It is not clear why the Congress ordered a study on hacking airliners, though there's a long list of things (MH flights, Carter's claims of a 'cyber Pearl Harbor') that some people might speculate over.

Does anyone know the impetus behind the study?

Probably part of a larger security initiative with money to pay for studies. I'd wager there's a Congressional committee tasked with writing policy to secure critical infrastructure like power plants and such. After 9/11, you can bet planes and FAA systems would be a part of this.

There are definitely security initiatives like this, there have been since before 9/11 (and an uptick afterwards), but it is unusual for the Congress to be involved or to demand a study.

How come it is not Java?

In its current incarnations, it is considered uncertifiable for high criticality levels under DO-178C. May be used in entertainment systems and such though.

You mean it is considered to be crap?)

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact