Hacker News new | past | comments | ask | show | jobs | submit login

The software in the case of the 737 max performed exactly according to the spec. The problem is that the spec was buggy. The language can't fix a buggy spec.

Note that the bug which caused the Ariane V disaster was written in Ada. And that was caused by the language. If the Ariane V code was written in C and the value simply overflowed, nothing negative would have happened. (The value would be hilariously wrong, but that wouldn't have mattered because the code wasn't necessary during flight. It was an oversight that it was running at all.) However, this is Ada, so it caused an integer overflow exception. Which was uncaught. Which caused the entire system to nope out. Which caused the rocket to blow up.

So there's that to consider.




> "If the Ariane V code was written in C and the value simply overflowed, nothing negative would have happened"

So... what you're saying is that the language not recognising an exception conditional has occurred is a better outcome in mission critical applications? "wasn't necessary", "It was an oversight", "Which was uncaught"... With these phrases you've already pointed out that the developers had improperly implemented the application.


There is something to be said for not just halting on an error in production systems. Its the same reason assertions are usually turned off on production builds.

Yes ideally the programmers should have handled the exception, but once your in production and the system is running, should you really just halt the whole program due to an exception the prograera didn't foresee?


For mission-critical software like avionics or a nuclear power plant? Absolutely! At design time you better figure out what all the error cases are, and blindly steamrolling over them when they should be triggered makes a bad problem worse (e.g. look at Cloudbleed where unchecked out-of-bounds reads starts disclosing other people's banking information).

Assertions are also usually turned off for performance reasons, not to allow the program to plow through invariants being violated.


Uncaught exception and die.

For a banking application, yes, die, hard, as loudly as possible. Downtime is worth it. Do not continue, do nothing until it is understood and fixed.

When you're flying. Hmmm. Not so much. Dying is a really bad idea. Ok you should have it tested thoroughly so it isn't going to happen but if it does and "anything could happen" well "anything" is better than killing all the passengers, crew and anyone you might crash on, right?

Literally anything is better than that.

Idealy, you log it, you refuse to take off again, all the fleet is grounded etc etc but deciding to fall out of the sky because something unexpected happened is probably not the right response.

Is graceful shutdown possible? No? Ok, so take your chances and log noisily is a less bad option.


The default out-of-box runtime can only handle one general scenario. The runtime designers choose to print the error message and terminate, and this seems to be the only sound option for default behavior. It's up to the developer to replace this default handler when the scenario differs that much.


> this seems to be the only sound option for default behavior

I think that is the crux of this debate. In my background, yeah, you fail hard when anything unexpected happens. It's the most straightforward way to fail safe.

But my background maps better to the financial case; I've never worked on avionics or anything like that. I can see the point, though, that, in that kind of situation, failing hard doesn't fail safe at all.

It's conceivable to me that different problem domains require different default behavior. (And perhaps, by extension, different programming languages.)


That particular set of requirements is not unique to avionics. It’s pretty close in game development. A glitch is way better than a game which crashes in the middle of gameplay. Especially because many glitches only last a single frame i.e. barely visible.

> It's conceivable to me that different problem domains require different default behavior. (And perhaps, by extension, different programming languages)

Not sure about problem domains, e.g. 3D graphics is the main problem domain in videogames people play for fun, and CT scanners which save lives.

Different projects indeed have different requirements.

Every time you see a programming-related holy war on the internets (exceptions/status codes, static/dynamic typing, unit testing or not, FP/OOP, etc.), you can bet the participants have different types of projects on their background, and these projects needed different tradeoffs. More often than not, what’s good for a web startup aint’t good for Nasa or MS, and vice versa.


Whether die-hard or steamroll-thru is the better option depends more on what exactly goes wrong rather than what the application is. For a flight control computer, for example, if the ordered pitch suddenly becomes 3.402823466e+38 degrees it might be better to die hard and restart rather than try to execute the order.


The answer for this is sanity checking and some kind of layered "reflexes".

I design for mission critical things on a regular basis, and one of the error modes I must accommodate is random bit flipping by cosmic rays, emf, or other failures.

Sometimes, it is possible to push things to a "safe" failure state and reboot (which often takes only 100 ms or so)

Sometimes, though, the error must be caught and corrected to a last known good value or something like that. Everything critical is sanity checked : overall boundaries, rate of change, state related boundaries, etc. Layers of "reflexes" are more robust than a single programmed behavior, because an error in one will be resisted by the others. So much the better when there are segregated systems to check each other.

Often, I'll have a "brilliant" system that performs to a very high standard and is complex and brittle. If the brilliant system fails, it just shuts down or reboots. Underneath the brilliant system is another layer or two. A "competent" system that is simpler and more robust, but less efficient and with soft failure modes, and a "brainstem" system that takes nothing for granted, even checking basic functions by bit flipping to negate stuck ram bits or broken ALU logic, but only tries to do a pasable job, reducing algorythmic complexity to its bare acceptable minimum.

Typically the system will generate a basic range of acceptable parameters at the lowest level (and take corrective action if needed), then refine the range at each subsequent level..rinse and repeat. That way, each lower level checks the ones above.

Or, you just fail downwards if errors are suspected. Either way.

Designing failure tolerant systems is not impossible, but it requires a different mindset.


I imagine (just thinking out loud randomly, I'm sure there's other issues to consider!) one way to mitigate this would be to have a second set of much simpler software (written in a different language, perhaps running on a different platform) whose only runtime job is to sit in the communications path from the primary software to the avionics hardware and monitor the outputs of the primary software and ensure the values and/or rate-of-change in certain values are within some physics-based sanity limits before they hit the hardware. It could be responsible for rebooting the primary software and holding the last-known-good output values (for the consuming avionics) while the reboot happens (hopefully quickly!). Of course then who watches the watcher, and you've added more things in the critical path which can have their own failure modes...


"whose only runtime job is to sit in the communications path from the primary software to the avionics hardware and monitor the outputs of the primary software and ensure the values and/or rate-of-change in certain values are within some physics-based sanity limits before they hit the hardware"

Brings to my mind: Erlang


Was waiting for someone to mention Erlang.


Well, for an unmanned rocket, have it automatically self-destruct is likely the saner choice, as you don't want it to fall back somewhere inhabited, and there is usually little use for something on the wrong orbit.


This is a slightly more complicated topic (i.e. it's complicated by whether or not manual recovery is possible and by whether or not carrying on makes it worse), but in general, the worst thing you can do if you encounter an error in the software that supervises a process is to bail out and leave the process unsupervised. A chemical reaction, for example, is happy to carry on whether or not computers are watching.

There are plenty of situations when automated or manual recovery isn't possible and/or where carrying on with a potentially damaged system can make things worse. In practice, you solve this sort of problem via other design solutions, like redundancy. If you think that a critical system needs to keep running, but you also think that some errors will have to be handled by shutting it down, then you make it redundant.

Airplane FBW systems are a good example (caveat: I don't do airplanes, I do medical stuff -- I might be wrong about the practical details of this but I think it gets the point across). If the ELAC (elevator and aileron computer) runs into a condition it doesn't know how to handle, there's a good chance it will make things worse if you keep going. But you also don't want your mitigation to be "just halt the damn ELAC", you still want to have control over the elevator and the ailerons. That's why there are several ELACs.

More to the point: if something has to keep happening, no matter what, then you design the computing system and the firmware around it so that it keeps happening.

> At design time you better figure out what all the error cases are

Not being able to figure out what all the error cases are has been an unfortunate component of systems engineering for decades now. The Wright brothers would have probably been able to account for all the failure modes in their flight control systems, but today, you are likely to miss some of the failure modes of a CPU that executes nothing but a NOP and a jump back to that NOP.

With the exception of simple and special-purpose systems, built without any programmable logic whatsoever, it's unlikely that you'll be able to figure out what all the error cases are. (There's something to be said here about OISC and whatnot...)

That's not to say it's OK to build systems that blindly steamroll over errors -- just that you have to build them so that they can deal with errors that you have not foreseen at design. You will run into that sort of error sooner or later, we are all failible.

Edit: as for assertions, performance may be a factor, but that's not why you want to turn them off in production builds for embedded systems. (Although, IMHO, this isn't the right approach for embedded systems at all, but I've seen it used).

First of all, you turn them off because, presumably, they make your system deviate from the spec (i.e. the system ends up handling some cases differently in the production build vs. the debug build, and hopefully the one in the production build is the one you want).

Second, you turn them off because they can introduce undefined behaviour in your system. For example, if a peripheral gives you an incomplete or malformed packet in response to a command, or fails to answer altogether, you may want to abort with a stack trace in a development build. But what you really want to do IRL is probably to check and reset that peripheral, because for all you know it may be stuck and giving haphazard commands to an actuator.

IMHO, assertions are only a partial answer to the problem you really want to solve -- obtaining useful data (e.g. stack traces) in response to error conditions. You can generally log the useful data in addition to actually handling the error correctly. Development and production builds should differ as little as possible -- ideally not at all. Handling potentially critical errors in different ways certainly doesn't count as differing as little as possible.


In the Boeing case why wasn't there a button the pilot can hit that says "The computer is acting up, it's going to crash, turn it off and let me fly the plane"

Surely, the pilot should always be able to have the last say in where or what the plane is flying toward? Or are these planes now so complicated the pilots can't fly them without the computers?


I'm a bit out of my waters when it comes to aerospace, hopefully someone more familiar with the field can correct me if I'm wrong on any of these accounts. I knew I should have given examples from the medical field, but the parent post mentioned airplanes and nuclear plants so...

Airplanes that cannot be flown without computers definitely do exist. The F-117 is perhaps the most famous. Its shape makes it aerodynamically unstable on all three axes and it needs constant correction from the FBW system. Which has quadruple redundancy :). You can turn off the autopilot in these systems, obviously, so you get to say where it goes -- but without the FBW system to issue corrections, the plane crashes.

As for Boeing (or Airbus, who have this ELAC thing)... the main thing to understand here is that there is not a computer. There are several computers, each of them covering a particular set of modules (e.g. ELAC controls the ailerons and elevator, SEC controls spoilers and elevator). There's a more in-depth overview here: https://aviation.stackexchange.com/questions/47711/what-are-... . The autopilot is only one of them. The way they take over each other's functions is actually remarkably advanced, and leads to very interesting procedures for handling failures, see e.g. https://hursts.org.uk/airbus-nonnormal/html/ch05.html .

Now, on some airplanes, some actuators can only be controlled through these computers. They get the commands from the pilot and they issue the right signals that control the actuators. There's no way to bypass them. You can turn off the autopilot and the plane goes where you want but the actuators that control the flight control surfaces are still acted upon by computers.

I don't know if this is the case on Airbus specifically (like I said, I'm in a different field), but if it were, then simply turning those systems off in case of something unexpected is definitely not the right design solution.


There are...there are multiple levels of auto-pilot to turn off, but the pilot can have pretty much full control.


No, the process manager should log the stack trace, restart the subsystem and try running a few more times, then try an auxiliary system or just fly without the subsystem. It should not halt the whole software.


On a rocket launch?


On projects which require redundancy and resilience.


As a counter point, the Toyota ECU in those Camry's occasionally accelerated uncontrollably had asserts and what not. When they fired the system rebooted. From memory it took about a second to boot, so usually the driver didn't notice. Apparently it did that on a regular basis as it was overloaded, and sometimes it failed to meet deadlines.

The acceleration problem was caused by noise flipping bits in the RAM the operating systems used to store state data. (The Toyota code was most immune to this because they duplicated their state by storing it in two different places, always compared on read and if it was different rebooted. But the OS was provided the the chip manufacturer, NEC which was a black box to Toyota - and it wasn't as conservative.) On rare occasions a bit flip would take out the watch dog process and the cruise control just after it decided to accelerate, and occasionally if that happened everyone in the vehicle died.

Toyota blamed the deaths all sorts of things for a while - the driver, the carpet, third party accelerator pedals. Which to my mind was fair enough. They like everybody else had no idea what was causing it, but they knew their monitor / reboot if out of spec kludge worked very well so it was unlikely to be the ECU.


Ideally your type system should be aware of all the exceptions a particular function is able to throw so the compiler forces you to handle all of them before it compiles.


This is hardly possible. The whole idea of exceptions is that each function only deals with a subset. If no one does this, there's always a top-level handler that handles all. The default handler normally terminates the program, but it's totally possible to write a custom top-level handler that does something else. E.g. normally "out of memory" is an exception that causes a program to terminate, but in old Adobe Photoshop it was a routine situation that simply prompted the user to free some memory (by dropping the Undo, for example).

I agree that the concept of the top-level handler and customization should be more visibly documented.


> This is hardly possible. The whole idea of exceptions is that each function only deals with a subset.

Why? It's totally possible to infer the most general type for all the functions in the program, hence to infer the type of the needed handler.

Any language with subtyping and powerful type inference can do this, here is the toy ocaml example with typed algebraic effects (you can think of them as of exceptions):

    val read_file : path -[ Not_found ]-> string

    val process_content : string -[ OOM | Intern_err of int ]-> float

    let computation path =
      let content = read_file path in
      let result = process_content content in
      if result < 0
      then raise Bad_result
      else result
the type of computation would be inferred as

   val computation : path -[ Not_found | OOM | Intern_err of int | Bad_result ]-> float
The handler should catch the corresponding exceptions. This could also be used with the result monad as well [1].

Even if you want to keep some exceptions unhandled, you can easily choose which ones, and track them carefully.

[1] https://keleshev.com/composable-error-handling-in-ocaml


It's possible to compute which exceptions can be raised, but it's impractical to handle each exception in each function (this is how I understood your comment, I guess this is not what you meant). E.g. nearly any call can technically raise an OutOfMemory error or an IntegerOverflow error, but most functions are not competent enough to handle that, all they can do is to clean up and raise this up the call stack.

But it is indeed possible to have a single top-level handler (or a carefully constructed hierarchy of handlers) that explicitly handle all exceptions that may arise. (And to ignore an exception is also a form of handling.) The handler already exists, but it's pretty simple: for any exception it prints the error and terminates the program. It's up to developer to override this to make the program more robust.


What they are saying is that every language runtime, even stricter ones, has its own safety implications and you have to consider them when writing the specs and the code.

Simply switching to a language with a stricter runtime may automagically take care of some simple cases, but no system is devoid of failure modes.

Overflows are classic sources of disastrous bugs (e.g. the infamous THERAC-25). Switching to a system that detects them and raises an exception is a step in the right direction, but you still have to handle the exception cases correctly. And in some cases (like the Ariane V), incorrectly handling an exceptional condition turns out to be worse than allowing it to happen.


C would recognize it, but not over-react. Log it and continue rather than going full abort/self destruct which is what ada did.


>The software in the case of the 737 max performed exactly according to the spec. The problem is that the spec was buggy. The language can't fix a buggy spec.

That is true, but I found that in more strict languages you have a bit of a slower ramp-up, but the time you save later before production in bugs you don't have could be used to take more looks at the spec and better simulations.

>If the Ariane V code was written in C and the value simply overflowed, nothing negative would have happened.

In this particular case yes, however if the system was actually needed for flight (which I would guess most software is), it might be better to reset and retry.

On average if your language fails hard like ADA it's also more likely to find these bugs in simulations and tests.

Rust might have solved the problem the way you like it. Integer overflow causes exceptions only in Debug mode, not in Release mode.


> That is true, but I found that in more strict languages you have a bit of a slower ramp-up, but the time you save later before production in bugs you don't have could be used to take more looks at the spec and better simulations.

In many of these large companies in avionics (and not just avionics), the people writing the specs are not the same as the people writing the code. There is no trade-off between how much time it takes to write the code and how much time it takes to draft and review the specs.

Edit:

> On average if your language fails hard like ADA it's also more likely to find these bugs in simulations and tests.

That is definitely ture, but it's important to keep two things in mind:

1. The possibility of a hard failure from your runtime is something that you need to be aware of at design time. As seen in the case that the parent comment mentioned, there are cases when a hard failure on a non-critical error is actually worse than allowing the error to occur.

2. More important, if it's hard failures that expose bugs during tests, the first thing you have to fix, even before you fix the code, is the test cases themselves. A hard failure during testing is an indication that the test cases you have don't cover a given situation and, more importantly, that your system can't even handle that situation.

There are always going to be error conditions you can't recover from, and if they're in critical systems, you work around that (e.g. through redundancy). But a runtime that gives you a hard failure is rarely useful by itself.


The issue with Ariane V was the lack of integrated testing. If they had coupled the testing of the software with increased acceleration levels generated by a simulated Ariane V instead of an Ariane IV, they would have caught the issue.

But let's say this was something unexpected - then probably the only way to mitigate this would be a backup system - but this is an unmanned system - and Apollo/Saturn/Shuttle/Soyuz levels of redundancies is not required.


Yeah requirements were woefully incomplete, testing insufficient and may I add design where one unreliable sensor was wired to one processor which was given great authority over a system to wind in one direction very fast and where manual reversing out of was very slow or impossible when speeds got high (improbable when pitch directs you down).

How could anyone believe this system was safe? No testing or requirements needed.


> value simply overflowed, nothing negative would have happened.

Except overflow would most likely have caused a significantly more severe outcome than the uncaught exception.

And there is a restriction in GNAT (No_Exception_Propagation) which forces you to deal with any exception immediately, and should be used in any critical software imho.

https://docs.adacore.com/gnathie_ug-docs/html/gnathie_ug/gna...


Significantly more severe than the rocket blowing up?


> this is Ada, so it caused an integer overflow exception

This is not a language issue, but a general engineering one. In any critical system that can cause an unexpected exception and failure, there should be sufficient redundancy, so that when the main system goes down, the failover system can take control.


You can and should check your specs like they're code. For instance, you can write the spec in TLA+, which lets you specify temporal properties (e.g. "can the stall recovery procedure take an unbounded amount of time?") and liveness properties (e.g. "can any non-majority of disagreeing sensors cause the wrong trim actuations?")

To test your invariants, you should sabotage the spec and check that invariants break.

Once your spec passes model checking (or perhaps theorem proving with TLAPS), you can codify it in e.g. ADA-SPARK contracts.

Once you have that, you've validated your spec, your contracts and your code. Bugs can only occur in your invariants and the seams of your subsystem. This level of rigor should be standard by now for safety-critical systems.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: