Note that the bug which caused the Ariane V disaster was written in Ada. And that was caused by the language. If the Ariane V code was written in C and the value simply overflowed, nothing negative would have happened. (The value would be hilariously wrong, but that wouldn't have mattered because the code wasn't necessary during flight. It was an oversight that it was running at all.) However, this is Ada, so it caused an integer overflow exception. Which was uncaught. Which caused the entire system to nope out. Which caused the rocket to blow up.
So there's that to consider.
So... what you're saying is that the language not recognising an exception conditional has occurred is a better outcome in mission critical applications?
"wasn't necessary", "It was an oversight", "Which was uncaught"... With these phrases you've already pointed out that the developers had improperly implemented the application.
Yes ideally the programmers should have handled the exception, but once your in production and the system is running, should you really just halt the whole program due to an exception the prograera didn't foresee?
Assertions are also usually turned off for performance reasons, not to allow the program to plow through invariants being violated.
For a banking application, yes, die, hard, as loudly as possible. Downtime is worth it. Do not continue, do nothing until it is understood and fixed.
When you're flying. Hmmm. Not so much. Dying is a really bad idea. Ok you should have it tested thoroughly so it isn't going to happen but if it does and "anything could happen" well "anything" is better than killing all the passengers, crew and anyone you might crash on, right?
Literally anything is better than that.
Idealy, you log it, you refuse to take off again, all the fleet is grounded etc etc but deciding to fall out of the sky because something unexpected happened is probably not the right response.
Is graceful shutdown possible? No? Ok, so take your chances and log noisily is a less bad option.
I think that is the crux of this debate. In my background, yeah, you fail hard when anything unexpected happens. It's the most straightforward way to fail safe.
But my background maps better to the financial case; I've never worked on avionics or anything like that. I can see the point, though, that, in that kind of situation, failing hard doesn't fail safe at all.
It's conceivable to me that different problem domains require different default behavior. (And perhaps, by extension, different programming languages.)
> It's conceivable to me that different problem domains require different default behavior. (And perhaps, by extension, different programming languages)
Not sure about problem domains, e.g. 3D graphics is the main problem domain in videogames people play for fun, and CT scanners which save lives.
Different projects indeed have different requirements.
Every time you see a programming-related holy war on the internets (exceptions/status codes, static/dynamic typing, unit testing or not, FP/OOP, etc.), you can bet the participants have different types of projects on their background, and these projects needed different tradeoffs. More often than not, what’s good for a web startup aint’t good for Nasa or MS, and vice versa.
I design for mission critical things on a regular basis, and one of the error modes I must accommodate is random bit flipping by cosmic rays, emf, or other failures.
Sometimes, it is possible to push things to a "safe" failure state and reboot (which often takes only 100 ms or so)
Sometimes, though, the error must be caught and corrected to a last known good value or something like that. Everything critical is sanity checked : overall boundaries, rate of change, state related boundaries, etc. Layers of "reflexes" are more robust than a single programmed behavior, because an error in one will be resisted by the others. So much the better when there are segregated systems to check each other.
Often, I'll have a "brilliant" system that performs to a very high standard and is complex and brittle. If the brilliant system fails, it just shuts down or reboots. Underneath the brilliant system is another layer or two. A "competent" system that is simpler and more robust, but less efficient and with soft failure modes, and a "brainstem" system that takes nothing for granted, even checking basic functions by bit flipping to negate stuck ram bits or broken ALU logic, but only tries to do a pasable job, reducing algorythmic complexity to its bare acceptable minimum.
Typically the system will generate a basic range of acceptable parameters at the lowest level (and take corrective action if needed), then refine the range at each subsequent level..rinse and repeat. That way, each lower level checks the ones above.
Or, you just fail downwards if errors are suspected. Either way.
Designing failure tolerant systems is not impossible, but it requires a different mindset.
Brings to my mind: Erlang
There are plenty of situations when automated or manual recovery isn't possible and/or where carrying on with a potentially damaged system can make things worse. In practice, you solve this sort of problem via other design solutions, like redundancy. If you think that a critical system needs to keep running, but you also think that some errors will have to be handled by shutting it down, then you make it redundant.
Airplane FBW systems are a good example (caveat: I don't do airplanes, I do medical stuff -- I might be wrong about the practical details of this but I think it gets the point across). If the ELAC (elevator and aileron computer) runs into a condition it doesn't know how to handle, there's a good chance it will make things worse if you keep going. But you also don't want your mitigation to be "just halt the damn ELAC", you still want to have control over the elevator and the ailerons. That's why there are several ELACs.
More to the point: if something has to keep happening, no matter what, then you design the computing system and the firmware around it so that it keeps happening.
> At design time you better figure out what all the error cases are
Not being able to figure out what all the error cases are has been an unfortunate component of systems engineering for decades now. The Wright brothers would have probably been able to account for all the failure modes in their flight control systems, but today, you are likely to miss some of the failure modes of a CPU that executes nothing but a NOP and a jump back to that NOP.
With the exception of simple and special-purpose systems, built without any programmable logic whatsoever, it's unlikely that you'll be able to figure out what all the error cases are. (There's something to be said here about OISC and whatnot...)
That's not to say it's OK to build systems that blindly steamroll over errors -- just that you have to build them so that they can deal with errors that you have not foreseen at design. You will run into that sort of error sooner or later, we are all failible.
Edit: as for assertions, performance may be a factor, but that's not why you want to turn them off in production builds for embedded systems. (Although, IMHO, this isn't the right approach for embedded systems at all, but I've seen it used).
First of all, you turn them off because, presumably, they make your system deviate from the spec (i.e. the system ends up handling some cases differently in the production build vs. the debug build, and hopefully the one in the production build is the one you want).
Second, you turn them off because they can introduce undefined behaviour in your system. For example, if a peripheral gives you an incomplete or malformed packet in response to a command, or fails to answer altogether, you may want to abort with a stack trace in a development build. But what you really want to do IRL is probably to check and reset that peripheral, because for all you know it may be stuck and giving haphazard commands to an actuator.
IMHO, assertions are only a partial answer to the problem you really want to solve -- obtaining useful data (e.g. stack traces) in response to error conditions. You can generally log the useful data in addition to actually handling the error correctly. Development and production builds should differ as little as possible -- ideally not at all. Handling potentially critical errors in different ways certainly doesn't count as differing as little as possible.
Surely, the pilot should always be able to have the last say in where or what the plane is flying toward? Or are these planes now so complicated the pilots can't fly them without the computers?
Airplanes that cannot be flown without computers definitely do exist. The F-117 is perhaps the most famous. Its shape makes it aerodynamically unstable on all three axes and it needs constant correction from the FBW system. Which has quadruple redundancy :). You can turn off the autopilot in these systems, obviously, so you get to say where it goes -- but without the FBW system to issue corrections, the plane crashes.
As for Boeing (or Airbus, who have this ELAC thing)... the main thing to understand here is that there is not a computer. There are several computers, each of them covering a particular set of modules (e.g. ELAC controls the ailerons and elevator, SEC controls spoilers and elevator). There's a more in-depth overview here: https://aviation.stackexchange.com/questions/47711/what-are-... . The autopilot is only one of them. The way they take over each other's functions is actually remarkably advanced, and leads to very interesting procedures for handling failures, see e.g. https://hursts.org.uk/airbus-nonnormal/html/ch05.html .
Now, on some airplanes, some actuators can only be controlled through these computers. They get the commands from the pilot and they issue the right signals that control the actuators. There's no way to bypass them. You can turn off the autopilot and the plane goes where you want but the actuators that control the flight control surfaces are still acted upon by computers.
I don't know if this is the case on Airbus specifically (like I said, I'm in a different field), but if it were, then simply turning those systems off in case of something unexpected is definitely not the right design solution.
The acceleration problem was caused by noise flipping bits in the RAM the operating systems used to store state data. (The Toyota code was most immune to this because they duplicated their state by storing it in two different places, always compared on read and if it was different rebooted. But the OS was provided the the chip manufacturer, NEC which was a black box to Toyota - and it wasn't as conservative.) On rare occasions a bit flip would take out the watch dog process and the cruise control just after it decided to accelerate, and occasionally if that happened everyone in the vehicle died.
Toyota blamed the deaths all sorts of things for a while - the driver, the carpet, third party accelerator pedals. Which to my mind was fair enough. They like everybody else had no idea what was causing it, but they knew their monitor / reboot if out of spec kludge worked very well so it was unlikely to be the ECU.
I agree that the concept of the top-level handler and customization should be more visibly documented.
Why? It's totally possible to infer the most general type for all the functions in the program, hence to infer the type of the needed handler.
Any language with subtyping and powerful type inference can do this, here is the toy ocaml example with typed algebraic effects (you can think of them as of exceptions):
val read_file : path -[ Not_found ]-> string
val process_content : string -[ OOM | Intern_err of int ]-> float
let computation path =
let content = read_file path in
let result = process_content content in
if result < 0
then raise Bad_result
val computation : path -[ Not_found | OOM | Intern_err of int | Bad_result ]-> float
Even if you want to keep some exceptions unhandled, you can easily choose which ones, and track them carefully.
But it is indeed possible to have a single top-level handler (or a carefully constructed hierarchy of handlers) that explicitly handle all exceptions that may arise. (And to ignore an exception is also a form of handling.) The handler already exists, but it's pretty simple: for any exception it prints the error and terminates the program. It's up to developer to override this to make the program more robust.
Simply switching to a language with a stricter runtime may automagically take care of some simple cases, but no system is devoid of failure modes.
Overflows are classic sources of disastrous bugs (e.g. the infamous THERAC-25). Switching to a system that detects them and raises an exception is a step in the right direction, but you still have to handle the exception cases correctly. And in some cases (like the Ariane V), incorrectly handling an exceptional condition turns out to be worse than allowing it to happen.
That is true, but I found that in more strict languages you have a bit of a slower ramp-up, but the time you save later before production in bugs you don't have could be used to take more looks at the spec and better simulations.
>If the Ariane V code was written in C and the value simply overflowed, nothing negative would have happened.
In this particular case yes, however if the system was actually needed for flight (which I would guess most software is), it might be better to reset and retry.
On average if your language fails hard like ADA it's also more likely to find these bugs in simulations and tests.
Rust might have solved the problem the way you like it. Integer overflow causes exceptions only in Debug mode, not in Release mode.
In many of these large companies in avionics (and not just avionics), the people writing the specs are not the same as the people writing the code. There is no trade-off between how much time it takes to write the code and how much time it takes to draft and review the specs.
> On average if your language fails hard like ADA it's also more likely to find these bugs in simulations and tests.
That is definitely ture, but it's important to keep two things in mind:
1. The possibility of a hard failure from your runtime is something that you need to be aware of at design time. As seen in the case that the parent comment mentioned, there are cases when a hard failure on a non-critical error is actually worse than allowing the error to occur.
2. More important, if it's hard failures that expose bugs during tests, the first thing you have to fix, even before you fix the code, is the test cases themselves. A hard failure during testing is an indication that the test cases you have don't cover a given situation and, more importantly, that your system can't even handle that situation.
There are always going to be error conditions you can't recover from, and if they're in critical systems, you work around that (e.g. through redundancy). But a runtime that gives you a hard failure is rarely useful by itself.
But let's say this was something unexpected - then probably the only way to mitigate this would be a backup system - but this is an unmanned system - and Apollo/Saturn/Shuttle/Soyuz levels of redundancies is not required.
How could anyone believe this system was safe? No testing or requirements needed.
Except overflow would most likely have caused a significantly more severe outcome than the uncaught exception.
And there is a restriction in GNAT (No_Exception_Propagation) which forces you to deal with any exception immediately, and should be used in any critical software imho.
This is not a language issue, but a general engineering one. In any critical system that can cause an unexpected exception and failure, there should be sufficient redundancy, so that when the main system goes down, the failover system can take control.
To test your invariants, you should sabotage the spec and check that invariants break.
Once your spec passes model checking (or perhaps theorem proving with TLAPS), you can codify it in e.g. ADA-SPARK contracts.
Once you have that, you've validated your spec, your contracts and your code. Bugs can only occur in your invariants and the seams of your subsystem. This level of rigor should be standard by now for safety-critical systems.