> Although the source of the Operand Error has been identified, this in itself did not cause the mission to fail. The specification of the exception-handling mechanism also contributed to the failure. In the event of any kind of exception, the system specification stated that: the failure should be indicated on the databus, the failure context should be stored in an EEPROM memory (which was recovered and read out for Ariane 501), and finally, the SRI processor should be shut down.
> It was the decision to cease the processor operation which finally proved fatal. Restart is not feasible since attitude is too difficult to re-calculate after a processor shutdown; therefore the Inertial Reference System becomes useless. The reason behind this drastic action lies in the culture within the Ariane programme of only addressing random hardware failures. From this point of view exception - or error - handling mechanisms are designed for a random hardware failure which can quite rationally be handled by a backup system.
> ... An underlying theme in the development of Ariane 5 is the bias towards the mitigation of random failure. The supplier of the SRI was only following the specification given to it, which stipulated that in the event of any detected exception the processor was to be stopped. The exception which occurred was not due to random failure but a design error. The exception was detected, but inappropriately handled because the view had been taken that software should be considered correct until it is shown to be at fault. The Board has reason to believe that this view is also accepted in other areas of Ariane 5 software design. The Board is in favour of the opposite view, that software should be assumed to be faulty until applying the currently accepted best practice methods can demonstrate that it is correct.
There were also questions about why the code was even still in operation. The functionality was useful under Ariane 4 but not Ariane 5, the code "serves no purpose" after launch, analysis was done to check some variables for overflow but not all variables, and the component's functional requirements weren't updated to reflect the new launch trajectory data.
To say it's only because of an exception during type conversion ignores the whole, very important context, which says that the exception wasn't the real problem.
"It was the decision to cease the processor operation which finally proved fatal."
Which is exactly what the author argues against: don't terminate the process on integer overflow.
It's incorrect to attribute the problem to an overflow exception. Rather, it's equally correct to say that it was a design fault (because that functionality simply wasn't needed for Ariane V) or an analysis fault (because the overflow could have been detected that way) or a testing fault (because testing with simulated Ariane V inputs would have revealed the fault).
If there was a random hardware failure in SR-2 then a shutdown, to only use SR-1 would have been fine. But SR-1 was subject to the same failure condition, because it used the same code. So it's equally valid to say that the failure was in not using two independent implementations, as a mitigation for simple coding errors.
Hence the report says the problem was "due to specification and design errors in the software of the inertial reference system", and does not place the blame only on the triggering event. It's also why "confining exceptions to tasks" is only half of one of the 14 recommendations.
You don't seem to realize that many of SW practices that are adequate for "normal" software, are NOT applicable to mission-critical, fault-tolerant scenarios.
What distinguishes avionics software from most "normal" software is the extensive (and expensive) analysis to determine the best fail-safe behavior for a wide variety of failure modes. Sometimes "fail-safe" means "self-destruct."
Please tell me why killing my spreadsheet program is OK because the cat walked on the keyboard.
If, instead, detecting and responding to the overflow is what you want, then someone has already filed a rdar on that and Chris Lattner has replied saying: "It would be straight-forward for us to provide arithmetic functions that return the result as an optional (i.e., nil on overflow). We'll take a look, thanks for filing a bug!"
In this case, the sensible solution would be to have a proper Numeric tower like, for example, Smalltalk, where such overflows simply don't matter. The default "Int" type should be part of that system.
You can then add Int32/Int64 or maybe IntNative to get to the built-in processor types, which then have built-in processor behavior, and possibly arithmetic overflow traps as an extension.
The current defaults are just nuts, IMHO.
> It has been stated to the Board that not all the conversions were protected because a maximum workload target of 80% had been set for the SRI computer. To determine the vulnerability of unprotected code, an analysis was performed on every operation which could give rise to an exception, including an Operand Error. In particular, the conversion of floating point values to integers was analysed and operations involving seven variables were at risk of leading to an Operand Error. This led to protection being added to four of the variables, evidence of which appears in the Ada code.
I find it hard to believe that a language which checks every case for possible overflow would be faster than a language which does not (excluding possible program flow analysis that wasn't available then). Thus, I don't think the engineering tradeoffs for the Ariane really apply for a spreadsheet where gobs of excess cycles are available, so I find it difficult to use the errors with the former to guide the latter.
That said, I recognize that you're on a different topic. I wanted only to point out that "the current default are just nuts" has nothing to do with exception handling in the Ariane, and I think using the Ariane as a basis for your explanation doesn't apply.
And, if not, what is your goal?
Edit: I see that you've answered my question on your blog by answering someone else who asked the same thing. Thank you!
> Optimization flags in general should not change visible program behavior, except for performance.
The only behavior you would be able to observe before, that you wouldn't after disabling overflow behaviour, is a crash, and you would have fixed that anyway as soon as you observed it.
Of course, if there is $370 Million on the line, maybe just disable it and hope for the best :-)
Integer overflow has been a source of numerous exploits, and I'd prefer the software on my desktop machine / phone / tablet to crash instead of giving unlimited access to the intruder.
The story changes radically if we start talking about mission critical systems, like car control.
To get into exploitable access, you need buffer overflows, which can and should be protected against separately.
Once you protect against buffer overflows, crashing on integer overflows gets you exactly nothing. (Might be OK as an optional, assert-like debugging tool).
Or you can use overflow to make the program to allocate too small buffer thus obtaining the buffer overflow you need to proceed further.
read(fd, &header, sizeof(header));
int total_len = header.data_len + header.code_len;
char *data = malloc(total_len);
read(fd, data, total_len);