Hacker News new | past | comments | ask | show | jobs | submit login
How to Swiftly Destroy a $370 Million Dollar Rocket with Overflow "Protection" (metaobject.com)
9 points by mpweiher on June 26, 2014 | hide | past | web | favorite | 19 comments

I don't think the author has read the Ariane crash report, which says:

> Although the source of the Operand Error has been identified, this in itself did not cause the mission to fail. The specification of the exception-handling mechanism also contributed to the failure. In the event of any kind of exception, the system specification stated that: the failure should be indicated on the databus, the failure context should be stored in an EEPROM memory (which was recovered and read out for Ariane 501), and finally, the SRI processor should be shut down.

> It was the decision to cease the processor operation which finally proved fatal. Restart is not feasible since attitude is too difficult to re-calculate after a processor shutdown; therefore the Inertial Reference System becomes useless. The reason behind this drastic action lies in the culture within the Ariane programme of only addressing random hardware failures. From this point of view exception - or error - handling mechanisms are designed for a random hardware failure which can quite rationally be handled by a backup system.

> ... An underlying theme in the development of Ariane 5 is the bias towards the mitigation of random failure. The supplier of the SRI was only following the specification given to it, which stipulated that in the event of any detected exception the processor was to be stopped. The exception which occurred was not due to random failure but a design error. The exception was detected, but inappropriately handled because the view had been taken that software should be considered correct until it is shown to be at fault. The Board has reason to believe that this view is also accepted in other areas of Ariane 5 software design. The Board is in favour of the opposite view, that software should be assumed to be faulty until applying the currently accepted best practice methods can demonstrate that it is correct.

There were also questions about why the code was even still in operation. The functionality was useful under Ariane 4 but not Ariane 5, the code "serves no purpose" after launch, analysis was done to check some variables for overflow but not all variables, and the component's functional requirements weren't updated to reflect the new launch trajectory data.

To say it's only because of an exception during type conversion ignores the whole, very important context, which says that the exception wasn't the real problem.

Yes, the author did read the report, did the esteemed critic?

"It was the decision to cease the processor operation which finally proved fatal."

Which is exactly what the author argues against: don't terminate the process on integer overflow.

"The decision to cease the processor" was an explicit design decision by humans, who assumed that the analysis of the software was sufficient to be able to assume that an exception was caused by a hardware error instead of a software one. This was valid ... for Ariane IV. "Don't terminate the process on integer overflow" is not allowed by the specification.

It's incorrect to attribute the problem to an overflow exception. Rather, it's equally correct to say that it was a design fault (because that functionality simply wasn't needed for Ariane V) or an analysis fault (because the overflow could have been detected that way) or a testing fault (because testing with simulated Ariane V inputs would have revealed the fault).

If there was a random hardware failure in SR-2 then a shutdown, to only use SR-1 would have been fine. But SR-1 was subject to the same failure condition, because it used the same code. So it's equally valid to say that the failure was in not using two independent implementations, as a mitigation for simple coding errors.

Hence the report says the problem was "due to specification and design errors in the software of the inertial reference system", and does not place the blame only on the triggering event. It's also why "confining exceptions to tasks" is only half of one of the 14 recommendations.

Don't terminate a mission-critical process. Garden variety software: yes, PLEASE, DO kill it.

You don't seem to realize that many of SW practices that are adequate for "normal" software, are NOT applicable to mission-critical, fault-tolerant scenarios.

I believe you have left out an important point. The correct failure mode is very dependent on the problem. For example, an off-course rocket just after launch may need to be blown up, because the primary concern is not "the mission" but range safety. While a rocket which lost an engine but is near LEO can likely recover by going to a lower orbit.

What distinguishes avionics software from most "normal" software is the extensive (and expensive) analysis to determine the best fail-safe behavior for a wide variety of failure modes. Sometimes "fail-safe" means "self-destruct."

What's mission critical depends on the mission, it's not the language's job to decide what is or is not mission critical.

Please tell me why killing my spreadsheet program is OK because the cat walked on the keyboard.

If it's the case that your spreadsheet program can tolerate overflow on addition, shouldn't that program be using the &+ operator instead of the + operator?

If, instead, detecting and responding to the overflow is what you want, then someone has already filed a rdar on that and Chris Lattner has replied saying: "It would be straight-forward for us to provide arithmetic functions that return the result as an optional (i.e., nil on overflow). We'll take a look, thanks for filing a bug!"


Sorry, I'd rather have a language with sensible defaults.

In this case, the sensible solution would be to have a proper Numeric tower like, for example, Smalltalk, where such overflows simply don't matter. The default "Int" type should be part of that system.

You can then add Int32/Int64 or maybe IntNative to get to the built-in processor types, which then have built-in processor behavior, and possibly arithmetic overflow traps as an extension.

The current defaults are just nuts, IMHO.

In the context of the Ariane rocket, it seems like a language with a proper numeric tower would have been unlikely to run on the given hardware. That is, the decision to not add explicit range checks to a few of the variables (including the critical one which overflowed) was based in part on keeping the processor load down.

> It has been stated to the Board that not all the conversions were protected because a maximum workload target of 80% had been set for the SRI computer. To determine the vulnerability of unprotected code, an analysis was performed on every operation which could give rise to an exception, including an Operand Error. In particular, the conversion of floating point values to integers was analysed and operations involving seven variables were at risk of leading to an Operand Error. This led to protection being added to four of the variables, evidence of which appears in the Ada code.

I find it hard to believe that a language which checks every case for possible overflow would be faster than a language which does not (excluding possible program flow analysis that wasn't available then). Thus, I don't think the engineering tradeoffs for the Ariane really apply for a spreadsheet where gobs of excess cycles are available, so I find it difficult to use the errors with the former to guide the latter.

That said, I recognize that you're on a different topic. I wanted only to point out that "the current default are just nuts" has nothing to do with exception handling in the Ariane, and I think using the Ariane as a basis for your explanation doesn't apply.

You've made that pretty clear in your writings (which I have enjoyed, by the way.) I'm curious, do you want Swift to be that language - and, if so, have you been participating in the dev forums[1] and filing bug reports into rdar? Would you mind sharing the rdar number that you filed about this issue so that others whose preferences lean in the same direction can add our voices to the chorus?

And, if not, what is your goal?

Edit: I see that you've answered my question on your blog[2] by answering someone else who asked the same thing. Thank you!

[1] https://devforums.apple.com/community/tools/languages/swift

[2] http://openradar.appspot.com/17472835

Marcel has made his feelings on Swift clear: http://openradar.appspot.com/17180612

You, Marcel Weiher, are the author.

The author argues against crashing on integer overflow by drawing an example from a totally different domain for which Swift is not at all intended.

Integer overflow has been a source of numerous exploits, and I'd prefer the software on my desktop machine / phone / tablet to crash instead of giving unlimited access to the intruder.

The story changes radically if we start talking about mission critical systems, like car control.

Looking at https://www.owasp.org/index.php/Integer_overflow the consequences of integer overflow by itself either crashes/infinite loops. Crashing on overflow is not an improvement in these cases.

To get into exploitable access, you need buffer overflows, which can and should be protected against separately.

Once you protect against buffer overflows, crashing on integer overflows gets you exactly nothing. (Might be OK as an optional, assert-like debugging tool).

> To get into exploitable access, you need buffer overflows, which can and should be protected against separately.

Or you can use overflow to make the program to allocate too small buffer thus obtaining the buffer overflow you need to proceed further.

You're not making any sense: I already wrote "protect against buffer overflow instead of integer overflow" and your response is "buffer overflow".

Integer overflow can be a causal cause of buffer overflow. Like this for example:

  read(fd, &header, sizeof(header));
  int total_len = header.data_len + header.code_len;
  char *data = malloc(total_len);
  read(fd, data, total_len);
Now, the attacker crafts the file and.. there's your buffer overflow which is a direct consequence of undetected integer overflow [i.e., the program being allowed to proceed].

Once again: if you protect against buffer overflow, an integer overflow that might lead to buffer overflow doesn't matter. The C example doesn't matter because C has neither.

Overflow protection is fail-fast behaviour, similar to array bounds checking. It allows you to more quickly find the source of errors. I believe everybody agrees it's a good thing, at least in the array bounds checking case.

> Optimization flags in general should not change visible program behavior, except for performance.

The only behavior you would be able to observe before, that you wouldn't after disabling overflow behaviour, is a crash, and you would have fixed that anyway as soon as you observed it.

Of course, if there is $370 Million on the line, maybe just disable it and hope for the best :-)

Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact