Hacker News new | past | comments | ask | show | jobs | submit login
Schiaparelli landing investigation makes progress (esa.int)
221 points by YeGoblynQueenne on Nov 24, 2016 | hide | past | favorite | 139 comments



What is fantastic here is that the telemetry was received for nearly the entire descent, up to a few seconds before the crash.

The could reproduce the exact same scenario in simulation by replaying the telemetry as inputs to their firmware and see what happened in the software. This is tremendous advantage for debugging, it's like a wireshark replay of network packets for debugging an error in the TCP stack.

This orbiter/descender architecture is really paying off for future missions. If the orbiter hadn't been there, earth wouldn't have received any telemetry at all.. Let's hope that this "real world" telemetry is saved for simulating successive versions of the guidance software of future probes.


I've worked on sensor fusion systems and I fully agree. Replaying the sensor inputs and evaluating new estimated state is a really good way of debugging failures (because you can't just stop the system mid-air and evaluate internal state). It also helps with regression test suite and trying out new algorithms quickly.


We'd do the same for debugging networked games. Incredibly useful.

If you build it right you also get a replay feature for almost free.


If you use ROS to do robotics, there's ROS Bags [0] that also allow this. Very very useful.

[0] http://wiki.ros.org/rosbag


There's a lot of interest in supplementary tooling around this format, too. See MARV robotics (formerly bagbunker), about which there was a ROSCon talk this fall:

https://github.com/ternaris/marv-robotics https://vimeo.com/187705380


I know, I was there :-). Bagbunker sounds much more awesome than "MARV Robotics" though


> The could reproduce the exact same scenario in simulation by replaying the telemetry as inputs to their firmware and see what happened in the software.

They already have:

> This behaviour has been clearly reproduced in computer simulations of the control system’s response to the erroneous information.


> the erroneous information generated an estimated altitude that was negative – that is, below ground level.

Kind of sad to see that computers nowadays still lack what we call common sense. I mean the machine received negative altimeter data but if it had been able to just look around, it could have seen that this data was erroneous. Hell, considering the descent profile was carefully prepared, the computer should also have noticed that it was way too early to reach the ground and should have concluded that something was wrong with the altitude measure.


It's a question of proper sensor fusion. The way it's typically done is with some form of bayesian prior + a robust estimator that can handle the case where your sensors give you disagreeing evidence of the true state.

So you have a prediction function that takes as input your previous estimated state and some form of "common sense" that's hard-coded or learned and which outputs a prediction of the current state. Then you look at your current measurements and if one of it disagrees to much it gets downweighted or outright discarded. The updated estimate is then computed from the prediction and the current measurements.


Indeed. I.e. have some probabilistic model of the 'world' and the probe's status within it (position, speed, rotation, etc). The sensor information then 'merely' informs that model, so e.g. if some sensor or piece of code says the probe has instantaneously jumped to a new location some distance away then the model can just disregard that as not conforming to anything like what was expected (the set of priors on how the probe moves, or essentially the laws of physics).

Or... run the flight controller through a 'fuzzer' that throws lots of unexpected scenarios at it. To be honest I'm rather disappointed that a project at this level has succumbed to a bug like this. It's a few thousand dollars of additional dev effort versus the whole project failing; classic small certain initial cost versus large possible future cost (see Taleb's Antifragile, systemic failure of financial markets, etc.)

Ho hum.


It's a few thousand dollars of additional dev effort versus the whole project failing; classic small certain initial cost versus large possible future cost...

We in hindsight, now know that a little extra development could have avoided the problem.

The trick is trying to figure out where to spend your limited resources as the project is going.

There are lots of areas that might have caused problems. There may well be other flaws in the system that would also have caused a failure on descent.


> The trick is trying to figure out where to spend your limited resources as the project is going.

In a situation such as this where you don't have the funds to test everything to the level you would like and the cost of a failure is large, I wonder if a reasonable compromise is to hire a smallish team of 'adversaries'. I.e. their full time job is to try and find issues with other people's work; and to ask what could go wrong, or, what is likely to to go wrong - let's go see if we can break that.

I also find the idea of a risk register useful. I.e. during development you'll often have thoughts about whether particular edge cases are truly being handled properly, but often the day to day cost versus benefit decisions mean you can't explore all of these issues. If however you log them then this can serve as a source of ideas for an adversarial team, or indeed your own investigations if you happen to have bits of free time for testing.


Engineering simulators are used for this purpose, I have worked on them, and seen them being used precisely for trying to find unexpected system behaviors under sensor or other subsystem failures (valves, computers, motors, power,...). This is quite costly. In commercial aviation there are teams for developing the simulations and separate teams just trying all kind of strange scenarios. In less critical projects, say for example drones, I have seen it done too although with smaller teams.

As I see it with the limited information we have right now it seems that either the error was made when writing the requirements for the IMU or the IMU wasn't the correct one. Having worked writing navigation software I cannot help siding with the software developers. The IMU is usually the central piece of the navigation system, at best they could have flagged it as failed but that is usually way more improbable that the altimeter failing. And even if they discarded the IMU measures I don't know what other sensors could be used to get at least the vehicle attitude. I think that after the saturation they were doomed and a software solution is not realistic.


> software solution is not realistic.

I would expect such a role to be holistic; to consider the entire system and ways it could fail. The boundaries between sub-systems would be high on the list of targets; software-software, software-hardware, hardware-hardware.

It's not a matter of siding with anyone, it's about accepting that an error has been made at some level, and asking if there are general purpose strategies for increasing the probability of identifying errors without having to build a simulation. Simulations of course have all the same problems - they're complex and therefore expensive and will contain defects.

Regarding the other point about probabilistic models compensating for sensor failures; it depends on which sensors fail and when. Parts of a descent are highly predictable, so e.g. if you know height and you're in free-fall then you can model pretty accurately where you're likely to be in +t seconds, and can use that as a prior for sensory input. OTOH if you're in the final landing stage using retro-thrusters then feedback sensors are paramount to doing landing correctly.

I accept that a probabilistic model would be more expensive to develop; initially anyway, and therefore may be unrealistic with the resources available.


This reminds me of something I read a couple of years ago about an earth-orbit satellite collecting O-zone data.. when it encountered the 'hole' it ended up throwing away most of the data since it failed its 'common sense' test. Wish I could find the link.

In the case of an altimeter it makes sense to be aggressive with the 'common-sense' component to throw out bad data. But with other things like temperature, you wouldn't want to throw out data that could look wrong, but could actually be caused by some unknown phenomenon (like say, superheated gas pockets on mars) as that's exactly the type of weird stuff you'd be interested in.

I guess the problem is distinguishing [data that differs greatly from what was expected] and [data from a sensor malfunctions]


It's just ozone, not a kind of zone.


Neat. There actually is a wikipedia entry about sensor fusion:

https://en.wikipedia.org/wiki/Sensor_fusion

TIL, thanks.


Yes, but maybe they should have implemented a more fault-tolerant system. With three different methods of determining altitude during decent, this problem shouldn't have occurred.


> Kind of sad to see that computers nowadays still lack what we call common sense. I mean the machine received negative altimeter data but if it had been able to just look around, it could have seen that this data was erroneous. Hell, considering the descent profile was carefully prepared, the computer should also have noticed that it was way too early to reach the ground and should have concluded that something was wrong with the altitude measure.

Well, computer vision is a different issue. Imagine you're a human sealed inside this thing, with no vision from the outside. You reach the point when the parachute's supposed to deploy, you feel something goes wrong - there's a big force, or it's in the wrong direction. Did the parachute rip off? Did it fail? You look at your two altimeters; one says that you're 3.5km up, one says that you're at almost ground level (that it's negative is a red herring I think - the probe knows the readings might be noisy and expects to handle that). What do you do?


I would look at the trajectory of data points so far. Can't expect the altitude to jump by several kilometers in a few seconds. Unless the two altimeters were always disagreeing but that doesn't seem to have been the case here.


Well, common sense in a way would be enabled with assertions. But even if you know that you cannot be below ground, how would you recover? Your sensors clearly tell you that you're below ground, so what is the course of action? You have no way of knowing where you are, you just know that you don't know. It doesn't really change anything in the outcome. Sure, maybe you could go from altitude measurements to time measurements and most likely still leave a crater since it's not exact enough (there's a reason for the internal measurement instead of simpler ways, after all).

It would also require you to anticipate the malfunction in which case you already know that things like that could happen, in which case you'd likely determine how it could happen and how to avoid it and continue having useful data. Sounds to me like a better solution than trying to recover without data.


From the article:

> This in turn successively triggered a premature release of the parachute and the backshell, a brief firing of the braking thrusters and finally activation of the on-ground systems as if Schiaparelli had already landed.

It would have been better to do nothing, in the hope that the altimeter data will resume to reasonable values, especially since the article mentions that the error is not supposed to last long:

> Its output was generally as predicted except for this event, which persisted for about one second – longer than would be expected.


There are several ways. First off, ignore the erroneous data. Then, estimate the altitude by other methods. As you mentioned, one way would be to integrate time and velocity.

All inputs and outputs should be run through sanity checks, with a plan for when they go wrong.

Essentially, when you've got a system that can't be fixed, plan for failure of each and every component (in isolation).


Cars do this, you can kill many sensors and an engine will still limp even with misfires. I guess consulting Bosch isn't an option tho


Not necessarily very well; my cars' mass airflow sensor went out a few months back... the performance was all over the place and sudden engine stalls were frequent. You could kill the engine just by accelerating quickly when it was cold.

... and that was a single sensor out.


It helps that car engines don't really _need_ sensors to function. Cars existed before electronic sensors. A mars lander actually needs to know when to deploy parachutes.


If the lander cannot function without an altimeter, and there is only one altimeter with no backup plan, then you can plan to lose the lander.

This is not a new idea. Airplanes cannot fly without instruments, and all the instruments have backups and alternate means. Everything deemed "flight critical" has a backup alternative.


You'll find that fuel injectors don't work well without input from the MAF and O2 sensor(s).


True of modern EFI, but older mechanical fuel injectors did not have O2 sensors.


Some cars do this.

My friend's Honda literally stopped combustion when his battery died while driving (no, I have no idea how, but that was the diagnosis).

In contrast, I've limped my Miata along on alternator-only when the battery wouldn't hold a charge. Note: feathering the gas at a stop with brakes and clutch depressed to keep your spark from dying is an "interesting" experience.


At a stop pull the tranny out of gear and activate the parking brake.

More fun is driving a vehicle with no clutch: stalling at every stop and starting the motor in first gear, and shifting without clutching. I had to do that in my '83 280ZX in its last days.


> your sensors clearly tell you that you're below ground

They weren't telling that. As far as I understood, the sensors were fully right, the software just hasn't expected the rotation to last "so long" -- the whole second, so it calculated the result wrong ("negative height") by incorrectly processing these multiple (in time) for their model unexpected inputs.


For context, the lander did have a radar altimeter. A radar altimeter has maximum distance from ground above which it will not establish a lock.

If radar altimeter had never had a lock, either the radar altimeter has faulted in a way you can't detect, or you are above the lock range for your radar altimeter.

If the radar altimeter had locked, you have a very accurate measurement of height above ground.


That's the surprising thing, though: "its radar Doppler altimeter functioned correctly and the measurements were included in the guidance, navigation and control system. "


And presumably mission time was in use too? Presumably they know to the (milli?)second when the lander should touchdown, at least from the point where the lander starts its descent. Time of flight send like a very strong, reliable signal?


Better to discard obviously bad data altogether. The system certainly should not act on that data, as seems to have been the case here.


I would argue that at 1700 km/h, even not acting is an action. The decision to do nothing is still a decision.


It's also not because the probe was constrained in terms of it's computing hardware. If you'd made the hardware a thousand times for powerful, it would have made zero difference to the 'smartness' of the software. Nor is it clear that applying the wonders of the modern AI age - Deep Blue, Watson, AphaGo, would help. They're massively optimized for the one task they're good at and completely useless at anything else.

This is why I'm deeply pessimistic about achieving human level AI any time soon, even within the next 100 years. I'm 50 years old today. AI researchers back when I was born expected to have substantially solved AI as a problem by now, but we're still barely scratching the surface of a very big domain. This accident is a really good demonstration of the limitations of modern computer systems in terms of autonomous independent situational awareness.


Happy Birthday


But what do you do when there is a bug with the hardware clock, or with the image analysis software that recognizes how far the ground is? Or, supposing you installed a laser measurement tool to gauge distance-to-ground level, with the software interpreting that? You could of course have quorums - with the majority of the systems deciding which solution (still a while to go / imminent crash!) is the correct one, but what if you had a problem with that part? And that's just the problem of figuring out that there is a problem.

>The first thing to do was to try to seal up the hole. This turned out to be impossible, because the ship’s sensors couldn’t see that there was a hole, and the supervisors, which should have said that the sensors weren’t working properly, weren’t working properly and kept saying that the sensors were fine. The ship could only deduce the existence of the hole from the fact that the robots had clearly fallen out of it, taking its spare brain-which would have enabled it to see the hole-with them.

You could stack layer upon layer of fault detecting modules and supervising software and error and redundancy modules, but then you'd probably wind up with something like Dark Star's Bomb 20:

>Unfortunately, Doolittle has mistakenly taught the bomb Cartesian doubt and, as a result, Bomb #20 determines that it can only trust itself and not external input. Convinced that only it exists, and that its sole purpose in life is to explode, Bomb #20 states "Let there be light," and promptly detonates.


There's some room between "sensor fusion is fragile enough that if a physical condition persists for an unexpected amount of time the craft is lost" and "all sensor faults can be handled."

Notably, this obviously wasn't a failure case they programmed for. But it does seem a pretty fragile system to throw across the solar system in retrospect.


I don't know how altitude is calculated on Mars, but, for instance, there are some landing strips of Schiphol Airport (Amsterdam) that are below sea level and I expect an altimeter to give a negative value there. But this value is correct though !

Not everything is as easy as it seems.


There are two known ways of calculating altitude on Mars: firstly by atmospheric pressure, with an arbitrary value of 6.105mb chosen (based on readings from Voyager). In 2001, a new definition was chosen, based around an ellipsoid (the Areoid) formed by imaginary sphere of the mean radius of the planet.

The effect is that the intended landing site is at an altitude of -250m, and the lowest point on Mars is at an altitude of -8200m.


... or whoever wrote the code should have put boundary checks in to make sure a negative altitude was caught and handled properly


It is an interesting problem and looks like a big extension to developing fault-tolerant software.

That is, how do you develop fault-tolerant software and at the same time have logic that would reject some sensor readings if they deviate too much from up-to-now readings?

How many readings would you reject in the hope that fresher readings would be better?


It's indeed a difficult problem and I'm not really surprised it's not solved by ESA even if they are probably extremely competent.

I doubt anyone care about my opinion, but since you ask, I do notice that simulation seems to be a big part of their development. So I'm sure there has to be a way to use some kind of an evolutionary algorithm for creating descent procedures.


I think this testing WAS that testing scenario to create better descent procedures


looks like a machine learning approach would be interesting to consider. not that i believe anybody will allow such an approach on a extraterrestial probe, but training a model to classify sensor data as trustworthy or not could lead to some advancement here and perhaps in other areas.


Actually, the solution to this sort of problem is well understood: it's a combination of Bayesian probability theory and decision theory.

First, you define a Bayesian model for the system. This includes discrete propositions such as "sensor X is not working" as well as things like a description how the craft's velocity, angular velocity and displacement at t0 can evolve to at t1, expressed as a continuous probability distribution. This takes into account the prior distribution for altitude.

All sensor inputs have distributions -- generally gaussian -- describing what values they could read given the "actual" value they're supposed to be measuring if they're not broken and giving completely erroneous results.

Feeding sensor inputs into this model as they're received to produces a posterior describing the craft's physical state probabilistically. Dimensions which are relevant only insofar as they affect knowledge of the craft's physical state (such as "sensor X is not working") are marginalised out.

Secondly, you define a utility function. This function takes some definite physical state of the craft's and a possible control output, and describes how "good" the outcome of that is. Then you choose the control outputs that maximise the expected utility, integrated over the craft's posterior state distribution.

If sensor X started giving results unlikely under the time evolution model, P(Sensor X is not working|inputs,control signals,prior information) would start to increase very rapidly.

If, somewhow, this probability was only 0.5, but the sensor was showing results that the craft was below the ground, the decision process would "decide" that "if sensor X is telling the truth, firing the landing thrusters will not lead to successful mission completion; if sensor X is lying it is too early to fire the landing thrusters" and not fire the thrusters.

In reality, such a situation is absurd because sensor X showing the craft below the ground would lead to an extremely high probability for "sensor X is not working". However, the point is that even under extremely pathological conditions decision theory will continue to give reasonable results.

One important consequence to note is that the sensor readings and state estimate do not matter in themselves, only as inputs to the decision theory process.

To answer your specific question: if sensor X is determined to be, with high probability, faulty, its inputs aren't "discarded" per se, but will have a very small contribution to the state estimate. If sensor X is only believed to be faulty with moderate probability, both possibilities are considered and the optimal decision will be made given the information available and utility function.

For more information, I recommend reading chapters 13 and 14 of "Probability Theory: The Logic of Science" by E.T. Jaynes [0].

[0]: http://www.med.mcgill.ca/epidemiology/hanley/bios601/Gaussia...


While it may sound that the failure was caused by a rookie mistake- I'm sure the on-board systems were designed and developed by people who know what they are doing.

It must be hard for all the people involved in Schiaparelli development.

Sometimes I wonder whether majority of software errors are caused only by multiple layers of people having incorrect / unchallenged / unjustified (and mostly implicit) assumptions about something.


> I'm sure the on-board systems were designed and developed by people who know what they are doing.

I have been working in safety-related domains, where each failure could involve the death of several hundred people. It wasn't pretty: everything was sub-contracted to death, and there were a minority of people who really knew what they were doing, a majority who was so-so and didn't give a fuck, and still a fair part who was notoriously incompetent. So, the minority of competent and concerned people cannot always make up for the others, and when themselves fail, there is almost no one to make up for them. Oh, and I was working in a company that was supposed to produce better quality than others. And it did :-(

So, for something like ESA where life is not even concerned, I can't imagine it is any better if stuff is outsourced, and since nowadays almost everything everywhere is outsourced, I assume ESA does it too.

> Sometimes I wonder whether majority of software errors are caused only by multiple layers of people having incorrect / unchallenged / unjustified (and mostly implicit) assumptions about something.

Yes and no. This causes a category of mistakes but on the other hand, it prevents another category. When someone knows the full system, he makes a whole lot of assumptions: it doesn't matter if this function cannot handle this or that case because it will never be called this way, it cannot happen, so I won't handle those cases and I won't even check them. And then, later, something changes in an upper function or system, the assumption is not valid any more, and boom.

When you have no idea what the system is, you just stick to the function definition and have it handle all the weird cases. You don't make assumptions. You can still make mistakes, though, some of them caused by the lack of global understanding of the system because the specification writer had those assumptions in mind and did not write them down in the specification of this function because they were obvious to him or they were already written in another part of the specification.


> When you have no idea what the system is, you just stick to the function definition and have it handle all the weird cases. You don't make assumptions. You can still make mistakes, though, some of them caused by the lack of global understanding of the system because the specification writer had those assumptions in mind and did not write them down in the specification of this function because they were obvious to him or they were already written in another part of the specification.

This, alone, deserves an article. It's a fundamental knowledge in any software development made by different teams, or even a discussion about the pros vs cons of having a monolithic team versus distributed teams.

If you expect that small teams, doing parts of a project, take care of all possible interactions of their piece of program with all other components, you'll fail. They just have a subset of the knowledge, and how all those information was passed so that they could act is relevant. And in general nobody writes down every single detail that could make a difference. In those scenarios, analysts who are able to write requisites until the point that some can be considered non-sense or obvious, should be paid in gold.


Which is why actual C and C++ code written at these companies, in these type of projects is so "fun" to debug.

Yes, everyone can be a super coder on his little bubble and use all the latest techniques to write high quality code, the problems arise when his/her bubble needs to interact with an ocean of unseen bubbles, many of which tainted ones.


The longer I've coded in practice, the more suspicious I've gotten about "you don't need any domain knowledge, here are perfect specifications" approaches. The fingers on the keyboard should always have some grasp of the bigger picture, or bad things are likely to happen.


Fully agree, I never understood the stereotype of coder closed in a cubicle breaking LOC records, coding without any other knowledge than programing language and libraries.

I always have spent a big part of my time interacting with those that would eventually use the programs/libraries. Without domain knowledge and communication skills, there is a high probability of misunderstandings.


> And then, later, something changes in an upper function or system, the assumption is not valid any more, and boom.

That's exactly how Ariane 5 crashed in 1996:

https://en.wikipedia.org/wiki/Cluster_(spacecraft)#Launch_fa...

"Specifically, the Ariane 5's greater horizontal acceleration caused the computers in both the back-up and primary platforms to crash and emit diagnostic data misinterpreted by the autopilot as spurious position and velocity data. Pre-flight tests had never been performed on the inertial platform under simulated Ariane 5 flight conditions so the error was not discovered before launch. During the investigation, a simulated Ariane 5 flight was conducted on another inertial platform. It failed in exactly the same way as the actual flight units."


I think it's also important to note that the inertial platform was developed for the Ariane 4 where it worked correctly.

The software was actually developed correctly, and functioned as intended. At least for it's intended use. Then it was tossed at a new use-case without any accounting for any differences in the new situation.


> The software was actually developed correctly

Not quite. If you read the details about the case you can find that it didn't have the handler for the overflow in the calculations(!) It's similar to this case now that both were developed with under the assumptions "can't happen," in the sense, developed to be too brittle, for the inputs that were certainly possible to happen as soon as the trajectory (in the case of Ariane 5) or the duration of the spinning movement (this case now) doesn't match their initial test cases.

Still, the development, especially in this kind of projects, is always a balancing act to organize covering most of the cases that can go wrong. Murphy's law works against the whole organization. Given the amount of real problems, I'm still amazed that the Apollo 11 succeeded.

Or even that there weren't any really destructive "accidents" involving rockets with the nuclear warheads. Think about it, these are prone to the same problems any other computer-related projects are: the amount of the damage is effectively infinitely larger than the effort needed to start it.

https://www.theguardian.com/world/2016/jan/07/nuclear-weapon...

“These weapons are literally waiting for a short stream of computer signals to fire. They don’t care where these signals come from.”

“Their rocket engines are going ignite and their silo lids are going to blow off and they are going to lift off as soon as they have the equivalent of you or I putting in a couple of numbers and hitting enter three times.”

http://thebulletin.org/

"It is 3 minutes to midnight"

Also: "How Risky is Nuclear Optimism?"

http://www-ee.stanford.edu/%7Ehellman/publications/75.pdf

And if you still think "but it works, the proof is that it hasn't exploded up to now", just consider this graph from Nassim Taleb:

http://static3.businessinsider.com/image/5655f69c8430765e008...


> Not quite. If you read the details about the case you can find that it didn't have the handler for the overflow in the calculations(!) It's similar to this case now that both were developed with under the assumptions "can't happen," in the sense, developed to be too brittle, for the inputs that were certainly possible to happen as soon as the trajectory (in the case of Ariane 5)

I'm not sure that's entirely fair. The software was intended for the Ariane 4 which wasn't intended to have as much horizontal acceleration as the 5. If the 4 had experienced such an acceleration it wasn't intended to be capable of recovering from it. That area of the code also explicitly had some protections provided by the language removed for the sake of efficiency. So it wasn't a total oversight that just happened to work out - there was a decision made based on the fact the rocket had already irrecoverably failed if the situation ever occurred.

While I agree it's somewhat distasteful not to cover all the bases in the most technically correct way all the time, I'm not sure how important it is to have an overflow handler fire in the inertial reference system just as the rocket self-destructs.


> That area of the code also explicitly had some protections provided by the language removed for the sake of efficiency

As far as I know the efficiency wasn't the issue, just that the "model" was, as I've said, brittle. The overflow was to be handled with what we'd today call "an exception handler" and the selected solution was, instead of (reasonably) writing "keep the maximum value as the result" handler, to leave the processor effectively executing random code in the case the overflow occurs. And the "exception" occured. It's not that the overflow detection was turned off to save the cycles, or that some default handling was provided. It was that it was handled with "whatever" (execute random instructions)! by intentionally omitting the handlers.


I don't really see that as the main point. Perhaps I shouldn't have mentioned it at all.

I don't see the practical issue with a model being brittle in the face of imminent mission failure. The model breaking down shortly before you self-destruct the whole thing seems like a rather minor concern. It's entirely irrelevant at that point what the model is.

It turns into an issue when somebody throws the software into a new environment without looking at it or it's requirements and then doesn't do any testing with it. But that's not on the original developers. Their solution was entirely valid for their problem.

Even if they had done something like report the maximum value instead, the rest of the software for the Ariane 5 could well have been expecting it to do something else entirely which would still result in a serious problem.

It's an issue of inappropriately using software in a new situation. Without knowing and account for how it behaves, you can't just use it and expect everything to work perfectly the first time around. It doesn't matter how well the software accounts for various issues - at some point something won't have only a single correct answer and the software you are using will have to pick how to behave. If you aren't paying attention to that, it can/will come back to bite you.


> It doesn't matter how well the software accounts for various issues - at some point something won't have only a single correct answer

It does, immensely. That's why we have floating point processing units instead of the fixed point. Think about it: even the single precision FP allows you to have "expected" responses between 10E-38 to 10E38. There are less stars in the observable universe. The double precision FP allows the ranges of inputs and outputs to be between 10E−308 and 10E308: there are only 10E80 atoms in the whole observable universe. Can the response which says how much the rocket is "aligned" be meaningful -- sure it can.

This piece of program catastrophically failed because some input was a just somewhat bigger than before.

Properly programmed components that are supposed to handle "continuous" inputs and provide "continuous" outputs (and that is the specific part we talk about) should not have "discontinuities" at the arbitrary points which are the accidents of some unimportant implementation decisions (leaving "operand error" exception for some input variables but protecting from it for others!).

I can understand that you don't understand this if you never worked in the area of the numerical computing or signal processing or something equivalently part of the "real life" responses, but I hope there are still enough professionals who know what I talk about.

Again from the report:

"The internal SRI software exception was caused during execution of a data conversion from 64-bit floating point to 16-bit signed integer value. The floating point number which was converted had a value greater than what could be represented by a 16-bit signed integer. This resulted in an Operand Error. The data conversion instructions (in Ada code) were not protected from causing an Operand Error, although other conversions of comparable variables in the same place in the code were protected.

The error occurred in a part of the software that only performs alignment of the strap-down inertial platform. This software module computes meaningful results only before lift-off. As soon as the launcher lifts off, this function serves no purpose."


> That's why we have floating point processing units instead of the fixed point.

I'm not sure what that is supposed to mean. I was talking generally. Not every situation has a single appropriate value to represent it. I don't particularly care if this one example could have used a floating point or not.

> This piece of program catastrophically failed because some input was a just somewhat bigger than before.

As far as the software was concerned the rocket had already catastrophically failed. It actually hadn't, because it was a different rocket than the software was designed for. It was "somewhat bigger" in the sense that it was large enough that the rocket the software was designed for would have been in an irrecoverable situation.

> Properly programmed components that are supposed to handle "continuous" inputs and provide "continuous" outputs (and that is the specific part we talk about) should not have "discontinuities" at the arbitrary points which are the accidents of some unimportant implementation decisions (leaving "operand error" exception for some input variables but protecting from it for others!).

That's theoretically impossible. If you want to account for every possible value you're going to need an infinite amount of memory. There will be a cutoff somewhere, no matter what. Even if that cutoff is the maximum value of a double precision float - that's an arbitrary implementation limitation. You can't just say you can more than count the stars in the sky and that's clearly and obviously good enough for everything. It's not.

There will be a limit, somewhere. It will be an implementation-defined one. As long as the limit suits the requirements, it effectively doesn't matter. In this case, the limit was set such that if it was reached the mission had already catastrophically failed. That's all that can practically be asked for.


I've checked the report: the exception resulted in the transmission of effectively random data to the main computer:

http://www.math.umn.edu/~arnold/disasters/ariane5rep.html

"g) As a result of its failure, the active inertial reference system transmitted essentially diagnostic information to the launcher's main computer, where it was interpreted as flight data and used for flight control calculations."

So the handler in the processes existed but it effectively confused the main computer. The units shut off but before that sent "the diagnostic." For which there was no handler at all in the main computer. And even more interesting, these processes weren't even needed for the flight. The main computer were able to just ignore such input and the flight would have continued (R1).

Brittle.


> It was that it was handled with "whatever" (execute random instructions)! by intentionally omitting the handlers.

Which is a perfectly valid course of action.

In fact, it is usually the only correct course of action, because there is no other correct course of action to take.

A "keep the maximum value as the result" is always plain wrong (and that extends to all cases of <return whatever fixed value sounds cool>), it wouldn't pass a code review.

Source: That's covered in the "safety & testing" courses of my previous university, that happen to be given by one guy who worked on the Arianes. :p


:) I could have expected that, that these involved have said "it was according to the specs." I don't claim it wasn't. But the commission didn't find that "it had to be all done as it was":

http://www.math.umn.edu/~arnold/disasters/ariane5rep.html

"4. RECOMMENDATIONS"

"R3 Do not allow any sensor, such as the inertial reference system, to stop sending best effort data."

See my other post, they effectively have sent something random ("diagnostics" instead of the data). And this piece of software wasn't even needed to run:

"R1 Switch off the alignment function of the inertial reference system immediately after lift-off. More generally, no software function should run during flight unless it is needed."

And of course, everything wasn't even tested together:

"R2 Prepare a test facility including as much real equipment as technically feasible, inject realistic input data, and perform complete, closed-loop, system testing. Complete simulations must take place before any mission. A high test coverage has to be obtained."


The piece of software was fine. It was done for Ariane 4 and worked as expected.

They re-used it for ariane 5 without checking/adapting it for work in the different environment (more acceleration & thrust). I don't even know what's the name for that kind of mistake. ^^

> See my other post, they effectively have sent something random ("diagnostics" instead of the data).

The software failed. It doesn't matter what it returned at this point. There is nothing to do but to fix the bug in the software.

If it returned "last number" instead of what it did, it would be considered a bug in the exact same way.

For R2, I suppose that they reused the tests from Ariane4 as well :D


What do we do about this?


Act! Share the info, raise the awareness. It seems non-technical people can't imagine how easy the computers and the technology can be catastrophically wrong. The accident will hapen and we must rationally minimize the impacts:

http://nuclearrisk.org

The political action is essential.


>Pre-flight tests had never been performed on the inertial platform under simulated Ariane 5 flight conditions

I had to read that several times, just to make sure it said what I thought it said.


Just so people are aware that ESA weren't necessarily directly responsible for writing the software, the RTPU (the computer handling the landing sequencing) seems to have been outsourced to a company called Terma (via Thales).

I can't figure out if Terma wrote the landing software, but it seems likely from their press release [1] and they have the relevant specialties.

1: https://www.terma.com/space/exomars/


Terma has said to danish media that they only developed the hardware, and had nothing to do with the software.

The software might have been outsourced to another company.


When will people learn not to outsource anything important?!


Seriously.

Look at NASA, they definitely haven't outsourced anything at all. They use all in house hardware and software to launch satellites or send astronauts to the ISS.


Eh, is this sarcasm?

NASA has outsourced things since the original Apollo program, massively.

NASA has never brought a single astronaut into space on a fully NASA-made rocket. Especially with parts like these, they're commonky outsourced.


Of course it was sarcasm, I thought it was obvious. NASA outsources everything these days. I was responding to the idea that you should never outsource important things when EVERYTHING is outsourced, including astornaut launches


Not to be snarky, but NASA also had its fair share of failures. I mean ... it's rocket science after all.


ESA didn't write the software, but surely they're responsible.


The IMU saturated, yet its outputs continued to be used - i.e. a failure to handle an exception. In some sense, it was a rookie mistake. Formal systems verification has reached the point where this sort of thing should be caught, if you are prepared to take the fairly considerable effort that adopting these methods requires.


Seems strange that a function looked at the data and thought the machine was underground didn't throw that data out.

Wouldn't they have a model of expected data and anything that is wildly out be removed? Fall back to an estimation?


They probably have. A large problem in sensor fusion is: you don't always know which sensor you can trust. For example, in this case it probably would have helped to reject IMU and rely on doppler radar altimeter. But what if it's doppler radar that for some reason return an incorrect value that indicates that vehicle is a lot higher than IMU?


It's reasonable to assume that if you're still moving and apparently underground there's a problem with the sensors, and therefore it's best not to do anything rash - like starting the landing sequence.

This does sound like a rookie mistake, and I'll bet there's a dev somewhere saying "Oh crap" a lot because of it.

It's also a hardware design problem. NASA's human missions have a weight budget that includes multiple redundant systems. I have no idea if this probe had three of everything, but missions that do are more likely to be successful.

More subtly, I'm curious if the sensor failed or maxed out. The kinds of forces that would max out the sensor might also max out the probe. So it's possible it was already doomed after they were encountered.


> This does sound like a rookie mistake, and I'll bet there's a dev somewhere saying "Oh crap" a lot because of it.

Howard Wolowitz.


Byzantine Generals


So their model of the spacecraft behavior allowed for the possibility of the altitude instantly changing from 3.7km to a negative value. Seems like poor design.


Their system design accepted data from a number of sensors, and allowed for the possibility of sensor error.

In a hypothetical scenario, if one of those altitude sensors used barometric pressure, then a piece of grit or ice could have blocked the air intake, allowing the sensor to record a lower pressure than expected. If that grit or ice were suddenly blown away, then the sensor could record an instant change from an incorrect value (3.7km altitude) to a correct value. Even a negative altitude isn't necessarily incorrect, if the "altitude" is based on a mean ground level. For example, the shore of the Dead Sea is at an altitude of -423m [1].

The challenge with designing something like this is to imagine all the possible ways it could fail, all the possible misleading sensors inputs, and an appropriate action for each of them.

[1] https://en.wikipedia.org/wiki/List_of_places_on_land_with_el...


  Even a negative altitude isn't necessarily incorrect,
  if the "altitude" is based on a mean ground level.
This has been niggling at me, so I had to check :-)

As Mars lacks water, the idea of a "zero datum" isn't really meaningful, so an arbitrary point has been used.

Between 1971 and 2001, the "zero elevation" point was an atmospheric pressure of 6.105mb [1] (for comparison, Earth's zero datum is 1013.25mb).

Since 2001, the zero datum definition has been based on the mean radius of the planet (an ellipsoid called the Areoid, similar to Earth's Geoid).

The lowest point on Mars is therefore at an altitude of -8200m (the Hellas impact crater). [2]

(edit) Interestingly, the intended landing site - Meridiani Planum - is "below ground" at an altitude of -250m. [3]

[1] https://en.wikipedia.org/wiki/Geography_of_Mars#Zero_elevati...

[2] http://geology.com/articles/highest-point-on-mars.shtml

[3] http://io9.gizmodo.com/this-elevation-map-of-mars-makes-the-...


Thanks for the info.

On a tangentially related note, do you have any idea what happens to the concept of sea level on earth as the sea level changes?


Thank you, and that's a really interesting question.

There are really two concepts in there: the idea of "sea level", and the idea of "altitude", which we generally understand to be height above or below sea-level, but in some circumstances is only tangentially related to sea-level.

Most current geospatial representations use the WGS84 projection, which is the projection used by the original USA GPS system called Navstar.

WGS does get revised every now and again. The last revision was in 2004. However, the concept of WGS is built around an imaginary idealised ellipsoid that represents the shape of the Earth's surface and the Earth's gravitational field, so a change in sea-level wouldn't necessarily change the definition of WGS, and if it were to be updated, it would be slow to update.

That said, GPS is notoriously bad at reporting altitude, so tools tend not to rely on GPS for altitude measurements anyway.

A good example is in aviation. Altitude isn't measured using GPS, it's measured using barometric pressure against one of three references: either the recorded barometric pressure at a particular airfield (QFE), the corresponding barometric pressure at sea-level (QNH) or a global idealised standard barometric pressure (1013mb). During take-off or landing, the altimeter is calibrated to the airfield's barometric pressure, to accurately report the altitude above the airfield. During transition to cruising altitude, many aircraft use QNH. During general flight, altimeters are calibrated to the global standard of 1013mb, so that each aircraft maintains consistent vertical separation. Commercial aircraft flying at "Flight Level" use that standard barometric pressure to calculate altitude. For that use-case, it doesn't matter whether the reported altitude is the height above the ground, the height above mean sea-level, or the height above some arbitrary reference point: what's important is that everyone is communicating using the same reference, which is why the mean atmospheric pressure of 1013mb was adopted.

If (when?) sea-level rises, then small changes will be seen. Nautical charts will be updated with new high-water/low-water marks, maps will eventually be updated, etc. Widely-used references such as WGS84 will take a long time to update, and may not even be updated at all (because an altitude difference of 1 or 2 meters isn't hugely significant). Other reference points, such as the barometric pressure of 1013mb are unlikely to change, partly because they have a long legacy, and partly because the accuracy of the value is to a large part irrelevant.


"A large volume of data recovered from the Mars lander shows that the atmospheric entry and associated braking occurred exactly as expected"

In every communication and media event they keep putting a positive spin on what now turns out to be in essence a preventable design error...


They have to, unfortunately, because general public (including journalists) doesn't understand that mistakes happen, and ESA being seriously underfunded as it is, cannot afford bad publicity.

Also, this was a test flight designed to gather data about the performance of the orbiter/lander platform, so it did perform its mission and according to the collected data, the platform behaves mostly as expected.


they're fighting tooth and nail for this mission's successor (which will be the real deal vs what was effectively an expensive acceptance test run) to not be canceled, i.e. politics.


What i'm wondering is why is it that when the Inertial Measurement Unit erroneous information was fed into the navigation system, and resulted in a negative estimated altitude, didn't the mars lander also have an accelerometer that could tell "hey, we're still accelerating, we're not stationary" and thus attempt some form of recovery of the navigation system.. perhaps reset the IMU or something? Reread data from the IMU after a certain delay?

I'm also quite confident that even before commencing entry, they had some idea from simulations as to how much the descent should take. Surely there's something wrong f you read negative altitude after just half the time required for landing.


But you don't know which IMU is misreporting. Now, given the nature of the situation you could probably say it's best to err on the side of acceleration. If you've stopped accelerating and haven't initiated landing procedures you're either wrong or somehow already landed and can afford to wait.

It's also not clear to me that this was an issue of erroneous data. All they said so far was that a rotational sensor maxed out for about a second. I don't think a simple delay-and-retry would have sufficed here. When attempting to land a second seems like a pretty long wait.


What's the difference between an IMU and an accelerometer?


IMU means gyro, that is, it measures rotational rates. Accelerometers measure translational rates.


> IMU means gyro

In aerospace the phrase IMU typically refers to the combination of accelerometers and gyroscopes.


Sorry, what I meant to say was a redundant IMU that could be read in specific cases like this one.


    When merged into the navigation system, the erroneous information generated an estimated altitude that was negative – that is, below ground level.
I wonder if this was a sensor fusion problem or a pedestrian integer overflow.


I would hope that it was a sensor fusion error.

IMU saturation is an expected condition, especially in this type of extreme real-world environment. It's absolutely normal to temporarily saturate IMU sensors on drones during hard landings for example and saturation values are generally clear in the documentation. A large part of the difficulty in position and attitude estimators is in rejecting glitching or erroneous data.

From the description (specifically that the duration of the saturation was an issue), it would sound like the position estimator (most likely a kalman filter) did not reject or distrust saturation values properly and converged on a bad solution. It appears that any strategy in place to (for example) reset the estimator state if it diverged from more reliable sensors (such as the radar altimeter), failed or were not included. It would also appear that the case where the IMU saturated during descent for over 1 second was not properly tested, as they were able to reproduce the issue in simulation.

This must be disappointing for the ESA but at least it was found now 'in beta' rather than in the relatively more important next lander.


Whilst incredibly disappointing for ESA, this is unfortunately not their first IMU-failure rodeo.

An Ariane 5 launch back in 1996 [0] suffered a catstrophic failure after the inertial reference units gave bad data, and the flight control computer accepted it as gospel.

It is sad that they might have lost another platform due to a lack of appropriate range / saturation checking, especially as there was a radar altimeter onboard telling them they weren't underground.

[0] https://en.wikipedia.org/wiki/Cluster_(spacecraft)#Launch_fa...


Ariane is built by Airbus, Schiaparelli was built by Thales Alenia, two completely independent companies. Not sure why there's supposed to be a correlation, other than both were contracted by ESA.


> both were contracted by ESA.

Yes I'm aware of how ESA works and who built what. But I would have hoped the common connection would have greater instilled the need for checking for these sorts of errors, if the Schiaparelli incident is as blutack theorised.


It says on Thales webpage that their IMU is already onboard on Ariane 5. I don't know what IMU was on Schiaparelli.

https://www.thalesgroup.com/en/worldwide/aerospace/topaxyz-i...


Nobody claimed the same company did it, just that there are similarities in the way the software errors caused the destruction.


> Kalman filter

I hesitate to speculate but the book I read that talked of Kalman filters and two people I know who messed with them said they are opaque and tricky to get right[1]. Hard to understand what the filter constants mean, and buggy software implementations 'sort of work' then break.

[1] Convinced me to use something else.


Nonsense, Kalman filters are well understood and widely used in spacecraft attitude dynamics and control systems. In fact, I'd argue that a Kalman filter is industry standard for that work. It's typical on teams doing GNC (guidance, navigation, and control) to have a PhD or two in control theory to make sure you get things right.

Most aerospace engineers with a GNC background know a lot about Kalmam filters.

Examples: http://www.techbriefs.com/component/content/article/ntb/webc...

http://digitalcommons.usu.edu/smallsat/2016/TS10AdvTech2/5/


The simpler statement would be: the lander started to spin after the parachute opened, and the sensor properly detected that it rotated a lot, giving the maximum value as an output, but the rotation lasted for at least one second and we made the rest of software without having any idea that this could happen, so this one sensor output confused the whole lander. Or even shorter "nobody here expected it could spin the whole second."

Once the model is so simple, there are many ways to program it wrongly.

To compare, NASA's Mars Climate Orbiter

https://en.wikipedia.org/wiki/Mars_Climate_Orbiter

burned close to Mars in 1998 because the "ground based computer" programmed per contract by Lockheed used lbf·s values instead of agreed N·s values.

https://en.wikipedia.org/wiki/Pound_(force)

The values generated on the ground were 4.45 times bigger.

Space missions are risky.


IMO it's more likely that accelerometer instead of gyroscope would saturate, and the sentence that explains IMU is shortened too much.


Maybe. The only thing I can conclude from the description is just that the lander was simply spinning too long (too long as in "the whole second") for their software to handle the correct information from the sensors. Now it sounds too simple.

But it's good that they made such an experiment before sending the whole rover, planned for 2020.

https://en.wikipedia.org/wiki/ExoMars_(rover)


Might be a sensor range issue


Good work diagnosing the error. A reliable diagnostic is a great outcome of a bad situation. At least now, measures can be taken to eliminate this error from future scenarios.


I'm still trying to understand whether this was a software or a hardware issue.. I mean, it's true that perhaps the software should have been better at rejecting the saturated data, but if an IMU persists in a saturated state for 1 second, that's a whole lot of time that the software just cannot know what is happening.

Granted, if it predicts an impossible configuration due to this, perhaps some different action should have been taken, like waiting for higher confidence and a more close-to-expected attitude estimate, but by then perhaps it would be too late. I mean, the thing is falling from the sky.. perhaps the "safe" thing was to deploy anyway. (Even if it didn't work.) I just don't know.


So, spend more money on the sensor simulator's chaos monkey. Having worked on ESA commissioned ground systems emulation, I know this is easier said than done. Indeed getting anything done is hard.


> Indeed getting anything done is hard.

Why? Too much Red tape?


Basically. Imagine how NASA splits up contracts for a large project, then put each of the companies in a different country.


I don't understand why/how a feed of IMU data (gyro + probably accelerometers) could over-ride a Doppler altimeter that "functioned properly"... Any aircraft that would use inertial data to determine altitude would be in very deep trouble on Earth as well. I don't get it.


To prevent an issue if IMU is saturated, couldn't they just add an and clause to conditional to prevent parachute release? altitude > 0 and time since entry > x then release parachute? Entry counter triggered once temperature reaches certain threshold or another sensor triggers entry.


Just by inspection that looks like that will have a ton of unwanted side effects... How do you define as "entry"? What if the triggering of the "entry" timer fails?


You can have multiple sensors start the timer.


This feels like the kind of bug I catch early in implementation using quickcheck x_x Is property based testing used in the development of these kinds of systems?


Which would totally work if you have a super high resolution full system simulation. Hardware, software, mechanical and electrical interconnections all need to be simulated for issues like this to be found automatically.


Of course it is easy to be wise after the fact, but surely this is the sort of thing that should have been exhaustively tested (What happens if one sensor malfunctions and sends incorrect data)?

Massive cudo's for being public about the issue. It is not easy to talk about mistakes, especially those that appear to be silly in hindsight.

EDIT Interesting to see that I am getting downvoted multiple times on this. I don't really care about a few downvotes but would be interested to find out why.


It isn't too clear from the description, but isn't non-saturation of the inertial measurement responsible, rather than saturation? I'd expect that some computation did not saturate its output, and the result overflowed (and hence became negative), which in turn was fed into altitude measurement.


Why is rotation data being used to calculate altitude?

And why hasn't anybody asked that question yet? It's the first question I thought to myself.

It's a shame such an expensive piece of equipment and years of work and waiting around were obliterated by a bug that should have been caught.


How could there not be a test case against negative attitude during descent?


negative altitude ... so integer overflow it was ...


if (altitude < 0) { return ERR_INVALID_ALTITUDE; }


Aircraft departing my local airfield commence at -4 metres according to GPS as it is actually below sea level.


And the intended landing site is at an altitude of -250m referenced to the Martian datum.


First integer overflow to (literally) hit Mars!


Maybe next time have a better system design so one instrument/sensor error/failure/loss doesn't lead to total loss of unit.


Maybe rather than a design issue this can be seen as a development process and engineering management issue.

I expect that a top level system model should reduce the risk of this sort of error.

Of course, it is hard to justify the cost of developing these models without being able to amortize that cost over multiple products / missions, so product line oriented engineering management and strong leadership from the sponsor has a big role to play as well.


Errors in INS are common. Saturation is common. What isn't common is a bad design that isn't failsafe or failop for a single failure.


Remember, spacecrafts have weight constraints that limits redundancy options. Besides this was a 'tracing bullet' designed to test an unproven procedure before committing to a more expensive mission, so there's that.


Designing a reliable system that is failure resistant is what aerospace design is all about. With INS fail op design is the norm. Yes accidents still happen for common mode failures, but those are supposed to be worked out before committing hundreds of millions of euros.

Costly mistakes happen when people take unjustified shortcuts. Like not testing Hubble's mirror on the ground, and ending up with a myopic telescope in orbit that then needs another billion dollar mission to correct.


The humanity is truly lucky to have your hindsight at the ready, please share more.


Navigating correctly when you have failed sensors is easy. Working out when to discard information from a sensor that seems to be working, but is giving bad readings, is much harder.


these are solved problems. Look at INS in B777 for example, designed in the 80s with quadruple laser gyros.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: