
Schiaparelli landing investigation makes progress - YeGoblynQueenne
http://www.esa.int/Our_Activities/Space_Science/ExoMars/Schiaparelli_landing_investigation_makes_progress
======
xgbi
What is fantastic here is that the telemetry was received for nearly the
entire descent, up to a few seconds before the crash.

The could reproduce the exact same scenario in simulation by replaying the
telemetry as inputs to their firmware and see what happened in the software.
This is tremendous advantage for debugging, it's like a wireshark replay of
network packets for debugging an error in the TCP stack.

This orbiter/descender architecture is really paying off for future missions.
If the orbiter hadn't been there, earth wouldn't have received any telemetry
at all.. Let's hope that this "real world" telemetry is saved for simulating
successive versions of the guidance software of future probes.

~~~
fest
I've worked on sensor fusion systems and I fully agree. Replaying the sensor
inputs and evaluating new estimated state is a really good way of debugging
failures (because you can't just stop the system mid-air and evaluate internal
state). It also helps with regression test suite and trying out new algorithms
quickly.

~~~
vvanders
We'd do the same for debugging networked games. Incredibly useful.

If you build it right you also get a replay feature for almost free.

------
grondilu
> the erroneous information generated an estimated altitude that was negative
> – that is, below ground level.

Kind of sad to see that computers nowadays still lack what we call common
sense. I mean the machine received negative altimeter data but if it had been
able to just look around, it could have seen that this data was erroneous.
Hell, considering the descent profile was carefully prepared, the computer
should also have noticed that it was way too early to reach the ground and
should have concluded that something was wrong with the altitude measure.

~~~
ygra
Well, common sense in a way would be enabled with assertions. But even if you
know that you cannot be below ground, how would you recover? Your sensors
clearly tell you that you're below ground, so what is the course of action?
You have no way of knowing where you are, you just know that you don't know.
It doesn't really change anything in the outcome. Sure, maybe you could go
from altitude measurements to time measurements and most likely still leave a
crater since it's not exact enough (there's a reason for the internal
measurement instead of simpler ways, after all).

It would also require you to anticipate the malfunction in which case you
already know that things like that could happen, in which case you'd likely
determine how it could happen and how to avoid it and continue having useful
data. Sounds to me like a better solution than trying to recover without data.

~~~
WalterBright
There are several ways. First off, ignore the erroneous data. Then, estimate
the altitude by other methods. As you mentioned, one way would be to integrate
time and velocity.

All inputs and outputs should be run through sanity checks, with a plan for
when they go wrong.

Essentially, when you've got a system that can't be fixed, plan for failure of
each and every component (in isolation).

~~~
homero
Cars do this, you can kill many sensors and an engine will still limp even
with misfires. I guess consulting Bosch isn't an option tho

~~~
TorKlingberg
It helps that car engines don't really _need_ sensors to function. Cars
existed before electronic sensors. A mars lander actually needs to know when
to deploy parachutes.

~~~
gonzo
You'll find that fuel injectors don't work well without input from the MAF and
O2 sensor(s).

~~~
WalterBright
True of modern EFI, but older mechanical fuel injectors did not have O2
sensors.

------
fest
While it may sound that the failure was caused by a rookie mistake- I'm sure
the on-board systems were designed and developed by people who know what they
are doing.

It must be hard for all the people involved in Schiaparelli development.

Sometimes I wonder whether majority of software errors are caused only by
multiple layers of people having incorrect / unchallenged / unjustified (and
mostly implicit) assumptions about something.

~~~
wott
> I'm sure the on-board systems were designed and developed by people who know
> what they are doing.

I have been working in safety-related domains, where each failure could
involve the death of several hundred people. It wasn't pretty: everything was
sub-contracted to death, and there were a minority of people who really knew
what they were doing, a majority who was so-so and didn't give a fuck, and
still a fair part who was notoriously incompetent. So, the minority of
competent and concerned people cannot always make up for the others, and when
themselves fail, there is almost no one to make up for them. Oh, and I was
working in a company that was supposed to produce better quality than others.
And it did :-(

So, for something like ESA where life is not even concerned, I can't imagine
it is any better if stuff is outsourced, and since nowadays almost everything
everywhere is outsourced, I assume ESA does it too.

> Sometimes I wonder whether majority of software errors are caused only by
> multiple layers of people having incorrect / unchallenged / unjustified (and
> mostly implicit) assumptions about something.

Yes and no. This causes a category of mistakes but on the other hand, it
prevents another category. When someone knows the full system, he makes a
whole lot of assumptions: it doesn't matter if this function cannot handle
this or that case because it will never be called this way, it cannot happen,
so I won't handle those cases and I won't even check them. And then, later,
something changes in an upper function or system, the assumption is not valid
any more, and boom.

When you have no idea what the system is, you just stick to the function
definition and have it handle all the weird cases. You don't make assumptions.
You can still make mistakes, though, some of them caused by the lack of global
understanding of the system because the specification writer had those
assumptions in mind and did not write them down in the specification of this
function because they were obvious to him or they were already written in
another part of the specification.

~~~
acqq
> And then, later, something changes in an upper function or system, the
> assumption is not valid any more, and boom.

That's exactly how Ariane 5 crashed in 1996:

[https://en.wikipedia.org/wiki/Cluster_(spacecraft)#Launch_fa...](https://en.wikipedia.org/wiki/Cluster_\(spacecraft\)#Launch_failure)

"Specifically, the Ariane 5's greater horizontal acceleration caused the
computers in both the back-up and primary platforms to crash and emit
diagnostic data misinterpreted by the autopilot as spurious position and
velocity data. Pre-flight tests had never been performed on the inertial
platform under simulated Ariane 5 flight conditions so the error was not
discovered before launch. During the investigation, a simulated Ariane 5
flight was conducted on another inertial platform. It failed in exactly the
same way as the actual flight units."

~~~
fghgfdfg
I think it's also important to note that the inertial platform was developed
for the Ariane 4 where it worked correctly.

The software was actually developed correctly, and functioned as intended. At
least for it's intended use. Then it was tossed at a new use-case without any
accounting for any differences in the new situation.

~~~
acqq
> The software was actually developed correctly

Not quite. If you read the details about the case you can find that it didn't
have the handler for the overflow in the calculations(!) It's similar to this
case now that both were developed with under the assumptions "can't happen,"
in the sense, developed to be _too brittle_ , for the inputs that were
certainly possible to happen as soon as the trajectory (in the case of Ariane
5) or the duration of the spinning movement (this case now) doesn't match
their initial test cases.

Still, the development, especially in this kind of projects, is always a
balancing act to organize covering most of the cases that can go wrong.
Murphy's law works against the whole organization. Given the amount of real
problems, I'm still amazed that the Apollo 11 succeeded.

Or even that there weren't any really destructive "accidents" involving
rockets with the nuclear warheads. Think about it, these are prone to the same
problems any other computer-related projects are: the amount of the damage is
effectively infinitely larger than the effort needed to start it.

[https://www.theguardian.com/world/2016/jan/07/nuclear-
weapon...](https://www.theguardian.com/world/2016/jan/07/nuclear-weapons-risk-
greater-than-in-cold-war-says-ex-pentagon-chief)

“These weapons are literally waiting for a short stream of computer signals to
fire. They don’t care where these signals come from.”

“Their rocket engines are going ignite and their silo lids are going to blow
off and they are going to lift off as soon as they have the equivalent of you
or I putting in a couple of numbers and hitting enter three times.”

[http://thebulletin.org/](http://thebulletin.org/)

"It is 3 minutes to midnight"

Also: "How Risky is Nuclear Optimism?"

[http://www-ee.stanford.edu/%7Ehellman/publications/75.pdf](http://www-
ee.stanford.edu/%7Ehellman/publications/75.pdf)

And if you still think "but it works, the proof is that it hasn't exploded up
to now", just consider this graph from Nassim Taleb:

[http://static3.businessinsider.com/image/5655f69c8430765e008...](http://static3.businessinsider.com/image/5655f69c8430765e008b57c8-1200-900/taleb-
turkey.png)

~~~
fghgfdfg
> Not quite. If you read the details about the case you can find that it
> didn't have the handler for the overflow in the calculations(!) It's similar
> to this case now that both were developed with under the assumptions "can't
> happen," in the sense, developed to be too brittle, for the inputs that were
> certainly possible to happen as soon as the trajectory (in the case of
> Ariane 5)

I'm not sure that's entirely fair. The software was intended for the Ariane 4
which wasn't intended to have as much horizontal acceleration as the 5. If the
4 had experienced such an acceleration it wasn't intended to be capable of
recovering from it. That area of the code also explicitly had some protections
provided by the language removed for the sake of efficiency. So it wasn't a
total oversight that just happened to work out - there was a decision made
based on the fact the rocket had already irrecoverably failed if the situation
ever occurred.

While I agree it's somewhat distasteful not to cover all the bases in the most
technically correct way all the time, I'm not sure how important it is to have
an overflow handler fire in the inertial reference system just as the rocket
self-destructs.

~~~
acqq
> That area of the code also explicitly had some protections provided by the
> language removed for the sake of efficiency

As far as I know the efficiency wasn't the issue, just that the "model" was,
as I've said, brittle. The overflow was to be handled with what we'd today
call "an exception handler" and the selected solution was, instead of
(reasonably) writing "keep the maximum value as the result" handler, to leave
the processor effectively executing random code in the case the overflow
occurs. And the "exception" occured. It's not that the overflow detection was
turned off to save the cycles, or that some default handling was provided. It
was that it was handled with "whatever" (execute random instructions)! by
intentionally omitting the handlers.

~~~
fghgfdfg
I don't really see that as the main point. Perhaps I shouldn't have mentioned
it at all.

I don't see the practical issue with a model being brittle in the face of
imminent mission failure. The model breaking down shortly before you self-
destruct the whole thing seems like a rather minor concern. It's entirely
irrelevant at that point what the model is.

It turns into an issue when somebody throws the software into a new
environment without looking at it or it's requirements and then doesn't do any
testing with it. But that's not on the original developers. Their solution was
entirely valid for their problem.

Even if they had done something like report the maximum value instead, the
rest of the software for the Ariane 5 could well have been expecting it to do
something else entirely which would still result in a serious problem.

It's an issue of inappropriately using software in a new situation. Without
knowing and account for how it behaves, you can't just use it and expect
everything to work perfectly the first time around. It doesn't matter how well
the software accounts for various issues - at some point something won't have
only a single correct answer and the software you are using will have to pick
how to behave. If you aren't paying attention to that, it can/will come back
to bite you.

~~~
acqq
> It doesn't matter how well the software accounts for various issues - at
> some point something won't have only a single correct answer

It does, immensely. That's why we have floating point processing units instead
of the fixed point. Think about it: even the single precision FP allows you to
have "expected" responses between 10E-38 to 10E38. There are less stars in the
observable universe. The double precision FP allows the ranges of inputs and
outputs to be between 10E−308 and 10E308: there are only 10E80 atoms in the
whole observable universe. Can the response which says how much the rocket is
"aligned" be meaningful -- sure it can.

This piece of program catastrophically failed because some input was a just
somewhat bigger than before.

Properly programmed components that are supposed to handle "continuous" inputs
and provide "continuous" outputs (and that is the specific part we talk about)
should not have "discontinuities" at the arbitrary points which are the
accidents of some unimportant implementation decisions (leaving "operand
error" exception for some input variables but protecting from it for others!).

I can understand that you don't understand this if you never worked in the
area of the numerical computing or signal processing or something equivalently
part of the "real life" responses, but I hope there are still enough
professionals who know what I talk about.

Again from the report:

"The internal SRI software exception was caused during execution of a data
conversion from 64-bit floating point to 16-bit signed integer value. The
floating point number which was converted had a value greater than what could
be represented by a 16-bit signed integer. This resulted in an Operand Error.
_The data conversion instructions (in Ada code) were not protected from
causing an Operand Error, although other conversions of comparable variables
in the same place in the code were protected_.

The error occurred in a part of the software that only performs alignment of
the strap-down inertial platform. This software module computes meaningful
results only before lift-off. As soon as the launcher lifts off, this function
serves no purpose."

~~~
fghgfdfg
> That's why we have floating point processing units instead of the fixed
> point.

I'm not sure what that is supposed to mean. I was talking generally. Not every
situation has a single appropriate value to represent it. I don't particularly
care if this one example could have used a floating point or not.

> This piece of program catastrophically failed because some input was a just
> somewhat bigger than before.

As far as the software was concerned the rocket had already catastrophically
failed. It actually hadn't, because it was a different rocket than the
software was designed for. It was "somewhat bigger" in the sense that it was
large enough that the rocket the software was designed for would have been in
an irrecoverable situation.

> Properly programmed components that are supposed to handle "continuous"
> inputs and provide "continuous" outputs (and that is the specific part we
> talk about) should not have "discontinuities" at the arbitrary points which
> are the accidents of some unimportant implementation decisions (leaving
> "operand error" exception for some input variables but protecting from it
> for others!).

That's theoretically impossible. If you want to account for every possible
value you're going to need an infinite amount of memory. There will be a
cutoff somewhere, no matter what. Even if that cutoff is the maximum value of
a double precision float - that's an arbitrary implementation limitation. You
can't just say you can more than count the stars in the sky and that's clearly
and obviously good enough for everything. It's not.

There will be a limit, somewhere. It will be an implementation-defined one. As
long as the limit suits the requirements, it effectively doesn't matter. In
this case, the limit was set such that if it was reached the mission had
already catastrophically failed. That's all that can practically be asked for.

------
acdx
So their model of the spacecraft behavior allowed for the possibility of the
altitude instantly changing from 3.7km to a negative value. Seems like poor
design.

~~~
isoprophlex
"A large volume of data recovered from the Mars lander shows that the
atmospheric entry and associated braking occurred exactly as expected"

In every communication and media event they keep putting a positive spin on
what now turns out to be in essence a preventable design error...

~~~
TeMPOraL
They have to, unfortunately, because general public (including journalists)
doesn't understand that mistakes _happen_ , and ESA being seriously
underfunded as it is, cannot afford bad publicity.

Also, this was a test flight designed to gather data about the performance of
the orbiter/lander platform, so it did perform its mission and according to
the collected data, the platform behaves mostly as expected.

------
dudeonthenet
What i'm wondering is why is it that when the Inertial Measurement Unit
erroneous information was fed into the navigation system, and resulted in a
negative estimated altitude, didn't the mars lander also have an accelerometer
that could tell "hey, we're still accelerating, we're not stationary" and thus
attempt some form of recovery of the navigation system.. perhaps reset the IMU
or something? Reread data from the IMU after a certain delay?

I'm also quite confident that even before commencing entry, they had some idea
from simulations as to how much the descent should take. Surely there's
something wrong f you read negative altitude after just half the time required
for landing.

~~~
programmer_dude
What's the difference between an IMU and an accelerometer?

~~~
planteen
IMU means gyro, that is, it measures rotational rates. Accelerometers measure
translational rates.

~~~
votingprawn
> IMU means gyro

In aerospace the phrase IMU typically refers to the _combination_ of
accelerometers and gyroscopes.

------
MrBuddyCasino

        When merged into the navigation system, the erroneous information generated an estimated altitude that was negative – that is, below ground level.
    

I wonder if this was a sensor fusion problem or a pedestrian integer overflow.

~~~
blutack
I would hope that it was a sensor fusion error.

IMU saturation is an expected condition, especially in this type of extreme
real-world environment. It's absolutely normal to temporarily saturate IMU
sensors on drones during hard landings for example and saturation values are
generally clear in the documentation. A large part of the difficulty in
position and attitude estimators is in rejecting glitching or erroneous data.

From the description (specifically that the duration of the saturation was an
issue), it would sound like the position estimator (most likely a kalman
filter) did not reject or distrust saturation values properly and converged on
a bad solution. It appears that any strategy in place to (for example) reset
the estimator state if it diverged from more reliable sensors (such as the
radar altimeter), failed or were not included. It would also appear that the
case where the IMU saturated during descent for over 1 second was not properly
tested, as they were able to reproduce the issue in simulation.

This must be disappointing for the ESA but at least it was found now 'in beta'
rather than in the relatively more important next lander.

~~~
votingprawn
Whilst incredibly disappointing for ESA, this is unfortunately not their first
IMU-failure rodeo.

An Ariane 5 launch back in 1996 [0] suffered a catstrophic failure after the
inertial reference units gave bad data, and the flight control computer
accepted it as gospel.

It is sad that they might have lost another platform due to a lack of
appropriate range / saturation checking, especially as there was a radar
altimeter onboard telling them they weren't underground.

[0]
[https://en.wikipedia.org/wiki/Cluster_(spacecraft)#Launch_fa...](https://en.wikipedia.org/wiki/Cluster_\(spacecraft\)#Launch_failure)

~~~
creshal
Ariane is built by Airbus, Schiaparelli was built by Thales Alenia, two
completely independent companies. Not sure why there's supposed to be a
correlation, other than both were contracted by ESA.

~~~
votingprawn
> both were contracted by ESA.

Yes I'm aware of how ESA works and who built what. But I would have hoped the
common connection would have greater instilled the need for checking for these
sorts of errors, if the Schiaparelli incident is as blutack theorised.

------
yummybear
Good work diagnosing the error. A reliable diagnostic is a great outcome of a
bad situation. At least now, measures can be taken to eliminate this error
from future scenarios.

------
radarsat1
I'm still trying to understand whether this was a software or a hardware
issue.. I mean, it's true that perhaps the software should have been better at
rejecting the saturated data, but if an IMU persists in a saturated state for
1 second, that's a whole lot of time that the software just cannot know what
is happening.

Granted, if it predicts an impossible configuration due to this, perhaps some
different action should have been taken, like waiting for higher confidence
and a more close-to-expected attitude estimate, but by then perhaps it would
be too late. I mean, the thing is falling from the sky.. perhaps the "safe"
thing was to deploy anyway. (Even if it didn't work.) I just don't know.

------
verytrivial
So, spend more money on the sensor simulator's chaos monkey. Having worked on
ESA commissioned ground systems emulation, I know this is easier said than
done. Indeed getting _anything_ done is hard.

~~~
junke
> Indeed getting anything done is hard.

Why? Too much Red tape?

~~~
verytrivial
Basically. Imagine how NASA splits up contracts for a large project, then put
each of the companies in a different country.

------
UhUhUhUh
I don't understand why/how a feed of IMU data (gyro + probably accelerometers)
could over-ride a Doppler altimeter that "functioned properly"... Any aircraft
that would use inertial data to determine altitude would be in very deep
trouble on Earth as well. I don't get it.

------
blondie9x
To prevent an issue if IMU is saturated, couldn't they just add an and clause
to conditional to prevent parachute release? altitude > 0 and time since entry
> x then release parachute? Entry counter triggered once temperature reaches
certain threshold or another sensor triggers entry.

~~~
hatsunearu
Just by inspection that looks like that will have a ton of unwanted side
effects... How do you define as "entry"? What if the triggering of the "entry"
timer fails?

~~~
blondie9x
You can have multiple sensors start the timer.

------
krenoten
This feels like the kind of bug I catch early in implementation using
quickcheck x_x Is property based testing used in the development of these
kinds of systems?

~~~
sitkack
Which would totally work if you have a super high resolution full system
simulation. Hardware, software, mechanical and electrical interconnections all
need to be simulated for issues like this to be found automatically.

------
SeanDav
Of course it is easy to be wise after the fact, but surely this is the sort of
thing that should have been exhaustively tested (What happens if one sensor
malfunctions and sends incorrect data)?

Massive cudo's for being public about the issue. It is not easy to talk about
mistakes, especially those that appear to be silly in hindsight.

 _EDIT_ Interesting to see that I am getting downvoted multiple times on this.
I don't really care about a few downvotes but would be interested to find out
why.

------
msravi
It isn't too clear from the description, but isn't _non-saturation_ of the
inertial measurement responsible, rather than saturation? I'd expect that some
computation did _not_ saturate its output, and the result overflowed (and
hence became negative), which in turn was fed into altitude measurement.

------
SNSE
Why is rotation data being used to calculate altitude?

And why hasn't anybody asked that question yet? It's the first question I
thought to myself.

It's a shame such an expensive piece of equipment and years of work and
waiting around were obliterated by a bug that should have been caught.

------
xer
How could there not be a test case against negative attitude during descent?

------
cleeus
negative altitude ... so integer overflow it was ...

------
mirekrusin
if (altitude < 0) { return ERR_INVALID_ALTITUDE; }

~~~
dingaling
Aircraft departing my local airfield commence at -4 metres according to GPS as
it is actually below sea level.

~~~
manarth
And the intended landing site is at an altitude of -250m referenced to the
Martian datum.

------
tie_
First integer overflow to (literally) hit Mars!

------
kashkhan
Maybe next time have a better system design so one instrument/sensor
error/failure/loss doesn't lead to total loss of unit.

~~~
w_t_payne
Maybe rather than a design issue this can be seen as a development process and
engineering management issue.

I expect that a top level system model should reduce the risk of this sort of
error.

Of course, it is hard to justify the cost of developing these models without
being able to amortize that cost over multiple products / missions, so product
line oriented engineering management and strong leadership from the sponsor
has a big role to play as well.

~~~
kashkhan
Errors in INS are common. Saturation is common. What isn't common is a bad
design that isn't failsafe or failop for a single failure.

