

Software bug fingered as cause of Aussie A330 plunge - mrb
http://www.theregister.co.uk/2011/12/20/bug_cause_aussie_a330_plunge/

======
xxbondsxx
"suggested it may be down to a high-energy atmospheric particle striking one
of the integrated circuits within the unit."

Radiation shielding aside, how in God's name would you design a system to
survive those kinds of errors? If some circuit deep down in the core logic of
the black box just suddenly spiked a random voltage, it would be really hard
to predict (and correct) the damage. Hell, that voltage spike might even kill
the actual error-handling code!

Redundancy seems to be the best way to go (with three being the optimal
number), but I guess there's still always a chance of things going wrong. I'm
glad everyone survived

~~~
virtuabhi
Instead of plain redundancy, error correcting codes (e.g. Hamming code) should
be able to control errors

[http://en.wikipedia.org/wiki/Error-
correcting_code#List_of_e...](http://en.wikipedia.org/wiki/Error-
correcting_code#List_of_error-correcting_codes)

~~~
ars
What happens when your error correcting circuit gets hit with such a particle?

Or worse - the CRC generating circuit does, generating a value which causes
you to "correct" things to an incorrect value.

~~~
bdonlan
If your CRC generator gets damaged, then it will produce values that are
inconsistent with the CRC checker, which is in another piece of equipment. At
which point your component gets kicked out of the system entirely, and all is
well. Note also that CRC is an error _detection_ code, and cannot _correct_
detected errors.

------
taitems
"The problem was fixed by turning the unit off and then on again."

Good to see IT support was immediately on deck with a fix.

~~~
yason
I can't help but be reminded of this:

 _A novice was trying to fix a broken Lisp machine by turning the power off
and on. Knight, seeing what the student was doing, spoke sternly: "You cannot
fix a machine by just power-cycling it with no understanding of what is going
wrong."

Knight turned the machine off and on. The machine worked._

------
brisance
So the flight control systems acted on bad data from ONE of 3 ADIRUs? So
what's the point of having the other 2 then?

~~~
bdonlan
The article summarizes the report a bit badly. With most ADIRU inputs, each
flight computer independently samples all three ADIRUs, and takes the median
value. It also looks for outliers - if an ADIRU input falls far out of line
with the others, the flight computer signals a fault, and disables that input.
This can also be done manually.

What actually happened is one (of three) ADIRUs started mixing up its outputs
- it intermittently labelled altitude data as angle-of-attack data. This
happened after internal checks for data consistency, and before CRC generation
- exactly how this occurred is a mystery, but it happened without the ADIRU
detecting an internal fault.

Now, AOA data is measured by multiple independent sensors, and each ADIRU has
its own sensor, on different sides of the plane. Because of this, under normal
circumstances, AOA data can be somewhat different between the ADIRUs - much
more so than the other inputs. Moreover, two sensors are on one side of the
plane, and one on the other - making the median method weight one side far
more than the other. As such, the flight computer software normally averages
the AOA data from all three ADIRUs, but has an error checking system to detect
and exclude erroneous data. This system detects when an outlier occurs, after
which point it enters a 1.2-second cooldown, in which it uses last-known-good
data. 1 second into the cooldown, it rechecks its inputs to see if the outlier
remains (and if so kicks it out of the system), and at the 1.2 second mark it
resumes normal operation. This error-checking algorithm is supposedly meant to
deal with the situation where two ADIRUs are both temporarily bad due to some
turbulence on one side of the plane - a fault scenario identified during early
testing of the A330.

The key thing is, at that 1.2 second mark, it assumes any input that passed
the 1-second-mark test is valid. In this event, ADIRU 1's AOA input spiked,
triggering the 1.2-second cooldown. It then returned to a normal value for the
1-second-mark test, and then spiked again - at the 1.2 second mark, this
invalid data was sampled, the average of all three ADIRUs taken, and a bogus
value resulted. Once the spike resolved again, rate-of-change limiting
prevented the value from returning to its true value immediately.

Note that there's also an independent flight computer monitoring system for
each flight computer that only listens to _one_ of the three ADIRUs, cross-
checks the control _outputs_ , and has the authority to disable the specific
flight computer it's attached to if there is a discrepancy for too long
(450ms). This flight control computer monitoring system _did_ work properly,
and disabled the flight computers in the end, switching over to manual pitch
control via a more primitive, secondary flight control system. However, it
took two faults (ie, pitch down events) to do so - one to take out each of the
primary flight computers with pitch control authority.

In any case, a new error detection algorithm that avoids this issue is
currently being distributed as a software update, and in the meantime there
are safety bulletins instructing pilots to disable the affected ADIRU manually
at the first sign of trouble.

~~~
brisance
Thank you, that was a good summary.

From the incident report (PDF) it seems the system chooses either the median
or arithmetic mean AOA values. Could the incident have been averted with more
robust filtering algorithms?

~~~
bdonlan
Median suffers from the very real (ie, _seen in a test flight_ ) problem where
one side of the plane can give bad data to two ADIRUs, so that's out.
Supposedly there's a new system being deployed via a software update (see
7.1.1) that does more detailed monitoring of each channel individually -
looking for oscillations, etc. They also got rid of this 1.2-second monitoring
period. The report doesn't go into much detail about this though.

Keep in mind, we're not talking about analog noise here. We're talking about
one type of data being labelled as another - but only part of the time. Analog
filtering techniques might not be helpful, depending on how often it switches.

------
jaydub
What about <http://en.wikipedia.org/wiki/Byzantine_fault_tolerance>

