

The software errors that made NATS close UK airspace 2014-12-12 [pdf] - parados
http://www.caa.co.uk/docs/2942/v3%200%20Interim%20Report%20-%20NATS%20System%20Failure%2012%20December%202014.pdf

======
gerjomarty
Interesting reading. The key points on how the incident actually happened I
pulled out were:

\- They had two links from their System Flight Server, one for redundancy if
one goes down. Both went down at the same time, apparently unprecedented for
them.

\- There was a system limit for a shared system resource (Atomic Functions)
that was defined twice in different systems, but with different magic numbers.
The problem wasn't spotted earlier because a recent system change actually
brought one of the systems close to the limit for the first time. (More
military controller functions were amalgamated into NATS the month previously)

\- There was a UX problem where the "Select sectors" button, which is used
often, is placed directly beside the "soft Sign off" button, which isn't often
used, and in fact was well known to be pressed by accident relatively often.
It was pressed at the time of the incident, putting the system into an illegal
state and hence triggering an automatic shutdown.

Problems that on their own you could argue aren't showstopper problems, but
when triggered together cause things like this.

~~~
mseebach
> They had two links from their System Flight Server, one for redundancy if
> one goes down. Both went down at the same time, apparently unprecedented for
> them.

What I understood here, is that they were redundant systems, but running the
same software, with the same bug present, so both went down. Redundant
hardware can only protect you against hardware failures.

~~~
ibmthrowaway271
> What I understood here, is that they were redundant systems, but running the
> same software, with the same bug present, so both went down.

a.k.a. "flailover"

