
Why the Airline Industry Could Keep Suffering System Failures Like Delta's - Ocerge
http://www.opb.org/news/article/npr-why-the-airline-industry-could-keep-suffering-system-failures-like-deltas/
======
Johnny555
_“Because they have to worry so much about safety and security, they are
constrained in ways that other businesses aren’t,” he says. “Delta can’t just
host its systems on Joe Blow’s cloud server somewhere else in the way that
another business might be able to do.”_

First, AWS, Azure and GCE are not just "Joe Blow's cloud server", they are
multi-billion dollar companies, and they all can provide hosting environments
compliant with a multitude of security programs including SOC 1, 2, 3, PCI
DSS, HIPAA, etc.

If a hospital can store patient records on AWS, why can't Delta store my
flight records there? If the government is worried that a public cloud leaves
them open to terrorist attacks, then they can sponsor them to run on Gov Cloud
for better isolation.

But more importantly, moving a complex high volume legacy system to "the
cloud" is no panacea, whatever dependencies or lack of redundancy that caused
this failure could cause the exact same failure mode in "the cloud" (plus it
can open them to all new failure modes).

~~~
vacri
The time-critical parts of a hospital's workflow are less IT-dependent than
the time-critical parts of an airline's workflow. You need a computer to
handle admissions in hospitals, but you can get by with the actual medicine
with the computers offline (there's a LOT of paper replication).

Besides, if $CLOUD_PROVIDER has a couple of hours outage and the patient
records aren't available, not that much backs up. There's not that much in the
count of patients that's affected by, say, a 2-hour outage. But that same
length of outage can affect thousands of passengers across a dozen airports,
and cause knock-on effects for days.

~~~
Johnny555
I wasn't comparing uptime needs of a hospital versus an airline, but their
security needs.

Every major cloud provide offers multiple independent regions, and I haven't
heard of any suffering from a multi-region outage.

If an airline's application can't tolerate an outage, then they better not
host it in a single region, whether they host it themselves or host it in the
cloud.

Delta is on day 2 of their outage, Southwest's was 12+ hours.

Google had a major outage in April -- it lasted 18 minutes. In June, AWS lost
a single Availability Zone (out of 3) in Sydney for about 12 hours.

~~~
taspeotis
> I haven't heard of any suffering from a multi-region outage.

It has happened in the past [1].

[1] [http://www.crn.com/news/cloud/300074866/microsoft-
explains-w...](http://www.crn.com/news/cloud/300074866/microsoft-explains-
what-went-wrong-in-latest-global-azure-outage.htm)

------
gbin
I am confused: they talk about terrorists & critically etc. And they have only
1 site holding their entire system with no backup ? A couple of fibers to cut
and those baddies cripple your airline ?

I am pretty sure they would be in a better situation by being hosted,
redundant and secure at a cloud provider...

Disclosure: I work on Google Cloud.

~~~
Animats
Google Cloud has had at least two major outages just this year. April 11, 2016
[1], and Feb. 18-19, 2016 [2].

[1] [http://www.serverpronto.com/spu/2016/05/google-cloud-
outage-...](http://www.serverpronto.com/spu/2016/05/google-cloud-outage-
spells-trouble-for-enterprise-customers/) [2]
[http://www.informationweek.com/cloud/infrastructure-as-a-
ser...](http://www.informationweek.com/cloud/infrastructure-as-a-
service/google-cloud-outage-virtual-networking-breakdown/d/d-id/1319178)

~~~
douche
So build redundancy across all three major cloud providers. If AWS, Azure, and
Google Cloud all go down at the same time, odds are there are bigger problems
in the world.

~~~
Senji
Of the mushroom cloud variety.

------
siculars
This keeps happening because:

1\. Lots of companies don't yet realize that if you're not a technology
company you may as well close up shop cause eventually you will fade to
competition or have a reputation destroying event.

2\. They don't pay their employees enough.

3\. They don't respect their employees enough.

~~~
newman314
This keeps happening because many companies treat DR (like security) as a risk
management exercise.

Until this stops, events like this will keep happening.

Source: I deal with this pretty much every day. I can't go into much more
detail but having seen the innards of multiple Fortune 500 companies, it's
frankly shocking that things work sometimes. There's stuff that the equivalent
of things being duct-taped together etc. etc.

------
TheSpiceIsLife
_But the automated transfer switch seems to have failed_

Switch? N+1 baby. Even more on critical systems. If that switch breaking can
cause you to cancel 1500 flights there's a clear case to replicate the
switchboard. Surely? Also, manual transfer switches are a thing too, just make
sure you break-before-make.

~~~
jzwinck
It's not as simple as N+1. Who watches the watchers? With a lot of these
"redundant" systems you eventually have some SPOF which "manages" the
redundant components.

For example, many years ago I used a network service provided by a quite
diligent company. They had installed a large Cisco switch with redundant
"supervisors" which are...something important. So back then the way this
worked was that the "active" supervisor would handle everything, and the
"standby" one would ping the active one and if it decided the active one had
died, it would make itself active.

This process takes about a minute. You can imagine what might happen if each
supervisor starts to believe that the other is down.

About ten years ago, Cisco released some new fancier (supposedly better) way
to handle this. It is not enabled by default [1, don't bother reading].

[1]
[http://www.cisco.com/c/en/us/products/collateral/switches/ca...](http://www.cisco.com/c/en/us/products/collateral/switches/catalyst-6500-series-
switches/prod_white_paper09186a0080088874.html)

~~~
macintux
I've always been (probably overly) paranoid about redundant switches. Seemed
like a good way to introduce bizarre failure modes to prevent a fairly rare
but easy to diagnose switch death.

~~~
Sanddancer
It's not just complete switch death you have to worry about. Ports can die,
cables can go bad, etc. Rare, but at the scale Delta's at, it'll happen
frequently enough to plan for it.

------
Nomentatus
Seems nuts. Spreading locations would allow more and better monitoring to be
sure locations weren't compromised, and the ability to drop a location if it
was.

Extra layers of protection aren't merely "adding to the attack surface"
(although that's possible, antivirus products becoming the vector, for
example.)

~~~
tw04
It would also, at minimum, double their costs. They're in a cutthroat industry
- doubling costs would likely mean going out of business.

~~~
koluft
Why assume computer costs is substantial to overall costs of maintaining and
flying airplanes?

~~~
ubernostrum
Why do people assume that this stuff works like a tech job?

The kind of national and global operations coordination Delta does from their
center in Atlanta is not a "oh, network's down, work from the coffee shop"
kind of job.

Delta + regional affiliates operate nearly 8,000 flights per day on six
continents through thirteen hubs, involving around 80,000 employees and ~820
mainline aircraft plus I don't even know how many smaller regional jets, plus
codesharing on SkyTeam alliance partners and 13 other non-alliance-partner
carriers.

Spec out an ops center capable of ingesting and organizing all the data on
that, presenting it for human use, and then communicating instructions back
out to all those people, planes, airports, maintenance bases, etc. And then
ask yourself if that's something you just build a couple extra copies of
(along with on-call staff to come in and pick up the whole thing at a moment's
notice, since if the main one goes down it's not like you'll be flying the
people to the backup locations).

The answer, of course, is that it's not something you build extras of. You
build it once, and build it as reliable as you can, because building spares
just is not feasible; the only people who maintain extras of this kind of
infrastructure are governments who worry about getting into nuclear wars.

~~~
maccam94
Uber has way more QPS and route planning complexity, and they still have
multiple datacenters for redundancy.

~~~
niftich
Uber didn't exist before 2009. They got a bunch of people in a room and wrote
some code, and now they're successful. Greenfield development at it's best.

Delta started off flying planes, and has been around longer than computers
existed. Like many established businesses, they have a mix of new and legacy
technologies, and can't just copy everything up to AWS. They'd have to
dedicate years of time and effort over many employees to rewrite some of their
systems, while still maintaining their existing ones in parallel until it's
safe to migrate off. All the while they have an actual business to run, with
real revenues, real expenditures, and IT is just (an important, but costly)
piece in the big picture.

These companies are neither incompetent, nor malicious. They just have to find
the money and time to get done the enhancements they'd like to their systems,
and change doesn't happen overnight. Likely, with this awkward generator fire,
they'll try to hasten their efforts.

~~~
praneshp
>> Delta started off flying planes,

Only statement in your comment that I disagree with:
[https://en.wikipedia.org/wiki/Delta_Air_Lines#History](https://en.wikipedia.org/wiki/Delta_Air_Lines#History)

------
lstamour
They were once an IBM outsourcing customer, so Bob has a theory:
[http://www.cringely.com/2016/08/08/outsourced-probably-
hurt-...](http://www.cringely.com/2016/08/08/outsourced-probably-hurt-delta-
airlines-power-went/) (spoiler: downtime was likely contributed to by
outsourcing and cost cutting. Shocker, I know!)

~~~
nickpsecurity
Some of the comments on that page were better than the original article. Much
detail.

------
cageface
My hunch is that the trend over the next decade or two is going to become an
increasingly strong emphasis on software _correctness_. As software becomes
more and more intertwined in every aspect of our lives the stakes just keep
getting higher and higher. And as "cyberspace" emerges more and more as a
primary theater for international political struggles vulnerabilities are
going to be that much more expensive and dangerous.

I don't think there are any easy answers here but I expect to see a move over
time to languages and tools that leave less room for error than the tools we
commonly use now. C/C++ and dynamic languages like Javascript, Python and Ruby
will be sidelined in favor of languages that provide more compile time
guarantees of their behavior.

~~~
igor47
I see the opposite happening. Remember when it was a big deal that a company
was hacked and leaked user data? Or when a bunch of credit card numbers were
stolen from a retailer? It feels like that's happened so often by now that
it's not even surprising anymore. I expect that any retailer I use has been,
is, or will be 0wn3d at some point.

Moreover, the tech industry is solidifying. Larger companies are becoming more
entrenched, and it seems harder for new companies to start clean with a fresh
stack. This is the most tenuous part of my argument -- maybe it's not true?

If the above _is_ true, then I expect software quality to worsen over time.
Large companies will not be rewriting huge parts of their stacks for any
reason. The legacy systems will continue to balloon as they are inflated with
new features, and morph as they are merged with the systems of acquired
companies. Also, the original engineers deeply familiar with the systems will
retire, and new maintenance programmers will have even less context into the
big-picture of operations.

Also, as the surface area of software increases, there will be more and more
places for bugs and security problems to occur, even while those bugs affect
more people and in more critical ways.

I predict that over time, we will begin regarding software which routinely
works as much as a pipe dream as software which is secure.

This is a fundamentally pessimistic perspective. I'd love to hear an opposing
argument.

------
Animats
There's more information available now.[1] Apparently part of their system
switched to backups, but not all of it. A Delta rep says "We are actually
fully operational, it's just that we're not able to use that newer interface."
Unclear what that means.

In an airline system, there's quite a bit of equipment at each location, and
much of it is specialized. There are interfaces to baggage systems and bar
code readers. There are interfaces to airport systems and incoming information
from air traffic control. The aircraft themselves transmit information and
need flight plan uploads. There's probably more machine to machine
communication than user interfaces. They may be having troubles
resynchronizing everything with the backup systems in the data center.

[1] [http://www.dallasnews.com/business/airline-
industry/20160809...](http://www.dallasnews.com/business/airline-
industry/20160809-delta-power-outage-wasn-t-the-cause-of-its-global-computer-
disruption.ece)

~~~
fennecfoxen
I've spent just enough time around systems driven by old IBM mainframes to
realize that they're pretty darned good at data integrity and at uptime, but
they make up for it by being an IT nightmare in most other respects.

In this case, it sounds very much like the backup servers were running the
wrong version of some user interface programmed in an arcane IBM programming
language from the 1970s that is built entirely around the record-oriented
database typical of a machine like the AS/400.

Hilarity!

------
ranedk
Fundamentally these are not technology companies but companies driven by
financial markets, cost saving _innovations_ and run more like the hospitality
sector.

Most of them outsource their tech to the tech-outsourcing behemoths who are
also run by man-month billing _innovations_

I see no hope of things improving drastically. Small incremental updates over
decades till we invent teleportation... then they will die.

------
desdiv
>Because they have to worry so much about safety and security, they are
constrained in ways that other businesses aren’t...

>In other words, given a choice between more backup systems and more security,
airlines are picking security.

Information security is about ensuring confidentiality, integrity and
_availability_ all at the same time.

