Hacker News new | past | comments | ask | show | jobs | submit login
Why the Airline Industry Could Keep Suffering System Failures Like Delta's (opb.org)
32 points by Ocerge on Aug 10, 2016 | hide | past | favorite | 45 comments

“Because they have to worry so much about safety and security, they are constrained in ways that other businesses aren’t,” he says. “Delta can’t just host its systems on Joe Blow’s cloud server somewhere else in the way that another business might be able to do.”

First, AWS, Azure and GCE are not just "Joe Blow's cloud server", they are multi-billion dollar companies, and they all can provide hosting environments compliant with a multitude of security programs including SOC 1, 2, 3, PCI DSS, HIPAA, etc.

If a hospital can store patient records on AWS, why can't Delta store my flight records there? If the government is worried that a public cloud leaves them open to terrorist attacks, then they can sponsor them to run on Gov Cloud for better isolation.

But more importantly, moving a complex high volume legacy system to "the cloud" is no panacea, whatever dependencies or lack of redundancy that caused this failure could cause the exact same failure mode in "the cloud" (plus it can open them to all new failure modes).

The time-critical parts of a hospital's workflow are less IT-dependent than the time-critical parts of an airline's workflow. You need a computer to handle admissions in hospitals, but you can get by with the actual medicine with the computers offline (there's a LOT of paper replication).

Besides, if $CLOUD_PROVIDER has a couple of hours outage and the patient records aren't available, not that much backs up. There's not that much in the count of patients that's affected by, say, a 2-hour outage. But that same length of outage can affect thousands of passengers across a dozen airports, and cause knock-on effects for days.

I wasn't comparing uptime needs of a hospital versus an airline, but their security needs.

Every major cloud provide offers multiple independent regions, and I haven't heard of any suffering from a multi-region outage.

If an airline's application can't tolerate an outage, then they better not host it in a single region, whether they host it themselves or host it in the cloud.

Delta is on day 2 of their outage, Southwest's was 12+ hours.

Google had a major outage in April -- it lasted 18 minutes. In June, AWS lost a single Availability Zone (out of 3) in Sydney for about 12 hours.

> I haven't heard of any suffering from a multi-region outage.

It has happened in the past [1].

[1] http://www.crn.com/news/cloud/300074866/microsoft-explains-w...

>>I wasn't comparing uptime needs of a hospital versus an airline, but their security needs.

Uptime needs are security needs too for various reasons. e.g. lack of uptime is just another way to get a Denial of Service attack.

edit: a typo

Hmm. Doesn't that just factor out to saying "it fails a lot and looks like its going to keep doing it and maybe even more in the future but its too important to move so meh?"

Can't stop to sharpen the axe. Too much wood to cut.

I am confused: they talk about terrorists & critically etc. And they have only 1 site holding their entire system with no backup ? A couple of fibers to cut and those baddies cripple your airline ?

I am pretty sure they would be in a better situation by being hosted, redundant and secure at a cloud provider...

Disclosure: I work on Google Cloud.

Google Cloud has had at least two major outages just this year. April 11, 2016 [1], and Feb. 18-19, 2016 [2].

[1] http://www.serverpronto.com/spu/2016/05/google-cloud-outage-... [2] http://www.informationweek.com/cloud/infrastructure-as-a-ser...

So build redundancy across all three major cloud providers. If AWS, Azure, and Google Cloud all go down at the same time, odds are there are bigger problems in the world.

Of the mushroom cloud variety.

>>I am confused: they talk about terrorists & critically etc. And they have only 1 site holding their entire system with no backup ?

They did have backup systems. Except those were located at the same place as the production systems and the fire destroyed them too.

This keeps happening because:

1. Lots of companies don't yet realize that if you're not a technology company you may as well close up shop cause eventually you will fade to competition or have a reputation destroying event.

2. They don't pay their employees enough.

3. They don't respect their employees enough.

This keeps happening because many companies treat DR (like security) as a risk management exercise.

Until this stops, events like this will keep happening.

Source: I deal with this pretty much every day. I can't go into much more detail but having seen the innards of multiple Fortune 500 companies, it's frankly shocking that things work sometimes. There's stuff that the equivalent of things being duct-taped together etc. etc.

But the automated transfer switch seems to have failed

Switch? N+1 baby. Even more on critical systems. If that switch breaking can cause you to cancel 1500 flights there's a clear case to replicate the switchboard. Surely? Also, manual transfer switches are a thing too, just make sure you break-before-make.

The Manual transfer switches I have seen are mechanically interlocked or designed so that the sources can not be paralleled. In fact automatic transfer switches that are "open transition" have a mechanical interlock as well.

Any situation in which it is up to a human to make sure they break-before-make is a grenade waiting to go off. The only thing I've seen close to that is old power plants where the generators are manually synchronized but usually there is still a sync check so the sources can't be connected more than X degrees out of phase, where X is 3 for large hydro units, 10 for standard diesels, and can be as wide as 20 or 30 for diesels at hospitals where they need to get the thing online ASAP

It's not as simple as N+1. Who watches the watchers? With a lot of these "redundant" systems you eventually have some SPOF which "manages" the redundant components.

For example, many years ago I used a network service provided by a quite diligent company. They had installed a large Cisco switch with redundant "supervisors" which are...something important. So back then the way this worked was that the "active" supervisor would handle everything, and the "standby" one would ping the active one and if it decided the active one had died, it would make itself active.

This process takes about a minute. You can imagine what might happen if each supervisor starts to believe that the other is down.

About ten years ago, Cisco released some new fancier (supposedly better) way to handle this. It is not enabled by default [1, don't bother reading].

[1] http://www.cisco.com/c/en/us/products/collateral/switches/ca...

There's ways to deal with stuff like that in commercial sector and academia going back a long time. It is more complex than a simple ping. It's also common to have the protocols analyzed or even formally-verified for avoiding issues like you mentioned. Many systems will also have things like separate power supplies, certain distance apart, octocouplers, whatever to reduce hardware glitches. There's also both noise-immune (like 1802) and clockless (like Amulet) architectures to knock out those problems. Systems with a subset of these have been known to go years to over a decade without downtime.

The usual culprit in stuff like you described is management. They either force knowledgeable staff to use less to save money in short-term at long-term expense or don't even hire staff good enough for high-assurance systems. Those do cost money given they're uncommon to rare. Although, one company (Ericsson) did go all out one time making their network nine 9's just to show it could be done. Till management and customers looked at the cost. :)

EDIT to add textbook case study on management angle:


This is why core ISP POPs deploy routers in twin pairs. Each chassis has hotswap fans, n+1 power, dual redundant supervisors and hotswap line cards but is entirely possible for an entire chassis to go belly up.

A single router chassis is not redundancy at all.

I've always been (probably overly) paranoid about redundant switches. Seemed like a good way to introduce bizarre failure modes to prevent a fairly rare but easy to diagnose switch death.

It's not just complete switch death you have to worry about. Ports can die, cables can go bad, etc. Rare, but at the scale Delta's at, it'll happen frequently enough to plan for it.

They're not talking about Cisco/networking switches. Transfer switches are part of the electrical systems in the data center. https://en.wikipedia.org/wiki/Transfer_switch

Seems nuts. Spreading locations would allow more and better monitoring to be sure locations weren't compromised, and the ability to drop a location if it was.

Extra layers of protection aren't merely "adding to the attack surface" (although that's possible, antivirus products becoming the vector, for example.)

It would also, at minimum, double their costs. They're in a cutthroat industry - doubling costs would likely mean going out of business.

Why assume computer costs is substantial to overall costs of maintaining and flying airplanes?

Why do people assume that this stuff works like a tech job?

The kind of national and global operations coordination Delta does from their center in Atlanta is not a "oh, network's down, work from the coffee shop" kind of job.

Delta + regional affiliates operate nearly 8,000 flights per day on six continents through thirteen hubs, involving around 80,000 employees and ~820 mainline aircraft plus I don't even know how many smaller regional jets, plus codesharing on SkyTeam alliance partners and 13 other non-alliance-partner carriers.

Spec out an ops center capable of ingesting and organizing all the data on that, presenting it for human use, and then communicating instructions back out to all those people, planes, airports, maintenance bases, etc. And then ask yourself if that's something you just build a couple extra copies of (along with on-call staff to come in and pick up the whole thing at a moment's notice, since if the main one goes down it's not like you'll be flying the people to the backup locations).

The answer, of course, is that it's not something you build extras of. You build it once, and build it as reliable as you can, because building spares just is not feasible; the only people who maintain extras of this kind of infrastructure are governments who worry about getting into nuclear wars.

I get that there's a lot of complexity involved in running an airline, but none of those numbers are very big. In fact, they're pretty small. What specific tech challenges do you see that require building your own datacenter?

My guess is that the reason it's set up this way is that it was built a long time ago in its current form and it's not worth the cost to rebuild / not worth the regulatory hurdles of moving to the cloud.

What specific tech challenges do you see that require building your own datacenter?

The problem is you're thinking of it as a "build more datacenters" problem.

It is not a problem of needing datacenters. It is not a problem of needing a bunch more computers. It is a problem of needing a combination of data, access to the data, communications and people. That is the thing that is complex to set up and complex and expensive to replicate and not worth the cost.

It's not clear from your comment: is the ops center people doing manual ingestion or are you talking about hardware?

Delta certainly should not be running their own datacenters, and its likely that they don't really do so to begin with. they (like every other airline) are running the same custom software and hardware that is creaking along, and is why every interaction at an airport runs sooooo slowly.

Uber has way more QPS and route planning complexity, and they still have multiple datacenters for redundancy.

* Uber's drivers can operate largely independently of each other. A local issue does not cascade to affect larger areas. An airline's sub-components largely cannot operate independently; issues on a local level can easily cascade upward to become regional issues, and then regional issues can cascade upward to become national or even global ones, as passengers and deadheading crew members misconnect, late flights become later en route to following destinations, etc.

* Uber does not submit route plans to any authority, or deal with issues of transit across controlled borders. Airlines must submit valid flight plans for approval to aviation authorities in the appropriate countries, and routinely must handle transit of aircraft, people and cargo across controlled international borders, planning all aspects in advance.

* Uber simply makes use of any available roads and pickup/drop-off zones. Airlines must pay for and sometimes obtain regulatory approval for takeoff and landing slots, use of gates at airports, use of facilities at airports, etc., keeping track of what resources are available and which will be in use at any given time.

* Uber famously tells regulators to, effectively, go fuck themselves. Airlines which attempt this are shut down. Immense effort must be dedicated to compliance with applicable safety and commerce regulations at all times.

* Uber does not own or maintain the cars its passengers use. Airlines own or lease their aircraft and are responsible for their maintenance, on legally-mandated schedules. Airliner maintenance is a wonder in and of itself.

And on and on and on it goes.

Again: this ain't a Silicon Valley tech startup. Thinking in terms of even the largest SV startups is going to lead you badly astray.

Uber didn't exist before 2009. They got a bunch of people in a room and wrote some code, and now they're successful. Greenfield development at it's best.

Delta started off flying planes, and has been around longer than computers existed. Like many established businesses, they have a mix of new and legacy technologies, and can't just copy everything up to AWS. They'd have to dedicate years of time and effort over many employees to rewrite some of their systems, while still maintaining their existing ones in parallel until it's safe to migrate off. All the while they have an actual business to run, with real revenues, real expenditures, and IT is just (an important, but costly) piece in the big picture.

These companies are neither incompetent, nor malicious. They just have to find the money and time to get done the enhancements they'd like to their systems, and change doesn't happen overnight. Likely, with this awkward generator fire, they'll try to hasten their efforts.

>> Delta started off flying planes,

Only statement in your comment that I disagree with: https://en.wikipedia.org/wiki/Delta_Air_Lines#History

Being unable to fly their airplanes will likely mean going out of business as well.

It certainly didn't mean that this time.

> Spreading locations would allow more and better monitoring to be sure locations weren't compromised, and the ability to drop a location if it was.

If Delta is the mainframe-oriented business that I suspect it is, "spreading locations" means turning their software stack from the monolith of monoliths into a distributed system.

It's a damn fine idea, and likely to take them decades of effort across dozens of teams with hilarious expensive bugs along the way.

They were once an IBM outsourcing customer, so Bob has a theory: http://www.cringely.com/2016/08/08/outsourced-probably-hurt-... (spoiler: downtime was likely contributed to by outsourcing and cost cutting. Shocker, I know!)

Some of the comments on that page were better than the original article. Much detail.

My hunch is that the trend over the next decade or two is going to become an increasingly strong emphasis on software correctness. As software becomes more and more intertwined in every aspect of our lives the stakes just keep getting higher and higher. And as "cyberspace" emerges more and more as a primary theater for international political struggles vulnerabilities are going to be that much more expensive and dangerous.

I don't think there are any easy answers here but I expect to see a move over time to languages and tools that leave less room for error than the tools we commonly use now. C/C++ and dynamic languages like Javascript, Python and Ruby will be sidelined in favor of languages that provide more compile time guarantees of their behavior.

I see the opposite happening. Remember when it was a big deal that a company was hacked and leaked user data? Or when a bunch of credit card numbers were stolen from a retailer? It feels like that's happened so often by now that it's not even surprising anymore. I expect that any retailer I use has been, is, or will be 0wn3d at some point.

Moreover, the tech industry is solidifying. Larger companies are becoming more entrenched, and it seems harder for new companies to start clean with a fresh stack. This is the most tenuous part of my argument -- maybe it's not true?

If the above is true, then I expect software quality to worsen over time. Large companies will not be rewriting huge parts of their stacks for any reason. The legacy systems will continue to balloon as they are inflated with new features, and morph as they are merged with the systems of acquired companies. Also, the original engineers deeply familiar with the systems will retire, and new maintenance programmers will have even less context into the big-picture of operations.

Also, as the surface area of software increases, there will be more and more places for bugs and security problems to occur, even while those bugs affect more people and in more critical ways.

I predict that over time, we will begin regarding software which routinely works as much as a pipe dream as software which is secure.

This is a fundamentally pessimistic perspective. I'd love to hear an opposing argument.

Except this outage might not have had anything to do with their software. Maybe it was a technical failure and their lack of redundancy made it impossible to switch to a backup system.

There's more information available now.[1] Apparently part of their system switched to backups, but not all of it. A Delta rep says "We are actually fully operational, it's just that we're not able to use that newer interface." Unclear what that means.

In an airline system, there's quite a bit of equipment at each location, and much of it is specialized. There are interfaces to baggage systems and bar code readers. There are interfaces to airport systems and incoming information from air traffic control. The aircraft themselves transmit information and need flight plan uploads. There's probably more machine to machine communication than user interfaces. They may be having troubles resynchronizing everything with the backup systems in the data center.

[1] http://www.dallasnews.com/business/airline-industry/20160809...

I've spent just enough time around systems driven by old IBM mainframes to realize that they're pretty darned good at data integrity and at uptime, but they make up for it by being an IT nightmare in most other respects.

In this case, it sounds very much like the backup servers were running the wrong version of some user interface programmed in an arcane IBM programming language from the 1970s that is built entirely around the record-oriented database typical of a machine like the AS/400.


"not able to use newer interface" likely means that the majority of the data is still in the mainframe but wrapped by a more "modern" interface. Or in IBM parlance, system of record vs. system of engagement.

More specifically "lipstick on a pig" =D

Fundamentally these are not technology companies but companies driven by financial markets, cost saving _innovations_ and run more like the hospitality sector.

Most of them outsource their tech to the tech-outsourcing behemoths who are also run by man-month billing _innovations_

I see no hope of things improving drastically. Small incremental updates over decades till we invent teleportation... then they will die.

>Because they have to worry so much about safety and security, they are constrained in ways that other businesses aren’t...

>In other words, given a choice between more backup systems and more security, airlines are picking security.

Information security is about ensuring confidentiality, integrity and availability all at the same time.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact