Hacker News new | past | comments | ask | show | jobs | submit login
United Airlines Domestic Flights Grounded for 2 Hours by Computer Outage (nbcnews.com)
163 points by perseusprime11 on Jan 23, 2017 | hide | past | favorite | 83 comments

The US airline industry has a serious problem with these outages. By my count, 4 major airlines have been hit with system-wide computer-related outages in the last year (southest, jetblue, delta, united).

This article goes into some detail of why it happens: http://money.cnn.com/2016/08/08/technology/delta-airline-com.... It seems like human error and fragile computer systems are the biggest issues.

It seems quite odd to list "Human Error" as a cause. This is only a secondary cause, because if a human is able to accidentally bring down an entire system, it's the system itself that is really the problem.

You mean computers should be able to work around humans ?

Are you trying to tell me I shouldn't have to treat every application like the developer is the laziest piece of shit possible and 90% of valid inputs will cause the application to crash or spit meaningless errors?

If the datacenters are really too complex for the people running them to understand, I would expect failures like this to drive the airlines towards AWS, Azure, or some similar service.

Having worked at SABRE its not as easy as you think, as there are a crapton of systems and people that depend on all that old stuff to keep running as it. SABRE has spent 30 years trying to modernize their mainframe based system piece by piece. But the core reservation system is actually highly available still, it's the modern bits that wind up less stable. The problem is that travel is a highly interconnected system of many different companies in which complexity is almost impossible to avoid. There are also many systems involved not visible to the public, such as weight balancing, crew scheduling, etc where even one of those failing and screw up airline travel worldwide. It's not always reservations or checkin that's broken, even something as simple as an airport system failing can have a domino affect all over the country.

It reminds me a bit of Vernor Vinge's "zones of thought" books (sci fi) in which many of our current technological dreams have failed to materialize and cascading failures of brittle automated systems cause logistic collapses that wipe out advanced civilizations.

> I would expect failures like this to drive the airlines towards AWS, Azure, or some similar service

I don't think that would help much. It's not really the core hardware or operating systems that tend to cause these types of outages.

More typically, it's the dependency chain between locations, applications, and services. And, there's more than one system that can cause a ground halt. The check-in service, the no-fly list functionality (which the govt runs), weight/balance, crew scheduling, dispatch functions, and so on.

Check-in is a good example. You can lose that either through a failure in the complex WAN, failures in the check-in backend service, failures with the no-fly service (run by the govt) or connectivity to it, failures in the CRS/GDS, failures in various services around check-in kiosks, failures in the online checkin, and so forth.

Once they go down, you also face an unusually high spike in request volume when you're trying to get them back up. It creates a wave than can overwhelm different parts of the system.

For the more recent failures (across different airlines) listed above, I know one was a routing storm on the IP network, one was the checkin service, and one was the central reservations system...I think a botched version upgrade. Similar effects, different root causes.

Not to say it's okay, or shouldn't be addressed, but just noting that there's not really one smoking gun.

I'm not advocating that it is ok, but I am sure these airline systems are super old and wouldn't be surprised if they use IBM DB2 or similar ancient database technology. Moving to the cloud is not a trivial task for these sorts of mission critical antiquated systems.

It's a legacy story like none other, in fact. The predecessor/origins of SABRE is IBM's Airline Control Program (ACP). When I worked for IBM years ago I heard many stories of how difficult it was to try an modernize to a newer system because of absurd complexity, but just as much because the whole airline industry became so inextricably bound to the legacy:


I'm currently doing consulting work for That Major Australian Airline and, while I'm rather high up the stack, you do get a sense of the amount of legacy that's built into everything and the monumental effort it would be to migrate all this old stuff to newer infra. I mean, AFAICT there's no database/service to _query_ flights - you have to register a web hook to receive flight data and store it yourself.

New projects are cloud-first, and more and more stuff gets migrated or replaced with equivalents which run in whatever cloud provider. But I can't even imagine how the replacements for all these old legacy services would go down.

They used to run on Tandem NonStops with a dual DC setup in Sydney and Newcastle. Those were the days.

You would be surprised how much has already been migrated away from the old IBM systems (TPF). The big players these days are Lufthansa, Jeppesen, Navitaire, Appolo, Sabre, etc.

Apollo is now Galileo. Galileo, Sabre, HP Shares, and Amadeus (the actual "Big 4" in this space) all still use the TPF operating system on IBM mainframes. They all are offloading functionality piecemeal, to more modern systems. But TPF is still at the core.

You mentioned Navitaire. They are not using TPF...they use COBOL on Windows (not kidding). They do have a large list of airline customers, but none with a big fleet. Reportedly, it doesn't scale up well enough to serve a large airline.

TPF also lives on in the financial world as well, like at Visa, for example.

IBM DB2 is downright futuristic compared to what some of these systems are. SABRE, for example, is probably the granddaddy and horror textbook example of what's commonly referred as "legacy codebase" (although the IRS Master Files written in S/370 assembly could give it a run).

There's no way the IRS still uses an S/370 system. Please give me a link so I can have a laugh.

Both the individual and business master files are still written in IBM mainframe assembly language, and are circa 56 years old. See the table on page 4 of this PDF for a list of the oldest systems in operation:


Number three on the the list is The DoD's Strategic Automated Command and Control System, which runs on "an IBM Series/1 Computer—a 1970s computing system—and uses 8-inch floppy disks". No biggie; it just "coordinates the operational functions of the United States’ nuclear forces, such as intercontinental ballistic missiles, nuclear bombers, and tanker support aircrafts."

I am wrong then! 56 years? That's the sixties. Yes, https://www.cnet.com/news/irs-trudges-on-with-aging-computer... it's from 1962 according to this 2008 article and the S/370 was introduced in 1970. It's actually an IBM 7074. God above. http://webcache.googleusercontent.com/search?q=cache:J3hXqKq...

This is a PDF from 2016 https://www.irs.gov/pub/irs-utl/scap-pia.pdf

> Standard CFOL Access Protocol (SCAP) is written in COBOL/Customer Information Control System (CICS). SCAP downloads Corporate Files On-Line (CFOL) data from the IBM mainframe at the Enterprise Computing Center, Martinsburg. The CFOL data resides in a variety of formats (packed decimal, 7074, DB2, etc.)

7074 format. weeps

Yeah, but it's better for the ICBM control systems to run on 8'floppies, it's much safer (also because of the added friction)

Distorts the attack surface these legacy systems present.

Code that works, and has for 40 years.

You'll never write something that lasts that long.

I feel really old when I realize some of my code is almost halfway there, and the odds are it WILL reach 40 years :(

You seriously underestimate inertia at financial institutions.

New IBM z-series mainframes will still run most System 370 assembly language programs. So it's possible.

never ever under estimate the government and military's ability to keep old systems operational well past what others consider reasonable.

anecdotal, back in the late 80s I was in the USAF. Our secure communication center was running the first model Burroughs machine to not use tubes. It was that old. It could boot from cards, paper tape, or switches. The machine was older than many people who would be assigned to it. This was closely repeated in the main data center (personnel records, inventory, and such) which had a decade old system that migrated off physical cards by 89 but still took them as images off 5.25 floppy uploaded by PC)

You're right, DB2 is way too hipster and modern for airlines.

DB2 isn't "ancient" in any meaningful sense. The first release shipped long ago but IBM has kept it fully up to date since then and it's still competitive with any other relational database for high-volume OLTP.

> but I am sure these airline systems are super old and wouldn't be surprised if they use IBM DB2 or similar ancient database technology.

Not even. They're still on mainframes.

I don't get what you're trying to get at here. DB2 started its life on IBM mainframes (MVS), and is still often run on z/OS or i.

They are still on mainframes running TPF, where DB2 is not an option. The database is TPFDF, a KV datastore.

The financial industry still has software running on mainframes. That some are still writing LOB software in VB6.

Can confirm both - I've been writing LOB software for the financial industry in VB6 up until very recently, and stopped because I switched jobs, not because they've stopped writing them :P

To be honest, VB6 is much better than some of the other stuff they have around.

I now switched to a travel agency and am interacting with the Sabre blue screen systems that are similarly old.

> Moving to the cloud is not a trivial task for these sorts of mission critical antiquated systems

The question is if moving to _anything_ more modern is less trivial/costly than keeping the current systems which appear to have many single points of failure/

Often I've seen businesses reason that the failures are cheaper than the upgrades.

That metric is often miscalculated as the initial cost is more of a deal than future revenue.

So they stay with antiquated mainframes and DOS interfaces.

Providing jobs for people like myself who can code in PL1, though I'd much rather take a brick to the face than do it again.

> Often I've seen businesses reason that the failures are cheaper than the upgrades.

I would love to know what the cost of today's outage in terms of overtime, gate fees, fuel, additional crew, &c.

Delta's cost ~$150MM [1]. That's something on the order of a thousand mid- to senior- level programmers for a year in my area. Even if you allocate a quarter of that cost to computer costs (which I'm betting is a fairly large over estimate), that still leaves a sizable team.

> DOS interfaces.

TUI interfaces can be really, really efficient in terms of navigation and getting around. I rarely see GUI apps that function anywhere near as smoothly or keyboard only.

[1] http://www.datacenterknowledge.com/archives/2016/09/08/delta...

> TUI interfaces can be really, really efficient in terms of navigation and getting around. I rarely see GUI apps that function anywhere near as smoothly or keyboard only.

TUIs have a steeper learner curve, but I agree that once someone masters the hotkeys for that particular interface, they're much quicker. In addition, depending how old the computer is running the TUI, the staff may not be able to use the computer to browse anything else. ;)

I absolutely agree that TUIs can be great, especially at user congestion sites, like POS and similar points.

I meant DOS when I said DOS. Where you run into program/system memory split problems when you keep piling on features.

> That's something on the order of a thousand mid- to senior- level programmers for a year in my area

I shudder to think of a scheduling program as complex as an airline's ticketing system being run on a bit of software written by a thousand programmers in only one year.

I didn't mean to imply it'd take a single year, but just trying to think about the cost of manpower.

>Delta's cost ~$150MM [1]

Okay, I admit that I have absolutely no idea what software like that costs to build, but surely they could have rebuilt their entire software stack for that, couldn't they?

You are excluding the costs of bugs such a rewrite would inevitably produce. Normally that is drastically more expensive then Dev time.

No. Software at that scale is humungously expensive.

I worked on quoting an insurance policy core (just the core mind you, not the extras), and for a medium-sized insurance company it would reach that amount of money.

I suspect a complete rewrite would go into the billions.

Just curious, where did all that money go? How many people for how long?

The company I worked for ended up not doing it (and regretting it).

It ends up not being that many people, but extremely high paid consultants (200 to 400 dollars an hour), and extremely high licensing costs. Some projects can go on for years, should be 1 to 2 years.

It's extremely profitable and very well paid, one such company, Guidewire, is one of the top 10 best paying employers in the U.S.


What? You can trivially run DB2 "in teh cloudz" http://www.ibm.com/analytics/us/en/technology/cloud-data-ser...

Yeah a lot of this stuff is on old school mainframes written in assembly. I work in the airline/travel industry and I've seen 28 year old code still running in production.

DB2 is actually a fairly decent database. I would be really surprised if DB2 was the weakest link in their technology stack.

I wonder if they are still interconnected with 56Kb leased lines?

DB2 isn't that old, relatively. It's in the same league as Oracle or many other mature relational databases. You might be thinking of IMS.

The funny thing is if you take about 5% of IMS's features and reliability, you get super-trendy MongoDB.

A lot of the systems the airlines use are typically hosted by large telco's. Some use SaaS, so those are hosted for them. Very few Airlines I know of actually have their own data centers anymore and typically it's a misconfiguration or an upgrade that causes these outages. Although it can also be a comm problem.

AirCanada recently had an outage that shut down computerized check ins.


Porter was down on the 14th as well, but only for a few hours. http://www.cbc.ca/1.3936314

A couple weekends ago, RDU was unable to handle flights out of its major terminal due to a hard drive (or three) failure. Took 75 (!) people to diagnose and fix:


anyone know why they dont have a fail-over mode that will just keep the same routes as the previous week? Then at least you'd just have underbooked-overbooked flights but the network would keep running.

We flew from Miami to London with American Airlines for Christmas and we had a computer outage as well.

Currently sitting on a plane at O'Hare, can't get to gate because they won't let any planes leave to clear a spot for my plane. What a nightmare.

I have no reason to believe that there's any malice involved here.

I also think that incidents like this might become more common as the state of cyber warfare progresses. As engineers, we should take care to build secure software, especially when that software underlies important systems. We should impress upon non-technical management the importance of doing so, even if it may take a little more time or money in the short term.

Waiting in SLC terminal. Crew has to be off the clock in an hour. Not likely getting home tonight.

I love flying and getting to experience some of the weird issues that crop up. I was on a flight in December when there was only one ground crew working at my mid-sized airport. I was really hoping to beat the winter storm coming in, as I live pretty far from the airport. If there was a delay, I'd be stuck driving home in a blizzard.

They were prioritizing getting planes to gates based on if the plane was picking up passengers and heading out again. Mine was the last flight of the day for my plane, so we were at the back of the line.

After 30 minutes, we finally had an opening and were moving... until another flight crew reminded the tower that they were reaching their FAA allowed hours and needed to be parked immediately. They took our spot, and then more planes came in that had to leave. Of course I know all of this because our pilot was pissed and would relay all of the info to us with a very sarcastic voice. He was supposed to be home by now too.

We ended up getting into our gate almost two hours after we landed. I normally live 30 minutes from the airport, and it took me another hour and a half to get home with the weather. The only reason we got in when we did is because the weather forced a shutdown of the airport, so we were literally the last flight to get parked at a gate, and only because there were no new flights coming in.

It's fun (after the fact) to think about these edge cases, how to keep the entire airspace running smoothly when one airport only has a single ground crew working, how to maximize efficiency to make sure the fewest number of people are delayed. Sucks to live through it though.

The solution here to have more crews & gates. Will create some jobs along the way, and surely will cost less than the delays you described.

How will hiring more crews and gates be cheaper than one flight getting abnormally delayed? I mean maybe it will be cheaper, but there is no surety about it.

If the airlines have to pay passengers for delays...

You said there was always a line of planes. All those planes were delayed, weren't they?

This is even before we factor in all the time lost by passengers of these planes, after which it would be a no-brainer.

You misunderstand. It's not normal to only have one ground crew, this was a very unusual situation. My guess is a bunch of people called in due to the impending weather, or the airport expected to be shut down sooner than they were.

Like I said, this was an edge case. I've flown once a month for almost three years now and this was the first time I've experienced this.

>Crew has to be off the clock in an hour

That's a good observation. Often, the original issue, if it's not fixed in an hour or two...starts creating cascading problems.

Say just the boarding pass / checkin function is down. If you don't fix it quickly...

- Crews (pilots and/or fa's) become illegal to fly due to various rules around crew work, hours, rest periods, etc.

- They can't be replaced by other crews that need to be flown in, because the gates are occupied by the planes that can't leave

- Downstream flight connections no longer match up, so a massive process to change the current tail numbers <--> flight numbers plan has to happen, followed by matching crews and passengers to the new plan. Especially fun if you have a mixed fleet where only certain crews can fly certain models, and the models have different passenger capacities.

Basically, once you go past an hour or so, a giant shitshow starts...and gets worse as time goes on.

I got stuck in SLC two Sundays ago. Hotel tonight should have decent options. If you end up in town, the place will be dead. Bourbon House is decent and serves food late.

Isn't Sundance going on now? That might overflow into SLC?

Stuck on ground in plane at ORD, just got an update that the "mainframes" are starting to come back online.

EDIT: We're now about to take off.

While in this case they probably are actual mainframes, at my firm we use the term to refer to the VMs running in tiny boxes that replaced what once were also mainframes.

Sometimes the language outlives the hardware.

Uh, they aren't using mainframes.

10:1, Linux fail.

They most certainly are. The primary systems running all the major US airlines are all mainframe at least in part.

This doesn't surprise me. US airlines seem like breeding grounds for management silos and the technology is going to reflect that, no matter how much you spend.

When's the last time you heard a great software/IT engineer go "You know what I really want to do? Work for an Airlines company. It will be sweet!"

Yeah, that's part of the problem.

No, the problem is the airlines don't compensate or treat their engineers well enough to make anyone think "I really want to work for an airline".

May this never happen to the Tesla fleet. As hesitant as I am about self driving cars, this would be an almost irreparable setback.

And it's happened twice in my recent memory to our air fleets...

Why would it happen? Teslas are not centrally managed / scheduled / booked.

The planes themselves could fly - it's the other things that failed.

They are centrally managed, though. They are authorized to run over the Tesla network, and they get patches from a centralized source.

Theoretically the cars would be mechanically sound to drive, but couldn't due to problems with the software.

Could be an issue for their self driving taxi fleet.

Are we playing word association now or something? Airline mainframe management systems are not in any way related to Tesla.

Teslas are remotely enabled and disabled. They regularly report telemetry back to HQ. They ask for and receive updates over the air.

What happens if that infrastructure crashes (or a natural disaster disables part of it)? What happens if the clocks drift? What happens if a bug (or an intentional virus) suddenly deauthorizes all Teslas on the network? It's the same problem - management of the remote systems is centrally located, and centrally located management systems can fail.

The planes were perfectly capable of flying. I haven't seen what the issue was but it's equally likely that it was a problem with the reservation system, the pricing system, the weight/balance calculators (admittedly I wouldn't want to fly with this being wonky), even the shift scheduling for backup flight crews.

There is no remote management of planes. Nobody at ATC or on the ground can prevent a plane from doing anything the pilot wants to make the plane do.

Why would it be irreparable?

Can you picture the news, and loss of user confidence, if one day everyone with a Tesla were suddenly unable to drive their car?

Would you trust a car which breaks down when AWS (for example) has an outage?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact