This article goes into some detail of why it happens: http://money.cnn.com/2016/08/08/technology/delta-airline-com.... It seems like human error and fragile computer systems are the biggest issues.
I don't think that would help much. It's not really the core hardware or operating systems that tend to cause these types of outages.
More typically, it's the dependency chain between locations, applications, and services. And, there's more than one system that can cause a ground halt. The check-in service, the no-fly list functionality (which the govt runs), weight/balance, crew scheduling, dispatch functions, and so on.
Check-in is a good example. You can lose that either through a failure in the complex WAN, failures in the check-in backend service, failures with the no-fly service (run by the govt) or connectivity to it, failures in the CRS/GDS, failures in various services around check-in kiosks, failures in the online checkin, and so forth.
Once they go down, you also face an unusually high spike in request volume when you're trying to get them back up. It creates a wave than can overwhelm different parts of the system.
For the more recent failures (across different airlines) listed above, I know one was a routing storm on the IP network, one was the checkin service, and one was the central reservations system...I think a botched version upgrade. Similar effects, different root causes.
Not to say it's okay, or shouldn't be addressed, but just noting that there's not really one smoking gun.
New projects are cloud-first, and more and more stuff gets migrated or replaced with equivalents which run in whatever cloud provider. But I can't even imagine how the replacements for all these old legacy services would go down.
You mentioned Navitaire. They are not using TPF...they use COBOL on Windows (not kidding). They do have a large list of airline customers, but none with a big fleet. Reportedly, it doesn't scale up well enough to serve a large airline.
TPF also lives on in the financial world as well, like at Visa, for example.
Number three on the the list is The DoD's Strategic Automated Command and Control System, which runs on "an IBM Series/1 Computer—a 1970s computing system—and uses 8-inch floppy disks". No biggie; it just "coordinates the operational functions of the United States’ nuclear forces, such as intercontinental ballistic missiles, nuclear bombers, and tanker support aircrafts."
This is a PDF from 2016 https://www.irs.gov/pub/irs-utl/scap-pia.pdf
> Standard CFOL Access Protocol (SCAP) is written in COBOL/Customer Information Control
System (CICS). SCAP downloads Corporate Files On-Line (CFOL) data from the IBM mainframe at
the Enterprise Computing Center, Martinsburg. The CFOL data resides in a variety of formats
(packed decimal, 7074, DB2, etc.)
7074 format. weeps
You'll never write something that lasts that long.
You seriously underestimate inertia at financial institutions.
anecdotal, back in the late 80s I was in the USAF. Our secure communication center was running the first model Burroughs machine to not use tubes. It was that old. It could boot from cards, paper tape, or switches. The machine was older than many people who would be assigned to it. This was closely repeated in the main data center (personnel records, inventory, and such) which had a decade old system that migrated off physical cards by 89 but still took them as images off 5.25 floppy uploaded by PC)
Not even. They're still on mainframes.
To be honest, VB6 is much better than some of the other stuff they have around.
I now switched to a travel agency and am interacting with the Sabre blue screen systems that are similarly old.
The question is if moving to _anything_ more modern is less trivial/costly than keeping the current systems which appear to have many single points of failure/
That metric is often miscalculated as the initial cost is more of a deal than future revenue.
So they stay with antiquated mainframes and DOS interfaces.
Providing jobs for people like myself who can code in PL1, though I'd much rather take a brick to the face than do it again.
I would love to know what the cost of today's outage in terms of overtime, gate fees, fuel, additional crew, &c.
Delta's cost ~$150MM . That's something on the order of a thousand mid- to senior- level programmers for a year in my area. Even if you allocate a quarter of that cost to computer costs (which I'm betting is a fairly large over estimate), that still leaves a sizable team.
> DOS interfaces.
TUI interfaces can be really, really efficient in terms of navigation and getting around. I rarely see GUI apps that function anywhere near as smoothly or keyboard only.
TUIs have a steeper learner curve, but I agree that once someone masters the hotkeys for that particular interface, they're much quicker. In addition, depending how old the computer is running the TUI, the staff may not be able to use the computer to browse anything else. ;)
I meant DOS when I said DOS. Where you run into program/system memory split problems when you keep piling on features.
I shudder to think of a scheduling program as complex as an airline's ticketing system being run on a bit of software written by a thousand programmers in only one year.
Okay, I admit that I have absolutely no idea what software like that costs to build, but surely they could have rebuilt their entire software stack for that, couldn't they?
I worked on quoting an insurance policy core (just the core mind you, not the extras), and for a medium-sized insurance company it would reach that amount of money.
I suspect a complete rewrite would go into the billions.
It ends up not being that many people, but extremely high paid consultants (200 to 400 dollars an hour), and extremely high licensing costs. Some projects can go on for years, should be 1 to 2 years.
It's extremely profitable and very well paid, one such company, Guidewire, is one of the top 10 best paying employers in the U.S.
I also think that incidents like this might become more common as the state of cyber warfare progresses. As engineers, we should take care to build secure software, especially when that software underlies important systems. We should impress upon non-technical management the importance of doing so, even if it may take a little more time or money in the short term.
They were prioritizing getting planes to gates based on if the plane was picking up passengers and heading out again. Mine was the last flight of the day for my plane, so we were at the back of the line.
After 30 minutes, we finally had an opening and were moving... until another flight crew reminded the tower that they were reaching their FAA allowed hours and needed to be parked immediately. They took our spot, and then more planes came in that had to leave. Of course I know all of this because our pilot was pissed and would relay all of the info to us with a very sarcastic voice. He was supposed to be home by now too.
We ended up getting into our gate almost two hours after we landed. I normally live 30 minutes from the airport, and it took me another hour and a half to get home with the weather. The only reason we got in when we did is because the weather forced a shutdown of the airport, so we were literally the last flight to get parked at a gate, and only because there were no new flights coming in.
It's fun (after the fact) to think about these edge cases, how to keep the entire airspace running smoothly when one airport only has a single ground crew working, how to maximize efficiency to make sure the fewest number of people are delayed. Sucks to live through it though.
This is even before we factor in all the time lost by passengers of these planes, after which it would be a no-brainer.
Like I said, this was an edge case. I've flown once a month for almost three years now and this was the first time I've experienced this.
That's a good observation. Often, the original issue, if it's not fixed in an hour or two...starts creating cascading problems.
Say just the boarding pass / checkin function is down. If you don't fix it quickly...
- Crews (pilots and/or fa's) become illegal to fly due to various rules around crew work, hours, rest periods, etc.
- They can't be replaced by other crews that need to be flown in, because the gates are occupied by the planes that can't leave
- Downstream flight connections no longer match up, so a massive process to change the current tail numbers <--> flight numbers plan has to happen, followed by matching crews and passengers to the new plan. Especially fun if you have a mixed fleet where only certain crews can fly certain models, and the models have different passenger capacities.
Basically, once you go past an hour or so, a giant shitshow starts...and gets worse as time goes on.
EDIT: We're now about to take off.
Sometimes the language outlives the hardware.
10:1, Linux fail.
Yeah, that's part of the problem.
And it's happened twice in my recent memory to our air fleets...
The planes themselves could fly - it's the other things that failed.
Theoretically the cars would be mechanically sound to drive, but couldn't due to problems with the software.
What happens if that infrastructure crashes (or a natural disaster disables part of it)? What happens if the clocks drift? What happens if a bug (or an intentional virus) suddenly deauthorizes all Teslas on the network? It's the same problem - management of the remote systems is centrally located, and centrally located management systems can fail.
There is no remote management of planes. Nobody at ATC or on the ground can prevent a plane from doing anything the pilot wants to make the plane do.
Would you trust a car which breaks down when AWS (for example) has an outage?