First, AWS, Azure and GCE are not just "Joe Blow's cloud server", they are multi-billion dollar companies, and they all can provide hosting environments compliant with a multitude of security programs including SOC 1, 2, 3, PCI DSS, HIPAA, etc.
If a hospital can store patient records on AWS, why can't Delta store my flight records there? If the government is worried that a public cloud leaves them open to terrorist attacks, then they can sponsor them to run on Gov Cloud for better isolation.
But more importantly, moving a complex high volume legacy system to "the cloud" is no panacea, whatever dependencies or lack of redundancy that caused this failure could cause the exact same failure mode in "the cloud" (plus it can open them to all new failure modes).
Besides, if $CLOUD_PROVIDER has a couple of hours outage and the patient records aren't available, not that much backs up. There's not that much in the count of patients that's affected by, say, a 2-hour outage. But that same length of outage can affect thousands of passengers across a dozen airports, and cause knock-on effects for days.
Every major cloud provide offers multiple independent regions, and I haven't heard of any suffering from a multi-region outage.
If an airline's application can't tolerate an outage, then they better not host it in a single region, whether they host it themselves or host it in the cloud.
Delta is on day 2 of their outage, Southwest's was 12+ hours.
Google had a major outage in April -- it lasted 18 minutes. In June, AWS lost a single Availability Zone (out of 3) in Sydney for about 12 hours.
It has happened in the past .
Uptime needs are security needs too for various reasons. e.g. lack of uptime is just another way to get a Denial of Service attack.
edit: a typo
Can't stop to sharpen the axe. Too much wood to cut.
I am pretty sure they would be in a better situation by being hosted, redundant and secure at a cloud provider...
Disclosure: I work on Google Cloud.
They did have backup systems. Except those were located at the same place as the production systems and the fire destroyed them too.
1. Lots of companies don't yet realize that if you're not a technology company you may as well close up shop cause eventually you will fade to competition or have a reputation destroying event.
2. They don't pay their employees enough.
3. They don't respect their employees enough.
Until this stops, events like this will keep happening.
Source: I deal with this pretty much every day. I can't go into much more detail but having seen the innards of multiple Fortune 500 companies, it's frankly shocking that things work sometimes. There's stuff that the equivalent of things being duct-taped together etc. etc.
Switch? N+1 baby. Even more on critical systems. If that switch breaking can cause you to cancel 1500 flights there's a clear case to replicate the switchboard. Surely? Also, manual transfer switches are a thing too, just make sure you break-before-make.
Any situation in which it is up to a human to make sure they break-before-make is a grenade waiting to go off. The only thing I've seen close to that is old power plants where the generators are manually synchronized but usually there is still a sync check so the sources can't be connected more than X degrees out of phase, where X is 3 for large hydro units, 10 for standard diesels, and can be as wide as 20 or 30 for diesels at hospitals where they need to get the thing online ASAP
For example, many years ago I used a network service provided by a quite diligent company. They had installed a large Cisco switch with redundant "supervisors" which are...something important. So back then the way this worked was that the "active" supervisor would handle everything, and the "standby" one would ping the active one and if it decided the active one had died, it would make itself active.
This process takes about a minute. You can imagine what might happen if each supervisor starts to believe that the other is down.
About ten years ago, Cisco released some new fancier (supposedly better) way to handle this. It is not enabled by default [1, don't bother reading].
The usual culprit in stuff like you described is management. They either force knowledgeable staff to use less to save money in short-term at long-term expense or don't even hire staff good enough for high-assurance systems. Those do cost money given they're uncommon to rare. Although, one company (Ericsson) did go all out one time making their network nine 9's just to show it could be done. Till management and customers looked at the cost. :)
EDIT to add textbook case study on management angle:
A single router chassis is not redundancy at all.
Extra layers of protection aren't merely "adding to the attack surface" (although that's possible, antivirus products becoming the vector, for example.)
The kind of national and global operations coordination Delta does from their center in Atlanta is not a "oh, network's down, work from the coffee shop" kind of job.
Delta + regional affiliates operate nearly 8,000 flights per day on six continents through thirteen hubs, involving around 80,000 employees and ~820 mainline aircraft plus I don't even know how many smaller regional jets, plus codesharing on SkyTeam alliance partners and 13 other non-alliance-partner carriers.
Spec out an ops center capable of ingesting and organizing all the data on that, presenting it for human use, and then communicating instructions back out to all those people, planes, airports, maintenance bases, etc. And then ask yourself if that's something you just build a couple extra copies of (along with on-call staff to come in and pick up the whole thing at a moment's notice, since if the main one goes down it's not like you'll be flying the people to the backup locations).
The answer, of course, is that it's not something you build extras of. You build it once, and build it as reliable as you can, because building spares just is not feasible; the only people who maintain extras of this kind of infrastructure are governments who worry about getting into nuclear wars.
My guess is that the reason it's set up this way is that it was built a long time ago in its current form and it's not worth the cost to rebuild / not worth the regulatory hurdles of moving to the cloud.
The problem is you're thinking of it as a "build more datacenters" problem.
It is not a problem of needing datacenters. It is not a problem of needing a bunch more computers. It is a problem of needing a combination of data, access to the data, communications and people. That is the thing that is complex to set up and complex and expensive to replicate and not worth the cost.
Delta certainly should not be running their own datacenters, and its likely that they don't really do so to begin with. they (like every other airline) are running the same custom software and hardware that is creaking along, and is why every interaction at an airport runs sooooo slowly.
* Uber does not submit route plans to any authority, or deal with issues of transit across controlled borders. Airlines must submit valid flight plans for approval to aviation authorities in the appropriate countries, and routinely must handle transit of aircraft, people and cargo across controlled international borders, planning all aspects in advance.
* Uber simply makes use of any available roads and pickup/drop-off zones. Airlines must pay for and sometimes obtain regulatory approval for takeoff and landing slots, use of gates at airports, use of facilities at airports, etc., keeping track of what resources are available and which will be in use at any given time.
* Uber famously tells regulators to, effectively, go fuck themselves. Airlines which attempt this are shut down. Immense effort must be dedicated to compliance with applicable safety and commerce regulations at all times.
* Uber does not own or maintain the cars its passengers use. Airlines own or lease their aircraft and are responsible for their maintenance, on legally-mandated schedules. Airliner maintenance is a wonder in and of itself.
And on and on and on it goes.
Again: this ain't a Silicon Valley tech startup. Thinking in terms of even the largest SV startups is going to lead you badly astray.
Delta started off flying planes, and has been around longer than computers existed. Like many established businesses, they have a mix of new and legacy technologies, and can't just copy everything up to AWS. They'd have to dedicate years of time and effort over many employees to rewrite some of their systems, while still maintaining their existing ones in parallel until it's safe to migrate off. All the while they have an actual business to run, with real revenues, real expenditures, and IT is just (an important, but costly) piece in the big picture.
These companies are neither incompetent, nor malicious. They just have to find the money and time to get done the enhancements they'd like to their systems, and change doesn't happen overnight. Likely, with this awkward generator fire, they'll try to hasten their efforts.
Only statement in your comment that I disagree with: https://en.wikipedia.org/wiki/Delta_Air_Lines#History
If Delta is the mainframe-oriented business that I suspect it is, "spreading locations" means turning their software stack from the monolith of monoliths into a distributed system.
It's a damn fine idea, and likely to take them decades of effort across dozens of teams with hilarious expensive bugs along the way.
Moreover, the tech industry is solidifying. Larger companies are becoming more entrenched, and it seems harder for new companies to start clean with a fresh stack. This is the most tenuous part of my argument -- maybe it's not true?
If the above is true, then I expect software quality to worsen over time. Large companies will not be rewriting huge parts of their stacks for any reason. The legacy systems will continue to balloon as they are inflated with new features, and morph as they are merged with the systems of acquired companies. Also, the original engineers deeply familiar with the systems will retire, and new maintenance programmers will have even less context into the big-picture of operations.
Also, as the surface area of software increases, there will be more and more places for bugs and security problems to occur, even while those bugs affect more people and in more critical ways.
I predict that over time, we will begin regarding software which routinely works as much as a pipe dream as software which is secure.
This is a fundamentally pessimistic perspective. I'd love to hear an opposing argument.
In an airline system, there's quite a bit of equipment at each location, and much of it is specialized. There are interfaces to baggage systems and bar code readers. There are interfaces to airport systems and incoming information from air traffic control. The aircraft themselves transmit information and need flight plan uploads. There's probably more machine to machine communication than user interfaces. They may be having troubles resynchronizing everything with the backup systems in the data center.
In this case, it sounds very much like the backup servers were running the wrong version of some user interface programmed in an arcane IBM programming language from the 1970s that is built entirely around the record-oriented database typical of a machine like the AS/400.
More specifically "lipstick on a pig" =D
Most of them outsource their tech to the tech-outsourcing behemoths who are also run by man-month billing _innovations_
I see no hope of things improving drastically. Small incremental updates over decades till we invent teleportation... then they will die.
>In other words, given a choice between more backup systems and more security, airlines are picking security.
Information security is about ensuring confidentiality, integrity and availability all at the same time.