
Thousands of Delta passengers delayed by computer outage - 0xbadf00d
http://www.bbc.co.uk/news/technology-37009311
======
mrweasel
The most annoying part of bugs like this, is that we rarely get a good post-
mortem. To frequently the media just reports it as "a computer glitch". You
would think that with computers being so important to modern day life, we
should get better answers.

I'm not suggestion that the bug should be explained in details in the evening
news, but at least give some indication of the nature of the problem. They
could just refer to a website, if people want the technical details.

When a car has a problem, it's not reported as a "glitch". Imagine the
stupidity of a media report claiming "A glitch in the latests Ford models
cause several drivers to crash". You don't do that, you explain that it's a
fault in the steering, breaks, electric systems, or where ever the issue lies.
I think we should reasonably expect the same when errors in computer systems
hit the news.

~~~
planetjones
One of the main reasons that you don't get a good post mortem is that it's
often directly related to human error or a severe defect brought about by a
decision to cut costs. There have been several huge outages in UK banking over
the past few years and management have a convenient excuse in blaming "legacy
systems". While I respect private enterprise, for systems of systemic
importance I would like to see some legislation to implement a mandatory and
detailed post mortem like you propose.

~~~
mrweasel
>it's often directly related to human error or a severe defect brought about
by a decision to cut costs

I wonder if companies do cost-benefit analysis on those kinds of situation.
"We saved X amount of £ on cutting costs. Outages as a result of cost cutting
Y amount of £". You would think that at least stock holders would insist.

~~~
di4na
No because doing that would mean knowing what happened and spending time doing
a post mortem.

~~~
pc86
This may come as a shock but _most_ people want to do a good job, even if they
work for Citi or Delta or Comcast or where ever.

Okay, maybe not Comcast.

~~~
di4na
I never said the contrary. But a tons of these systems are old, and people
knowing how they work are long gone.

Plus doing a proper post mortem and understanding how it works woudl take a
lot of time and money. Things they most of the time to not have budget for.

------
cstross
It's not software _or_ hardware this time: there's a crippling power outage at
ATL, which is Delta's main hub (and presumably where their mainframe herd is
based):

[http://fox61.com/2016/08/08/delta-airline-reports-outage-
eve...](http://fox61.com/2016/08/08/delta-airline-reports-outage-everywhere-
systems-down-and-flights-grounded/)

The outage began at 0230 and is presumably ongoing and beyond the capacity of
their backup generators to cope with.

~~~
bradleyjg
I find that pretty odd. A decade and a half ago now I rented part of a rack in
a not particularly high end co-lo facility. They had battery backup for an
hour and then generators that could be fired up in twenty minutes with 3 days
worth of diesel.

Delta airlines, with $40B of revenue last year, has less than 8 hours of
backup power for their mission critical computers?

~~~
mseebach
> They had battery backup for an hour and then generators that could be fired
> up in twenty minutes with 3 days worth of diesel.

How often did you test that?

> has less than 8 hours of backup power for their mission critical computers?

It's plausible that they thought they had that, but some part of it failed.

~~~
bradleyjg
I don't remember, but when we were taking the tour the guide did mention that
the tested the batteries and generators on some schedule. I bet there's some
industry standard that has a minimum testing frequency.

~~~
dbcurtis
In my experience (radio repeater and cell sites on mountain tops) an auto-
start timer runs the generator once per week for 30 minutes or so. Diesels
dont like to sit too long without being run. And these are 30-50 Kw
generators, relatively small in comparison to a data center. I expect a weekly
schedule is common practice.

------
lordnacho
I'm sure a lot of people are asking how this is possible.

\- Legacy systems: the thing is probably old and has had a lot of patchwork
done to it by various people, many of whom are retired by now.

\- The thing is also large and distributed, which makes noticing potential
failures quite hard. There are loads of postmortems on the web about how some
minor issue like a line going down caused some cascade of unforeseen problems.
There's a lot of lessons being learned about how to solve this problem though,
so I suppose there's hope.

\- Always on: it's hard to change something once the business is relying on
it. Loads of staff need to be shown how it works, which makes it hard to
incrementally improve the thing because "how it works" would be something that
people would have to be updated on.

\- Never finished: large organisations will always have loads of feature
requests and bug fixes. If you start responding to some of them, you might
find yourself swamped. Either you work on the next item, or you spend valuable
time making sure fewer items are fixed. It's a balancing act.

~~~
ams6110
Also hardware. Likely older and expensive mainframes, which while generally
reliable do have a finite lifespan.

I'm guessing a critical hardware problem affecting some central part of their
flight dispatching system.

~~~
clueless123
Back in the 2000's Sabre purchased spare parts from the soviets to keep their
ancient mainframes running.. I can't imagine how hard it must be now.

~~~
chinathrow
Not that hard if you're willing to part with lots of cash for IBM.

------
coldcode
Power failure, clearly their back power systems are inadequate. I believe some
of Delta's systems are also backed by SABRE which while ancient is generally
rock solid (it lives in a nuclear hardened bunker in Tulsa). This problem is
in Delta's own systems not SABRE.

~~~
useful
the main host of delta is not sabre, it is deltamatic, a homebrew system that
is run by delta. Most other airline systems like sabre communicate via
teletype to one another

~~~
coldcode
I think the reservation system is SABRE, and likely they contract for some of
SABRE's other services like crew scheduling, but other parts are pure Delta.

~~~
_acme
Delta uses Worldspan as its GDS, not Sabre.

------
tedmiston
A post in the ongoing outage thread on FlyerTalk mentions a fire in the
datacenter.

> According to the flight captain of JFK-SLC this morning, a routine scheduled
> switch to the backup generator this morning at 2:30am caused a fire that
> destroyed both the backup and the primary. Firefighters took a while to
> extinguish the fire. Power is now back up and 400 out of the 500 servers
> rebooted, still waiting for the last 100 to have the whole system fully
> functional.

[http://www.flyertalk.com/forum/27032000-post135.html](http://www.flyertalk.com/forum/27032000-post135.html)

------
martin1b
Why is it reported as a bug rather than a power outage?? There is a huge
difference.

~~~
tedmiston
It seems a bug in the infrastructure when they switched generators is what
caused the power outage.

------
tedmiston
Official post from Delta Operations - [http://news.delta.com/more-flights-
resume-delays-cancels-con...](http://news.delta.com/more-flights-resume-
delays-cancels-continue-after-power-outage)

------
timthelion
Part of me, thinks that their systems were attacked, but being into security,
I've though a lot about how to attack a large computer system like this. Since
server failiures are so common, the system should be able to handle N
failiures before anything bad happens. If this was an attack, I guess the
attacker must have found a flaw that allowed them to instantly compromise any
server. But it actually seems more likely that this was caused, merely, by
incompetence.

If the attack was possible, than it was just as likely, if not more, that
Delta managed to merely shoot itself in the foot.

~~~
spyrosg
I'm one of the lucky people who have to wait. The airport personnel tells us
it's the system that prints out weather information that's out. Not sure why
one would want to attack that.

~~~
phil21
Considering check-in and reservation status in general appears to be down I'm
guessing this outage is relatively widespread.

Assuming what you've heard is correct this seems to span multiple Delta
systems.

~~~
ethbro
If it is power issues and some bit of infrastructure deep in the back end took
a nosedive, wouldn't be surprised if it effects multiple front end systems.

~~~
snuxoll
Considering it's an airline, it's likely either a IBM i or a IBM Z System
where they track _everything_ relating to their flights (schedules,
reservations, checked luggage, etc). A lot of companies (mine included) seem
to thing these devices are high-availability giants and never test what
happens should they go down. I know if ours ever took a dump we would be in
some big trouble (though we could function for at least 24 hours).

------
nodesocket
Strangely, their stock price was not affected by the outage and this news at
all. Honestly, I don't get it.

[https://www.google.com/finance?q=NYSE%3ADAL&ei=ncSpV_m8MsG9i...](https://www.google.com/finance?q=NYSE%3ADAL&ei=ncSpV_m8MsG9iwKpnrSQCQ)

------
DickingAround
Tough day for them. Moving to 'cloud' hosting... it doesn't just make it
someone else's problem. It encourages good architecture.

------
yardie
I just don't understand how a large carrier such as Delta can get so much
wrong. What's the DR plan? Are they using ASNs? A company is product, people,
and computers. This is not the 20th century, flight plans aren't filed by
telephone calls for these guys.

I've tested our DR plans and we were operational after 4.5hours, optimally.
And we were 3 people.

This outage reeks of PHBs doing lots of kicking the can and not a whole of
implementing.

~~~
xjlin0
Sounds like your DR plan or your network don't require power. Care to share?

~~~
yardie
Yes, 2 datacentres, Microsoft's DFS, cloud backups, and staff trained to know
what to do.

For the datacentre I have a friend that we worked out a mutually beneficial
arrangement. I keep a few spare servers in his racks and he has a few in ours.
Just enough to run the essentials.

Testing was literally walking up to the rack and switching off the power. Then
documenting what happens next. If you've configured the ASN just right the
domain should roll right over to the backup servers. Offsite DR servers needed
a little kick to bring up to speed (I can't remember why this wasn't fully
automated).

Most of the time was waiting for backup files to restore and verify. Since we
were using someone else's network we were shaped down to a small percentage of
their bandwidth.

After that we had a debriefing to see what worked, what didn't work. The usual
IT stuff.

------
hiimnate
I was expecting a "computer bug" story on the front page of HN to actually
have some interesting content.

------
rbc
It sounds like some kind of single point of failure in the Delta architecture
was uncovered.

------
fflluuxx
Southwest had an outage two weeks ago. Its starting to feel fishy.

~~~
_acme
Would you mind elaborating on what is starting to feel fishy to you and how
whatever that is is starting to feel fishy?

------
intrasight
Time for Delta to move to AWS

~~~
rbc
AWS has outages too. There might be some perceived attraction to transferring
the operational risk to AWS, but Amazon's agreements limit their liability.

------
smn1234
should be on AWS... all the cool kids are doing it

~~~
peter303
Merged airlines with multiple legacy systems from 1980s. The legacy systems
are well battle tested over the decades, but brittle to add new capabilities
like mobile.

Only about a third of of from-scratch large corporate and government software
systems succeed. Having to hire the cheapest contractor doesnt always work.

~~~
smn1234
finance industry as well... They're sunsetting legacy systems from many years
of M&A, refactoring apps, adding API tiers, using novel tech stacks, building
in true-HA and cross-region active-active.

------
baus
Interestingly, 10 flights were canceled on Allegiant today. Including one I'm
booked on.

But apparently it is a normal mode of operations for Allegiant. Absolutely
horrid airline. [https://www.allegiantair.com/travel-
alerts](https://www.allegiantair.com/travel-alerts)

~~~
tedmiston
Allegiant a budget airline that's okay when it works and a pile of junk when
it doesn't. They fly routes a few days per week out of my city. In a recent
cancellation (by choice), they told customers that they could rebook them on
the same day next week. Somehow they managed to do this without offering
refunds or alternative flights back that day, the next, etc.

I've only flown with them once, but it was a popular route to Vegas. It was
fine for the price, especially since they don't surge as much as normal
domestic airlines when you book with less than 2 weeks notice.

