

Southwest Airlines grounds flights due to computer outage - gridscomputing
http://www.ktvu.com/news/news/national/southwest-airlines-grounds-flights-due-computer-ou/nYR3n/

======
tomschlick
I would really hate to be the developer/sysadmin who potentially struck the
fatal blow to their system(s). Its one thing in a business if people can't get
email/work with some crm type tool. Its another thing entirely when thousands
of people are stranded at airports.

Good luck to those poor souls working triage right now. Even if it wasn't
you're fault they are gonna have someones head.

~~~
D4M14N
If it's any consolation, often in these sort of large scale outages at least
part of the cause is management pushing back on recommendations by the IT
staff.

------
hkmurakami
Man I always underestimate the reach and extent of software, especially in
enterprise. It really _is_ everywhere.

As an aside, does anyone know if such things like ticketing systems are
developed in house or are outsourced to places like, say, IBM?

~~~
penland
I worked for Southwest as recently as last year.

In the industry, ALL reservation and ticketing systems are written by
specialists. The major players are Sabre ( think IBM ) / Navitaire / ITA (
which was bought by Google not too long ago ). Southwest is a great company,
and they pay well, but their engineering staff is made up of the neckbeard
brigade . . . lots of guys in their 50s that have either retired before
retiring, or finally decided to learn this new fangled "java" language and
move on from C and COBOL. There is simply not the engineering skill, nor the
organizational will, to develop anything that complex in house. They tried
once about 5 years ago and after 4 months it exploded in a ball of flames
because they couldn't solve the the throughput issue.

This outage was almost certainly a Sabre issue. The reservation system itself
runs on old school IBM big iron written in C. The system is so damn old that
when you go southwest.com, all the information retrieved is done via screen
scraping. Not XML, not JSON, not even CORBA. The key point is when the
spokesman describes the "weight" of the airplane, something that is calculated
by the system according to how many people checked in, how much luggage, etc.
Without the system up and running, the flight cannot be "closed" and therefore
tracked properly.

Southwest hosts all of their own in house applications ( except for a few
things on GCE ), but the Sabre system is run out of some nuclear proof bomb
shelter in Tulsa. It goes down 1 ~ 2 a year, but typically it's only out for
10-15 minutes. For planes to be called back to the gate, the outage would have
had to have been over 60 minutes, as that's how long they buffer in a separate
system in case it does go down.

Finally, Southwest pays Sabre by . . . wait for it, requests executed per
second.

~~~
coolhandluke
> The reservation system itself runs on old school IBM big iron written in C.
> The system is so damn old that when you go southwest.com, all the
> information retrieved is done via screen scraping.

Man, say what you will about old mainframes and critical systems and C and
COBOL versus racks of commodity x86 boxes and Java and Ruby and "scaling out"
and load balancers and reverse proxies and ..., but those old business apps
running on the AS/400s and mainframes?

They just work(TM).

I agree that these old systems need to come into the 21st century but some of
the most stable, reliable systems I've seen in my career are 20- or 30-year-
old applications running on that old iron.

(mmm, 5150, 3270, CICS, JCL, RJE, IBM printers as big as refrigerators...
_sigh_... sometimes I miss those days.)

~~~
x0x0
Can we all agree that your point (while well taken)... is really really funny
in light of the subject of this article? I'd say this was a pretty spectacular
instance of just _not_ working.

~~~
ghshephard
Well - a lot depends on what part failed - was it the backend big iron, or the
front end scraping system?

------
furyg3
"Some flights were on the taxiway and diverted back to the terminal after the
problem was detected"

 _Why?_ You've checked these guys in, the baggage is loaded, the pilots
(presumably) know where they are flying to... why would you pull them back in?

~~~
a3n
The business is probably so computer-dependent that the business is
essentially the software systems executing. Planes flying, at this point, are
mere physical side effects of software systems executing. If your systems
aren't executing, your business isn't running.

Add in probable regulatory requirements that are also implemented by software,
and that probably makes it illegal to fly.

~~~
neurotech1
It may have been relating to the Pilot Operating Handbook procedures the
Airlines' Air Operators Certificate requires.

They still have the means to manually calculate the weight/balance etc. Some
(most?) airlines require the pilots to do a manual cross-check worksheet of
the fuel, weight/balance and a few other safety of flight items, before they
take-off. This helps avoids mistakes like over-rotation on takeoff etc.

One particularly amusing case was in a 767, where the Captain handed a copy of
the worksheet to a young girl who loved math problems, and was travelling in
first class as a VIP. The young girl found a mathematical error in the
worksheet. It turned out the flight computer had the fuel CG incorrect by 2-3%
in certain cases, and both were incorrect.

A quick call to Boeing, and 20 minutes later, fuel transferred and the flight
proceeded.

I suspect the Southwest aircraft returning to the gate, was a procedural
abort, than a safely-of-flight issue.

