
British Airways: All flights cancelled amid IT crash - CmdSft
http://www.bbc.com/news/uk-40069865
======
2manyredirects
One of the unions has been quick to attribute the issue to the outsourcing to
India of some of the IT responsibility, which the right wing press here has
been all too eager to publish, but BA have rebuffed this and said at this
stage they believe the root cause was a power supply issue - sure, that could
be attributed somewhere along the lines to an 'alpha male business asshole'
(as I read in one of the comments here), but it's probably best to wait and
see what the post-mortem really is rather than seek to blame someone,
somewhere, be it a businessman, an Indian dev team or anything else.

I am reminded of a post a while back regarding AWS' issues affecting multiple
data centres (I forget the specifics), and how their post mortem didn't
appropriate blame on anyone (which it really easily could have), but rather
their own checks and balances, which allowed the issue to arise in the first
place. I do hope that when the dust settles we see a measured response rather
than a witch hunt.

~~~
StavrosK
I've found that not only is it good to not assign blame in postmortems, but
it's also accurate: The culprit usually _is_ the checks and balances, as
mistakes _will_ happen, and the goal should be to have failsafes and
detection.

I'm reminded of airplane accidents: Whenever you hear of an airplane accident,
it's always some amazingly crazy series of things going exactly wrong to get
the plane to crash. We have a tendency to think "wow, what bad luck", but a
better way to think about it is that airplanes are _so safe_ that an accident'
_can 't occur_ unless a whole series of things go very specifically wrong.

A company's goal should be to increase the number of necessary things that
need to all go wrong before there is downtime.

~~~
devonkim
One other important point is that the very term "root cause" is extremely
harmful in that it presumes a primary failure and already seeds the idea of
one bad actor and, by proxy, blame. Systems today are too complex to blame
upon one or two things - we operate in a very complicated, "complected" world
both in our software and in many organizations.

While there are always technical causes for larger technical failures, I've
seen far too many times RCA post-mortems performed that result in witch hunts
instead of a solemn contemplation of how things could be better done by
everyone. Such an RCA may ignore that a normally careful engineer was
overworked by managers, never is lack of relevant monitoring and testing due
to budget cuts cited, and you'll certainly never see "teams X and Y
collaborated too much" as a reason for failure in these places. Because in a
typical workplace, the company's values and culture are never related to a
failure. You can't objectively measure how bad or how good a culture is
either. Why make it part of post mortems when you don't think it's a failure?

~~~
hinkley
I don't recall ever having a manager use the term root cause analysis in the
way you are implying. Usually we are looking for the cheapest or most
effective process change that will prevent that class of problem happening
again.

------
mancerayder
In the tech-related (but now fully mainstream, unlike a few years ago) news
these days, there are a number of recurring themes.

1\. Massive security breaches affecting major corporations, governments, and
so on.

2\. Massive IT outages, affecting corporations, governments and so forth.

It doesn't take a lot of incidents to mark them as 'massive' since things are
strongly centralized in this world.

Meanwhile, in the last 15 years of my IT career, never have I seen such a
strong push towards offshoring. One would think that with such a vulnerable IT
landscape mixed with an unprecedented dependence on IT infrastructure, that
CTO's would want to spend more and not LESS on operational costs.

I'm slightly baffled.

~~~
fauigerzigerk
Is there any evidence that IT outsourcing has made matters worse? Or is it
only worse if the provider is located offshore?

Arguably, delegating IT operations to a proper IT company could improve
security and reliability.

We use a low cost offshore IT provider and we haven't had any security or
reliability issues so far
([https://gsuite.google.com/](https://gsuite.google.com/))

~~~
raesene9
The key is incentives. An IT outsourcer's goal is generally to fulfill their
contractual obligations (to the letter not the spirit) for as little money as
possible.

So if something like security or reliability isn't clearly stated in contracts
(and it's extremely difficult to do that well) it is disregarded.

An outsourcing company is not going to spend money it doesn't have to, simple
as that.

SAAS like gsuite is an entirely different prospect. you're buying a service
from Google not outsourcing your IT to them. If google's service was insecure
or unreliable no one would buy it.

~~~
fauigerzigerk
I don't see why an IT outsourcing company would not have the greatest
incentive to take security and reliability extremely seriously.

Is there anything worse for an IT service provider than being blamed for a
massive IT outage at a global corporation? This is headline news.

And I don't see any difficult contractual issues at all. On the contrary. A
massive outage, by definition, means that the contractual obligation is not
being met.

~~~
raesene9
Security costs money, reliability costs money. If it's not in your contract,
you don't pay for it.

if there's any outage that is because the customer didn't ask you do do
something that's their fault not yours.

For example would you as an IT outsourcer pay for a redundant datacentre if
your contract didn't call for it?

Would you patch all your systems immediately even if it caused availability
issues if it wasn't explicitly outlined in the contract?

would you explain to your shareholders that your profits were lower this year
because you undertook activities not specified in your contracts because they
were good for the security and reliability of the services you managed...?

When outsourcing contracts are bid for there's a common experience of lower
costs win. that inevitably leads to items that aren't strictly required being
excluded.

~~~
fauigerzigerk
_> For example would you as an IT outsourcer pay for a redundant datacentre if
your contract didn't call for it?_

I would assume that the contract calls for particular service levels and that
downing the entire fleet of a large carrier for days is in breach of that
contract.

If the contract says "provide service X with 99.999% availability" the service
provider cannot come back and say, oh but you forgot to specify that we should
run a redundant data center to guarantee that availability.

~~~
Clubber
>If the contract says "provide service X with 99.999% availability"

If you read those contracts, you have to read what the consequences are for
breaking that uptime guarantee. Usually it's something silly like 10% off your
next bill.

------
davidf18
By using IT outsourcing in India (Tata?), Poland, BA and its parent is
revealing some important information: They are cutting corners to the point of
risking airline operations. The question is, where else are they unnecessarily
cutting corners? Are they cutting unnecessary corners on airplane maintenance?

Other quality airlines have outsourced IT or part of their IT operations, but
they are careful about choosing their IT vendors.

For example, Israel's national carrier, El Al, which also has extremely good
security against hijacking, outsources at least its ticketing as a cost
savings move, but this it does to Lufthansa the German national carrier, which
uses the Amadeus Ticketing platform. El Al is saving money over running their
own operations, but still using a reliable vendor.

Quantas, the airline of Australia, outsources its IT operations to IBM. They
did not choose the cheapest alternative, but a reliable one.

Contrasting with Israeli and Australian flag carriers El Al and Quantas, BA
the British flag carrier made some extremely unwise choices trying choose a
"cheapest" solution, instead of a money saving cheaper solution.

Compared with El Al and Quantas, BA management has shown us that quality of
operations is not their top priority. The question is, where else in their
operations is this a problem?

BA management is revealing to customer and shareholder alike that customer
service and shareholder value is not their highest priority. They are
signaling to shareholders that it is time for a change before they further
harm the BA brand and before some serious accident happens.

~~~
dchichkov
Yes, it's a cyclic trend in IT:

    
    
      Step 0. Well functioning and balanced company.
      Step 1. Why these engineers are so expensive? We can hire 'ten for one' in 'country A'.
      Step 2. Cut local IT budget twice, put goals to have *zero* local IT budget in 2 years. Outsource everything to 'country A'.
      Step 3. People are fired, everything is outsourced, operational knowledge accumulated over the last few years is lost. 
      Step 4. Why everything is broken and we loosing millions? The service is of a quality like we live in 'country A'!
      Step 5. We need to hire high quality local team that can take *ownership* over product. We are willing to pay a fortune.
      Step 6. Well functioning and balanced company.
    

Not sure, if airplane maintenance is being treated the same way.

~~~
Zigurd
You cat bet it is: [http://www.vanityfair.com/news/2015/11/airplane-
maintenance-...](http://www.vanityfair.com/news/2015/11/airplane-maintenance-
disturbing-truth)

FTA:

>The airplanes that U.S. carriers send to Aeroman undergo what’s known in the
industry as “heavy maintenance,” which often involves a complete teardown of
the aircraft. Every plate and panel on the wings, tail, flaps, and rudder are
unscrewed, and all the parts within—cables, brackets, bearings, and bolts—are
removed for inspection. The landing gear is disassembled and checked for
cracks, hydraulic leaks, and corrosion. The engines are removed and inspected
for wear. Inside, the passenger seats, tray tables, overhead bins, carpeting,
and side panels are removed until the cabin has been stripped down to bare
metal. Then everything is put back exactly where it was, at least in theory.

>The work is labor-intensive and complicated, and the technical manuals are
written in English, the language of international aviation. According to
regulations, in order to receive F.A.A. certification as a mechanic, a worker
needs to be able to “read, speak, write, and comprehend spoken English.” Most
of the mechanics in El Salvador and some other developing countries who take
apart the big jets and then put them back together are unable to meet this
standard. At Aeroman’s El Salvador facility, only one mechanic out of eight is
F.A.A.-certified. At a major overhaul base used by United Airlines in China,
the ratio is one F.A.A.-certified mechanic for every 31 non-certified
mechanics. In contrast, back when U.S. airlines performed heavy maintenance at
their own, domestic facilities, F.A.A.-certified mechanics far outnumbered
everyone else. At American Airlines’ mammoth heavy-maintenance facility in
Tulsa, certified mechanics outnumber the uncertified four to one.

~~~
raverbashing
To be honest this article is not great

Maintenance facilities are usually certified by the local aviation authority

Aviation is also "regulated" by two extra entities: plane lessors and
insurance companies. Both won't be happy if the maintenance done at these
facilities is not correct

------
markonen
The main KPI for BA's IT function is the percentage of jobs they have moved
"nearshore" (to Krakow).

Not that I think this is the Krakow workers' fault, mind you; rather this (and
the string of similar IT incidents recently at BA) seems to fit the pattern of
upper management focusing on driving down costs to the exclusion of everything
else.

~~~
idlewords
Krakow is a big tech center, is in the EU, and it's not obvious why moving
jobs there would have anything to do with a drop in quality.

~~~
markonen
It's not about Krakow per se. But when a company works towards specific
targets like "90/10 supplier offshore ratio", rather than metrics based on
quality or efficiency, I don't think quality comes first.

~~~
tajen
On the one hand I've never seen a successful project with non-colocated teams,
whether 300km away or 10.000km away. On the other hand, I'm surprised a
airplane company is so much localized in UK.

~~~
methyl
Basecamp

------
a-dub
Looks like their new (as of 2 yrs) CEO comes from the low-cost airline world:
[https://en.wikipedia.org/wiki/%C3%81lex_Cruz_(businessman)](https://en.wikipedia.org/wiki/%C3%81lex_Cruz_\(businessman\))
so that may be a hint...

Although, I've seen this sort of thing before... Usually it starts with a
middle management hiring of some alpha-male business asshole who is driven to
advance his career and thinks that because he can use a spreadsheet that he's
qualified to run IT. He'll then go on to sell upper management on some kind of
ridiculous story straight out of some bullshit CIO magazine about how
consolidating all the existing best-in-class systems into one system will cut
costs and open opportunities for building and mining customer data to increase
revenue. He'll get the greenlight and a shit-ton of capital, and then he has
to make the decision about whether to build or buy. That question is
irrelevant, as our intrepid hero has no idea what the fuck he is doing and
will fuck things up regardless of which path he takes. Once the "new system"
is fully half baked, he then shoves it out all over the company in some
ridiculous balls to the wall no going back roll out plan. Subsequently there
will be huge problems, massive lines, pissed off customers, pissed off
employees... but this is where our intrepid hero really shines. His mastery of
the art of bullshit successfully deflects all blame from himself and his
incompetence onto the users/operators (eg, people who are responsible for
revenue generating businesses) of his monstrosity. Somehow it is forgotten
that outages never happened before and all the revenue loss and customer ill-
will is blamed on the operators for not having well-tuned disaster recovery
plans in place for manual operation.

Of course the only disasters they ever dealt with were a result of his
incompetence, and in a stunning feat of failing upwards, he's destined to helm
the company (and then likely others) within a few short years.

------
tim333
Quite a lot of mixed gossip in the Daily Mail article

>Yesterday's issues are the fourth BA failure in the past month, with problems
on June 19, July 7 and July 13

>Union leaders say hundreds of BA staff complained about 'FLY' system and most
workers say it's not fit for purpose

>A survey by GMB of 700 staff in June found that 89 per cent said training was
poor, 94 cent suffered delays or system failures and 76 per cent said their
health had suffered because of stress or anger aimed at them by frustrated
passengers.

etc [http://www.dailymail.co.uk/news/article-3695151/Philip-
Schof...](http://www.dailymail.co.uk/news/article-3695151/Philip-Schofield-
melts-Heathrow-check-computer-failure-hits-BA-passengers-starting-summer-
holidays.html)

~~~
kbob
That article is from July, 2016, which is why the dates "in the past month"
don't compute.

------
GordonS
One of the big take-aways from this is actually about how _not_ to handle
situations like this.

Problems happen - even huge ones - and I think most customes understand that.
What they don't understand is being given little or no information about what
they should do, and also being given vague, contradicting and even false
information.

The BA Twitter feed seems to be the main source of dubious information. They
are telling people to check the website for flight status info - but it only
works intermittently, and different parts of the site say different things;
for example, the flight status tool says my flight is cancelled, but the
booking management tool says it's all fine.

On the Twitter feed they are telling people that they are contacting people
and they rebooking them _automatically_ \- but it's apparent this is happening
for very few people.

They are telling people to rebook on the _website_ \- which only works
intermittently, and will not allow rebooking even when it is working.

They are telling people to _call_ them to rebook - but their call centres
seemed to be working on normal working hours (rather than getting all hand on
deck), so were not in operation between 20:00 and 09:00 (or whatever; it
varies by country). During working hours, when calling any call centre
anywhere in the world, you just get a recorded message and are then
disconnected. Some people say they did get into the call queue, and have been
waiting on hold for 7+ hours!

They are sometimes telling people they _shouldn 't go to go to the airport to
rebook, and sometimes telling them they _should* go to the airport to rebook.

Yesterday they were telling people they could book alternative travel with a
different airline and then claim it back from them... and today they are
saying they won't pay if you book alternative travel - this could cost
passengers dearly.

The CEO, Alex Cruz, also made a laughing stock of himself yesterday when he
randomly donned a yellow high-visibility vest to do a recorded message in an
office.

Honestly, the whole thing has been a lesson on how _not_ to treat your
customers when things go wrong.

------
coldcode
Airlines these days are run by a complex set of systems most of which have to
work in order to have the airline function. I remember a few years ago SABRE
(one of the 3 major GDS companies in the world) had a 4 hour outage (I worked
for a division). Half the world's airlines stopped functioning. Generally
these things seldom go down but when they do, hell breaks loose. Often these
systems not only do reservations, but crew scheduling, weight balancing,
check-in, manifest creation and a whole host of other small but important
things. Airlines also integrate some of their own pieces into the contracted
ones, and some just contract almost everything. It usually comes down to
money. Even Southwest Airlines which doesn't share its res data with anyone
uses a GDS backend for a lot of things (in this case SABRE).

~~~
tyingq
Southwest was never on Sabre, per se. They had their own, separate reservation
system, operated by Sabre the company, but not the main "Sabre Res System" /
PSS. It was called SAAS, and sometimes "Cowboy".

They moved off of that recently though, and are now on Amadeus/Altea.

~~~
coldcode
Was true when I worked there, of course changed since then.

------
kidjoedango
I'm curious if anyone can give insight on how the passenger backlog is
resolved in these situations. How was it done before smart digital systems
(probably circa 1980's?) and how is it handled today given all the intertwined
applications. I can imagine that it must be fascinating and equally exhausting
on a grand scale.

~~~
tyingq
>How was it done before smart digital systems

Manually. The TPF reservation systems have a concept of "Queues". They would
place travel records that needed to be reaccommodated onto a queue. Then,
reservations agents would "work the queues" from their green screens, make
phone calls, etc.

>how is it handled today given all the intertwined applications

Depends on the airline, and the system, but the general answer is "partially
automated". The processes for less widespread issues is more automated. Think
like a major storm in the northeast. Global outages are less automated because
you're dealing with multiple issues, not just passengers.

------
tyingq
Shopping, flight status, etc, on their website is working, so their central
reservations system isn't down.

[http://www.bbc.com/news/live/world-40069977](http://www.bbc.com/news/live/world-40069977)
says _" A BA captain has said the failure affects the passenger and baggage
manifests"_

So it's an legality/operational thing. Passengers can get boarding passes,
etc, but the plane isn't allowed to take off without proper manifests.

~~~
CiaranMcNulty
According to reports on flying forums checking is being processed manually
with huge queues as a result, flight status screens aren't working, etc.

Until an hour or so, ago login to the website was down for me.

So it seems it's a very widespread outage that they're in the process of
recovering from

~~~
tyingq
Likely they turned both of those off on purpose. If you can't depart due to
lack of a manifest, you turn off flight status and check-in.

I was able to do a flight status a little while ago though.

------
iaw
I wish there was a listing of company IT quality somewhere similar to the list
of 2Fac financial institutions.

I'm leaving my citi card for a chase one due to poor infrastructure but I
could have avoided a lot of wasted time if I hadn't opened it in the first
place.

~~~
spydum
Problem is it's highly variable from engagement to engagement.

A lot of the quality and poor performance could be bad process/management on
the clients side (though a outsourcing company will profit hugely from).

------
crivabene
I feel close to their IT staff having to deal with this on a Saturday
afternoon.

Partially OT: anyone wanting to share any on-call horror stories? :)

~~~
tyingq
Airlines are especially stressful, as the problem compounds with time. Once
you hit 45 minutes or so of downtime, you start invalidating downstream flight
connections. Two hours in, and you've issues with crews not being legal to
fly. Four hours in, and there's not enough capacity the following day to fix
the missed flights from today, etc.

~~~
mtkd
are there commonly algos or decision support tools to help unravel that in an
optimised way?

~~~
tyingq
Sort of. The software is called "automated re-accommodation". But you're
solving for several intertwined problems. Which aircraft models / tail numbers
fly which flights...they have different seating capacities, nautical range,
etc. Which crews are assigned to which aircraft. They aren't all qualified to
fly every model. And, they aren't all in the right city, so you have to
"deadhead" them there. And, finally, which passengers go on which flights.

They do have solver/optimizer algorithms, but you can imagine it's not a
button press thing. There's a lot of human process, trial and error, etc. Oh,
and federal laws about crew hours / legality plus union work rules. You can't
just assign crews wherever you want for example...you have to consider
seniority, their "home base" where they live, etc.

~~~
mtkd
I was in the middle of some major rail cancellations a couple of months back
over 2 days - on the 3rd day many of the assets were out of place around the
country - they were clearly trying to solve it manually and the impact
escalated over the day even though most of the original problems had cleared

it's got to be a great problem to work on - and must be pretty rewarding to
watch when you get it right

~~~
gonzo
Remember on 9/11/2001 when the US FAA declared a nation-wide ground stop, and
rerouted all inbound international flights to other countries (e.g. Canada
[https://en.wikipedia.org/wiki/Operation_Yellow_Ribbon](https://en.wikipedia.org/wiki/Operation_Yellow_Ribbon))?

~~~
CamTin
Do you have some airline/IT ops stories about that day and the aftermath? If
so, I (and I'm sure others) would be delighted to hear about them.

------
pbhjpbhj
>BA chief executive Alex Cruz said: "We believe the root cause was a power
supply issue." //

That shut down BA operations World wide? Does that seem likely? Possible? They
don't have power fail-over and operational centres in different countries?

I've been re-shaping my aluminium foil hat and wondering if there wasn't a
specific terrorist threat that's been covered up; but then where I am there
have been 2 cities suffer bomb threats (with attendance of bomb disposal and
armed police) that seem to have been buried in the news completely. Also I've
heard elsewhere that staff on the ground reported the incident as due to
"hackers" almost immediately - the speculation being 'before they could
possibly have known that' \- which suggests some sort of disinformation
process.

/wild-speculation

~~~
mkempe
Same thought here. Ten years ago the UK foiled a terrorist plot to bomb 6-7
airplanes in flight from the UK to the US. That's the origin of the ban on
bringing large amounts of liquid on board. [1] The recent bomb in Manchester
was apparently made using peroxide, too.

Assuming they identified a real, immediate, and massive threat, I can see why
they would prefer to ground all planes until they've sorted things out.

[1]
[https://en.wikipedia.org/wiki/2006_transatlantic_aircraft_pl...](https://en.wikipedia.org/wiki/2006_transatlantic_aircraft_plot)

~~~
isostatic
Half the planes taking off at Heathrow are not BA. If they wanted to ground
the planes, shutting down ATC would be the better way.

~~~
mkempe
You're assuming the hypothetical threat is not specific to BA.

------
tomschlick
Airlines should band together and form working group to redevelop the old
system in modern open source tech.

Yes it would probably take 5+ years to develop and roll out, but they can't
keep maintaining these 50 year old mainframes that cost them tens of millions
a year in downtime.

~~~
benmarks
"Redevelop the old system in modern open source tech" is a tough sell when so
many of these airlines are differentiating on technology. Myself and many
other frequent fliers will not engage with companies with poor functionality
and bad UX. We move or stay loyal when airline tech allows us to do & see most
of the things we need.

~~~
mbaha
Their tech is not open enough to even allow for proper innovation and
differentiation. That's the main issue with legacy IT.

When you live in a world w/ Docker, REST (and every other open tech there is),
you can build systems which are way more innovative.

~~~
gaius
_When you live in a world w / Docker, REST (and every other open tech there
is), you can build systems which are way more innovative_

That's quite funny because Docker and REST are just half-arsed reinventions of
mainframe features from the 1970s.

~~~
mbaha
I completely get where you're coming from, but you also have to acknowledge
for example that Docker is open source.

That's not a trivial add-on to IBM mainframes, it provides every developer (w/
minimal resources e.g a laptop or free tier cloud service) the ability to run
production-like environments.

Sometimes, we take that for granted, but having worked for an airline IT, I
noticed the bottleneck didn't come from hard algorithmic issues (most advanced
route features were very basic to implement), but from the huge leap between
production and dev environments: this was not Ubuntu on a server VS a MacOS on
a laptop (manageable...), this was Ubuntu VMs (so should be close to prod?) vs
cryptic data center clusters that had impossible-to-replicate features.

As for REST, airline IT use an outdated messaging mechanism. It implements
versioning and grammar, so should be clean and nice to work with? Not
really... The messages were impossible to read for an inexperienced dev
(opposed to XML or JSON).

I heard plans to put JSON blobs in one of the fields of the messaging
mechanism (completely destroying the value of versionning and grammar btw).
That was not necessarily a bad idea, just a reaction to the lack of supported
tooling, and readability for an obscure messaging mechanism.

Again, I get where you're coming from, but I'm just allergic to nostagia for
the sake of nostalgia where the old features clearly lacked essential features
for 2017.

------
haveopinion
The facts are clear. No effective mirror system or failover system. I've
architected major 365x24 systems repeatedly in my career: Removing the
possibility of power supply failure across the primary and secondary sites
(and even within single site) is absolutely fundamental. That clearly has not
been done here, at least to the extent of making sure that the solution is
effective- and the blame I would guess lie within a crazy management and
decision structure that seems to increasing permeate IT within large
companies. Tata do have some good staff (from my first hand experience over
many years):However, from my more recent experience, there is an increasingly
MASSIVE CLIFF EDGE in talent/competence in within Indian companies generally,
(and to some lesser extent UK companies as well): So the possibility of having
"very assertive, but stupid or incompetent" middle and even senior managers is
becoming an ever greater real-world problem. The criteria and method for
selecting candidates for IT related work does not help - in particular role,
competency, and "modus operandi" of recruiting agencies, is pretty atrocious,
as many a seasoned contractor will attest! But ultimately, I think Alex Cruz
and his CIO must take responsibility for a lack of diligence and competence.
They clearly do not understand the most basic truth of high reliability,
mission-critical IT: i.e. They NEVER needed cheaper IT staff (with all the
attendant risk within their industry) , but rather much fewer technical staff,
of the HIGHEST QUALITY, for all BA's key systems. And that I’m afraid means
probably not putting Indian companies at the top of the list.

------
CiaranMcNulty
One customer was told that the root cause was a "lightning strike on a
datacentre", although it sounds pretty unlikely (there are storms today in the
UK but surely there'd be a DR plan?)

~~~
tyingq
They do have DR plans, but the reality is that whole thing is a huge
distributed system with parts from different vendors, in different locations,
etc, with decades of legacy.

A power outage the reboots a whole datacenter would screw any major airline
for at least half a day.

Thus far, nobody has found the economics of a modern, truly HA setup worth the
cost. Outsiders greatly underestimate the complexity too. Think something like
40 disparate applications from different vendors, or some homegrown systems,
in different geographical locations. Then, all the client applications are in
buildings you don't own (airports) where you aren't allowed to control the
infrastructure.

If it were a high margin business, perhaps things would be different. It's
not.

~~~
jacques_chester
> _If it were a high margin business, perhaps things would be different. It 's
> not._

Most of the loss here is not from lost bookings; a lot of people who prefer BA
will probably just wait and book once the systems are restored.

It's going to be from secondary losses -- lawsuits, ding to the reputation and
so on.

These numbers are estimatable, even for low-margin businesses.

~~~
tyingq
I agree, but I've been around this sort of thing and seen the post incident
analysis, been in the discussions, etc. Despite the huge costs of these
outages, they pale in comparison to the costs of a real HA solution for all
"needed to fly" applications.

To give you some idea, ITA Software was bought by Google for $700 million.
They were some of the best and brightest minds in this space, on par with any
Silicon Valley darling. They successfully wrote a modern replacement for one
popular airline function...shopping. They failed, however, at delivering a
modern reservation system, despite tons of money and talent invested.

~~~
jacques_chester
> _Despite the huge costs of these outages, they pale in comparison to the
> costs of a real HA solution for all "needed to fly" applications._

Then, quite honestly, it's the rational decision for airlines to take.

Even Google accepts that perfect reliability is impossible, and they're
sailing in a pillow-strewn gold-plated yacht down a Mississippi of money.

------
CommanderData
I've managed enterprise IT applications shift support to India. I have to say
when its done right, it can work well sometimes.

But my experience has taught me this. The vast majority of time its a fragile
egg shell waiting to crack. When it fails, it fails miraculously. An IT
support team on this scale is one of those things you should keep close to
your product or customer.

Occasions like this I guarantee a bunch of execs somewhere in BA will pay
ANYTHING to have some of their loyal IT staff back to take control of the
situation.

~~~
sgt101
But within 4 months no one in the BA C-suite will remember this happened.

------
cauterize
There might be a better word than "crashing" when describing airline computer
systems malfunctioning. Nevertheless, was happy to hear it wasn't a plane.

------
bitmapbrother
I can't help but be amused of the fact that all of the money British Airways
thought they saved, by outsourcing their IT, was just squandered by the
additional expenses it's going to cost them to fix this mess. And I'm not even
factoring in all of the money this bad PR is going to cost them in the near
and long term.

------
forgottenacc57
All those UML diagrams didn't make the software good.

------
chmaynard
Henceforward, BA passengers will be required to bring portable power supplies
with them and make them available to airport personnel when asked to do so.
Failure to do so will result in forced removal from the aircraft. Over and
out.

~~~
StavrosK
Nah, BA isn't a US airline.

------
known
Companies ruined or almost ruined by Indians
[http://sammyboy.com/showthread.php?98021-Companies-ruined-
or...](http://sammyboy.com/showthread.php?98021-Companies-ruined-or-almost-
ruined-by-imported-Indian-labor-%28US%29)

------
faragon
TL;DR: yet another massive chaos because some "smart" PHB doing "cost
reduction" in IT.

------
frik
Apparently whole Heathrow airport, .. was shut down yesterday. United and
Lufthansa has to cancel their flights there too. So it meant more passengers
trying to get with other plans to neighbour countries.

------
gmisra
For reference, British Airways' parent company made a profit of €2.5 billion
last year, and expects higher profits this year [0].

Without meaningful consequences at the top of the executive chain for sub-par
IT/infrastructure quality, these kinds of incidents seem inevitable. But how
do you hold people responsibly for "bad" software? We could adopt something
akin to how PE licenses are required for civil engineering in the US. I
suspect it is in the industry's best interest to address this need before a
government entity decides to.

[0] [http://uk.reuters.com/article/uk-iag-results-
idUKKBN1630MA](http://uk.reuters.com/article/uk-iag-results-idUKKBN1630MA)

------
oliv__
Crash might not be the best word to put in a sentence containing British
Airways. My brain was frozen for a couple seconds until I understood what had
crashed.

------
BoiledCabbage
I have zero evidence, and am simply working through a though exercise, but it
could be targeted?

Maybe a group on behalf of a nation state is testing out its muscle. Or
sending a warning. Maybe the UK did something recently a nation didn't like,
and this is the new form of "diplomatic protest". Or sending a warning shot.

I remember reading last year that the large Delta airlines outage was cyber
terror related.

It really feels like we're not too far off from a war between two nations
without a single bullet fired. What happens when a country is hit with nation
level ransomware? Ie not, "give us $300 bucks and you'll get your PC back",
but "sign XYZ treaty and you get your country's water system back"? Or "we'll
restore your internet and turn your powerplants back online"? How much will a
wealthy country (like us in the US) be willing to stomach of seeing people
going thirsty and hungry before they want their govt to capitulate?

Of course the world will condemn and complain, but as has always been the case
the country with the largest Army makes the rules. And we might be using the
wrong measuring stick. Microsoft missed the boat by thinking (along with most
of the industry) that measuring computing meant measuring PCs. It wasn't until
it was too late that it realized that computing was about to be dominated by
Mobile and smaller. They were using the wrong measuring stick.

Are we on the verge of an era where an army should be measured in its digital
strength and not its physical strength? It's a very scary thought.

As an American I think first of the risks to my own country. How long could we
sit with a nationwide blackout, and the internet down (for anyone using backup
generators)? Food rotting in shipping containers because no cranes can offload
it. No easy communication or flights for quick way to move people or things
around.

As a country (and as a world) I can't help but feel we're really not doing
enough to take a risk like this seriously. It's just a feeling, but based on
small incidents here and there it really feels to me like something in the
next 5-10 years is gonna happen that'll make us all drop our jaws and say "I
didn't think this was possible."

Regardless of your political beliefs, a lot of people around the world didn't
think last year's US election was possible. I was shocked by it as well, but
it also opened my eyes to how much more really is in the realm of possible.

As engineers we have much a deeper knowledge of the risks involved and as a
result a much greater responsibility of raising awareness and getting the
problem fixed.

EDIT: I'm not sure of the original location I saw the Delta stuff, but a quick
look now turns up this link.

[http://observer.com/2016/09/did-a-cyber-attack-ground-
delta-...](http://observer.com/2016/09/did-a-cyber-attack-ground-delta-
airlines/)

And yes, I'll repeat this above comment is purely speculative by me.

~~~
tyingq
There's a little insider knowledge on the Flyertalk forum.

See this post, for example...a BA employee:
[https://www.flyertalk.com/forum/28366141-post168.html](https://www.flyertalk.com/forum/28366141-post168.html)

Later, there's some talk about a power outage and/or lighting strike, again
from BA employees.

~~~
BoiledCabbage
I could be mistaken, but from what I'm seeing in that thread they really have
no idea what caused it. Which, btw, seems fairly reasonable for how early into
this incident it is. I wouldn't expect a root cause for a while.

Delta Airlines also called their outage a "power outage", but it looks like
there was reason to believe it wasn't.

[http://observer.com/2016/09/did-a-cyber-attack-ground-
delta-...](http://observer.com/2016/09/did-a-cyber-attack-ground-delta-
airlines/)

~~~
tyingq
I don't put much stock in that article. Their conspiracy theory is that no
other businesses saw a power outage. But it didn't have to be a supplier/mains
problem. They could have had an issue with their own, internal power grid.
Their data center likely has a UPS, generators, huge switches to move back and
forth from battery/mains/generator etc. A fault there would cause an outage.

Edit: Yep. [http://bgr.com/2016/08/14/delta-finally-explained-how-one-
po...](http://bgr.com/2016/08/14/delta-finally-explained-how-one-power-outage-
grounded-an-entire-airline/)

------
purpleostrich
Who is John Galt?

------
riffic
Hopefully, this will be a wake-up call to the industry - government regulation
in IT is needed.

~~~
al452
The UK government (perhaps the relevant one in this case) does not have a
stellar track record of figuring out how to make big IT systems and projects
successful. This knee-jerk appeal to regulation is unconvincing to say the
least.

~~~
riffic
A modified form of ITIL would do a lot of good for important sectors. Let me
put it another way, if the industry won't regulate itself, then they will be
regulated by force.

~~~
gaius
_A modified form of ITIL would do a lot of good for important sectors_

You've never used ITIL for real, have you? Because if you had you would know
that no amount of process can replace good engineers, and good engineers don't
want to work in ITIL shops...

------
id122015
if this is another case when they employed a horror coder who doesnt know to
program, they'd better employ me !

------
technofiend
Maybe they hired the managers who handled Deepwater Horizon. Although this
time around, I don't think Donald Trump should send a nuclear sub to drop a
nuke down the shaft on day two of the disaster. I'd wait at least a week.

------
GnarfGnarf
This is a distant early warning of the impending demise of commercial
aviation. The inexorable rise of the price of fuel will eventually make flying
unaffordable for all but the military and civil servants. Airlines are already
struggling to remain solvent, and are cutting costs everywhere they can:
salaries, squashing passengers to increase seats/plane, nickel & dimeing us
with baggage surcharges, etc.

Offshoring IT is an example of cutting costs to the bone.

Your grandchildren will only know of flying as an ancient legend.

~~~
65827
This is nonsense, fuel prices have been plummeting amid new extraction tech
and long term demand destruction. These are fundamental changes and if you're
still droning on about peak oil you're just operating on old bad info.

