Hacker News new | past | comments | ask | show | jobs | submit login
British Airways: All flights cancelled amid IT crash (bbc.com)
276 points by CmdSft on May 27, 2017 | hide | past | web | favorite | 246 comments



One of the unions has been quick to attribute the issue to the outsourcing to India of some of the IT responsibility, which the right wing press here has been all too eager to publish, but BA have rebuffed this and said at this stage they believe the root cause was a power supply issue - sure, that could be attributed somewhere along the lines to an 'alpha male business asshole' (as I read in one of the comments here), but it's probably best to wait and see what the post-mortem really is rather than seek to blame someone, somewhere, be it a businessman, an Indian dev team or anything else.

I am reminded of a post a while back regarding AWS' issues affecting multiple data centres (I forget the specifics), and how their post mortem didn't appropriate blame on anyone (which it really easily could have), but rather their own checks and balances, which allowed the issue to arise in the first place. I do hope that when the dust settles we see a measured response rather than a witch hunt.


I've found that not only is it good to not assign blame in postmortems, but it's also accurate: The culprit usually is the checks and balances, as mistakes will happen, and the goal should be to have failsafes and detection.

I'm reminded of airplane accidents: Whenever you hear of an airplane accident, it's always some amazingly crazy series of things going exactly wrong to get the plane to crash. We have a tendency to think "wow, what bad luck", but a better way to think about it is that airplanes are so safe that an accident' can't occur unless a whole series of things go very specifically wrong.

A company's goal should be to increase the number of necessary things that need to all go wrong before there is downtime.


One other important point is that the very term "root cause" is extremely harmful in that it presumes a primary failure and already seeds the idea of one bad actor and, by proxy, blame. Systems today are too complex to blame upon one or two things - we operate in a very complicated, "complected" world both in our software and in many organizations.

While there are always technical causes for larger technical failures, I've seen far too many times RCA post-mortems performed that result in witch hunts instead of a solemn contemplation of how things could be better done by everyone. Such an RCA may ignore that a normally careful engineer was overworked by managers, never is lack of relevant monitoring and testing due to budget cuts cited, and you'll certainly never see "teams X and Y collaborated too much" as a reason for failure in these places. Because in a typical workplace, the company's values and culture are never related to a failure. You can't objectively measure how bad or how good a culture is either. Why make it part of post mortems when you don't think it's a failure?


I don't recall ever having a manager use the term root cause analysis in the way you are implying. Usually we are looking for the cheapest or most effective process change that will prevent that class of problem happening again.


> but a better way to think about it is that airplanes are so safe that an accident' can't occur unless a whole series of things go very specifically wrong.

As an aside, I met someone who was working on a graph theory problem as their research project, and the application was that you could model the entire process of aircraft control through a state machine using that graph. Effectively they are working on making it mathematically impossible for a crash to occur assuming that a certain process is followed (with safety measures ofc).


> assuming that a certain process is followed

The challenge is to avoid pushing all the risk into that assumption. It's easy enough to build a system that never breaks if you're willing to assume perfect behaviour on the part of its dependencies, environment, users and operators.


If you've seen how aircraft controllers and pilots work, I think that "following the rules" is a very fair assumption to make. But ignoring that, obviously if implemented there would be fail-safes.

> It's easy enough to build a system that never breaks if you're willing to assume perfect behaviour

It isn't though. Seriously, think about how you could safely route several thousand flying hunks of metal through fairly small air corridors (which all have inertia) and you need to maintain strict flight schedules. Then think about how you need to factor in all of the edge cases caused by emergencies on planes (these are all included in the process for flight controllers). Then think how you could mathematically prove that safety.

Yes, it's easier if you assume that people will follow a certain process (and actual flight systems have so many layers of fail-safes that it's ridiculous) but it's definitely not "easy enough".


If an error makes it into production it is always process, never an individual, even if the individual involved was malicious. The only thing you can do with errors is fix them, learn from them, and then fix the process too.

Assigning blame does not move the needle at all.


Their system should be built in a way which makes it resistant to a power supply issue. That is the fault of however built it, whether onshore or off.


Yeah. I'm sure the "power supply issue" is an overly simplistic explanation -- it's highly unlikely a single computer PS could cause such major issues in such a huge organization.

That said, having almost entirely dodged any outsourcing-related issues in the 90s, and worked with generally great offshore teams, seeing my current role impacted by an utterly shortsighted and ignorant attempt to offshore critical operations tasks is quite disheartening. It's rarely the fault of the teams, it's the fault of higher-ups who completely fail to grasp the complexity and consequences of the tasks they're offloading. Everything looks great for a few weeks or months or years until one of the dozens of things that have gone neglected rear their ugly heads. If they're lucky, they take the money and run before the crash happens and escape most blame.


I was worked on a site where a UPS failure took out a large AS400, the business stopped for three days while they waited for IBM to replace it.


Hey there! "alpha male business asshole" theorist checking back in on this. While not a slam-dunk yet, it's looking like the theory of alpha-male business asshole looking to advance career at expense of company and deflection of blame onto operators of business is gaining some credence!

https://www.thesun.co.uk/news/3671697/ba-travel-chaos-dodgy-...

“We started using the new system in October. Training aside, the whole thing has been a disaster."

“It breaks my heart to see this as I love this company but it is really going down the pan."

“It’s got so bad that some staff members have written to the transport secretary Chris Grayling. All of our concerns have fallen on deaf ears."

“The Chief Executive Alex Cruz, when he was warned about the system told us that it was the staff’s fault not the system."


In the tech-related (but now fully mainstream, unlike a few years ago) news these days, there are a number of recurring themes.

1. Massive security breaches affecting major corporations, governments, and so on.

2. Massive IT outages, affecting corporations, governments and so forth.

It doesn't take a lot of incidents to mark them as 'massive' since things are strongly centralized in this world.

Meanwhile, in the last 15 years of my IT career, never have I seen such a strong push towards offshoring. One would think that with such a vulnerable IT landscape mixed with an unprecedented dependence on IT infrastructure, that CTO's would want to spend more and not LESS on operational costs.

I'm slightly baffled.


> never have I seen such a strong push towards offshoring

As I've been working in the airline IT industry, I've observed that the reality is more nuanced.

Core booking and ticketing systems are 'outsourced' to solutions from Amadeus and Sabre. This makes sense not only for many strategic reasons (easier to have airline alliances when the airlines are all on the one system), but also because airlines are in the business of flying planes, not building ticketing systems.

Airlines are often full of loads of legacy and old systems which you just can't hire developers for (in the numbers, and costs, desired) - this is where Indian outsourcing firms come into the play and resolve the shortcomings. They're taking all the jobs and work that no one really wants to do anyway.

However airlines, and many other industries (banks come to mind), are realising that their digital offering is just as much, if not more, a core part of their offering and those are skills they need to bring in house and own if they want to be competent. I'm seeing more and more web and app developers be brought in house so airlines can establish these competencies. Digital/Product agencies are 'stepping in' in the meantime to help out (like Virgin America and Work & Co).


Airlines are logistics companies that fly planes. They have to have a functioning plane at the right location - and only the right location - at the right time, or they bleed money.

It baffles me that they ever thought outsourcing part of that logistics was a good idea. But then I've worked at tech companies that tried to outsource their IT departments too.


Retailers are in the business of selling things, yet 'outsource' their Point of Sales systems. This is just the same thing.


No, it isn't. The point of sale system is about taking payments. If your core business isn't 'the processing of financial transactions', then buying that is the sane thing to do. IBM being able to field a POS system doesn't make them a competent retailer. But they handle regulatory rules for you and let you be a competent retailer.

Airlines aren't just responsible for flying planes. They are responsible for the planes. It's not enough for the plane to show up in Chicago by 5pm, it has to be there and all of the necessary maintenance and safety work has to be done by 5pm. Anyone who could figure that out for them would simply go into business for themselves and compete.


It seems that companies would rather buy "insurance" rather than spend money on actual security. They have accepted that they will eventually be hacked or have a really expensive outage, so now they want a cushion of money to pay for any penalties and lawsuits.

I get the impression that for the most part, no one really believes in security, and doesn't understand that it's a process and not a product. They already tried throwing money at the problem by purchasing security products, which ultimately failed to deliver security. Now they're trying the opposite by cutting costs. Meanwhile there are very few consequences to losing control of people's data.


Why aren't there Security Process Engineers (SPE)? I was recently waiting on a massively delayed flight and observed three instances where employees left the computer behind the check in desks unlocked. I'm sure these computers checked off all the requirements for having a firewall turned on, anti-malware software, auto-updates turned on, etc. But the front door was left wide open because the employees were forced to move between their counters and the actual jet bridge. Since they get penalized for late flights, but not for security breaches it seemed as if they didn't care. This would be where a SPE would come in and propose something to remediate the process failure.


There are certainly this type of thing out there.

SSI or software security initiatives, when given greenlights based on buy in from governance can introduce secure processes.

But before this happens to any meaningful level, risk based anaysis should prove it is cost effective. Guess what those analysis say?


"They already tried throwing money at the problem by purchasing security products, which ultimately failed to deliver security."

So true. I used to wonder why corporates can't build a real security team. One would spend $1M on an annual contract but perhaps you will do okay for $700,000 for a few security engineers. Then I realized that most corps are indeed looking for insurances.

You see, managed security vendors have a team. You don't manage the hiring, the tooling building etc. Obviously you can't just sit back and watch them put out the fire. You train them to understand a bit about your environments, you work with them on triages and resolutions, and you work with them to integrate your systems with them (e.g install an agent etc)

But it isn't most enterprises' interest to build a strong security team. Many of the in-house security team's role is to manage incident response. Many of them don't really have a say, they are often consulted and that's it.

Many of the product demos sound exicting and if you are not careful during your POC you will end up with a medicore product. Even if you did your best during POC yo evaluate the product, you will realized the product is really medicore. It caught the low-hanging fruits and are full of false positive. You ask for better analysis but because they come in as black-box, there is a lot of back and forth before you can act on the issue. So at work I would get a ticket from the vendor and I would end up doing a lot of the analysis. That sucks because while I enjoy doing security triage, that's not my role, but the conflicting side is at least they caught something. I would have to engineer or deployment some solutions and monitor the whole thing, since no one man or team can do everyone's job.

Since no one wants to be responsible for other's job, that's the whole point of having a secuirty team. It just happens that team, the people who are building tools and monitoring incidents are outsourced.

Most importantly, there is very little control of what your vendor can do. Want a new feature? Want better reporting? Want to change some configuration? A lot of time you are out of luck, or you need to be patience.

But, seriously, security practice is not magical. There are best practices such as SSH to server using key or cert authentication, not password. Least privilege to run processes, so system admin/devops should create a checklist on what's in place what isn't. It does take a serious committment from management to move forward though.

And security is an iterative process. Evaluate the best most secure option possible for the moment being, and put a realistic plan together for the remaining unresolved concerns.


that CTO's would want to spend more and not LESS on operational costs

The incentives are the same as they always were. A well run IT organization can coast on momentum for a long time - everything is neatly automated. So they sack their good engineers and hire an outsourcing company and save an amazing amount of money just long enough to get that next bonus or promotion, and they're long gone when the momentum runs out and the wheels come off. Repeating the same scam at another company.


I liken it to shorting options. Let me explain.

In the derivatives world (that I started my career in) there's a common theme: you can sell options, and generally you will make money. The only issue is, now and again you lose a lot of money. Often enough to wipe out everything you made.

But incentives matter. If you come in as a new guy, you want to impress by making money. And come up with some story about how you'll know when not to sell options. You also don't want to be the guy losing money when the other desks are making it. If there's a crisis and everyone loses, that's fine. Unforeseen, right? It happened to everyone.

The CTO is facing a very similar choice. Save money by not upgrading, work the security guys a bit harder and let them update the OS a bit slower. Don't think too hard about redundancy. Or code review, that takes away dev time. You're saving money.

But now and again, something happens. Maybe you have to leave, but you'll still have "saved money" for several years before. You'll land somewhere.


Is there any evidence that IT outsourcing has made matters worse? Or is it only worse if the provider is located offshore?

Arguably, delegating IT operations to a proper IT company could improve security and reliability.

We use a low cost offshore IT provider and we haven't had any security or reliability issues so far (https://gsuite.google.com/)


There's some evidence in software engineering research that offshoring isn't the problem but that the distance between team members on an organizational chart, however, is much more important. It's ok if an external party comes in as long as the resources are treated somewhat like equals to existing engineers. But when they become "that team" you will lose collaboration and things will fall apart partly because that team's manager may have different priorities than the local team.

When it comes specifically to IT, however, we're looking at almost a race to the bottom to save money immediately, sometimes to make more budget for software engineers that are revenue centers. Furthermore, a lot of IT folks want to stay current on technologies just like many developers do, but IT constraints are oftentimes even more than what their developer peers in the same company would have. This has led to a concentration of lower performers staying for a long time (not indicating skill necessarily - their environment is hostile to professional growth) and more ambitious technical folks move elsewhere.

It's been puzzling for me why IT orgs don't focus upon automation obsessively like most industries trying hard to cut costs have. The US coal industry automated a lot of labor and it's not like coal miners were making handsome salaries like most sysadmins did in the late 90s. The low performers all seem to have a strange obsession with creating as much manual processes as possible instead of, I dunno, writing some orchestration in Rundeck jobs or Ansible playbooks maybe. One decent automation engineer can replace dozens of lower skilled junior and mid level sysadmins even in bare metal environments.


The key is incentives. An IT outsourcer's goal is generally to fulfill their contractual obligations (to the letter not the spirit) for as little money as possible.

So if something like security or reliability isn't clearly stated in contracts (and it's extremely difficult to do that well) it is disregarded.

An outsourcing company is not going to spend money it doesn't have to, simple as that.

SAAS like gsuite is an entirely different prospect. you're buying a service from Google not outsourcing your IT to them. If google's service was insecure or unreliable no one would buy it.


I've worked in outsourcing for 9 years in Romanian companies as a software developer. I had the impression that the goal was to have the customer satisfied, not to fulfill a contract (as long as the customer was also reasonable, of course) and for pragmatic reasons: a happy customer pays you longer. My longest project was for 4 years, with the same customer. And my former employer keeps working on that project, 2 years after I left the company. In these 5 years the customer's company got acquired and for some reasons they had a pretty high turnover with their own employees, but they kept working with my employer.

When I joined the company there were 60 employees, when I left it 5 years later, there were about 500. I think they could only achieve this growth based on their high work ethics and the quality of their hiring their system.


To make customer happy you don't have to build reliable software. Corporations are not machines, there's a lot of people with their own interests, who make important decisions. You have to make these people happy, first of all. They can be bribed, convinced or fooled to achieve the desired outcome for your business. And if something bad happens, you have to dodge the bullet together, because you are in the same boat. So you blame other contractors, hackers or the guy who left the customer company some time ago, and do a lot of other things to convince the customer management, that it's not your fault, you actually saved them from bigger losses somehow and you can help to build better software. And they won't be happier, but they will think that you are good guy trying to solve their problems. And that's the only thing you need to sell another contract.


Cynical much?


I wouldn't say that an outsourcing company wants to do a bad job, of course they don't, but things that are often invisible in the product (like security or reliability) will likely get less focus in a cost sensitive environment.

For example, completing a full security development lifecycle can add 10%+ to the costs of the final product. that's not a cost that a company will incur unless they have to.

In a bid for work, everyone says "we take security seriously" and the client probably can't evaluate the difference between someone who really takes all the necessary steps, and someone who pays lip service to that concept.

So the cheap company (who doesn't spend that 10% extra) looks just as good as the one that does, and they're likely cheaper.

Guess who's more likely to get the work (all other things being equal)...


Back in the 2012 I asked one of the department managers: are the customers willing to pay for automated QA tests? and he said "they usually do, because I give them the price per project and tell them 'in this price X it's included automated testing for the functionality. We can lower the price if you don't want the automated test suite".

I liked the approach and I think the same could go with security. Include an external security audit in the initial project price.


I don't see why an IT outsourcing company would not have the greatest incentive to take security and reliability extremely seriously.

Is there anything worse for an IT service provider than being blamed for a massive IT outage at a global corporation? This is headline news.

And I don't see any difficult contractual issues at all. On the contrary. A massive outage, by definition, means that the contractual obligation is not being met.


Security costs money, reliability costs money. If it's not in your contract, you don't pay for it.

if there's any outage that is because the customer didn't ask you do do something that's their fault not yours.

For example would you as an IT outsourcer pay for a redundant datacentre if your contract didn't call for it?

Would you patch all your systems immediately even if it caused availability issues if it wasn't explicitly outlined in the contract?

would you explain to your shareholders that your profits were lower this year because you undertook activities not specified in your contracts because they were good for the security and reliability of the services you managed...?

When outsourcing contracts are bid for there's a common experience of lower costs win. that inevitably leads to items that aren't strictly required being excluded.


>For example would you as an IT outsourcer pay for a redundant datacentre if your contract didn't call for it?

I would assume that the contract calls for particular service levels and that downing the entire fleet of a large carrier for days is in breach of that contract.

If the contract says "provide service X with 99.999% availability" the service provider cannot come back and say, oh but you forgot to specify that we should run a redundant data center to guarantee that availability.


>If the contract says "provide service X with 99.999% availability"

If you read those contracts, you have to read what the consequences are for breaking that uptime guarantee. Usually it's something silly like 10% off your next bill.


this is all correct and will not change until we reach a critical mass of devastating failures/breaches.

until then, don't hold your breath. people learn their lessons the hard way.


Here is the list of things that are worse for an IT outsourcer.

1. Being more expensive than the other guy.


I doubt that anyone wants to compete solely on price. It's a margins game.


have you been involved in many IT outsourcing agreements? Cost is always a major factor in procurement, and in IT where there can be a strong market for lemons, it's often the key factor.

A company that's more expensive and can't clearly demonstrate in a bid scenario the positive impact of that increase in costs, will lose bids, a lot.


Of course cost is a major factor. I never disputed that. I'm saying that competing on price alone is undesirable for IT outsourcing companies.


But the problem is that factors like security and reliability are often invisible in outsourcing contracts, as it's very hard to specify things like security exactly in a bid contract, and very hard for customers to tell the difference between an organisation with higher security and one with poorer security practices.

Everyone will say they take security seriously, but the cost of actually doing a good job on security is much higher than paying lipservice to it, so it doesn't really make commercial sense to do it well.

In general in fact I'd say that a lot of IT outsourcing contracts tend to lead to a market for lemons. It's very hard for customer to assess the quality of a companies staff for example, so the company with the cheaper staff can afford a cheaper bid, which looks just as good as a company with more expensive staff...


Granted, security is difficult to specify. But availabilty isn't. Not on the scale we're seeing right now with BA.


You are right, but they all do it because outsourcing is a signal that corporate management has lost faith in its ability to manage IT and now has but one method of judging value - price.


If the possible negative consequences to company's image is the only incentive, they will rather try find an excuse, since they've fulfilled the requirements of the contract. Can be anything, from missing security requirements (under the assumption that their customer will take care about security himself) to incorrect use of provided software etc. And they'll probably be right, especially if they actually tried to sell the security. And they'll probably be able to defend their position in court and sue for defamation.


You're not wrong that delegating operations COULD improve security and reliability, but it could also suffer from the same issues that branch office suffer from -- out of sight, out of mind, distance from the corporate HQ culture and scrutiny, etc.

Furthermore, often these issues rear their ugly heads when they sack the existing staff who are keeping 1000 balls in the air and rarely dropping one. In the migration to an outsourced team, or an offshore team, or even just another equally competent team in the same office, details will get missed. It's simply not possible to do a complete knowledge transfer. Things will get missed, and sooner or later one of the things you miss will turn out to be a landmine.

Of course, the original team wasn't perfect either, and could have also suffered a large outage, but the new team is more likely to do so, at least in the short term.

In a best-case outsourcing scenario, you either hire better people, or more people, or a vendor with better processes or understanding. You do the migration very carefully and miss as few details as possible, AND the quality of the new team is such that, within 3 or 6 or 12 months, they're actually outperforming the original team. AND you manage to avoid hitting any landmines during the transition period where they're underperforming. This requires a great bit of planning, execution, and luck.


The reason is simple: short term thinking driven by short term investors


Or maybe it's what is repeated here so often: don't let the perfect be the enemy of the good.

Perhaps it really is cheaper to have a few outages than it is to built the sort of systems that never go down. Every "nine" you add gets more and more expensive.

Or perhaps the systems are so complex that it's just impossible for them to be fully reliable, and this is just one of the costs of having the ability to fly anywhere in the world, cheaply and at the drop of a hat, and to have it work 99% of the time.


Or maybe it's what is repeated here so often: don't let the perfect be the enemy of the good.

Is that what you tell hundreds of thousands of passengers sitting in a terminal with canceled flights and confused airline employees during one of the biggest holiday weekends?

If your credit card number gets stolen or worse, your identity, and it's up to you to get that straightened out, are you just going to shrug your shoulders and write that off as the inevitable impossibility of perfection?


I'm saying yes, maybe that is the case. Maybe those scenarios are either a) cheaper to handle on occasion than building systems which will never fail, or b) unavoidable (with some low probability).

Air travel itself is not 100% safe. Planes crash. People die. Yet we feel that the probabilities are enough in our favor that we accept it.


What? "We" don't accept anything: strict regulations and processes were put in place to reduce the probability of accidents to the absolute minimum. Did this stop manufacturers and airlines to spearhead this or that new plane? Maybe - very few companies can carry that sort of compliance effort; but it's why the public trust air travel as fundamentally safe (statistically speaking, it's probably safer than cars).


Ignoring the offshoring part of this for a moment I think you're on point about the CTOs, it baffled me too at the beginning. When I started doing sales for a startup service and was targeting CTOs I discovered based on my interactions that outside of Silicon Valley companies or obviously tech centered companies, CTOs in other countries and industries are in many cases just the usual political shark type person interested in showing how they were able to cut costs to help improve profit and often there isn't the real interest in tech to understand why doing things better is so important.

As an example a person in a tech leader role at a major company in the UK who I thought may be interested in hearing about a product was quite rude about me never contacting them again and refusing to take any calls or emails before even hearing about something that other CTOs has at least looked at and some had even become customers. She happened to have a personal blog that I checked out and all it mentioned was her passion for swimming, nothing at all, not a single thing about tech. Why is this person in this role then? Beats me but it is sadly commonplace, until you get people at the top that truly understands tech and was actually a software engineer themselves at the start of their career and are things won't change.


> As an example a person in a tech leader role at a major company in the UK who I thought may be interested in hearing about a product was quite rude about me never contacting them again ...

You didn't spam them did you?

Your description of their behaviour sounds a lot like the standard response to spammers. eg FOAD


No I did not, and that's the issue, a lot of these people are not actually interested in making things any better, they'd rather not be disturbed instead. And then something like what happened today happens. And then people are misdirecting blame to others. The leadership is in most cases to blame.


Interesting. What approach did you use to tell them about the product they hadn't heard of before?


In that particular case I contacted them by email, introduced myself and explained why I had contacted them specifically, what I knew about what they were doing and the problem I was solving and asked if they wanted to discuss it. They responded and said they were not interested. I replied and said that's fine, if they change their mind to get back in touch. They wrote back again with the 'foad' response as you suggested, really quite a shock considering I had expressed already that it was all good and I had considered it the end of the conversation and didn't actually expect a reply. Unless I actually share the emails with you I'm sure it just sounds like he said she said but I do think I'm describing this quite accurately, most people did not reply like this, especially people who I had researched and were generally interested in software and tech, they were either interested in discussing the opportunity or calmly explained why they were already good. I am just providing my experience of some bad leadership at the top when it comes to software/tech.


Well, it certainly sounds like you spammed them. :(


There is a distinction between spam email and a personalized targeted sales email. Mine was most definitely not spam.


I receive about 10-20 unsolicited sales emails a day that are "personalized", which I promptly delete.

If I am looking for a product, I will look for it. As far as I'm concerned, your emails are spam.


> I am looking for a product, I will look for it.

Note that this attitude is what pushes some sales teams away from reaching out to line management, and towards C-level-targeted sales. If you hate "golf-cart sales" outcomes where a C-level forces a solution down your throat, then learning how to communicate with sales people will richly pay off. It is part of managing upwards, by short-circuiting sales warm approaches to the management you report to, and turning them into your allies to help you pitch your priorities that happen to align with their sales goals.

I am up front with the sales people who approach me, and tell them when I anticipate the problem space they solve will rotate onto my front burners, my anticipated budget to switch (usually in the form of "if I switch to what you propose, I only have $X to do it with, all in, software and services"), the benefits from my current solution I want to ensure stay in place, and the pain points I want to solve. That usually ends the conversation right there and then, I'm tagged "no-contact" in their CRM, and the spam disappears.


In an ideal world, spammers wouldn't exist. Or at least C-level people who respond non-negatively, thus encouraging them to keep on doing it. :(

If people wonder why spam is a problem, it's pretty much because enough idiots respond to keep it worthwhile.


In my personal anecdata experience of observing C-level sales efforts from the inside follow this general pattern. I encourage you to look at these situations from an angle other than "all/most sales people contacting C-levels are spammers, and C-levels who respond neutrally/positively are idiots for not recognizing spammers".

Generally speaking, C-level relationships are cultivated over a long time with the kind of sales and marketing budgets you've read of; dinners, sporting events, wine tastings, etc., with some proven sales account manager. However, it takes a lot for some sales opportunity to rise to the level of bringing it to the attention of this relationship. If you want to stop what you call spamming, then your job is to manage upwards by ensuring pain points do not rise to that level. This keeps the sales account management activity focused on the numbers when the support contracts come up for renewal.

However, if you have not been successfully managing upwards, then lines of communications have broken down between C-levels, or between CIO to your level. The first you can't do much about. The second is partly within your control. If you ensure your manager has nothing from the business to complain about that can be addressed within the solutions you are responsible for, then you've done what you can about it; you can't fix what you aren't told, after all. Where most technology-oriented staff misstep is they think they need to hear it from their management; the staff who successfully short-circuit pain points coming up to the C-level's attention realize that they can also ask, and more importantly, demonstrate they can communicate and coordinate between departments and people to help address those business-oriented pain points before they percolate upwards.

Where these sales account managers pounce is when the pain points become so grave that the C-levels hear about it and feel they have to "do something" about it, and "it" is a competitor's solution. And if you really like that competitor's solution, and hate the idea of switching away, be on the lookout for a call from one of the sales reps assigned to help that sales account manager. If you've even heard about the pain point, then you will be able to help drive the discussion, and most of the time if your favored vendor is on the ball, close the window of opportunity for the sale to be floated up.


You're conflating real sales emails and spam again and really seem to hate sales people. How are the people who respond idiots for actually being happy to hear about a product or service that solves their problem and signing up.


I'm a C-level executive. I am more familiar with my infrastructure and pain points than any cold-call salesperson will ever be, so if I need a third party solution to a problem, I use the internet to find it.

I'm not going to waste my working time communicating with a salesperson back and forth for a product I'm not interested in. One, it only validates that I read their email, and two, the end result is the same - I don't buy the product.


Great observation.


Ahhh. That's a misconception on your part. There is literally no difference between any kind of sales email, and spam. All sales email to people that haven't opted-in previously, is spam. 100% of it. Thankfully, some countries have even introduced laws against it.

That's the reason for the FOAD response.


This may be your opinion but this is not a fact.


Well, you were wondering why the FOAD reaction, and we've clearly described why based on the info you provided.

Your willingness to listen or otherwise is on you. ;)


I was not wondering why at all, if it appears that way I did not intend it but I can't see anywhere I said that above.


Good point. You've convinced yourself the people you're spamming are in the wrong, and they should trust someone who's already shown to have a distinct lack of ethics.

Literally, by spamming them.

It's absolutely no wonder they're telling you to FOAD, as that's the correct response.

Please, change to a different profession. Preferably one that contributes to society in a positive way instead of your present one. :)


The problem is, that for CTOs, there are so many people who 'know' how to make things better for them. You can't listen to them all.


In one way the CTO gets the 'purchase' of insurance here. Not against the company having issues but for their job because they can say, "Well I hired the best company to do this. If they couldn't do this, then we definitely couldn't!"


Don't manufacturers face the same sorts of issues with supply chains? I mean, if there's a train wreck or an ice storm, the people who needed those deliveries are pretty much screwed.

I think we like to pretend that this is a problem unique to informatics.


By using IT outsourcing in India (Tata?), Poland, BA and its parent is revealing some important information: They are cutting corners to the point of risking airline operations. The question is, where else are they unnecessarily cutting corners? Are they cutting unnecessary corners on airplane maintenance?

Other quality airlines have outsourced IT or part of their IT operations, but they are careful about choosing their IT vendors.

For example, Israel's national carrier, El Al, which also has extremely good security against hijacking, outsources at least its ticketing as a cost savings move, but this it does to Lufthansa the German national carrier, which uses the Amadeus Ticketing platform. El Al is saving money over running their own operations, but still using a reliable vendor.

Quantas, the airline of Australia, outsources its IT operations to IBM. They did not choose the cheapest alternative, but a reliable one.

Contrasting with Israeli and Australian flag carriers El Al and Quantas, BA the British flag carrier made some extremely unwise choices trying choose a "cheapest" solution, instead of a money saving cheaper solution.

Compared with El Al and Quantas, BA management has shown us that quality of operations is not their top priority. The question is, where else in their operations is this a problem?

BA management is revealing to customer and shareholder alike that customer service and shareholder value is not their highest priority. They are signaling to shareholders that it is time for a change before they further harm the BA brand and before some serious accident happens.


Yes, it's a cyclic trend in IT:

  Step 0. Well functioning and balanced company.
  Step 1. Why these engineers are so expensive? We can hire 'ten for one' in 'country A'.
  Step 2. Cut local IT budget twice, put goals to have *zero* local IT budget in 2 years. Outsource everything to 'country A'.
  Step 3. People are fired, everything is outsourced, operational knowledge accumulated over the last few years is lost. 
  Step 4. Why everything is broken and we loosing millions? The service is of a quality like we live in 'country A'!
  Step 5. We need to hire high quality local team that can take *ownership* over product. We are willing to pay a fortune.
  Step 6. Well functioning and balanced company.
Not sure, if airplane maintenance is being treated the same way.


You cat bet it is: http://www.vanityfair.com/news/2015/11/airplane-maintenance-...

FTA:

>The airplanes that U.S. carriers send to Aeroman undergo what’s known in the industry as “heavy maintenance,” which often involves a complete teardown of the aircraft. Every plate and panel on the wings, tail, flaps, and rudder are unscrewed, and all the parts within—cables, brackets, bearings, and bolts—are removed for inspection. The landing gear is disassembled and checked for cracks, hydraulic leaks, and corrosion. The engines are removed and inspected for wear. Inside, the passenger seats, tray tables, overhead bins, carpeting, and side panels are removed until the cabin has been stripped down to bare metal. Then everything is put back exactly where it was, at least in theory.

>The work is labor-intensive and complicated, and the technical manuals are written in English, the language of international aviation. According to regulations, in order to receive F.A.A. certification as a mechanic, a worker needs to be able to “read, speak, write, and comprehend spoken English.” Most of the mechanics in El Salvador and some other developing countries who take apart the big jets and then put them back together are unable to meet this standard. At Aeroman’s El Salvador facility, only one mechanic out of eight is F.A.A.-certified. At a major overhaul base used by United Airlines in China, the ratio is one F.A.A.-certified mechanic for every 31 non-certified mechanics. In contrast, back when U.S. airlines performed heavy maintenance at their own, domestic facilities, F.A.A.-certified mechanics far outnumbered everyone else. At American Airlines’ mammoth heavy-maintenance facility in Tulsa, certified mechanics outnumber the uncertified four to one.


To be honest this article is not great

Maintenance facilities are usually certified by the local aviation authority

Aviation is also "regulated" by two extra entities: plane lessors and insurance companies. Both won't be happy if the maintenance done at these facilities is not correct


Or, sometimes:

Step 5. We need to hire high quality local team that can take ownership over product. We are willing to pay a fortune. But we can't hire anyone local, because we did step 3 for too long or too often, and that knowledge is locally gone.


I thought Step 5 would be: lets hire an overpriced western big name consulting firm to fix our 'temporary' problem.


> We can hire 'ten for one' in a country A.

Rather: 'two for one'.


Why just two? The average salary in a western country is many times that in a third world country, and ten times is not that unusual.


What country are you talking about? In India I don't know but in China the living costs are skyrocketing and are in many occasions more expensive than in major western cities. 2 for 1 (if the team lives in a big city) or at most 4 for 1 is what I believe would be the upper bound.

Well if you truly talk about "third world country" then sure, 10 for 1 but what kind of country and IT operation would that be?


Unless you're an idiot and outsourcing critical operations to India or China, it's more like ~4 for 1.

Third world countries are developing, you should keep up with that.


>>Unless you're an idiot and outsourcing critical operations to India or China [...]

I've heard rumours that the Chinese will just copy your IP as soon as possible, regardless of what you make them sign.

But what's supposed to be wrong in India compared to other outsourcing sites?


I think the narrative here is that yes, the pointy heads are idiots.


The problem is (and I've seen this happening) for many managers or C-levels, it's really hard to see the distinction between high- and low quality. They just can't tell.

So if all offers look the same quality to you, you obviously buy the cheapest.


Or you delegate to people who can? Or at the very least don't change anything you don't understand!? The root cause of course is the MBA type without the technical background being put in charge of engineering teams in the first place.


>"Other quality airlines have outsourced IT or part of their IT operations, but they are careful about choosing their IT vendors"

I'm not so sure about that. Here are 4 of the 5 big carriers in the U.S and they have all had national outages within the last year due to "IT."

Delta:

https://www.usatoday.com/story/news/2017/01/30/delta-outage-...

SouthWest:

http://www.cbsnews.com/news/southwest-airlines-computer-outa...

United:

http://www.chicagotribune.com/business/ct-united-flight-dela...

Jet Blue:

https://www.usatoday.com/story/travel/flights/todayinthesky/...


Quantas, the airline of Australia, outsources its IT operations to IBM. They did not choose the cheapest alternative, but a reliable one.

I believe IBM was responsible for the Australian census debacle in 2016. Hardly a ringing endorsement of their reliability, and not the only high-profile instance of them messing things up.

"No one ever got fired for choosing IBM" though.


While I can't speak to the exact circumstances, IBM when it takes over operations often hires the IT staff of the firm it is taking over operations from and runs systems on multiple-year contracts.

At any rate, I believe few would contest that IBM in general is a more reliable vendor than Tata, other Indian suppliers or say, offices in Poland.

EDIT: Details on IBM and Australian census. http://www.abc.net.au/news/2016-10-25/turning-router-off-and...


It's pretty funny how in this thread IBM is conflated with decent quality IT outsourcing and Poland with poor quality, while IBM's outsourcing is done out of Poland (among other countries).

https://www-03.ibm.com/press/us/en/pressrelease/32469.wss


IBM has the resources and infrastructure others firms don't have. It can afford to pay higher rates for the top IT people and it has the software and management and the backup in the US to ensure that projects are performed properly.

But as a rule, Poland, India, ...other cheaper locales don't offer the quality of the US, Israel, ....look to the places where the world's top software/computer chip design/hardware is being designed and built. There is a correlation.

For example, Israel has a population of only 8 million yet creates more original software and startups (purchased by developed nations) than all of India. India buys hi-tech military technology from Israel (and the US).


If you're going to down vote, please state the reason. The truth is that tiny Israel has more high tech software/hardware than any other nation except the US. There is definitely a difference in software quality.


> IBM when it takes over operations often hires the IT staff of the firm it is taking over operations from and runs systems on multiple-year contracts.

And typically has them laid off in 3 - 5 years and replaced by IBM offshore resources or subcontractors.


> hires the IT staff of the firm it is taking over operations from

How can they do that and save the client money at the same time?


How can they do that and save the client money at the same time?

Because the staff they transferred over are used to train cheaper replacements then laid off after 2 years. That is the plan all along, tho' the staff will never be told it upfront.


They hire the IT staff. Not the management, HR, etc.


And presumably short term too, so even if they did hire the whole hierarchy, if they replace it with one having a quarter of the cost in a couple of years then they profit long term?


> IBM when it takes over operations often hires the IT staff of the firm it is taking over operations from and runs systems on multiple-year contracts.

I've seen this happen many times and it's always been successful. They can replicate the business processes they've honed over time whilst keeping that important business knowledge. It's also often better for the staff as they can hand off that knowledge and move internally within IBM to new and more challenging roles without switching employers.


I've seen this happen too. But people who wanted to work for company X for whatever reason, not for IBM (or whoever).

Now they work for IBM, and all the things they did above-and-beyond their job description are now billable. Now their incentive isn't to help company X that they wanted to work for, but to screw company X for every nickel-and-dime because that's where their new employer's revenue comes from. And they know all the skeletons in the closet and all the pain points.

Stab good people in the back and you make powerful enemies, the managers of all the company X's out there never learn this lesson.


Interesting. In IT circles IBM is generally regarded as "worst of class", due to continual waves of firing competent employees, outsourcing, and many other dodgy practises. A recent example (there are many, many more):

https://www.theregister.co.uk/2017/05/26/ibm_asks_contractor...



No word about Accenture


[flagged]


> India I can understand,...

Care to elaborate a bit?


I think it refers to the telemarketers and tech support stereotype of India.

In this view, India is portrayed valuing quantity over quality.


Interesting, I hear of outsourcing software dev to Ukraine, Russia, Bulgaria, but not Poland. Might just be connections.


And the IBM Phoenix payroll system disaster in Canada.

But hey at least Canadian government managers/executives got performance bonuses.


IBM makes mistakes but IBM is massive. You expect some things to happen just by the sheer surface area they have.

I'd trust IBM over random outsourced coding houses when it comes to IT Security.


Pretty convenient that IBM's size lets them remain trusted for things that would disqualify random companies for a given outcome. Why pay so much more for the same problem?

What is the conversation that makes this happen? "Yeah, let's go with IBM. The same problems might crop up, but they're a larger company."


I wouldn't. IBM have made it a strategic goal to move their staff to "low cost locations" and are even now getting rid of staff in higher cost locations.

for example https://www.theregister.co.uk/2017/05/26/ibm_asks_contractor... doesn't sound like a company I'd want to trust.


IBM can mismanage projects and deliver subpar results and people will keep giving them millions of dollars for it "because"

The smaller shop have more reasons to try harder and deliver a better result. IBM will just hire the cheapest people they can find and put them under 5 levels of management and good luck with that!


People in this thread talk about IBM like they're the only well known vendor that does this - far from it. All of IBM's enterprise competitors doing professional and managed services such as HP do this as an industry practice. This has everything to do with their customers treating IT as cost centers. I very much believe in hate the game, not the player.


I live in Atlanta - and as someone mentions below there have been pretty big IT outages at Delta etc too.

But - without any evidence whatsoever wholesale blaming of Indian and Polish companies seem somewhat unfair. Until, we have a postmortem and know this in more detail - it appears that you just have a axe to grind.


We don't know yet what failed, let alone why. What sense does it make to jump to conclusions so quickly?

And what difference does it make whether the error -- if indeed the root cause was an error, which hasn't yet been established -- was made by someone in India vs. someone in Poland vs. someone in the UK?


I think you should not put IT specialists from India and Poland in the one bag. Polish engineers are sometimes a bit late when it comes to the new technologies, but are very good compared to ones from, for example, Ireland.


Polish engineers are sometimes a bit late when it comes to the new technologies, but are very good compared to ones from, for example, Ireland.

It doesn't matter. They are no smarter than BA's original British engineers, but with decades less experience on those systems. Even if they were smarter - and in general the talent prefers to work for real companies, not bodyshops, so that's unlikely - they can't compensate for the lack of real experience.

Only a person who believes experience doesn't matter would sign an outsourcing deal.


> They are no smarter than BA's original British engineers, but with decades less experience on those systems

Wow you really believe Poland is some third world country where we ride horses to our farms? They have experience, in some ways Poland is way ahead of UK in digitalization of many services. In great part because of thriving economy, great schools and excellent developers.

Poland has some of the best programmers according to a study conducted by hackerrank https://blog.hackerrank.com/which-country-would-win-in-the-p...


Wow you really believe Poland is some third world country where we ride horses to our farms?

No, I was in Poland just last week in fact.

But my point stands: BA's own staff have been operating its systems for decades. No matter how good you are solving made-up puzzles on Hackerrank, you can't match that experience of those systems overnight. There would have been people at BA who'd been working on say the reservation system for 30+ years - and you're claiming your "hacker rank" can beat that. And THAT is why outsourcing always fails, that kind of belief.

Apologies if that seems a bit harsh, but the assertion that "we do well on Hackerrank therefore we are as qualified to maintain legacy systems as their original authors" just doesn't make sense.


Decades? We are talking here about 2-3 years lag due to conservatist approach. You seem to know nothing about what you write.


Decades? We are talking here about 2-3 years lag due to conservatist approach. You seem to know nothing about what you write.

You think the core systems of major airlines or any other large org turnover every 2-3 years? You think knowing the syntax of a particular language is comparable to decades of domain knowledge? You think that ANY 20-something, anywhere in the world, has decades of experience of anything??

No, YOU know nothing about what you write.


[flagged]


Only 52% of us.


Only 26% voted for Brexit.

There's a bunch of people not eligible to vote; eligible but not-voting; as well as those voting against brexit.


> Quantas, the airline of Australia, outsources its IT operations to IBM. They did not choose the cheapest alternative, but a reliable one.

Actually, Qantas has a huge 'workforce' of Tata onsite as well. Of course all the newer and 'more exciting' stuff is ran either by in-house resources or 'digital agencies' - the more expensive kind of outsourcing.

(FWIW, it's Qantas - Queensland and Northern Territory Aerial Services)


>Quantas, the airline of Australia, outsources its IT operations to IBM. They did not choose the cheapest alternative, but a reliable one

And where do you think that work awarded to IBM is being done? Chances are it's India. As for IBM being reliable - are we talking about the same company known for its brazen incompetence in Australia?


Just send them feedback!

https://gfycat.com/AnyLegalBlackmamba


The main KPI for BA's IT function is the percentage of jobs they have moved "nearshore" (to Krakow).

Not that I think this is the Krakow workers' fault, mind you; rather this (and the string of similar IT incidents recently at BA) seems to fit the pattern of upper management focusing on driving down costs to the exclusion of everything else.


The system that's down was built, and is currently supported by, Tata Consultancy Services with support from BA IT staff. Once the system was built, BA made the majority of the IT department redundant.


The biggest Norwegian bank also uses Tata. A friend of mine that worked on a project there talked few times with their programmers. The company treated them really badly including things like trying not to pay for accommodation while they were in Norway for few months. They also told that they actively searched for a new job.


I picture executives thinking of IT like rolling a big stone once it's moving get rid of the pushers.


Not Krakow, Poland but India, via Tata about a year ago they made a big move.


So the Indians screwed it up, but the Polish are taking the blame? How wonderful of you brits !!!


Krakow is a big tech center, is in the EU, and it's not obvious why moving jobs there would have anything to do with a drop in quality.


It's not about Krakow per se. But when a company works towards specific targets like "90/10 supplier offshore ratio", rather than metrics based on quality or efficiency, I don't think quality comes first.


On the one hand I've never seen a successful project with non-colocated teams, whether 300km away or 10.000km away. On the other hand, I'm surprised a airplane company is so much localized in UK.


Basecamp


Any distributed startup. There are thousands.


I agree with you on that. But I had to stick up for Krakow!


Relax, it's better if everyone thinks Krakow (and Poland) is a "third world shithole". Nobody's worried about it until it's too late :D


if the offshored personnel work via a bodyshop then the incentives are completely different compared to direct employees, and the resulting work is considerably lower quality (at least in my experience)

if offshored personnel are direct employees it's a completely different story

given the main cost saving comes from being able to "ramp the size of the pool up and down to meet demand" most large companies opt for the bodyshop solution...


replace a long established (expensive) team on a complex app who know all the bear-traps with an offshore team of rent-a-coders with a 6 month average turnover rate but at half the price - wcgw


What makes Krakow inherently better than say, Bangalore?

Not snark, just asking.


Much more friendly timezone (1 h from London, 6 h from East Coast). I'd also say the accent is easier on the ears, but I might be biased since I'm Polish ;) Also, it only takes 2-3 years on average in Polish civil courts to resolve a dispute (as opposed to supposedly decades in India) - in case you want to sue your services provider.


[flagged]


We've banned this account for repeatedly violating the HN guidelines.

National and racial slurs are not allowed here.


Until about 2 years ago I was gold guest list with BA - spending £30k+ a year with them, and sending lots of others on BA.

Their high costs were fine, but then they started cost cutting more and more, and enough was enough. everyone I know has twigged that in Europe easyJet are far better, and longer haul virgin, the gulf carriers, and even USian carriers, are better.


Do you happen to know if they outsourced or if they opened a subsidiary there and just moved the positions?


It's a group function of the International Airlines Group (BA's corporate parent). http://www.iaggbs.com


Land costs are too high in the UK. Everything except banking must leave.


Land costs in the South East of England may be too high, but that really doesn't apply to the whole of the UK...


They are very high in most urban areas where you are likely to have a dev centre.


I'd say there are several locations in the UK with quite low costs (definitely massively lower than SE England).

For example Glasgow. Average residential prices are 23% of London, that's a lot cheaper and there's a good level of skilled workforce there.


South Wales where the one of the BA service centres used to be, for example, is low cost and 2 hours train from London (there's Cardiff airport; and St.Athan's [ex-RAF?]).


Land costs are really not high in Birmingham and the Black Country, the second biggest urban area in Britain. You could build within walking distance of (say) Sandwell & Dudley station, with mainline trains to London, for comparative peanuts.


Waterside is absolutely able to accommodate a dev team.


Looks like their new (as of 2 yrs) CEO comes from the low-cost airline world: https://en.wikipedia.org/wiki/%C3%81lex_Cruz_(businessman) so that may be a hint...

Although, I've seen this sort of thing before... Usually it starts with a middle management hiring of some alpha-male business asshole who is driven to advance his career and thinks that because he can use a spreadsheet that he's qualified to run IT. He'll then go on to sell upper management on some kind of ridiculous story straight out of some bullshit CIO magazine about how consolidating all the existing best-in-class systems into one system will cut costs and open opportunities for building and mining customer data to increase revenue. He'll get the greenlight and a shit-ton of capital, and then he has to make the decision about whether to build or buy. That question is irrelevant, as our intrepid hero has no idea what the fuck he is doing and will fuck things up regardless of which path he takes. Once the "new system" is fully half baked, he then shoves it out all over the company in some ridiculous balls to the wall no going back roll out plan. Subsequently there will be huge problems, massive lines, pissed off customers, pissed off employees... but this is where our intrepid hero really shines. His mastery of the art of bullshit successfully deflects all blame from himself and his incompetence onto the users/operators (eg, people who are responsible for revenue generating businesses) of his monstrosity. Somehow it is forgotten that outages never happened before and all the revenue loss and customer ill-will is blamed on the operators for not having well-tuned disaster recovery plans in place for manual operation.

Of course the only disasters they ever dealt with were a result of his incompetence, and in a stunning feat of failing upwards, he's destined to helm the company (and then likely others) within a few short years.


Quite a lot of mixed gossip in the Daily Mail article

>Yesterday's issues are the fourth BA failure in the past month, with problems on June 19, July 7 and July 13

>Union leaders say hundreds of BA staff complained about 'FLY' system and most workers say it's not fit for purpose

>A survey by GMB of 700 staff in June found that 89 per cent said training was poor, 94 cent suffered delays or system failures and 76 per cent said their health had suffered because of stress or anger aimed at them by frustrated passengers.

etc http://www.dailymail.co.uk/news/article-3695151/Philip-Schof...


That article is from July, 2016, which is why the dates "in the past month" don't compute.


One of the big take-aways from this is actually about how not to handle situations like this.

Problems happen - even huge ones - and I think most customes understand that. What they don't understand is being given little or no information about what they should do, and also being given vague, contradicting and even false information.

The BA Twitter feed seems to be the main source of dubious information. They are telling people to check the website for flight status info - but it only works intermittently, and different parts of the site say different things; for example, the flight status tool says my flight is cancelled, but the booking management tool says it's all fine.

On the Twitter feed they are telling people that they are contacting people and they rebooking them automatically - but it's apparent this is happening for very few people.

They are telling people to rebook on the website - which only works intermittently, and will not allow rebooking even when it is working.

They are telling people to call them to rebook - but their call centres seemed to be working on normal working hours (rather than getting all hand on deck), so were not in operation between 20:00 and 09:00 (or whatever; it varies by country). During working hours, when calling any call centre anywhere in the world, you just get a recorded message and are then disconnected. Some people say they did get into the call queue, and have been waiting on hold for 7+ hours!

They are sometimes telling people they shouldn't go to go to the airport to rebook, and sometimes telling them they should* go to the airport to rebook.

Yesterday they were telling people they could book alternative travel with a different airline and then claim it back from them... and today they are saying they won't pay if you book alternative travel - this could cost passengers dearly.

The CEO, Alex Cruz, also made a laughing stock of himself yesterday when he randomly donned a yellow high-visibility vest to do a recorded message in an office.

Honestly, the whole thing has been a lesson on how not to treat your customers when things go wrong.


Airlines these days are run by a complex set of systems most of which have to work in order to have the airline function. I remember a few years ago SABRE (one of the 3 major GDS companies in the world) had a 4 hour outage (I worked for a division). Half the world's airlines stopped functioning. Generally these things seldom go down but when they do, hell breaks loose. Often these systems not only do reservations, but crew scheduling, weight balancing, check-in, manifest creation and a whole host of other small but important things. Airlines also integrate some of their own pieces into the contracted ones, and some just contract almost everything. It usually comes down to money. Even Southwest Airlines which doesn't share its res data with anyone uses a GDS backend for a lot of things (in this case SABRE).


Southwest was never on Sabre, per se. They had their own, separate reservation system, operated by Sabre the company, but not the main "Sabre Res System" / PSS. It was called SAAS, and sometimes "Cowboy".

They moved off of that recently though, and are now on Amadeus/Altea.


Was true when I worked there, of course changed since then.


Your comment reminds me of how crazy duct-tape and bubblegum is in our IT industry.


Perhaps related to our calling people "engineers" who aren't anything of the sort...


I'm curious if anyone can give insight on how the passenger backlog is resolved in these situations. How was it done before smart digital systems (probably circa 1980's?) and how is it handled today given all the intertwined applications. I can imagine that it must be fascinating and equally exhausting on a grand scale.


>How was it done before smart digital systems

Manually. The TPF reservation systems have a concept of "Queues". They would place travel records that needed to be reaccommodated onto a queue. Then, reservations agents would "work the queues" from their green screens, make phone calls, etc.

>how is it handled today given all the intertwined applications

Depends on the airline, and the system, but the general answer is "partially automated". The processes for less widespread issues is more automated. Think like a major storm in the northeast. Global outages are less automated because you're dealing with multiple issues, not just passengers.


> before digital systems (probably circa 1980's?)

IBM offered ACP (Airline Control Program) with their new mainframes running OS/360 in 1965. Later they changed its name to ALCS and TPF, still used today by reservation systems.


Shopping, flight status, etc, on their website is working, so their central reservations system isn't down.

http://www.bbc.com/news/live/world-40069977 says "A BA captain has said the failure affects the passenger and baggage manifests"

So it's an legality/operational thing. Passengers can get boarding passes, etc, but the plane isn't allowed to take off without proper manifests.


According to reports on flying forums checking is being processed manually with huge queues as a result, flight status screens aren't working, etc.

Until an hour or so, ago login to the website was down for me.

So it seems it's a very widespread outage that they're in the process of recovering from


Likely they turned both of those off on purpose. If you can't depart due to lack of a manifest, you turn off flight status and check-in.

I was able to do a flight status a little while ago though.


If that's true then it's probably their operations system, i.e. flight management, digital flight bag, etc. Without those systems you can't file flight plans easily, fuel the planes with the optimal amount for the weather conditions, etc.


I would guess some component of what they call "Departure Control".


I wish there was a listing of company IT quality somewhere similar to the list of 2Fac financial institutions.

I'm leaving my citi card for a chase one due to poor infrastructure but I could have avoided a lot of wasted time if I hadn't opened it in the first place.


Problem is it's highly variable from engagement to engagement.

A lot of the quality and poor performance could be bad process/management on the clients side (though a outsourcing company will profit hugely from).


I feel close to their IT staff having to deal with this on a Saturday afternoon.

Partially OT: anyone wanting to share any on-call horror stories? :)


Airlines are especially stressful, as the problem compounds with time. Once you hit 45 minutes or so of downtime, you start invalidating downstream flight connections. Two hours in, and you've issues with crews not being legal to fly. Four hours in, and there's not enough capacity the following day to fix the missed flights from today, etc.


Note to self...never work in airline ops. That sounds like every day is on the edge of being a disaster.


are there commonly algos or decision support tools to help unravel that in an optimised way?


Sort of. The software is called "automated re-accommodation". But you're solving for several intertwined problems. Which aircraft models / tail numbers fly which flights...they have different seating capacities, nautical range, etc. Which crews are assigned to which aircraft. They aren't all qualified to fly every model. And, they aren't all in the right city, so you have to "deadhead" them there. And, finally, which passengers go on which flights.

They do have solver/optimizer algorithms, but you can imagine it's not a button press thing. There's a lot of human process, trial and error, etc. Oh, and federal laws about crew hours / legality plus union work rules. You can't just assign crews wherever you want for example...you have to consider seniority, their "home base" where they live, etc.


I was in the middle of some major rail cancellations a couple of months back over 2 days - on the 3rd day many of the assets were out of place around the country - they were clearly trying to solve it manually and the impact escalated over the day even though most of the original problems had cleared

it's got to be a great problem to work on - and must be pretty rewarding to watch when you get it right


Remember on 9/11/2001 when the US FAA declared a nation-wide ground stop, and rerouted all inbound international flights to other countries (e.g. Canada https://en.wikipedia.org/wiki/Operation_Yellow_Ribbon)?


Do you have some airline/IT ops stories about that day and the aftermath? If so, I (and I'm sure others) would be delighted to hear about them.


One vendor of those optimization tools is Jeppesen, which about ten years ago bought up two previously independent companies in the sector, SBS (New York) and Carmen Systems (in Gothenburg, Sweden).


I suspect that there's a lot of eating crow and bargaining with rivals for space on their flights to help clear the backlog, but each incident is going to have unique elements to resolve in it...


Sadly that doesn't work for any serious outage.

If it's only one airline down, you can get away by buying tickets on competitors.

If it's multiple airlines, or the upstream reservation system, or a local meteorological incident, or an airport wide issue, etc... there are no flying plane.


It's standard practice for airlines to fly each others' crews around even during normal operations (often at no cost).

For example, the overbooking that led to the beating on the United flight was the result of an aircrew from another airline being booked on at the last second.

I would imagine that when the shit hits the fan like this, the other airlines are very sympathetic and will do whatever it takes to get BA personnel where they need to go. After all, next month it could be their own systems that are down.

(assuming it's not one of the shared services causing a global outage - not much you can do when everyone is down)


> For example, the overbooking that led to the beating on the United flight was the result of an aircrew from another airline being booked on at the last second.

That's not quite what happened. The flight was operated by Republic Airlines, on behalf of United. Republic bumped the passenger to make way for more of their own crew - it's just that crew was going to to fly a Republic flight operating under another, different, carrier. But both crews were employed by Republic.


>BA chief executive Alex Cruz said: "We believe the root cause was a power supply issue." //

That shut down BA operations World wide? Does that seem likely? Possible? They don't have power fail-over and operational centres in different countries?

I've been re-shaping my aluminium foil hat and wondering if there wasn't a specific terrorist threat that's been covered up; but then where I am there have been 2 cities suffer bomb threats (with attendance of bomb disposal and armed police) that seem to have been buried in the news completely. Also I've heard elsewhere that staff on the ground reported the incident as due to "hackers" almost immediately - the speculation being 'before they could possibly have known that' - which suggests some sort of disinformation process.

/wild-speculation


Same thought here. Ten years ago the UK foiled a terrorist plot to bomb 6-7 airplanes in flight from the UK to the US. That's the origin of the ban on bringing large amounts of liquid on board. [1] The recent bomb in Manchester was apparently made using peroxide, too.

Assuming they identified a real, immediate, and massive threat, I can see why they would prefer to ground all planes until they've sorted things out.

[1] https://en.wikipedia.org/wiki/2006_transatlantic_aircraft_pl...


Half the planes taking off at Heathrow are not BA. If they wanted to ground the planes, shutting down ATC would be the better way.


You're assuming the hypothetical threat is not specific to BA.


But even if Dr Evil blew up their server room (which is ridiculous, by the way), they should still have failed over to their B site with a few hours interruption.


Could have been a power feed to a datacenter. We can't always take newspaper descriptions as literally as we might like.


Airlines should band together and form working group to redevelop the old system in modern open source tech.

Yes it would probably take 5+ years to develop and roll out, but they can't keep maintaining these 50 year old mainframes that cost them tens of millions a year in downtime.


For what it's worth, it's not generally the mainframes that cause the outages. It's the distributed ecosystem where you need a bunch of disparate systems from different vendors around the mainframe, all coordinating, to be able to fly planes, sell tickets, process boarding passes, etc.

There were fewer global outages when all of the functionality was on the TPF mainframes with dumb green screen clients.


TPF mainframes with dumb green screen clients

Implemented and operated by in-house staff.


> Airlines should band together and form working group to redevelop the old system in modern open source tech.

It's surprising how quickly one can suggest a solution based on... what exactly?

Imagine a doctor prescribing "this new fantastic medicine" based on "patient is sick" description. While personally I'm all for open source this is not a silver bullet and should be carefully considered.

Joel had an interesting blog post [0] back then about Netscape rewriting their software.

[0]: https://www.joelonsoftware.com/2000/04/06/things-you-should-...


I'm willing to commit to a serious effort in this area if anyone wants to collaborate - email in profile. I've been digging in to real time logistics for my existing business recently anyway, previously founded a 3300+ hotel reservation network and contributed a lot of Wikipedia content around PNR records and their political background/intelligence use, so have some familiarity with industry structure and protocols, and built and defended HA 24x7 transaction infrastructure for a major Bitcoin exchange. I would seek people with more experience close to or inside the major booking networks or airlines as collaborators.


Redeveloping large, complex systems in anything, let alone "modern open source tech", is much more difficult and prone to failure than one may think.

There are many factors that contribute to these difficulties, but tech is not near the top of the list.

There are many cautionary tales, including the $8-billion FAA AAS system rewrite failure [1].

Contributing factors include

- Regulatory processes

- Subcontracting

- Project management

- Politics and turf wars

- The extreme difficulty of reengineering a complex system when few, if any of the original designers and developers are driving it [2].

[1]: http://sebokwiki.org/wiki/FAA_Advanced_Automation_System_(AA...

[2]: https://www.dropbox.com/s/n0nud2xp2a9bxjo/Programming%20as%2...


"Redevelop the old system in modern open source tech" is a tough sell when so many of these airlines are differentiating on technology. Myself and many other frequent fliers will not engage with companies with poor functionality and bad UX. We move or stay loyal when airline tech allows us to do & see most of the things we need.


Their tech is not open enough to even allow for proper innovation and differentiation. That's the main issue with legacy IT.

When you live in a world w/ Docker, REST (and every other open tech there is), you can build systems which are way more innovative.


When you live in a world w/ Docker, REST (and every other open tech there is), you can build systems which are way more innovative

That's quite funny because Docker and REST are just half-arsed reinventions of mainframe features from the 1970s.


I completely get where you're coming from, but you also have to acknowledge for example that Docker is open source.

That's not a trivial add-on to IBM mainframes, it provides every developer (w/ minimal resources e.g a laptop or free tier cloud service) the ability to run production-like environments.

Sometimes, we take that for granted, but having worked for an airline IT, I noticed the bottleneck didn't come from hard algorithmic issues (most advanced route features were very basic to implement), but from the huge leap between production and dev environments: this was not Ubuntu on a server VS a MacOS on a laptop (manageable...), this was Ubuntu VMs (so should be close to prod?) vs cryptic data center clusters that had impossible-to-replicate features.

As for REST, airline IT use an outdated messaging mechanism. It implements versioning and grammar, so should be clean and nice to work with? Not really... The messages were impossible to read for an inexperienced dev (opposed to XML or JSON).

I heard plans to put JSON blobs in one of the fields of the messaging mechanism (completely destroying the value of versionning and grammar btw). That was not necessarily a bad idea, just a reaction to the lack of supported tooling, and readability for an obscure messaging mechanism.

Again, I get where you're coming from, but I'm just allergic to nostagia for the sake of nostalgia where the old features clearly lacked essential features for 2017.


They can build a open-source back-end and differentiate in the front end.

Or even build common components/ technologies together and then build their back-ends on top - that is basically what the rest of us are doing...


Is that not how Amadeus happened?


Yes, they should build an even bigger system with more requirements from each of the airlines, and have the same consulting company implement it. What could possibly go wrong?


You think these organizations can agree upon anything deeply technical?


redeveloping legacy systems (innovation) is a concept that is diametrically opposed to reliability.

I'd rather fly on an airline that hasn't 'upgraded' core systems.


The facts are clear. No effective mirror system or failover system. I've architected major 365x24 systems repeatedly in my career: Removing the possibility of power supply failure across the primary and secondary sites (and even within single site) is absolutely fundamental. That clearly has not been done here, at least to the extent of making sure that the solution is effective- and the blame I would guess lie within a crazy management and decision structure that seems to increasing permeate IT within large companies. Tata do have some good staff (from my first hand experience over many years):However, from my more recent experience, there is an increasingly MASSIVE CLIFF EDGE in talent/competence in within Indian companies generally, (and to some lesser extent UK companies as well): So the possibility of having "very assertive, but stupid or incompetent" middle and even senior managers is becoming an ever greater real-world problem. The criteria and method for selecting candidates for IT related work does not help - in particular role, competency, and "modus operandi" of recruiting agencies, is pretty atrocious, as many a seasoned contractor will attest! But ultimately, I think Alex Cruz and his CIO must take responsibility for a lack of diligence and competence. They clearly do not understand the most basic truth of high reliability, mission-critical IT: i.e. They NEVER needed cheaper IT staff (with all the attendant risk within their industry) , but rather much fewer technical staff, of the HIGHEST QUALITY, for all BA's key systems. And that I’m afraid means probably not putting Indian companies at the top of the list.


One customer was told that the root cause was a "lightning strike on a datacentre", although it sounds pretty unlikely (there are storms today in the UK but surely there'd be a DR plan?)


They do have DR plans, but the reality is that whole thing is a huge distributed system with parts from different vendors, in different locations, etc, with decades of legacy.

A power outage the reboots a whole datacenter would screw any major airline for at least half a day.

Thus far, nobody has found the economics of a modern, truly HA setup worth the cost. Outsiders greatly underestimate the complexity too. Think something like 40 disparate applications from different vendors, or some homegrown systems, in different geographical locations. Then, all the client applications are in buildings you don't own (airports) where you aren't allowed to control the infrastructure.

If it were a high margin business, perhaps things would be different. It's not.


> If it were a high margin business, perhaps things would be different. It's not.

Most of the loss here is not from lost bookings; a lot of people who prefer BA will probably just wait and book once the systems are restored.

It's going to be from secondary losses -- lawsuits, ding to the reputation and so on.

These numbers are estimatable, even for low-margin businesses.


I agree, but I've been around this sort of thing and seen the post incident analysis, been in the discussions, etc. Despite the huge costs of these outages, they pale in comparison to the costs of a real HA solution for all "needed to fly" applications.

To give you some idea, ITA Software was bought by Google for $700 million. They were some of the best and brightest minds in this space, on par with any Silicon Valley darling. They successfully wrote a modern replacement for one popular airline function...shopping. They failed, however, at delivering a modern reservation system, despite tons of money and talent invested.


> Despite the huge costs of these outages, they pale in comparison to the costs of a real HA solution for all "needed to fly" applications.

Then, quite honestly, it's the rational decision for airlines to take.

Even Google accepts that perfect reliability is impossible, and they're sailing in a pillow-strewn gold-plated yacht down a Mississippi of money.


I thought that they successfully delivered a passenger scheduling system, but it failed to gain market traction. I don't know the details, but I'd imagine it would be a huge migration cost even if the new system was much better.


that would make a lot of sense - there was a heavy storm that moved north from south coast right at that time

you can have all the DR in the world but if you never fully test it - you never know if it's going to be 100% on the day - the only way to fully test is to do a full failover from the primary and see what happens - but it could be a real career low if you do that on something so commercially sensitive and it all goes very wrong

interestingly it's a bank holiday weekend - so every possibility some people who knew some of the special incantations to get things back up on days like this are out camping somewhere remote - and will have 500+ messages when they get back in to signal tomorrow and wonder why they ever try to go away - and if they ever do go away again - wonder how they will ever be able to relax


Surely they have multiple datacentres as redundancy.


The two data centres are geographically very close to each other, straddling the runway.


It's also not generally that easy. There's more than one system required to operate flights, usually more than 10. Some would be in BA data centers (UK), some in Amadeus/Altea data centers(DE), etc. Some applications are vendor provided, some in-house. So, you can have two of everything. But when you have something big, like a power outage, getting everything talking to the "surviving/dr" instances of all the different sytems isn't easy. And, the outage itself causes a huge spike in client driven transactions. Every passenger and employee is hitting whatever the equivalent of a reload button is.

The fix is more expensive than any major airline has been willing to spend thus far.


I've managed enterprise IT applications shift support to India. I have to say when its done right, it can work well sometimes.

But my experience has taught me this. The vast majority of time its a fragile egg shell waiting to crack. When it fails, it fails miraculously. An IT support team on this scale is one of those things you should keep close to your product or customer.

Occasions like this I guarantee a bunch of execs somewhere in BA will pay ANYTHING to have some of their loyal IT staff back to take control of the situation.


But within 4 months no one in the BA C-suite will remember this happened.


There might be a better word than "crashing" when describing airline computer systems malfunctioning. Nevertheless, was happy to hear it wasn't a plane.


I can't help but be amused of the fact that all of the money British Airways thought they saved, by outsourcing their IT, was just squandered by the additional expenses it's going to cost them to fix this mess. And I'm not even factoring in all of the money this bad PR is going to cost them in the near and long term.


All those UML diagrams didn't make the software good.


Henceforward, BA passengers will be required to bring portable power supplies with them and make them available to airport personnel when asked to do so. Failure to do so will result in forced removal from the aircraft. Over and out.


Nah, BA isn't a US airline.


Companies ruined or almost ruined by Indians http://sammyboy.com/showthread.php?98021-Companies-ruined-or...


TL;DR: yet another massive chaos because some "smart" PHB doing "cost reduction" in IT.


Apparently whole Heathrow airport, .. was shut down yesterday. United and Lufthansa has to cancel their flights there too. So it meant more passengers trying to get with other plans to neighbour countries.


For reference, British Airways' parent company made a profit of €2.5 billion last year, and expects higher profits this year [0].

Without meaningful consequences at the top of the executive chain for sub-par IT/infrastructure quality, these kinds of incidents seem inevitable. But how do you hold people responsibly for "bad" software? We could adopt something akin to how PE licenses are required for civil engineering in the US. I suspect it is in the industry's best interest to address this need before a government entity decides to.

[0] http://uk.reuters.com/article/uk-iag-results-idUKKBN1630MA


Crash might not be the best word to put in a sentence containing British Airways. My brain was frozen for a couple seconds until I understood what had crashed.


I have zero evidence, and am simply working through a though exercise, but it could be targeted?

Maybe a group on behalf of a nation state is testing out its muscle. Or sending a warning. Maybe the UK did something recently a nation didn't like, and this is the new form of "diplomatic protest". Or sending a warning shot.

I remember reading last year that the large Delta airlines outage was cyber terror related.

It really feels like we're not too far off from a war between two nations without a single bullet fired. What happens when a country is hit with nation level ransomware? Ie not, "give us $300 bucks and you'll get your PC back", but "sign XYZ treaty and you get your country's water system back"? Or "we'll restore your internet and turn your powerplants back online"? How much will a wealthy country (like us in the US) be willing to stomach of seeing people going thirsty and hungry before they want their govt to capitulate?

Of course the world will condemn and complain, but as has always been the case the country with the largest Army makes the rules. And we might be using the wrong measuring stick. Microsoft missed the boat by thinking (along with most of the industry) that measuring computing meant measuring PCs. It wasn't until it was too late that it realized that computing was about to be dominated by Mobile and smaller. They were using the wrong measuring stick.

Are we on the verge of an era where an army should be measured in its digital strength and not its physical strength? It's a very scary thought.

As an American I think first of the risks to my own country. How long could we sit with a nationwide blackout, and the internet down (for anyone using backup generators)? Food rotting in shipping containers because no cranes can offload it. No easy communication or flights for quick way to move people or things around.

As a country (and as a world) I can't help but feel we're really not doing enough to take a risk like this seriously. It's just a feeling, but based on small incidents here and there it really feels to me like something in the next 5-10 years is gonna happen that'll make us all drop our jaws and say "I didn't think this was possible."

Regardless of your political beliefs, a lot of people around the world didn't think last year's US election was possible. I was shocked by it as well, but it also opened my eyes to how much more really is in the realm of possible.

As engineers we have much a deeper knowledge of the risks involved and as a result a much greater responsibility of raising awareness and getting the problem fixed.

EDIT: I'm not sure of the original location I saw the Delta stuff, but a quick look now turns up this link.

http://observer.com/2016/09/did-a-cyber-attack-ground-delta-...

And yes, I'll repeat this above comment is purely speculative by me.


There's a little insider knowledge on the Flyertalk forum.

See this post, for example...a BA employee: https://www.flyertalk.com/forum/28366141-post168.html

Later, there's some talk about a power outage and/or lighting strike, again from BA employees.


As a data point, there was an intense lightning storm in south west UK (where I live atm) around 3am this morning (~13 hours ago).

There were lightning flashes a few times per second, near continuously... for about 1 1/2 hours. I've never seen anything like it before.


I could be mistaken, but from what I'm seeing in that thread they really have no idea what caused it. Which, btw, seems fairly reasonable for how early into this incident it is. I wouldn't expect a root cause for a while.

Delta Airlines also called their outage a "power outage", but it looks like there was reason to believe it wasn't.

http://observer.com/2016/09/did-a-cyber-attack-ground-delta-...


I don't put much stock in that article. Their conspiracy theory is that no other businesses saw a power outage. But it didn't have to be a supplier/mains problem. They could have had an issue with their own, internal power grid. Their data center likely has a UPS, generators, huge switches to move back and forth from battery/mains/generator etc. A fault there would cause an outage.

Edit: Yep. http://bgr.com/2016/08/14/delta-finally-explained-how-one-po...


> I remember reading last year that the large Delta airlines outage was cyber terror related.

Where'd you hear that? AFAIK, it was a botched failover to backup power.

https://www.theregister.co.uk/2016/08/08/computer_fault_take...


> nation level ransomware? Ie not, "give us $300 bucks and you'll get your PC back", but "sign XYZ treaty and you get your country's water system back"?

The lessons from Ukraine to Trump is that, far better than something obvious like tuning off the power, a deniable attack on the political system is extremely effective.

There has been a "cyber war", the US lost, and it's not yet clear how the US political system will recover.


Maybe, but I suspect the effects of stupidity would be so strong that you'd never be able to detect the effects of a hostile nation.


Who is John Galt?


Hopefully, this will be a wake-up call to the industry - government regulation in IT is needed.


The UK government (perhaps the relevant one in this case) does not have a stellar track record of figuring out how to make big IT systems and projects successful. This knee-jerk appeal to regulation is unconvincing to say the least.


A modified form of ITIL would do a lot of good for important sectors. Let me put it another way, if the industry won't regulate itself, then they will be regulated by force.


A modified form of ITIL would do a lot of good for important sectors

You've never used ITIL for real, have you? Because if you had you would know that no amount of process can replace good engineers, and good engineers don't want to work in ITIL shops...


What's always worked in the past is hire good people, pay them well, treat them well and train them well. Pretty simple. Authoritarian type companies tend to do poorly at that though.


if this is another case when they employed a horror coder who doesnt know to program, they'd better employ me !


Maybe they hired the managers who handled Deepwater Horizon. Although this time around, I don't think Donald Trump should send a nuclear sub to drop a nuke down the shaft on day two of the disaster. I'd wait at least a week.


This is a distant early warning of the impending demise of commercial aviation. The inexorable rise of the price of fuel will eventually make flying unaffordable for all but the military and civil servants. Airlines are already struggling to remain solvent, and are cutting costs everywhere they can: salaries, squashing passengers to increase seats/plane, nickel & dimeing us with baggage surcharges, etc.

Offshoring IT is an example of cutting costs to the bone.

Your grandchildren will only know of flying as an ancient legend.


This is nonsense, fuel prices have been plummeting amid new extraction tech and long term demand destruction. These are fundamental changes and if you're still droning on about peak oil you're just operating on old bad info.




Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: