I am reminded of a post a while back regarding AWS' issues affecting multiple data centres (I forget the specifics), and how their post mortem didn't appropriate blame on anyone (which it really easily could have), but rather their own checks and balances, which allowed the issue to arise in the first place. I do hope that when the dust settles we see a measured response rather than a witch hunt.
I'm reminded of airplane accidents: Whenever you hear of an airplane accident, it's always some amazingly crazy series of things going exactly wrong to get the plane to crash. We have a tendency to think "wow, what bad luck", but a better way to think about it is that airplanes are so safe that an accident' can't occur unless a whole series of things go very specifically wrong.
A company's goal should be to increase the number of necessary things that need to all go wrong before there is downtime.
While there are always technical causes for larger technical failures, I've seen far too many times RCA post-mortems performed that result in witch hunts instead of a solemn contemplation of how things could be better done by everyone. Such an RCA may ignore that a normally careful engineer was overworked by managers, never is lack of relevant monitoring and testing due to budget cuts cited, and you'll certainly never see "teams X and Y collaborated too much" as a reason for failure in these places. Because in a typical workplace, the company's values and culture are never related to a failure. You can't objectively measure how bad or how good a culture is either. Why make it part of post mortems when you don't think it's a failure?
As an aside, I met someone who was working on a graph theory problem as their research project, and the application was that you could model the entire process of aircraft control through a state machine using that graph. Effectively they are working on making it mathematically impossible for a crash to occur assuming that a certain process is followed (with safety measures ofc).
The challenge is to avoid pushing all the risk into that assumption. It's easy enough to build a system that never breaks if you're willing to assume perfect behaviour on the part of its dependencies, environment, users and operators.
> It's easy enough to build a system that never breaks if you're willing to assume perfect behaviour
It isn't though. Seriously, think about how you could safely route several thousand flying hunks of metal through fairly small air corridors (which all have inertia) and you need to maintain strict flight schedules. Then think about how you need to factor in all of the edge cases caused by emergencies on planes (these are all included in the process for flight controllers). Then think how you could mathematically prove that safety.
Yes, it's easier if you assume that people will follow a certain process (and actual flight systems have so many layers of fail-safes that it's ridiculous) but it's definitely not "easy enough".
Assigning blame does not move the needle at all.
That said, having almost entirely dodged any outsourcing-related issues in the 90s, and worked with generally great offshore teams, seeing my current role impacted by an utterly shortsighted and ignorant attempt to offshore critical operations tasks is quite disheartening. It's rarely the fault of the teams, it's the fault of higher-ups who completely fail to grasp the complexity and consequences of the tasks they're offloading. Everything looks great for a few weeks or months or years until one of the dozens of things that have gone neglected rear their ugly heads. If they're lucky, they take the money and run before the crash happens and escape most blame.
“We started using the new system in October. Training aside, the whole thing has been a disaster."
“It breaks my heart to see this as I love this company but it is really going down the pan."
“It’s got so bad that some staff members have written to the transport secretary Chris Grayling. All of our concerns have fallen on deaf ears."
“The Chief Executive Alex Cruz, when he was warned about the system told us that it was the staff’s fault not the system."
1. Massive security breaches affecting major corporations, governments, and so on.
2. Massive IT outages, affecting corporations, governments and so forth.
It doesn't take a lot of incidents to mark them as 'massive' since things are strongly centralized in this world.
Meanwhile, in the last 15 years of my IT career, never have I seen such a strong push towards offshoring. One would think that with such a vulnerable IT landscape mixed with an unprecedented dependence on IT infrastructure, that CTO's would want to spend more and not LESS on operational costs.
I'm slightly baffled.
As I've been working in the airline IT industry, I've observed that the reality is more nuanced.
Core booking and ticketing systems are 'outsourced' to solutions from Amadeus and Sabre. This makes sense not only for many strategic reasons (easier to have airline alliances when the airlines are all on the one system), but also because airlines are in the business of flying planes, not building ticketing systems.
Airlines are often full of loads of legacy and old systems which you just can't hire developers for (in the numbers, and costs, desired) - this is where Indian outsourcing firms come into the play and resolve the shortcomings. They're taking all the jobs and work that no one really wants to do anyway.
However airlines, and many other industries (banks come to mind), are realising that their digital offering is just as much, if not more, a core part of their offering and those are skills they need to bring in house and own if they want to be competent. I'm seeing more and more web and app developers be brought in house so airlines can establish these competencies. Digital/Product agencies are 'stepping in' in the meantime to help out (like Virgin America and Work & Co).
It baffles me that they ever thought outsourcing part of that logistics was a good idea. But then I've worked at tech companies that tried to outsource their IT departments too.
Airlines aren't just responsible for flying planes. They are responsible for the planes. It's not enough for the plane to show up in Chicago by 5pm, it has to be there and all of the necessary maintenance and safety work has to be done by 5pm. Anyone who could figure that out for them would simply go into business for themselves and compete.
I get the impression that for the most part, no one really believes in security, and doesn't understand that it's a process and not a product. They already tried throwing money at the problem by purchasing security products, which ultimately failed to deliver security. Now they're trying the opposite by cutting costs. Meanwhile there are very few consequences to losing control of people's data.
SSI or software security initiatives, when given greenlights based on buy in from governance can introduce secure processes.
But before this happens to any meaningful level, risk based anaysis should prove it is cost effective. Guess what those analysis say?
So true. I used to wonder why corporates can't build a real security team. One would spend $1M on an annual contract but perhaps you will do okay for $700,000 for a few security engineers. Then I realized that most corps are indeed looking for insurances.
You see, managed security vendors have a team. You don't manage the hiring, the tooling building etc. Obviously you can't just sit back and watch them put out the fire. You train them to understand a bit about your environments, you work with them on triages and resolutions, and you work with them to integrate your systems with them (e.g install an agent etc)
But it isn't most enterprises' interest to build a strong security team. Many of the in-house security team's role is to manage incident response. Many of them don't really have a say, they are often consulted and that's it.
Many of the product demos sound exicting and if you are not careful during your POC you will end up with a medicore product. Even if you did your best during POC yo evaluate the product, you will realized the product is really medicore. It caught the low-hanging fruits and are full of false positive. You ask for better analysis but because they come in as black-box, there is a lot of back and forth before you can act on the issue. So at work I would get a ticket from the vendor and I would end up doing a lot of the analysis. That sucks because while I enjoy doing security triage, that's not my role, but the conflicting side is at least they caught something. I would have to engineer or deployment some solutions and monitor the whole thing, since no one man or team can do everyone's job.
Since no one wants to be responsible for other's job, that's the whole point of having a secuirty team. It just happens that team, the people who are building tools and monitoring incidents are outsourced.
Most importantly, there is very little control of what your vendor can do. Want a new feature? Want better reporting? Want to change some configuration? A lot of time you are out of luck, or you need to be patience.
But, seriously, security practice is not magical. There are best practices such as SSH to server using key or cert authentication, not password. Least privilege to run processes, so system admin/devops should create a checklist on what's in place what isn't. It does take a serious committment from management to move forward though.
And security is an iterative process. Evaluate the best most secure option possible for the moment being, and put a realistic plan together for the remaining unresolved concerns.
The incentives are the same as they always were. A well run IT organization can coast on momentum for a long time - everything is neatly automated. So they sack their good engineers and hire an outsourcing company and save an amazing amount of money just long enough to get that next bonus or promotion, and they're long gone when the momentum runs out and the wheels come off. Repeating the same scam at another company.
In the derivatives world (that I started my career in) there's a common theme: you can sell options, and generally you will make money. The only issue is, now and again you lose a lot of money. Often enough to wipe out everything you made.
But incentives matter. If you come in as a new guy, you want to impress by making money. And come up with some story about how you'll know when not to sell options. You also don't want to be the guy losing money when the other desks are making it. If there's a crisis and everyone loses, that's fine. Unforeseen, right? It happened to everyone.
The CTO is facing a very similar choice. Save money by not upgrading, work the security guys a bit harder and let them update the OS a bit slower. Don't think too hard about redundancy. Or code review, that takes away dev time. You're saving money.
But now and again, something happens. Maybe you have to leave, but you'll still have "saved money" for several years before. You'll land somewhere.
Arguably, delegating IT operations to a proper IT company could improve security and reliability.
We use a low cost offshore IT provider and we haven't had any security or reliability issues so far (https://gsuite.google.com/)
When it comes specifically to IT, however, we're looking at almost a race to the bottom to save money immediately, sometimes to make more budget for software engineers that are revenue centers. Furthermore, a lot of IT folks want to stay current on technologies just like many developers do, but IT constraints are oftentimes even more than what their developer peers in the same company would have. This has led to a concentration of lower performers staying for a long time (not indicating skill necessarily - their environment is hostile to professional growth) and more ambitious technical folks move elsewhere.
It's been puzzling for me why IT orgs don't focus upon automation obsessively like most industries trying hard to cut costs have. The US coal industry automated a lot of labor and it's not like coal miners were making handsome salaries like most sysadmins did in the late 90s. The low performers all seem to have a strange obsession with creating as much manual processes as possible instead of, I dunno, writing some orchestration in Rundeck jobs or Ansible playbooks maybe. One decent automation engineer can replace dozens of lower skilled junior and mid level sysadmins even in bare metal environments.
So if something like security or reliability isn't clearly stated in contracts (and it's extremely difficult to do that well) it is disregarded.
An outsourcing company is not going to spend money it doesn't have to, simple as that.
SAAS like gsuite is an entirely different prospect. you're buying a service from Google not outsourcing your IT to them. If google's service was insecure or unreliable no one would buy it.
When I joined the company there were 60 employees, when I left it 5 years later, there were about 500. I think they could only achieve this growth based on their high work ethics and the quality of their hiring their system.
For example, completing a full security development lifecycle can add 10%+ to the costs of the final product. that's not a cost that a company will incur unless they have to.
In a bid for work, everyone says "we take security seriously" and the client probably can't evaluate the difference between someone who really takes all the necessary steps, and someone who pays lip service to that concept.
So the cheap company (who doesn't spend that 10% extra) looks just as good as the one that does, and they're likely cheaper.
Guess who's more likely to get the work (all other things being equal)...
I liked the approach and I think the same could go with security. Include an external security audit in the initial project price.
Is there anything worse for an IT service provider than being blamed for a massive IT outage at a global corporation? This is headline news.
And I don't see any difficult contractual issues at all. On the contrary. A massive outage, by definition, means that the contractual obligation is not being met.
if there's any outage that is because the customer didn't ask you do do something that's their fault not yours.
For example would you as an IT outsourcer pay for a redundant datacentre if your contract didn't call for it?
Would you patch all your systems immediately even if it caused availability issues if it wasn't explicitly outlined in the contract?
would you explain to your shareholders that your profits were lower this year because you undertook activities not specified in your contracts because they were good for the security and reliability of the services you managed...?
When outsourcing contracts are bid for there's a common experience of lower costs win. that inevitably leads to items that aren't strictly required being excluded.
I would assume that the contract calls for particular service levels and that downing the entire fleet of a large carrier for days is in breach of that contract.
If the contract says "provide service X with 99.999% availability" the service provider cannot come back and say, oh but you forgot to specify that we should run a redundant data center to guarantee that availability.
If you read those contracts, you have to read what the consequences are for breaking that uptime guarantee. Usually it's something silly like 10% off your next bill.
until then, don't hold your breath. people learn their lessons the hard way.
1. Being more expensive than the other guy.
A company that's more expensive and can't clearly demonstrate in a bid scenario the positive impact of that increase in costs, will lose bids, a lot.
Everyone will say they take security seriously, but the cost of actually doing a good job on security is much higher than paying lipservice to it, so it doesn't really make commercial sense to do it well.
In general in fact I'd say that a lot of IT outsourcing contracts tend to lead to a market for lemons. It's very hard for customer to assess the quality of a companies staff for example, so the company with the cheaper staff can afford a cheaper bid, which looks just as good as a company with more expensive staff...
Furthermore, often these issues rear their ugly heads when they sack the existing staff who are keeping 1000 balls in the air and rarely dropping one. In the migration to an outsourced team, or an offshore team, or even just another equally competent team in the same office, details will get missed. It's simply not possible to do a complete knowledge transfer. Things will get missed, and sooner or later one of the things you miss will turn out to be a landmine.
Of course, the original team wasn't perfect either, and could have also suffered a large outage, but the new team is more likely to do so, at least in the short term.
In a best-case outsourcing scenario, you either hire better people, or more people, or a vendor with better processes or understanding. You do the migration very carefully and miss as few details as possible, AND the quality of the new team is such that, within 3 or 6 or 12 months, they're actually outperforming the original team. AND you manage to avoid hitting any landmines during the transition period where they're underperforming. This requires a great bit of planning, execution, and luck.
Perhaps it really is cheaper to have a few outages than it is to built the sort of systems that never go down. Every "nine" you add gets more and more expensive.
Or perhaps the systems are so complex that it's just impossible for them to be fully reliable, and this is just one of the costs of having the ability to fly anywhere in the world, cheaply and at the drop of a hat, and to have it work 99% of the time.
Is that what you tell hundreds of thousands of passengers sitting in a terminal with canceled flights and confused airline employees during one of the biggest holiday weekends?
If your credit card number gets stolen or worse, your identity, and it's up to you to get that straightened out, are you just going to shrug your shoulders and write that off as the inevitable impossibility of perfection?
Air travel itself is not 100% safe. Planes crash. People die. Yet we feel that the probabilities are enough in our favor that we accept it.
As an example a person in a tech leader role at a major company in the UK who I thought may be interested in hearing about a product was quite rude about me never contacting them again and refusing to take any calls or emails before even hearing about something that other CTOs has at least looked at and some had even become customers. She happened to have a personal blog that I checked out and all it mentioned was her passion for swimming, nothing at all, not a single thing about tech. Why is this person in this role then? Beats me but it is sadly commonplace, until you get people at the top that truly understands tech and was actually a software engineer themselves at the start of their career and are things won't change.
You didn't spam them did you?
Your description of their behaviour sounds a lot like the standard response to spammers. eg FOAD
If I am looking for a product, I will look for it. As far as I'm concerned, your emails are spam.
Note that this attitude is what pushes some sales teams away from reaching out to line management, and towards C-level-targeted sales. If you hate "golf-cart sales" outcomes where a C-level forces a solution down your throat, then learning how to communicate with sales people will richly pay off. It is part of managing upwards, by short-circuiting sales warm approaches to the management you report to, and turning them into your allies to help you pitch your priorities that happen to align with their sales goals.
I am up front with the sales people who approach me, and tell them when I anticipate the problem space they solve will rotate onto my front burners, my anticipated budget to switch (usually in the form of "if I switch to what you propose, I only have $X to do it with, all in, software and services"), the benefits from my current solution I want to ensure stay in place, and the pain points I want to solve. That usually ends the conversation right there and then, I'm tagged "no-contact" in their CRM, and the spam disappears.
If people wonder why spam is a problem, it's pretty much because enough idiots respond to keep it worthwhile.
Generally speaking, C-level relationships are cultivated over a long time with the kind of sales and marketing budgets you've read of; dinners, sporting events, wine tastings, etc., with some proven sales account manager. However, it takes a lot for some sales opportunity to rise to the level of bringing it to the attention of this relationship. If you want to stop what you call spamming, then your job is to manage upwards by ensuring pain points do not rise to that level. This keeps the sales account management activity focused on the numbers when the support contracts come up for renewal.
However, if you have not been successfully managing upwards, then lines of communications have broken down between C-levels, or between CIO to your level. The first you can't do much about. The second is partly within your control. If you ensure your manager has nothing from the business to complain about that can be addressed within the solutions you are responsible for, then you've done what you can about it; you can't fix what you aren't told, after all. Where most technology-oriented staff misstep is they think they need to hear it from their management; the staff who successfully short-circuit pain points coming up to the C-level's attention realize that they can also ask, and more importantly, demonstrate they can communicate and coordinate between departments and people to help address those business-oriented pain points before they percolate upwards.
Where these sales account managers pounce is when the pain points become so grave that the C-levels hear about it and feel they have to "do something" about it, and "it" is a competitor's solution. And if you really like that competitor's solution, and hate the idea of switching away, be on the lookout for a call from one of the sales reps assigned to help that sales account manager. If you've even heard about the pain point, then you will be able to help drive the discussion, and most of the time if your favored vendor is on the ball, close the window of opportunity for the sale to be floated up.
I'm not going to waste my working time communicating with a salesperson back and forth for a product I'm not interested in. One, it only validates that I read their email, and two, the end result is the same - I don't buy the product.
That's the reason for the FOAD response.
Your willingness to listen or otherwise is on you. ;)
Literally, by spamming them.
It's absolutely no wonder they're telling you to FOAD, as that's the correct response.
Please, change to a different profession. Preferably one that contributes to society in a positive way instead of your present one. :)
I think we like to pretend that this is a problem unique to informatics.
Other quality airlines have outsourced IT or part of their IT operations, but they are careful about choosing their IT vendors.
For example, Israel's national carrier, El Al, which also has extremely good security against hijacking, outsources at least its ticketing as a cost savings move, but this it does to Lufthansa the German national carrier, which uses the Amadeus Ticketing platform. El Al is saving money over running their own operations, but still using a reliable vendor.
Quantas, the airline of Australia, outsources its IT operations to IBM. They did not choose the cheapest alternative, but a reliable one.
Contrasting with Israeli and Australian flag carriers El Al and Quantas, BA the British flag carrier made some extremely unwise choices trying choose a "cheapest" solution, instead of a money saving cheaper solution.
Compared with El Al and Quantas, BA management has shown us that quality of operations is not their top priority. The question is, where else in their operations is this a problem?
BA management is revealing to customer and shareholder alike that customer service and shareholder value is not their highest priority. They are signaling to shareholders that it is time for a change before they further harm the BA brand and before some serious accident happens.
Step 0. Well functioning and balanced company.
Step 1. Why these engineers are so expensive? We can hire 'ten for one' in 'country A'.
Step 2. Cut local IT budget twice, put goals to have *zero* local IT budget in 2 years. Outsource everything to 'country A'.
Step 3. People are fired, everything is outsourced, operational knowledge accumulated over the last few years is lost.
Step 4. Why everything is broken and we loosing millions? The service is of a quality like we live in 'country A'!
Step 5. We need to hire high quality local team that can take *ownership* over product. We are willing to pay a fortune.
Step 6. Well functioning and balanced company.
>The airplanes that U.S. carriers send to Aeroman undergo what’s known in the industry as “heavy maintenance,” which often involves a complete teardown of the aircraft. Every plate and panel on the wings, tail, flaps, and rudder are unscrewed, and all the parts within—cables, brackets, bearings, and bolts—are removed for inspection. The landing gear is disassembled and checked for cracks, hydraulic leaks, and corrosion. The engines are removed and inspected for wear. Inside, the passenger seats, tray tables, overhead bins, carpeting, and side panels are removed until the cabin has been stripped down to bare metal. Then everything is put back exactly where it was, at least in theory.
>The work is labor-intensive and complicated, and the technical manuals are written in English, the language of international aviation. According to regulations, in order to receive F.A.A. certification as a mechanic, a worker needs to be able to “read, speak, write, and comprehend spoken English.” Most of the mechanics in El Salvador and some other developing countries who take apart the big jets and then put them back together are unable to meet this standard. At Aeroman’s El Salvador facility, only one mechanic out of eight is F.A.A.-certified. At a major overhaul base used by United Airlines in China, the ratio is one F.A.A.-certified mechanic for every 31 non-certified mechanics. In contrast, back when U.S. airlines performed heavy maintenance at their own, domestic facilities, F.A.A.-certified mechanics far outnumbered everyone else. At American Airlines’ mammoth heavy-maintenance facility in Tulsa, certified mechanics outnumber the uncertified four to one.
Maintenance facilities are usually certified by the local aviation authority
Aviation is also "regulated" by two extra entities: plane lessors and insurance companies. Both won't be happy if the maintenance done at these facilities is not correct
Step 5. We need to hire high quality local team that can take ownership over product. We are willing to pay a fortune. But we can't hire anyone local, because we did step 3 for too long or too often, and that knowledge is locally gone.
Rather: 'two for one'.
Well if you truly talk about "third world country" then sure, 10 for 1 but what kind of country and IT operation would that be?
Third world countries are developing, you should keep up with that.
I've heard rumours that the Chinese will just copy your IP as soon as possible, regardless of what you make them sign.
But what's supposed to be wrong in India compared to other outsourcing sites?
So if all offers look the same quality to you, you obviously buy the cheapest.
I'm not so sure about that. Here are 4 of the 5 big carriers in the U.S and they have all had national outages within the last year due to "IT."
I believe IBM was responsible for the Australian census debacle in 2016. Hardly a ringing endorsement of their reliability, and not the only high-profile instance of them messing things up.
"No one ever got fired for choosing IBM" though.
At any rate, I believe few would contest that IBM in general is a more reliable vendor than Tata, other Indian suppliers or say, offices in Poland.
EDIT: Details on IBM and Australian census.
But as a rule, Poland, India, ...other cheaper locales don't offer the quality of the US, Israel, ....look to the places where the world's top software/computer chip design/hardware is being designed and built. There is a correlation.
For example, Israel has a population of only 8 million yet creates more original software and startups (purchased by developed nations) than all of India. India buys hi-tech military technology from Israel (and the US).
And typically has them laid off in 3 - 5 years and replaced by IBM offshore resources or subcontractors.
How can they do that and save the client money at the same time?
Because the staff they transferred over are used to train cheaper replacements then laid off after 2 years. That is the plan all along, tho' the staff will never be told it upfront.
I've seen this happen many times and it's always been successful. They can replicate the business processes they've honed over time whilst keeping that important business knowledge. It's also often better for the staff as they can hand off that knowledge and move internally within IBM to new and more challenging roles without switching employers.
Now they work for IBM, and all the things they did above-and-beyond their job description are now billable. Now their incentive isn't to help company X that they wanted to work for, but to screw company X for every nickel-and-dime because that's where their new employer's revenue comes from. And they know all the skeletons in the closet and all the pain points.
Stab good people in the back and you make powerful enemies, the managers of all the company X's out there never learn this lesson.
Also May 27, 2017
Care to elaborate a bit?
In this view, India is portrayed valuing quantity over quality.
But hey at least Canadian government managers/executives got performance bonuses.
I'd trust IBM over random outsourced coding houses when it comes to IT Security.
What is the conversation that makes this happen? "Yeah, let's go with IBM. The same problems might crop up, but they're a larger company."
for example https://www.theregister.co.uk/2017/05/26/ibm_asks_contractor... doesn't sound like a company I'd want to trust.
The smaller shop have more reasons to try harder and deliver a better result. IBM will just hire the cheapest people they can find and put them under 5 levels of management and good luck with that!
But - without any evidence whatsoever wholesale blaming of Indian and Polish companies seem somewhat unfair. Until, we have a postmortem and know this in more detail - it appears that you just have a axe to grind.
And what difference does it make whether the error -- if indeed the root cause was an error, which hasn't yet been established -- was made by someone in India vs. someone in Poland vs. someone in the UK?
It doesn't matter. They are no smarter than BA's original British engineers, but with decades less experience on those systems. Even if they were smarter - and in general the talent prefers to work for real companies, not bodyshops, so that's unlikely - they can't compensate for the lack of real experience.
Only a person who believes experience doesn't matter would sign an outsourcing deal.
Wow you really believe Poland is some third world country where we ride horses to our farms? They have experience, in some ways Poland is way ahead of UK in digitalization of many services. In great part because of thriving economy, great schools and excellent developers.
Poland has some of the best programmers according to a study conducted by hackerrank https://blog.hackerrank.com/which-country-would-win-in-the-p...
No, I was in Poland just last week in fact.
But my point stands: BA's own staff have been operating its systems for decades. No matter how good you are solving made-up puzzles on Hackerrank, you can't match that experience of those systems overnight. There would have been people at BA who'd been working on say the reservation system for 30+ years - and you're claiming your "hacker rank" can beat that. And THAT is why outsourcing always fails, that kind of belief.
Apologies if that seems a bit harsh, but the assertion that "we do well on Hackerrank therefore we are as qualified to maintain legacy systems as their original authors" just doesn't make sense.
You think the core systems of major airlines or any other large org turnover every 2-3 years? You think knowing the syntax of a particular language is comparable to decades of domain knowledge? You think that ANY 20-something, anywhere in the world, has decades of experience of anything??
No, YOU know nothing about what you write.
There's a bunch of people not eligible to vote; eligible but not-voting; as well as those voting against brexit.
Actually, Qantas has a huge 'workforce' of Tata onsite as well. Of course all the newer and 'more exciting' stuff is ran either by in-house resources or 'digital agencies' - the more expensive kind of outsourcing.
(FWIW, it's Qantas - Queensland and Northern Territory Aerial Services)
And where do you think that work awarded to IBM is being done? Chances are it's India. As for IBM being reliable - are we talking about the same company known for its brazen incompetence in Australia?
Not that I think this is the Krakow workers' fault, mind you; rather this (and the string of similar IT incidents recently at BA) seems to fit the pattern of upper management focusing on driving down costs to the exclusion of everything else.
if offshored personnel are direct employees it's a completely different story
given the main cost saving comes from being able to "ramp the size of the pool up and down to meet demand" most large companies opt for the bodyshop solution...
Not snark, just asking.
National and racial slurs are not allowed here.
Their high costs were fine, but then they started cost cutting more and more, and enough was enough. everyone I know has twigged that in Europe easyJet are far better, and longer haul virgin, the gulf carriers, and even USian carriers, are better.
For example Glasgow. Average residential prices are 23% of London, that's a lot cheaper and there's a good level of skilled workforce there.
Although, I've seen this sort of thing before... Usually it starts with a middle management hiring of some alpha-male business asshole who is driven to advance his career and thinks that because he can use a spreadsheet that he's qualified to run IT. He'll then go on to sell upper management on some kind of ridiculous story straight out of some bullshit CIO magazine about how consolidating all the existing best-in-class systems into one system will cut costs and open opportunities for building and mining customer data to increase revenue. He'll get the greenlight and a shit-ton of capital, and then he has to make the decision about whether to build or buy. That question is irrelevant, as our intrepid hero has no idea what the fuck he is doing and will fuck things up regardless of which path he takes. Once the "new system" is fully half baked, he then shoves it out all over the company in some ridiculous balls to the wall no going back roll out plan. Subsequently there will be huge problems, massive lines, pissed off customers, pissed off employees... but this is where our intrepid hero really shines. His mastery of the art of bullshit successfully deflects all blame from himself and his incompetence onto the users/operators (eg, people who are responsible for revenue generating businesses) of his monstrosity. Somehow it is forgotten that outages never happened before and all the revenue loss and customer ill-will is blamed on the operators for not having well-tuned disaster recovery plans in place for manual operation.
Of course the only disasters they ever dealt with were a result of his incompetence, and in a stunning feat of failing upwards, he's destined to helm the company (and then likely others) within a few short years.
>Yesterday's issues are the fourth BA failure in the past month, with problems on June 19, July 7 and July 13
>Union leaders say hundreds of BA staff complained about 'FLY' system and most workers say it's not fit for purpose
>A survey by GMB of 700 staff in June found that 89 per cent said training was poor, 94 cent suffered delays or system failures and 76 per cent said their health had suffered because of stress or anger aimed at them by frustrated passengers.
Problems happen - even huge ones - and I think most customes understand that. What they don't understand is being given little or no information about what they should do, and also being given vague, contradicting and even false information.
The BA Twitter feed seems to be the main source of dubious information. They are telling people to check the website for flight status info - but it only works intermittently, and different parts of the site say different things; for example, the flight status tool says my flight is cancelled, but the booking management tool says it's all fine.
On the Twitter feed they are telling people that they are contacting people and they rebooking them automatically - but it's apparent this is happening for very few people.
They are telling people to rebook on the website - which only works intermittently, and will not allow rebooking even when it is working.
They are telling people to call them to rebook - but their call centres seemed to be working on normal working hours (rather than getting all hand on deck), so were not in operation between 20:00 and 09:00 (or whatever; it varies by country). During working hours, when calling any call centre anywhere in the world, you just get a recorded message and are then disconnected. Some people say they did get into the call queue, and have been waiting on hold for 7+ hours!
They are sometimes telling people they shouldn't go to go to the airport to rebook, and sometimes telling them they should* go to the airport to rebook.
Yesterday they were telling people they could book alternative travel with a different airline and then claim it back from them... and today they are saying they won't pay if you book alternative travel - this could cost passengers dearly.
The CEO, Alex Cruz, also made a laughing stock of himself yesterday when he randomly donned a yellow high-visibility vest to do a recorded message in an office.
Honestly, the whole thing has been a lesson on how not to treat your customers when things go wrong.
They moved off of that recently though, and are now on Amadeus/Altea.
Manually. The TPF reservation systems have a concept of "Queues". They would place travel records that needed to be reaccommodated onto a queue. Then, reservations agents would "work the queues" from their green screens, make phone calls, etc.
>how is it handled today given all the intertwined applications
Depends on the airline, and the system, but the general answer is "partially automated". The processes for less widespread issues is more automated. Think like a major storm in the northeast. Global outages are less automated because you're dealing with multiple issues, not just passengers.
IBM offered ACP (Airline Control Program) with their new mainframes running OS/360 in 1965. Later they changed its name to ALCS and TPF, still used today by reservation systems.
http://www.bbc.com/news/live/world-40069977 says "A BA captain has said the failure affects the passenger and baggage manifests"
So it's an legality/operational thing. Passengers can get boarding passes, etc, but the plane isn't allowed to take off without proper manifests.
Until an hour or so, ago login to the website was down for me.
So it seems it's a very widespread outage that they're in the process of recovering from
I was able to do a flight status a little while ago though.
I'm leaving my citi card for a chase one due to poor infrastructure but I could have avoided a lot of wasted time if I hadn't opened it in the first place.
A lot of the quality and poor performance could be bad process/management on the clients side (though a outsourcing company will profit hugely from).
Partially OT: anyone wanting to share any on-call horror stories? :)
They do have solver/optimizer algorithms, but you can imagine it's not a button press thing. There's a lot of human process, trial and error, etc. Oh, and federal laws about crew hours / legality plus union work rules. You can't just assign crews wherever you want for example...you have to consider seniority, their "home base" where they live, etc.
it's got to be a great problem to work on - and must be pretty rewarding to watch when you get it right
If it's only one airline down, you can get away by buying tickets on competitors.
If it's multiple airlines, or the upstream reservation system, or a local meteorological incident, or an airport wide issue, etc... there are no flying plane.
For example, the overbooking that led to the beating on the United flight was the result of an aircrew from another airline being booked on at the last second.
I would imagine that when the shit hits the fan like this, the other airlines are very sympathetic and will do whatever it takes to get BA personnel where they need to go. After all, next month it could be their own systems that are down.
(assuming it's not one of the shared services causing a global outage - not much you can do when everyone is down)
That's not quite what happened. The flight was operated by Republic Airlines, on behalf of United. Republic bumped the passenger to make way for more of their own crew - it's just that crew was going to to fly a Republic flight operating under another, different, carrier. But both crews were employed by Republic.
That shut down BA operations World wide? Does that seem likely? Possible? They don't have power fail-over and operational centres in different countries?
I've been re-shaping my aluminium foil hat and wondering if there wasn't a specific terrorist threat that's been covered up; but then where I am there have been 2 cities suffer bomb threats (with attendance of bomb disposal and armed police) that seem to have been buried in the news completely. Also I've heard elsewhere that staff on the ground reported the incident as due to "hackers" almost immediately - the speculation being 'before they could possibly have known that' - which suggests some sort of disinformation process.
Assuming they identified a real, immediate, and massive threat, I can see why they would prefer to ground all planes until they've sorted things out.
Yes it would probably take 5+ years to develop and roll out, but they can't keep maintaining these 50 year old mainframes that cost them tens of millions a year in downtime.
There were fewer global outages when all of the functionality was on the TPF mainframes with dumb green screen clients.
Implemented and operated by in-house staff.
It's surprising how quickly one can suggest a solution based on... what exactly?
Imagine a doctor prescribing "this new fantastic medicine" based on "patient is sick" description. While personally I'm all for open source this is not a silver bullet and should be carefully considered.
Joel had an interesting blog post  back then about Netscape rewriting their software.
There are many factors that contribute to these difficulties, but tech is not near the top of the list.
There are many cautionary tales, including the $8-billion FAA AAS system rewrite failure .
Contributing factors include
- Regulatory processes
- Project management
- Politics and turf wars
- The extreme difficulty of reengineering a complex system when few, if any of the original designers and developers are driving it .
When you live in a world w/ Docker, REST (and every other open tech there is), you can build systems which are way more innovative.
That's quite funny because Docker and REST are just half-arsed reinventions of mainframe features from the 1970s.
That's not a trivial add-on to IBM mainframes, it provides every developer (w/ minimal resources e.g a laptop or free tier cloud service) the ability to run production-like environments.
Sometimes, we take that for granted, but having worked for an airline IT, I noticed the bottleneck didn't come from hard algorithmic issues (most advanced route features were very basic to implement), but from the huge leap between production and dev environments: this was not Ubuntu on a server VS a MacOS on a laptop (manageable...), this was Ubuntu VMs (so should be close to prod?) vs cryptic data center clusters that had impossible-to-replicate features.
As for REST, airline IT use an outdated messaging mechanism. It implements versioning and grammar, so should be clean and nice to work with? Not really... The messages were impossible to read for an inexperienced dev (opposed to XML or JSON).
I heard plans to put JSON blobs in one of the fields of the messaging mechanism (completely destroying the value of versionning and grammar btw). That was not necessarily a bad idea, just a reaction to the lack of supported tooling, and readability for an obscure messaging mechanism.
Again, I get where you're coming from, but I'm just allergic to nostagia for the sake of nostalgia where the old features clearly lacked essential features for 2017.
Or even build common components/ technologies together and then build their back-ends on top - that is basically what the rest of us are doing...
I'd rather fly on an airline that hasn't 'upgraded' core systems.
A power outage the reboots a whole datacenter would screw any major airline for at least half a day.
Thus far, nobody has found the economics of a modern, truly HA setup worth the cost. Outsiders greatly underestimate the complexity too. Think something like 40 disparate applications from different vendors, or some homegrown systems, in different geographical locations. Then, all the client applications are in buildings you don't own (airports) where you aren't allowed to control the infrastructure.
If it were a high margin business, perhaps things would be different. It's not.
Most of the loss here is not from lost bookings; a lot of people who prefer BA will probably just wait and book once the systems are restored.
It's going to be from secondary losses -- lawsuits, ding to the reputation and so on.
These numbers are estimatable, even for low-margin businesses.
To give you some idea, ITA Software was bought by Google for $700 million. They were some of the best and brightest minds in this space, on par with any Silicon Valley darling. They successfully wrote a modern replacement for one popular airline function...shopping. They failed, however, at delivering a modern reservation system, despite tons of money and talent invested.
Then, quite honestly, it's the rational decision for airlines to take.
Even Google accepts that perfect reliability is impossible, and they're sailing in a pillow-strewn gold-plated yacht down a Mississippi of money.
you can have all the DR in the world but if you never fully test it - you never know if it's going to be 100% on the day - the only way to fully test is to do a full failover from the primary and see what happens - but it could be a real career low if you do that on something so commercially sensitive and it all goes very wrong
interestingly it's a bank holiday weekend - so every possibility some people who knew some of the special incantations to get things back up on days like this are out camping somewhere remote - and will have 500+ messages when they get back in to signal tomorrow and wonder why they ever try to go away - and if they ever do go away again - wonder how they will ever be able to relax
The fix is more expensive than any major airline has been willing to spend thus far.
But my experience has taught me this. The vast majority of time its a fragile egg shell waiting to crack. When it fails, it fails miraculously. An IT support team on this scale is one of those things you should keep close to your product or customer.
Occasions like this I guarantee a bunch of execs somewhere in BA will pay ANYTHING to have some of their loyal IT staff back to take control of the situation.
Without meaningful consequences at the top of the executive chain for sub-par IT/infrastructure quality, these kinds of incidents seem inevitable. But how do you hold people responsibly for "bad" software? We could adopt something akin to how PE licenses are required for civil engineering in the US. I suspect it is in the industry's best interest to address this need before a government entity decides to.
Maybe a group on behalf of a nation state is testing out its muscle. Or sending a warning. Maybe the UK did something recently a nation didn't like, and this is the new form of "diplomatic protest". Or sending a warning shot.
I remember reading last year that the large Delta airlines outage was cyber terror related.
It really feels like we're not too far off from a war between two nations without a single bullet fired. What happens when a country is hit with nation level ransomware? Ie not, "give us $300 bucks and you'll get your PC back", but "sign XYZ treaty and you get your country's water system back"? Or "we'll restore your internet and turn your powerplants back online"? How much will a wealthy country (like us in the US) be willing to stomach of seeing people going thirsty and hungry before they want their govt to capitulate?
Of course the world will condemn and complain, but as has always been the case the country with the largest Army makes the rules. And we might be using the wrong measuring stick. Microsoft missed the boat by thinking (along with most of the industry) that measuring computing meant measuring PCs. It wasn't until it was too late that it realized that computing was about to be dominated by Mobile and smaller. They were using the wrong measuring stick.
Are we on the verge of an era where an army should be measured in its digital strength and not its physical strength? It's a very scary thought.
As an American I think first of the risks to my own country. How long could we sit with a nationwide blackout, and the internet down (for anyone using backup generators)? Food rotting in shipping containers because no cranes can offload it. No easy communication or flights for quick way to move people or things around.
As a country (and as a world) I can't help but feel we're really not doing enough to take a risk like this seriously. It's just a feeling, but based on small incidents here and there it really feels to me like something in the next 5-10 years is gonna happen that'll make us all drop our jaws and say "I didn't think this was possible."
Regardless of your political beliefs, a lot of people around the world didn't think last year's US election was possible. I was shocked by it as well, but it also opened my eyes to how much more really is in the realm of possible.
As engineers we have much a deeper knowledge of the risks involved and as a result a much greater responsibility of raising awareness and getting the problem fixed.
EDIT: I'm not sure of the original location I saw the Delta stuff, but a quick look now turns up this link.
And yes, I'll repeat this above comment is purely speculative by me.
See this post, for example...a BA employee: https://www.flyertalk.com/forum/28366141-post168.html
Later, there's some talk about a power outage and/or lighting strike, again from BA employees.
There were lightning flashes a few times per second, near continuously... for about 1 1/2 hours. I've never seen anything like it before.
Delta Airlines also called their outage a "power outage", but it looks like there was reason to believe it wasn't.
Edit: Yep. http://bgr.com/2016/08/14/delta-finally-explained-how-one-po...
Where'd you hear that? AFAIK, it was a botched failover to backup power.
The lessons from Ukraine to Trump is that, far better than something obvious like tuning off the power, a deniable attack on the political system is extremely effective.
There has been a "cyber war", the US lost, and it's not yet clear how the US political system will recover.
You've never used ITIL for real, have you? Because if you had you would know that no amount of process can replace good engineers, and good engineers don't want to work in ITIL shops...
Offshoring IT is an example of cutting costs to the bone.
Your grandchildren will only know of flying as an ancient legend.