Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Southwest Airlines halts flight departures amid technology issue (bloomberg.com)
97 points by mfiguiere on April 18, 2023 | hide | past | favorite | 92 comments



Is it really fair to blame it on the technology? After a couple of oopsies it seems like it's more of a management issue than a technology issue.

I imagine it's a lot easier to deflect blame by blaming it on computers though.


I can’t find the exact quote or its source, but one thing I’ve heard that has stuck with me is (probably unintentionally paraphrased):

“Every problem blamed on ‘computer error’ involves at least two human errors, one of which is blaming the computer.”


"At the source of every error which is blamed on the computer, you will find at least two human errors, one of which is the error of blaming it on the computer."

— Tom Glib


And Sidney Dekker will tell you that blaming things on "human error" won't actually help you find the root cause of the problem.


There is human error which is an error of an employee and there is process flaw which is managent's problem and while they are humans the flaw is not an error it is incompetence, especially when it happens more than twice.


The technology is not working because of a lack of investment, that much was obvious at Christmas.

Factually describing the outage doesn’t remove responsibility; whoever didn’t take the first crisis seriously enough will face it soon enough. Right now, they need engineers, the FAA, flight controllers, etc. to unstuck them. Then there will be a blameless diagnostic. Then heads will roll and I’d be surprised if the most demotion won’t be a general manager.


I would be surprised if they can solve all their issues in 5 months no matter now much focus is put into it.


Oh, certainly. But they can reset the server that is running out of memory faster than that. Most issues that leaked in the press require more than five months, but some patches can be set up.

One of the key processes that failed over Christmas was assigning personnel to crews and crews to flights. You can store a working option regularly (every hour) and revert to that, with local modifications, when things stop working. It won’t make things good, but it will avoid having to strand all flights for days.


A few weeks after the first big incident, I applied for a couple of open IT roles at Southwest. Though I was well qualified for all of them, I never heard back from SWA, not even a "Thank you for applying" type of email. Makes me wonder if they thought the first incident was a fluke and they could go about business as normal. Also makes me wonder if they put out a bunch of ghost jobs to appear like they were staffing up. Most companies will at least acknowledge receipt of job applications no matter how unqualified the applicant and Southwest didn't even do that much.


Maybe they also have a 'technology issue' in their recruiting process and they think nobody is applying.


From swa@myworkday.com

  Thank you for completing your job application for the [not available] position!


You're gonna need a fax machine to apply.


>How many software engineers does it take to screw in a lightbulb?

That's a hardware problem.

>How many software engineers does it take to fix broken flight software?

That's a management problem.

>How do you fix this software problem?

This question has been closed for being off-topic.


"A loss of X dollars is always the responsibility of an executive whose financial responsibility exceeds X dollars."

- Weinberg’s 'First Principle of Financial Management' and 'Second Rule of Failure Prevention’

- 'First-Order Measurement', Quality Software Management, Volume 2, Gerald Weinberg, Dorset House Publishing, 1993


Apparently the last CEO cut a lot of costs in favor of profitability.


This always triggers me to think that it is really difficult to give a score of what a CEO does. Many times they pumped up stocks and quit before problems they created surfaced yet. They were hailed as heroes and the next guy ate shit.


Case in point: Jack Welch

Throughout the 80s Jack Welch, then CEO of General Electric, was seen as a genius. My god! Look at the stock!

He gutted the company in favor of financial services, and now is just a shell of what it once was.

C'est la eighties.


I feel this is the same way with US Presidents, although it's not really their choice when it comes to leaving.


There's a large range. Some things, like executive orders, can be instant. Some things, like leaning on congress to do something, might take longer to have measurable results. Some things, like the economy, regularly only affect the next person.

Oddly, the economy one seems to be the one that people vote for, with predictably stupid results. Also for some reason the economy only includes the gasoline refining industry, so make sure to switch careers so you can be a productive member of society!


Every country without regular turn over on top seems to devolve to an evil dictatorship over time. Sometimes kings can be forced to give up power, until they become non-evil just for lack of power, but this takes centuries.


Everything is working, I don't think we need that IT team then....


OTOH, there are lot of IT teams that are truly bloated


That depends on how effective they are when truly needed. Firefighters in my town do busy-work 70% of the time. When they are needed, I'm happy we didn't send 70% of them home for the day.


it's basically inevitable with how Southwest operates, they trade resiliency for efficiency. Same reason modern supply chains are so fragile, minimal redundancy anywhere. Tech seems to be one of the last places where people understand the value of resiliency, the way these companies run would be like not keeping database backups to save on storage costs. Or not having any excess capacity provisioned to handle traffic surges or hardware failures


It really does come down to management. There are ways to make computer systems reliable but private companies seldom have the budget, will, organization, or patience to develop their systems to those specs.

In lieu of that management should require business resiliency processes allowing the business to continue manually when the computer systems inevitably fail.


Well yeah, going down the '5-Whys' it ends up being management to blame


Not always. I read Accident Investigation reports for the transport industries in my countries. The vast majority of the time, management are a factor, for example the management of a bulk freighter haven't allowed enough crew for it to be operated safely, or the management of a airport haven't spent money on the improved hardware that would have prevented a dangerous incident. These reports inevitably have recommendations, many of which are directed to management.

But once in a while it really isn't. My "favourite" was a fishing vessel which capsized killing everybody aboard. Why did it get into trouble? Because the entire crew were using heroin. That report has zero recommendations because there's nothing to recommend. It's already both illegal and obviously a bad idea to operate vessels at sea under the influence of drugs or alcohol, so there's nothing to recommend, those guys had only themselves to blame for their deaths.


You could say it's management's fault they didn't know entire crews were on heroin. This is one of the challenges of being management: every failure is conceivably a management problem.


What management? Small boat like that, everybody senior is likely a co-owner.


>What management? Small boat like that, everybody senior is likely a co-owner.

And probably on the boat when it capsized and high on heroin too.


The flaw with 5 whys (not to say I don’t encourage it!) is they reflect the biases of the author while appearing unbiased and authoritative. So, it depends on who writes the post mortem who gets the blame.


Yes, because there’s insufficient room for reality. Causality is this big DAG with tons of incoming edges, going back to the beginning of time. 5 Whys is creating a path 5 edges long. This is often useful, but it’s not “the truth”. If each node has 10 incoming edges, there are 100,000 paths you could take.

At least it’s better than root cause analysis, which is walking back until you want to stop and declaring that the root.

The real root is the creation of the universe. I mean that sincerely. When you follow definitions of causality, time and the universe, you end up with the Big Bang as the root of your graph. Everything is else is editorializing.


"In the beginning the Universe was created. This has made a lot of people very angry and has been widely regarded as a bad move."


To make an apple pie from scratch, you must first create the universe, as they say


Management decision frameworks are often like that, but "5 Whys" is relatively innocuous - too simple to make tendentious. Mainly it is a reminder to dig until you hit bottom when doing root cause analysis.


A real 5 whys analysis probably should give a tree. Why did X happen? Because A, B, and C. Why did A happen? Because A1 and A2. And so on.


Fishbone diagram is one way to do that

https://asq.org/quality-resources/fishbone


You don't need to do that because there will be more 5-whys. Just pick whichever of A1, A2 seems like a bigger deal and find a B that seems like the biggest contribution - until you get to some deep root cause you can solve and mitigate. Then you fix that one issue and move one. You WILL see more issues, if they are always in other places you must have got the right one, if not then you will see this again and this time you can go down A2.

The important part of the above is by refusing create the whole tree you actually get something solved. Sometimes you will get the right one, sometimes you won't. You can spend a lot of time building a tree of problems, and find tens of thousands of things you can fix - and now you have a larger todo list than you have time to do and likely you don't get anything done. Or analysis paralysis in other words.


This is specifically where bias comes in. People selectively prune the tree in ways that are either interesting (in a positive take) or deceptively based on their desire to avoid certain topics. Because the tree can be so large it isn’t always obvious even to the author whether they picked a branch because it was useful or because it served some other purpose - divert focus to a team they want to see behave differently, avoid taking blame for something, avoid being asked to fix something they don’t want to, etc.

So, in this case the real question is “will they pick the branch that blames the people that pay and promote them, or take the branches that are of the least resistance.” Something this contentious would be vetted by committee in some way or another and at that point any upward finger pointing would be almost certainly curtailed in favor of a less contentious branch.


Once again, there is only so many times you can keep re-running the same tree before someone will look for a different root cause. If you only run the process once you can place blame wherever. However when it is multiple different people for slightly different problems running it over several years, eventually you will get to the right branch.


Only if it’s independent in each run. Even with different people on different problems they’re still subject to the same pressures by the people that pay and promote them.


It's almost never a technology issue. It's an investment and prioritization issue.


The way the SA business model is set up makes me wonder if this sort of outcome is inherently baked-in, regardless of the actual technology utilized.

At the end of the day, if crew A and plane B are in wildly-unexpected places, no amount of computational prowess can correct that reality. Contingencies can be installed at every level, but then you don't have a viable business anymore. E.g. keeping a backup crew and spare plane on the ready 24/7/365 at every airport just in case is not going to scale. It also still won't deal with the worst scenarios.


At the Christmas outage, they had crews and aircraft in the same location and still weren't able to operate them because their system didn't believe they were in that location, and they had too few staff centrally to update the system with that information via phone.

This really is a massive management failure. They couldn't re-route crews and aircraft, they couldn't consolidate ones that were already in the same location, and they couldn't fix it because they under-staffed to an absurd degree.

Better technology would have helped a ton. But that isn't, to me, the story here. This is an epic mismanagement story that may ultimately kill Southwest. They cost cut technology for years, then cost cut away their backup staff too.


And if IIRC they couldn't do all this in a reasonable time because their system still required crew to call them on the phone to report location, availability, and to receive new assignments and that system didn't scale during severe weather.


South West needed to update its tech infrastructure. They know this. Apparently staff have to call in to schedule and when things aren't going well it goes south quickly.

This business model is kinda what made southwest what it is today. I think like "just in time" manufacturing, it can save you money when its working but it might not be as fault tolerant.

From a podcast transcript: https://www.nytimes.com/2023/01/10/podcasts/the-daily/the-so...

Niraj Chokshi: Right, exactly. There was this union official who told this story where there was a plane ready to go that was missing one flight attendant. And they had several on board already as passengers. But none of them could reach the headquarters quickly enough to tell them, look, we’ll work this flight. And so the flight ended up getting canceled.

Michael Barbaro: Even though the flight attendants were on the plane.

Niraj Chokshi: Right, exactly. And so this kind of thing was happening across the board. And the airline just could not keep up with the nature of the problem.

Michael Barbaro: Because basically, they had an antiquated scheduling system.

Niraj Chokshi: Right, right. And Southwest has acknowledged that, too, even before this crisis. After Thanksgiving, they invited a group of journalists to Dallas. And I was there. And the CEO was telling us about his goal of modernizing the operation. There are all these sorts of systems that they want to upgrade.

They want to make better. And he mentioned this system. He said, we’ve got flight attendants who have to call in. This is a process that could be automated. And he said, it’s not OK. And so they knew. They had started to work on it. That’s what they say. But unfortunately, they were still working on it when Christmas comes.


>>they had several on board already as passengers. But none of them could reach the headquarters quickly enough to tell them, look, we’ll work this flight. And so the flight ended up getting canceled.

>>Michael Barbaro: Even though the flight attendants were on the plane.

>>Niraj Chokshi: Right, exactly. And so this kind of thing was happening across the board. And the airline just could not keep up with the nature of the problem.

Perhaps more importantly, they had provided no means to push decision-making to the edges of the system.

This is apparently an absolutely critical difference between the armed forces of Ukraine and Russia, where UKR has updated to modern command & control pushing decision-making out as far into the field as possible, while RUS has very centralized C&C, so UKR can outfight them even though outnumbered 5:1, and both sides are running short of artillery.

Here, SW, cannot even fix parts of their system when the fix is literally on the tarmac fully configured and loaded, just not centrally approved. They can't just say, we're good, every item on the checklist is verified, we're going, here's the data, we can update via communications en-route. I'd bet that such solutions exist within an hour of scheduled takeoff for more than 50% of the situations. Instead, it looks like they are 100% shutdown.


Airlines generally treat all flights as part of a matched pair: Home-out and back again. Even with a non-hub-and-spoke flight system like Southwest, this is still the case. Crews and (to a lesser extent planes) don’t wander the earth like Cain, they go out and come back (perhaps with a stop or two along the way).

Southwest should be more resilient in some ways in that all of their flights are done with the same type of plane, so they can always swap in a different plane for the task, and within the limits of employment contracts/government regulations, they can also easily swap crews around, as long as they keep that out-and-back scheduling.


> Southwest should be more resilient in some ways in that all of their flights are done with the same type of plane, so they can always swap in a different plane for the task

That's not that unusual for their size airline. But, some of their planes have more seats than others, so on fully booked routes, swapping in a smaller capacity plane is still a problem. Certainly, a smaller problem than if they swapped in a plane the current crew couldn't fly.

I don't know where other airlines keep contingency planes, but I'd imagine they might keep them at their larger hubs, which would give you a decent chance of having a contingency plane at the right location; with Southwest's model, I think any contingency planes are going to have to first deadhead to the route, adding more confusion and delay. :/

At least they fixed this one quickly; once the backlog got too big last time, it was very difficult to fix.


Using a lower-capacity plane is still less of an issue for SWA than other airlines since they also don’t do assigned seating so they can bump a portion of the booked tickets rather than all of them. Although are you sure about the fewer seats on some planes things? I remember having identical planes for all flights being a central part of their business plan last time I had read about this. I can’t imagine them getting any benefit out of having 737s with different seating capacities.


> Using a lower-capacity plane is still less of an issue for SWA than other airlines since they also don’t do assigned seating so they can bump a portion of the booked tickets rather than all of them.

Other airlines can do similar. Assigned seats don't prevent an airline from bumping you; it's been a while, but I've had my seat changed without my involvement as well as had to figure out what to so when someone else had a boarding pass with the same seat as mine. I've had pleasant surprises too; booked towards the front of the cabin for a plane without an extra leg room section, but the plane that was used had extra leg room for that row, so bonus.

> I can’t imagine them getting any benefit out of having 737s with different seating capacities.

It's probably a mix of things; one being they bought many 737-700s before larger models were available, but it looks like they've got 192 737-MAX7s on order, so it seems that they'll have three sizes for a while and probably have two sizes once the 737-700s age out. Some routes probably don't provide enough customers to fill the larger planes, and using a smaller plane reduces operating costs. Two sizes isn't that many. Alaska operates more sizes and has fewer aircraft.


SeatGuru shows their 737-700 is 143 seats and their 737-800 and 737 MAX 8 are both 175 seats.


An aside but some years ago I was taking a flight in Europe and I was fascinated to see the differences. It seems many budget airlines operate on a similar model but the EU has steep fines for late flights so there's a huge incentive to find alternatives.

In my case the alternative appeared to be an entire backup airline that seems to exist only to serve other airlines when they experience issues. A plane was quickly scrambled from (IIRC) Lisbon and flown to Heathrow and we were all shepherded onto it as quick as they could possibly manage. The plane was garbage, service was garbage and food was garbage but they did get us in the air quickly. Ultimately they missed their 4hr delay window by about twenty minutes so I was entitled to something like 600EUR of compensation anyway.

I imagine there are a number of reasons why this wouldn't work domestically in the US but without the threat of fines there simply isn't any incentive to provide it anyway.


The one concern I'd have with such pressure is that there would be other essential corners cut in order to not end up being fined. Bad service on a dirty aircraft is one thing but if they end up skimping on something safety related, disaster is waiting to happen. While I'm sure there are lots of other penalties involved in cutting corners in those areas, including the prospect of losing an aircraft and many lives, people at the lowest levels often are receiving pressure for immediate concerns and the consequences of the larger event isn't something that's immediately real to them.


> At the end of the day, if crew A and plane B are in wildly-unexpected places, no amount of computational prowess can correct that reality.

I think that’s not wrong, but one of Southwest’s efficiencies is that they only operate one “type” (term of art) of airplane. So theoretically any crew can operate any airplane.

So, in theory, it doesn’t matter which crew is at the airport as long as they meet the rest requirements. You could just use the greedy algorithm and assign the first available crew in the queue to the next flight at each airport. And, since you need a crew to get the airplane to the airport, there are usually crews in the same places as airplanes. (But you have to know that an available crew exists at that airport to do this)

Where that breaks down a bit (and where you need a smart algorithm or more compute) is that you can end up with crews at an airport, but the crews are all timed out. So while the airplanes are mostly fungible, the crews are mostly not.

The complexity is really filling all the planes for the next-n rounds of flights with crews such that none of the crews times out up to your time horizon. But that only works if you know where all the crews and planes are (and how much time each crew has left)

Also, keeping a backup crew at each airport isn’t super expensive because crews are only paid for the time they are operating the airplane while the doors are shut.


Also people want to go home. The aircraft don't care if they're in one city or another, but the crews do, for regional in particular it's normal to go home, and if you're out of position at the end of the day that's not possible.


And you have staffing contracts negotiated by relatively powerful unions that require some of these things.


> So theoretically any crew can operate any airplane.

This doesn’t actually hold up to reality because 737 variants are not interchangeable. Gauges may be in the same place but have a different appearance. There have been accidents directly caused by differences between variants. (Shutting down the wrong engine because the air conditioning now used bleed air from both engines, not just the right one.)

My understanding is difference training is required to certify a pilot for a specific variant.


The 737 is a single type rating. They are interchangeable. That was the whole idea behind Boeing's MCAS system - the 737-MAX didn't fly the same due to the engines being pulled further forward on the wing, so they made a piece of software that made the airplane feel the same as the 737NG.


You can’t fly an 737-200 for 3 years and then jump into a 737 NG without difference training. You won’t know how to work the flight management systems.

They are the same type, so you don’t need to re-qualify for the type. But you need to do training to move between variants of a type. It’s just a lot less training.


SWA has optimized their operations to the point where there is little margin for error. They have the quickest turnaround times. They land at the highest speeds their planes are designed for. They taxi at or maybe above the maximum allowed speeds once on the ground. This all works great most of the time but when it does fail, it tends to do so spectacularly since there is no slack in the system.


Except this outage has nothing to do with crew scheduling


Their “whoopsies, sorry we ruined your Christmas” mea culpa email was full of corporate horseshit and essentially no real changes, so this is pretty unsurprising.


Am I the only annoyed with the use of the word "technology" to mean anything computer related? That would be like saying turbulence is "an issue with physics". A hammer is a type of technology too.

Edit: phrasing/typos


The more vague and hand-wavy it sounds, the better it is for corporate PR statements.


It appears to be over except for Dallas Love Field Airport.

"FAA lifts nationwide ground stop for Southwest Airlines flights after equipment issues"

https://www.cnn.com/travel/article/southwest-airlines-flight...


Why does FAA need to impose a ground stop? Couldn’t the airline decide for itself that it’s not going to have its own planes take off?


This implies the outage was affecting their Pilot's ability to get accurate weight/balance data.


It's probably easier to allow the FAA to propagate the message than for SWAs systems (that seemed to be failing) to try and do so.


I’m in the middle of dealing with this as I’m not a proud nor rich man and always keep crawling back to Southwest. Flying Nashville > Phoenix > Oakland right now. As far as I can tell they just delayed all flights by 30 mins which set things back in place somehow


My wife and I had a recent bad experience on SouthWest, where we boarded the plane at Love Field in Dallas and then were made to wait 2.5 hours on the tarmac, because their IT system had a different passenger count than what the crew counted. After several attempts at counting again, they ended up having to go to each passenger row by row with a manifest in hand, and verify the passenger boarding pass and driver's license. I think aside from their IT system problems, this situation was made worse because as a passenger you don't have an assigned seat, so it's not obvious when someone is missing or a seat has two or more contending passengers.


Ironically, a Southwest Airline recruiter messaged me today. Maybe that's actually a good sign they've realized they need to take their tech stack seriously?


What tech stack? They make their pilots call in to find out where the planes/pilots are.


Grandparent should pitch he can fix it in 2 weeks for $1 million.

Use vocode.dev to take the pilot calls. Use bot to ask the relevant questions, and get transcriptions. Send transcriptions to ChatGPT to formalize into well-defined json updates to the database. It's a 2 week project, max.


If the transcription messes up that flight might not fly .. which means the airline is out a ton of money trying to make alternative arrangements for a flight full of people.

Your invention could bankrupt the airline in a matter of months.


That means you can upsell them after they deploy. Have the bot read back the formalized updates to confirm if you want more accuracy.

And don't forget premium voices. I'd argue using General Adama's voice from Battlestar Galactica leads to less mistakes.

This is Battlestar actual, what's your status?

Who won't give accurate information when they hear that?


Version 2.0 has mechanical Turk. Version 3.0 introduces airline success manager associates that will chat with the pilots live and write it down in a notebook.


Unless you're an attorney.


This makes think, how can a CEO manage a tech team (headed by CTO) effectively? For sure he himself usually does not know much about the tech involved and even he does grow from the trench he doesn't know all arms.

The only way he can do is to trust his CTO and let him pitch changes and innovations and such. But how does a CTO do that? Again he has to rely on his generals to make the judgement.

This all sounds like a job impossible to do right. Any idea?


In these 3 steps:

1) Gather requirements of what the business expect the systems to do provide which capabilities for the business to operate and grow for the future.

2) Present this to the generals and ask them "how can we get here with which changes or new systems?".

3) Analyze document the proposed changes outline pros and cons. Make recommendation to CEO.

Repeat the loop again with action plan and timeline, if the CEO agrees which i'm sure the CEO will have input on where action plan is focused on and timelines. Ask for input from all Cxx and SVPs - repeat loop after this point.

Repeat until actionable work items are created and assigned to people to execute on. Monitor execution and keep doing it again and again until all the changes are in.


Don't forget the part where you actually analyze the result to see whether the needs reported to the team creating the solution were accurate, and then triage the mistakes and misunderstandings that inevitably happened.


This might sound dumb, but one interesting thing about Star Trek TNG was that Picard and Riker (captain / first mate) were the two foremost computer experts on the ship. It’s explicitly stated that they have the most knowledge of how the ship’s computer works.

Perhaps this was prescient. As tech begins to play an increasingly dominant role, the humans who directly manage that tech will play an increasingly dominant role as well.


That is interesting, but it requires suspension of disbelief to think that the foremost computer expert will also be the foremost management expert. The aptitudes and personalities required for deep understanding of tech rarely present with those required for effective management of large organizations.


You make a fair point. Unless, of course, the computers become the best managers of all.


Seems implausible that Riker would know more about the Enterprise's computer than Data.


They'll take tax payer bailouts, buy back stock, cut staff, and pay record bonuses to executives.

But fuck investing in the business.


Perhaps we shouldn't do those bailouts...


I concur


Hey! Do us all a favor and don't post links to stories hosted on sites that won't let you read them until you subscribe. Here is a link to the AP News story which you can read freely as long as you don't mind the ads. https://apnews.com/article/southwest-flights-grounded-faa-fa...



From the HN FAQ https://news.ycombinator.com/newsfaq.html

> Are paywalls ok?

It's ok to post stories from sites with paywalls that have workarounds.

In comments, it's ok to ask how to read an article and to help other users do so. But please don't post complaints about paywalls. Those are off topic.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: