Frankfurt rail work damaged fibre, causes global Lufthansa check-in system fault

ho_schi · on Feb 15, 2023

What an irony that the Deutsche Bahn caused an outage of Deutsche Lufthansa. And Deutsche Lufthansa now tries to shift domestic flights to Deutsche Bahn[0] (see also for general advice/help). And who got recently a Star Alliance member? Deutsche Bahn. And who has experience with interrupted cut-off fibers[1]? Deutsche Bahn!

*Please*

All critical system shall be working autonomous. Cache flight/passenger data locally at airport for next weeks. Servers are for synchronizing, not for keeping your data away. With a local cache you can keep going for some weeks, adding/changing bookings will be harder but you can remain in the air. That is pretty complicated and requires more work but is important. Git or IMAP are examples how it should be.

Outages will happen and will keep happening and when the get more seldom, the will become more serious because it will hit inexperienced staff. Especially British Airways is known for issues[2][3].

[0] https://www.lufthansa.com/xx/en/flight-information.html

[1] https://www.heise.de/news/Sabotage-bei-der-Bahn-Viele-vertra...

[2] https://thepointsguy.co.uk/2017/09/amadeus-network-issue-cau...

[3] https://www.datacenterdynamics.com/en/news/british-airways-e...

PS: Deutsche Lufthansa is known for reliable transport. Deutsche Bahn is known for unreliable transport.

PurpleRamen · on Feb 15, 2023

> What an irony that the Deutsche Bahn caused an outage of Deutsche Lufthansa.

Technically it wasn't the Deutsche Bahn, but a hired construction company. And they just happen to cut some cable from Deutsche Telekom, which also broke the connection from Lufthansa and some others. Those things happen with constructions. But generally, they all are kinda at fault.

> All critical system shall be working autonomous.

It's not always obvious what is a critical system. I mean it was a semi-important sub-system of a bigger company, not something where you would assume it could cause a national impact.

bogomipz · on Feb 15, 2023

>"It's not always obvious what is a critical system. I mean it was a semi-important sub-system of a bigger company, not something where you would assume it could cause a national impact."

Why would you not assume it was something that could cause national impact when cutting through fiber cables was the very thing that impacted rail travel for Deutsche Bahn 4 months ago in Northern Germany?[1]

DB uses GSM-R(Global System for Mobile Communications Railway)which has fiber backhaul at regular intervals along the tracks.[2]

Given the incident in October is still fresh in memory surely there should have been a heightened awareness to this.

[1] https://archive.is/oOPcp

[2] https://www.globalrailwayreview.com/article/1088/current-gsm...

PurpleRamen · on Feb 15, 2023

You really think 4 Months is enough time to check all your system for unlikely problems outside your control, redesign them, implement and execute it, and even convince your boss to waste all the money for something which in the worst case will cost most likely less than a solution for preventing it? All assuming that it's even possible to have a solution for every potential problem.

bogomipz · on Feb 15, 2023

You don't need to check all your systems just the area where you are deploying bulldozers and backhoes into. Further enumerating all base stations and their geo coordinates in a GSM network is trivial.

zrail · on Feb 15, 2023

Cacheing at the edge might help in certain situations but a global reservation and check-in system seems like the kind of thing that needs to be centralized.

I think a more workable solution would have been to treat transit for this data center how most other data centers are organized, with multiple independent uplinks that take different routes into the building. A last-resort point-to-point wireless link would probably be wise as well.

ho_schi · on Feb 15, 2023

Most of the IT-Plans focus on mitigating the outage itself (e.g. multiple uplinks, backup server, fallback datacenter). I argue that critical system shall gracefully handling the outage. If a client caches the flight/passenger data (lets say two weeks) you can print list. Grab a pen and keep going.

That is enough for some hours and even days. If you're going advanced you can allow to add (right at the gate/counter) one person to more passengers and so on. You can also go all-in and add code which allows merging of external data i.e. allowing headquarters and airport personnel working at same time.

     Die Fluggäste wurden auch nicht per Strichliste in die bereitstehenden Maschinen gelassen, weil nach Angaben des Personals wichtige Informationen zum Abflug fehlten.

If it works that is an appropriate solution.

PS: We cannot see circumstances. Maybe a hack, a burning datacenter, war, co-worker going crazy, international sanctions, storm, flooding, whatever. For example Deutsche Bahn itself suffered recently a sabotage, two entirely independent glass fibers get cut off at same time (one in west-germany, one in east-germany).

ta1243 · on Feb 15, 2023

A large number of flights are booked with less than two weeks notice, and that's before changes which occur hours or even minutes before flights.

rtkwe · on Feb 15, 2023

Not to mention planes are intentionally overbooked (at least here in the US I presume the same tactic exists to a lesser extent in places like Europe with presumably better consumer protections) to ensure they're flying as close to full as possible so you're constantly needing to access the larger booking information to rebook flights for people bumped or to see who's been moved on to your flight. Two weeks is way too long, a day is probably too long to have a fresh passenger manifest.

ho_schi · on Feb 16, 2023

Update Source: https://www.sueddeutsche.de/wirtschaft/computer-warum-der-sc...

Article says that Lufthansa actually has a backup lane and it initially worked. It finally failed later [sic!] on Wednesday, maybe because the load was too much for the backup lane. So they tried - as usual - to lower the chance of an outage. But it is crucial to actually being able to keep working with an outage.

PS: The construction workers seem to have drilled through four cables (each ~ 900 fibres) in 4-5 meter deep below the surface.

ho_schi · on Feb 15, 2023

The meme: https://pbs.twimg.com/media/FpAJFXoWAAEnjq7?format=png&name=...

German source: https://www.faz.net/agenturmeldungen/dpa/it-probleme-bei-luf...

They cut off four cables at once. It didn't failed immediately.

ta1243 · on Feb 15, 2023

" four fiber optic cables were severed by an excavator on Tuesday."

An excavator. One.

That's not diversity. Diversity would require cables cuts at two separate sites.

est31 · on Feb 15, 2023

This tweet [0] points out that the cable break happened at tuesday afternoon, but the lufthansa problems only occured wednesday morning. So it's more complicated than just one breakage, something else failed too on top of that.

[0]: https://twitter.com/deutschetelekom/status/16258656661117091...

midasuni · on Feb 15, 2023

“ Lufthansa said construction work on a rail line in Frankfurt was to blame for the massive IT failure,. ”

No, Lufthansa not having sufficient resilience was to blame.

davidkuennen · on Feb 15, 2023

I'm from Germany and it's really crazy to think just one cable is holding almost the entirety of our flight system.

Hopefully they fix it until next Monday. I'm having a flight to Frankfurt then.

jessekv · on Feb 15, 2023

> fix it until next Monday

Off topic, sorry. I'm curious why Germans often use this phrasing. Is there a "false friend" [1] in English? (I assume you mean 'before', not 'until')

1. https://en.m.wikipedia.org/wiki/False_friend

cjpearson · on Feb 15, 2023

In English one would use 'by' for a deadline and 'until' when the state or action continues up to a point. e.g. "I must finish the task by the end of the day" or "I'll be working on the task until the end of the day". In German 'bis' is used for both cases.

trunc8 · on Feb 15, 2023

To continue the tangent, I'm currently learning Spanish and appreciate the really long tail involved in language learning. I often work with people who, while they do not speak English as their first language, speak it pretty much perfectly. Except for maybe a few minor "glitches", which would just never occur with a native speaker. I often wonder how pointing these out would be received, and though I presume many would be appreciative, maybe not all...

midasuni · on Feb 16, 2023

The order of adjectives in English is something you never think about, but unless you put them in a very specific order you sound crazy

https://dictionary.cambridge.org/grammar/british-grammar/adj...

toast0 · on Feb 15, 2023

Before and until are sometimes interchangable in English.

Don't go there before Monday. Don't go there until Monday.

You can get this deal until Monday. You can get this deal before Monday. (Some lack of clarity about whether the offer is valid on Monday with until though)

But I agree, it doesn't work in this context.

English is weird.

sschueller · on Feb 15, 2023

It should not take too long to resplice all those ribbon fibers unless the site is difficult to get too the armed cable is in worse shape than just a cut.

toast0 · on Feb 15, 2023

Unfortunately, it's very difficult to determine the path of redundant fiber without a backhoe.

It is incredibly common for multiple fiber bundles to have a shared path, because it's often easier to locate another bundle in the same place. Fiber is often laid along rail lines because rail lines have the perfect property shape for communication, and there's specialty train cars that make laying fiber alongside the rail really efficient.

bogomipz · on Feb 15, 2023

>"Unfortunately, it's very difficult to determine the path of redundant fiber without a backhoe."

No, it's not. Metro Fiber links are designed and deployed in a ring topology for exactly this reason. See:

https://www.fiber-optic-tutorial.com/tag/dwdm-ring-topology

toast0 · on Feb 15, 2023

Designed, sure. Deployed? All I can say is the number of incidents where everyone was assured that the fiber strands were independent, but in actuality they weren't is a lot more than zero.

It's really easy for a vendor to claim that there's no point at which a single backhoe can take out two of the A/B/C/D/E paths in your diagram. It's also pretty easy to buy dark fiber from two different vendors that are reselling in the same bundle or have their bundles placed in the same conduit.

bogomipz · on Feb 16, 2023

Wrong. When you build a customer a Metro fiber circuit, the ring is already built, often years before. You simply pull wavelengths out of the fiber and drop them off at at the meet-me room in the POP. Source: I used to do this.

>"It's also pretty easy to buy dark fiber from two different vendors that are reselling in the same bundle or have their bundles placed in the same conduit."

Wrong again. If you are purchasing an IRU on dark fiber you know exactly who owns the physical assets. If you are purchasing wavelengths from a reseller they will happily disclose whose network they are reselling. Additionally you can request the CLR/DLR for your circuit and see exactly how it's built. You can also easily avoid resellers and not worry at all about this.

Lastly you seem to not understand the difference between long haul and metro fiber.

sokols · on Feb 15, 2023

Interesting, most of the Nord-West area of Frankfurt was affected, residential cable internet failed but LTE/5G network of Deutsche Telekom didn't function as well.

edit: related tweet from Deutsche Telekom

https://twitter.com/deutschetelekom/status/16258248409249505...

ho_schi · on Feb 15, 2023

Some people suggested adding {2,3,4,5}G as another fallback. This often fails because the cell tower is connected to the same cables. Or the backbone burns down. Or the connection between the datacenters fails (see recent AWS failures). Or coworker is mad...

It is good to prevent failures. But autonomous local systems shall remain (basically) usable for some time. And if the local system is a paper, that's at least something. We've also ABS, ESP and ASR in cars. We still close the seatbelts.

JCM9 · on Feb 15, 2023

Everything fails all the time. The fiber cut was not responsible for this. Lufthansa having woefully inadequate resiliency plans in place with their infrastructure is to blame.

This is 100% on poor management practices by Lufthansa, but of course they’re going to point the press to that shiny object over there in the form of a fiber cut. The press, as usual, took the bait.

ta1243 · on Feb 15, 2023

push out the PR and remember that journalists are not technical people.

Had there been multiple cuts in multiple different locations I'd be more sympathetic. Shetland being a recent example [0], where one fibre was cut, then the other one in a completely different direction was also cut.

Assuming they had 2 cables and both are broken, whether something as critical as the entire LH booking system deserves more resilience than just 2 geographically diverse cables, I'm not sure -- ultimately it's what's the cost, what's the damage, and what's the likelihood of it happening.

[0] https://www.bbc.co.uk/news/uk-scotland-north-east-orkney-she...

cactusplant7374 · on Feb 15, 2023

There was a fiber cut last year near Chicago. It was near a railroad so a bunch of authorizations were required and it took about three days to complete the work. My parents live in a rural area that has fiber but didn't have internet because the local ISP didn't have any redundancy.

We switched from microwave antenna which has its own issues to fiber and my dad is thinking, "Well, you said this would be better." You can blame the local ISP, but am I wrong to think that it's hard for a rural ISP providing fiber to afford redundancy?

zrail · on Feb 15, 2023

A rural ISP might not even have the possibility of redundancy. There might be just one backbone fiber within reasonable distance. For example, there's a local rural fiber ISP near me that had to invest quite a lot into digging a fiber trench several miles to the nearest backbone connection before they even got off the ground.

thinkindie · on Feb 15, 2023

this could have happened to the microwave antenna anyway - the other side of your antenna is most probably connected with a fiber cable, which can also be cut.

cactusplant7374 · on Feb 15, 2023

The previous outages with the microwave antenna were weather related. Now the problems are much more interesting. :)

Am4TIfIsER0ppos · on Feb 15, 2023

Clearly the internet should be wireless all the way to the server which even has wireless power. No outages(!)

ginko · on Feb 15, 2023

Guess DB isn't content with just delaying their own trains these days..

james_pm · on Feb 15, 2023

Crazy that an airline that is very used to the need for double- and triple-redundancy in the aircraft it flies fails to have any sort of connectivity backup for it's ground systems.

JaimeThompson · on Feb 15, 2023

They have the legal obligation to have those double/triple redundant systems in their aircraft. Otherwise it is probable they would not have as many safety systems.

cactusplant7374 · on Feb 15, 2023

Does anyone know if these systems are such high bandwidth that they couldn't run with 4G or 5G temporarily? Perhaps some flights, say 25%, could actually still be processed?

oarsinsync · on Feb 15, 2023

Attaching a 4G modem to the end of an existing enterprise network is a non-trivial feat.

KMag · on Feb 15, 2023

Right. I'm not even sure the interconnect is IP-based. It could very well be a legacy X.25 network. A quick check of Wikipedia says that X.25 is still used in the aviation industry. Routing X.25 over 4G probably involves adding a layer to encapsulate in IP.

TeMPOraL · on Feb 15, 2023

This is Germany. I bet they have at least one company nobody heard of that makes money hand over fist specifically by doing X.25 over IP for industrial plants.

atonse · on Feb 15, 2023

How is this possible? Presumably whichever datacenter it was hosted in, would have multiple fiber lines connecting to it? Or am I just spoiled by the major cloud providers?

PurpleRamen · on Feb 15, 2023

It was a major outtake of the whole area, and there is not that much competition in terms of cables for historical reasons.

Though, it's also possible that the other connections just failed, or could not take the sudden traffic, or the system crashed for some nonsical reason, because nobody really tested this scenario.

kleiba · on Feb 15, 2023

Obviously you're not familiar with how IT is done in Germany.

wellanyway · on Feb 15, 2023

Hehe I'm having flashbacks

slac · on Feb 15, 2023

This is data sovereignty at work.

durnygbur · on Feb 15, 2023

Why don't they communicate over DSL like most German households? Maybe 6Mbps but it's a sicher 6Mbps.

coldcode · on Feb 15, 2023

I would consider a move to Germany, but not giving up my 500Mbps internet connection.

tiluha · on Feb 16, 2023

Internet Speeds in Germany are lacking in some areas, but fast internet is generally available. I have a symmetrical 1 gig FTTH connection in the countryside

rad_gruchalski · on Feb 15, 2023

Some places in Germany have that too.

durnygbur · on Feb 15, 2023

Except when you rent and the landlord says "Internet is internet, why do you need different one?" then a next desperate tenant will rent whatever there is.