Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

And why could the system not put the failed flight plan in a queue for human review and just keep on working for the rest of the flights? I think the lack of that “feature” is what I find so boggling.


Because the code classified it as a "this should never happen!" error, and then it happened. The code didn't classify it as a "flight plan has bad data" error or a "flight plan data is OK but we don't support it yet" error.

If a "this should never happen!" error occurs, then you don't know what's wrong with the system or how bad or far-reaching the effects are. Maybe it's like what happened here and you could have continued. Or maybe you're getting the error because the software has a catastrophic new bug that will silently corrupt all the other flight plans and get people killed. You don't know whether it is or isn't safe to continue, so you stop.


That reasoning is fine, but it rather seems that the programmers triggered this catastrophic "stop the world" error because they were not thorough enough considering all scenarios. As TA expounds, it seems that neither formal methods nor fuzzing were used, which would have gone a long way flushing out such errors.


> it rather seems that the programmers triggered this catastrophic "stop the world" error because they were not thorough enough considering all scenarios

Yes. But also, it's an ATC system. Its primary purpose "is to prevent collisions..." [1].

If the system encounters a "this should never happen!" error, the correct move is to shut it down and ground air traffic. (The error shouldn't have happened in the first place. But the shutdown should have been more graceful.)

[1] https://en.wikipedia.org/wiki/Air_traffic_control


Neither formal methods nor fuzzing would've helped if the programmer didn't know that input can repeat. Maybe they just didn't read the paragraph in whatever document describes how this should work and didn't know about it.

I didn't have to implement flight control software, but I had to write some stuff described by MIFID. It's a job from hell, if you take it seriously. It's a series of normative documents that explains how banks have to interact with each other which were published quicker than they could've been implemented (and therefore the date they had to take effect was rescheduled several times).

These documents aren't structured to answer every question a programmer might have. Sometimes the "interesting" information is close together. Sometimes you need to guess the keyword you need to search for to discover all the "interesting" parts... and it could be thousands of pages long.


The point of fuzzing is precisely to discover cases that the programmers couldn't think about, and formal methods are useful to discover invariants and assumptions that programmers didn't know they rely on.

Furthermore, identifiers from external systems always deserve scepticism. Even UUIDs can be suspect. Magic strings from hell even more so.


Sorry, you missed the point.

If programmer didn't know that repetitions are allowed, they wouldn't appear in the input to the fuzzer as well.

The mistake is too trivial to attribute it to the programmer incompetence / lack of attention. I'd bet my lunch it was because the spec is written in an incomprehensible language, is all over the place in a thousand pages PDF, and the particular aspect of repetition isn't covered in what looks like the main description of how paths are defined.

I've dealt with specs like that. It's most likely the error created by the lack of understanding of the details of the requirements than of anything else. No automatic testing technique would help here. More rigorous and systematic approach to requirement specification would probably help, but we have no tools and no processes to address that.


> If programmer didn't know that repetitions are allowed, they wouldn't appear in the input to the fuzzer as well.

It totally would. The point of a fuzzer is to test the system with every technically possible input, to avoid bias and blind spots in the programmer's thinking.

Furthermore, assuming that no duplicates exist is a rather strong assumption that should always be questioned. Unless you know all about the business rules of an external system, you can't trust its data and can't assume much about its behavior.

Anyways, we are discussing about the wrong issue. Bugs happen, even halting the whole system can be justified, but the operators should have had an easier time figuring out what was actually going on, without the vendor having to pore through low-level logs.


No... that's not the point of fuzzing... You cannot write individual functions in such a way that they keep revalidating input handed to them. Because then, invariably, the validations will be different function to function, and once you have an error in your validation logic, you will have to track down all function that do this validation. So, functions have to make assumptions about input, if it doesn't come from an external source.

I.e. this function wasn't the one which did all the job -- it already knew that the input was valid because the function that provided the input already ensured validation happened.

It's pointless to deliberately send invalid input to a function that expects (for a good reason) that the input is valid -- you will create a ton of worthless noise instead of looking for actual problems.

> Furthermore, assuming that no duplicates exist is a rather strong assumption that should always be questioned.

How do you even come up with this? Do you write your code in such a way that any time it pulls a value from a dictionary, you iterate over the dictionary keys to make sure that they are unique?... There are plenty of things that are meant to be unique by design. The function in question wasn't meant to check if the points were unique. For all we know, the function might have been designed to take a map and the data was lost even before this function started processing it...

You really need to try doing what you suggest before suggesting it.


I am not going to comment the first paragraph since you turned my words around.

> How do you even come up with this? Do you write your code in such a way that any time it pulls a value from a dictionary, you iterate over the dictionary keys to make sure that they are unique?

A dictionary in my program is under my control and I can be sure that the key is unique since... well, I know it's a dictionary. I have no such knowledge about data coming from external systems.

> There are plenty of things that are meant to be unique by design. The function in question wasn't meant to check if the points were unique. For all we know, the function might have been designed to take a map and the data was lost even before this function started processing it...

"Meant to be" and "actually are" can be very different things, and it's the responsibility of a programmer to establish the difference, or to at least ask pointed questions. Actually, the programmers did the correct thing by not sweeping this unexpected problem under the rug. The reaction was just a big drastic, and the system did not make it easy for the operators to find out what went wrong.

Edit: as we have seen, input can be valid, but still not be processable by our code. That not fine, but it's a fact of life since specs are often unclear or incomplete. Also, the rules can actually change without us noticing. In these cases, we should make it as easy as possible to figure out what went wrong.


I've only heard from people engineering systems for aerospace industry and we're speaking hundreds of pages of api documentation. It is very complex so equally the chances of a human error are higher.


I agree with the general sentiment "if you see an unexpected error, STOP", but I don't really think that applies here.

That is, when processing a sequential queue which is what this job does, it seems to me reading the article that each job in the queue is essentially totally independent. In that case, the code most definitely should isolate "unexpected error in job" from a larger "something unknown happened processing the higher level queue".

I've actually seen this bug in different contexts before, and the lessons should always be: One bad job shouldn't crash the whole system. Error handling boundaries should be such that a bad job should be taken out of the queue and handled separately. If you don't do this (which really just entails being thoughtful when processing jobs about the types of errors that are specific to an individual job), I guarantee you'll have a bad time, just like these maintainers did.


If the code takes a valid series of ICAO waypoints and routes, generates the corresponding ADEXP waypoint list, but then when it uses that to identify the ICAO segment that leaves UK airspace it's capable of producing a segment from before when the route enters UK airspace, then that code is wrong, and who knows what other failure modes it has?

Maybe it can also produce the wrong segment within British airspace, meaning another flight plan might be processed successfully, but with the system believing it terminates somewhere it doesn't?

Maybe it's already been processing all the preceding flight plans wrongly, and this is just the first time when this error has occurred in a way that causes the algorithm to error?

Maybe someone's introduced an error in the code or the underlying waypoint mapping database and every flight plan that is coming into the system is being misinterpreted?


An "unexpected error" is always a logic bug. The cause of the logic error is not known, because it is unexpected. Therefore, the software cannot determine if it is an isolated problem or a systemic problem. For a systemic problem, shutting down the system and engaging the backup is the correct solution.


I'm pretty inexperienced, but I'm starting to learn the hard way that it takes more discipline to add more complex error recovery. (Just recently my implementation of what you're suggesting - limiting the blast radius of server side errors - meant all my tests were passing with a logged error I missed when I made a typo)

Considering their level 1 and 2 support techs couldn't access the so-called "low level" logs with the actual error message it's not clear to me they'd be able to keep up with a system with more complicated failure states. For example, they'd need to make sure that every plan rejected by the computer is routed to and handled by a human.


> is essentially totally independent

They physically cannot be independent. The system works on an assumption that the flight was accepted and is valid, but it cannot place it. What if it accidentally schedules another flight in the same time and place?


> What if it accidentally schedules another flight in the same time and place?

Flight plans are not responsible for flight separation. It is not their job and nobody uses them for that.

As a first approximation they are used so ATC doesn’t need to ask every airplane every five minute “so flight ABC123 where do you want to go today?”

I’m staring to think that there is a need for a “falsehoods programers believe about aviation” article.


Except that you can't be sure this bad flight plan doesn't contain information that will lead to a collision. The system needs to maintain the integrity of all plans it sees. If it can't process one, and there's the risk of a plane entering airspace with a bad flight plan, you need to stop operations.


>> Except that you can't be sure this bad flight plan doesn't contain information that will lead to a collision.

Flight plans don't contain any information relevant for collision avoidance. They only say when and where the plane is expected to be. There is not enough specificity to ensure no collisions. Things change all the time, from late departures, to diverting around bad weather. On 9/11 they didn't have every plane in the sky file a new flight plan carefully checked against every other...


But they have 4 hours to reach out to the one plane whose flight plan didn't get processed and tell them to land somewhere else.


Assuming they can identify that plane.

Aviation is incredibly risk-averse, which is part of why it's one of the safest modes of travel that exists. I can't imagine any aviation administration in a developed country being OK with a "yeah just keep going" approach in this situation.


That's true, but then, why did engineers try to restart the system several times if they had no clue what was happening, and restarting it could have been dangerous?


And that's why I never (or very rarely) put "this should never happen" exceptions anymore in my code

Because you eventually figure out that, yes, it does happen


A customer of mine is adamant in their resolve to log errors, retry a few times, give up and go on with the next item to process.

That would have grounded only the plane with the flight plan that the UK system could not process.

Still a bug but with less effects to all the continent, because planes that could not get inside or outside the UK could not fly and that affected all of Europe and possibly more.


> That would have grounded only the plane with the flight plan that the UK system could not process.

By the looks of it, it was few hours in the air by the time the system had a breakdown. Considering it didn't know what the problem was, it seems appropriate that it shut down. No planes collided, so the worst didn't happen.


Couldn't the outcome be "access to the UK airspace denied" only for that flight? It would have checked with an ATC and possibly landed somewhere before approaching the UK.

In the case of a problem with all flights, the outcome would have been the same they eventually had.

Of course I have no idea if that would be a reasonable failure mode.


This here is the true takeaway. The bar for writing "this should never happen" code must be set so impossibly high that it might as well be translated into "'this should never happen' should never happen"


The problem with that is that most programming languages aren't sufficiently expressive to be able to recognise that, say, only a subset of switch cases are actually valid, the others having been already ruled out. It's sometimes possible to re-architect to avoid many of this kind of issue, but not always.

What you're often led to is "if this happens, there's a bug in the code elsewhere" code. It's really hard to know what to do in that situation, other than terminate whatever unit of work you were trying to complete: the only thing you know for sure is that the software doesn't accurately model reality.

In this story, there obviously was a bug in the code. And the broken algorithm shouldn't have passed review. But even so, the safety critical aspect of the complete system wasn't compromised, and that part worked as specified -- I suspect the system behaviour under error conditions was mandated, and I dread to think what might have happened if the developers (the company, not individuals) were allowed to actually assume errors wouldn't happen and let the system continue unchecked.


So what does your code do when you did not handle the this should never happen exception? Exit and print out a stacktrace to stdout?


To be fair, the article suggests early on that sometimes these plans are being processed for flights already in the air (although at least 4 hours away from the UK).

If you can stop the specific problematic plane taking off then keeping the system running is fine, but once you have a flight in the air it's a different game.

It's not totally unreasonable to say "we have an aircraft en route to enter UK airspace and we don't know when or where - stop planning more flights until we know where that plane is".

If you really can't handle the flight plan, I imagine a reasonable solution would be to somehow force the incoming plane to redirect and land before reaching the UK, until you can work out where it's actually going, but that's definitely something that needs to wait for manual intervention anyway.


> "we have an aircraft en route to enter UK airspace and we don't know when or where - stop planning more flights until we know where that plane is".

Flight plans don't tell where the plane is. Where is this assumption coming from?


Presumably you need to know where upcoming flights are going to be in the future (based on the plan), before they hit radar etc.


For the most part (although there are important exceptions), IFR flights are always in radar contact with a controller. The flight plan is tool allows ATC and the plane to agree a route so that they don't have to be constantly communicating. ATC 'clears' a plane to continue on the route to a given limit, and expects the plane to continue on the plan until that limit unless they give any future instructions.

In this regard UK ATC can choose to do anything they like with a plane when it comes under their control - if they don't consider the flight plan to be valid or safe they can just instruct the plane to hold/divert/land etc.

I'm not sure the NATS system that failed has the ability to reject a given flight plan back upstream.


Mostly yes; however, there are large parts of the Atlantic and Pacific where that isn't true (radar contact). I know the Atlantic routes are frequently full of plans that left the US and Canada heading to the UK.

I have no idea what percent of the volume into the UK comes from outside radar control; if they asked a flight to divert, that may open multiple other cans of worms.


> If they asked a flight to divert, that may open multiple other cans of worms.

Any ATC system has to be resilient enough to handle a diversion on account of things like bad weather, mechanical failure or a medical emergency. In fact, I would think the diversion of one aircraft would be less of a problem than those caused by bad weather, and certainly less than the problem caused by this failure. Furthermore, I would guess that the mitigation would be just to manually direct the flight according to the accepted flight plan, as it was a completely valid one.

One of the many problems here is that they could not identify the problem-triggering flight plan for hours, and only with the assistance of the vendor's engineers. Another is that the system had immediately foreclosed on that option anyway, by shutting down.


Flight plans do inform ATC where and when a plane is expected to enter their FIR though, no?


Only theoretically. In practice the only thing that usually matches is from which other ATC unit the plane is coming. But it could be on a different route and will almost always be at a different time due to operational variation.

That doesn't matter, because the previous unit actively hands the plane over. You don't need the flight plan for that.

What does matter is knowing what the plane is planning to do inside your airspace. That's why they're so interested in the UK part of the flight plan. Because if you don't give any other instructions, the plane will follow the filed routing. Making turns on its own, because the departing ATC unit cleared it for that route.


> the previous unit actively hands the plane over. You don't need the flight plan for that.

I thought practically, what's handed over is the CPL (current flight plan), which is essentially the flight plan as filed (FPL) plus any agreed-upon modifications to it?

> Because if you don't give any other instructions, the plane will follow the filed routing. Making turns on its own, because the departing ATC unit cleared it for that route.

Without voice or datalink clearance (i.e. the plane calling the new ATC), would the flight even be allowed to enter a new FIR?


To be fair that is exactly what the article said was a major problem, and which the postmortem also said was a major problem. I agree I think this is the most important issue:

> The FPRSA-R system has bad failure modes

> All systems can malfunction, so the important thing is that they malfunction in a good way and that those responsible are prepared for malfunctions.

> A single flight plan caused a problem, and the entire FPRSA-R system crashed, which means no flight plans are being processed at all. If there is a problem with a single flight plan, it should be moved to a separate slower queue, for manual processing by humans. NATS acknowledges this in their "actions already undertaken or in progress":

>> The addition of specific message filters into the data flow between IFPS and FPRSA-R to filter out any flight plans that fit the conditions that caused the incident.


Because they hit "unknown error" and when that happens on safety critical systems you have to assume that all your system's invariants are compromised and you're in undefined behavior -- so all you can do is stop.

Saying this should have been handled as a known error is totally reasonable but that's broadly the same as saying they should have just written bug free code. Even if they had parsed it into some structure this would be the equivalent of a KeyError popping out of nowhere because the code assumed an optional key existed.

For these kinds of things the post mortem and remediation have to kinda take as given that eventually a not predictable in advance unhandled unknown error will occur and then work on how it could be handled better. Because of course the solution to a bug is to fix the bug, but the issue and the reason for the meltdown is a DR plan that couldn't be implemented in a reasonable timeframe. I don't care what programming practices, what style, what language, what tooling. Something of a similar caliber will happen again eventually with probability 1 even with the best coders.


I agree with your first paragraph but your second paragraph is quite defeatist. I was involved in a quite few of "premortem" meetings where people think of increasing improbable failure modes and devise strategies for them. It's a useful meeting before larges changes to critical systems are made live. In my opinion, this should totally be a known error.

> Having found an entry and exit point, with the latter being the duplicate and therefore geographically incorrect, the software could not extract a valid UK portion of flight plan between these two points.

It doesn't take much imagination to surmise that perhaps real world data is broken and sometimes you are handed data that doesn't have a valid UK portion of flight plan. Bugs can happen, yes, such as in this case where a valid flight plan was misinterpreted to be invalid, but gracefully dealing with the invalid plan should be a requirement.


> Saying this should have been handled as a known error is totally reasonable but that's broadly the same as saying they should have just written bug free code.

I think there's a world of difference between writing bug free code, and writing code such that a bug in one system doesn't propagate to others. Obviously it's unreasonable to foresee every possible issue with a flight plan and handle each, but it's much more reasonable to foresee that there might be some issue with some flight plan at some point, and structure the code such that it doesn't assume an error-free flight plan, and the damage is contained. You can't make systems completely immune to failure, but you can make it so an arbitrarily large number of things have to all go wrong at the same time to get a catastrophic failure.


> Even if they had parsed it into some structure this would be the equivalent of a KeyError popping out of nowhere because the code assumed an optional key existed.

How many KeyError exceptions have brought down your whole server? It doesn't happen because whoever coded your web framework knows better and added a big try-catch around the code which handles individual requests. That way you get a 500 error on the specific request instead of a complete shutdown every time a developer made a mistake.


Crash is a feature, though. It's not like exceptions raises by itself into interpreter specifications. It's just that it so happens that Web apps ain't need no airbags that slow down businesses.


That line of reasoning is how you have systemic failures like this (or the Ariane 5 debacle). It only makes sense in the most dire of situations, like shutting down a reactor, not input validation. At most this failure should have grounded just the one affected flight rather than the entire transportation network.


On a multi-user system, only partial crashes are features. Total crashes are bugs.

A web server is a multi-user system, just like a country's air traffic control.


I love that phrasing, I'm gonna use that from now on when talking about low-stakes vs high-stakes systems.


> big try-catch around the code which handles individual requests.

I mean, that's assuming the code isolating requests is also bug free. You just don't know.


> Because they hit "unknown error" and when that happens on safety critical systems you have to assume that all your system's invariants are compromised and you're in undefined behavior -- so all you can do is stop.

What surprised me more is that the amount of data existing for all waypoints on the globe is quite small, if I were to implement a feature that query by their names as an identifier the first thing I'd do is to check for duplicates in the dataset. Because if there are, I need to consider that condition in every place where I'd be querying a waypoint by a potential duplicate identifier.

I had that thought immediately when looking at flight plan format, noticed the short strings referring to waypoints, way before getting to the section where they point out the name collision issue.

Maybe I'm too used to work with absurd amounts of data (at least in comparison to this dataset), it's a constant part of my job to do some cursory data analysis to understand the parameters of the data I'm working with, what values can be duplicated or malformed, etc.


If there are duplicate waypoint IDs, they are not close together. They can be easily eliminated by selecting the one that is one hop away from the prior waypoint. Just traversing the graph of waypoints in order would filter out any unreachable duplicates.


That it's safety critical is all the more reason it should fail gracefully (albeit surfacing errors to warn the user). A single bad flight plan shouldn't jeopardize things by making data on all the other flight plans unavailable.


That's like saying that because one browser tab tried to parse some invalid JSON then my whole browser should crash.


Well yes because you're describing a system where there are really low stakes and crash recovery is always possible because you can just throw away all your local state.

The flip side would be like a database failing to parse some part of its WAL log due to disk corruption and just said, "eh just delete those sections and move on."


Crash the tab and allow all the others to carry on!

The problem here is that one individual document failed to parse.


The other “tabs” here are other airplanes in flight, depending on being able to land before they run out of fuel. You don’t just ignore one and move on.


Nonsense comparison, your browser's tabs are de facto insulated from each other, flight paths for 7000 daily planes over the UK literally share the same space.


You don't know that the JSON is invalid. Maybe the JSON is perfect and your parser is broken.


No, it's more like saying your browser has detected possible internal corruption with, say, its history or cookies database and should stop writing to it immediately. Which probably means it has to stop working.


It definitely isn't. It was just a validation error in one of thousands external data files that the system processes. Something very routine for almost any software dealing with data.


The algorithm as described in the blogpost is probably not implemented as a straightforward piece of procedural code that goes step by step through the input flightplan waypoints as described. It may be implemented in a way that incorporates some abstractions that obscured the fact that this was an input error.

If from the code’s point of view it looked instead like a sanity failure in the underlying navigation waypoint database, aborting processing of flight plans makes a lot more sense.

Imagine the code is asking some repository of waypoints and routes ‘find me the waypoint where this route leaves UK airspace’; then it asks to find the route segment that incorporates that waypoint; then it asserts that that segment passes through UK airspace… if that assertion fails, that doesn’t look immediately like a problem with the flight plan but rather with the invariant assumptions built into the route data.

And of course in a sense it is potentially a fatal bug because this issue demonstrates that the assumptions the algorithm is making about the data are wrong and it is potentially capable of returning incorrect answers.


I've had brief glimpses at these systems, and honestly I wouldn't be surprised if it took more a year for a simple feature like this to be implemented. These systems look like decades of legacy code duct-taped together.


> why could the system not put the failed flight plan in a queue

Because it doesn't look at the data as a "flight plan" consisting of "way points" with "segments" along a "route" that has any internal self-consistency. It's a bag of strings and numbers that's parsed and the result passed along, if parsing is successful. If not, give up. In this case fail the entire systemand take it out of production.

Airline industry code is a pile of badly-written legacy wrappers on top of legacy wrappers. (Mostly not including actual flight software on the aircraft. Mostly). The FPRSA-R system mentioned here is not a flight plan system, it's an ETL system. It's not coded to model or work with flight plans, it's just parsing data from system A, re-encoding it for system B, and failing hard if it it can't.


good ETLs are usually designed to separate good records from bad records, so even if one or two rows in the stream do not conform to schema - you can put them aside and process the rest.

seems like poor engineering


The problem is that it means you have a plane entering the airspace at some point in the near future and the system doesn't know it is going to be there. The whole point of this is to make sure no two planes are attempting to occupy the same space at the same time. If you don't know where one of the planes will be you can't plan all of the rest to avoid it.

The thing that blows my mind is that this was apparently the first time this situation had happened after 15 million records processed. I would have expected it to trigger much more often. It makes me wonder if there wasn't someone who was fixing these as they came up in the 4 hour window, and he just happened to be off that day.


Bad records aren't supposed to be ignored. They are supposed to be looked at by a human who can determine what to do.

Failing the way NATS did means that all future flight plan data including for planes already in the sky are not longer being processed. The safer failure mode was definitely to flag this plan and surface to a human while continuing to process other plans.


> It makes me wonder if there wasn't someone who was fixing these as they came up in the 4 hour window, and he just happened to be off that day.

This is very possible. I know of a guy who does (or at least a few years ago did) 24x7 365 on-call for a piece of mission (although not safety) critical aviation software.

Most of his calls were fixing AWBs quickly because otherwise planes would need to take off empty or lose their take-off slot.

Although there had been some “bus factor” planning and mitigation around this guy’s role, it involved engaging vendors etc. and would have likely resulted in a lot of disruption in the short term.


Please tell me this guy is now wealthy beyond imagination and living a life of leisure?


I would love to. But it wouldn’t be true.


One in a 15M chance with 7000 daily flies over the UK handled by nats meant it had a probability to happen at least once in 69 months, it took few months less.


I never said it was a good ETL system. Heck, I don't even know if the specs for it even specifies what to do with a bad record - there are at least 300 pages detailing the system. Looking around at other stories, I see repeated mentions of how the circumstances leading to this failure are supposedly extremely rare, "one in 15 million" according to one official[1]. But at 100,000 flights/day (estimated), this kind situation would occur, statistically, twice a year.

1 https://news.sky.com/story/major-flights-disruption-caused-b...


This flight plan was correct though, if there was some validation like that then it should have passed.

The code that crashed had a bug, it couldn't deal with all valid data.


Because some software developers are crap at their jobs.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: