You've got it exactly right. There are a lot of people here who are completely deluded into the "move fast and break things" mindset not realizing that sometimes you really do not want to move fast, because you REALLY do not want to break things. A corrupted file throwing up panics like this is a good thing, because you don't want corrupted files to pass through like everything is fine.
If the corrupted file is in the backup, it DID pass through like everything was fine. What's clear to me is that the FAA has no post deployment validation, hasn't tested its DR strategy, and that errors can go unseen for long periods of time.
> What's clear to me is that the FAA has no post deployment validation, hasn't tested its DR strategy, and that errors can go unseen for long periods of time.
It is possible to have all of those mitigations in place and still experience a failure like this.
Post deployment validation is only as good as the validations executed. 99% coverage still leaves the door open to failure.
A DR strategy is just that - a strategy.
A failure of this sort is not an automatic implication that those things do not exist, just that they failed in this particular case.
I would find it incredibly surprising that an organization of that complexity could have survived as long as they did without a major incident if none of those things were in place.
They’d be either incredibly lucky, or incredibly competent, and if they are the latter, they would not operate without such mitigations in place.
It seems far more believable that an organization of the FAA’s age and complexity missed something along the way.
> incredibly surprising that an organization of that complexity could have survived as long as they did without a major incident
I'm not surprised. FAA does not fly each plane. Government organizational complexity helps ensure the government organization survives through next round of Congressional appropriations.
Org complexity + opaque oversight + 'safety' + 'homeland security' + taxpayer funded = playing around and more budget.
The pilot is responsible for safety. Air travel has rules to avoid collisions (eastbound gets altitude levels different than westbound, pilots shall broadcast on known frequencies) and pilots have distributed intelligence to keep their flight safe.
Yes, somehow there needs to be coordination of runway use. Many ways to provide reservations and queuing.
We can make excuses all day long. A simple query of the database/table would have produced an error. Sure, the FAA does some complex stuff, but the tech I see in airplanes looks ancient. I'm willing to bet most of the FAA complexity comes from budget (lack thereof) and old computer systems.
This has nothing to do with excuses - I’m challenging the assertion that “because something bad happened, they must not have any mitigations in place at all”.
This seems like a bad case of binary thinking, and my point was that the occurrence of an incident like this is not sufficient to support that claim. It’s just as likely that an ancient process that wasn’t accounted for somewhere in the architecture broke down, and this is how it manifested.
Clearly improvements are needed, as is always the case after an outage. That doesn’t justify wild speculation.
Anecdote time: I once worked for a large financial institution that makes money when people swipe their credit cards. The system that authorizes purchases is ancient, battle tested, and undergoes minimal change because the cost of an outage could be measured in the millions of $ per minute.
Every change was scrutinized, reviewed by multiple groups, discussed with executives, and tested thoroughly. The same system underwent regular DR testing that involved quite a lot of involvement from all related teams.
So the day it went down, it was obviously a big deal, and raised all of the natural questions about how such a thing could occur.
Turns out it had an unknown transitive dependency on an internal server - a server that had not been rebooted in literally a decade. When that server was rebooted (I think it was a security group insisting it needed patches despite some strong reasons to avoid that when considering the architecture), some of the services never came back up, and everyone quickly learned that a very old change that predated almost everyone there established this unknown dependency.
The point of this story is really about the unknowability of sufficiently complex legacy enterprise systems.
All of the right processes and procedures won’t necessarily account for that seemingly inconsequential RPC call to an internal system implemented by a grizzled dev shortly before his retirement.
And then you find an obscure service doesn’t come back up on the 10,000th or 100,000th reboot because of <any number of reasons>. And now you have multiple states, because you have to handle failover. It’s turtles all the way down.
It’s always easy to say that in hindsight. But keep in mind this is an environment with many core components built in the 80s. Regular reboots on old AIX systems wasn’t a common practice - the sheer uptime capability of these systems was a big selling point in an environment that looks nothing like a modern cloud architecture.
But none of that is really the point. The point is that even with every correct procedure in place, you’ll still encounter failures.
Modern dev teams in companies that build software have more checks and balances in place from the get go that help head off some categories of failure.
But when an organization is built on core tech born of the 80s/90s, there will always be dragons, regardless of the current active policies and procedures.
The problem is that the cost to replace some of these systems was inestimable.
I don't even know if I'd call this a disaster recovery fail. Depending on what they meant by "corruption", a roughly 6-8 hour turn around time is not awful for a database restore.
People like to think that the alternative to "move fast and break things" is "move slowly and not break things" but it's not, it's "move slowly, break things anyway, then take days to resolve the problem because you never learned how to move fast".
You act like "moving fast" is all you need to know to "move fast". As if it's simply the skill of making time move faster, and you don't need any other skills than that, because once you've broken the laws of physics and changed the speed of time, everything just works faster without any differences or consequences. Do you watch a lot of Superhero movies?
You should work smarter, not harder. Just turn up your smart knob. But why didn't you ever think of that before? Probably because you had your smart knob turned all the way down.
Frantic is often a natural outcome of “moving faster” when the environment one is moving in is not conducive to that speed of movement.
In my experience, this tendency towards frantic is multiplied the larger and more complex the organization and architecture becomes.
The entire point of “move fast” in software circles is to leave behind the constraints of legacy tech and management practices in favor of building something “better”.
In a mature org that grew up before these ideas were mainstream, maybe one or two teams can manage to move faster, but invariably they end up depending on other teams, who in turn depend on deeply ingrained and established company culture and procedures.
We can talk about why those impediments are a Bad Thing, and I wouldn’t advise a consumer startup to adopt those methodologies in 2022, but there’s still the harsh reality that where they exist, “just move faster” doesn’t help much more than telling a depressed person to “just do cardio every day”. There’s often a lot of inner work that’s gotta happen to make way for the new.
The only way I’ve seen this sort of work in a large org is when a brand new “emerging tech” group is spun up and given autonomy to work outside of the legacy norms. This is not perfect either, and seems much better for greenfield projects. When applied to deeply entrenched legacy systems, all of the problems mentioned above come to a head.
This also creates a weird in/out group dynamic which tends to further stratify the old tech and widen the gap between the old practices and the new.
In the context of this particular conversation though, I think the concept of “move fast” has lost all meaning and has little to offer for an org like the FAA.
When I hear "move fast and break things", I am reminded of a guy that I worked with. He delivered work fast......full of bugs...0 planning...and his daily ritual was to just keep patching the "pile of shit" he put together.
He has now moved on and we sometimes chat, he works for a huge corp, still sucks at writing SQL.
And as I said, people think the alternative to "fast, full of bugs, 0 planning" is automatically "slow, no bugs, lots of planning" but it's often "slow, lots of planning, just as many bugs"
When you are in a complex spiderweb you simply cannot move fast. If you are moving fast you are not looking at everything and it will blow up in your face.
The worldwide air traffic control system is not simple or easy to understand and iterate on. And that's not because it was designed by idiots, or that you're so much smarter and more experienced and a vastly better programmer than the combined efforts of everyone in the world working on air traffic control, as you seem to be implying from your comfortable armchair.
Are you proposing the entire world simply give up air travel, because government regulations and industry standards and the laws of physics and chaos theory prevent you from having the simple easy to understand air traffic control system you envision?
Change is the most common reason for breaking things. Moving fast means more broken things, hence the slogan. The alternative is indeed move slow, break things less often. It's a bad strategy when you NEED a LOT of change. But if you don't NEED a LOT of change, and you do need a lot of stability, it seems perfectly valid?
So for the things that don't need a lot of change, what are some characteristics of the system?
Does CI/CD exist? Does CI even exist? Are deployments automated? Is data sanity checked before loading? Is there a development environment? Do things like hourly snapshots exist? Can you easily provision a replacement system from scratch and restore data from a known good snapshot?
Or, is every process manual, slow, and error prone because there's never been a need to move fast.
Look at this one sentence in the article:
> In the overnight hours of Tuesday into Wednesday, FAA officials decided to shut down and reboot the main NOTAM system -- a significant decision, because the reboot can take about 90 minutes, according to the source.
So they do a reboot, that takes 90 minutes for some reason, and then that didn't even fix the problem. Their system that needs a lot of stability is now broken.