“Configuration” in this context is the high level system configuration, and can mean pretty much anything that falls under that.
(Configuration changes, that's the source of my sarcasm)
Well... except for those companies dumb enough to farm out their intra-company communications to facebook.
A friend of mine's law firm had to resort to - OMG - the phone - yesterday.
I'm excluding the other common type of long outage, the head-desking "failover didn't work, backups are horked, it'll take 10's of hours to restore/cold start" kind.
An organization Facebook's size isn't gonna be applying configuration changes to one server at a time over SSH, either. A server configuration can easily affect thousands of machines across the globe if it's deployed to them all.
I've had to sit around waiting a couple hours for a Percona database cluster to re-sync after a major networking whoops, and it only had a few hundred gigabytes of data.
Something else happened. This was not a configuration issue. Edit: If it was, I'd expect a post-mortem post-haste.
Certain configurations at a big enough scale are dangerous, just because you could hit a terrible corner case when you rolled out the change on 50% capacity, and lose all of it so fast that your magic automatic rollback is pointless because your infrastructure is burning.
> Something else happened. This was not a configuration issue. Edit: If it was, I'd expect a post-mortem post-haste.
I've worked on automation projects at a large scale, and Facebook uses an unusual and clever method to deploy their software: BitTorrent.
I can only speculate about why FB went down yesterday. But if you understand that it's being deployed via BT, you can see that there's the potential to have a lengthy rollback window.
IE, this isn't like uninstalling a single RPM; this could have impacted a significant fraction of their fleet of systems, across multiple datacenters, and if so, the amount of data they'd need to move to rollback could have been tremendous.
My initial comment is/was admittedly a bit reactive, and more so to the general tone of their explanation than the likelihood of a legitimate technical explanation. This wasn't one service -- every product was down for nearly 24 hours, and their explanation is basically, "uh, yea, it was a...um...configuration issue." The terseness of that explanation, in my opinion, is insulting to the millions of people and businesses that rely on facebook to get information and operate their businesses.
Facebook went down for most of a day in ~2009 because a new hire mistakenly removed the memcached server config in sitevars.
Facebook went down for several hours in ~2010 because someone configured a cyclic dependency in GateKeeper.
Circa 2011, Facebook's deployment process was very good, but also very very far from infallible.
Nobody actually spends the money to do that. They all wing it at some level or another. They're just winging it at a scale vastly more massive than the hundred or thousand computers most people manage.
Source: I worked at Amazon back when managing 30,000 servers was a lot, and I can extrapolate.
Nothing. I quite like using them to deploy applications. If you package them right and build your deployment system correctly, they're not the worst way to do things.
Here's a few examples:
This was not a configuration issue.
I'm mostly guessing
my own professional experience
Normally, this level of outage would come with a technical post-mortem. Instead, they issue a super vague statement.