Reddit r/Sysadmin user that claims to be on the "Recovery Team" for this ongoing issue:
>As many of you know, DNS for FB services has been affected and this is likely a symptom of the actual issue, and that's that BGP peering with Facebook peering routers has gone down, very likely due to a configuration change that went into effect shortly before the outages happened (started roughly 1540 UTC).
There are people now trying to gain access to the peering routers to implement fixes, but the people with physical access is separate from the people with knowledge of how to actually authenticate to the systems and people who know what to actually do, so there is now a logistical challenge with getting all that knowledge unified.
Part of this is also due to lower staffing in data centers due to pandemic measures.
User is providing live updates of the incident here:
* This is a global outage for all FB-related services/infra (source: I'm currently on the recovery/investigation team).
* Will try to provide any important/interesting bits as I see them. There is a ton of stuff flying around right now and like 7 separate discussion channels and video calls.
* Update 1440 UTC: \
As many of you know, DNS for FB services has been affected and this is likely a symptom of the actual issue, and that's that BGP peering with Facebook peering routers has gone down, very likely due to a configuration change that went into effect shortly before the outages happened (started roughly 1540 UTC).
There are people now trying to gain access to the peering routers to implement fixes, but the people with physical access is separate from the people with knowledge of how to actually authenticate to the systems and people who know what to actually do, so there is now a logistical challenge with getting all that knowledge unified.
Part of this is also due to lower staffing in data centers due to pandemic measures.
I mean, when I last worked in a NOC, we used to call ourselves "NOC monkeys", so yeah. IF you're in the NOC, you're a NOC monkey, if you're on the floor, you're a floor monkey. And so on.
We even had a site and operation for a long while called:
"NOC MONKEY .DOT ORG"
We called all of ourselves NOC MONKEYS. [[Remote Hands]]
Yeah, that was a term used widely.
I'm 46. I assume you are < #
---
Where were you in 1997 building out the very first XML implementations to replace EDI from AS400s to FTP EDI file retrievals via some of the first Linux FTP servers based in SV?
They were quoted on multiple news sites including Ars Technica. I would imagine they were not authorized to post that information. I hope they don't lose their job.
Shareholders and other business leaders I'm sure are much happier reporting this as a series of unfortunate technical failures (which I'm sure is part of it) rather than a company-wide organizational failure. The fact they can't physically badge in the people who know the router configuration speaks to an organization that hasn't actually thought through all its failure modes. People aren't going to like that. It's not uncommon to have the datacenter techs with access and the actual software folks restricted, but that being the reason one of the most popular services in the world has been down for nearly 3 hours now will raise a lot of questions.
Edit: I also hope this doesn't damage prospects for more Work From Home. If they couldn't get anyone who knew the configuration in because they all live a plane ride away from the datacenters, I could see managers being reluctant to have a completely remote team for situations where clearly physical access was needed.
Operations teams normally have a special room with a secure connection for situations like this, so that production can be controlled in the event of bgp failure, nuclear war, etc. I could see physical presence being an issue if their bgp router depends on something like a crypto module in a locked cage, in which case there's always helicopters.
So if anything, Facebook's labor policies are about to become cooler.
Yup, it's terrifying how much is ultimately, ultimately dependent on dongles and trust. I used to work at a company with a billion or so in a bank account (obviously a rather special type of account), which was ultimately authorised by three very trusted people who were given dongles.
Sorry, I should have been clearer - the dongles controlled access to that bank account. It was a bank account for banks to hold funds in. (Not our real capital reserves, but sort of like a current account / checking account for banks.)
I was friends with one of those people, and I remember a major panic one time when 2 out of 3 dongles went missing. I'm not sure if we ever found out whether it was some kind of physical pen test, or an astonishingly well-planned heist which almost succeeded - or else a genuine, wildly improbable accident.
The problem is when your networking core goes down, even if you get in via a backup DSL connection or something to the datacenter, you can't get from your jump host to anything else.
It helps if your dsl line is is bridging at layer 2 in the osi model using rotated psks, so it won't be impacted by dns/bgp/auth/routing failures. That's why you need to put it in a panic room.
That model works great, until you need to ask for permission to go into the office, and the way to get permission is to use internal email and ticketing systems, which are also down.
I'm not sure why shareholders are lumped in here. A lot of reasons companies do the secret squirrel routine is to hide their incompetence from the shareholders.
You don't need to consider 'what if a meteor hit the data centre and also it was made of cocaine'. You do need to think through "how do I get this back online in a reasonable timeframe from a starting point of 'everything is turned off and has the wrong configuration'."
In a company the size of FaceBook, "everything is turned off" has never happened since before the company was founded 17 years ago. This makes is very hard to be sure you can bring it all back online! Every time you try it, there are going to be additional issues that crop up, and even when you think you've found them all, a new team that you've never heard of before has wedged themselves into the data-center boot-up flow.
The meteor isn't made of cocaine, but four of them hitting at exactly the same time is freakishly improbable. There are other, bigger fish to fry, that we're going to treat four simultaneous meteors as impossible. Which is great, but then one the day, five of them hit at the same time.
>we're going to treat four simultaneous meteors as impossible. Which is great, but then one the day, five of them hit at the same time.
I think that suggests that there were not bigger fish to fry :)
I take your point on priorities, but in a company the size of facebook perhaps a team dedicated to understanding the challenges around 'from scratch' kickstarting of the infrastructure could be funded and part of the BCP planning - this is a good time to have a binder with, if not perfectly up-to-date data, pretty damned good indications of a process to get things working.
>> we're going to treat four simultaneous meteors as impossible. Which is great, but then one the day, five of them hit at the same time.
> I think that suggests that there were not bigger fish to fry :)
I can see this problem arising in two ways:
(1) Faulty assumptions about failure probabilities: You might presume that meteors are independent, so simultaneous impacts are exponentially unlikely. But really they are somehow correlated (meteor clusters?), so simultaneous failures suddenly become much more likely.
(2) Growth of failure probabilities with system size: A meteor hit on earth is extremely rare. But in the future there might be datacenters in the whole galaxy, so there's a center being hit every month or so.
In real, active infrastructure there are probably even more pitfalls, because estimating small probabilities is really hard.
> "how do I get this back online in a reasonable timeframe from a starting point of 'everything is turned off and has the wrong configuration'."
The electricity people have a name for that: black start (https://en.wikipedia.org/wiki/Black_start). It's something they actively plan for, regularly test, and once in a while, have to use in anger.
It's a process I'm familiar with gaming out. For our infrastructure, we need to discuss and update our plan for this from time to time, from 'getting the generator up and running' through to 'accessing credentials when the secret server is not online' and 'configuring network equipment from scratch'.
Of course you can’t think of every potential scenario possible, but an incorrect configuration and rollback should be pretty high in any team’s risk/disaster recovery/failure scenario documentation.
This is true, but it's not an excuse for not preparing for the contingencies you can anticipate. You're still going to be clobbered by an unanticipated contingency sooner or later, but when that happens, you don't want to feel like a complete idiot for failing to anticipate a contingency that was obvious even without the benefit of hindsight.
#1 it's a clear breach of corporate confidentiality policies. I can say that without knowing anything about Facebook's employment contracts. Posting insider information about internal company technical difficulties is going to be against employment guidelines at any Big Co.
In a situation like this that might seem petty and cagey. But zooming out and looking at the bigger picture, it's first and foremost a SECURITY issue. Revealing internal technical and status updates needs to go through high-level management, security, and LEGAL approvals, lest you expose the company to increased security risk by revealing gaps that do not need to be publicized.
(Aside: This is where someone clever might say "Security by obscurity is not a strategy". It's not the ONLY strategy, but it absolutely is PART of an overall security strategy.)
#2 just purely from a prioritization/management perspective, if this was my employee, I would want them spending their time helping resolve the problem not post about it on reddit. This one is petty, but if you're close enough to the issue to help, then help. And if you're not, don't spread gossip - see #1.
You're very, very right - and insightful - about the consequences of sharing this information. I agree with you on that. I don't think you're right that firing people is the best approach.
Irrespective of the question of how bad this was, you don't fix things by firing Guy A and hoping that the new hire Guy B will do it better. You fix it by training people. This employee has just undergone some very expensive training, as the old meme goes.
Whoever is responsible for the BGP misconfiguration that caused this should absolutely not be fired, for example.
But training about security, about not revealing confidential information publicly, etc is ubiquitous and frequent at big co's. Of course, everyone daydreams through them and doesn't take it seriously. I think the only way to make people treat it seriously is through enforcement.
I feel you're thinking through this with a "purely logical" standpoint and not a "reality" standpoint. You're thinking worst case scenario for the CYA management, having more sympathy for the executive managers than for the engineer providing insight to the tech public.
It seems like a fundamental difference of "who gives a shit about corporate" from my side. The level of detail provided isn't going to get nationstates anything they didn't already know.
Yeah but what is the tech public going to do with these insights?
It's not actionable, it's not whistleblowing, it's not triggering civic action, or offering a possible timeline for recovery.
It's pure idle chitchatter.
So yeah, I do give a shit about corporate here.
Disclosure: While I'm an engineer too, I'm also high enough in the ladder that at this point I am more corporate than not. So maybe I'm a stooge and don't even realize it.
Facebook, the social media website is used, almost exclusively for 'idle chitchatter', so you may want to avoid working there if your opinion of the user is so low. (Actually, you'll probably fit right in at Facebook.)
It's unclear to me how a 'high enough in the ladder' manager doesn't realize that there's easily dozen people who know the situation intimately but who can't do anything until a dependent system to them is up. "Get back to work" is... the system is down, what do you want them to do, code with a pencil and paper?
ramenporn violated the corporate communication policy, obviously, but the tone and approach for a good manager to an IC that was doing this online isn't to make it about corporate vs them/the team, and in fact, encourage them to do more such communication, just internally. (I'm sure there was a ton of internal communication, the point is to note where ramenporn's communicative energy was coming from, and nurture that, and not destroy that in the process of chiding them for breaking policy.
> Edit: I also hope this doesn't damage prospects for more Work From Home. If they couldn't get anyone who knew the configuration in because they all live a plane ride away from the datacenters, I could see managers being reluctant to have a completely remote team for situations where clearly physical access was needed.
You're conflating working remotely ("a plane ride away") and working from home.
You're also conflating the people who are responsible network configuration, and for coming up with a plan to fix this; and the people who are responsible for physically interacting with systems. Regardless of WFH those two sets likely have no overlap at a company the size of Facebook.
> I doubt WFH will be impacted by this - not an insider but seems unlikely that the relevant people were on-site at data centers before COVID
I think the issue is less "were the right people in the data center" and more "we have no way to contact our co-workers once the internal infrastructure goes down". In non-wfh you physically walk to your co-workers desk and say "hey, fb messenger is down and we should chat, what's your number?". This proves that self-hosting your infra (1) is dangerous and (2) makes you susceptible to super-failures if comms goes down during WFH.
Major tech companies (GAFAM+) all self-host and use internal tools so they're all at risk of this sort of comms breakdown. I know I don't have any co-workers number (except one from WhatsApp which if I worked at FB wouldn't be useful now).
It is a matter of preparation. You can make sure there are KVMoIPs or other OOB technologies available on site to allow direct access from a remote location. In the worst case technician has to know how to connect the OOB device or press a power button ;)
I'm not disagreeing with you, however clearly (if the reddit posts were legitimate) some portion of their OOB/DR procedure depended on a system that's down. From old coworkers who are at FB, their internal DNS and logins are down. It's possible that the username/password/IP of an OOB KVM device is stored in some database that they can't login to. And the fact FB has been down for nearly 4 hours now suggests it's not as simple as plugging in a KVM.
I was referring to the WFH aspect the parent post mentioned. My point was that the admins could get the same level of access as if they were physically on site, assuming the correct setup.
Can you recommend similar others (or maybe how to find them)? I learned of PushShift because snew, an alternative reddit frontend showing deleted comments, was making fetch requests and I had to whitelist it in uMatrix. Did not know about Camas until today.
It's time to decentralize and open up the Internet again, as it once was (ie. IRC, NNTP and other open protocols) instead of relying on commercial entities (Google, Facebook, Amazon) to control our data and access to it.
I'll throw in Discord into that mix, the thing that basically mostly killed IRC. Which is yet again centralized despite pretending that it is not centralized.
What are they afraid of? While they are sharing information
that's internal/proprietary to the company, it isn't anything particularly sensitive and having some transparency into the problem is good for everyone.
Who'd want to work for a company that might take disciplinary action because an SRE posted a reddit comment to basically say "BGP's down lol" - If I was in charge I'd give them a modest EOY bonus for being helpful in their outreach to my users in the wider community.
Seems reasonable that at a company of 60k, with hundreds who specialize in PR, you do not want a random engineer making the choice himself to be the first to talk to the press by giving a PR conference on a random forum.
Honestly, from a PR perspective, I’m not sure it’s so bad. Giving honest updates showing Facebook hard at work is certainly better PR for our kind of crowd than whatever actual Facebook PR is doing.
That one guy's comments seen fine from a PR perspective apart from it not being his role to communicate for the company.
I still think he should be fired for this kind of communication though. One reason is, imagine Facebook didn't punish breaches of this type. Every other employee is going to be thinking "Cool, I could be in a Wired article" or whatever. All they have to do is give sensitive company information to reporters.
Either you take corporate confidentiality seriously or you don't. Posting details of a crisis in progress on your Reddit account is not taking corporate confidentiality seriously. If the Facebook corporation lightly punishes, scolds, or ignores this person then the corporation isn't taking confidentiality seriously either.
Reporters are going to opportunistically start writing about those comments vs having to wait for a controlled message from a communications team. So the reddit posts might not be "so bad", but they're also early and preempting any narrative they may want to control.
Compare Facebook's official tweet: "We’re aware that some people are having trouble accessing our apps and products. We’re working to get things back to normal as quickly as possible, and we apologize for any inconvenience."
I don't think Facebook could actually say anything more accurate or more honest. "Everything is dead, we are unable to recover, and we are violently ashamed" would be a more fun statement, but not a more useful one.
There will be plenty of time to blame someone, share technical lessons, fire a few departments, attempt to convince the public it won't happen again, and so on.
I agree completely. The target audience Facebook is concerned about is not techies wanting to know the technical issues. Its the huge advertising firms, governments, power users, etc. who have concerns about the platform or have millions of dollars tied up in it. A bland statement is probably the best here - and even if the one engineer gave accurate useful info I don't see how you'd want to encourage an org in which thousands of people feel the need to post about whats going on internally during every crisis.
Well, they could at least be specific about how large the outage is. "Some people" is quite different to absolutely everyone. At least they did not add a "might" in there.
Facebook is well known for having really good PR, if they go after this guy for sharing such basic info that's yet another example of their great PR teams.
A few random guesses (I am not in any way affiliated with FB); just my 2c:
Sharing status of an active event may complicate recovery, especially if they suspect adversarial actions: such public real-time reports can explain to the red team what the blue team is doing and, especially important, what the blue team is unable to do at the moment.
Potentially exposing the dirty laundry. While a postmortem should be done within the company (and as much as possible is published publicly) after the event, such early blurbs may expose many non-public things, usually unrelated to the issue.
Shareholders and other business leaders I'm sure are much happier reporting this as a series of unfortunate technical failures (which I'm sure is part of it) rather than a company-wide organizational failure. The fact they can't physically badge in the people who know the router configuration speaks to an organization that hasn't actually thought through all its failure modes. People aren't going to like that. It's not uncommon to have the datacenter techs with access and the actual software folks restricted, but that being the reason one of the most popular services in the world has been down for nearly 3 hours now will raise a lot of questions.
I did not read it as they can't get them on site but rather that it takes travel to get them on site. Travel takes time of which they desperately want not to spend.
> If I was in charge I'd give them a modest EOY bonus for being helpful in their outreach to my users in the wider community.
That seems pretty unlikely at any but the smallest of companies. Most companies unify all external communications through some kind of PR department. In those cases usually employees are expressly prohibited from making any public comments about the company without approval.
Unrelated to the outage, but I hate headlines like this.
Facebook is down ~5% today. That's a huge plunge to be sure, but Zuckerberg hasn't "lost" anything. He owns the same number of shares today as he did yesterday. And in all likelihood, unless something truly catastrophic happens the share price will bounce back fairly quickly. The only reason he even appears to have lost $7 billion is because he owns so much Facebook stock.
Unlikely to be related. FB's losses today already happened before FB went down, and are most likely related to the general negative sentiment in the market today, and the whistleblower documents. It's actually kind of remarkable how little impact the outage had on the stock.
As much as all of the curious techies here would love transparency into the problem, that doesn't actually do any good for Facebook (or anyone else) at the moment. Once everything is back online, making a full RCA available would do actual good for everyone. But I wouldn't hold my breath for that.
Do we even know if someone had the account deleted? I think facebook might have their hands full right now solving the issue rather than looking at social media posts that discusses the issue.
There are a lot of people who work at Facebook, and I'm sure the people responsible for policing external comms do not have the skills or access to fix what's wrong right now.
I remember some huge DDOS attacks like a decade ago, and people were speculating who could be behind it. The three top theories were Russian intelligence, the Mossad, and this guy on 4chan who claimed to have a Botnet doing it.
That was the start of living in the future for me.
This felt like something straight out of a post modern novel during the whole WSB press rodeo, where some user names being used on TV were somewhere between absurd to repulsive.
The problem with tweets on transgender bathrooms is that you can be attacked for them by either side at any point in the future, so the user OverTheCounterIvermectin should have known better.
Curious what the internal "privacy" limitations are. Certainly FB must track reddit users : fb account even if they don't actually display it. It just makes sense.
Well, you want the right people to have access. If you're a small shop or act like one, that's your "top" techs.
If you're a mature larger company, that's the team leads in your networking area on the team that deal with that service area (BGP routing, or routers in general).
Most likely Facebook et. al. management never understood this could happen because it's "never been a problem before".
I can't fathom how they didn't plan for this. In any business of size, you have to change configuration remotely on a regular basis, and can easily lock yourself out on a regular basis. Every single system has a local user with a random password that we can hand out for just this kind of circumstance...
Organizational complexity grows super-linearly; in general, the number of people a company can hire per unit time is either constant or grows linearly.
Google once had a very quiet big emergency that was, ironically(1), initiated by one of their internal disaster-recovery tests. There's a giant high-security database containing the 'keys to the kingdom', as it were... Passwords, salts, etc. that cannot be represented as one-time pads and therefore are potentially dangerous magic numbers for folks to know. During disaster recovery once, they attempted to confirm that if the system had an outage, it would self-recover.
It did not.
This tripped a very quiet panic at Google because while the company would tick along fine for awhile without access to the master password database, systems would, one by one, fail out if people couldn't get to the passwords that had to be occasionally hand-entered to keep them running. So a cross-continent panic ensued because restarting the database required access to two keycards for NORAD-style simultaneous activation. One was in an executive's wallet who was on vacation, and they had to be flown back to the datacenter to plug it in. The other one was stored in a safe built into the floor of a datacenter, and the combination to that safe was... In the password database. They hired a local safecracker to drill it open, fetched the keycard, double-keyed the initialization machines to reboot the database, and the outside world was none the wiser.
(1) I say "ironically," but the actual point of their self-testing is to cause these kinds of disruptions before chance does. They aren't generally supposed to cause user-facing disruption; sometimes they do. Management frowns on disruption in general, but when it's due to disaster recovery testing, they attach to that frown the grain of salt that "Because this failure-mode existed, it would have occurred eventually if it didn't occur today."
Thanks for telling this story as it was more amusing than my experiences of being locked in a security corridor with a demagnetised access card, looooong ago.
EDIT: I had mis-remembered this part of the story. ;) What was stored in the executive's brain was the combination to a second floor safe in another datacenter that held one of the two necessary activation cards. Whether they were able to pass it to the datacenter over a secure / semi-secure line or flew back to hand-deliver the combination I do not remember.
If you mean "Would the pick-pocket have access to valuable Google data," I think the answer is "No, they still don't have the key in the safe on the other continent."
If you mean "Would the pick-pocket have created a critical outage at Google that would have required intense amounts of labor to recover from," I don't know because I don't know how many layers of redundancy their recovery protocols had for that outage. It's possible Google came within a hair's breadth of "Thaw out the password database from offline storage, rebuild what can be rebuilt by hand, and inform a smaller subset of the company that some passwords are now just gone and they'll have to recover on their own" territory.
Maybe because they were planning for a million other possible things to go wrong, likely with higher probability than this. And busy with each day's pressing matters.
Anyone who has actually worked in the field can tell you that a deploy or config change going wrong, at some point, and wiping out your remote access / ability to deploy over it is incredibly, crazy likely.
That someone will win the lottery is also incredibly likely. That a given person will win the lottery is, on the other hand, vanishingly unlikely. That a given config change will go wrong in a given way is ... eh, you see where I'm going with this
Right, which is why you just roll in protection for all manner of config changes by taking pains to ensure there are always whitelists, local users, etc. with secure(ly stored) credentials available for use if something goes wrong; rather than assuming your config changes will be perfect.
I'm not sure it's possible to speculate in a way which is generic over all possible infrastructures. You'll also hit the inevitable tradeoff of security (which tends towards minimal privilege, aka single points of failure) vs reliability (which favours 'escape hatches' such as you mentioned, which tend to be very dangerous from a security standpoint).
Absolutely, and I'd even call it a rite of passage to lock yourself out in some way, having worked in a couple of DCs for three years. Low-level tooling like iLO/iDRAC can sure help out with those, but is often ignored or too heavily abstracted away.
Exactly! Obviously they have extremely robust testing and error catching on things like code deploys: how many times do you think they deploy new code a day? And at least personally, their error rate is somewhere below 1%.
Clearly something about their networking infrastructure is not as robust.
Most likely they did plan for this. Then, something happened that the failsafe couldn't handle. E.g. if something overwrites /etc/passwd, having a local user won't help. I'm not saying that specific thing happened here -- it's actually vanishingly unlikely -- but your plan can't cover every contingency.
Agreed, it’s also worth mentioning that at the end of every cloud is real physical hardware, and that is decidedly less flexible than cloud, if you locked yourself out of a physical switch or router you have many fewer options.
In risk management cultures where consequences from failures are much, much higher, the saying goes that “failsafe systems fail by failing to be failsafe”. Explicit accounting for scenarios where the failsafe fails is a requirement. Great truths of the 1960s to be relearned, I guess.
My company runs copies of all our internal services in air-gapped data centers for special customers. The operators are just people with security clearance who have some technical skills. They have no special knowledge of our service inner workings. We (the dev team) aren’t allowed to see screenshots or get any data back. So yeah, I have done that sort of troubleshooting many times. It’s very reminiscent of helping your grandma set up her printer over the phone.
For all the hours I spent on the phone spelling grep, ls, cd, pwd, raging that we didn't keep nano instead of fucking vim (and I'm a vim person)... I could have stayed young and been solving real customer problems, not imperium-typing on a fucking keyboard with a 5s delay 'cause colleague is lost in the middle of nowhere and can't remember what file he just deleted and the system doesn't start anymore your software is fragile, just shite.
Yes, and it works if both parties are able to communicate using precise language. The onus is on the remote SME to exactly articulate steps, and on the local hands to exactly follow instructions and pause for clarifications when necessary.
Sometimes the DR plan isn't so much I have to have a working key, I just have to know who gets their first with a working key, and break glass might be literal.
Not OP, but many times. Really makes you think hard about log messages after an upset customer has to read them line by line over the phone.
One was particularly painful, as it was a "funny" log message I had added the code when something went wrong. Lesson learned was to never add funny / stupid / goofy fail messages in the logs. You will regret it sooner or later.
this is not new, this is everyday life with helping hands, on duty engineers, l2-l3 levels telling people with physical access which commands to run etc. etc. etc.
The places I've seen this at had specific verification codes for this. One had a simple static code per person that the hands-on guys looked up in a physical binder on their desk. Very disaster proof.
The other ones had a system on the internal network in which they looked you up, called back on your company phone and asked for a passphrase the system showed them. Probably more secure but requires those systems to be working.
This is not a real datacenter case but normal social hacking. On the datacenter side you have many more security checks plus many of the times the helping hands and engineers are part of the same company, using internal communication tools etc. so they are on the same logical footprint anyhow
Imagine that guy has this big npm repository locally with all those dodgy libraries with uncontrolled origin, in their /lib/node_modules with root permissions.
for something as distributed as Facebook, do multiple somebodys all have to race down each individual datacenter and plug their laptops into the routers?
As someone with no experience in this, it sounds like a terrifying situation for the admins...
Interesting that they published stuff about their BGP setup and infrastructure a few months ago - maybe a little tweak to roll backs is needed.
"... We demonstrate how this design provides us with flexible control over routing and keeps the network reliable. We also describe our in-house BGP software implementation, and its testing and deployment pipelines. These allow us to treat BGP like any other software component, enabling fast incremental updates..."
Surely Facebook don't update routing systems between data centres (IIRC the situation) when they don't have people present to fix things if they go wrong? Or have an out-of-band connection (satellite, or dial-up (?), or some other alternate routing?).
I must be misunderstanding this situation here.
[Aside: I recall updating wi-fi settings on my laptop and first checking I had direct Ethernet connection working ... and that when I didn't have anything important to do (could have done a reinstall with little loss). Is that a reasonable analogy?]
Joking aside, I can see how an IRC network has potential to be used in these situations. Maybe FAMANG should work together to set something like this up. The problem is, a single IRC server is not fail safe, but a network of multiple servers would just see a netsplit, in which case users would switch servers.
Also, I remember back in the IRCnet days using simply telnet to connect to IRCnet just for fun and sending messages, so its a very easy protocol that can be understood in a global desaster scenario (just the PING replys where annoying in telnet).
I heard the same thing from my old coworker who is at FB currently. All of their internal DNS/logins are broken atm so nobody can reach the IRC server. I bet this will spur some internal changes at FB in terms of how to separate their DR systems in the case of an actual disaster.
Good planning! Now, where does the IRC server live, and is it currently routable from the internet?
While normally I know the advice is "Don't plan for mistakes not to happen, it's impossible, murphy's law, plan for efficient recovery for mistakes"... when it comes to "literally our entire infrastructure is no longer routable from the internet", I'm not sure there's a great alternative to "don't let that happen. ever." And yet, here facebook is.
Also, are the users able to reach the server without DNS (i.e. are the IP addresse(s) involved static and communicated beforehand) and is the server itself able to function without DNS?
Routing is one thing which you can't do without (then you need to fallback to phone communications), but DNS is something that's quite probable to not work well in a major disaster.
I would think that their internal network would correctly resolve facebook.com even though they've borked DNS for the external world, or if not they could immediately fix that. So at least they'd be able to talk to each other.
To the communication angle, I've worked at two different BigCo's in my career, and both times there was a fallback system of last resort to use when our primary systems were unavailable.
I haven't worked for a FAANG but it would be unthinkable that FB does not have backup measures in place for communications entirely outside of Facebook.
Hmm well I mean for key people, ops and so on.
Not for every employee.
Only a few people need that type of access, and they should have it ready. They need to bring more people there should be an easy way to do it.
Maybe the internal FB Messenger app has a slide button to switch to the backup network for those in need.
> Maybe the internal FB Messenger app has a slide button to switch to the backup network for those in need.
Having worked for 2 FAANG companies, I can tell you most core services like which FB Messenger would be using internal database services and relying on those which would be ineffective in a case like this as it would not work and the engineering cost to design them to support an external database would be a lot more than just paying for like 5 different external backup products for your SRE team.
"I believe the original change was 'automatic' (as in configuration done via a web interface). However, now that connection to the outside world is down, remote access to those tools don't exist anymore, so the emergency procedure is to gain physical access to the peering routers and do all the configuration locally."
You know how after changing resolution and other video settings you get a popup "do you want to keep these changes?" with a countdown and automatic revert in case you managed to screw up and can't see the output anymore?
Well, I wonder why a router that gets a config update but then doesn't see any external traffic for 4 hours doesn't just revert back to the last known good config...
Our security team complained that we have some services like monitoring or SSH access to some Jump Hosts accessible without a VPN because VPN should be mandatory to access all internal services. I'm afraid once comply we could be in similar situation where Facebook is now...
Fundamentally, how is a 2nd independent VPN into your network a different attack surface than a single, well-secured ssh jumphost? When you're using them for narrow emergency access to restore the primary VPN, both are just "one thing" listening on the wire, and it's not like ssh isn't a well-understood commodity.
On the other hand if you had to break through wireguard first, and then go through your single well-secured bastion, you'd not only be harder to find, you'd have two layers of protection, and of course you tick the "VPN" box
But if your vpn has a zero day, that lets you get to the ssh server. It's two layers of protection, you'd have to have two zero days to get in instead of one.
You could argue it's overkill, but it's clearly more secure
Only if the VPN means you have a VPN and a jump box. If it's "VPN with direct access to several servers and no jump box" there's still only one layer to compromise.
Still wouldn't help if your configuration change wipes you clear off the Internet like Facebook's apparently has. The only way to have a completely separate backup is to have a way in that doesn't rely on "your network" at all.
These are readily available, OpenGear and others have offered them forever. I can't believe fb doesn't have out of band access to their core networking in some fashion. OOB access to core networking is like insurance, rarely appreciated until the house is on fire.
It's quite possible that they have those, but that the credentials are stored in a tool hosted in that datacenter or that the DNS entries are managed by the DNS servers that are down right now.
You are probably right but if that is the case, it isn't really out of band and needs another look. I use OpenGear devices with cellular to access our core networking to multiple locations and we treat them as basically an entirely independent deployment, as if it is another company. DNS and credentials are stored in alternate systems that can be accessed regardless of the primary systems.
I'm sure the logistics of this become far more complicated as the organization scales but IMHO it is something that shouldn't be overlooked, exactly for outlier events like this. It pays dividends the first time it is really needed. If the accounts of ramenporn are correct, it would be paying very well right now.
Out of band access is a far more complicated version of not hosting your own status page, which they don't seem to get right either.
It shouldn't be too stressful. Well-managed companies blame processes rather than people, and have systems set up to communicate rapidly when large-scale events occur.
It can be sort of exciting, but it's not like there is one person typing at a keyboard with a hundred managers breathing down their neck. These resolutions are collaborative, shared efforts.
> It can be sort of exciting, but it's not like there is one person typing at a keyboard with a hundred managers breathing down their neck.
As someone who formerly did Ops for many many years... this is not accurate. Even in a well organized company there are usually stakeholders at every level on IM calls so that they don't need to play "telephone" for status. For an incident of this size, it wouldn't be unusual to have C-level executives on the call.
While those managers are mostly just quietly listening in on mute if they know what's good (e.g. don't distract the people doing the work to fix your problem), their mere presence can make the entire situation more tense and stressful for the person banging keyboards. If they decide to be chatty or belligerent, it makes everything 100x worse.
I don't envy the SREs at Facebook today. Godspeed fellow Ops homies.
I think it comes down to the comfort level of the worker. I remember when our production environment went down. The CTO was sitting with me just watching and I had no problem with it since he was completely supportive, wasn't trying to hurry me, just wanted to see how the process of fixing it worked. We knew it wasn't any specific person's fault, so no one had to feel the heat from the situation beyond just doing a decent job getting it back up.
"it's not like there is one person typing at a keyboard with a hundred managers breathing down their neck. These resolutions are collaborative, shared efforts"
Well, you'd be surprised about how one person can bring everything down and/or save the day at Facebook, Cloudflare, Google, Gitlab, etc.
Most people are observers/cheerleaders when there is an incident.
Well, individuals will still stress, if anything, due to the feeling of bein personally responsible for inflicting damage.
I know someone who accidentally added a rule 'reject access to * for all authenticated users' in some stupid system where the ACL ruleset itself was covered by this *, and this person nearly collapsed when she realized even admins were shut out of the system. It required getting low level access to the underlying software to reverse engineer its ACLs and hack into the system. Major financial institution. Shit like leaves people with actual trauma.
As much as I hate fb, I really feel for the net ops guys trying to figure it all out, with the whole world watching (most of it with shadenfreude)
> It shouldn't be too stressful. (...) it's not like there is one person typing at a keyboard with a hundred managers breathing down their neck
Earlier comment mentioned that there is a bottleneck, and that people who are physically able to solve the issue are few and that they need to be informed what to do; being one of these people sounds pretty stressful to me.
"but the people with physical access is separate (...) Part of this is also due to lower staffing in data centers due to pandemic measures", source: https://news.ycombinator.com/item?id=28749244
Most big tech companies automatically start a call for every large scale incident, and adjacent teams are expected to have a representative call in and contribute to identifying/remediating the issue.
None of the people with physical access are individually responsible, and they should have a deep bench of advice and context to draw from.
I'm not an IT Operations guy, but as a dev I always thought it was exciting when the IT guys had in their shoulders the destiny of the firm. I must be exciting.
Most teams that handle incidents have well documented incident plans and playbooks. When something major happens you are mostly executing the plan (which has been designed and tested). There are always gotchas that require additional attention / hands but the general direction is usually clear.
> Well-managed companies blame processes rather than people
I feel like this just obfuscates the fact that individuals are ultimately responsible, and allows subpar employees to continue existing at an organization when their position could be filled by a more qualified employee. (Not talking about this Facebook incident in particular, but as a generalisation: not attributing individual fault allows faulty employees to thrive at the expense of more qualified ones).
> this just obfuscates the fact that individuals are ultimately responsible
in critical systems, you design for failure. if your organizational plan for personnel failure is that no one ever makes a mistake, that's a bad organization that will forever have problems.
this goes by many names, like the swiss cheese model[0]. its not that workers get to be irresponsible, but that individuals are responsible only for themselves, and the organization is the one responsible for itself.
This isn't what I'm saying, though. The thought I'm trying to express is that if no individual accountability is done, it allows employees who are not as good at their job (read: sloppy) to continue to exist in positions which could be better occupied by employees who are better at their job (read: more diligent).
The difference between having someone who always triple-checks every parameter they input, versus someone who never double-checks and just wings it. Sure, the person who triple-checks will make mistakes, but less than the other person. This is the issue I'm trying to get at.
If someone is sloppy and not willing to change he should be shown the door, but not because he caused outage but because he is sloppy.
People who operate systems under fear tend to do stupid things like covering up innocent actions (deleting logs), keep information instead of sharing it etc. Very few can operate complex systems for long time without doing mistake. Organization where the spirit is "oh, outage, someone is going to pay for that" wiil never be attractive to good people, will have hard time adapting to changes and to adopt new tech.
> The difference between having someone who always triple-checks every parameter they input, versus someone who never double-checks and just wings it. Sure, the person who triple-checks will make mistakes, but less than the other person. This is the issue I'm trying to get at.
If you rely on someone triple-checking, you should improve your processes. You need better automation/rollback/automated testing to catch things. Eventually only intentional failure should be the issue (or you'll discover interesting new patterns that should be protected against)
If there is an incident because an employee was sloppy, the fault lies with the hiring process, the evaluation process for this employee, or with the process that put four eyes on each implementation. The employee fucked up, they should be removed if they are not up to standards, but putting the blame on them does not prevent the same thing from happening in the future.
If you'd think about it, it isn't very useful to find a person who is responsible. Suppose someone cause outage or harm, due to neglect or even bad intentions, either the system will be setup in a way that the person couldn't cause the outage or that in time it will be down. To build truly resilient system, especially on global scale, there should never be an option for a single person to bring down the whole system.
I don't think the comment you're replying to applies to your concern about subpar employees.
We blame processes instead of people because people are fallible. We've spent millenia trying to correct people, and it rarely works to a sufficient level. It's better to create a process that makes it harder for humans to screw up.
Yes, absolutely, people make mistakes. But the thought I was trying to convey is that some people make a lot more mistakes than others, and by not attributing individual fault these people are allowed to thrive at the cost of having less error-prone people in their position. For example, someone who triple-checks every parameter that they input, versus someone who has a habit of just skimming or not checking at all. Yes the triple-checker will make mistakes too, but way less than the person who puts less effort in.
But that has nothing to do with blaming processes vs people.
If the process in place means that someone has to triple check their numbers to make sure they’re correct, then it’s a broken process. Because even that person who triple checks is one time going to be woken up at 2:30am and won’t triple check because they want sleep.
If the process lets you do something, then someone at some point in time, whether accidentally or maliciously, will cause that to happen. You can discipline that person, and they certainly won’t make the same mistake again, but what about their other 10 coworkers? Or the people on the 5 sister teams with similar access who didn’t even know the full details of what happened?
If you blame the process and make improvements to ensure that triple checking isn’t required, then nobody will get into the situation in the first place.
Yeah, I've heard this view a hundred times on Twitter, and I wish it were true.
But sadly, there is no company which doesn't rely, at least at one point or another, on a human being typing an arbitrary command or value into a box.
You're really coming up against P=NP here. If you can build a system which can auto-validate or auto-generate everything, then that system doesn't really need humans to run at all. We just haven't reached that point yet.
Edit: Sorry, I just realised my wording might imply that P does actually equal NP. I have not in fact made that discovery. I meant it loosely to refer to the problem, and to suggest that auto-validating these things is at least not much harder than auto-executing them.
I don’t think anyone ever claimed the process itself is perfect. If it were, we obviously would never have any issues.
To be explicit here, by blaming the process, you are discovering and fixing a known weakness in the process. What someone would need to triple check for now, wouldn’t be an issue once fixed. That isn’t to say that there aren’t any other problems, but it ensures that one issue won’t happen again, regardless of who the operator is.
If you have to triple check that value X is within some range, then that can easily be automated to ensure X can’t be outside of said range. Same for calculations between inputs.
To take the overly simplistic triple check example from before, said inputs that need to be triple checked are likely checked based on some rule set (otherwise the person themselves wouldn’t know if it was correct or not). Generally speaking, those rules can be encoded as part of the process.
What was before potentially “arbitrary input” now becomes an explicit set of inputs with safeguards in place for this case. The process became more robust, but is not infallible.
But if you were to blame people, the process still takes arbitrary input, the person who messed up will probably validate their inputs better but that speaks nothing of anyone else on the team, and two years down the line where nobody remembers the incident, the issue happens again because nothing really has changed.
The issue is that this view always relies on stuff like "make people triple check everything".
- How does that relate to making a config change?
- How do you practically implement a system where someone has to triple check everything they do?
- How do you stop them just clicking 'confirm' three times?
- Why do you assume they will notice on the 2nd or 3rd check, rather than just thinking "well, I know I wrote it correctly, so I'll just click confirm"?
I don't think rules can always be encoded in the process, and I don't see how such rules will always be able to detect all errors, rather than only a subset of very obvious errors.
And that's only dealing with the simplest class of issues. What about a complex distributed systems problem? What about the engineer who doesn't make their system tolerant of Byzantine faults? How is any realistic 'process' going to prevent that?
This entire trope relies on the fundamental axiom that "for any individual action A, there is a process P which can prevent human error". I just don't see how that's true.
(If the statement were something like "good processes can eliminate whole classes of error, and reduce the likelihood of incidents", I'd be with you all the way. It's this Twitter trope of "if you have an incident, it's a priori your company's fault for not having a process to prevent it" which I find to be silly and not even nearly proven.)
The stress for me usually goes away once the incident is fully escalated and there's a team with me working on the issue. I imagine that happened quite quick in this case...
Exactly, the primary focus in situations like this, is to ensure that no one feel like they are alone, even if in the end it is one person who has to type in the right commands.
Always be there, help them double check, help monitor, help make the calls to whomever needs to be informed, help debug. No one should ever be alone during a large incident.
This is a one off event, not a chronic stress trigger. I find them envigorating personally, as long as everybody concerned understands that this is not good in the long run, and that you are not going to write your best code this way.
Also, equally important to note, there was a massive expose on FaceBook yesterday that is reverberating across social media and news networks, and today, when I tried to make a post including the tag #deletefacebook, my post mysteriously could not be published and the page refreshed, mysteriously wiping my post...
This is possibly the equivalent of a corporate watergate if you ask me... Just my personal opinion as a developer though... Not presented as fact... But hrmmm.
If it's anything like my past employers, they probably have a lot of time. They probably also got in a lot of trouble.
When we'd have situation bridges put in place to work a critical issue, there would usually be 2-3 people who were actively troubleshooting and a bunch of others listening in, there because "they were told to join" but with little-to-nothing to do. In the worst cases, there was management there, also.
Most of the time I was one of the 2 or 3 and generally preferred if the rest of them weren't paying much attention to what was going on. It's very frustrating when you have a large group of people who know little about what's going on injecting their opinions while you're feverishly trying to (safely) resolve a problem.
It was so bad that I once announced[0] to a C-Level and VP that they needed to exit the bridge, immediately because the discussion devolved into finger-pointing. All of management was "kicked out". We were close to solving it but technical staff was second-guessing themselves in the presence of folks with the power to fire them. 30 minutes later we were working again. My boss at the time explained that management created their own bridge and the topic was "what do to about 'me'" which quickly went from "fire me" to "get them all a large Amazon gift card". Despite my undiplomatic handling of the situation, that same C-Level negotiated to get me directly beneath during a reorganization about six months later and I stayed in that spot for years with a very good working relationship. One of my early accomplishments was to limit any of management's participation in situation bridges to once/hour, and only when absolutely necessary, for status updates assuming they couldn't be gotten any other way (phones always worked, but the other communication options may not have).
[0] This was the 16th hour of a bridge that started at 11:00 PM after a full work day early in my career -- I was a systems person with a title equivalent to 'peon', we were all very raw by then and my "announcement" was, honestly, very rude, which I wasn't proud of. Assertive does not have to be rude, but figuring out the fine line between expressing urgency and telling people off is a skill that has to be learned.
>As many of you know, DNS for FB services has been affected and this is likely a symptom of the actual issue, and that's that BGP peering with Facebook peering routers has gone down, very likely due to a configuration change that went into effect shortly before the outages happened (started roughly 1540 UTC). There are people now trying to gain access to the peering routers to implement fixes, but the people with physical access is separate from the people with knowledge of how to actually authenticate to the systems and people who know what to actually do, so there is now a logistical challenge with getting all that knowledge unified. Part of this is also due to lower staffing in data centers due to pandemic measures.
User is providing live updates of the incident here:
https://www.reddit.com/r/sysadmin/comments/q181fv/looks_like...