Hacker News new | past | comments | ask | show | jobs | submit login
[dupe]
20 days ago | hide | past | favorite



There's still no connectivity to Facebook's DNS servers:

    > traceroute a.ns.facebook.com
      traceroute to a.ns.facebook.com (129.134.30.12), 30 hops max, 60 byte packets
      1  dsldevice.attlocal.net (192.168.1.254)  0.484 ms  0.474 ms  0.422 ms
      2  107-131-124-1.lightspeed.sntcca.sbcglobal.net (107.131.124.1)  1.592 ms  1.657 ms  1.607 ms 
      3  71.148.149.196 (71.148.149.196)  1.676 ms  1.697 ms  1.705 ms
      4  12.242.105.110 (12.242.105.110)  11.446 ms  11.482 ms  11.328 ms
      5  12.122.163.34 (12.122.163.34)  7.641 ms  7.668 ms  11.438 ms
      6  cr83.sj2ca.ip.att.net (12.122.158.9)  4.025 ms  3.368 ms  3.394 ms
      7  * * *
      ...
So they're hours into this outage and still haven't re-established connectivity to their own DNS servers.


"facebook.com" is registered with "registrarsafe.com" as registrar. "registrarsafe.com" is unreachable because it's using Facebook's DNS servers and is probably a unit of Facebook. "registrarsafe.com" itself is registered with "registrarsafe.com".

I'm not sure of all the implications of those circular dependencies, but it probably makes it harder to get things back up if the whole chain goes down. That's also probably why we're seeing the domain "facebook.com" for sale on domain sites. The registrar that would normally provide the ownership info is down.

Anyway, until "a.ns.facebook.com" starts working again, Facebook is dead.


Notes as Facebook comes back up:

"registrarsafe.com" is back up. It is, indeed, Facebook's very own registrar for Facebook's own domains. "RegistrarSEC, LLC and RegistrarSafe, LLC are ICANN-accredited registrars formed in Delaware and are wholly-owned subsidiaries of Facebook, Inc. We are not accepting retail domain name registrations." Their address is Facebook HQ in Menlo Park.

That's what you have to do to really own a domain.


Out of curiosity, I looked up how much it costs to become an registrar. Based on the ICANN site, it is $4,000 USD per yr, plus variable fees and transactions fees ($0.18/yr). Does anyone have experience or insight into running a domain registrar? Curious what it would entail (aside from typical SRE type stuff).


> transactions fees ($0.18/yr)

Wow, I had no idea it was so cheap[1] once you're a registrar. The implication is that anyone who wants to be a domain squatting tycoon should become a registrar. For an annual cost of a few thousand dollars plus $0.18 per domain name registered, you can sit on top of hundreds of thousands of domain names. Locking up one million domain names would cost you only $180,000 a year. Anytime someone searched for an unregistered domain name on your site, you could immediately register it to yourself for $0.18, take it off the market, and offer to sell it to the buyer at a much inflated price. Does ICANN have rules against this? Surely this is being done?

[1] "Transaction-based fees - these fees are assessed on each annual increment of an add, renew or a transfer transaction that has survived a related add or auto-renew grace period. This fee will be billed at USD 0.18 per transaction." as quoted from https://www.icann.org/en/system/files/files/registrar-billin...


> Surely this is being done?

Personally saw this kind of thing as early as 2001.

Never search for free domains on the registar site unless you are going to register it immediately. Even whois queries can trigger this kind of thing, although that mostly happens on obscure gtld/cctld registries which have a single registrar for the whole tld.


I can sadly attest to this behavior as recently as a couple years ago :(

I searched for a domain that I couldn't immediately grab (one of more expensive kind) using a random free whois site... and when I revisited the domain several weeks later it was gone :'(

Emailed the site's new owner D: but fairly predictably got no reply.

Lesson learned, and thankfully on a domain that wasn't the absolute end of the world.

I now exclusively do all my queries via the WHOIS protocol directly. Welp.


> Surely this is being done?

Probably every major retail registrar was rumored to do this at some point. Add to your calculation that even some heavyweights like GoDaddy (IIRC) tend to run ads on domains that don't have IPs specified.


Network Solutions definitely did it. I searched for a few domains along the lines of "network-solutions-is-a-scam.com", and watched them come up in WHOIS and DNS.


There are also fees you have to pay to the owner of the tld. For example .com has a $8.39 fee. In total that would be $8.57 per .com domain.

You are off by a factor of almost 50.


I didn't know that, and you're right. For anyone who's interested, I found the following references regarding the $8.39 additional fee for a .com registration:

https://itp.cdn.icann.org/en/files/registry-agreements/com/c...

https://www.icann.org/en/system/files/correspondence/stewart...

https://www.icann.org/en/announcements/details/icann-and-ver...


They have a pretty interesting page on the topic: https://www.icann.org/resources/pages/financials-55-2012-02-...

They want you to have $70k liquid.


And they want you to be someone else than Peter Sunde:

https://torrentfreak.com/icann-refuses-to-accredit-pirate-ba...


This is not completely accurate. The whole reason a registrar with domain abc.com can use ns1.abc.com is because glue records are established at the registry, this allows a bootstrap that keeps you in from a circular dependency. All that said it’s usually a bad idea, for someone as large as Facebook they should have nameservers across zones ie a.ns.fb.com b.ns.fb.org c.ns.fb.co Etc…


There is always a step which involve to email the domain when a domain update its information with the registrar. In this case, facebook.com and registrarsafe.com are managed by the same NS. You need these NS to query the MX to send that update approval by email and unblock the registrar update. Glue records are more for performance than to make that loop. I'm maybe missing something but, hopefully they won't need to send an email to fix this issue.


I have literally never once received an email to confirm a domain change. Perhaps the only exception is on a transfer to another registrar (though I can't recall that occurring, either).

To be fair, we did have to get an email from eurid recently for a transfer auth code, but that was only because our registrar was not willing to provide.

In any case, no, they will not need to send an email to fix this issue.


I just changed the email address on all my domains. My inbox got flooded with emails across three different domain vendors. If they didn't do it before, they sure are doing it now.


Yes I meant for transferring to another DNS server. In this case, they can't.


This is not true when your the registrar (as in this case) in fact your entire system could be down and you’d still have access to the registries system to do this update


FB is running their own registrar. Supposedly they can sidestep the email procedure if it's even there to begin with.


Facebook does operate their own private Registrar, since they operate tens of thousands of domains. Most of these are misspellings and domains from other countries and so forth.

So yes, the registrar that is to blame is themselves.

Source: I know someone within the company that works in this capacity.


> That's also probably why we're seeing the domain "facebook.com" for sale on domain sites. The registrar that would normally provide the ownership info is down.

That’s not how it works. The info of whether a domain name is available is provided by the registry, not by the registrars. It’s usually done via a domain:check EPP command or via a DAS system. It’s very rare for registrar to registrar technical communication to occur.

Although the above is the clean way to do it, it’s common for registrars to just perform a dig on a domain name to check if it’s available because it’s faster and usually correct. In this case, it wasn’t.


When the NS hostname is dependent on the domain it serves, "glue records" cover the resolution to the NS IP addresses. So there's no circular dependency type issue


Good catch. Hopefully, they won't need an email sent to fb.com from registrarsafe.com to update an important record to fix this. What a loop.


Its partially there. C and D are still not in the global tables according to routeviews ie. 185.89.219.12 is still not being advertised to anyone. My peers to them in Toronto have routes from them, but not sure how far they are supposed to go inside their network. (past hop 2 is them)

% traceroute -q1 -I a.ns.facebook.com

traceroute to a.ns.facebook.com (129.134.30.12), 64 hops max, 48 byte packets 1 torix-core1-10G (67.43.129.248) 0.133 ms

2 facebook-a.ip4.torontointernetxchange.net (206.108.35.2) 1.317 ms

3 157.240.43.214 (157.240.43.214) 1.209 ms

4 129.134.50.206 (129.134.50.206) 15.604 ms

5 129.134.98.134 (129.134.98.134) 21.716 ms

6 *

7 *

% traceroute6 -q1 -I a.ns.facebook.com

traceroute6 to a.ns.facebook.com (2a03:2880:f0fc:c:face:b00c:0:35) from 2607:f3e0:0:80::290, 64 hops max, 20 byte packets

1 toronto-torix-6 0.146 ms

2 facebook-a.ip6.torontointernetxchange.net 17.860 ms

3 2620:0:1cff:dead:beef::2154 9.237 ms

4 2620:0:1cff:dead:beef::d7c 16.721 ms

5 2620:0:1cff:dead:beef::3b4 17.067 ms

6 *

7 *

8 *


Kevin Beaumont:

   »The Facebook outage has another major impact: lots of mobile apps constantly poll Facebook in the background = everybody is being slammed who runs large scale DNS, so knock on impacts elsewhere the long this goes on.«

https://twitter.com/GossiTheDog/status/1445118907187175427


Oh my gosh, their IPv6 address contains "face:b00c"...

> 2a03:2880:f0fc:c:face:b00c:0:35


Besides being fun and quirky, it is actually useful for their sysadmins as well as sysadmins at other orgs.

Well at least it will in 2036, when IPv6 goes mainstream.


How difficult is to get such a "vanity" address?


You just need to get a large enough block so that you can throw most of it away by adding your own vanity part to the prefix you are given. IPv6 really isn't scarce so you can actually do that.


The face:b00c part is in the Interface ID, so this did not even need a large block (Though I am sure they have one).


dead beef sounds about right


My suspicion is that since a lot of internal comms runs through the FB domain and since everyone is still WFH, then its probably a massive issue just to get people talking to each other to solve the problem.


I don’t know how true it is but a few reports claim employees can’t get into the building with their badges.


I remember my first time having a meeting at Facebook and observing none of the doors had keyholes and thinking "hope their badge system never goes down"


> I remember my first time having a meeting at Facebook and observing none of the doors had keyholes and thinking "hope their badge system never goes down"

Every internet-connected physical system needs to have a sensible offline fallback mode. They should have had physical keys, or at least some kind of offline RFID validation (e.g. continue to validate the last N badges that had previously successfully validated).


Breaking the glass to get in to fix the service is totally a good business move.

A few hundred bucks of glass Vs a billion wiped off the share price if the service is down for a day and all the user's go find alternatives.


In case of emergency, break glass...

...the doors are glass right?


Zucks personal conference room has 3 glass walls, so I’ve been amusing myself imagining him just throwing a chair through one of the walls


That glass is bullet resistant.


Do they (you?) call him that at FB?


Yes, "Zuck".


I'm assuming someone in building security has watched the end of Ex Machina...and applied some learnings, or not.


All doors are glass with the right combination of a halligan bar, an axe and a gasoline powered saw.

And I guess beyond that point, walls are glass. Or you need explosives.


Aaaaaaand it's down!


maybe they're open by default, like old 7-11 stores when they went 24hrs and had no locks on the doors :)


Link to such claims here: https://news.ycombinator.com/item?id=28750894

I have no doubt that the publicly published post-mortem report (if there even is one) will be heavily redacted in comparison to the internal-only version. But I very much want to see said hypothetical report anyway. This kind of infrastructural stuff fascinates me. And I would hope there would be some lessons in said report that even small time operators such as myself would do well to heed.


I think the real take away is that no one has this figured out.

A small company has to keep all of its customers happy (or at least be responsive when issues arise, at a bare minimum).

Massive companies deal in error budgets, where a fraction of a percent can still represent millions of users.



I guess they didn't have an "emergency ingress" plan.


The they will have to old school it and try a brick.


I've heard on Blind this is unrelated, more of a Covid restriction issue.


What is Blind? Or shouldn't I ask?


www.teamblind.com

Enjoy.


A copy of Glassdoor


More like a crossover between Glassdoor... and Gab.


first rule of Blind, never talk about Blind


You mean the same problem as when GMail goes down and Googlers can't reach each other?

I guess good decentralized public communication services could solve those issues for everybody.


Googler here - my opinions are my own, not representing the company

at the lowest level in case of severe outage we resort to IRC, Plain Old Telephone Service and, sometimes, stick-it notes taped to windows...


Around here we use Slack for primary communications, Google Hangouts (or Chat or whatever they call it now) as secondary, and we keep an on-call list with phone numbers in our main Git repo, so everyone has it checked out on their laptop, so if the SHTF, we can resort to voice and/or SMS.

I remembered to publish my cell phone's real number on the on-call list rather than just my Google Voice number since if Hangouts is down, Google Voice might be too.


Where are the tapes though? Colo on separate tectonic tape or nah?


?


I think texasbigdata is talking about backup tapes and maybe mistyped tectonic plate

Backup tapes and in production servers are kept at different colocation sites to protect data from fire and other catastrophes of that level

Using colo sites on separate tectonic plates would protect you from catastrophes on a geological cataclysm level


We don't use tapes, everything we have is in the cloud, at a minimum everything is spread over multiple datacenters (AZ's in AWS parlance), important stuff is spread over multiple regions, or depending on the data, multiple cloud providers.

Last time I used tape, we used Ironmountain to haul the tapes 60 miles away which was determined to be far enough for seismic safety, but that was over a decade ago.


Thank you kind sir.


Some people here say their fallback IRC doesn't work due to DNS reliance. :|


One of my employers once forced all the staff to use an internally-developed messenger (for sake of security, but some politics was involved as well), but made an exception for the devops team who used Telegram.


Telegram? Interesting choice!


Devops like Telegram because it has proper bot API, unlike many other competitors.


Oh! It makes sense. While I don't like telegram for some reasons, their API is totally top notch and a real pleasure to work with.


That would completely defeat the purpose... I have a hard time believing that.


Why? Even if it's not DNS reliance, if they self-hosted the server (very likely) then it'll be just as unreachable as everything else within their network at the moment.


The entire purpose of an IRC backup is in case shit hits the fan. That means having it run on a completely separate stack.

What use is it if it runs on the same stack as what you might be trying to fix?


Clearly "our entire network is down, worldwide" wasn't part of their planning. Don't get too cocky with your 20/20 hindsight.


I don't think it's cocky or 20/20 hindsight. Companies I've worked for specifically set up IRC in part because "our entire network is down, worldwide" can happen and you need a way to communicate.


I bet they never tested taking out their own DNS.

IRC does use DNS at least to get hostnames during connection. I'd be surprised if it didn't use it at other points.


I’ve setup hosts files in case DNS was down to access critical systems before. It’s a perfectly reasonable precaution.


My small org, maybe 50 ips/hosts we care about, maintain a hosts file stills, for those nodes public and internal names. It's in Git, spread around and we also have our fingers crossed.


If only IRC would have been built with multi-server setups in mind, that forward messages between servers, and continues to work if a single - or even a set - of servers would go down, just resulting in a netsplit...Oh wait, it was!

My bet is, FB will reach out to others in FAMANG, and an interest group will form maintaining such an emergency infrastructure comm network. Basically a network for network engineers. Because media (and shareholders) will soon ask Microsoft and Google what their plans for such situations are. I'm very glad FB is not in the cloud business...


> If only IRC would have been built with multi-server setups in mind, that forward messages between servers, and continues to work if a single - or even a set - of servers would go down, just resulting in a netsplit...Oh wait, it was!

yeah if only Facebook's production engineering team had hired a team of full time IRCops for their emergency fallback network...


Considering how much IRCops were paid back in the day (mostly zero as they were volunteers) and what a single senior engineer at FB makes, I'm sure you will find 3-4 people spread amongst the world willing to share this 250k+ salary amongst them.


That is called outbound network :)


I worked on the identity system that chat (whatever the current name is) and gmail depend on and we used IRC since if we relied on the system we support we wouldn’t be able to fix it.


Word is that the last time Google had a failure involving a cyclical dependency they had to rip open a safe. It contained the backup password to the system that stored the safe combination.


The safe in question contained a smartcard required to boot an HSM. The safe combination was stored in a secret manager that depended on that HSM.

The engineer attempted to restart the service, but did not know that a restart required a hardware security module (HSM) smart card. These smart cards were stored in multiple safes in different Google offices across the globe, but not in New York City, where the on-call engineer was located. When the service failed to restart, the engineer contacted a colleague in Australia to retrieve a smart card. To their great dismay, the engineer in Australia could not open the safe because the combination was stored in the now-offline password manager.

Source: Chapter 1 of "Building Secure and Reliable Systems" (https://sre.google/static/pdf/building_secure_and_reliable_s... size warning: 9 MB)


Lovely.

Safes typically have the instructions on how to change the combination glued to the inside of the door, and ending with something like "store the combination securely. Not inside the safe!"

But as they say: make something foolproof and nature will create a better fool.


I'm sure this sort of thing won't be a problem for a company whose founding ethos is 'move fast and break things.' O:-)


Anyone remember the 90s? There was this thing called the Information Superhighway, a kind of decentralised network of networks that was designed to allow robust communications without a single point of failure. I wonder what happened to that...?


Folks are still chatting here... seems to work as designed...


Aren't we still communicating on HN, even though the possibly largest network is down? Can you send email?


We are a dying breed... A few days ago my daughter asked me "will you send me the file on Whatsapp or Discord?". I replied I will send an email. She went "oh, you mean on Gmail?" :-D


Hahaha... I can relate to that. Email is synonymous with Gmail now, something that only dads and uncles use. :-)


Somehow I gotta figure out how to get kiddos interested in networking...


Setting up a Minecraft server has been a good experience for my kiddo to learn more networking.


I am going to guess it’s one of those things the techies want to get round to, but in reality there is never any chance or will to do it.


I can assure you that Google has a procedure in place for that.


I unfortunately cannot edit the parent comment anymore but several people pointed out that I didn't back up my claim or provided any credentials so here they are:

Google has multiple independent procedures for coordination during disasters. A global DNS outage (mentioned in https://news.ycombinator.com/item?id=28751140) was considered and has been taken into account.

I do not attempt to hide my identity here, quite the opposite: my HN profile contains my real name. Until recently a part of my job was to ensure that Google is prepared for various disasterous scenarios and that Googlers can coordinate the response independently from Google's infrastructure. I authored one of the fallback communication procedures that would likely be exercised today if Google's network experienced a global outage. Of course Google has a whole team of fantastic human beings who are deeply involved in disaster preparedness (miss you!). I am pretty sure they are going to analyze what happened to Facebook today in light of Google's emergency plans.

While this topic is really fascinating, I am unfortunately not at liberty to disclose the details as they belong to my previous employer. But when I stumble upon factually incorrect comments on HN that I am in a position to correct, why not do that?


In future news: Waymo outage results in engineers unable to get to data center. Engineers don't even know where their servers are.


Give us the dirt on how google does it's disaster planning exercises please! Do you do these exercises all at once or slowly over the year?


Interesting that you are asking for the dirt given that DiRT stands for Disaster and Recovery Testing, at least at Google.

Every year there is a DiRT week where hundreds of tests are run. That obviously requires a ton of planning that starts well in advance. The objective is, of course, that despite all the testing nobody outside Google notices anything special. Given the volume and intrusiveness of these tests, the DiRT team is doing quite an impressive job.

While the DiRT week is the most intense testing period, disaster preparedness is not limited to just one event per year. There are also plenty tests conducted througout the year, some planned centrally, some done by individual teams. That's in addition to the regular training and exercises that SRE teams are doing periodically.

If you are interested in reading more about Google's approach to distaster planning and preparedness, you may be interested in reading the DiRT, or how to get dirty section from Shrinking the time to mitigate production incidents—CRE life lessons (https://cloud.google.com/blog/products/management-tools/shri...) and Weathering the Unexpected (https://queue.acm.org/detail.cfm?id=2371516).


Why not do both? ;)


Yup, they make a new chat app if the previous one is down.


Google Talk, Google Voice, Google Buzz, Google+ Messenger, Hangouts, Spaces, Allo, Hangouts Chat, and Google Messages.

At some point, they must run out of names, right?


You forgot google meet!


And Google Wave.


You forgot the chat boxes inside other apps like Google docs, Gmail, YouTube, etc.


And Google Pay, apparently.


> Yup, they make a new chat app if the previous one is down.

Continuous Deployment.


For those who don't know who he is: l9i would know this. Just clarifying that this is not an Internet nobody guessing.


He is still an anonymous dude to me.


HN Profile -> Personal Website -> LinkedIn -> Over 10 years experience as Google Site Reliability Engineer


Is the LinkedIn profile linking back to the hn account?


Security Engineer asking?


Ha, no. It just occured to me that any random hacker news account could link to somebody's personal account and claim authority on some subject.


Google SRE for 10 years, ending as the Principal Site Reliability Engineer (L8).


s/the//

Google has more than 1 L8 SRE.


I don't know who either he or you are, so...


I was clarifying his comment, since he didn't mention that this is not a guess, but inside knowledge.

I was not trying to establish a trust chain.

Take from it what you will.


Why does it matter if he's guessing or not?


Because, it may shock you to know, but sometimes people just go on the Internet and tell lies.

No shit Google has plans in place for outages.

But what are these plans, are they any good... a respected industry figure who's CV includes being at Google for 10 years doesn't need to go into detail describing the IRC fallback to be believed and trusted that there is such a thing.


I've found that when I post things I learned on the job here it actually causes people to tell me I'm wrong or made it up even more often…


It's kind of amusing given that employers are usually pretty easy to deduce based on comments…


That's just an 'appeal to authority'.

No-one knows or cares who made the statement, it may as well have been 'water is wet', it was useless and adds nothing but noise.


I found a comment that was factually incorrect and I felt competent to comment on that. Regrettably, I wrote just one sentence and clicked reply without providing any credentials to back up my claim. Not that I try to hide my identity, as danhak pointed out in https://news.ycombinator.com/item?id=28751644, my full name and URL of my personal website are only a click away.

I have replied to my initial comment with provide some additonal context: https://news.ycombinator.com/edit?id=28752431. Hope that helps.


That’s…not what “appeal to authority” means.


I've read here on HN that exactly this was the issue as they had one of the bigger outages (I think it was due to some auth service failure) and GMail didn't accept incoming mail.


A Gmail outage would be barely an inconvenience as Gmail plays a minor role in Google's disaster response.

Disclaimer: Ex-Googler who used to work on disaster reponse. Opinions are my own.


What do you think all those superfluous chat apps were for?


I think the issue there is that in exchange for solving the "one fat finger = outage" problem, you lose the ability to update the server fleet quickly or consistently.


BGP is decentralised.


LOL - score one against building out all tooling internally (a la Amazon and apparently Facebook too)


The rate at which some amazon services lately go done because other AWS services went down proves that this is an unsustainable house of cards anyways.


Netflix knows how to build on top of a house of cards.


There's a joke here somewhere about how bad the final season was


Those communications are done over irc at FB for exactly this purpose.


time to start working at your mfing desk again, johnson


They supposedly can't enter facebook office right now. Their cards don't work.


Why would a system like that have to be in their online infrastructure?


For doing LDAP lookups against the corporate directory? Oh wait, LDAP configuration of course depends on DNS and DNS is kaputt...


source?



Sheera Frenkel @sheeraf Was just on phone with someone who works for FB who described employees unable to enter buildings this morning to begin to evaluate extent of outage because their badges weren’t working to access doors.


"Something went wrong. Try reloading."

its not loading for me. could you say what it said?



From the Tweet, "Was just on phone with someone who works for FB who described employees unable to enter buildings this morning to begin to evaluate extent of outage because their badges weren’t working to access doors."


"Something went wrong. Try reloading."

its not loading for me. could you say what it said?


> Was just on phone with someone who works for FB who described employees unable to enter buildings this morning to begin to evaluate extent of outage because their badges weren’t working to access doors.

https://nitter.net/sheeraf/status/1445099150316503057


Disclose.tv @disclosetv JUST IN - Facebook employees reportedly can't enter buildings to evaluate the Internet outage because their door access badges weren’t working (NYT)


What do you think will be the impact on WFH and office requirements?


Unlikely, PagerDuty was invented for this kind of thing


Oh I'm sure everyone knows whats wrong, but how am I supposed to send an email, find a coworkers phone number, get the crisis team on video chat etc etc if all of those connections rely on the facebook domain existing?


Hence the suggestion for PagerDuty. It handles all this, because responders set their notification methods (phone, SMS, e-mail, and app) in their profiles, so that when in trouble nobody has to ask those questions and just add a person as a responder to the incident.


Yes, but Facebook is not a small company. Could PagerDuty realistically handle the scale of notifications that would be required for Facebook's operations?


PagerDuty does not solve some of the problems you would have at FB's scale, like how do you even know who to contact ? And how do they login once they know there is a problem ?


Sure. As long as you plan for disaster.

The place where I worked had failure trees for every critical app and service. The goal for incident management was to triage and have an initial escalation for the right group within 15 minutes. When I left they were like 96% on target overall and 100% for infrastructure.


Even if it can’t, it’s trivial to use it for an important subset, ie is Facebook.com down, is the ns stuff down etc. So there is an argument to be made for still using an outside service as a fallback


Sure, if you're...

- not arrogant - or complacent - haven't inadvertently acquired the company - know your tech peers well enough to have confidence in their identity during an emergency - do regular drills to simulate everything going wrong at once

Lots of us know what should be happening right now, but think back to the many situations we've all experienced where fallback systems turned into a nightmarish war story, then scale it up by 1000. This is a historic day, I think it's quite likely that the scale of the outage will lead to the breakup of the company because it's the Big One that people have been warning about for years.


I guarantee you that every single person at Facebook who can do anything at all about this, already knows there's an issue. What would them receiving an extra notification help with?


We kind of got off topic, I was arguing that if you were concerned about internal systems being down (including your monitoring/alerting) something like pager duty would be fine as a backup. Even at huge scale that backup doesn’t need to watch everything.

I don’t think it’s particularly relevant to this issue with fb. I suspect they didn’t need a monitoring system to know things were going badly.


Heck of a coincidence I must say...

I can imagine this affects many other sites that use FB for authentication and tracking.

If people pay proper attention to it, this is not just an average run of the mill "site outage", and instead of checking on or worrying about backups of my FB data (Thank goodness I can afford to lose it all), I'm making popcorn...

Hopefully law makers all study up and pay close attention.

What transpires next may prove to be very interesting.


Indeed, what happened shows a good reason not to rely only on social log-in for various sites.


NYT tech reporter Sheera Frenkel gives us this update:

>Was just on phone with someone who works for FB who described employees unable to enter buildings this morning to begin to evaluate extent of outage because their badges weren’t working to access doors.

https://twitter.com/sheeraf/status/1445099150316503057


Got a good chuckle imagining a fuming Zuckerberg not being allowed into his office, thinking the world is falling apart.


Can’t get in to fix error


I just got off a short pre-interview conversation with a manager at Instagram and he had to dial in with POTS. I got the impression that things are very broken internally.


How much of modern POTS is reliant on VOIP? In Australia at least, POTS has been decommissioned entirely, but even where it's still running, I'm wondering where IP takes over?


I am guessing that most POTS is VOIP now, except for the few places with existing copper infrastructure that has not been decommissioned yet.


This person has a POTS line in their current location, and a modem, and the software stack to use it, and Instagram has POTS lines and modems and software that connect to their networks? Wow. How well do Instagram and their internal applications work over 56K?


He called on his mobile phone. As a result it was a voice-only conversation, no video.


They could have dialed in by their own cell phone though


I read that as POTUS at first and paused for a minute


What is POTS?



Plain old telephone system. Aka a phone.


Plain Old Telephone System


Looks like they misconfigured a web interface that they can't reach anymore now that they're off the net.

"anyone have a Cisco console cable lying around?"


The only one they have is serial and the company's one usb-to-serial converter is missing.


The voices, stories, announcements, photos, hopes and sorrows of millions, no, literally billions of people, and the promise that they may one day be seen and heard again now rests in the hands of Dave, the one guy who is closest to a Microcenter, owns his own car and knows how to beat the rush hour traffic and has the good sense to not forget to also buy an RS-232 cable, since those things tend to get finicky.


Great visual!


Yeah the patch to fix BGP to reach the DNS is sent by email to @facebook.com. Ooops no DNS to resolve the MX records to send the patch to fix the BGP routers.


Seriously? Is that how it works?


No. A network like Facebook's is vast and complicated and managed by higher-level configuration systems, not people emailing patches around.

If this issue is even to do with BGP it's much more likely the root of the problem is somewhere in this configuration system and that fixing it is compounded by some other issues that nobody foresaw. Huge events like this are always a perfect storm of several factors, any one or two of which would be a total noop alone.


The Swiss cheese model of accidents. Occasionally the holes all align.

https://en.wikipedia.org/wiki/Swiss_cheese_model


The fun part of BGP is they apparently make a lot of use of it within their network, not just advertising routes externally.

https://engineering.fb.com/2021/05/13/data-center-engineerin...

(and yes, fb.com resolves)


No, the backbone of the internet is not maintained with patches sent in emails.


You are very wrong about that ;) https://lkml.org/


You are very wrong about that https://lkml.org/


Clearly you and the person you replied to are talking about very different things.


I think the sub-comment is confusing the linux kernel with BGP.


In a way, the Linux kernel does power the "backbones of the internet".


There are a hell of a lot of non-linux OS's running on core routers, but yes, in a way. However BGP isn't via email.


On the other hand, I and my office mate at the time negotiated the setup of a ridiculous number of BGP sessions over email, including sending configs. That was 20 years ago.


luckily not... would be absolutely terrible to have the backbone only on linux


Interoperability and a thriving ecosystem are necessities for resiliency.

Note that resiliency and efficiency are often working against each other.



I don't know. I doubt. It's just funny to think that you need email to fix BGP, but DNS is down because of BGP. You need DNS to send email which needs BGP. It's a kind of chicken and egg problem but at a massive scale this time.


Sheera Frenkel:

    Was just on phone with someone who works for FB who described employees unable to enter buildings this morning to begin to evaluate extent of outage because their badges weren’t working to access doors.
https://twitter.com/sheeraf/status/1445099150316503057


You'd think they'd have worked that into their DR plans for a complete P1 outage of the domain/DNS, but perhaps not, or at least they didn't add removal of BGP announcements to the mix.


Can someone explain why it is also down when trying to access it via Tor using its onion address: http://facebookwkhpilnemxj7asaniu7vnjjbiltxjqhye3mhbshg7kx5t...

Or when trying ips directly: https://www.lifewire.com/what-is-the-ip-address-of-facebook-...

I would have expected a DNS issue to not affect either of these.

I can understand the onionsite being down if facebook implemented it the way a thirdparty would (a proxy server accessing facebook.com) instead of actually having it integrated into its infrastructure as a first class citizen.


You can get through to a web server, but that web server uses DNS records or those routes to hit other services necessary to render the page. So the server you hit will also time out eventually and return a 500


The issue here is that this outage was a result of all the routes into their data centers being cut off (seemingly from the inside). So knowing that one of the servers in there is at IP address "1.2.3.4" doesn't help, because no-one on the outside even knows how to send a packet to that server anymore.


routing was down _everywhere_ so tor is getting a better experience than most people by getting a 500 error


DNS is back, looks like systems are still coming online.



Reddit r/Sysadmin user that claims to be on the "Recovery Team" for this ongoing issue:

>As many of you know, DNS for FB services has been affected and this is likely a symptom of the actual issue, and that's that BGP peering with Facebook peering routers has gone down, very likely due to a configuration change that went into effect shortly before the outages happened (started roughly 1540 UTC). There are people now trying to gain access to the peering routers to implement fixes, but the people with physical access is separate from the people with knowledge of how to actually authenticate to the systems and people who know what to actually do, so there is now a logistical challenge with getting all that knowledge unified. Part of this is also due to lower staffing in data centers due to pandemic measures.

User is providing live updates of the incident here:

https://www.reddit.com/r/sysadmin/comments/q181fv/looks_like...


He just deleted all his updates.

user:

https://old.reddit.com/user/ramenporn

some messages:

* This is a global outage for all FB-related services/infra (source: I'm currently on the recovery/investigation team).

* Will try to provide any important/interesting bits as I see them. There is a ton of stuff flying around right now and like 7 separate discussion channels and video calls.

* Update 1440 UTC: \

    As many of you know, DNS for FB services has been affected and this is likely a symptom of the actual issue, and that's that BGP peering with Facebook peering routers has gone down, very likely due to a configuration change that went into effect shortly before the outages happened (started roughly 1540 UTC).

    There are people now trying to gain access to the peering routers to implement fixes, but the people with physical access is separate from the people with knowledge of how to actually authenticate to the systems and people who know what to actually do, so there is now a logistical challenge with getting all that knowledge unified.

    Part of this is also due to lower staffing in data centers due to pandemic measures.


The 1440 UTC update is also archived on the Wayback Machine: https://web.archive.org/web/20211004171424/https://old.reddi...

And archive.today: https://archive.ph/sMgCi


Essentially, they locked themselves out with an uninspired command line at the exact moment the datacenter was being hijacked by ape-people.

Yup, corporate comms won't love these status updates.


Sorry, are you referring to data center technicians as “ape people”?


As a former data center technician, I wouldn't say it's too far off


But we're all ape people.



I mean, when I last worked in a NOC, we used to call ourselves "NOC monkeys", so yeah. IF you're in the NOC, you're a NOC monkey, if you're on the floor, you're a floor monkey. And so on.


Same with "SOC monkeys". (Which carries the additional pun of sounding like the "sock monkey" toy.)


Are you fucking kidding me?

We even had a site and operation for a long while called:

"NOC MONKEY .DOT ORG"

We called all of ourselves NOC MONKEYS. [[Remote Hands]]

Yeah, that was a term used widely.

I'm 46. I assume you are < #

---

Where were you in 1997 building out the very first XML implementations to replace EDI from AS400s to FTP EDI file retrievals via some of the first Linux FTP servers based in SV?

I was there? Remember LinuxCare?


Are you ok, Sir?


Weren't able to get their ego-fill on facebook like normally.


And there his account went poof, thanks for archiving.


They were quoted on multiple news sites including Ars Technica. I would imagine they were not authorized to post that information. I hope they don't lose their job.

Shareholders and other business leaders I'm sure are much happier reporting this as a series of unfortunate technical failures (which I'm sure is part of it) rather than a company-wide organizational failure. The fact they can't physically badge in the people who know the router configuration speaks to an organization that hasn't actually thought through all its failure modes. People aren't going to like that. It's not uncommon to have the datacenter techs with access and the actual software folks restricted, but that being the reason one of the most popular services in the world has been down for nearly 3 hours now will raise a lot of questions.

Edit: I also hope this doesn't damage prospects for more Work From Home. If they couldn't get anyone who knew the configuration in because they all live a plane ride away from the datacenters, I could see managers being reluctant to have a completely remote team for situations where clearly physical access was needed.


Facebook should have had a panic room.

Operations teams normally have a special room with a secure connection for situations like this, so that production can be controlled in the event of bgp failure, nuclear war, etc. I could see physical presence being an issue if their bgp router depends on something like a crypto module in a locked cage, in which case there's always helicopters.

So if anything, Facebook's labor policies are about to become cooler.


Yup, it's terrifying how much is ultimately, ultimately dependent on dongles and trust. I used to work at a company with a billion or so in a bank account (obviously a rather special type of account), which was ultimately authorised by three very trusted people who were given dongles.


What did the dongles do?


Sorry, I should have been clearer - the dongles controlled access to that bank account. It was a bank account for banks to hold funds in. (Not our real capital reserves, but sort of like a current account / checking account for banks.)

I was friends with one of those people, and I remember a major panic one time when 2 out of 3 dongles went missing. I'm not sure if we ever found out whether it was some kind of physical pen test, or an astonishingly well-planned heist which almost succeeded - or else a genuine, wildly improbable accident.


I would be absolutely shocked if they didn't.

The problem is when your networking core goes down, even if you get in via a backup DSL connection or something to the datacenter, you can't get from your jump host to anything else.


It helps if your dsl line is is bridging at layer 2 in the osi model using rotated psks, so it won't be impacted by dns/bgp/auth/routing failures. That's why you need to put it in a panic room.


That model works great, until you need to ask for permission to go into the office, and the way to get permission is to use internal email and ticketing systems, which are also down.


Operations teams don't need permission from some apparatchik to enter the office when production goes down. If they can't get in, they drill.


> nuclear war

I think you need some convincing to keep your SREs on-site in case of a nuclear war ;)


Hey, if I can take the kids and there’s food for a decade and a bunker I’m probably in ;)


I'm not sure why shareholders are lumped in here. A lot of reasons companies do the secret squirrel routine is to hide their incompetence from the shareholders.


That is what I meant, although you have lots of executives and chiefs who are also shareholders.


> an organization that hasn't actually thought through all its failure modes

Thinking about any potential things that can happen is impossible


You don't need to consider 'what if a meteor hit the data centre and also it was made of cocaine'. You do need to think through "how do I get this back online in a reasonable timeframe from a starting point of 'everything is turned off and has the wrong configuration'."


In a company the size of FaceBook, "everything is turned off" has never happened since before the company was founded 17 years ago. This makes is very hard to be sure you can bring it all back online! Every time you try it, there are going to be additional issues that crop up, and even when you think you've found them all, a new team that you've never heard of before has wedged themselves into the data-center boot-up flow.

The meteor isn't made of cocaine, but four of them hitting at exactly the same time is freakishly improbable. There are other, bigger fish to fry, that we're going to treat four simultaneous meteors as impossible. Which is great, but then one the day, five of them hit at the same time.


>we're going to treat four simultaneous meteors as impossible. Which is great, but then one the day, five of them hit at the same time.

I think that suggests that there were not bigger fish to fry :)

I take your point on priorities, but in a company the size of facebook perhaps a team dedicated to understanding the challenges around 'from scratch' kickstarting of the infrastructure could be funded and part of the BCP planning - this is a good time to have a binder with, if not perfectly up-to-date data, pretty damned good indications of a process to get things working.


>> we're going to treat four simultaneous meteors as impossible. Which is great, but then one the day, five of them hit at the same time.

> I think that suggests that there were not bigger fish to fry :)

I can see this problem arising in two ways:

(1) Faulty assumptions about failure probabilities: You might presume that meteors are independent, so simultaneous impacts are exponentially unlikely. But really they are somehow correlated (meteor clusters?), so simultaneous failures suddenly become much more likely.

(2) Growth of failure probabilities with system size: A meteor hit on earth is extremely rare. But in the future there might be datacenters in the whole galaxy, so there's a center being hit every month or so.

In real, active infrastructure there are probably even more pitfalls, because estimating small probabilities is really hard.


> "how do I get this back online in a reasonable timeframe from a starting point of 'everything is turned off and has the wrong configuration'."

The electricity people have a name for that: black start (https://en.wikipedia.org/wiki/Black_start). It's something they actively plan for, regularly test, and once in a while, have to use in anger.


It's a process I'm familiar with gaming out. For our infrastructure, we need to discuss and update our plan for this from time to time, from 'getting the generator up and running' through to 'accessing credentials when the secret server is not online' and 'configuring network equipment from scratch'.


I love that when you had to think of a random improbable event, you thought of a cocaine meteor. But ... hell YES!


Luckily you don't need to do that exhaustively: all you have to do is cover the general failure case. What happens when communications fail?

This is something that most people aren't good at naturally, it tends to come from experience.


Right, but imagining that DNS goes down doesn’t take a science fiction author.


Of course you can’t think of every potential scenario possible, but an incorrect configuration and rollback should be pretty high in any team’s risk/disaster recovery/failure scenario documentation.


This is true, but it's not an excuse for not preparing for the contingencies you can anticipate. You're still going to be clobbered by an unanticipated contingency sooner or later, but when that happens, you don't want to feel like a complete idiot for failing to anticipate a contingency that was obvious even without the benefit of hindsight.


> I hope they don't lose their job.

I hope they do.

#1 it's a clear breach of corporate confidentiality policies. I can say that without knowing anything about Facebook's employment contracts. Posting insider information about internal company technical difficulties is going to be against employment guidelines at any Big Co.

In a situation like this that might seem petty and cagey. But zooming out and looking at the bigger picture, it's first and foremost a SECURITY issue. Revealing internal technical and status updates needs to go through high-level management, security, and LEGAL approvals, lest you expose the company to increased security risk by revealing gaps that do not need to be publicized.

(Aside: This is where someone clever might say "Security by obscurity is not a strategy". It's not the ONLY strategy, but it absolutely is PART of an overall security strategy.)

#2 just purely from a prioritization/management perspective, if this was my employee, I would want them spending their time helping resolve the problem not post about it on reddit. This one is petty, but if you're close enough to the issue to help, then help. And if you're not, don't spread gossip - see #1.


You're very, very right - and insightful - about the consequences of sharing this information. I agree with you on that. I don't think you're right that firing people is the best approach.

Irrespective of the question of how bad this was, you don't fix things by firing Guy A and hoping that the new hire Guy B will do it better. You fix it by training people. This employee has just undergone some very expensive training, as the old meme goes.


I feel this way about mistakes, and fuckups.

Whoever is responsible for the BGP misconfiguration that caused this should absolutely not be fired, for example.

But training about security, about not revealing confidential information publicly, etc is ubiquitous and frequent at big co's. Of course, everyone daydreams through them and doesn't take it seriously. I think the only way to make people treat it seriously is through enforcement.


I feel you're thinking through this with a "purely logical" standpoint and not a "reality" standpoint. You're thinking worst case scenario for the CYA management, having more sympathy for the executive managers than for the engineer providing insight to the tech public.

It seems like a fundamental difference of "who gives a shit about corporate" from my side. The level of detail provided isn't going to get nationstates anything they didn't already know.


Yeah but what is the tech public going to do with these insights?

It's not actionable, it's not whistleblowing, it's not triggering civic action, or offering a possible timeline for recovery.

It's pure idle chitchatter.

So yeah, I do give a shit about corporate here.

Disclosure: While I'm an engineer too, I'm also high enough in the ladder that at this point I am more corporate than not. So maybe I'm a stooge and don't even realize it.


Facebook, the social media website is used, almost exclusively for 'idle chitchatter', so you may want to avoid working there if your opinion of the user is so low. (Actually, you'll probably fit right in at Facebook.)

It's unclear to me how a 'high enough in the ladder' manager doesn't realize that there's easily dozen people who know the situation intimately but who can't do anything until a dependent system to them is up. "Get back to work" is... the system is down, what do you want them to do, code with a pencil and paper?

ramenporn violated the corporate communication policy, obviously, but the tone and approach for a good manager to an IC that was doing this online isn't to make it about corporate vs them/the team, and in fact, encourage them to do more such communication, just internally. (I'm sure there was a ton of internal communication, the point is to note where ramenporn's communicative energy was coming from, and nurture that, and not destroy that in the process of chiding them for breaking policy.


> Edit: I also hope this doesn't damage prospects for more Work From Home. If they couldn't get anyone who knew the configuration in because they all live a plane ride away from the datacenters, I could see managers being reluctant to have a completely remote team for situations where clearly physical access was needed.

You're conflating working remotely ("a plane ride away") and working from home.

You're also conflating the people who are responsible network configuration, and for coming up with a plan to fix this; and the people who are responsible for physically interacting with systems. Regardless of WFH those two sets likely have no overlap at a company the size of Facebook.


There could be something in the contract that requires all community interaction to go via PR official channels.

It's innocous enough, but leaking info, no matter what, will be a problem if it's stated in their contract.


100%! comms will want to proof any statement made by anybody along with legal to ensure that there is no D&O liability for sec fraud.


> an organization that hasn't actually thought through all its failure modes

Move Fast and Break Things!


I came here to move fast and break things, and i'm all out of move fast.


In their defense they really lived up to their mission statement today.


I doubt WFH will be impacted by this - not an insider but seems unlikely that the relevant people were on-site at data centers before COVID


> I doubt WFH will be impacted by this - not an insider but seems unlikely that the relevant people were on-site at data centers before COVID

I think the issue is less "were the right people in the data center" and more "we have no way to contact our co-workers once the internal infrastructure goes down". In non-wfh you physically walk to your co-workers desk and say "hey, fb messenger is down and we should chat, what's your number?". This proves that self-hosting your infra (1) is dangerous and (2) makes you susceptible to super-failures if comms goes down during WFH.

Major tech companies (GAFAM+) all self-host and use internal tools so they're all at risk of this sort of comms breakdown. I know I don't have any co-workers number (except one from WhatsApp which if I worked at FB wouldn't be useful now).


Apple is all on Slack.


But is it a publicly hosted slack, or does apple host it themselves?


I don't think it is possible to self-host Slack.


Amazon has a privately managed instance.


Most of the stuff was probably implemented before COVID anyways.

They will fix the issue and add more redundant communication channels, which is either an improvement or a non-event for WFH.

And Zuck is slowly moving (dogfooding) company culture to remote too with their Quest work app experiments


They must have been moving very fast!


shoestring budget on a billion dollar product. you get what you deserve.


> I hope they don't lose their job.

FB has such poor integrity, I'd not be surprised if they take such extreme measures.


It is a matter of preparation. You can make sure there are KVMoIPs or other OOB technologies available on site to allow direct access from a remote location. In the worst case technician has to know how to connect the OOB device or press a power button ;)


I'm not disagreeing with you, however clearly (if the reddit posts were legitimate) some portion of their OOB/DR procedure depended on a system that's down. From old coworkers who are at FB, their internal DNS and logins are down. It's possible that the username/password/IP of an OOB KVM device is stored in some database that they can't login to. And the fact FB has been down for nearly 4 hours now suggests it's not as simple as plugging in a KVM.


I was referring to the WFH aspect the parent post mentioned. My point was that the admins could get the same level of access as if they were physically on site, assuming the correct setup.


Pushshift maintains archives of Reddit. You can use camas reddit search to view them.

Comments by u/ramenporn: https://camas.github.io/reddit-search/#{%22author%22:%22rame...


PushShift is one of the most amazing resources out the for social media data and more people should know about it


Can you recommend similar others (or maybe how to find them)? I learned of PushShift because snew, an alternative reddit frontend showing deleted comments, was making fetch requests and I had to whitelist it in uMatrix. Did not know about Camas until today.


If it was actually someone in Facebook, their job is gone by now, too.


It's time to decentralize and open up the Internet again, as it once was (ie. IRC, NNTP and other open protocols) instead of relying on commercial entities (Google, Facebook, Amazon) to control our data and access to it.


I'll throw in Discord into that mix, the thing that basically mostly killed IRC. Which is yet again centralized despite pretending that it is not centralized.


The account has been deleted as well.


What are they afraid of? While they are sharing information that's internal/proprietary to the company, it isn't anything particularly sensitive and having some transparency into the problem is good for everyone.

Who'd want to work for a company that might take disciplinary action because an SRE posted a reddit comment to basically say "BGP's down lol" - If I was in charge I'd give them a modest EOY bonus for being helpful in their outreach to my users in the wider community.


Seems reasonable that at a company of 60k, with hundreds who specialize in PR, you do not want a random engineer making the choice himself to be the first to talk to the press by giving a PR conference on a random forum.


Honestly, from a PR perspective, I’m not sure it’s so bad. Giving honest updates showing Facebook hard at work is certainly better PR for our kind of crowd than whatever actual Facebook PR is doing.


That one guy's comments seen fine from a PR perspective apart from it not being his role to communicate for the company.

I still think he should be fired for this kind of communication though. One reason is, imagine Facebook didn't punish breaches of this type. Every other employee is going to be thinking "Cool, I could be in a Wired article" or whatever. All they have to do is give sensitive company information to reporters.

Either you take corporate confidentiality seriously or you don't. Posting details of a crisis in progress on your Reddit account is not taking corporate confidentiality seriously. If the Facebook corporation lightly punishes, scolds, or ignores this person then the corporation isn't taking confidentiality seriously either.


I agree, but try to explain that to PR people...


It's terrible PR for the FB PR team's performance.


Reporters are going to opportunistically start writing about those comments vs having to wait for a controlled message from a communications team. So the reddit posts might not be "so bad", but they're also early and preempting any narrative they may want to control.


You falsely assume Hacker News is even remotely what Facebook PR gives a shit about.


That was their best PR in years


Compare Facebook's official tweet: "We’re aware that some people are having trouble accessing our apps and products. We’re working to get things back to normal as quickly as possible, and we apologize for any inconvenience."

That's the PR team, clueless.


I don't think Facebook could actually say anything more accurate or more honest. "Everything is dead, we are unable to recover, and we are violently ashamed" would be a more fun statement, but not a more useful one.

There will be plenty of time to blame someone, share technical lessons, fire a few departments, attempt to convince the public it won't happen again, and so on.


I agree completely. The target audience Facebook is concerned about is not techies wanting to know the technical issues. Its the huge advertising firms, governments, power users, etc. who have concerns about the platform or have millions of dollars tied up in it. A bland statement is probably the best here - and even if the one engineer gave accurate useful info I don't see how you'd want to encourage an org in which thousands of people feel the need to post about whats going on internally during every crisis.


Well, they could at least be specific about how large the outage is. "Some people" is quite different to absolutely everyone. At least they did not add a "might" in there.


Facebook has never been open and honest about anything, no reason to think they would start now.


To be fair, Facebook has never been open and honest about anything.


Facebook is well known for having really good PR, if they go after this guy for sharing such basic info that's yet another example of their great PR teams.


These few sentences were a better and more meaningful read than what hundreds of PR people could ever come up with


A few random guesses (I am not in any way affiliated with FB); just my 2c:

Sharing status of an active event may complicate recovery, especially if they suspect adversarial actions: such public real-time reports can explain to the red team what the blue team is doing and, especially important, what the blue team is unable to do at the moment.

Potentially exposing the dirty laundry. While a postmortem should be done within the company (and as much as possible is published publicly) after the event, such early blurbs may expose many non-public things, usually unrelated to the issue.


Mentioned in another reply

Shareholders and other business leaders I'm sure are much happier reporting this as a series of unfortunate technical failures (which I'm sure is part of it) rather than a company-wide organizational failure. The fact they can't physically badge in the people who know the router configuration speaks to an organization that hasn't actually thought through all its failure modes. People aren't going to like that. It's not uncommon to have the datacenter techs with access and the actual software folks restricted, but that being the reason one of the most popular services in the world has been down for nearly 3 hours now will raise a lot of questions.


I did not read it as they can't get them on site but rather that it takes travel to get them on site. Travel takes time of which they desperately want not to spend.


> If I was in charge I'd give them a modest EOY bonus for being helpful in their outreach to my users in the wider community.

That seems pretty unlikely at any but the smallest of companies. Most companies unify all external communications through some kind of PR department. In those cases usually employees are expressly prohibited from making any public comments about the company without approval.


> What are they afraid of?

Zuckerberg Loses $7 Billion in Hours as Facebook Plunges

https://finance.yahoo.com/news/zuckerberg-loses-7-billion-ho...

Stop the hemorrhaging. Too much bad press for FB lately and it all adds up.


Unrelated to the outage, but I hate headlines like this.

Facebook is down ~5% today. That's a huge plunge to be sure, but Zuckerberg hasn't "lost" anything. He owns the same number of shares today as he did yesterday. And in all likelihood, unless something truly catastrophic happens the share price will bounce back fairly quickly. The only reason he even appears to have lost $7 billion is because he owns so much Facebook stock.

These types of alarmist headlines are inane.


Unlikely to be related. FB's losses today already happened before FB went down, and are most likely related to the general negative sentiment in the market today, and the whistleblower documents. It's actually kind of remarkable how little impact the outage had on the stock.


There was no permanent damage to Facebook as a result of the outage so it's understandable that the stock price wasn't really affected by it


I was thinking the same...


As much as all of the curious techies here would love transparency into the problem, that doesn't actually do any good for Facebook (or anyone else) at the moment. Once everything is back online, making a full RCA available would do actual good for everyone. But I wouldn't hold my breath for that.


FB takes confidentiality very seriously. He crossed a major red line.


They got told, explicitly that they shouldn't be sharing updates from the outage meeting, in the outage meeting.


Do we even know if someone had the account deleted? I think facebook might have their hands full right now solving the issue rather than looking at social media posts that discusses the issue.


There are a lot of people who work at Facebook, and I'm sure the people responsible for policing external comms do not have the skills or access to fix what's wrong right now.


Assuming that Facebook forced the account to be deleted, it wouldn't have been done by anyone who's working on fixing the problem.


> the people with physical access is separate from the people with knowledge of [...]

Welcome to the brave new world of troubleshooting. This will seriously bite us one day.


I like how FB decided to send "ramenporn" as their spokesperson.


A particular facet I love of the internet era is journalists reporting serious events while having to use the completely absurd usernames...

"A Facebook engineer in the response team, ramenporn..."


I remember some huge DDOS attacks like a decade ago, and people were speculating who could be behind it. The three top theories were Russian intelligence, the Mossad, and this guy on 4chan who claimed to have a Botnet doing it.

That was the start of living in the future for me.


4chan is disturbingly resourceful at times. I have heard them described as weaponized autism.


Ya, on hn it's merely productized.


That's a pretty accurate description of the site, lol.

On a side-note, I think you'll enjoy some of the videos by the YouTube 'Internet Historian' on 4chan:

* https://www.youtube.com/watch?v=SvjwXhCNZcU

* https://www.youtube.com/watch?v=HiTqIyx6tBU


My favorite example of this is when I saw references to "Goatse Security" on the front page of the Wall Street Journal


This felt like something straight out of a post modern novel during the whole WSB press rodeo, where some user names being used on TV were somewhere between absurd to repulsive.

Loved it.


I believe that's the exact reason behind the pattern of horrifying usernames on reddit and imgur. It's magnificent in its surrealness.


Exactly, I'm having deja vues from Vernor Vinge's Rainbow's End constantly lately.


>journalists reporting serious events

A facet I don't love is journalism devolving to reposting unverified, anonymous reddit posts.


"Discussed in Hacker News, the user that goes by the 'huevosabio' handle, stated as a fact that..."


‘He was then subsequently attacked by “OverTheCounterIvermectin” for his tweets on transgender bathrooms from several months ago’.


The problem with tweets on transgender bathrooms is that you can be attacked for them by either side at any point in the future, so the user OverTheCounterIvermectin should have known better.


I got quoted as noir_lord in the press.

My bbs handle from 30 years ago.


Immortality.


I'm worried about that person. I doubt Facebook will look kindly on breaking incident news being shared on reddit.


Apparently Facebook HQ didn't like how ramenporn handled the situation. His account has been deleted, as well as all his messages about the incident.


his account is active, only the incident comments were deleted


> [Reddit logo] u/ramenporn: deleted

> This user has deleted their account.


At least that department at Facebook is still working!


There never was a ramenporn.


That Ramenporn got engagement by Hate Speech


They work at facebook. Can’t imagine they have any illusions regarding their privacy/anonymity.


Curious what the internal "privacy" limitations are. Certainly FB must track reddit users : fb account even if they don't actually display it. It just makes sense.


Thanks to the GDPR at least that's easy to verify for European users.


That said, it will be interesting to read their post-mortem next year and compare it with what ramenporn wrote.


lol no one cares. we're all laughing about this too (all of us except the networks people at least...)


I hope you won't have to delete your account too :)


Well, seems like FB shutdowned his post...


This is why so many teams fight back against the audit findings:

"The information systems office did not enforce logical access to the system in accordance with role-based access policies."

Invariably, you want your best people to have full access to all systems.


Well, you want the right people to have access. If you're a small shop or act like one, that's your "top" techs.

If you're a mature larger company, that's the team leads in your networking area on the team that deal with that service area (BGP routing, or routers in general).

Most likely Facebook et. al. management never understood this could happen because it's "never been a problem before".


I can't fathom how they didn't plan for this. In any business of size, you have to change configuration remotely on a regular basis, and can easily lock yourself out on a regular basis. Every single system has a local user with a random password that we can hand out for just this kind of circumstance...


Organizational complexity grows super-linearly; in general, the number of people a company can hire per unit time is either constant or grows linearly.

Google once had a very quiet big emergency that was, ironically(1), initiated by one of their internal disaster-recovery tests. There's a giant high-security database containing the 'keys to the kingdom', as it were... Passwords, salts, etc. that cannot be represented as one-time pads and therefore are potentially dangerous magic numbers for folks to know. During disaster recovery once, they attempted to confirm that if the system had an outage, it would self-recover.

It did not.

This tripped a very quiet panic at Google because while the company would tick along fine for awhile without access to the master password database, systems would, one by one, fail out if people couldn't get to the passwords that had to be occasionally hand-entered to keep them running. So a cross-continent panic ensued because restarting the database required access to two keycards for NORAD-style simultaneous activation. One was in an executive's wallet who was on vacation, and they had to be flown back to the datacenter to plug it in. The other one was stored in a safe built into the floor of a datacenter, and the combination to that safe was... In the password database. They hired a local safecracker to drill it open, fetched the keycard, double-keyed the initialization machines to reboot the database, and the outside world was none the wiser.

(1) I say "ironically," but the actual point of their self-testing is to cause these kinds of disruptions before chance does. They aren't generally supposed to cause user-facing disruption; sometimes they do. Management frowns on disruption in general, but when it's due to disaster recovery testing, they attach to that frown the grain of salt that "Because this failure-mode existed, it would have occurred eventually if it didn't occur today."


That's not quite how it happened. ;)

<shameless plug> We used this story as the opening of "Building Secure and Reliable Systems" (chapter 1). You can check it out for free at https://sre.google/static/pdf/building_secure_and_reliable_s... (size warning: 9 MB). </shameless plug>


Thanks for telling this story as it was more amusing than my experiences of being locked in a security corridor with a demagnetised access card, looooong ago.


what if the executive had been pick-pocketed


EDIT: I had mis-remembered this part of the story. ;) What was stored in the executive's brain was the combination to a second floor safe in another datacenter that held one of the two necessary activation cards. Whether they were able to pass it to the datacenter over a secure / semi-secure line or flew back to hand-deliver the combination I do not remember.

If you mean "Would the pick-pocket have access to valuable Google data," I think the answer is "No, they still don't have the key in the safe on the other continent."

If you mean "Would the pick-pocket have created a critical outage at Google that would have required intense amounts of labor to recover from," I don't know because I don't know how many layers of redundancy their recovery protocols had for that outage. It's possible Google came within a hair's breadth of "Thaw out the password database from offline storage, rebuild what can be rebuilt by hand, and inform a smaller subset of the company that some passwords are now just gone and they'll have to recover on their own" territory.


> I can't fathom how they didn't plan for this

Maybe because they were planning for a million other possible things to go wrong, likely with higher probability than this. And busy with each day's pressing matters.


Anyone who has actually worked in the field can tell you that a deploy or config change going wrong, at some point, and wiping out your remote access / ability to deploy over it is incredibly, crazy likely.


That someone will win the lottery is also incredibly likely. That a given person will win the lottery is, on the other hand, vanishingly unlikely. That a given config change will go wrong in a given way is ... eh, you see where I'm going with this


Right, which is why you just roll in protection for all manner of config changes by taking pains to ensure there are always whitelists, local users, etc. with secure(ly stored) credentials available for use if something goes wrong; rather than assuming your config changes will be perfect.


I'm not sure it's possible to speculate in a way which is generic over all possible infrastructures. You'll also hit the inevitable tradeoff of security (which tends towards minimal privilege, aka single points of failure) vs reliability (which favours 'escape hatches' such as you mentioned, which tend to be very dangerous from a security standpoint).


Absolutely, and I'd even call it a rite of passage to lock yourself out in some way, having worked in a couple of DCs for three years. Low-level tooling like iLO/iDRAC can sure help out with those, but is often ignored or too heavily abstracted away.


A config change gone bad?

That’s like failure scenarios 101. That should be the second on the list, after “code change gone bad”.


Exactly! Obviously they have extremely robust testing and error catching on things like code deploys: how many times do you think they deploy new code a day? And at least personally, their error rate is somewhere below 1%.

Clearly something about their networking infrastructure is not as robust.


Right? Especially on global scale. Something doesn't add up!


Curious/unfortunate timing. The day after a whistleblower docu and with a long list of other legal challenges and issues incoming.


Haha sure. They were too busy implementing php compilers to figure out that "whole DR DNS thing"

rotflmao. I'd remove Facebook from my resume.


Most likely they did plan for this. Then, something happened that the failsafe couldn't handle. E.g. if something overwrites /etc/passwd, having a local user won't help. I'm not saying that specific thing happened here -- it's actually vanishingly unlikely -- but your plan can't cover every contingency.


Agreed, it’s also worth mentioning that at the end of every cloud is real physical hardware, and that is decidedly less flexible than cloud, if you locked yourself out of a physical switch or router you have many fewer options.


In risk management cultures where consequences from failures are much, much higher, the saying goes that “failsafe systems fail by failing to be failsafe”. Explicit accounting for scenarios where the failsafe fails is a requirement. Great truths of the 1960s to be relearned, I guess.


Another Monday morning at a boring datacenter job, i bet they weren't even there yet at 830 when the phones started ringing.


You mean the VOIP phones that could no longer receive incoming calls?


Assuming anyone can actually look up the phone numbers to call.


There should be 24/7 on-site rotations. I wonder if physical presence was cut on account of COVID?


phones? how lame.


It certainly wasn't the Messenger.


Phones - the old, analogue, direct cable ones - were self-sustaining, and kept running even when there was a power cut in the house.


yes, indeed. Reliability. That's so 20th century. #lame.

(Actually not lame at all in my eyes)


This sounds like something that might have been done with security in mind. Although generally speaking, remote hands don't have to be elite hackors.


Have you ever tried to remotely troubleshoot THROUGH another person?!


My company runs copies of all our internal services in air-gapped data centers for special customers. The operators are just people with security clearance who have some technical skills. They have no special knowledge of our service inner workings. We (the dev team) aren’t allowed to see screenshots or get any data back. So yeah, I have done that sort of troubleshooting many times. It’s very reminiscent of helping your grandma set up her printer over the phone.


And this is why we should build our critical systems in a way that can be debugged on the phone... With your grandma.


We try to write our ops manuals in a way that my grandma could follow but we don’t always succeed. :)


For all the hours I spent on the phone spelling grep, ls, cd, pwd, raging that we didn't keep nano instead of fucking vim (and I'm a vim person)... I could have stayed young and been solving real customer problems, not imperium-typing on a fucking keyboard with a 5s delay 'cause colleague is lost in the middle of nowhere and can't remember what file he just deleted and the system doesn't start anymore your software is fragile, just shite.


Yes. Depending on the person, it can either go extremely well or extremely poorly. Getting someone else to point a camera at the screen helps.


Yes, and it works if both parties are able to communicate using precise language. The onus is on the remote SME to exactly articulate steps, and on the local hands to exactly follow instructions and pause for clarifications when necessary.


Yeah. Do what you have to.

Sometimes the DR plan isn't so much I have to have a working key, I just have to know who gets their first with a working key, and break glass might be literal.


Not OP, but many times. Really makes you think hard about log messages after an upset customer has to read them line by line over the phone.

One was particularly painful, as it was a "funny" log message I had added the code when something went wrong. Lesson learned was to never add funny / stupid / goofy fail messages in the logs. You will regret it sooner or later.


folks with physical access are also denied. source - https://twitter.com/YourAnonOne/status/1445100431181598723


FWIW that's not the original source, just some twitter account reposting info shared by someone else. See this sub-thread: https://news.ycombinator.com/item?id=28750888


IT: "Please do this fix."

Person 1: "I can't, I don't have physical access."

IT: "Please do this fix."

Person 2: "I can't, I don't have digital access."

Why? It's [IT's?] policy.


Let me guess, it is tied to FB systems which are down. That would be hilarious.


this is not new, this is everyday life with helping hands, on duty engineers, l2-l3 levels telling people with physical access which commands to run etc. etc. etc.


Then you have security issues like this where someone impersonates a client with helping hands and drains your exchanges hot wallet:

https://www.huffpost.com/archive/ca/entry/canadian-bitcoins-...


The places I've seen this at had specific verification codes for this. One had a simple static code per person that the hands-on guys looked up in a physical binder on their desk. Very disaster proof.

The other ones had a system on the internal network in which they looked you up, called back on your company phone and asked for a passphrase the system showed them. Probably more secure but requires those systems to be working.


This is not a real datacenter case but normal social hacking. On the datacenter side you have many more security checks plus many of the times the helping hands and engineers are part of the same company, using internal communication tools etc. so they are on the same logical footprint anyhow


Telecommunication satellite communication issues might seriously shut down whole regions if they occur.


I don't think so. I bet nobody is ever going to make that mistake at FB again after today.


I think it's the same with supply chains.


It just bit FB.


like today! xD


> Even in the biggest of organizations, they still have to wait for somebody to race down to the datacenter and plug his laptop into a router.

I love this comment.


Imagine having the a huge portion of the digital world internationally riding on your shoulders...


Imagine that guy has this big npm repository locally with all those dodgy libraries with uncontrolled origin, in their /lib/node_modules with root permissions.

Wait, we all do, here.


You can use a custom npm prefix to avoid the mess you're describing. So basically:

See current prefix:

> npm config get prefix

Set prefix to something you can write to without sudo:

> npm config set prefix /some/custom/path


for something as distributed as Facebook, do multiple somebodys all have to race down each individual datacenter and plug their laptops into the routers?

As someone with no experience in this, it sounds like a terrifying situation for the admins...


Interesting that they published stuff about their BGP setup and infrastructure a few months ago - maybe a little tweak to roll backs is needed.

"... We demonstrate how this design provides us with flexible control over routing and keeps the network reliable. We also describe our in-house BGP software implementation, and its testing and deployment pipelines. These allow us to treat BGP like any other software component, enabling fast incremental updates..."


    # todo: add rollbacks


Surely Facebook don't update routing systems between data centres (IIRC the situation) when they don't have people present to fix things if they go wrong? Or have an out-of-band connection (satellite, or dial-up (?), or some other alternate routing?).

I must be misunderstanding this situation here.

[Aside: I recall updating wi-fi settings on my laptop and first checking I had direct Ethernet connection working ... and that when I didn't have anything important to do (could have done a reinstall with little loss). Is that a reasonable analogy?]


Move fast and break . . . <NO CARRIER>


> don't update routing systems between data centres (IIRC the situation) when they don't have people present

Ha. You put too much faith into people.


Wondering how Facebook communicates now internally - most of their work streams likely depend on Facebooks systems which are all down.

Can engineers and security teams even access prod systems anymore? Like, would "Bastion" hosts be reachable?

Wonder if they use Signal and Slack now?


There are various non-FB fallback measures, including IRC as a last-ditch method. The IRC fallback is usually tested once a year for each engineer.


I just heard from a contact that the fallback/backup IRC is also down.


Bet it was located at irc.facebook.com ;)

Joking aside, I can see how an IRC network has potential to be used in these situations. Maybe FAMANG should work together to set something like this up. The problem is, a single IRC server is not fail safe, but a network of multiple servers would just see a netsplit, in which case users would switch servers.

Also, I remember back in the IRCnet days using simply telnet to connect to IRCnet just for fun and sending messages, so its a very easy protocol that can be understood in a global desaster scenario (just the PING replys where annoying in telnet).


I heard the same thing from my old coworker who is at FB currently. All of their internal DNS/logins are broken atm so nobody can reach the IRC server. I bet this will spur some internal changes at FB in terms of how to separate their DR systems in the case of an actual disaster.


Good planning! Now, where does the IRC server live, and is it currently routable from the internet?

While normally I know the advice is "Don't plan for mistakes not to happen, it's impossible, murphy's law, plan for efficient recovery for mistakes"... when it comes to "literally our entire infrastructure is no longer routable from the internet", I'm not sure there's a great alternative to "don't let that happen. ever." And yet, here facebook is.


Also, are the users able to reach the server without DNS (i.e. are the IP addresse(s) involved static and communicated beforehand) and is the server itself able to function without DNS?

Routing is one thing which you can't do without (then you need to fallback to phone communications), but DNS is something that's quite probable to not work well in a major disaster.


A lot of the core 'ops like' teams at FB use IRC on a daily basis.

When I worked there, I wasn't aware of any 'test once per year' concept or directive.

Of course, FB is a really big place, so things are different in different areas.


FB uses a separate IRC instance for these kinds of issues, at least when I used to work there


I would think that their internal network would correctly resolve facebook.com even though they've borked DNS for the external world, or if not they could immediately fix that. So at least they'd be able to talk to each other.


To the communication angle, I've worked at two different BigCo's in my career, and both times there was a fallback system of last resort to use when our primary systems were unavailable.


I haven't worked for a FAANG but it would be unthinkable that FB does not have backup measures in place for communications entirely outside of Facebook.

Hmm well I mean for key people, ops and so on. Not for every employee.

Only a few people need that type of access, and they should have it ready. They need to bring more people there should be an easy way to do it.

Maybe the internal FB Messenger app has a slide button to switch to the backup network for those in need.


> Maybe the internal FB Messenger app has a slide button to switch to the backup network for those in need.

Having worked for 2 FAANG companies, I can tell you most core services like which FB Messenger would be using internal database services and relying on those which would be ineffective in a case like this as it would not work and the engineering cost to design them to support an external database would be a lot more than just paying for like 5 different external backup products for your SRE team.


Facebook does use IRC and Zoom as a fallback.


Actually, in this situation: Discord.


If they planned ahead, they should have had their oncalls practice on the backup systems (like Signal/Slack/Zoom) before now.


My team set up a discord lol


Don't they have a separate instance for internal communications?


"I believe the original change was 'automatic' (as in configuration done via a web interface). However, now that connection to the outside world is down, remote access to those tools don't exist anymore, so the emergency procedure is to gain physical access to the peering routers and do all the configuration locally."

Hmm, could be a UI/UX bug then :)


Seems odd to not have a redundant backdoor on a different network interface. Maybe that is too big of a security risk but idk.


You know how after changing resolution and other video settings you get a popup "do you want to keep these changes?" with a countdown and automatic revert in case you managed to screw up and can't see the output anymore?

Well, I wonder why a router that gets a config update but then doesn't see any external traffic for 4 hours doesn't just revert back to the last known good config...


So, does anyone know where to one can buy an LTE gateway with a serial port interface? Asking for a friend.


Our security team complained that we have some services like monitoring or SSH access to some Jump Hosts accessible without a VPN because VPN should be mandatory to access all internal services. I'm afraid once comply we could be in similar situation where Facebook is now...


But you have two independent VPNs right, using different technologies on different internet handoffs in very different parts of your network, right?


Fundamentally, how is a 2nd independent VPN into your network a different attack surface than a single, well-secured ssh jumphost? When you're using them for narrow emergency access to restore the primary VPN, both are just "one thing" listening on the wire, and it's not like ssh isn't a well-understood commodity.


Zero day sshd vulnerability would be bad.

On the other hand if you had to break through wireguard first, and then go through your single well-secured bastion, you'd not only be harder to find, you'd have two layers of protection, and of course you tick the "VPN" box


Vpn can also have a zero day, and seems about as likely?


But if your vpn has a zero day, that lets you get to the ssh server. It's two layers of protection, you'd have to have two zero days to get in instead of one.

You could argue it's overkill, but it's clearly more secure


Only if the VPN means you have a VPN and a jump box. If it's "VPN with direct access to several servers and no jump box" there's still only one layer to compromise.


Still wouldn't help if your configuration change wipes you clear off the Internet like Facebook's apparently has. The only way to have a completely separate backup is to have a way in that doesn't rely on "your network" at all.


Your OOB network wouldn't be affected by changes to your main network


These are readily available, OpenGear and others have offered them forever. I can't believe fb doesn't have out of band access to their core networking in some fashion. OOB access to core networking is like insurance, rarely appreciated until the house is on fire.


It's quite possible that they have those, but that the credentials are stored in a tool hosted in that datacenter or that the DNS entries are managed by the DNS servers that are down right now.


You are probably right but if that is the case, it isn't really out of band and needs another look. I use OpenGear devices with cellular to access our core networking to multiple locations and we treat them as basically an entirely independent deployment, as if it is another company. DNS and credentials are stored in alternate systems that can be accessed regardless of the primary systems.

I'm sure the logistics of this become far more complicated as the organization scales but IMHO it is something that shouldn't be overlooked, exactly for outlier events like this. It pays dividends the first time it is really needed. If the accounts of ramenporn are correct, it would be paying very well right now.

Out of band access is a far more complicated version of not hosting your own status page, which they don't seem to get right either.


Facebook is likely scrambling private jets as we speak to get the right people to the right places.


Reminds me of that episode in Mr Robot


The cost of the downtime would be


Facebook 2021 revenue is around $100B. That’s $11M an hour. Since it’s peak hour for ad printing, one can assume double or triple this rate.

They are already looking at > $100M in ad loss, not counting reputation damage etc.


Think of all the influencers who can’t influence and FB addicts who can’t get their fix (+insta and whatsapp)


This tweet seems to confirm it is a bgp issue...

https://twitter.com/GossiTheDog/status/1445063880963674121?s...


Cloudflare also confirmed it:

https://twitter.com/jgrahamc/status/1445068309288951820

Also, the Domain name is for sale???

https://whois.domaintools.com/facebook.com


Weird banner at the top, seems like false advertising as it says a couple lines down: Expires on 2030-03-29


I suspect it's an automated system triggered by DNS not resolving, and they try to "make an offer" if you follow through.


You're right, it's misleading, thanks. Other sites (dreamhost, godaddy) don't list it as for sale.


Just imagine the amount of stress on this people, hope the money really worth it.


It shouldn't be too stressful. Well-managed companies blame processes rather than people, and have systems set up to communicate rapidly when large-scale events occur.

It can be sort of exciting, but it's not like there is one person typing at a keyboard with a hundred managers breathing down their neck. These resolutions are collaborative, shared efforts.


> It can be sort of exciting, but it's not like there is one person typing at a keyboard with a hundred managers breathing down their neck.

As someone who formerly did Ops for many many years... this is not accurate. Even in a well organized company there are usually stakeholders at every level on IM calls so that they don't need to play "telephone" for status. For an incident of this size, it wouldn't be unusual to have C-level executives on the call.

While those managers are mostly just quietly listening in on mute if they know what's good (e.g. don't distract the people doing the work to fix your problem), their mere presence can make the entire situation more tense and stressful for the person banging keyboards. If they decide to be chatty or belligerent, it makes everything 100x worse.

I don't envy the SREs at Facebook today. Godspeed fellow Ops homies.


I think it comes down to the comfort level of the worker. I remember when our production environment went down. The CTO was sitting with me just watching and I had no problem with it since he was completely supportive, wasn't trying to hurry me, just wanted to see how the process of fixing it worked. We knew it wasn't any specific person's fault, so no one had to feel the heat from the situation beyond just doing a decent job getting it back up.


C levels don't sit on the call with engineers. They aren't that dumb. Managers will communicate upward.


That greatly depends on the incident and the organization. I’ve personally been on numerous incident calls with C-level folks involved.


Yeah hell, I've ended up with one of the big names as my comm's lead.

That in itself was stressful, and became an example case later.


"it's not like there is one person typing at a keyboard with a hundred managers breathing down their neck. These resolutions are collaborative, shared efforts"

Well, you'd be surprised about how one person can bring everything down and/or save the day at Facebook, Cloudflare, Google, Gitlab, etc. Most people are observers/cheerleaders when there is an incident.


> Most people are observers/cheerleaders when there is an incident.

Yeah, a typical fight/flight response.


Or most people simply don't have anything useful to add or do during an incident.


Taking all the available slots in the massive gvc warroom ain't much... but its honest work.


Well, individuals will still stress, if anything, due to the feeling of bein personally responsible for inflicting damage.

I know someone who accidentally added a rule 'reject access to * for all authenticated users' in some stupid system where the ACL ruleset itself was covered by this *, and this person nearly collapsed when she realized even admins were shut out of the system. It required getting low level access to the underlying software to reverse engineer its ACLs and hack into the system. Major financial institution. Shit like leaves people with actual trauma.

As much as I hate fb, I really feel for the net ops guys trying to figure it all out, with the whole world watching (most of it with shadenfreude)


As one of the major responders to an incident analogous to this at a different fang... you're high, its still hella stressful.


> It shouldn't be too stressful. (...) it's not like there is one person typing at a keyboard with a hundred managers breathing down their neck

Earlier comment mentioned that there is a bottleneck, and that people who are physically able to solve the issue are few and that they need to be informed what to do; being one of these people sounds pretty stressful to me.

"but the people with physical access is separate (...) Part of this is also due to lower staffing in data centers due to pandemic measures", source: https://news.ycombinator.com/item?id=28749244


Sure, but that's what conference calls are for.

Most big tech companies automatically start a call for every large scale incident, and adjacent teams are expected to have a representative call in and contribute to identifying/remediating the issue.

None of the people with physical access are individually responsible, and they should have a deep bench of advice and context to draw from.


I'm not an IT Operations guy, but as a dev I always thought it was exciting when the IT guys had in their shoulders the destiny of the firm. I must be exciting.


You tend not to think about it…

Most teams that handle incidents have well documented incident plans and playbooks. When something major happens you are mostly executing the plan (which has been designed and tested). There are always gotchas that require additional attention / hands but the general direction is usually clear.


>Well-managed companies

To what extent does this include Facebook?


> Well-managed companies blame processes rather than people,

We're six hours without a route to their network, and counting. I think we can safely rule out well-managed.


> Well-managed companies blame processes rather than people

I feel like this just obfuscates the fact that individuals are ultimately responsible, and allows subpar employees to continue existing at an organization when their position could be filled by a more qualified employee. (Not talking about this Facebook incident in particular, but as a generalisation: not attributing individual fault allows faulty employees to thrive at the expense of more qualified ones).


> this just obfuscates the fact that individuals are ultimately responsible

in critical systems, you design for failure. if your organizational plan for personnel failure is that no one ever makes a mistake, that's a bad organization that will forever have problems.

this goes by many names, like the swiss cheese model[0]. its not that workers get to be irresponsible, but that individuals are responsible only for themselves, and the organization is the one responsible for itself.

[0] https://en.wikipedia.org/wiki/Swiss_cheese_model


> is that no one ever makes a mistake

This isn't what I'm saying, though. The thought I'm trying to express is that if no individual accountability is done, it allows employees who are not as good at their job (read: sloppy) to continue to exist in positions which could be better occupied by employees who are better at their job (read: more diligent).

The difference between having someone who always triple-checks every parameter they input, versus someone who never double-checks and just wings it. Sure, the person who triple-checks will make mistakes, but less than the other person. This is the issue I'm trying to get at.


If someone is sloppy and not willing to change he should be shown the door, but not because he caused outage but because he is sloppy.

People who operate systems under fear tend to do stupid things like covering up innocent actions (deleting logs), keep information instead of sharing it etc. Very few can operate complex systems for long time without doing mistake. Organization where the spirit is "oh, outage, someone is going to pay for that" wiil never be attractive to good people, will have hard time adapting to changes and to adopt new tech.


> The difference between having someone who always triple-checks every parameter they input, versus someone who never double-checks and just wings it. Sure, the person who triple-checks will make mistakes, but less than the other person. This is the issue I'm trying to get at.

If you rely on someone triple-checking, you should improve your processes. You need better automation/rollback/automated testing to catch things. Eventually only intentional failure should be the issue (or you'll discover interesting new patterns that should be protected against)


If there is an incident because an employee was sloppy, the fault lies with the hiring process, the evaluation process for this employee, or with the process that put four eyes on each implementation. The employee fucked up, they should be removed if they are not up to standards, but putting the blame on them does not prevent the same thing from happening in the future.


If you'd think about it, it isn't very useful to find a person who is responsible. Suppose someone cause outage or harm, due to neglect or even bad intentions, either the system will be setup in a way that the person couldn't cause the outage or that in time it will be down. To build truly resilient system, especially on global scale, there should never be an option for a single person to bring down the whole system.


By focusing on the process, lessons are learned and systems are put in place which leads to a cycle of improvement.

When individuals are blamed instead, a culture of fear sets in and people hide / cover up their mistakes. Everybody loses as a result.


I don't think the comment you're replying to applies to your concern about subpar employees.

We blame processes instead of people because people are fallible. We've spent millenia trying to correct people, and it rarely works to a sufficient level. It's better to create a process that makes it harder for humans to screw up.


Yes, absolutely, people make mistakes. But the thought I was trying to convey is that some people make a lot more mistakes than others, and by not attributing individual fault these people are allowed to thrive at the cost of having less error-prone people in their position. For example, someone who triple-checks every parameter that they input, versus someone who has a habit of just skimming or not checking at all. Yes the triple-checker will make mistakes too, but way less than the person who puts less effort in.


But that has nothing to do with blaming processes vs people.

If the process in place means that someone has to triple check their numbers to make sure they’re correct, then it’s a broken process. Because even that person who triple checks is one time going to be woken up at 2:30am and won’t triple check because they want sleep.

If the process lets you do something, then someone at some point in time, whether accidentally or maliciously, will cause that to happen. You can discipline that person, and they certainly won’t make the same mistake again, but what about their other 10 coworkers? Or the people on the 5 sister teams with similar access who didn’t even know the full details of what happened?

If you blame the process and make improvements to ensure that triple checking isn’t required, then nobody will get into the situation in the first place.

That is why you blame the process.


Yeah, I've heard this view a hundred times on Twitter, and I wish it were true.

But sadly, there is no company which doesn't rely, at least at one point or another, on a human being typing an arbitrary command or value into a box.

You're really coming up against P=NP here. If you can build a system which can auto-validate or auto-generate everything, then that system doesn't really need humans to run at all. We just haven't reached that point yet.

Edit: Sorry, I just realised my wording might imply that P does actually equal NP. I have not in fact made that discovery. I meant it loosely to refer to the problem, and to suggest that auto-validating these things is at least not much harder than auto-executing them.


I don’t think anyone ever claimed the process itself is perfect. If it were, we obviously would never have any issues.

To be explicit here, by blaming the process, you are discovering and fixing a known weakness in the process. What someone would need to triple check for now, wouldn’t be an issue once fixed. That isn’t to say that there aren’t any other problems, but it ensures that one issue won’t happen again, regardless of who the operator is.

If you have to triple check that value X is within some range, then that can easily be automated to ensure X can’t be outside of said range. Same for calculations between inputs.

To take the overly simplistic triple check example from before, said inputs that need to be triple checked are likely checked based on some rule set (otherwise the person themselves wouldn’t know if it was correct or not). Generally speaking, those rules can be encoded as part of the process.

What was before potentially “arbitrary input” now becomes an explicit set of inputs with safeguards in place for this case. The process became more robust, but is not infallible.

But if you were to blame people, the process still takes arbitrary input, the person who messed up will probably validate their inputs better but that speaks nothing of anyone else on the team, and two years down the line where nobody remembers the incident, the issue happens again because nothing really has changed.


The issue is that this view always relies on stuff like "make people triple check everything".

- How does that relate to making a config change?

- How do you practically implement a system where someone has to triple check everything they do?

- How do you stop them just clicking 'confirm' three times?

- Why do you assume they will notice on the 2nd or 3rd check, rather than just thinking "well, I know I wrote it correctly, so I'll just click confirm"?

I don't think rules can always be encoded in the process, and I don't see how such rules will always be able to detect all errors, rather than only a subset of very obvious errors.

And that's only dealing with the simplest class of issues. What about a complex distributed systems problem? What about the engineer who doesn't make their system tolerant of Byzantine faults? How is any realistic 'process' going to prevent that?

This entire trope relies on the fundamental axiom that "for any individual action A, there is a process P which can prevent human error". I just don't see how that's true.

(If the statement were something like "good processes can eliminate whole classes of error, and reduce the likelihood of incidents", I'd be with you all the way. It's this Twitter trope of "if you have an incident, it's a priori your company's fault for not having a process to prevent it" which I find to be silly and not even nearly proven.)


> and allows subpar employees to continue existing at an organization when their position could be filled by a more qualified employee.

Not really, their incompetence is just noticed earlier at the review/testing stages instead of in production incidents.

If something reaches production that's no longer the fault of one person, it's the fault of the process and that's what you focus on.


The stress for me usually goes away once the incident is fully escalated and there's a team with me working on the issue. I imagine that happened quite quick in this case...


Exactly, the primary focus in situations like this, is to ensure that no one feel like they are alone, even if in the end it is one person who has to type in the right commands.

Always be there, help them double check, help monitor, help make the calls to whomever needs to be informed, help debug. No one should ever be alone during a large incident.


This is a one off event, not a chronic stress trigger. I find them envigorating personally, as long as everybody concerned understands that this is not good in the long run, and that you are not going to write your best code this way.


Well, those comments have been deleted now... I guess someone's boss didn't like the unofficial updates going out? :)


Also, equally important to note, there was a massive expose on FaceBook yesterday that is reverberating across social media and news networks, and today, when I tried to make a post including the tag #deletefacebook, my post mysteriously could not be published and the page refreshed, mysteriously wiping my post...

This is possibly the equivalent of a corporate watergate if you ask me... Just my personal opinion as a developer though... Not presented as fact... But hrmmm.


So what you're saying is facebook... deleted itself?

The singularity is happening. It realized it would end society, so it ended itself.


They decided that they publish too much misinformation and self censored ;)


This reminds me the last time the singularity nearly happened.

https://google.com/search?q=google

I beg you, don't go there.


Archived version: https://archive.is/QvdmH



The Reddit post is down but not before it was archived: https://archive.is/QvdmH and https://archive.is/TNrFv


User has now deleted the update.


I am sure this is not what they specifically mean by fail fast and break things often.


> Reddit r/Sysadmin user that claims to be on the "Recovery Team"

They have time to make public posts, and think it's a good idea?

Sure, I'm on the 'Recovery Team' too! How about you?


If it's anything like my past employers, they probably have a lot of time. They probably also got in a lot of trouble.

When we'd have situation bridges put in place to work a critical issue, there would usually be 2-3 people who were actively troubleshooting and a bunch of others listening in, there because "they were told to join" but with little-to-nothing to do. In the worst cases, there was management there, also.

Most of the time I was one of the 2 or 3 and generally preferred if the rest of them weren't paying much attention to what was going on. It's very frustrating when you have a large group of people who know little about what's going on injecting their opinions while you're feverishly trying to (safely) resolve a problem.

It was so bad that I once announced[0] to a C-Level and VP that they needed to exit the bridge, immediately because the discussion devolved into finger-pointing. All of management was "kicked out". We were close to solving it but technical staff was second-guessing themselves in the presence of folks with the power to fire them. 30 minutes later we were working again. My boss at the time explained that management created their own bridge and the topic was "what do to about 'me'" which quickly went from "fire me" to "get them all a large Amazon gift card". Despite my undiplomatic handling of the situation, that same C-Level negotiated to get me directly beneath during a reorganization about six months later and I stayed in that spot for years with a very good working relationship. One of my early accomplishments was to limit any of management's participation in situation bridges to once/hour, and only when absolutely necessary, for status updates assuming they couldn't be gotten any other way (phones always worked, but the other communication options may not have).

[0] This was the 16th hour of a bridge that started at 11:00 PM after a full work day early in my career -- I was a systems person with a title equivalent to 'peon', we were all very raw by then and my "announcement" was, honestly, very rude, which I wasn't proud of. Assertive does not have to be rude, but figuring out the fine line between expressing urgency and telling people off is a skill that has to be learned.


Uh oh that user deleted their account. Hope they are OK.


Looks like those updates have now been deleted


Comment now seems to be deleted by user.


That reddit comment has been deleted.


he started deleting the comments


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: