Hacker News new | past | comments | ask | show | jobs | submit login
More details about the October 4 outage (fb.com)
473 points by moneil971 61 days ago | hide | past | favorite | 286 comments



> Our systems are designed to audit commands like these to prevent mistakes like this, but a bug in that audit tool didn’t properly stop the command.

I'm so glad to see that they framed this in terms of a bug in a tool designed to prevent human error, rather than simply blaming it on human error.


> I'm so glad to see that they framed this in terms of a bug in a tool designed to prevent human error, rather than simply blaming it on human error.

Wouldn't human error reflect extremely poorly on the company though? I mean, for human error to be the root cause of this mega-outage, that would imply that the company's infrastructure and operational and security practices were so ineffective that a single person screwing up could inadvertently bring the whole company down.

A freak accident that requires all stars to be aligned to even be possible, on the other hand, does not cause a lot of concerns.


Organisations have a bad habit of using "human error" to blame systemic problems whose true root cause is inadequate leadership on individual low level employees. So, we're glad to see Facebook didn't try this shitty practice.

For a modern example, look for information on Symantec's "A tough day as leaders" in which they try to blame an incident that's clearly a result of at least incompetence by senior management on a single person who they've just fired. This is part of the sequence of events that leads to Symantec no longer being a trusted root CA. You won't find that actual post by Symantec because (of course) once they realised it wasn't doing what they wanted they deleted it, but you can find copies and references to it.

For much older examples, look at the early history of the railway in most of the world. Train crashes, blame the (often dead in the crash and thus unable to defend themselves) train driver, hint that they may have been drunk and were certainly incompetent. Owners carry on profiting from unsafe railway and needn't spend any money making it safer.


Boeing's initial response to the 737MAX crashes comes to mind as well.


> Boeing's initial response to the 737MAX crashes comes to mind as well.

Truth be told, Boeing's response to the 737MAX crashes was to blame people working for other organizations, thus the blame would not fall within neither Boeing engineers/technicians nor the Boeing organization. That's a total and complete cop out.

Pinning the blame on a company employee at least implies that the company itself has some responsibility.


And the Volkswagen emissions scandal too


Yes! I forgot they initially tried to blame it at some low-level engineer. There is a old joke in Germany, predating the VW scandal, saying that is definitely was the night guard.


So, not human error, but inadequate leadership, which is also a human error.

In other words, not human error but human error.


Except an error is supposed to be unintentional.


I mean, sure? mumblemumble is still right though. If you're looking for a cynical reason for everything FB related, then, sure, it's true that a human error looks bad.


Human error is a cop-out excuse anyway, since it's not something you can fix going forward. Humans err, and if a mistake was made once it could easily be made again.


To err is human. To really fsk things up you need a computer


To err requires a computer. To really fsk things up requires automation.


The buggy audit tool was probably made by a human too, though.


But reviewed by other humans. At some count, a collective human error becomes a system error.


There actually is a really good tool for auditing such systems.

https://learntla.com/introduction/

I discovered it here on HN just recently, in a comment on a new tool in the same problem space.


System complexity is just a way to avoid blaming individual humans when an error occurs.

- Me, 2021


It's humans all the way down!


something something soylent green


Of course the system was built by humans, but we are discussing the proximate cause of the outage.


"hey let's try this github copilot thingy to write an audit tool"


No they used GitHub Copilot


I wouldn't be surprised if that tool was a shell script with a mistyped conditional somewhere, I really dislike shell scripting.


As opposed to what? Sixteen pages of boilerplate Java/Python?


I wouldn't conflate Java and Python in the boilerplate camp. Python can be very boilerplate-y, but it tends to only happen in the hands of Java developers.

That said, even clean, idiomatic Python isn't as terse as sh. It also isn't as terse as perl. Many would argue that's a good thing. The optimum point for readability isn't found at either of the extremes. Not entirely unlike how the most readable way of writing English is neither shorthand nor blackletter.


Interesting bit on recovery w.r.t. the electrical grid

> flipping our services back on all at once could potentially cause a new round of crashes due to a surge in traffic. Individual data centers were reporting dips in power usage in the range of tens of megawatts, and suddenly reversing such a dip in power consumption could put everything from electrical systems ...

I wish there was a bit more detail in here. What's the worst case there? Brownouts, exploding transformers? Or less catastrophic?


Brownouts is probably the most proximate concern - a sudden increase in demand will draw down the system frequency in the vicinity, and if there aren't generation units close enough or with enough dispatchable capacity there's a small chance they would trip a protective breaker.

A person I know on the power grid side said at one data center there were step functions when FB went down and then when it came up, equal to about 20% of the load behind the distribution transformer. That quantity is about as much as an aluminum smelter switching on or off.


> That quantity is about as much as an aluminum smelter switching on or off.

Interestingly, the mountains east of Portland OR, where all the Aluminum smelters used to be, are now full of FAANG datacenters relying on the power infrastructure (and pricing) the Aluminum industry used to use...

https://www.oregonlive.com/silicon-forest/2015/10/small-town...

And Washington state too:

https://www.bizjournals.com/seattle/blog/techflash/2015/11/p...


That's pretty interesting, I'm sure those aluminium foundries would need to be careful about turning the power on as well.

Tangentially related, aluminium production in the Netherlands may shut down soon; because of a sudden spike in gas prices (due to mismanagement), electricity prices have also gone up, making producing aluminium no longer cost-effective. €2400 in electricity to produce a ton of aluminium worth €2500 kinda cost effectiveness.

I wouldn't be surprised if the big datacenters here will try and offload some of their workloads to datacenters elsewhere with lower energy costs. Mind you, I'm pretty sure these datacenters make long-running deals on electricity prices.


But don't their datacenters all have backup generators? So worst case in a brownout, they fail over to generator power, then can start to flip back to utility power slowly.

Or do they forgo backup generators and count on shifting traffic to a new datacenter if there's a regional power outage?


Edit to be less snarky:

I assume they do have backup generators, though I don’t know.

However if the sudden increase put that much load on the grid it could drop the frequency enough to blackout the entire neighborhood. That would be bad even if FB was able to keep running through it.


Ah yea I meant brownouts for other people haha. I figure Facebook can handle their own electrical stability just fine


The blackout of the northeast US and parts of Canada, in 2003 was really caused by something relatively small. Imagine Facebook, yesterday, causing some weird cascading effect on the power grid, and pulling half of the country with it...


Is there any liability if Facebook had brought everything up at once and caused brownouts? Seems like it would be some form of negligence on their part harming a shared resource, but I don't know if there's any laws or contract terms with the power company that require them to pay if they mess up like that.


My girlfriend works in a large grid operator (in Europe). According to her there are lots of regulations and contracts on the grid operators about how they must handle reliability. So it's unlikely that Facebook would be liable if this took down half the country, because then it was the grid operator not living up to their agreements on reliability.

There are a lot of automated fail-safes on this, and apparently larger industry (which a datacenter is as well) will get disconnected from the grid automatically in emergency situations before they drop residential areas. But in the end they will drop one by one everything they need to keep the larger grid running. It's not even a networked "smart" management system, the distribution points automatically react to voltage and frequency drops and they're set up to break some things like industry earlier than others.


For outages the generatos are great but I'm not sure how they assist with brownouts unless they can start instantly or are constantly running to provide a buffer.

Short term they'd help but an instantaneous or unexpected massive traffic/CPU usage/user surge might pop too fast for the generators to start and kick in properly. Also, it might not be good for those big generators to start and stop over and over vs bringing infra back online in waves to limit spikes.


For outages the generatos are great but I'm not sure how they assist with brownouts unless they can start instantly

If the generators will help in an outage, why wouldn't they help in a brownout? You'd transition to generator when the voltage and/or frequency is outside of spec.

You'd typically have some short-term power protection to keep your datacenter running until the generators start.

I was skeptical about a datacenter that had less than 60 seconds of flywheel energy storage. But the data center manager said that if the generator doesn't start within 30 seconds, you're not going to get it started in an hour so having a huge battery stack that can power the datacenter for 15 minutes isn't going to help much.


Generators do usually start up very quickly. Under a minute.


I guess a DC would have UPS/battery power on hand to cover an instantaneous brown out. Then the generators could be on and restoring battery power while running the DC.


If your system is pulling 500 watts at 120V, that's around 4A of line voltage. If you drop down 20% to 100V, the output will happily still pull its regulated voltage, but now the line components are seeing ~20% more, at 5A. For brown out, you need to overrate your components, and/or shut everything off if the line voltage goes too low.

I used to do electrical compliance testing in a previous life, with brown out testing being one of our safety tests. You would drape a piece of cheese cloth over the power supply and slowly ramp the line voltage down. At the time, the power supplies didn't have good line side voltage monitoring. There was almost always smoke, and sometimes cheese cloth fires. Since this was safety testing, pass/fail was mostly based on if the cheese cloth caught fire, not if the power supply was damaged.


Why wouldn't the power supply shut down due to overcurrent protection?


That's likely on the output side, which doesn't see overcurrent.


"output will happily still pull its regulated voltage" you mean power, right?


All standard computer components require a regulated voltage, then they consume power as a consequence of their operation. The steady voltage is required because the transistors in ICs will break down if voltages go too high, or stop operating if they go too low. Forcing something like an IC to always use the same amount of power, even if it were idle, isn't really possible, because nobody would build it that way.


I’m very close with someone who works at a FB data center and was discussing this exact issue.

I can only speak to one problem I know of (and am rather sure I can share): a spike might trip a bunch of breakers at the data center.

BUT, unlike me at home, FBs policy is to never flip a circuit back on until you’re positive of the root cause of said trip.

By itself that could compound issues and delay ramp up time as they’d work to be sure no electrical components actually sorted/blew/etc. A potentially time sucking task given these buildings could be measured in whole units of football fields.


Likely tripping breakers or overload protection on UPSes?

Often PDUs used in a rack can be configured to start servers up in a staggered pattern to avoid a surge in demand for these reasons.

I'd imagine there's more complications when you're doing an entire DC vs just a single rack, though.


I don't see how suddenly running more traffic is going to trip datacenter breakers -- I could see how flipping on power to an entire datacenter's worth of servers could cause a spike in electrical demand that the power infrastructure can't handle, but if suddently running CPU's at 100% trips breakers, then it seems like that power infrastructure is undersized? This isn't a case where servers were powered off, they were idle because they had no traffic.

Do large providers like Facebook really provision less power than their servers would require at 100% utilization? Seems like they could just use fewer servers with power sized at 100% if their power system going to constrain utilization anyway?


All of the components in the supply chain will be rated for greater than max load, however power generation at grid scale is a delicate balancing act.

I’m not an electrical engineer, so the details here may be fuzzy, however in broad strokes:

Grid operators constantly monitor power consumption across the grid. If more power is being drawn than generated, line frequency drops across the whole grid. This leads to brownouts and can cause widespread damage to grid equipment and end-user devices.

The main way to manage this is to bring more capacity online to bring the grid frequency back up. This is slow, since spinning up even “fast” generators like natural gas can take on the order of several minutes.

Notably, this kind of scenario is the whole reason the Tesla battery in South Australia exists. It can respond to spikes in demand (and consume surplus supply!) much faster than generator capacity can respond.

The other option is load shedding, where you just disconnect parts of your grid to reduce demand.

Any large consumers (like data center operators) likely work closely with their electricity suppliers to be good citizens and ramp up and down their consumption in a controlled manner to give the supply side (the power generators) time to adjust their supply as the demand changes.

Note that changes to power draw as machines handle different load will also result in changes to consumption in the cooling systems etc. making the total consumption profile substantially different coming from a cold start.


You're talking about the grid, the OP was talking about datacenter infrastructure -- which one is the weak link?

If a datacenter can't go from idle (but powered on) servers to fully utilized servers without taking down the power grid, then it seems that they'd have software controls in place to prevent this, since there are other failure modes that could cause this behavior other than a global Facebook outage.


Unfortunately the article doesn’t provide enough explicit detail to be 100% sure one way or the other, however my read is that it’s probably the grid.

> Individual data centers were reporting dips in power usage in the range of tens of megawatts, and suddenly reversing such a dip in power consumption could put everything from electrical systems to caches at risk.

“Electrical systems” is vague and could refer to either internal systems, external systems or both.

That said, if the DC is capable of running under sustained load at peak (which we have to assume it is, since that’s its normal state when FB is operational) it seems to me like the externality of the grid is the more likely candidate.

In terms of software controls preventing this kind of failure mode, they do have it - load shedding. They’ll cut your supply until capacity is made available.


The key word is "suddenly".

In the electricity grid, demand and generation must always be precisely matched (otherwise, things burn up). This is done by generators automatically ramping up or down whenever the load changes. But most generators cannot change their output instantly; depending on the type of generator, it can take several minutes or even hours to respond to a large change in the demand.

Now consider that, on modern servers, most of the power consumption is from the CPU, and also there's a significant difference on the amount of power consumed between 100% CPU and idle. Imagine for instance 1000 servers (a single rack can hold 40 servers or more), each consuming 2kW of power at full load, and suppose they need only half that at idle (it's probably even less than half). Suddenly switching from idle to full load would mean 1MW of extra power has to be generated; while the generators are catching up to that, the voltage drops, which means the current increases to compensate (unlike incandescent lamps, switching power supplies try to maintain the same output no matter the input voltage), and breakers (which usually are configured to trip on excess current) can trip (without breakers, the wiring would overheat and burn up or start a fire).

If the load changes slowly, on the other hand, there's enough time for the governor on the generators to adjust their power source (opening valves to admit more water or steam or fuel), and overcome the inertia of their large spinning mass, before the voltage drops too much.


>generated; while the generators are catching up to that, the voltage drops, which means the current increases to compensate...

Close- you won't see an increase in load of a synchronous machine operating at constant throttle manifest as a voltage sag, you'll see it manifest as a decrease in frequency (this generators literally slow down, like a guy on a bike going uphill). Voltage sags are more related to transmission line phenomenon.


I get that lots of servers can add up to lots of power, but what is a "lot"? Is 1MW really enough demand to destabilize a regional power grid?


No. All balancing authorities are required to keep a certain amount of "spinning reserve" available for fast adjustments like this. But if I do it and the next guy does it and a transmission like is down and...etc

A lot of horror stories start that way.


If it's all at once at the end of one leg and unplanned? Yes.

The question is somewhat similar to a thought experiment. If a ship is docked and loading cargo, is it a good idea to use all the cranes to suddenly fill up one outer side of the ship?


I don't know the answer. But it's not too uncommon, in general, to provision for reasonable use cases plus a margin, rather than provision for worst case scenario.


Disk arrays have been staggering drive startup for a long time for this reason. Sinking current into hundreds of little starting motors simultaneously is a bad idea.


Think Thundering Herd problem on the scale you already know from context. Partial service is a kind of backpressure.


One case is automated protection systems in the grid detecting a sudden hop of current and assuming an isolation failure along the path - basically, not enough current to trip the short-circuit breakers, but enough to raise an alarm.


This isn't really a thing. Transmission and distribution protection doesn't operate on any kind of di\dt basis, other than those defined by overcurrent, in which case the line trips. A sudden increase in load will just manifest in ACE (area control error) as a load imbalance and be dealt with by increasing generation from the spinning reserve that the balancing authority is required to have on hand.


So someone ran "clear mpls lsp" instead of "show mpls lsp"?


For context, parent comment is trying to decipher this heavily-PR-reviewed paragraph:

> During one of these routine maintenance jobs, a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centers globally. Our systems are designed to audit commands like these to prevent mistakes like this, but a bug in that audit tool didn’t properly stop the command.


s/assess/augment/g


Seems like it. It's kinda like typing hostname and accidentally poking your yubikey (Not that I've done that...) or the date command that both let's you set the date and format the date


  ykpersonalize -u -1 -o -fast-trig
(or some variation thereof)


My guess : he executed the command from shell history. Commands look similar enough and he hit Enter too quickly


But why would the clear command be in their history?


cltr+r mpls [enter]

Would fit muscle memory, but if that wasn't caught by the automated tool they have some work to do.


Incidentally the facebook app itself really handled this gracefully. When the app can't connect to facebook, it displays "updates" from a pool of cached content. It looks and feels like facebook is there, but we know it's not. I didn't notice this until the outage and I thought it was neat.


WhatsApp failed similarly, but I thought it was a poor design decision to do so. Anyone waiting on communication through WhatsApp had no indication (outside the media) that it was unavailable, and that they should find a nother communication channel.

Don't paper connectivity failures. It disempowers users


I think these features are designed for the timescales like "between subway stations which have cell service" or even "driving into a tunnel". In those cases, it seems totally appropriate to me.


whatsapp used to have a "status" page in their app, they removed it years ago though.

That said - yeah, i figured there's an issue when hard refreshing whatsapp web (sometimes has a kink that needs to be refreshed away) didn't work.


I’m curious if the app handled posts or likes gracefully too. Did it accept and cache the updates until it could reconnect to Facebook servers?


I'm not a huge facebook user and I mostly use it to keep in touch with my parents. I did notice that trying to message them resulted in "message not delivered" (which gave me a prompt to retry) but I didn't try to post anything during the outage.


On Instagram I was even able to "like" posts while the outage was in effect. Not sure if the app replayed those when the service came back.


I suspect so. From the app's point of view it's likely not terribly different to losing cell service for a while.


Note that contrary to popular reports, DNS was NOT to blame for this outage — for once DNS worked exactly as per per the spec, design and configuration:

> To ensure reliable operation, our DNS servers disable those BGP advertisements if they themselves can not speak to our data centers, since this is an indication of an unhealthy network connection.


Not the first cause, but involved. Before reading I expected to see some combination of (1) automation (2) DNS (3) BGP. I didn't expect to see all three and the special automatically disconnect the backbone from the internet with no other way for senior tech staff to get to the backbone, not even a secure dial-up console.

I think the general lesson here is for each thing you automate, assume that it can act in error and have another manual way to do what the automatic action prevents.


> One of the jobs performed by our smaller facilities is to respond to DNS queries. DNS is the address book of the internet, enabling the simple web names we type into browsers to be translated into specific server IP addresses.

What is the target audience of this post? It is too technical for non-technical people, but also it is dumbed down to try to include people that does not know how the internet works. I feel like I'm missing something.


I'm guessing it has multiple target audiences. Those that won't understand some of the technical jargon (e.g., "IP addresses") will still be able to follow the general flow of the article.

Those of us who are familiar with the domain of knowledge, on the other hand, get a decent summary of events.

It's a balancing act. I think the article does a good enough job of explaining things.


With an outage this big, even a post for a technical audience will get read by non-technical people (including journalists), so I'm sure it helps to include details like this.


The media: "Facebook engineer typed command. This is what happened next."


"10 backbone router commands Mark Zuckerberg doesn't want you to know. Number 7 will shock you!"


Huh? I would hardly describe this as technical. Someone with a high school education can read it and get the gist. It's actually somewhat impressive how it toes the line between accessibility and 'just detailed enough'.


I'm reading your comment as a form of "feigning surprise", in other words a statement along the lines of "I can't believe target audience doesn't know about x concept".

more on the concept: https://noidea.dog/blog/admitting-ignorance


There are plenty of technical people (or people employed in technical roles) who don't understand how DNS works. For example, I field questions on why "hostname X only works when I'm on VPN" at work.


The media. This was a huge international story.


Both those groups of people. I imagine, they would either be accused of it being either too complicated or dumbed down, so they do both in the same article.


Poor targeting choices.


Teenagers who are responsible for managing the family router?


> Those data centers come in different forms.

It's like the birds and the bees.


> What is the target audience of this post?

Separate point to your question.

FB is under no obligation to provide more details than they need to because a small segment of the population (certainly relative to their 'customers') might find it interesting or helpful or entertaining. FB is a business. They can essentially do (and should be able to) do whatever they want. There is no requirement (and there should be no requirement) to provide the general public with more info than they want subject to any legal requirement. If the government wants more (and are entitled to more info) they can ask for it and FB can decide if they are required to comply.

FB is a business. Their customers are not the tech community looking to get educated and avoid issues themselves at their companies or (as mentioned) be entertained. And their customers (advertisers or users) can decide if they want to continue to patronize FB.

I always love on HN seeing 'hey where is the post mortem' as if it's some type of defacto requirement to air dirty laundry to others.

If I go to the store and there is not paper towels there I don't need to know why there are no towels and what the company will do going forward to prevent any errors that caused the lack of that product. I can decide to buy another brand or simply take steps to not have it be an issue.


The air industry has this solved, it's mandatory to report certain kind of incidents to avoid them in the future and inform the aviation community. https://www.skybrary.aero/index.php/Mandatory_Occurrence_Rep...

That the main form of personal communication for hundreds of millions of users is down and there is no mandatory reporting is irresponsible. That Facebook is a business does not mean that they do not have responsibilities towards society.

Facebook is not your local supermarket, it has global impact.


One would imagine a large local supermarket going down would owe the people it serves some explanation. That's where their food comes from.

At this point, I am completely sick of the pro-corporate rhetoric to let businesses do whatever they want. They exist to serve the public and they should be treated as such.


> If I go to the store and there is not paper towels there I don't need to know why there are no towels

You don't _need_ to know, but it's human to want to know, and it's also human to want to satisfy other human's curiosity, especially if it doesn't bring any harm to you.

Also, your post is not really answering any of GP's questions. I presume you wanted to say that FB doesn't _owe_ any explanation to us, but the GP asked, as they already provided one, to whom is it addressed.


You can see the security/reliability tradeoff problem here.

You need a control plane. But what does it run over? Your regular data links? A problem if it also controls their configuration. Something outside your own infrastructure, like a modest connection to the local ISP as a backup? That's an attack vector.

One popular solution is to keep both the current and previous generation of the control plane up. Both Google and AT&T seem to have done that. AT&T kept Signalling System 5 up for years after SS7 was doing all the work. Having two totally different technologies with somewhat different paths is helpful.


The post mentioned that the out-of-band network was also down, and I’m curious what that entails and how it was also impacted. They must not have been on external DNS or had static IPs to access recovery. I’m sure they won’t share more than this now, but I’d sure love to hear more about the OOB access.


I'm confused by this. Are the DNS servers inside the backbone, or outside?

  To ensure reliable operation, our DNS servers disable those BGP advertisements if they themselves can not speak to our data centers, since this is an indication of an unhealthy network connection. In the recent outage the entire backbone was removed from operation,  making these locations declare themselves unhealthy and withdraw those BGP advertisements. The end result was that our DNS servers became unreachable even though they were still operational. This made it impossible for the rest of the internet to find our servers.


If we make a simplified network map, FB looks more or less like a bunch of PoPs (points of presence) at major peering points around the world, a backbone network that connects those PoPs to the FB datacenters, and the FB operated datacenters themselves. (The datacenters are generally located a bit farther away from population centers, and therefore peering points, so it's sensible to communicate to the outside world through the PoPs only)

The DNS servers run at the PoPs, but only BGP advertise the DNS addresses when the PoP determines it's healthy. If there's no connectivity back to a FB datacenter (or perhaps, no connectivity to the preferred datacenter), the PoP is unhealthy and won't advertise the DNS addresses over BGP.

Since the BGP change that was pushed eliminated the backbone connectivity, none of the PoPs were able to connect to datacenters, and so they all, independently, stopped advertising the DNS addresses.

So that's why DNS went down. Of course, since client access goes through load balancers at the PoPs, and the PoPs couldn't access the datacenters where requests are actually processed, DNS being down wasn't a meaningful impediment to accessing the services. Apparently, it was an issue with management (among other issues).

Disclosure: I worked at WhatsApp until 2019, and saw some of the network diagrams. Network design may have changed a bit in the last 2 years, but probably not too much.


/* just observing the data presented in the Cloudflare article (https://blog.cloudflare.com/october-2021-facebook-outage/) and disagreeing with the conclusion :-)

While 129.134.30.0/23 (subnet where and a and b nameservers reside) has indeed been withdrawn (according to FB postmortem by DNS automation tooling, 129.134.0.0/17 that is the shorter prefix (perhaps summary at the edge) was still present, however, didn't have longer prefixes (e.g 129.134.30.0/24 and 129.134.31.0/24 we normally see anycasted externally) internally. In other words - routing towards FB DNS subnet (I haven't looked into 185.89.218.0/23 which is where 2 other authoritative nameservers reside) still worked up to the FB border, the traffic was dropped (routed to Null) by FB edge, since it didn't have more specifics internally.

This, combined with TTL of 60 seconds led to almost immediate global DNS failure and all other stuff you have been reading about.


That particular subnet has a covering prefix, but I don't think the other two DNS subnets do, and I had checked on the WhatsApp authoritative subnets, because I have greater affinity for WhatsApp. The WhatsApp subnets don't usually have a covering prefix (and I did check a looking glass during the outage and there were no announcements visible at least at that point).

For those with a covering prefix, the diagnosis is a little bit different as you said, traffic would still flow to whichever FB PoPs advertise the covering prefix, but then it loops in FB, because the PoP doesn't know where to send it, since nowhere was advertising the specific /24. As opposed to the addresses with zero announcements, where the traffic doesn't make it to FB, but gets dropped somewhere else.


Ok so the DNS servers at PoPs, outside of backbone, did not go down.

Does it mean they can respond with public IPs meaningful for local PoP only and are not able to respond with IPs as directions to other PoPs or FB's main DCs? So that has to mean different public IPs are handed out at different PoPs, right?


I'm not quite sure I understand the question exactly, but let me give it a try.

So, first off, each pop has a /24, so like the seattle-1 pop which is near me has 157.240.3.X addresses; for me, whatsapp.net currently resolves to 157.240.3.54 in the seattle-1 pop. these addresses are used as unicast meaning they go to one place only, and they're dedicated for seattle-1 (until FB moves them around). But there are also anycast /24s, like 69.171.250.0/24, where 69.171.250.60 is a loadbalancer IP that does the same job as 157.240.3.54, but multiple PoPs advertise 69.171.250.0/24; it's served from seattle-1 for me, but probably something else for you unless you're nearby.

The DNS server IPs are also anycast, so if a PoP is healthy, it will BGP advertise the DNS server IPs (or at least some of them; if I ping {a-d}.ns.whatsapp.net, I see 4 different ping times, so I can tell seattle-1 is only advertising d.ns.whatsapp.net right now, and if I worked a little harder, I could probably figure out the other PoPs).

Ok, so then I think your question is, if my DNS request for whatsapp.net makes it to the seattle1 PoP, will it only respond with a seattle-1 IP? That's one way to do it, but it's not necessarily the best way. Since my DNS requests could make it to any PoP, sending back an answer that points at that PoP may not be the best place to send me.

Ideally, you want to send back an answer that is network local to the requester and also not a PoP that is overloaded. Every fancy DNS server does it a little different, but more or less you're integrating a bunch of information that links resolver IP to network location as well as capacity information and doing the best you can. Sometimes that would be sending users to anycast which should end up network local (but doesn't always), sometimes it's sending them to a specific pop you think is local, sometimes it's sending them to another pop because the usual best pop has some issue (overloaded on CPU, network congestion to the datacenters, network congestion on peering/transit, utility power issue, incoming weather event, fiber cut or upcoming fiber maintenance, etc).

But in short, different DNS requests will get different answers. If you've got a few minutes, run these commands to see the range of answers you could get for the same query:

    host whatsapp.net # using your system resolver settings
    host whatsapp.net a.ns.whatsapp.net # direct to authoritative A
    host whatsapp.net b.ns.whatsapp.net # direct to B
    host whatsapp.net 8.8.8.8 # google public DNS
    host whatsapp.net 1.1.1.1 # cloudflare public DNS
    host whatsapp.net 4.2.2.1 # level 3 not entirely public DNS
    host whatsapp.net 208.67.222.222 # OpenDNS
    host whatsapp.net 9.9.9.9 # Quad9
You should see a bunch of different addresses for the same service. FB hostnames do similar things of course.

Adding on, the BGP announcments for the unicast /24s of the PoPs didn't go down during yesterday's outage. If you had any of the pop specific IPs for whatsapp.net, you could still use http://whatsapp.net (or https://whatsapp.net ), because the configuration for that hostname is so simple, it's served from the PoPs without going to the datacenters (it just sets some HSTS headers and redirects to www.whatsapp.com, which perhaps despite appearances is a page that is served from the datacenters and so would not have worked during the outage).


  Ok, so then I think your question is, if my DNS request for whatsapp.net makes it to the seattle1 PoP, will it only respond with a seattle-1 IP? That's one way to do it, but it's not necessarily the best way. Since my DNS requests could make it to any PoP, sending back an answer that points at that PoP may not be the best place to send me.

  Ideally, you want to send back an answer that is network local to the requester and also not a PoP that is overloaded. Every fancy DNS server does it a little different, but more or less you're integrating a bunch of information that links resolver IP to network location as well as capacity information and doing the best you can. Sometimes that would be sending users to anycast which should end up network local (but doesn't always), sometimes it's sending them to a specific pop you think is local, sometimes it's sending them to another pop because the usual best pop has some issue (overloaded on CPU, network congestion to the datacenters, network congestion on peering/transit, utility power issue, incoming weather event, fiber cut or upcoming fiber maintenance, etc).
Right I was hoping the DNSs of FB ought to be smarter than usual and let's say when DNS at Seattle-1 cannot reach backbone it'd respond with IP of perhaps NYC/SF before it starts the BGP withdrawal.

Thanks for the write up and I enjoy it.


> Right I was hoping the DNSs of FB ought to be smarter than usual and let's say when DNS at Seattle-1 cannot reach backbone it'd respond with IP of perhaps NYC/SF before it starts the BGP withdrawal.

The problem there is coordination. The PoPs don't generally communicate amongst themselves (and may not have been able to after the FB backbone was broken, although technically, they could have through transit connectivity, it may not be configured to work that way), so when a PoP loses its connection to the FB datacenters, it also loses its source of what PoPs are available and healthy. I think this is likely a classic distributed systems problem; the desired behavior when an individual node becomes unhealthy is different than when all nodes become unhealthy, but the nature of distributed systems is that a node can't tell if its the only unhealthy node or all nodes became unhealthy together. Each individual PoP did the right thing by dropping out of the anycast, but because they all did it, it was the wrong thing.


You are to the point and precise. This is exactly the problem.

  Each individual PoP did the right thing by dropping out of the anycast, but because they all did it, it was the wrong thing.
Somehow I feel the design is flawed because if abuses DNS server status a bit. I mean DNS server down and BGP withdrawal for the DNS server is a perfect combination, however connectivity between DNS and backend server down, DNS up and BGP withdrawal for DNS server is not. DNS did not fail and DNS should just fall back to some other operational DNS perhaps a regional/global default one.


I think this is not necessarily a flaw of the design. It's a fundamental weakness of the real world.

Either you can take the backbone being unavailable as a symbol that the PoP is broken, and kill the PoP; or you can take the backbone being unavailable as a symbol that the backbone is broken and do your best.

When either interpretation is wrong, you'll need humans to come around and intervene. It's much more common that the only the PoP is broken, so having that require intervention results in more effort.

The flaw here is more that the intervention required to get the backbone back was hard to do because internal tools to bring back the backbone relied on DNS which relied on the backbone being up. As well, there were some reports that physical security relied on the backbone being up. And that restoring the backbone needed physical access.

This isn't the first largescale FB outage where the root cause was a bad configuration was pushed globally quickly. It's really something they need to learn not to do. But, even without that, being able to get key things running again, like the backbone, and DNS, and the configuration system(s), and centralized authentication, needs to be doable without those key systems running. I suspect at least some of that will be improved on, and hopefully regularly practiced.


I'm not going to claim to be a BGP expert, but as I understand it the way BGP propagation tends to work makes it a pretty global thing just in terms of how the router hardware handles stuff, which makes it unusually tricky to avoid.

I don't disagree about the general problem mind, I just have a feeling that fixing "don't push configs globally" for BGP specifically is unusually complicated.


  internal tools to bring back the backbone relied on DNS which relied on the backbone being up
So are you referring to same DNS servers sitting outside the backbone at the various PoPs? I'd imagine some internal DNS servers which stays in the backbone at use here, unless of course the FB engineers themselves were disconnected from those internal DNS servers.


I don't recall how internal DNS was setup (and determining from the outside isn't really possible), but there were comments in the incident report that DNS being unavailable made it harder to recover.



Thanks for the very helpful explanation!


If I understand it correctly they have DNS servers spread out at different locations. These locations are also BGP peering locations. If the DNS server at a location cannot reach the other datacenters via the backbone it stops advertising their IP prefixes. The hope is that traffic will then instead get routed to some other facebook location that is still operational.


The wording is a bit unclear (rushed, no doubt) but I expect this means the DNS servers stopped announcing themselves as possible targets for the anycasted IPs Facebook uses for its authoritative DNS [1], since they learned that the network was deemed unhealthy. If they all do that nobody will answer traffic sent to the authoritative DNS IPs and nothing works.

[1] See "our authoritative name servers that occupy well known IP addresses themselves" mentioned earlier


The anycasted IPs for DNS servers make sense to me and the BGP withdrawal too when the common cases are perhaps one or a few PoPs lost connectivity to the backbone/DC rarely every one of them fails at the time.

I was hoping perhaps the DNS servers at PoPs can be improved by responding with public IPs for other PoPs/DCs and only when that is not available start the BGP withdrawal. Or I can presume the available DNSs at PoPs decrease over time and the remaining ones getting more and more requests until finally every one of them is cut off from the internet.


Facebook’s authoritative DNS servers are at the borders between Facebook’s backbone and the rest of the Internet.


We want ramenporn


Context for those who didn't see it yesterday: https://news.ycombinator.com/item?id=28749244


Google had a comparable outage several years ago.

https://status.cloud.google.com/incident/cloud-networking/19...

This event left a lot of scar tissue across all of Technical Infrastructure, and the next few months were not a fun time (e.g. a mandatory training where leadership read out emails from customers telling us how we let them down and lost their trust).

I'd be curious to see what systemic changes happen at FB as a result, if any.


To expand on why this made me think of the Google outage:

It was a global backbone isolation, caused by configuration changes (as they all are...). It was detected fairly early on, but recovery was difficult because internal tools / debugging workflows were also impacted, and even after the problem was identified, it still took time to back out the change.

"But wait, a global backbone isolation? Google wasn't totally down," you might say. That's because Google has two (primary) backbones (B2 and B4), and only B4 was isolated, so traffic spilled over onto B2 (which has much less capacity), causing heavy congestion.


Google also had a runaway automation outage where a process went around the world "selling" all the frontend machines back to the global resource pool. Nobody was alerted until something like 95% of global frontends had disappeared.

This was an important lesson for SREs inside and outside Google because it shows the dangers of the antipattern of command line flags that narrow the scope of an operation instead of expanding it. I.e. if your command was supposed to be `drain -cell xx` to locally turn-down a small resource pool but `drain` without any arguments drains the whole universe, you have developed a tool which is too dangerous to exist.


Agreed, but with an amendment:

If your tool is capable of draining the whole universe, period, it is too dangerous to exist.

That was one of the big takeaways: global config changes must happen slowly. (Whether we've fully internalized that lesson is a different matter.)


As FB opines at the end, at some point, it's a trade-off between power (being able access / do everything quickly) and safety (having speed bumps that slow larger operations down).

The pure takeaway is probably that it's important to design systems where "large" operations are rarely required, and frequent ops actions are all "small."

Because otherwise, you're asking for an impossible process (quick and protected).


SREs live in a dangerous world, unfortunately. It's entirely possible the "tool" in question is a shell script that gets fed a list of bad cells but some bug causes it to get a list of all the cells instead.

Some tools are well engineered, capable of the Sisyphean task of globally deploying updates but others are rapid prototypes that, sure, are too dangerous to exist, but the whole point of SREs being capable programmers is that the work has problems that are most efficiently solved with one-off code that just isn't (because it can't be) rigorously tested before being used. You can bet there was some of that used in recovering from this incident. (I'm sure there were many eyes reviewing the code before being run, but that only goes so far when you're trying to do something that you never expected, like having to revive Facebook.)


The other problem is scale: the standard "save me" for tools like this is a --doit and --no-really-i-mean-it and defaulting to a "this is what I would've done" mode. That falls apart the moment the list of actions is longer then the screen but you're expecting that: after all how can you really tell the difference unless the console scrolls for a really long time?

There's solutions to that, but of course these sorts of tools all come into existence well before the system reaches a size where how they work becomes dangerous.


If your tool is capable of draining the whole universe

Why did I think of humans, when I read this. :P


I feel like this explains so much about why the gcloud command works the way it does. Sometimes feels overly complicated for minor things, but given this logic, I get it.


But the FB outage was not a configuration change.

> a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network


From yesterday's post:

"Our engineering teams have learned that configuration changes on the backbone routers that coordinate network traffic between our data centers caused issues that interrupted this communication.

...

Our services are now back online and we’re actively working to fully return them to regular operations. We want to make clear that there was no malicious activity behind this outage — its root cause was a faulty configuration change on our end."

Ultimately, that faulty command changed router configuration globally.

The Google outage was triggered by a configuration change due to an automation system gone rogue. But hey, it too was triggered by a human issuing a command at some point.


I'm inclined to believe the later post as they've had more time to assess the details. I think the point of the earlier post is really to say "we weren't hacked!" but they didn't want to use exactly that language.


This is kind of like Chernobyl where they were testing to see how hot they could run the reactor to see how much power it could generate. Then things went sideways.


The Chernobyl test was not a test to drive the reactor to the limits, but actually a test to verify that the inertia of the main turbines is big enough to drive the coolant pumps for X amount of time in the case of grid failure.


Of possible interest:

https://www.youtube.com/watch?v=Ijst4g5KFN0

This is a presentation to students by an MIT professor that goes over exactly what happened, the sequence of events, mistakes made, and so on.


Warning for others: I watched the above video and then watched the entire course (>30 hours).


Now I know what I'm doing the rest of this week...


As already said the test was about something entirely different. And the dangerous part was not the test itself, but the way they delayed the test and then continued to perform it despite the reactor being in a problematic state and the night shift being on duty, who were not trained on this test. The main problem was that they ran the reactor at reduced power long enough to have significant xenon poisoning, and then put the reactor at the brink when they tried to actually run the test under these unsafe conditions.


I'd say the failure at Chernobyl was that anyone who asked questions got sent to a labor camp and the people making the decisions really had no clue about the work being done. Everything else just stems from that. The safest reactor in the world would blow up under the same leadership.


At first i thought it was inappropriate hyperbole to compare Facebook to Chernobyl, but then i realized that i think Facebook (along with twitter and other "web 2.0" graduates) has spread toxic waste across far larger of an area than Chernobyl. But I would still say that it's not the _outage_ which is comparable to Chernobyl, but the steady-state operations.


>internal tools / debugging workflows were also impacted

That's something that should never happen.


> a mandatory training where leadership read out emails from customers telling us how we let them down and lost their trust

Is that normal at Google? Making people feel bad for an outage doesn't seem consistent with the "blameless postmortem" culture promoted in the SRE book[1].

[1] https://sre.google/sre-book/postmortem-culture/


"Blameless Postmortem" does not mean "No Consequences", even if people often want to interpret it that way. If an organization determines that a disconnect between ground work and a customer's experience is a contributing factor to poor decision making then they might conclude that making engineers more emotionally invested in their customers could be a viable path forward.


Relentless customer service is never going to screw you over in my experience... It pains me that we have to constantly play these games of abstraction between engineer and customer. You are presumably working a job which involves some business and some customer. It is not a fucking daycare. If any of my customers are pissed about their experience, I want to be on the phone with them as soon as humanly possible and I want to hear it myself. Yes, it is a dreadful experience to get bitched at, but it also sharpens your focus like you wouldn't believe when you can't just throw a problem to the guy behind you.

By all means, put the support/enhancement requests through a separate channel+buffer so everyone can actually get work done during the day. But, at no point should an engineer ever be allowed to feel like they don't have to answer to some customer. If you are terrified a junior dev is going to say a naughty phrase to a VIP, then invent an internal customer for them to answer to, and diligently proxy the end customer's sentiment for the engineer's benefit.


I think of this is terms of empathy: every engineer should be able to provide a quick and accurate answer to "What do our customers want? And how do they use our product?"

I'm not talking esoterica, but at least a first approximation.


Why? Like we all are customers as well as an employee.


Because we as engineers create software for our customers, and if you don't understand who your customers are how can you create software that actually suits their needs?

Very rarely are we our own customers


I would argue that SREs are consistently our own customers in a way unique to SRE.


Ironic, as measurability came up in another comment thread I'm in.

I'd say from a technical perspective SREs are, but there's a potential (depends on product) gap between their technical goals and user goals.

e.g. What does "p95 latency is spiking" actually mean to the end user?


From the SRE book: "For a postmortem to be truly blameless, it must focus on identifying the contributing causes of the incident without indicting any individual or team for bad or inappropriate behavior. A blamelessly written postmortem assumes that everyone involved in an incident had good intentions and did the right thing with the information they had. If a culture of finger pointing and shaming individuals or teams for doing the 'wrong' thing prevails, people will not bring issues to light for fear of punishment."

If it's really the case that engineers are lacking information about the impact that outages have on customers (which seems rather unlikely), then leadership needs to find a way to provide them with that information without reading customer emails about how the engineers "let them down", which is blameful.

Furthermore, making engineers "emotionally invested" doesn't provide concrete guidance on how to make better decisions in the future. A blameless portmortem does, but you're less likely to get good postmortems if engineers fear shaming and punishment, which reading those customer emails is a minor form of.


I work at Google and have written more than a few blameless postmortems. You don't need to quote things to me.

Is what was described above "finger pointing or shaming"? I don't work in TI so I didn't experience this meeting but it doesn't seem like it is. It also doesn't sound to me like this was the only outcome, where the execs just wagged their fingers at engineers and called it a day. Of course there'd be all sorts of process improvements derived from an understanding of the various system causes that led to an outage.


Yes, if I were made to attend a mandatory training in which my leaders read customer emails saying that the outage caused them to lose trust in the company, I would feel ashamed. That was surely the goal of that exercise. The fact that there were also process improvements doesn't make it any less wrong.

Thankfully, other comments in this thread suggest that this is not how Google normally does things.


That's fuck up.


Not the original googler responding, but I have never experienced what they describe.

Postmortems are always blameless in the sense that "Somebody fat fingered it" is not an acceptable explanation for the causes of an incident - the possibility to fat finger it in the first place must be identified and eliminated.

Opinions are my own, as always


> Not the original googler responding, but I have never experienced what they describe.

I have also never experienced this outside of this single instance. It was bizarre, but tried to reinforce the point that something needed to change -- it was the latest in a string of major customer-facing outages across various parts of TI, potentially pointing to cultural issues with how we build things.

(And that's not wrong, there are plenty of internal memes about the focus on building new systems and rewarding complexity, while not emphasizing maintainability.)

Usually mandatory trainings are things like "how to avoid being sued" or "how to avoid leaking confidential information". Not "you need to follow these rules or else all of Cloud burns down; look, we're already hemorrhaging customer goodwill."

As I said, there was significant scar tissue associated with this event, probably caused in large part by the initial reaction by leadership.


I assume it was training for all SREs, like "this is why we're all doing so much to prevent it from reoccurring"


Facebook also had a nearly 24-hour outage in 2019. https://www.nytimes.com/2019/03/14/technology/facebook-whats... (or http://archive.today/O7ycB )


> leadership read out emails from customers telling us how we let them down and lost their trust).

That's amazing. I would never have expected my feedback to a company to actually be read, let alone taken seriously. Hopefully more companies do this than I thought.


From my experience this is more done to make leadership feel better and deflect blame from their leadership.


I just read your comment out loud if it helps.


The most remarkable thing about this is learning that anyone at Google read an email from a customer. Given the automated responses to complaints of account shutdowns, or complaints about app store rejections, etc, this is pretty surprising.


I'd love to get a read receipt each time someone at Google has actually read my feedback. Then it might be possible to determine whether I'm just shaking my fists at the heavens or not.


“ mandatory training where leadership read out emails from customers telling us how we let them down and lost their trust “

The same leadership that demanded tighter and tighter deadlines and discouraged thinking things through?


> I'd be curious to see what systemic changes happen at FB as a result, if any.

If history is any guide, Facebook will decide some division charged with preventing problems was an ineffective waste of money, shut it down, and fire a bunch of people.


> This event left a lot of scar tissue across all of Technical Infrastructure, and the next few months were not a fun time (e.g. a mandatory training where leadership read out emails from customers telling us how we let them down and lost their trust).

Bullshit.

I'd believe this if it was not completely impossible for 99.999999% of google "customers" to contact anyone at the company. Or for the decade and a half of personal and professional observations of people getting fucked over by google and having absolutely nobody they could contact to try and resolve the situation.

You googlers can't even disdain yourselves to talk to other workers at the company who are in a caste lower than you.

The fundamental problem googlers have is that they all think they're so smart/good at what they do, it just doesn't seem to occur that they could have possibly screwed something up, or something could go wrong or break, or someone might need help in a way your help page authors didn't anticipate...and people might need to get ahold of an actual human to say "shit's broke, yo." Or worse, none of you give a shit. The company certainly doesn't. When you've got near monopoly and have your fingers in every single aspect of the internet, you don't need to care about fucking your customers over.

I cannot legitimately name a single google product that I, or anyone I know, likes or wants to use. We just don't have a choice because of your market dominance.


Hi there. I'm a Googler and I've directly interfaced with a nontrivial number of customers such that I alone have interfaced with more than 0.000001% of the entire world population.


All you need to do is browse any online forum, bug tracker, subreddit dedicated to a consumer-facing Google product to know that Google does not give a rat's ass about customer service. We know the customer is ultimately not the consumer.


Maps, Mail, Drive, Scholar, and Search are all the best or near the best available. That doesn’t mean I like every one of them or I wouldn’t prefer others, but as far as I can tell the competition doesn’t exist that works better.

GCP and Pixel phones are a toss-up between them and competitors.

It isn’t market dominance, nobody has made anything better.


Search is famously kind of bad the last few years, but even Maps isn’t that great.

(Data errors I’ve seen this week: the aerial imagery over Brisbane Australia is from ~2010 but labeled 2021, the coastline near Barentsburg in Svalbard is wrong and doesn’t match any other map.)


I’m not saying any of it is great, just that there aren’t better replacements that make me want to switch.


> You googlers can't even disdain yourselves to talk to other workers at the company who are in a caste lower than you.

We must know different googlers then. It's good to avoid painting a group with the same brush



For a lot of people in countries outside the US, Facebook _is_ the internet. Facebook has cut deals with various ISPs outside the US to allow people to use their services without it costing any data. Facebook going down is a mild annoyance for us but a huge detriment to, say, Latin America.


    the total loss of DNS broke many of the internal tools we’d normally use to investigate and resolve outages like this.

    this took time, because these facilities are designed with high levels of physical and system security in mind. They’re hard to get into, and once you’re inside, the hardware and routers are designed to be difficult to modify even when you have physical access to them.
Sounds like it was the perfect storm.


> The backbone is the network Facebook has built to connect all our computing facilities together, which consists of tens of thousands of miles of fiber-optic cables crossing the globe and linking all our data centers.

This makes it sound like Facebook has physically laid "tens of thousands of miles of fiber-optic cables crossing the globe and linking all our data centers". Is this in fact true?



Likely a mixture of bought, leased, and self laid fiber. This is not at all uncommon and basically necessary if you have your own data center.


"Those translation queries are answered by our authoritative name servers that occupy well known IP addresses themselves, which in turn are advertised to the rest of the internet via another protocol called the border gateway protocol (BGP)."

"To ensure reliable operation, our DNS servers disable those BGP advertisements if they themselves can not speak to our data centers, since this is an indication of an unhealthy network connection."

Correct me if I am wrong, but here "DNS servers" means the computers, not the software running on them, i.e., each computer is running both DNS software and a BGP daemon. I am not aware of DNS server software that disables BGP advertisements but a BGP daemon could do it.

For example, a BGP daemon like ExaBGP can execute a DNS query, check the output and disable advertisements if the query fails.

https://github.com/Exa-Networks/exabgp


Perhaps the right word is “system” - the combination of software (DNS protocol server software, health checker, BGP agent, and so forth) involved in making the DNS service available (or not, in this case). These could be running on the same computer, or separate ones if you are particularly imaginative.

Unfortunately, “DNS system” means something very different than if you said “load balancing system”, so “server” is simpler.

(Usual disclaimer: work at FB, even on this exact stuff, but not representing it.)


DNS seems to be a massive point of failure everywhere, even taking out the tools to help deal with outages themselves. The same thing happened to Azure multiple times in the past, causing complete service outages. Surely there must be some way to better mitigate DNS misconfiguration by now, given the exceptional importance of DNS?


> DNS seems to be a massive point of failure everywhere

Emphasis on the "seems". DNS gets blamed a lot because it's the very first step in the process of connecting. When everything is down, you will see DNS errors.

And since you can't get past the DNS step, you never see the other errors that you would get if you could try later steps. If you knew the web server's IP address to try to make a TCP connection to it, you'd get connection timed out errors. But you don't see those errors because you didn't get to the point where you got an IP address to connect to.

It's like if you go to a friend's house but their electricity is out. You ring the doorbell and nothing happens. Your first thought is that the doorbell is messed up. And you're not wrong: it is, but so is everything else. If you could ring it and get their attention to let you inside in their house, you'd see that their lights don't turn on, their TV doesn't turn on, their refrigerator isn't running, etc. But those things are hidden to you because you're stuck on the front porch.


But DNS didn't actually fail. Their design says DNS must go offline if the rest of the network is offline. That's exactly what DNS did.

Sounds like their design was wrong, but you can't just blame DNS. DNS worked 100% here as per the task that it was given.

> To ensure reliable operation, our DNS servers disable those BGP advertisements if they themselves can not speak to our data centers, since this is an indication of an unhealthy network connection.


I'm not sure the design was even wrong, since the DNS servers being down didn't meaningfully contribute to the outage. The entire Facebook backbone was gone, so even if the DNS servers continued giving out cached responses clients wouldn't be able to connect anyway.


DNS being down instead of returning an unreachable destination did increase load for other DNS resolvers though since empty results cannot be cached and clients continued to retry. This made the outage affect others.


Source?

DNS errors are actually still cached; it's something that has been debunked by DJB like a couple of decades ago, give or take:

http://cr.yp.to/djbdns/third-party.html

> RFC 2182 claims that DNS failures are not cached; that claim is false.

Here are some more recent details and the fuller explanation:

https://serverfault.com/a/824873

Note that FB.com currently expires its records in 300 seconds, which is 5 minutes.

PowerDNS (used by ordns.he.net) caches servfail for 60s by default — packetcache-servfail-ttl — which isn't very far from the 5min that you get when things aren't failing.

Personally, I do agree with DJB — I think it's a better user experience to get a DNS resolution error right away, than having to wait many minutes for the TCP timeout to occur when the host is down anyways.


Exactly. And it would actually be worse, because the clients would have to wait for a timeout, instead of simply returning a name error right away.


How would've it been worse? Waiting for a timeout is a good thing as it prevents a thundering herd of refresh-smashing (both automated and manual).

I don't know BGP well, but it seems easier for peers to just drop FB's packets on the floor than deal with a DNS stampede.


An average webpage today is several megabytes in size.

How would a few bytes over a couple of UDP packets for DNS have any meaningful impact on anyone's network? If anything, things fail faster, so, there's less data to transmit.

For example, I often use ordns.he.net as an open recursive resolver. They use PowerDNS as their software. PowerDNS has the default of packetcache-servfail-ttl of 60s. OTOH, fb.com A response currently has a TTL of 300s — 5 minutes. So, basically, FB's DNS is cached for roughly the same time whether or not they're actually online.


The rest of the internet sucked yesterday, and my understanding was it was due to a thundering herd of recursive DNS requests. Slowing down clients seems like a good thing.


You cannot blame other operators if your own operator has broken software.

If your network cannot accommodate another network's DNS servers being unreachable, the problem is your network, not the fact that the other network is unreachable.

A network being unreachable is a normal thing. It has been widely advocated by DJB (http://cr.yp.to/djbdns/third-party.html) and others, since decades ago, that it's pointless and counterproductive for single-site operators to have redundant DNS, so, it's time to fix your software if decades later somehow it still makes the assumption that all DNS is redundant and always available.

I didn't notice any slowdowns on Monday, BTW. I don't quite understand why a well written DNS recursive cache software would even have any, when it's literally just a couple of domains and a few FQDNs that were at stake for this outage. How will such software handle a real outage of a whole backbone with thousands of disjoint nameservers, all with different names and IP addresses?


DNS was very much a proximate cause. In most cases you want your anycast dns servers to shoot themselves in the head if they detect their connection to origin to be interrupted. This would have been an big outage anyways just at a different layer.

Oddly enough, one could consider that behavior something that was put in place to "mitigate DNS misconfiguration"


Seems like the simplest solution would be to just move recovery tooling to their own domain / DNS?


Apparently they had to bring in the angle grinder to get access to the server room.

https://twitter.com/cullend/status/1445156376934862848?t=P5u...


Was this ever confirmed? NYT tech reporter Mike Isaac issued a correction to his previous reporting about it.

> the team dispatched to the Facebook site had issues getting in because of physical security but did not need to use a saw/ grinder.

https://twitter.com/MikeIsaac/status/1445196576956162050


> so.....they had to use a jackhammer, got it


This. We know they lie so have to assume the worst when they say something. The correction only says saw/angle grinder. “No tools were needed” is a little clearer. However I am not sure why it matters if they used a key or a grinder?



that didn't happen (NY Times corrected their story)


> We’ve done extensive work hardening our systems to prevent unauthorized access, and it was interesting to see how that hardening slowed us down as we tried to recover from an outage caused not by malicious activity, but an error of our own making. I believe a tradeoff like this is worth it — greatly increased day-to-day security vs. a slower recovery from a hopefully rare event like this.

If you correctly design your security with appropriate fall backs you don't need to make this trade off.

If that story of the Facebook campus having no physical key holes on doors is true it just speaks to an arrogance of assuming things can never fail so we don't even need to bother planning for it.


Can you elaborate on this? There are always going to be security/reliability tradeoffs. Things that fail closed for security reasons will cause slower incident responses. That's unavoidable. Innovation can improve the frontier, but there will always be tradeoffs.


Slower sure, but not five hour slow.


The moment you need to start moving people around, you are into "hours" territory of recovery.

You don't want the data centre staff to be able to change configurations (security), so once something requires hands-on changing, you are definitely into the "move people around" stage of recovery and it WILL be slow.


So it wasn't a config change, it was a command-of-death.


> Individual data centers were reporting dips in power usage in the range of tens of megawatts, and suddenly reversing such a dip in power consumption could put everything from electrical systems to caches at risk.

Interesting that they think fluctuations of tens of megawatts would risk electrical systems. If the equipment was handling that much continuous load, wouldn't it also easily handle the resumption of the same load? Also I totally did not understand how power usage would affect caches.


Continuous load is very different from a sudden 0 to 100 increase.


> To ensure reliable operation, our DNS servers disable those BGP advertisements if they themselves can not speak to our data centers, since this is an indication of an unhealthy network connection.

No, it's (clearly) not a guaranteed indication of that. Logic fail. Infrastructure tools at that scale need to handle all possible causes of test failures. "Is the internet down or only the few sites I'm testing?" is a classic network monitoring script issue.


I think you're misunderstanding. The DNS servers (at Facebook peering points) had zero access to Facebook datacenters because the backbone was down. That is as unhealthy as the network connection can get, so they (correctly) stopped advertising the routes to the outside world.

By that point, the Facebook backbone was already gone. The DNS servers stopping BGP advertisements to the outside world did not cause that.


You're talking about backend network connections to facebook's datacenters as if that's the only thing that matters. I'm talking about overall network connection including the internet-facing part.

Facebook's infrastructure at their peering points loses all contact with their respective facebook datacenter(s).

Their response is to automatically withdraw routes to themselves. I suppose they assumed that all datacenters would never go down at the same time, so that client dns redundancy would lead to clients using other dns servers that could still contact facebook datacenters. It's unclear how those routes could be restored without on-site intervention. If they automatically detect when the datacenters are reachable again, that too requires on-site intervention since after withdrawing routes FB's ops tools can't do anything to the relevant peering points or datacenters.

But even without the catastrophic case of all datacenter connections going down, you don't need to be a facebook ops engineer to realize that there are problems that need to be carefully thought through when ops tools depends on the same (public) network routes and DNS entries that the DNS servers are capable of autonomously withdrawing.


Why were all BGP routes advertised from same set of servers in same DC which were pointed by ALL fb owned domains?


I don't think they were. But, without knowing more specifically what happened, it is hard to speculate.

Some of the speculation I have seen was that this was an attempt to cut connections to one internet exchange (IXP) point, possibly resulting in cutting connection to all IXPs. But ,as I said, this is speculation. It'd take a more thorough understanding of how the automation looked prior to the incident, the change hat was trying to be made, and eth like, to say something sensible.


What kind of bgp command would do that?


When facebook is directly peering to so many other ASes, why would they not have static routes in place for those direct links? Why run BGP for that? It's not like there is going to be a better route than the direct link. If the link goes down, then you can rely on BGP to reroute.


tldr; a maintenance query was issued that inexplicably severed FB's data centers from the internet, which unnecessarily caused their DNS servers to mark themselves defunct, which made it all but impossible for their guys to repair the problem from HQ, which compelled them to physically dispatch field units whose progress was stymied by recent increased physical security measures.


> caused their DNS servers to mark themselves defunct

This is awkward for me too, why should a DNS server withdraw BGP routes? Design fail.


It's a trade-off.

Imagine you have some DNS servers at a POP. They're connected to a peering router there which is connected to a bunch of ISPs. The POP is connected via a couple independent fiber links to the rest of your network. What happens if both of those links fail?

Ideally the rest of your service can detect that this POP is disconnected, and adjust DNS configuration to point users toward POPs which are not disconnected. But you still have that DNS server which can't see that config change (since it's disconnected from the rest of your network) but still reachable from a bunch of local ISPs. That DNS server will continue to direct traffic to the POP which can't handle it.

What if that DNS server were to mark itself unavailable? In that case, DNS traffic from ISPs near that POP would instead find another DNS server from a different POP, and get a response which pointed toward some working POP instead. How would the DNS server mark itself unavailable? One way is to see if it stopped being able to communicate with the source of truth.

Yesterday all of the DNS servers stopped being able to communicate with the source of truth, so marked themselves offline. This code assumes a network partition, so can't really rely on consensus to decide what to do.


Most of the large DNS services are anycasted via BGP. (All POPs announce the same IP prefix) It makes sense to stop the BGP routing if the POP is unhealthy. Traffic will flow to the next healthy POP.

In this case if the DNS sevice in the POP is unhealthy and IP address belonging to the DNS service are removed from the POP.


Note those are anycast addresses, my guess is the DNS server gives out addresses for FB names pointing your traffic to the POP the DNS server is part of.

If the POP is not able to connect to the rest of Facebook's network, the POP stops announcing itself as available and that DNS and part of the network goes away so your traffic can go somewhere else.


Designed to handle individual POPs/DCs going down


Will somebody lose a job over this?


They shouldn't. If FB has a proper PMA culture in place, you figure out why the processes in place didn't work that allowed this kind of change to happen, so more testing, etc. Should be a blameless exercise.


Not if FB has a halfway decent engineering culture. People make mistakes. They're practically fundamental to being a person. You can minimize mistakes, but any system that requires perfect human performance will fail.


Wow


What would they have done if the whole data center was destroyed?


continue working form all other data centers, possibly without users really noticing.


I want to know what happened to the poor engineer who issued the command?


He will be promoted to street dweller while his managers will fail up.


Actually he was promoted to C-Level, CII, Chief Imperial Intern ‘For Life’.

It’s a great accomplishment to be be fair, comes with a lifetime weekly stipend and access to whatever Frontend books/courses you need to be a great web developer.

Will never touch ops again.


The blog post is putting the blame on a bug in the tooling which should have made the command impossible to issue, which is exactly where the blame ought to go.


Still I'd hate to be the first 'Why' of a multi-billion dollar outage :D


> Our primary and out-of-band network access was down

Don't create circular dependencies.


With something as fundamental as the network, no way around it.

- Okay, we'll set up a separate maintenance network in case we can't get to the regular network.

- Wait, but we need a maintenance network for the maintenance network...


Two is One, One is None. There are absolutely ways around this, it's called redundancy. The marginal cost of laying an extra pair during physical plant installation is basically $0, which is why you'd never go "well we need a backup for the backup, so there's no point in having two pairs). Similarly, the marginal cost for having a second UPS and PDU in a rack is effectively $0 at scale, so nobody would argue this is unnecessary to deal with possible UPS failure or accidentally unplugging a cable.

In this case, there are likely several things that can be changes systemically to mitigate or prevent similar failures in the future, and I have every faith that Facebook's SRE team is capable of identifying and implementing those changes. There is no such thing as "no way around it", unless you're dealing with a law of physics.


By "no way around it" I mean you're going to need to create a circular dependency at some point, whether it's a maintenance network that's used to manage itself, or the prod network for managing the maintenance network.

I absolutely agree that installing a maintenance network is a good idea. One of the big challenges, though, is making sure that all your tooling can and will run exclusively on the maintenance network if needed.

(Also, while the marginal cost of laying an extra pair of fiber during physical installation may be low, making sure that you have fully independent failure domains is much higher, whether that's leased fiber, power, etc.)


"Okay, we'll pull in a DSL line from a completely separate ISP for the out-of-band access." (guess what else is in that manhole/conduit?)

"Okay, we'll use LTE for out-of-band!" (oops, the backhaul for the cell tower goes under the same bridge as the real network)

True diversity is HARD (not unsolvable, just hard. especially at scale)!


heh i toured a large data center here in dallas and listened to them brag about all the redundant connectivity they had while standing next to the conduit where they all entered the building. One person, a pair of wire cutters, and 5 seconds and that whole datacenter is dark.


Although the difference here is that loosing connection and out-of-band for a single data center shouldn't be as catastrophic for Facebook, so your examples would be tolerable?


That's the trick, though: if you don't do that level of planning for all of your datacenters and POPs (and fiber huts out in the middle of nowhere), it's inevitable that the one you most need to access during an outage will be the one where your OOB got backhoe'd.

Murphy is a jerk.


How do you avoid circular dependencies on an out-of-band-network? Seems like the choice is between a circular dependency, or turtles all the way down.


How do you go from "have a separate access method that doesn't depend on your main system" to "turtles all the way down"? The secondary access is allowed to have dependencies, just not on your network.


And if the secondary access fails, then what? Backup systems are not reliable 100% of the time.


Then you were 1-FT, which is still worlds better than 0-FT.

"Don't put two engines on the plane because both of them might fail" is not how fault tolerance works.


I worked with a guy who was an amateur pilot, and he had an opinion about dual engine planes. He said the purpose of the second engine was to get you to the scene of the crash.


Then you're SOL. What's your point? The backup might fail, so don't have a backup? I don't understand what you're trying to say.


You should read the post I was responding to and consider the context instead of taking mine in a vacuum. You completely missed the point.


During the outage, FB briefly made the world a better place


"The Devil's Backbone"


It is completely logical but still kind of amazing, that facebook plugged their globally distributed datacenters together with physical wire.


What do you imagine other companies use to connect their datacenters?


Uh... the cloud?


isn't that still over wires?


Other people's wires :-)


I am curious about something.

It has been quite a while since i had any job that required me to think about DCs.

Back in the day we would have a setup of regular modems.

If all hell broke loose, then we could dial up a modem and have access to the system. It was not fast, and it was a pain, but we could get access that way.

(I am skipping a lot of steps. There was heavy security involved in getting access)

I guess landlines might not be an option anymore??


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: