A Tale of DNS and BGP: The Facebook Outage, October 2021

fauria · on Oct 4, 2021

Facebook in this case, operates a set of intermediary DNS servers that are responsible for everything between your ISP's recursers and the roots. These are responsible for facebook.com, instagram.com, whatsapp.com and everything else they operate.

This is not the case for instagram.com, which is hosted on a different provider (AWS Route53) and was resolvable during the whole outage.

I'm not sure why Instagram's fronted servers returned 503, though. Maybe their backend fleet was included in the withdrawn prefixes, or maybe it was referenced through the affected domains.

1vuio0pswjnm7 · on Oct 5, 2021

"I'm not sure why Instagram's frontend servers returned 503, though."

One explanation is Facebook uses a proxy configuration that requires DNS in order to resolve the internal IP addresses for the backend servers. High availability proxy servers like haproxy can easily use files loaded into memory to do lookups, instead of making DNS requests. Apparently Facebook had no backup plan if the DNS method started failing. Facebook remained down until their DNS servers became available. The proxies continued to work and no doubt the backend servers were available the entire time, but proxies could not connect to them because the DNS lookups for their internal IP addresses (serv)failed. After the retried DNS queries finally timeout, a 503 is returned.

"Maybe their backend fleet was included in the withdrawn prefixes..."

According to Cloudflare's writeup the only prefixes withdrawn were for DNS servers.

1vuio0pswjnm7 · on Oct 5, 2021

Another possibility is that failing to announce the prefixes for their DNS server IPs was just a symptom of a larger problem, like misconfigured routers.

jvolkman · on Oct 5, 2021

Kind of funny that instagram.com uses Route53, but amazon.com does not.

shric · on Oct 5, 2021

> No two devices on the internet are directly connected.

I get the need for brevity and simplicity in a post like this, but is there really a need for obviously false statements?

xapata · on Oct 5, 2021

You can't get there from here.

thehappypm · on Oct 5, 2021

What’s a counter example?

shric · on Oct 5, 2021

My router and my desktop is one of several billion counterexamples.

thehappypm · on Oct 5, 2021

Your route and desktop are not an example of an internet connection. That's an intranet connection. I think that's what they mean -- for two devices to be connected on the internet there's always (at least) routers in between.

shric · on Oct 5, 2021

I can infer what they mean, it doesn't make it a correct statement. Maybe I'm being pedantic, but routers are devices too, and I have computers with multiple NICs that act as routers as well as servers. Intranet vs internet is an arbitrary distinction. If a "device" has an IP address that's reachable from "the internet" then it's on the internet, regardless.

thehappypm · on Oct 5, 2021

The article's point is that to get information from Device A to Device B across the internet is never a straight link from Device A to Device B, there are always middlemen whose purpose it is just to forward the data along. There's always something between the end nodes.

shric · on Oct 6, 2021

Yes, and that's simply and demonstrably false. There are not always middlemen, as I have already explained.

thehappypm · on Oct 6, 2021

nazgulsenpai · on Oct 4, 2021

This page makes Brave think it's unavailable and offer an archived version, lol.

19h · on Oct 5, 2021

That's because the status code is 404..

cryptodan · on Oct 5, 2021

This was likely an inside job. This outage prevented employees from entering their office buildings.