Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
A Tale of DNS and BGP: The Facebook Outage, October 2021 (riskledger.com)
66 points by jamescun on Oct 4, 2021 | hide | past | favorite | 16 comments


Facebook in this case, operates a set of intermediary DNS servers that are responsible for everything between your ISP's recursers and the roots. These are responsible for facebook.com, instagram.com, whatsapp.com and everything else they operate.

This is not the case for instagram.com, which is hosted on a different provider (AWS Route53) and was resolvable during the whole outage.

I'm not sure why Instagram's fronted servers returned 503, though. Maybe their backend fleet was included in the withdrawn prefixes, or maybe it was referenced through the affected domains.


"I'm not sure why Instagram's frontend servers returned 503, though."

One explanation is Facebook uses a proxy configuration that requires DNS in order to resolve the internal IP addresses for the backend servers. High availability proxy servers like haproxy can easily use files loaded into memory to do lookups, instead of making DNS requests. Apparently Facebook had no backup plan if the DNS method started failing. Facebook remained down until their DNS servers became available. The proxies continued to work and no doubt the backend servers were available the entire time, but proxies could not connect to them because the DNS lookups for their internal IP addresses (serv)failed. After the retried DNS queries finally timeout, a 503 is returned.

"Maybe their backend fleet was included in the withdrawn prefixes..."

According to Cloudflare's writeup the only prefixes withdrawn were for DNS servers.


Another possibility is that failing to announce the prefixes for their DNS server IPs was just a symptom of a larger problem, like misconfigured routers.


Kind of funny that instagram.com uses Route53, but amazon.com does not.


> No two devices on the internet are directly connected.

I get the need for brevity and simplicity in a post like this, but is there really a need for obviously false statements?


You can't get there from here.


What’s a counter example?


My router and my desktop is one of several billion counterexamples.


Your route and desktop are not an example of an internet connection. That's an intranet connection. I think that's what they mean -- for two devices to be connected on the internet there's always (at least) routers in between.


I can infer what they mean, it doesn't make it a correct statement. Maybe I'm being pedantic, but routers are devices too, and I have computers with multiple NICs that act as routers as well as servers. Intranet vs internet is an arbitrary distinction. If a "device" has an IP address that's reachable from "the internet" then it's on the internet, regardless.


The article's point is that to get information from Device A to Device B across the internet is never a straight link from Device A to Device B, there are always middlemen whose purpose it is just to forward the data along. There's always something between the end nodes.


Yes, and that's simply and demonstrably false. There are not always middlemen, as I have already explained.


Sigh


This page makes Brave think it's unavailable and offer an archived version, lol.


That's because the status code is 404..


This was likely an inside job. This outage prevented employees from entering their office buildings.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: