Emphasis on the "seems". DNS gets blamed a lot because it's the very first step in the process of connecting. When everything is down, you will see DNS errors.
And since you can't get past the DNS step, you never see the other errors that you would get if you could try later steps. If you knew the web server's IP address to try to make a TCP connection to it, you'd get connection timed out errors. But you don't see those errors because you didn't get to the point where you got an IP address to connect to.
It's like if you go to a friend's house but their electricity is out. You ring the doorbell and nothing happens. Your first thought is that the doorbell is messed up. And you're not wrong: it is, but so is everything else. If you could ring it and get their attention to let you inside in their house, you'd see that their lights don't turn on, their TV doesn't turn on, their refrigerator isn't running, etc. But those things are hidden to you because you're stuck on the front porch.
Sounds like their design was wrong, but you can't just blame DNS. DNS worked 100% here as per the task that it was given.
> To ensure reliable operation, our DNS servers disable those BGP advertisements if they themselves can not speak to our data centers, since this is an indication of an unhealthy network connection.
DNS errors are actually still cached; it's something that has been debunked by DJB like a couple of decades ago, give or take:
> RFC 2182 claims that DNS failures are not cached; that claim is false.
Here are some more recent details and the fuller explanation:
Note that FB.com currently expires its records in 300 seconds, which is 5 minutes.
PowerDNS (used by ordns.he.net) caches servfail for 60s by default — packetcache-servfail-ttl — which isn't very far from the 5min that you get when things aren't failing.
Personally, I do agree with DJB — I think it's a better user experience to get a DNS resolution error right away, than having to wait many minutes for the TCP timeout to occur when the host is down anyways.
I don't know BGP well, but it seems easier for peers to just drop FB's packets on the floor than deal with a DNS stampede.
How would a few bytes over a couple of UDP packets for DNS have any meaningful impact on anyone's network? If anything, things fail faster, so, there's less data to transmit.
For example, I often use ordns.he.net as an open recursive resolver. They use PowerDNS as their software. PowerDNS has the default of packetcache-servfail-ttl of 60s. OTOH, fb.com A response currently has a TTL of 300s — 5 minutes. So, basically, FB's DNS is cached for roughly the same time whether or not they're actually online.
If your network cannot accommodate another network's DNS servers being unreachable, the problem is your network, not the fact that the other network is unreachable.
A network being unreachable is a normal thing. It has been widely advocated by DJB (http://cr.yp.to/djbdns/third-party.html) and others, since decades ago, that it's pointless and counterproductive for single-site operators to have redundant DNS, so, it's time to fix your software if decades later somehow it still makes the assumption that all DNS is redundant and always available.
I didn't notice any slowdowns on Monday, BTW. I don't quite understand why a well written DNS recursive cache software would even have any, when it's literally just a couple of domains and a few FQDNs that were at stake for this outage. How will such software handle a real outage of a whole backbone with thousands of disjoint nameservers, all with different names and IP addresses?
Oddly enough, one could consider that behavior something that was put in place to "mitigate DNS misconfiguration"