>Now, here's the fun part. @Cloudflare runs a free DNS resolver, 1.1.1.1, and lots of people use it. So Facebook etc. are down... guess what happens? People keep retrying. Software keeps retrying. We get hit by a massive flood of DNS traffic asking for http://facebook.com
Believe it or not, there are places in the world where FB products (WhatsApp specifically) are used as the primary communication platform for most people.
Second comment was saying there is no point using Signal if they are down during 2 days. Only a few hours for FB yet but curiously nobody is saying the same :)
I wonder if any big DNS servers will artificially cache a long TTL NXDOMAIN response for FB to reduce their load. Done wrong, it would extend the FB outage longer.
This outage also affected whatsapp, one of the most widely-used communication technologies in the world. It also almost caused me to be locked out of my apartment, were it not for random chance and the kindness of a stranger, but I’m glad that you can feel smugly superior about it
Not OP, but such ideas usually stem from a misunderstanding of root cause. Facebook inaccessibility likely exposed poor assumptions or other flaws in e.g. "smart" devices or workflows. Those poor assumptions or other flaws are likely what got OP locked out of his apartment when Facebook went down, not Facebook itself going down.
No, that would not be direct root cause. Direct root cause would be designing and implementing an apartment-complex entry app which depends on a working internet connection, battery, and network route to a single point of failure.
>but I’m glad that you can feel smugly superior about it
And I'm glad you can feel smug about combating smuggness, because obviously the consequences of some social media and chat apps being down can't be measured but by anecodotal stories of some unrelated issue like being locked out, not about their general societal impact, shady practices, contribution to disinformation and data mining, etc. Who's being self centered now?
If anything, the lesson here is to not depend on some single, centralized, channel, for such communications (e.g. to get your AirBnB key). Now I also feel smug for always giving 2-3 alternative ways in cases contacting someone/someone contacting me is crucial...
It's not like what the world lacks in 2021 is communication channels. One can use land phone, mobile phone, SMS, email, and 200 other alternative IM outlets...
Clients weren't getting NXDOMAIN, they were getting SERVFAIL because the nameservers were unreachable. These responses cannot be cached for more than 5 minutes [1].
Yes, that's the point. If you're running a DNS server and being overwhelmed by this, you might have considered artificially injecting NXDOMAIN with a long cache value to get some relief. Which could extend the outage for FB.
Unless the operators were in direct contact with Facebook, it doesn't sound like a good idea. It's certainly not the job of the ISP to reduce an outage for FB. They also weren't sure if the outage would only be 5 minutes or 5 hours. Instead, ISPs should scale up and handle DNS traffic for outages like this. In this case, FB isn't the only company to learn a lesson or two around failure modes and how to improve in the future.
The point isn't reducing an outage for FB, it would actually extend the outage for some. The point would be to help give some relief to a DNS server you're running that's overloaded due to the FB outage...during the "crisis". Yes, of course, better planning ahead of time is nice. In any case, I didn't suggest doing this. I wondered if it was happening.
I think you missed the idea that the FB outage created a really heavy DNS load on other people's DNS servers.
No, I didn't miss the idea (and it's not an idea, it really happened.) I believe you're mistaking the role of the resolver operator and whether or not they should be manipulating client queries/responses without the user knowing. An NXDOMAIN response does not match the conditions, and shouldn't be used just to manipulate the clients.
It will have been cached at closer to the edge, but once the TTL expires, so does the cache. That means all the DNS requests that would have been served via local caches end up hitting the upstream DNS servers. For a site like Facebook that will be creating an asbolute deluge of requests.
Andecdotal but the whole of the internet feels sluggish atm.
No, since the positive response will normally be cached for "some time" dependant on a number of factors. The negative response on the other hand often won't get cached, again, dependent on settings.
I know you're just replying to the parent statement but unfortunately in this case the SOA went down with the ship. None of the (admittedly few) clients I've tested are caching the lack of a response for facebook.com's SOA or address records.
Yes.
I handle around a million requests per minute. I exponentially increase the cache period after subsequent misses to avoid an outage ddos the whole system.
This tends to be beneficial regardless of the root cause.
edit this is especially useful for handling search/query misses as a query with no results is going to scan any relevant indexes etc. until it is clear no match exists meaning a no results query may take up more cycles than a hit.
It's remarkable the effect even short TTL caching can have given enough traffic. I recall once caching a value that was being accessed on every page load with a TTL of 1s resulting in a >99% reduction in query volume, and that's nowhere near Facebook/internet backbone scale.
yep, prepriming the cache rather than passively allowing it be rebuilt by request/queries can also result in some nice improvements and depending on replication delay across database servers avoid some unexpected query results reaching the end user.
In the past I was the architect of a top 2000 alexa ranked social networking site, data synchronization delays were insane under certain load patterns high single low double digit second write propagation delays.
It's disappointingly common for cloud-backed apps and device firmware to go into a hot retry loop on any kind of network failure. A lot of engineers just haven't heard of exponential backoff, to say nothing of being able to implement and test it properly for a scenario that almost never happens.
Even if you assume Facebook's own apps have reasonable failure logic, there's all kinds of third-party apps and devices integrating with their API that probably get it wrong. Surprise botnet!
Yes. It's basically turned every device, especially mobile devices with the app running in the background, into botnet clients which are continually hitting their DNS servers.
I don't know what facebook's DNS cache expiration interval was, but assume it's 1 day. Now multiply the load on the DNS that those facebook users put by whatever polling interval the apps use.
And then remember what percentage of internet traffic (requests, not bandwidth) facebook, whatsapp, and instagram make up.
It's basically turned every device, especially mobile devices with the app running in the background, into botnet clients which are continually hitting their DNS servers
Anecdotally, it also seems to be draining the batteries of those devices with all of those extra queries. At least that seems to be what's happening on my wife's phone.
Well, everything is bit slow for me. I'm in the UK on Virgin Media, using either Google DNS or the VM ones (I'm not sure and can't be bothered to look).
What has just happened, and it can't be coincidence, is that I lost internet connectivity about 1 hour ago, and had to reboot my Cable Modem to get it back.
I'm fairly certain that my ISP was affected by this causing an outage of all internet traffic for my network. So it seems possible, although I imagine using an alternate DNS provider should work ok (if they're not overrun by extra traffic)?
Unfortunately I'm not sure what the default DNS on the modem points to..
I've launched Wireshark monitoring DNS traffic of roughly 5 phones. I've collected 19.8k DNS packets so far. Out of that, 5.1k packets are flagged with REFUSED or SERVFAIL. If I am not mistaken, it means that 51% of DNS requests fail.
Looking at queries for graph.instagram.com, it looks like there are roughly 20 attempts in a sequence before it gives up.
All in all, this could probably explain doubling of the DNS traffic. But the sample is rather small, so take it with a grain of salt.
Sort of, yeah. Typically a DDoS attack is done on purpose, this is a side effect of so many clients utilizing retry strategies for failed requests. But in both cases, a lot of requests are being made, which is how a DDoS attack works.
> Software keeps retrying. We get hit by a massive flood of DNS traffic asking for http://facebook.com
If you aren’t using exponential backoff algorithms for your reconnect scheme - you should be!
I have a device in the field, only a few thousand total, but we saw issues when our shared cloud would go down and everyone hammered it to get back up.
>Our small non profit also sees a huge spike in DNS traffic. It’s really insane.
It's not crazy; people are panicking over Facebook, Instagram and WhatsApp being down and they keep trying to connect to those services. I mean I would panic too if I were social media junky.
It’s not just "social media junkies", a very pretentious phrase to use considering you’re writing it in a comment on a social network. Hundreds of thousands of apps use Facebook APIs, often in the background too (including FB's own apps).
>Now, here's the fun part. @Cloudflare runs a free DNS resolver, 1.1.1.1, and lots of people use it. So Facebook etc. are down... guess what happens? People keep retrying. Software keeps retrying. We get hit by a massive flood of DNS traffic asking for http://facebook.com
https://twitter.com/jgrahamc/status/1445066136547217413
>Our small non profit also sees a huge spike in DNS traffic. It’s really insane.
https://twitter.com/awlnx/status/1445072441886265355
>This is frontend DNS stats from one of the smaller ISPs I operate. DNS traffic has almost doubled.
https://twitter.com/TheodoreBaschak/status/14450732299707637...