Hacker News new | past | comments | ask | show | jobs | submit login

Interesting side effects:

>Now, here's the fun part. @Cloudflare runs a free DNS resolver, 1.1.1.1, and lots of people use it. So Facebook etc. are down... guess what happens? People keep retrying. Software keeps retrying. We get hit by a massive flood of DNS traffic asking for http://facebook.com

https://twitter.com/jgrahamc/status/1445066136547217413

>Our small non profit also sees a huge spike in DNS traffic. It’s really insane.

https://twitter.com/awlnx/status/1445072441886265355

>This is frontend DNS stats from one of the smaller ISPs I operate. DNS traffic has almost doubled.

https://twitter.com/TheodoreBaschak/status/14450732299707637...




Another side effect:

Two of our local mobile operators are experiencing issues with phone calls due to network overload.

https://twitter.com/claroelsalvador/status/14450819333319598...


Believe it or not, there are places in the world where FB products (WhatsApp specifically) are used as the primary communication platform for most people.


Possibly in Norway too (internet though, not phone calls) https://www.nrk.no/nyheter/internett-trobbel-hos-telia-1.156...


The same happened in Romania with two of our mobile operators immediately after FB&all went down.


oh the irony!

can't use the phone network to place a call b/c of fb-errors clogging the pipe


Almost same thing happened when Signal went down:

https://news.ycombinator.com/item?id=25803010 Signal apps DDoS'ed their own server

Second comment was saying there is no point using Signal if they are down during 2 days. Only a few hours for FB yet but curiously nobody is saying the same :)


I'd say there's no point using FB anytime, even without the outage ;)


Well, for FB there are quite a lot of comments suggesting for it to stay down. So Signal actually got off quite easy.


I wonder if any big DNS servers will artificially cache a long TTL NXDOMAIN response for FB to reduce their load. Done wrong, it would extend the FB outage longer.


>Done wrong, it would extend the FB outage longer.

Let's hope it's done wrong.


This outage also affected whatsapp, one of the most widely-used communication technologies in the world. It also almost caused me to be locked out of my apartment, were it not for random chance and the kindness of a stranger, but I’m glad that you can feel smugly superior about it


Just out of curiosity, and obviously if you can disclose... How does FB availability affect your ability to enter your apartment?


Not OP, but such ideas usually stem from a misunderstanding of root cause. Facebook inaccessibility likely exposed poor assumptions or other flaws in e.g. "smart" devices or workflows. Those poor assumptions or other flaws are likely what got OP locked out of his apartment when Facebook went down, not Facebook itself going down.


An apartment-complex entry app with Facebook login integration seems possible to me. That would be direct root cause.

https://developers.facebook.com/docs/facebook-login/


No, that would not be direct root cause. Direct root cause would be designing and implementing an apartment-complex entry app which depends on a working internet connection, battery, and network route to a single point of failure.


>but I’m glad that you can feel smugly superior about it

And I'm glad you can feel smug about combating smuggness, because obviously the consequences of some social media and chat apps being down can't be measured but by anecodotal stories of some unrelated issue like being locked out, not about their general societal impact, shady practices, contribution to disinformation and data mining, etc. Who's being self centered now?

If anything, the lesson here is to not depend on some single, centralized, channel, for such communications (e.g. to get your AirBnB key). Now I also feel smug for always giving 2-3 alternative ways in cases contacting someone/someone contacting me is crucial...

It's not like what the world lacks in 2021 is communication channels. One can use land phone, mobile phone, SMS, email, and 200 other alternative IM outlets...


Clients weren't getting NXDOMAIN, they were getting SERVFAIL because the nameservers were unreachable. These responses cannot be cached for more than 5 minutes [1].

[1] https://datatracker.ietf.org/doc/html/rfc2308#section-7.1


Yes, that's the point. If you're running a DNS server and being overwhelmed by this, you might have considered artificially injecting NXDOMAIN with a long cache value to get some relief. Which could extend the outage for FB.


Unless the operators were in direct contact with Facebook, it doesn't sound like a good idea. It's certainly not the job of the ISP to reduce an outage for FB. They also weren't sure if the outage would only be 5 minutes or 5 hours. Instead, ISPs should scale up and handle DNS traffic for outages like this. In this case, FB isn't the only company to learn a lesson or two around failure modes and how to improve in the future.


The point isn't reducing an outage for FB, it would actually extend the outage for some. The point would be to help give some relief to a DNS server you're running that's overloaded due to the FB outage...during the "crisis". Yes, of course, better planning ahead of time is nice. In any case, I didn't suggest doing this. I wondered if it was happening.

I think you missed the idea that the FB outage created a really heavy DNS load on other people's DNS servers.


No, I didn't miss the idea (and it's not an idea, it really happened.) I believe you're mistaking the role of the resolver operator and whether or not they should be manipulating client queries/responses without the user knowing. An NXDOMAIN response does not match the conditions, and shouldn't be used just to manipulate the clients.


I don't understand that logic, wouldn't people interacting with the website normally also generate the same amount if not more DNS requests?


It will have been cached at closer to the edge, but once the TTL expires, so does the cache. That means all the DNS requests that would have been served via local caches end up hitting the upstream DNS servers. For a site like Facebook that will be creating an asbolute deluge of requests. Andecdotal but the whole of the internet feels sluggish atm.


Anecdotally, my personal website feels faster than normally. Gandi DNS.


No, since the positive response will normally be cached for "some time" dependant on a number of factors. The negative response on the other hand often won't get cached, again, dependent on settings.


Negative responses are cachable with the appropriate time to live from the Start of Authority record for the zone.


I know you're just replying to the parent statement but unfortunately in this case the SOA went down with the ship. None of the (admittedly few) clients I've tested are caching the lack of a response for facebook.com's SOA or address records.


Yep, I always make it a point to cache cache-misses in my code.


So then when I'm on some kind of blocked WiFi and nothing resolves, and I switch to a properly working WiFi your code will continue to fail?

It's not so simple to cache misses - you don't know if it's a real miss or some kind of error.

For example if Facebook cached the miss, then even when they are back up nothing would connect.


Yes. I handle around a million requests per minute. I exponentially increase the cache period after subsequent misses to avoid an outage ddos the whole system.

This tends to be beneficial regardless of the root cause.

edit this is especially useful for handling search/query misses as a query with no results is going to scan any relevant indexes etc. until it is clear no match exists meaning a no results query may take up more cycles than a hit.


It's remarkable the effect even short TTL caching can have given enough traffic. I recall once caching a value that was being accessed on every page load with a TTL of 1s resulting in a >99% reduction in query volume, and that's nowhere near Facebook/internet backbone scale.


yep, prepriming the cache rather than passively allowing it be rebuilt by request/queries can also result in some nice improvements and depending on replication delay across database servers avoid some unexpected query results reaching the end user.

In the past I was the architect of a top 2000 alexa ranked social networking site, data synchronization delays were insane under certain load patterns high single low double digit second write propagation delays.


I'm talking back-end not in app data caching. I would also cache misses there as well but with less aggressive ttl.


It's disappointingly common for cloud-backed apps and device firmware to go into a hot retry loop on any kind of network failure. A lot of engineers just haven't heard of exponential backoff, to say nothing of being able to implement and test it properly for a scenario that almost never happens.

Even if you assume Facebook's own apps have reasonable failure logic, there's all kinds of third-party apps and devices integrating with their API that probably get it wrong. Surprise botnet!


Normally the request resolves then gets cached locally, on the edge, by the ISP, … DNS is cached to a ridiculous levels.

But if the request does not resolve there’s no caching, the next request goes through the entire thing and hits the server again.


There's a lot of caching involved in the chain of requests that would alleviate this request volume if things were working.


My best guess is that after n many attempts to access the provided IP, the local DNS cache deletes the entry causing a miss. Then the cycle continues.


am i correct in interpreting this as almost equivalent to a DDoS attack on DNS providers?


Yes. It's basically turned every device, especially mobile devices with the app running in the background, into botnet clients which are continually hitting their DNS servers.

I don't know what facebook's DNS cache expiration interval was, but assume it's 1 day. Now multiply the load on the DNS that those facebook users put by whatever polling interval the apps use.

And then remember what percentage of internet traffic (requests, not bandwidth) facebook, whatsapp, and instagram make up.

It's kindof beautiful.


It's basically turned every device, especially mobile devices with the app running in the background, into botnet clients which are continually hitting their DNS servers

Anecdotally, it also seems to be draining the batteries of those devices with all of those extra queries. At least that seems to be what's happening on my wife's phone.


Now I'm a bit worried.

Could this bring down the whole internet for a while?


Well, everything is bit slow for me. I'm in the UK on Virgin Media, using either Google DNS or the VM ones (I'm not sure and can't be bothered to look).

What has just happened, and it can't be coincidence, is that I lost internet connectivity about 1 hour ago, and had to reboot my Cable Modem to get it back.


I'm fairly certain that my ISP was affected by this causing an outage of all internet traffic for my network. So it seems possible, although I imagine using an alternate DNS provider should work ok (if they're not overrun by extra traffic)?

Unfortunately I'm not sure what the default DNS on the modem points to..


You can try https://dnsleaktest.com/ which shows which DNS server is actually used.


I read it brought down the Vodafone network in Czechia, one of the major providers there.


... and the facebook SDK. Every single app that has facebook SDK is blowing up now.


Further to this, doesn't Chrome and Safari quietly auto-ping/reload pages that "fail to connect" if they're left open in a tab or browser?


How often do the apps try to reconnect? Does anyone know?


I've launched Wireshark monitoring DNS traffic of roughly 5 phones. I've collected 19.8k DNS packets so far. Out of that, 5.1k packets are flagged with REFUSED or SERVFAIL. If I am not mistaken, it means that 51% of DNS requests fail.

Looking at queries for graph.instagram.com, it looks like there are roughly 20 attempts in a sequence before it gives up.

All in all, this could probably explain doubling of the DNS traffic. But the sample is rather small, so take it with a grain of salt.


5.1 / 19.8 is much closer to a quarter than to half. But your point is still just as poignant at ~25% as it is at 51%.


I think those are DNS round trips. So 1 packet to request and 1 to respond. E.g. 9.9k total requests of which 5.1k fail.


Sort of, yeah. Typically a DDoS attack is done on purpose, this is a side effect of so many clients utilizing retry strategies for failed requests. But in both cases, a lot of requests are being made, which is how a DDoS attack works.


Equivalent how? In volume? In intention?


In volume.


Ah, I getchu. In that case you're probably not wrong. It must be an absolutely redoubtable volume of traffic.


> Software keeps retrying. We get hit by a massive flood of DNS traffic asking for http://facebook.com

If you aren’t using exponential backoff algorithms for your reconnect scheme - you should be!

I have a device in the field, only a few thousand total, but we saw issues when our shared cloud would go down and everyone hammered it to get back up.


>Our small non profit also sees a huge spike in DNS traffic. It’s really insane.

It's not crazy; people are panicking over Facebook, Instagram and WhatsApp being down and they keep trying to connect to those services. I mean I would panic too if I were social media junky.


It’s not just "social media junkies", a very pretentious phrase to use considering you’re writing it in a comment on a social network. Hundreds of thousands of apps use Facebook APIs, often in the background too (including FB's own apps).


Is "alcoholic" a very pretentious word to use considering that the person saying it has a beer once a week?


Hopefully they're not DNS ANY requests? <ducks>

(CF decided not to honour them some years ago)


am i correct in interpreting this as almost equivalent to a DDoS attack




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: