Interesting side effects: >Now, here's the fun part. @Cloudflare runs a free DNS...

el-salvador · on Oct 4, 2021

Another side effect:

Two of our local mobile operators are experiencing issues with phone calls due to network overload.

https://twitter.com/claroelsalvador/status/14450819333319598...

shreezus · on Oct 5, 2021

Believe it or not, there are places in the world where FB products (WhatsApp specifically) are used as the primary communication platform for most people.

karencarits · on Oct 4, 2021

Possibly in Norway too (internet though, not phone calls) https://www.nrk.no/nyheter/internett-trobbel-hos-telia-1.156...

paganel · on Oct 4, 2021

The same happened in Romania with two of our mobile operators immediately after FB&all went down.

hansel_der · on Oct 5, 2021

oh the irony!

can't use the phone network to place a call b/c of fb-errors clogging the pipe

polote · on Oct 4, 2021

Almost same thing happened when Signal went down:

https://news.ycombinator.com/item?id=25803010 Signal apps DDoS'ed their own server

Second comment was saying there is no point using Signal if they are down during 2 days. Only a few hours for FB yet but curiously nobody is saying the same :)

smsm42 · on Oct 4, 2021

I'd say there's no point using FB anytime, even without the outage ;)

Sebb767 · on Oct 4, 2021

Well, for FB there are quite a lot of comments suggesting for it to stay down. So Signal actually got off quite easy.

tyingq · on Oct 4, 2021

I wonder if any big DNS servers will artificially cache a long TTL NXDOMAIN response for FB to reduce their load. Done wrong, it would extend the FB outage longer.

coldtea · on Oct 4, 2021

>Done wrong, it would extend the FB outage longer.

Let's hope it's done wrong.

nonsince · on Oct 5, 2021

This outage also affected whatsapp, one of the most widely-used communication technologies in the world. It also almost caused me to be locked out of my apartment, were it not for random chance and the kindness of a stranger, but I’m glad that you can feel smugly superior about it

rytis · on Oct 5, 2021

Just out of curiosity, and obviously if you can disclose... How does FB availability affect your ability to enter your apartment?

dotancohen · on Oct 5, 2021

Not OP, but such ideas usually stem from a misunderstanding of root cause. Facebook inaccessibility likely exposed poor assumptions or other flaws in e.g. "smart" devices or workflows. Those poor assumptions or other flaws are likely what got OP locked out of his apartment when Facebook went down, not Facebook itself going down.

tyingq · on Oct 5, 2021

An apartment-complex entry app with Facebook login integration seems possible to me. That would be direct root cause.

https://developers.facebook.com/docs/facebook-login/

dotancohen · on Oct 6, 2021

No, that would not be direct root cause. Direct root cause would be designing and implementing an apartment-complex entry app which depends on a working internet connection, battery, and network route to a single point of failure.

coldtea · on Oct 5, 2021

>but I’m glad that you can feel smugly superior about it

And I'm glad you can feel smug about combating smuggness, because obviously the consequences of some social media and chat apps being down can't be measured but by anecodotal stories of some unrelated issue like being locked out, not about their general societal impact, shady practices, contribution to disinformation and data mining, etc. Who's being self centered now?

If anything, the lesson here is to not depend on some single, centralized, channel, for such communications (e.g. to get your AirBnB key). Now I also feel smug for always giving 2-3 alternative ways in cases contacting someone/someone contacting me is crucial...

It's not like what the world lacks in 2021 is communication channels. One can use land phone, mobile phone, SMS, email, and 200 other alternative IM outlets...

treesknees · on Oct 4, 2021

Clients weren't getting NXDOMAIN, they were getting SERVFAIL because the nameservers were unreachable. These responses cannot be cached for more than 5 minutes [1].

[1] https://datatracker.ietf.org/doc/html/rfc2308#section-7.1

tyingq · on Oct 4, 2021

Yes, that's the point. If you're running a DNS server and being overwhelmed by this, you might have considered artificially injecting NXDOMAIN with a long cache value to get some relief. Which could extend the outage for FB.

treesknees · on Oct 5, 2021

Unless the operators were in direct contact with Facebook, it doesn't sound like a good idea. It's certainly not the job of the ISP to reduce an outage for FB. They also weren't sure if the outage would only be 5 minutes or 5 hours. Instead, ISPs should scale up and handle DNS traffic for outages like this. In this case, FB isn't the only company to learn a lesson or two around failure modes and how to improve in the future.

tyingq · on Oct 6, 2021

The point isn't reducing an outage for FB, it would actually extend the outage for some. The point would be to help give some relief to a DNS server you're running that's overloaded due to the FB outage...during the "crisis". Yes, of course, better planning ahead of time is nice. In any case, I didn't suggest doing this. I wondered if it was happening.

I think you missed the idea that the FB outage created a really heavy DNS load on other people's DNS servers.

treesknees · on Oct 6, 2021

No, I didn't miss the idea (and it's not an idea, it really happened.) I believe you're mistaking the role of the resolver operator and whether or not they should be manipulating client queries/responses without the user knowing. An NXDOMAIN response does not match the conditions, and shouldn't be used just to manipulate the clients.

curiousgal · on Oct 4, 2021

I don't understand that logic, wouldn't people interacting with the website normally also generate the same amount if not more DNS requests?

_joel · on Oct 4, 2021

It will have been cached at closer to the edge, but once the TTL expires, so does the cache. That means all the DNS requests that would have been served via local caches end up hitting the upstream DNS servers. For a site like Facebook that will be creating an asbolute deluge of requests. Andecdotal but the whole of the internet feels sluggish atm.

GrumpySloth · on Oct 4, 2021

Anecdotally, my personal website feels faster than normally. Gandi DNS.

pwagland · on Oct 4, 2021

No, since the positive response will normally be cached for "some time" dependant on a number of factors. The negative response on the other hand often won't get cached, again, dependent on settings.

rkeene2 · on Oct 4, 2021

Negative responses are cachable with the appropriate time to live from the Start of Authority record for the zone.

jcims · on Oct 4, 2021

I know you're just replying to the parent statement but unfortunately in this case the SOA went down with the ship. None of the (admittedly few) clients I've tested are caching the lack of a response for facebook.com's SOA or address records.

keithnoizu · on Oct 4, 2021

Yep, I always make it a point to cache cache-misses in my code.

ars · on Oct 4, 2021

So then when I'm on some kind of blocked WiFi and nothing resolves, and I switch to a properly working WiFi your code will continue to fail?

It's not so simple to cache misses - you don't know if it's a real miss or some kind of error.

For example if Facebook cached the miss, then even when they are back up nothing would connect.

keithnoizu · on Oct 4, 2021

Yes. I handle around a million requests per minute. I exponentially increase the cache period after subsequent misses to avoid an outage ddos the whole system.

This tends to be beneficial regardless of the root cause.

edit this is especially useful for handling search/query misses as a query with no results is going to scan any relevant indexes etc. until it is clear no match exists meaning a no results query may take up more cycles than a hit.

paledot · on Oct 4, 2021

It's remarkable the effect even short TTL caching can have given enough traffic. I recall once caching a value that was being accessed on every page load with a TTL of 1s resulting in a >99% reduction in query volume, and that's nowhere near Facebook/internet backbone scale.

keithnoizu · on Oct 4, 2021

yep, prepriming the cache rather than passively allowing it be rebuilt by request/queries can also result in some nice improvements and depending on replication delay across database servers avoid some unexpected query results reaching the end user.

In the past I was the architect of a top 2000 alexa ranked social networking site, data synchronization delays were insane under certain load patterns high single low double digit second write propagation delays.

keithnoizu · on Oct 4, 2021

I'm talking back-end not in app data caching. I would also cache misses there as well but with less aggressive ttl.

kentonv · on Oct 4, 2021

It's disappointingly common for cloud-backed apps and device firmware to go into a hot retry loop on any kind of network failure. A lot of engineers just haven't heard of exponential backoff, to say nothing of being able to implement and test it properly for a scenario that almost never happens.

Even if you assume Facebook's own apps have reasonable failure logic, there's all kinds of third-party apps and devices integrating with their API that probably get it wrong. Surprise botnet!

masklinn · on Oct 4, 2021

Normally the request resolves then gets cached locally, on the edge, by the ISP, … DNS is cached to a ridiculous levels.

But if the request does not resolve there’s no caching, the next request goes through the entire thing and hits the server again.

bt1a · on Oct 4, 2021

There's a lot of caching involved in the chain of requests that would alleviate this request volume if things were working.

eklbt · on Oct 4, 2021

My best guess is that after n many attempts to access the provided IP, the local DNS cache deletes the entry causing a miss. Then the cycle continues.

htrp · on Oct 4, 2021

am i correct in interpreting this as almost equivalent to a DDoS attack on DNS providers?

thepasswordis · on Oct 4, 2021

Yes. It's basically turned every device, especially mobile devices with the app running in the background, into botnet clients which are continually hitting their DNS servers.

I don't know what facebook's DNS cache expiration interval was, but assume it's 1 day. Now multiply the load on the DNS that those facebook users put by whatever polling interval the apps use.

And then remember what percentage of internet traffic (requests, not bandwidth) facebook, whatsapp, and instagram make up.

It's kindof beautiful.

reaperducer · on Oct 4, 2021

It's basically turned every device, especially mobile devices with the app running in the background, into botnet clients which are continually hitting their DNS servers

Anecdotally, it also seems to be draining the batteries of those devices with all of those extra queries. At least that seems to be what's happening on my wife's phone.

wozer · on Oct 4, 2021

Now I'm a bit worried.

Could this bring down the whole internet for a while?

Jaruzel · on Oct 4, 2021

Well, everything is bit slow for me. I'm in the UK on Virgin Media, using either Google DNS or the VM ones (I'm not sure and can't be bothered to look).

What has just happened, and it can't be coincidence, is that I lost internet connectivity about 1 hour ago, and had to reboot my Cable Modem to get it back.

_jstreet · on Oct 4, 2021

I'm fairly certain that my ISP was affected by this causing an outage of all internet traffic for my network. So it seems possible, although I imagine using an alternate DNS provider should work ok (if they're not overrun by extra traffic)?

Unfortunately I'm not sure what the default DNS on the modem points to..

r721 · on Oct 4, 2021

You can try https://dnsleaktest.com/ which shows which DNS server is actually used.

gpvos · on Oct 4, 2021

I read it brought down the Vodafone network in Czechia, one of the major providers there.

notyourday · on Oct 4, 2021

... and the facebook SDK. Every single app that has facebook SDK is blowing up now.

universenz · on Oct 4, 2021

Further to this, doesn't Chrome and Safari quietly auto-ping/reload pages that "fail to connect" if they're left open in a tab or browser?

_gohp · on Oct 4, 2021

How often do the apps try to reconnect? Does anyone know?

vakabus · on Oct 4, 2021

I've launched Wireshark monitoring DNS traffic of roughly 5 phones. I've collected 19.8k DNS packets so far. Out of that, 5.1k packets are flagged with REFUSED or SERVFAIL. If I am not mistaken, it means that 51% of DNS requests fail.

Looking at queries for graph.instagram.com, it looks like there are roughly 20 attempts in a sequence before it gives up.

All in all, this could probably explain doubling of the DNS traffic. But the sample is rather small, so take it with a grain of salt.

dotancohen · on Oct 5, 2021

5.1 / 19.8 is much closer to a quarter than to half. But your point is still just as poignant at ~25% as it is at 51%.

profile53 · on Oct 5, 2021

I think those are DNS round trips. So 1 packet to request and 1 to respond. E.g. 9.9k total requests of which 5.1k fail.

colpabar · on Oct 4, 2021

Sort of, yeah. Typically a DDoS attack is done on purpose, this is a side effect of so many clients utilizing retry strategies for failed requests. But in both cases, a lot of requests are being made, which is how a DDoS attack works.

samhw · on Oct 4, 2021

Equivalent how? In volume? In intention?

j-bos · on Oct 4, 2021

In volume.

samhw · on Oct 4, 2021

Ah, I getchu. In that case you're probably not wrong. It must be an absolutely redoubtable volume of traffic.

SlowRobotAhead · on Oct 5, 2021

> Software keeps retrying. We get hit by a massive flood of DNS traffic asking for http://facebook.com

If you aren’t using exponential backoff algorithms for your reconnect scheme - you should be!

I have a device in the field, only a few thousand total, but we saw issues when our shared cloud would go down and everyone hammered it to get back up.

mrkramer · on Oct 4, 2021

>Our small non profit also sees a huge spike in DNS traffic. It’s really insane.

It's not crazy; people are panicking over Facebook, Instagram and WhatsApp being down and they keep trying to connect to those services. I mean I would panic too if I were social media junky.

nonsince · on Oct 5, 2021

It’s not just "social media junkies", a very pretentious phrase to use considering you’re writing it in a comment on a social network. Hundreds of thousands of apps use Facebook APIs, often in the background too (including FB's own apps).

executesorder66 · on Oct 5, 2021

Is "alcoholic" a very pretentious word to use considering that the person saying it has a beer once a week?

ricardo81 · on Oct 4, 2021

Hopefully they're not DNS ANY requests? <ducks>

(CF decided not to honour them some years ago)

htrp · on Oct 4, 2021

am i correct in interpreting this as almost equivalent to a DDoS attack