>Now, here's the fun part. @Cloudflare runs a free DNS resolver, 22.214.171.124, and lots of people use it. So Facebook etc. are down... guess what happens? People keep retrying. Software keeps retrying. We get hit by a massive flood of DNS traffic asking for http://facebook.com
>Our small non profit also sees a huge spike in DNS traffic. It’s really insane.
>This is frontend DNS stats from one of the smaller ISPs I operate. DNS traffic has almost doubled.
Two of our local mobile operators are experiencing issues with phone calls due to network overload.
can't use the phone network to place a call b/c of fb-errors clogging the pipe
https://news.ycombinator.com/item?id=25803010 Signal apps DDoS'ed their own server
Second comment was saying there is no point using Signal if they are down during 2 days. Only a few hours for FB yet but curiously nobody is saying the same :)
Let's hope it's done wrong.
And I'm glad you can feel smug about combating smuggness, because obviously the consequences of some social media and chat apps being down can't be measured but by anecodotal stories of some unrelated issue like being locked out, not about their general societal impact, shady practices, contribution to disinformation and data mining, etc. Who's being self centered now?
If anything, the lesson here is to not depend on some single, centralized, channel, for such communications (e.g. to get your AirBnB key). Now I also feel smug for always giving 2-3 alternative ways in cases contacting someone/someone contacting me is crucial...
It's not like what the world lacks in 2021 is communication channels. One can use land phone, mobile phone, SMS, email, and 200 other alternative IM outlets...
I think you missed the idea that the FB outage created a really heavy DNS load on other people's DNS servers.
It's not so simple to cache misses - you don't know if it's a real miss or some kind of error.
For example if Facebook cached the miss, then even when they are back up nothing would connect.
This tends to be beneficial regardless of the root cause.
edit this is especially useful for handling search/query misses as a query with no results is going to scan any relevant indexes etc. until it is clear no match exists meaning a no results query may take up more cycles than a hit.
In the past I was the architect of a top 2000 alexa ranked social networking site, data synchronization delays were insane under certain load patterns high single low double digit second write propagation delays.
Even if you assume Facebook's own apps have reasonable failure logic, there's all kinds of third-party apps and devices integrating with their API that probably get it wrong. Surprise botnet!
But if the request does not resolve there’s no caching, the next request goes through the entire thing and hits the server again.
I don't know what facebook's DNS cache expiration interval was, but assume it's 1 day. Now multiply the load on the DNS that those facebook users put by whatever polling interval the apps use.
And then remember what percentage of internet traffic (requests, not bandwidth) facebook, whatsapp, and instagram make up.
It's kindof beautiful.
Anecdotally, it also seems to be draining the batteries of those devices with all of those extra queries. At least that seems to be what's happening on my wife's phone.
Could this bring down the whole internet for a while?
What has just happened, and it can't be coincidence, is that I lost internet connectivity about 1 hour ago, and had to reboot my Cable Modem to get it back.
Unfortunately I'm not sure what the default DNS on the modem points to..
Looking at queries for graph.instagram.com, it looks like there are roughly 20 attempts in a sequence before it gives up.
All in all, this could probably explain doubling of the DNS traffic. But the sample is rather small, so take it with a grain of salt.
If you aren’t using exponential backoff algorithms for your reconnect scheme - you should be!
I have a device in the field, only a few thousand total, but we saw issues when our shared cloud would go down and everyone hammered it to get back up.
It's not crazy; people are panicking over Facebook, Instagram and WhatsApp being down and they keep trying to connect to those services. I mean I would panic too if I were social media junky.
(CF decided not to honour them some years ago)
The outage has pretty much buried that story, and perhaps more importantly, stopped its spread on FB networks.
That said, I can't see how FB managers and engineers would actually agree to carry out something like this intentionally.
Strongly disagree. The outage has millions of people entering "Facebook" into their search engines. Most engines will conveniently put related news at the top of the search results page. The most recent and widespread Facebook-related news story is about the whistleblower.
Plus everyone has a lot of spare time to read the article now that Facebook and Instagram are down.
The outage didn't bury the story. It amplified it. Any suggestions that Facebook did this on purpose don't even make sense.
With respect I am pretty sure that the most recent and widespread Facebook-related news story is this one.
Holistically I agree that this isn't the kind of distraction Facebook wants, although it tickles me to imagine Mark in the datacenter going Rambo with a pair of wire cutters.
I am seeing 0 news about the whistleblower when I google Facebook. Only outage news.
The whistleblower is kinda silly
If FB could increase revenue by having a "safer" algorithm then of course they would. Every company is just trying to increase revenue..
Unless another disgruntled employee knew it would amplify the story.
1 article about the whistleblower and 2 about the outage. Both about the outage also mention the whistleblower, so you could say that's 100% of coverage at least mentions the whistleblower.
Also 1 out of 3 tweets also mentions the whistleblower.
But I haven't paid this much attention to Facebook in over a year.
So you can't just divide over time like that.
I'm just trying to process that FB is having its historic, all-networks global outage today of all days. And I bet FB would have paid double of whatever this will eventually cost them to make that story go away.
(5)(a)knowingly causes the transmission of a program, information, code, or command, and as a result of such conduct, intentionally causes damage without authorization, to a protected computer;
Not saying this is the work of spies, just that it’s not unimaginable to think some middle manager could convince themselves or a subordinate to do something drastically illegal out of some fear that terrible things would happen otherwise.
bingo. I don't care whether it's in the realm of tinfoil hat or not, this is the very real effect that this outage has had. By the time Facebook is back up, people on Facebook will be talking about the outage, not about the whistle blower report. Intentional or not, it will certainly be in Facebook's favor.
I suspect Facebook was making some change to their DNS generally, and they made some kind of mistake in deployment that blew up this morning.
Yeah, that could happen.
Could a pang of morality have struck one of the employees with the keys to the kingdom?
The more likely scenario is that this was the final straw for some disgruntled employee who decided to pull the plug on the entire thing.
I live in Australia. 60 Minutes exists here as well.
Maybe this is just to cover the fact that they leaked information about 20% of the earth's population?
This is straight up false. It was scrapers extracting data from public profiles. They already incorporate anti-scraping techniques, so there's not much they can do other than require every one to set their profile to private.
If they want to position themselves as the global phonebook, that's fine, but they should be open about that.
Edit to add: If you aren't in the "gather and sell access to everybody's data" business, "private" is a sensible default setting for that information. On the other hand, if you're Facebook...
It's not like this is a new thing. We've been getting [facebook does awful thing] news stories pretty consistently for years now.
Like...what? "Brave whistelblower comes out showing that facebook isn't doing enough to control what you are thinking!" is sortof arguing past the question. Should facecbook be in charge of deciding what you think?
That is not at all what the whistleblower is alleging. Facebook already controls what content you are seeing through its news feed algorithm. The parameters to that algorithm are not a 1-dimensional "how much control", but instead uses engagement metrics for what content to show. The whistleblower claims that the engagement optimization, according to facebooks own research, prioritizes emotionally angry/hurtful/divisive content.
Is there anything else in the interview the whistleblower alleges, or can prove?
Without regulation, social media will always prioritize profit.
Buried the news ... which is basically as noteworthy the news that water is still wet. What exactly did she reveal that was not known before, or is it somehow newsworthy that Facebook also knew what everyone else knew? The real news ought to be how that managed to make it to the headlines.
They can either agree to comply with the orders from up above or they face consequences? How is that hard to comprehend?
To me this seems like a million dollar mistake.
A lot. The resulting Wall Street Journal series directly led to the shut down of Instagram for Kids.
It hasn't on the BBC. They're airing both stories.
Well, they work for Facebook. In my opinion you would have to have no morals to join that corporations in the first place, so I can imagine such ask would be just another dirty task to do. They seem to love it.
I think Facebook is awful, but her primary complaint seemed to me that she lacked controls for what people like her, you know, the good people have access to prevent anyone else from seeing. That she was powerless to stop users from saying the wrong things. How was her motivation anything but a desire for more authoritarianism? She said she specifically took the job on the condition she could monitor and direct posts to prevent the wrong info from being online, that's the last type of person you want in that position, the one that wants it.
I expect that we're still pretending Facebook is "just a private business", despite it being unlike any in history and that the ties to government are completely benign.
I'm not saying she was wrong in any claim about internal discussions. But, if you can not imagine yourself being on the wrong side of someone like that, you have limited imagination.
If you wanted to scrub a lot of the data and nefarious evidence the whistle blower brought out, this would be a great way to do it, under the guise of a simple "employee screw up" cover story.
Its hard for me to think something more nefarious is afoot considering FB's track record with a myriad of other things. At this point, it seems more likely something sketchy is going on and not just some random employee who screwed up and brought down the entire network with a simple change. I would assume there are several layers of decision makers who oversee the BGP records. I have a hard time thinking one person had sole access to these and brought everything down with an innocent change.
FB has too many smart people who would allow a single point of failure for their entire network such that if it goes down, it becomes "a simple error on the part of some random employee". This is not some junior dev who broke the build, its far more serious than that.
Many DoH servers are working fine. DNS isn't a problem for the browser, but it seems to be a problem for Facebook's internal setup. It's like their proxy configuration is 100% reliant on DNS lookups in order to find backends.
The FB content servers are reachable. It is only the Facebook DNS servers that are unreachable.
Don't take my word for it, try for yourself
www.facebook.com 1 IN A 126.96.36.199 (content)
static.facebook.com 1 IN A 188.8.131.52 (content)
a.ns.facebook.com 1 IN A 184.108.40.206 (DNS)
ping -c1 220.127.116.11 |grep -A1 statistics
--- 18.104.22.168 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
ping -c1 22.214.171.124|grep -A1 statistics
--- 126.96.36.199 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
ping -c1 -W2 188.8.131.52 |grep -A1 statistics
--- 184.108.40.206 ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms
curl -HUser-Agent --resolve www.facebook.com:443:220.127.116.11 https://www.facebook.com|sed windex.htm
links -dump index.htm
Sorry, something went wrong.
We're working on it and we'll get it fixed as soon as we can.
Facebook (c) 2020 . Help Center
grep HTTP index.htm
HTTP/1.1 503 No server is available for the request
> server 18.104.22.168
Default Server: dns.google
DNS request timed out.
timeout was 2 seconds.
DNS request timed out.
timeout was 2 seconds.
*** Request to dns.google timed-out
Even if authoritative DNS servers are unreachable, there are, for example, multiple, authorised, public scans of port 53 available for free download. There are also passive DNS datasets also available for free. While recursive caches may be convenient, IMO they are the least trustworthy sources of DNS data.
Keeping those poor network engineers in our thoughts.
Just a reminder this is not a failure of a single person though and of the organization as a whole and policies in place.
Somebody from my engineering class had an internship at DuPonts main facility/production line. Was implementing something that managed to complety shut down production for an entire shift & cause a large fire, ended up being something in the millions worth of damages from production loss and fire damage.
Intern wasn't even yelled at IIRC. He actually went on to do some very helpful things the rest of the internship. But man, did the person who let an intern be in the position to single handedly cause such a mess get absolutely fucked by his superiors.
In most companies someone is the fall guy depending on much impact is there. It is not uncommon for that some guy to be the CEO if the fuck up is big enough .
it only means what is kept pn their server is encrypted, that's all. e2e has no impact on message expiration.
My (very large) employer had a worldwide outage a few years ago where a single bad DNS update stopped everything in its tracks (at the time many things were still in our own data centers, now more is in Amazon/etc). It took most of the day to restart everything. But it's not something most people would have noticed like FB. Thankfully I worked in mobile so not involved.
It is easy to get it wrong if your company provides internet services that every developer typical depends in their workflows and to keep educating your own developers on how to use them and when not to use your own services.
Although, to be fair, that is kind of like praising the arsonist after he put out the fire he started (which had already smoke-damaged the whole neighborhood).
lol! it's like the bicycle, appliance and consumer toilet paper shortages that resulted from changed consumer behavior during last year's lockdowns, but instead with internet distractions.
(even HN is creaking under the load, hah!)
I'm guessing Facebook upgraded everything to the highest level tech and inadvertently got thrown back into the stone age.
There was a great talk on this on the 32C3; unfortunately only in German: https://media.ccc.de/v/32c3-7323-wie_man_einen_blackout_veru...
I hear a rumor that the badge readers at the door also don't work, which would be just amazing if true... [Edit: Apparently partially confirmed.]
FB have really managed to knot their own shoelaces together here.
The regulators in many countries that allowed the purchase failed to protect customers and competition and helped to create a more fragile world prone to systemic disruptions.
However many services even government ones depends on WhatsApp business accounts to provide essential services.
A ton of apps also use fb login as their only or primary authentication mechanism some of them would definitely be essential services.
Maybe building on Facebook was not best idea for these apps , but essential services would have been impacted nonetheless.
Why the assumption that I'm American?
Messenger is a bit more popular than WhatsApp in the U.S. compared to rest of the world where WhatsApp is sometimes defacto replacement for SMS (phone number vs Facebook account I guess).
I have seen WhatsApp or telegram logos on billboards or printed media as support channels but never really messenger anywhere. I have not seen WhatsApp in the U.S. commonly as a support channel yet .
Since you mentioned messenger I qualified U.S. out as perhaps a different market
The argument is essentially that these organizations have gotten so successful that they need to hand control over their infrastructure to the state, since the state will manage it better. You might not be making this argument, but some in this thread clearly are when they ask "how can this be allowed to happen". I can't think of a single system managed by the government which actually is run in a way that's as good as Facebook's networks.
Consider it this way: Facebook has crashed less in the past 10 years than the stock market. That should give people thinking of state control for reliability something to think about.
Comparable would be messaging on traditional networks.
OTP/Erlang based AXD301 has had exceptionally high uptime, the reliability figures commonly cited are 9 Nines! (99.9999999%) for example. The entire language stack was built primarily for telecom to have exceptional reliability and uptime. WhatsApp (at-least originally) was built on Erlang and BEAM VM.
Telecom systems have likely had lesser downtime than Facebook in the last 30 years.
 Uptime for AXD301 over 20 years has many other factors including hardware, architecture etc and the 9 nines is really the reliability of Erlang when there was uptime , the caveat being reliability numbers cited and uptime are not directly related .
> Telecom systems have likely had lesser downtime than Facebook in the last 30 years
I've had a handful outages in the past few years for my telecom services. Rogers' network failed in Canada in 2020 for a full day, crippling communications across the country. I've had one outage with Facebook since 2008. At worst, they're just as reliable. Regardless, the argument here really is "should the state step in and take control of Facebook's uptime because a lot of people chose to use it and it goes offline for a few hours a decade"?. I still maintain that none of this is an affirmative argument.
Text messaging works fine. There is no serious disruption of public service.
That said, who wants to go back to ICQ?
I'm still leaning heavily towards an "oopsie" with routing that accomplished the same thing, however.
I am talking in a very generalized sense, not for this particular issue. But I don't think the code review/deployment process is entirely safe against internal bad actors.
The whole point is to write C that appears on the level at first, but actually has a subtle exploitable flaw. The flaw is supposed to appear like a simple mistake for plausible deniability. Some of the winning responses are very devious.
The fix may have been easy, all the tools and comms down you need to fix is making it hard. It's all so interesting. Good riddance to Facebook.
It is amazing how far some people go with trying to explain this.
The whistleblower story is all over every news site.
Turning off everyone's favorite time wasting website is the worst possible way to cover it up. How many people are typing "Facebook down" into Google and getting the Facebook whistleblower news story in the "Related News" section of their results?
It's not a coverup.
While I tend to agree, you can't say that definitively -- no one can at this point, except those at FB super close to this issue.
> The whistleblower story is all over every news site.
As many point out in your other comment above on this, the outage is drowning out the whistleblower story on most search engines.
Myself, I would pit to either Technical Issue, or internal sabotage as an act of protest.
I'm so curious what you mean. How could you come to the conclusion that it is anything other than a technical issue?
> The mass outage comes just hours after CBS’s 60 Minutes aired a much-anticipated interview with Frances Haugen, the Facebook whistleblower who recently leaked a number of internal Facebook investigations showing the company knew its products were causing mass harm, and that it prioritized profits over taking bolder steps to curtail abuse on its platform — including disinformation and hate speech.
> We don’t know how or why the outages persist at Facebook and its other properties, but the changes had to have come from inside the company, as Facebook manages those records internally. Whether the changes were made maliciously or by accident is anyone’s guess at this point.
I think we can't completely rule out the possible connection to this. Again, likely isn't, but answering the question how one might come to the conclusion.
Apparently the people planning the heist went a bit overboard with their misdirection.
I'm hoping that this is just a coincidence
Seems like cutting their ASN off from the world would be a great way to cut off any would-be Discovery Volunteers that might try to collect evidence 4chan style to support the whistleblower’s case.
Still, let‘s hope that this gets fixed soon for the engineers and users involved
But then the timing with regard to China sending 50 military aircraft over Taiwan today is also interesting... FB and communication infrastructure would go down first in times of tension, if you want to go full tin-foil hat.
Ok ... enough news reading for me today!
No matter when something happens, other things will be happening in the world around the same time. That doesn't establish a correlation (China has been doing that for awhile), much less causation.