Hacker News new | past | comments | ask | show | jobs | submit login
What Happened to Facebook, Instagram, and WhatsApp? (krebsonsecurity.com)
584 points by djrogers 23 days ago | hide | past | favorite | 311 comments



Interesting side effects:

>Now, here's the fun part. @Cloudflare runs a free DNS resolver, 1.1.1.1, and lots of people use it. So Facebook etc. are down... guess what happens? People keep retrying. Software keeps retrying. We get hit by a massive flood of DNS traffic asking for http://facebook.com

https://twitter.com/jgrahamc/status/1445066136547217413

>Our small non profit also sees a huge spike in DNS traffic. It’s really insane.

https://twitter.com/awlnx/status/1445072441886265355

>This is frontend DNS stats from one of the smaller ISPs I operate. DNS traffic has almost doubled.

https://twitter.com/TheodoreBaschak/status/14450732299707637...


Another side effect:

Two of our local mobile operators are experiencing issues with phone calls due to network overload.

https://twitter.com/claroelsalvador/status/14450819333319598...


Believe it or not, there are places in the world where FB products (WhatsApp specifically) are used as the primary communication platform for most people.


Possibly in Norway too (internet though, not phone calls) https://www.nrk.no/nyheter/internett-trobbel-hos-telia-1.156...


The same happened in Romania with two of our mobile operators immediately after FB&all went down.


oh the irony!

can't use the phone network to place a call b/c of fb-errors clogging the pipe


Almost same thing happened when Signal went down:

https://news.ycombinator.com/item?id=25803010 Signal apps DDoS'ed their own server

Second comment was saying there is no point using Signal if they are down during 2 days. Only a few hours for FB yet but curiously nobody is saying the same :)


I'd say there's no point using FB anytime, even without the outage ;)


Well, for FB there are quite a lot of comments suggesting for it to stay down. So Signal actually got off quite easy.


I wonder if any big DNS servers will artificially cache a long TTL NXDOMAIN response for FB to reduce their load. Done wrong, it would extend the FB outage longer.


>Done wrong, it would extend the FB outage longer.

Let's hope it's done wrong.


This outage also affected whatsapp, one of the most widely-used communication technologies in the world. It also almost caused me to be locked out of my apartment, were it not for random chance and the kindness of a stranger, but I’m glad that you can feel smugly superior about it


Just out of curiosity, and obviously if you can disclose... How does FB availability affect your ability to enter your apartment?


Not OP, but such ideas usually stem from a misunderstanding of root cause. Facebook inaccessibility likely exposed poor assumptions or other flaws in e.g. "smart" devices or workflows. Those poor assumptions or other flaws are likely what got OP locked out of his apartment when Facebook went down, not Facebook itself going down.


An apartment-complex entry app with Facebook login integration seems possible to me. That would be direct root cause.

https://developers.facebook.com/docs/facebook-login/


No, that would not be direct root cause. Direct root cause would be designing and implementing an apartment-complex entry app which depends on a working internet connection, battery, and network route to a single point of failure.


>but I’m glad that you can feel smugly superior about it

And I'm glad you can feel smug about combating smuggness, because obviously the consequences of some social media and chat apps being down can't be measured but by anecodotal stories of some unrelated issue like being locked out, not about their general societal impact, shady practices, contribution to disinformation and data mining, etc. Who's being self centered now?

If anything, the lesson here is to not depend on some single, centralized, channel, for such communications (e.g. to get your AirBnB key). Now I also feel smug for always giving 2-3 alternative ways in cases contacting someone/someone contacting me is crucial...

It's not like what the world lacks in 2021 is communication channels. One can use land phone, mobile phone, SMS, email, and 200 other alternative IM outlets...


Clients weren't getting NXDOMAIN, they were getting SERVFAIL because the nameservers were unreachable. These responses cannot be cached for more than 5 minutes [1].

[1] https://datatracker.ietf.org/doc/html/rfc2308#section-7.1


Yes, that's the point. If you're running a DNS server and being overwhelmed by this, you might have considered artificially injecting NXDOMAIN with a long cache value to get some relief. Which could extend the outage for FB.


Unless the operators were in direct contact with Facebook, it doesn't sound like a good idea. It's certainly not the job of the ISP to reduce an outage for FB. They also weren't sure if the outage would only be 5 minutes or 5 hours. Instead, ISPs should scale up and handle DNS traffic for outages like this. In this case, FB isn't the only company to learn a lesson or two around failure modes and how to improve in the future.


The point isn't reducing an outage for FB, it would actually extend the outage for some. The point would be to help give some relief to a DNS server you're running that's overloaded due to the FB outage...during the "crisis". Yes, of course, better planning ahead of time is nice. In any case, I didn't suggest doing this. I wondered if it was happening.

I think you missed the idea that the FB outage created a really heavy DNS load on other people's DNS servers.


No, I didn't miss the idea (and it's not an idea, it really happened.) I believe you're mistaking the role of the resolver operator and whether or not they should be manipulating client queries/responses without the user knowing. An NXDOMAIN response does not match the conditions, and shouldn't be used just to manipulate the clients.


I don't understand that logic, wouldn't people interacting with the website normally also generate the same amount if not more DNS requests?


It will have been cached at closer to the edge, but once the TTL expires, so does the cache. That means all the DNS requests that would have been served via local caches end up hitting the upstream DNS servers. For a site like Facebook that will be creating an asbolute deluge of requests. Andecdotal but the whole of the internet feels sluggish atm.


Anecdotally, my personal website feels faster than normally. Gandi DNS.


No, since the positive response will normally be cached for "some time" dependant on a number of factors. The negative response on the other hand often won't get cached, again, dependent on settings.


Negative responses are cachable with the appropriate time to live from the Start of Authority record for the zone.


I know you're just replying to the parent statement but unfortunately in this case the SOA went down with the ship. None of the (admittedly few) clients I've tested are caching the lack of a response for facebook.com's SOA or address records.


Yep, I always make it a point to cache cache-misses in my code.


So then when I'm on some kind of blocked WiFi and nothing resolves, and I switch to a properly working WiFi your code will continue to fail?

It's not so simple to cache misses - you don't know if it's a real miss or some kind of error.

For example if Facebook cached the miss, then even when they are back up nothing would connect.


Yes. I handle around a million requests per minute. I exponentially increase the cache period after subsequent misses to avoid an outage ddos the whole system.

This tends to be beneficial regardless of the root cause.

edit this is especially useful for handling search/query misses as a query with no results is going to scan any relevant indexes etc. until it is clear no match exists meaning a no results query may take up more cycles than a hit.


It's remarkable the effect even short TTL caching can have given enough traffic. I recall once caching a value that was being accessed on every page load with a TTL of 1s resulting in a >99% reduction in query volume, and that's nowhere near Facebook/internet backbone scale.


yep, prepriming the cache rather than passively allowing it be rebuilt by request/queries can also result in some nice improvements and depending on replication delay across database servers avoid some unexpected query results reaching the end user.

In the past I was the architect of a top 2000 alexa ranked social networking site, data synchronization delays were insane under certain load patterns high single low double digit second write propagation delays.


I'm talking back-end not in app data caching. I would also cache misses there as well but with less aggressive ttl.


It's disappointingly common for cloud-backed apps and device firmware to go into a hot retry loop on any kind of network failure. A lot of engineers just haven't heard of exponential backoff, to say nothing of being able to implement and test it properly for a scenario that almost never happens.

Even if you assume Facebook's own apps have reasonable failure logic, there's all kinds of third-party apps and devices integrating with their API that probably get it wrong. Surprise botnet!


Normally the request resolves then gets cached locally, on the edge, by the ISP, … DNS is cached to a ridiculous levels.

But if the request does not resolve there’s no caching, the next request goes through the entire thing and hits the server again.


There's a lot of caching involved in the chain of requests that would alleviate this request volume if things were working.


My best guess is that after n many attempts to access the provided IP, the local DNS cache deletes the entry causing a miss. Then the cycle continues.


am i correct in interpreting this as almost equivalent to a DDoS attack on DNS providers?


Yes. It's basically turned every device, especially mobile devices with the app running in the background, into botnet clients which are continually hitting their DNS servers.

I don't know what facebook's DNS cache expiration interval was, but assume it's 1 day. Now multiply the load on the DNS that those facebook users put by whatever polling interval the apps use.

And then remember what percentage of internet traffic (requests, not bandwidth) facebook, whatsapp, and instagram make up.

It's kindof beautiful.


It's basically turned every device, especially mobile devices with the app running in the background, into botnet clients which are continually hitting their DNS servers

Anecdotally, it also seems to be draining the batteries of those devices with all of those extra queries. At least that seems to be what's happening on my wife's phone.


Now I'm a bit worried.

Could this bring down the whole internet for a while?


Well, everything is bit slow for me. I'm in the UK on Virgin Media, using either Google DNS or the VM ones (I'm not sure and can't be bothered to look).

What has just happened, and it can't be coincidence, is that I lost internet connectivity about 1 hour ago, and had to reboot my Cable Modem to get it back.


I'm fairly certain that my ISP was affected by this causing an outage of all internet traffic for my network. So it seems possible, although I imagine using an alternate DNS provider should work ok (if they're not overrun by extra traffic)?

Unfortunately I'm not sure what the default DNS on the modem points to..


You can try https://dnsleaktest.com/ which shows which DNS server is actually used.


I read it brought down the Vodafone network in Czechia, one of the major providers there.


... and the facebook SDK. Every single app that has facebook SDK is blowing up now.


Further to this, doesn't Chrome and Safari quietly auto-ping/reload pages that "fail to connect" if they're left open in a tab or browser?


How often do the apps try to reconnect? Does anyone know?


I've launched Wireshark monitoring DNS traffic of roughly 5 phones. I've collected 19.8k DNS packets so far. Out of that, 5.1k packets are flagged with REFUSED or SERVFAIL. If I am not mistaken, it means that 51% of DNS requests fail.

Looking at queries for graph.instagram.com, it looks like there are roughly 20 attempts in a sequence before it gives up.

All in all, this could probably explain doubling of the DNS traffic. But the sample is rather small, so take it with a grain of salt.


5.1 / 19.8 is much closer to a quarter than to half. But your point is still just as poignant at ~25% as it is at 51%.


I think those are DNS round trips. So 1 packet to request and 1 to respond. E.g. 9.9k total requests of which 5.1k fail.


Sort of, yeah. Typically a DDoS attack is done on purpose, this is a side effect of so many clients utilizing retry strategies for failed requests. But in both cases, a lot of requests are being made, which is how a DDoS attack works.


Equivalent how? In volume? In intention?


In volume.


Ah, I getchu. In that case you're probably not wrong. It must be an absolutely redoubtable volume of traffic.


> Software keeps retrying. We get hit by a massive flood of DNS traffic asking for http://facebook.com

If you aren’t using exponential backoff algorithms for your reconnect scheme - you should be!

I have a device in the field, only a few thousand total, but we saw issues when our shared cloud would go down and everyone hammered it to get back up.


>Our small non profit also sees a huge spike in DNS traffic. It’s really insane.

It's not crazy; people are panicking over Facebook, Instagram and WhatsApp being down and they keep trying to connect to those services. I mean I would panic too if I were social media junky.


It’s not just "social media junkies", a very pretentious phrase to use considering you’re writing it in a comment on a social network. Hundreds of thousands of apps use Facebook APIs, often in the background too (including FB's own apps).


Is "alcoholic" a very pretentious word to use considering that the person saying it has a beer once a week?


Hopefully they're not DNS ANY requests? <ducks>

(CF decided not to honour them some years ago)


am i correct in interpreting this as almost equivalent to a DDoS attack


I know this is tinhat territory, but it's weird this happens right after the FB whistleblower interview on 60 minutes.

The outage has pretty much buried that story, and perhaps more importantly, stopped its spread on FB networks.

That said, I can't see how FB managers and engineers would actually agree to carry out something like this intentionally.


> The outage has pretty much buried that story,

Strongly disagree. The outage has millions of people entering "Facebook" into their search engines. Most engines will conveniently put related news at the top of the search results page. The most recent and widespread Facebook-related news story is about the whistleblower.

Plus everyone has a lot of spare time to read the article now that Facebook and Instagram are down.

The outage didn't bury the story. It amplified it. Any suggestions that Facebook did this on purpose don't even make sense.


> recent and widespread Facebook-related news story is about the whistleblower

With respect I am pretty sure that the most recent and widespread Facebook-related news story is this one.

Holistically I agree that this isn't the kind of distraction Facebook wants, although it tickles me to imagine Mark in the datacenter going Rambo with a pair of wire cutters.


Yeah but journalists are happy to connect the dots between the two stories and honestly my brain loves the coincidence of these two thingy being clustered: https://krebsonsecurity.com/2021/10/what-happened-to-faceboo...


Yeah but journalists are happy to connect the dots between the two stories and honestly my brain loves the coincidence of these two thingy being clustered: but the how is clear: Earlier this morning, something inside Facebook caused the company to revoke key digital records that tell computers and other Internet-enabled devices how to find these destinations online.


That is in no way gonna make people forget the whistleblower story - if anything, it's gonna increase the antipathy to having a single point of failure. Face it, everyone hates FB, even the people who spend the most time on it.


datacenterS


> Strongly disagree. The outage has millions of people entering "Facebook" into their search engines. Most engines will conveniently put related news at the top of the search results page. The most recent and widespread Facebook-related news story is about the whistleblower.

I am seeing 0 news about the whistleblower when I google Facebook. Only outage news.


Every outage piece of news I'm seeing mentions the whistleblower.


Who reads the article? If I google "Facebook" to see if there's an outage, I see the first headline that says it's an outage and leave. Maybe curious few percent will.


This wasn't the case six hours ago. I checked, there were scattered outage stories and all whistle blower


Anecdotal, but I just tried Google + Bing and topline Facebook-related news is all about the outage.


Also anecdotal, but I didn't know about the whistleblower until I searched Twitter for "facebook" when I learned about the outage.


I also didn't know about the whistleblower until seeing it as a top tweet, however...

The whistleblower is kinda silly

If FB could increase revenue by having a "safer" algorithm then of course they would. Every company is just trying to increase revenue..


> Any suggestions that Facebook did this on purpose don't even make sense.

Unless another disgruntled employee knew it would amplify the story.


Sample size of one but a quick google shows me zero whistleblower news and 100% outage news.


https://i.imgur.com/IaSoR0w.png

1 article about the whistleblower and 2 about the outage. Both about the outage also mention the whistleblower, so you could say that's 100% of coverage at least mentions the whistleblower.

Also 1 out of 3 tweets also mentions the whistleblower.


I'm one of those who had no idea about the whistleblower story, but I learned of it through reading about Facebook network outage.


Yeah but reading about it but also being able to communicate about it on the largest network (the one in question too) are 2 separate phenomena. No one can go on there right now and say I'm deleting my account, who's with me?


Not at all. I just tried searching for "Facebook" on Google. The whistleblower story is not on the first page of search results. The outage is mentioned half a dozen times on that same page.


I assume this outage is costing millions per hour. And it's not exactly great advertising for Facebook, either. I doubt very much they would do something like this on purpose.


Dividing up last quarters $29B revenue leads to approximately $13.4M per hour of downtime, now past $53M after the 4 hour mark.

But I haven't paid this much attention to Facebook in over a year.


Sounds like a drop in the bucket for them.


That doesn't make a lot of sense though - Facebook generates revenue primarily from ad traffic (on all sorts of sites). It needs to be up for reputation and to harvest ever more detail for 'improving' those ads, sure, but not for revenue. (Modulo blip from ads on its own site.)

So you can't just divide over time like that.


What? It absolutely needs to be up -- those ads being served are on Facebook and Insta, not display banners on random sites.


That's what I meant by 'ads on its own site' - but I was under the impression that Facebook generated most revenue from selling data/ads for display elsewhere (as well as on Facebook.com itself, and other subsidiaries). Perhaps I was wrong about that? Quick search shows up 'audience network', but I'm not sure to what extent that's what I was thinking of.


Nope, for the most part all the ads that Facebook serves are for facebook owned sites and properties. They don't sell data, or have general ad placements on 3rd party websites.


wow...


It sounds like they are not even able to serve ads, on any property. So while far from perfect, it's probably a decent estimate without doing in-depth analysis.


Right, I know that, and I usually try to avoid conspiratorial thinking, but man, Zuck doesn't make it easy.

I'm just trying to process that FB is having its historic, all-networks global outage today of all days. And I bet FB would have paid double of whatever this will eventually cost them to make that story go away.


Unless “they” were one or two disgruntled employees with the access, know-how, and motive to execute a “mistake”. Emphasis added.


If it was intentional, that's serious jail time territory. That's a high price to pay for such limited downtime. I'm pretty sure an intentionally malicious actor with that type of access could do much worse things.


I'm curious as to what law, exactly, they would be breaking. Sabotage in the US code is defined mostly in terms of war material and damages done to physical "national defense" properties. Certainly an employee would be fired and sued by the company, but is deliberately changing a routing policy (and not something like a worm or virus that deletes or otherwise degrades hardware and software) a crime?


IANAL but I would assume computer fraud and abuse act:

(5)(a)knowingly causes the transmission of a program, information, code, or command, and as a result of such conduct, intentionally causes damage without authorization, to a protected computer;


In the cases cited under the CFAA (such as https://scholar.google.com/scholar_case?case=124545279862007...) it seems the employee deleted data and private info. In this case, no data was deleted or other computing property damaged it just became unreachable.


The recent Van Buren decision would make that unlikely.

https://news.ycombinator.com/item?id=27389500


That’s the one.


Proof of intent is a significant burden placed upon prosecution. If that can be overcome, there’s legal precedent for criminal conviction namely under the CFAA.

https://tadlaw.com/can-charged-crime-sabotaging-employers-co...


I’m pretty sure the vast majority of entry level spy craft is about convincing people to do highly illegal and destructive things from a place of fear.

Not saying this is the work of spies, just that it’s not unimaginable to think some middle manager could convince themselves or a subordinate to do something drastically illegal out of some fear that terrible things would happen otherwise.


> stopped its spread on FB networks.

bingo. I don't care whether it's in the realm of tinfoil hat or not, this is the very real effect that this outage has had. By the time Facebook is back up, people on Facebook will be talking about the outage, not about the whistle blower report. Intentional or not, it will certainly be in Facebook's favor.


Facebook controls the algorithm, wouldn't they just be able to down amplify how much that story is spread on it's network? (Rather than resort to this?)


Just to clarify...I pretty obviously don't think that Facebook intentionally pulled the plug to suppress a critical story. But the inadvertent effect of the downtime is nonetheless the fact that the critical story will not be the center of discussion on Facebook when Facebook is back up.


A general outage is more deniable than dampening a negative story.


I love a good tinfoil hat theory, but in this case I doubt it. I have FB blocked on my network via pihole, but I don't explicitly block Instagram. Until sometime late last week (I noticed on Saturday), blocking facebook.com also blocked Instagram. As of this weekend, Instagram works just fine even with those blocks in place.

I suspect Facebook was making some change to their DNS generally, and they made some kind of mistake in deployment that blew up this morning.


I'll take the other side of that bet. Who messes with routing tables at noon on a Monday?


Someone who doesn't want to deploy on a Friday?


They deployed this morning that doesn’t imply they implemented anything. I can’t think of a better time that way you have the whole week work on anything that uncovers. Or in the case of something this big they have the rest of the day to freak out.


When I worked there a few years ago, the routing tables were being updated almost all day every day, primarily via automated processes.


It sounds like the perfect time honestly. If you fuck something up, you have the whole week to fix it.


I was thinking more of the ticking time-bomb variety, but that seems as good a time as any?


Nah, a ticking time-bomb would "explode" on Christmas (or Aïd El Kebir, etc.), whenever most of the employees who could do something about it are absent.


Still wasn't clear enough with my analogy! I was thinking more like a dam failure due to operator/designer error, not sabotage (but who knows). The damage is really small signs initially, followed by rolling catastrophic failures.


Ah, my bad, I misread your comment!

Yeah, that could happen.


> it's weird this happens right after the FB whistleblower interview on 60 minutes

Could a pang of morality have struck one of the employees with the keys to the kingdom?


Counterpoint: I had not even heard about the whistle-blower until seeing stories about the outage. One of the largest web services in the world being out of commission for multiple hours is a big deal in 2021. It's a top story on most news sites and other social media (e.g. here at HN, reddit, twitter). If you want something to pass under the radar, it's probably best to not attract global attention.


If I was so inclined to put on my conspiracy theorist robe, I’d guess more likely related to the bulk of Pandora Papers news hitting today.


Or evergrande.


I had no idea about the 60 Minutes thing until people started mentioning it in response to this outage.


Most people outside of the US don't even know what "60 Minutes" is. Even fewer have heard about that report. And even fewer care. But everyone has now heard about the outage. This would be the worst possible way of trying to stop the spread of the story.

The more likely scenario is that this was the final straw for some disgruntled employee who decided to pull the plug on the entire thing.


> Most people outside of the US don't even know what "60 Minutes" is.

I live in Australia. 60 Minutes exists here as well.

https://en.wikipedia.org/wiki/60_Minutes_(Australian_TV_prog...


arguing on the the basis of a strong cultural differentiation between the us and aus might not work as well as you think.


Agree. I just did a quick check and 60 Minutes averages around 10 million viewers. It's not like in 1977 when something 20%+ of the US population was watching that show.


It is pretty well available — and quickly — via piracy means, which I have always thought interesting for its somewhat esoteric content.


Not just that, but another story just broke about the sale of personal info on 1.5 billion FB users.

Maybe this is just to cover the fact that they leaked information about 20% of the earth's population?


> they leaked information about 20% of the earth's population

This is straight up false. It was scrapers extracting data from public profiles. They already incorporate anti-scraping techniques, so there's not much they can do other than require every one to set their profile to private.


If you don't collect the data in one place, there's no chance of leaking it.

If they want to position themselves as the global phonebook, that's fine, but they should be open about that.

Edit to add: If you aren't in the "gather and sell access to everybody's data" business, "private" is a sensible default setting for that information. On the other hand, if you're Facebook...


It's kinda in the name isn't it?

Phonebook... Facebook...


Here's the interview (which I had totally missed btw)

https://www.youtube.com/watch?v=_Lx5VmAdZSI


Hanlon's razor applies here, but it's a lot less fun. :)


That news broke ~12 hours ago right?


> I know this is tinhat territory, but it's weird this happens right after the FB whistleblower interview on 60 minutes.

It's not like this is a new thing. We've been getting [facebook does awful thing] news stories pretty consistently for years now.


I actually think most importantly it shows everyone what the world without FB is like ;)


If we're in "tinhat" territory: it seems extremely odd to me that this whistelblower seems to be "blowing the whistle" on the fact that facebook isn't doing enough to control what people are thinking and talking about.

Like...what? "Brave whistelblower comes out showing that facebook isn't doing enough to control what you are thinking!" is sortof arguing past the question. Should facecbook be in charge of deciding what you think?


> this whistelblower seems to be "blowing the whistle" on the fact that facebook isn't doing enough to control what people are thinking and talking about.

That is not at all what the whistleblower is alleging. Facebook already controls what content you are seeing through its news feed algorithm. The parameters to that algorithm are not a 1-dimensional "how much control", but instead uses engagement metrics for what content to show. The whistleblower claims that the engagement optimization, according to facebooks own research, prioritizes emotionally angry/hurtful/divisive content.


I didn't have the time to watch the interview yet but... wasn't it common knowledge for years?

Is there anything else in the interview the whistleblower alleges, or can prove?


We all knew Facebook is bad for society. The whistleblower showed us that Facebook has done internal studies and that these studies have shown their products are bad for society/contributed to the insurrection/promote human trafficking/damage teen mental health/etc. But even with these studies, Facebook has decided to prioritize growth and revenue, rather than fix the issues that are bad for society. What this whistleblower leaked will hopefully lead to some sort of government regulation on social media.

Without regulation, social media will always prioritize profit.


They are exercising that power already, they are just explicitly doing so in a way that tears down the trust in society because makes them money, rather than encouraging a less I divisive and more fact based conversation, because that doesn’t make them as much money.


The problem isn't that Facebook isn't do enough to control what you are thinking, it's that it's doing way too much!


FWIW, every article I've read has referenced the interview, and I personally find it hard to believe Facebook would be unaware of the Streisand Effect


It’s like watching a hostage over-analysing why the abductor forgot to lock the door. Just get out en enjoy your newfound, albeit temporary, freedom.


> The outage has pretty much buried that story, and perhaps more importantly, stopped its spread on FB networks.

Buried the news ... which is basically as noteworthy the news that water is still wet. What exactly did she reveal that was not known before, or is it somehow newsworthy that Facebook also knew what everyone else knew? The real news ought to be how that managed to make it to the headlines.


As much as I'd love to imagine FB rage-quitting the internet because people don't seem to appreciate them enough, I'm pretty sure it's a coincidence. Probably has more to do with it being Monday (you don't put big stories on Friday and you sure don't deploy config changes on Friday!) than anything else.


> That said, I can't see how FB managers and engineers would actually agree to carry out something like this intentionally.

They can either agree to comply with the orders from up above or they face consequences? How is that hard to comprehend?


I was thinking more along the lines of the Pandora Papers hitting the MSM.


Ah yes, the best way to bury a moral scandal of the kind that usually gets forgotten in a week is to undermine the trust of almost every single user worldwide. This is a very good conspiracy.


Did the whistleblower reveal something we didn't know already?

To me this seems like a million dollar mistake.


> Did the whistleblower reveal something we didn't know already?

A lot. The resulting Wall Street Journal series directly led to the shut down of Instagram for Kids.


I see it as similar to Snowden, in the sense that everybody kind of knew (actually guessed) but now we actually know. It doesn't come as a shock, but it's important information to have since it can be now argued with authority.


The whistleblower revealed that Facebook knows it is bad for society. The documents also show Facebook actively optimizes its algorithms for "bad for society" content because that drives engagement which makes them more money. Furthermore Facebook doesn't do as much content moderation in regions/languages with low usage numbers because it costs more than those users make them. So calls for genocide in Myanmar basically go unchallenged and unmoderated because Facebook doesn't make much money in Myanmar. Sorry genocided minority, you should have been more valuable to Facebook.


> The outage has pretty much buried that story

It hasn't on the BBC. They're airing both stories.


> would actually agree to carry out something like this intentionally.

Well, they work for Facebook. In my opinion you would have to have no morals to join that corporations in the first place, so I can imagine such ask would be just another dirty task to do. They seem to love it.


The story that a woman at Facebook doesn't think they're going far enough to control speech they hate and bad-thoughts?

I think Facebook is awful, but her primary complaint seemed to me that she lacked controls for what people like her, you know, the good people have access to prevent anyone else from seeing. That she was powerless to stop users from saying the wrong things. How was her motivation anything but a desire for more authoritarianism? She said she specifically took the job on the condition she could monitor and direct posts to prevent the wrong info from being online, that's the last type of person you want in that position, the one that wants it.

I expect that we're still pretending Facebook is "just a private business", despite it being unlike any in history and that the ties to government are completely benign.

I'm not saying she was wrong in any claim about internal discussions. But, if you can not imagine yourself being on the wrong side of someone like that, you have limited imagination.


Facebook is surprisingly tolerant of controversial subjects. YouTube has gone scorched earth on millions of channels and deleted years of work of many people. Facebook was far more lenient and you could talk about non-official covid information for example where YouTube deleted anything that wasn't official narrative with extreme prejudice. Given how much bad stuff all over the world is happening to sacrifice freedom to get everyone to tow the official line on Covid that is complete science fiction level totalitarianism, I am sure Facebook made some very powerful and determined enemies with its more lenient stance. I was downvoted earlier for saying this was an intentional takedown and deleted my comment, but now I think this could be a full blown William Gibson Neuromancer Cyberpunk level corporate takedown attempt in progress!


She said she wanted FB to do something to stop misinformation and hate speech but what we've seen from Reddit is that "are mRNA vaccines actually safe?" becomes misinformation and "we shouldn't perform elective life-altering surgery on pre-teen children" becomes hate speech. There's not much I applaud Facebook for, but not listening to this woman is one of the few I do.


It also looks like its much deeper than just people not finding the site. Employees are all locked out and there's another story on the front page on HN saying employees are locked out of the building as well.

If you wanted to scrub a lot of the data and nefarious evidence the whistle blower brought out, this would be a great way to do it, under the guise of a simple "employee screw up" cover story.

Its hard for me to think something more nefarious is afoot considering FB's track record with a myriad of other things. At this point, it seems more likely something sketchy is going on and not just some random employee who screwed up and brought down the entire network with a simple change. I would assume there are several layers of decision makers who oversee the BGP records. I have a hard time thinking one person had sole access to these and brought everything down with an innocent change.

FB has too many smart people who would allow a single point of failure for their entire network such that if it goes down, it becomes "a simple error on the part of some random employee". This is not some junior dev who broke the build, its far more serious than that.


"As a result, when one types Facebook.com into a web browser, the browser has no idea where to find Facebook.com, and so returns an error page."

Not quite.

Many DoH servers are working fine. DNS isn't a problem for the browser, but it seems to be a problem for Facebook's internal setup. It's like their proxy configuration is 100% reliant on DNS lookups in order to find backends.

The FB content servers are reachable. It is only the Facebook DNS servers that are unreachable.

Don't take my word for it, try for yourself

   www.facebook.com 1 IN A 179.60.192.3 (content)
   static.facebook.com 1 IN A 157.240.21.16 (content)
   a.ns.facebook.com 1 IN A 129.134.30.12 (DNS)

   ping -c1 157.240.21.16 |grep -A1 statistics
   --- 157.240.21.16 ping statistics ---
   1 packets transmitted, 1 received, 0% packet loss, time 0ms

   ping -c1 179.60.192.3|grep -A1 statistics 
   --- 179.60.192.3 ping statistics ---
   1 packets transmitted, 1 received, 0% packet loss, time 0ms

   ping -c1 -W2 129.134.30.12 |grep -A1 statistics
   --- 129.134.30.12 ping statistics ---
   1 packets transmitted, 0 received, 100% packet loss, time 0ms
The browser, i.e., client, here, curl, has an idea where to find Facebook.com

   curl -HUser-Agent --resolve www.facebook.com:443:179.60.192.3 https://www.facebook.com|sed windex.htm
Wait...

   links -dump index.htm 

   [IMG]

                          Sorry, something went wrong.

   We're working on it and we'll get it fixed as soon as we can.

   Go Back

   Facebook (c) 2020 . Help Center


   grep HTTP index.htm

   HTTP/1.1 503 No server is available for the request


If I do:

    nslookup
    > server 8.8.8.8
    Default Server:  dns.google
    Address:  8.8.8.8
    > facebook.com
    Server:  dns.google
    Address:  8.8.8.8

    DNS request timed out.
        timeout was 2 seconds.
    DNS request timed out.
        timeout was 2 seconds.
    *** Request to dns.google timed-out    
Weird.


At the time, OP likely pulled cached DNS records to interrogate the associated IPs directly for their application (HTTP) level resources.


Theres no need to use a cache.

Even if authoritative DNS servers are unreachable, there are, for example, multiple, authorised, public scans of port 53 available for free download. There are also passive DNS datasets also available for free. While recursive caches may be convenient, IMO they are the least trustworthy sources of DNS data.


And thinking on this... a timeout is the right answer if facebook's dns servers are missing in action.


Due to DNS being busted, all internal FB services/tooling that they'd use to push DNS config updates are probably completely inaccessible. Someone at FB will have to manually SSH into a production host (assuming they can even identify the right one), and issue some commands to repopulate the DNS records. They'll probably have to do this without any access to internal wikis, documentation, or code.

Keeping those poor network engineers in our thoughts.


Maybe Zuck watched the interview, agreed and pulled the plug.


In all seriousness there is definitely some poor soul out there stressing their brains out that probably pushed the button to set this all off.

Just a reminder this is not a failure of a single person though and of the organization as a whole and policies in place.


Hmm... I'm always reminded of my professor telling me that it's never the fault of who pressed the button, responsibility lay upon who decided to make them able to press a button that can cause such catastrophic issues.

Somebody from my engineering class had an internship at DuPonts main facility/production line. Was implementing something that managed to complety shut down production for an entire shift & cause a large fire, ended up being something in the millions worth of damages from production loss and fire damage.

Intern wasn't even yelled at IIRC. He actually went on to do some very helpful things the rest of the internship. But man, did the person who let an intern be in the position to single handedly cause such a mess get absolutely fucked by his superiors.


What about the middle manager who gave that supervisor the power to put an intern in a such critical position without review?. You can keep going up like that .

In most companies someone is the fall guy depending on much impact is there. It is not uncommon for that some guy to be the CEO if the fuck up is big enough .


Said "this is why we can't have nice things"


basically facebook deleted itself


Not only security. Also privacy! I started to see messages that I know 100% that I deleted days or weeks ago?!

https://twitter.com/Pytlicek/status/1445072626729242637


So it appears that WhatsApp are in the process of restoring from backup? Why would they need to do that if it was just a DNS issue? And why would the server be accessible while backup restoration was still in progress? I feel like there is going to be a lot more to this story when it all shakes out.


Once the DNS is back up they need to basically reboot every service. Once server one can’t talk to server two, everything is out of sync and they need to resolve this somehow. They probably have mitigation plans for a few data centers going down, but when it’s all of them at once, that’s going to be a huge pain.


Who knows. I use PiHole where all DNS records are cached. Maybe this is the reason why it happens to me. And regards Twitter (obviously), I'm not the only one who is facing this weird behaviour.


WhatsApp is (at least supposed to be) e2e, so unless they're restoring from every user's personal backup, it seems an unlikely course of action


I don't know how WhatsApp works but e2e doesn't mean that messages can't be cached/stored in their encrypted form. Actually they almost certainly are since otherwise messages couldn't be delivered to recipients while your phone is off/disconnected.


e2e does not mean FB isn't keeping the messages on their server.

it only means what is kept pn their server is encrypted, that's all. e2e has no impact on message expiration.


Even before E2E - to my knowledge, whatsapp would only store messages until they could be delivered. They never really stored your chats once they made it to their destination - there shouldn't be any "restoring" of backups that brings back messages unless it's just a re-delivery at most. (And honestly, i'd doubt that gets backed up).


Hopefully that's pulling from local cache or something but yikes


IMHO no, not. I see messages that are 1 month old. The same I have deleted at least 2-3 weeks ago. Terrible


It must be a local cache, because right now Facebook doesn't exist on the internet. It's not loading them from the server.


If they're restoring from backup that makes sense right? I assume backups are read-only, so deleting messages won't delete them from the backup also. It is sloppy though that you would see anything before the restore was totally done though (including re-deleting messages)


Before messages had unlimited expiry, FB would auto expire them after few weeks. When they announced messages would remain forever, I went back to check and kept scrolling up until my arm hurts and voila! there they are, messages that expired YEARS ago all of a sudden were visible!


That happened to me on instagram (DMs) a bit of time ago too.


>In addition to stranding billions of users, the Facebook outage also has stranded its employees from communicating with one another using their internal Facebook tools. That’s because Facebook’s email and tools are all managed in house and via the same domains that are now stranded.

SinglePointOfFailure.NoRedundancies.FB


Having worked for a similar company, I remember there were some good old IRC servers up and running to communicate in an emergency just like that.


Facebook has this too, but they require facebook.com DNS to work, so they are also down.


Nice try, but they have separate communication channels for SREs so don't worry.


You mean they don't receive all their alerts through facebook messenger?


Thanos snapped his fingers and Zuckerberg vanished with the keys.

My (very large) employer had a worldwide outage a few years ago where a single bad DNS update stopped everything in its tracks (at the time many things were still in our own data centers, now more is in Amazon/etc). It took most of the day to restart everything. But it's not something most people would have noticed like FB. Thankfully I worked in mobile so not involved.


It is hard to balance dogfooding (good) with SPOF (bad), many big companies do get it wrong (AWS with S3, Slack in the recent past) all the time.

It is easy to get it wrong if your company provides internet services that every developer typical depends in their workflows and to keep educating your own developers on how to use them and when not to use your own services.


Today, Facebook made the world a better place. For real, this time.


True

Although, to be fair, that is kind of like praising the arsonist after he put out the fire he started (which had already smoke-damaged the whole neighborhood).


https://downdetector.com/

lol! it's like the bicycle, appliance and consumer toilet paper shortages that resulted from changed consumer behavior during last year's lockdowns, but instead with internet distractions.

(even HN is creaking under the load, hah!)


I wonder if they managed to get caught in a catch-22 where they can not access the systems to fix it because the access control is the system.


Yeah; it sounds like that is maybe the case. Reminds me of the concern around a "dark start" if the power grid goes down where you can't bring up certain power plants because they need power to start.


Anybody who's ever played Satisfactory knows to keep a few old-school biomass generators around for that reason...

I'm guessing Facebook upgraded everything to the highest level tech and inadvertently got thrown back into the stone age.


I do know that many plants which require power to bootstrap themselves maintain emergency generation facilities (with battery backup for the diesel/natural gas engine starters). Hopefully there's a sufficient number of these to make the "dark start" issue not much of a concern.


The big problem with a dark start is bringing the grid back up (syncing frequency, overcoming initial load, calculating the load on specific lines ...). Jumpstarting a few plants is going to be the easy part.

There was a great talk on this on the 32C3; unfortunately only in German: https://media.ccc.de/v/32c3-7323-wie_man_einen_blackout_veru...


I know that in Ontario, the Bruce Nuclear Plant (with about 8GW capacity) is designed to run indefinitely through a power outage and did during the Northeast blackout in 2003. I assume that sort of power would be enough to bootstrap the grid in Ontario.


Some can bootstrap yes, but it's still a giant mess and takes time to build back up. Jumpstarting it off hydro or neighbours is preferred


I believe so, at least to some degree. Anybody working remotely is almost certainly locked out unless they know the right IP address. And from what I hear, internal email is down as well.

I hear a rumor that the badge readers at the door also don't work, which would be just amazing if true... [Edit: Apparently partially confirmed.]


Facebook's BGP is not advertising any routes (as I understand things), so knowing the IP address won't help you because your ISP will have no idea how to route packets to that address.

FB have really managed to knot their own shoelaces together here.


Maybe they shouldn't have hired that new guy, Bobby Tables.


„commit on the first day”


How it can be allowed that two of the most used messaging apps inn the world fall at the same time?

The regulators in many countries that allowed the purchase failed to protect customers and competition and helped to create a more fragile world prone to systemic disruptions.


While this is a massive inconvenience, I don't see how messaging apps like this are a government problem if they go offline. These are not state run businesses.


Imagine the SWIFT network (handles all bank transfers) going down. _Technically_, it's a private company, but it can wreck havoc on a country. Similarly, these messaging services are quite essential for some people and this dependency is only going to become stronger. So it can absolutely make sense for a country to have a fallback.


Banking and the ability to use Facebook messenger are not even remotely close to the same thing. SMS services are still alive, so are phone and other actually critical services. Messenger is nice, but really, outside of being able to send stickers and whatnot (which is nice) this isn't critical infrastructure.


Not in the U.S. Perhaps,

However many services even government ones depends on WhatsApp business accounts to provide essential services.

A ton of apps also use fb login as their only or primary authentication mechanism some of them would definitely be essential services.

Maybe building on Facebook was not best idea for these apps , but essential services would have been impacted nonetheless.


> Not in the U.S. Perhaps

Why the assumption that I'm American?


It was not specific to you.

Messenger is a bit more popular than WhatsApp in the U.S. compared to rest of the world where WhatsApp is sometimes defacto replacement for SMS (phone number vs Facebook account I guess).

I have seen WhatsApp or telegram logos on billboards or printed media as support channels but never really messenger anywhere. I have not seen WhatsApp in the U.S. commonly as a support channel yet .

Since you mentioned messenger I qualified U.S. out as perhaps a different market


I see what you're saying. I still don't think there's a case to be made here for letting the government control these services to ensure uptime. The argument instead should be stop using Facebook properties for business, that's not what they're for.

The argument is essentially that these organizations have gotten so successful that they need to hand control over their infrastructure to the state, since the state will manage it better. You might not be making this argument, but some in this thread clearly are when they ask "how can this be allowed to happen". I can't think of a single system managed by the government which actually is run in a way that's as good as Facebook's networks.

Consider it this way: Facebook has crashed less in the past 10 years than the stock market. That should give people thinking of state control for reliability something to think about.


Stock Market is hardly the right comparison

Comparable would be messaging on traditional networks.

OTP/Erlang based AXD301 has had exceptionally high uptime, the reliability figures commonly cited are 9 Nines! (99.9999999%)[1] for example. The entire language stack was built primarily for telecom to have exceptional reliability and uptime. WhatsApp (at-least originally) was built on Erlang and BEAM VM.

Telecom systems have likely had lesser downtime than Facebook in the last 30 years.

[1] Uptime for AXD301 over 20 years has many other factors including hardware, architecture etc and the 9 nines is really the reliability of Erlang when there was uptime , the caveat being reliability numbers cited and uptime are not directly related .


The stock market is a fine comparison. It's something that's run by, or at least heavily controlled and regulated by the federal government of many nations. Telecom services run on the telecom company's infrastructure. The government just happens to pay for them to build it where I live. The purpose of the analogy is to highlight how the state is pretty bad at managing things, even when it pours billions of dollars into doing so.

> Telecom systems have likely had lesser downtime than Facebook in the last 30 years

I've had a handful outages in the past few years for my telecom services. Rogers' network failed in Canada in 2020 for a full day, crippling communications across the country. I've had one outage with Facebook since 2008. At worst, they're just as reliable. Regardless, the argument here really is "should the state step in and take control of Facebook's uptime because a lot of people chose to use it and it goes offline for a few hours a decade"?. I still maintain that none of this is an affirmative argument.


> two of the most used messaging apps inn the world fall at the same time

Text messaging works fine. There is no serious disruption of public service.


Regulators don't _see_ every single facet of an acquisition. I'd bet they didn't even think about a scenario like this. Their concerns were probably more along the lines of anti-monopoly, preservation of competition, etc.

That said, who wants to go back to ICQ?


I assume this means that anyone that uses their FB account to login somewhere can't login once their token expires...

Hahahahaha!


You know it's a big outage when other people are writing status updates for you


Too bad Facebook can't write status updates on Facebook. They have Twitter account tho.


So somebody messed up Facebook's BGP records and traffic couldn't be routed to Facebook servers. I wouldn't be surprised if some angry insider(employee) got his revenge on Facebook for whatever reason.


Could one single person really inflict that much damage without a balance/check or code deployment review process in place?


With the right access it wouldn't be hard for someone to configure some key routers in such a way that all traffic is blocked and no one can get into them over the network. They'd need to send someone physically to the sites to reset or replace them.

I'm still leaning heavily towards an "oopsie" with routing that accomplished the same thing, however.


Rollout gone bad is def more likely. People with that level of access aren’t stupid enough to get sued for tens of millions of $ in damages


It would not be the first time


A network engineer with enough experience to handle Facebook's DNS and BGP configurations can probably design a plausibly deniable mistake/misunderstanding/unfortunate coincidence.


By the time it gets to that point, they may likely have already been identified as not a good cultural fit, unless they were a real sleeper.


I think if you were smart enough, you may be able to mask some needed changes under some legitimate tickets. You make certain changes that you know will break stuff, but you assign a reviewer who doesn't know enough about the particular thing that they may think it seems fine.

I am talking in a very generalized sense, not for this particular issue. But I don't think the code review/deployment process is entirely safe against internal bad actors.


See: The Underhanded C Contest.

http://underhanded-c.org/

The whole point is to write C that appears on the level at first, but actually has a subtle exploitable flaw. The flaw is supposed to appear like a simple mistake for plausible deniability. Some of the winning responses are very devious.


Code reviews can potentially catch bugs and prevent an obvious inside attack but are mostly to keep the code-base healthy and consistent over time. Something that can take down multiple revenue streams for all customers should have some other check besides a peer code review.


I'm sure there are measures in place to prevent it, but at the end of the day someone almost certainly could and would have to be able to.


System administrator with high authority can pretty much bring down the whole network stack if s/he wants to.


In general, access levels are pretty liberal at FB (less so then in the past, but still liberal compared to others)


Talk about a tactical attack. Whistleblower interview goes up. BGP weakness likely hacked. Facebook down. Facebook internal tools for communicating problem and fix also down. Everyone is WFH because of COVID.

The fix may have been easy, all the tools and comms down you need to fix is making it hard. It's all so interesting. Good riddance to Facebook.


Technical issue (most probably the case) or coverup?

It is amazing how far some people go with trying to explain this.


> Technical issue (most probably the case) or coverup?

The whistleblower story is all over every news site.

Turning off everyone's favorite time wasting website is the worst possible way to cover it up. How many people are typing "Facebook down" into Google and getting the Facebook whistleblower news story in the "Related News" section of their results?

It's not a coverup.


> It's not a coverup.

While I tend to agree, you can't say that definitively -- no one can at this point, except those at FB super close to this issue.

> The whistleblower story is all over every news site.

As many point out in your other comment above on this, the outage is drowning out the whistleblower story on most search engines.


I know it is not a coverup, just the question that the "world" seems to be asking.

Myself, I would pit to either Technical Issue, or internal sabotage as an act of protest.


Yes, conspiracy theories amok. Mine: considering the 60 minutes piece, could be a disgruntled, internal employee as well?


The proper term is "conspiracy fantasy".


The one with Mark and wire cutters was pretty amusing.


Coverup?

I'm so curious what you mean. How could you come to the conclusion that it is anything other than a technical issue?


While it's certainly possible (likely?) that it's "just" a technical issue, the article talks about this:

> The mass outage comes just hours after CBS’s 60 Minutes aired a much-anticipated interview with Frances Haugen, the Facebook whistleblower who recently leaked a number of internal Facebook investigations showing the company knew its products were causing mass harm, and that it prioritized profits over taking bolder steps to curtail abuse on its platform — including disinformation and hate speech.

> We don’t know how or why the outages persist at Facebook and its other properties, but the changes had to have come from inside the company, as Facebook manages those records internally. Whether the changes were made maliciously or by accident is anyone’s guess at this point.

I think we can't completely rule out the possible connection to this. Again, likely isn't, but answering the question how one might come to the conclusion.


I mean, Cambridge Analytica is the example here. Facebook has been privy to some shady shit at the very least. Is it likely that they purposefully took down all their revenue making machines to distract from the 60 minutes piece? No, probably not. But they've demonstrated that they can't be trusted so it's at least worth investigating.


The world should do 'a day without facebook' more often.


> Was just on phone with someone who works for FB who described employees unable to enter buildings this morning to begin to evaluate extent of outage because their badges weren’t working to access doors.

https://twitter.com/sheeraf/status/1445099150316503057

Apparently the people planning the heist went a bit overboard with their misdirection.


> The mass outage comes just hours after CBS’s 60 Minutes aired a much-anticipated interview with Frances Haugen, the Facebook whistleblower who recently leaked a number of internal Facebook investigations showing the company knew its products were causing mass harm, and that it prioritized profits over taking bolder steps to curtail abuse on its platform — including disinformation and hate speech.

I'm hoping that this is just a coincidence


I'm always try to side with Occam, but let me speculate here: This may be a sign of resistance from within? A hacker group so good, they were hired by FB only to carry out a huge, clandestine hack that results in FB being down for hours, if not days?


I’m just gonna say this. Disclaimer I have no knowledge nor evidence whatsoever that this may be the reality. But speculation seems to be the order of the day…

Seems like cutting their ASN off from the world would be a great way to cut off any would-be Discovery Volunteers that might try to collect evidence 4chan style to support the whistleblower’s case.


I strongly dislike how we are forced into centralizing our online life into a few big corporations. Therefore, it is somewhat nice to read that even the access cards don’t work at Facebook HQ due to them running everything via the Facebook domains.

Still, let‘s hope that this gets fixed soon for the engineers and users involved


Krebs just says it's anybody's guess as to whether it's an internal screw up or a hack.


The now infamous /u/ramenporn said in his latest update that they don't consider attack hypothesis (yet?):

https://twitter.com/atoonk/status/1445084833017843721


Because of the level of internal access required to do this intentionally I would assume it isn't a hack, but it could be. The timing is interesting with the whistle-blower news though.

But then the timing with regard to China sending 50 military aircraft over Taiwan today is also interesting... FB and communication infrastructure would go down first in times of tension, if you want to go full tin-foil hat.

Ok ... enough news reading for me today!


> timing with regard to ...

No matter when something happens, other things will be happening in the world around the same time. That doesn't establish a correlation (China has been doing that for awhile), much less causation.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: