>Now, here's the fun part. @Cloudflare runs a free DNS resolver, 1.1.1.1, and lots of people use it. So Facebook etc. are down... guess what happens? People keep retrying. Software keeps retrying. We get hit by a massive flood of DNS traffic asking for http://facebook.com
Believe it or not, there are places in the world where FB products (WhatsApp specifically) are used as the primary communication platform for most people.
Second comment was saying there is no point using Signal if they are down during 2 days. Only a few hours for FB yet but curiously nobody is saying the same :)
I wonder if any big DNS servers will artificially cache a long TTL NXDOMAIN response for FB to reduce their load. Done wrong, it would extend the FB outage longer.
This outage also affected whatsapp, one of the most widely-used communication technologies in the world. It also almost caused me to be locked out of my apartment, were it not for random chance and the kindness of a stranger, but I’m glad that you can feel smugly superior about it
Not OP, but such ideas usually stem from a misunderstanding of root cause. Facebook inaccessibility likely exposed poor assumptions or other flaws in e.g. "smart" devices or workflows. Those poor assumptions or other flaws are likely what got OP locked out of his apartment when Facebook went down, not Facebook itself going down.
No, that would not be direct root cause. Direct root cause would be designing and implementing an apartment-complex entry app which depends on a working internet connection, battery, and network route to a single point of failure.
>but I’m glad that you can feel smugly superior about it
And I'm glad you can feel smug about combating smuggness, because obviously the consequences of some social media and chat apps being down can't be measured but by anecodotal stories of some unrelated issue like being locked out, not about their general societal impact, shady practices, contribution to disinformation and data mining, etc. Who's being self centered now?
If anything, the lesson here is to not depend on some single, centralized, channel, for such communications (e.g. to get your AirBnB key). Now I also feel smug for always giving 2-3 alternative ways in cases contacting someone/someone contacting me is crucial...
It's not like what the world lacks in 2021 is communication channels. One can use land phone, mobile phone, SMS, email, and 200 other alternative IM outlets...
Clients weren't getting NXDOMAIN, they were getting SERVFAIL because the nameservers were unreachable. These responses cannot be cached for more than 5 minutes [1].
Yes, that's the point. If you're running a DNS server and being overwhelmed by this, you might have considered artificially injecting NXDOMAIN with a long cache value to get some relief. Which could extend the outage for FB.
Unless the operators were in direct contact with Facebook, it doesn't sound like a good idea. It's certainly not the job of the ISP to reduce an outage for FB. They also weren't sure if the outage would only be 5 minutes or 5 hours. Instead, ISPs should scale up and handle DNS traffic for outages like this. In this case, FB isn't the only company to learn a lesson or two around failure modes and how to improve in the future.
The point isn't reducing an outage for FB, it would actually extend the outage for some. The point would be to help give some relief to a DNS server you're running that's overloaded due to the FB outage...during the "crisis". Yes, of course, better planning ahead of time is nice. In any case, I didn't suggest doing this. I wondered if it was happening.
I think you missed the idea that the FB outage created a really heavy DNS load on other people's DNS servers.
No, I didn't miss the idea (and it's not an idea, it really happened.) I believe you're mistaking the role of the resolver operator and whether or not they should be manipulating client queries/responses without the user knowing. An NXDOMAIN response does not match the conditions, and shouldn't be used just to manipulate the clients.
It will have been cached at closer to the edge, but once the TTL expires, so does the cache. That means all the DNS requests that would have been served via local caches end up hitting the upstream DNS servers. For a site like Facebook that will be creating an asbolute deluge of requests.
Andecdotal but the whole of the internet feels sluggish atm.
No, since the positive response will normally be cached for "some time" dependant on a number of factors. The negative response on the other hand often won't get cached, again, dependent on settings.
I know you're just replying to the parent statement but unfortunately in this case the SOA went down with the ship. None of the (admittedly few) clients I've tested are caching the lack of a response for facebook.com's SOA or address records.
Yes.
I handle around a million requests per minute. I exponentially increase the cache period after subsequent misses to avoid an outage ddos the whole system.
This tends to be beneficial regardless of the root cause.
edit this is especially useful for handling search/query misses as a query with no results is going to scan any relevant indexes etc. until it is clear no match exists meaning a no results query may take up more cycles than a hit.
It's remarkable the effect even short TTL caching can have given enough traffic. I recall once caching a value that was being accessed on every page load with a TTL of 1s resulting in a >99% reduction in query volume, and that's nowhere near Facebook/internet backbone scale.
yep, prepriming the cache rather than passively allowing it be rebuilt by request/queries can also result in some nice improvements and depending on replication delay across database servers avoid some unexpected query results reaching the end user.
In the past I was the architect of a top 2000 alexa ranked social networking site, data synchronization delays were insane under certain load patterns high single low double digit second write propagation delays.
It's disappointingly common for cloud-backed apps and device firmware to go into a hot retry loop on any kind of network failure. A lot of engineers just haven't heard of exponential backoff, to say nothing of being able to implement and test it properly for a scenario that almost never happens.
Even if you assume Facebook's own apps have reasonable failure logic, there's all kinds of third-party apps and devices integrating with their API that probably get it wrong. Surprise botnet!
Yes. It's basically turned every device, especially mobile devices with the app running in the background, into botnet clients which are continually hitting their DNS servers.
I don't know what facebook's DNS cache expiration interval was, but assume it's 1 day. Now multiply the load on the DNS that those facebook users put by whatever polling interval the apps use.
And then remember what percentage of internet traffic (requests, not bandwidth) facebook, whatsapp, and instagram make up.
It's basically turned every device, especially mobile devices with the app running in the background, into botnet clients which are continually hitting their DNS servers
Anecdotally, it also seems to be draining the batteries of those devices with all of those extra queries. At least that seems to be what's happening on my wife's phone.
Well, everything is bit slow for me. I'm in the UK on Virgin Media, using either Google DNS or the VM ones (I'm not sure and can't be bothered to look).
What has just happened, and it can't be coincidence, is that I lost internet connectivity about 1 hour ago, and had to reboot my Cable Modem to get it back.
I'm fairly certain that my ISP was affected by this causing an outage of all internet traffic for my network. So it seems possible, although I imagine using an alternate DNS provider should work ok (if they're not overrun by extra traffic)?
Unfortunately I'm not sure what the default DNS on the modem points to..
I've launched Wireshark monitoring DNS traffic of roughly 5 phones. I've collected 19.8k DNS packets so far. Out of that, 5.1k packets are flagged with REFUSED or SERVFAIL. If I am not mistaken, it means that 51% of DNS requests fail.
Looking at queries for graph.instagram.com, it looks like there are roughly 20 attempts in a sequence before it gives up.
All in all, this could probably explain doubling of the DNS traffic. But the sample is rather small, so take it with a grain of salt.
Sort of, yeah. Typically a DDoS attack is done on purpose, this is a side effect of so many clients utilizing retry strategies for failed requests. But in both cases, a lot of requests are being made, which is how a DDoS attack works.
> Software keeps retrying. We get hit by a massive flood of DNS traffic asking for http://facebook.com
If you aren’t using exponential backoff algorithms for your reconnect scheme - you should be!
I have a device in the field, only a few thousand total, but we saw issues when our shared cloud would go down and everyone hammered it to get back up.
>Our small non profit also sees a huge spike in DNS traffic. It’s really insane.
It's not crazy; people are panicking over Facebook, Instagram and WhatsApp being down and they keep trying to connect to those services. I mean I would panic too if I were social media junky.
It’s not just "social media junkies", a very pretentious phrase to use considering you’re writing it in a comment on a social network. Hundreds of thousands of apps use Facebook APIs, often in the background too (including FB's own apps).
Strongly disagree. The outage has millions of people entering "Facebook" into their search engines. Most engines will conveniently put related news at the top of the search results page. The most recent and widespread Facebook-related news story is about the whistleblower.
Plus everyone has a lot of spare time to read the article now that Facebook and Instagram are down.
The outage didn't bury the story. It amplified it. Any suggestions that Facebook did this on purpose don't even make sense.
> recent and widespread Facebook-related news story is about the whistleblower
With respect I am pretty sure that the most recent and widespread Facebook-related news story is this one.
Holistically I agree that this isn't the kind of distraction Facebook wants, although it tickles me to imagine Mark in the datacenter going Rambo with a pair of wire cutters.
Yeah but journalists are happy to connect the dots between the two stories and honestly my brain loves the coincidence of these two thingy being clustered: but the how is clear: Earlier this morning, something inside Facebook caused the company to revoke key digital records that tell computers and other Internet-enabled devices how to find these destinations online.
That is in no way gonna make people forget the whistleblower story - if anything, it's gonna increase the antipathy to having a single point of failure. Face it, everyone hates FB, even the people who spend the most time on it.
> Strongly disagree. The outage has millions of people entering "Facebook" into their search engines. Most engines will conveniently put related news at the top of the search results page. The most recent and widespread Facebook-related news story is about the whistleblower.
I am seeing 0 news about the whistleblower when I google Facebook. Only outage news.
Who reads the article? If I google "Facebook" to see if there's an outage, I see the first headline that says it's an outage and leave. Maybe curious few percent will.
1 article about the whistleblower and 2 about the outage. Both about the outage also mention the whistleblower, so you could say that's 100% of coverage at least mentions the whistleblower.
Also 1 out of 3 tweets also mentions the whistleblower.
Yeah but reading about it but also being able to communicate about it on the largest network (the one in question too) are 2 separate phenomena. No one can go on there right now and say I'm deleting my account, who's with me?
Not at all. I just tried searching for "Facebook" on Google. The whistleblower story is not on the first page of search results. The outage is mentioned half a dozen times on that same page.
I assume this outage is costing millions per hour. And it's not exactly great advertising for Facebook, either. I doubt very much they would do something like this on purpose.
That doesn't make a lot of sense though - Facebook generates revenue primarily from ad traffic (on all sorts of sites). It needs to be up for reputation and to harvest ever more detail for 'improving' those ads, sure, but not for revenue. (Modulo blip from ads on its own site.)
That's what I meant by 'ads on its own site' - but I was under the impression that Facebook generated most revenue from selling data/ads for display elsewhere (as well as on Facebook.com itself, and other subsidiaries). Perhaps I was wrong about that? Quick search shows up 'audience network', but I'm not sure to what extent that's what I was thinking of.
Nope, for the most part all the ads that Facebook serves are for facebook owned sites and properties. They don't sell data, or have general ad placements on 3rd party websites.
It sounds like they are not even able to serve ads, on any property. So while far from perfect, it's probably a decent estimate without doing in-depth analysis.
Right, I know that, and I usually try to avoid conspiratorial thinking, but man, Zuck doesn't make it easy.
I'm just trying to process that FB is having its historic, all-networks global outage today of all days. And I bet FB would have paid double of whatever this will eventually cost them to make that story go away.
If it was intentional, that's serious jail time territory. That's a high price to pay for such limited downtime. I'm pretty sure an intentionally malicious actor with that type of access could do much worse things.
I'm curious as to what law, exactly, they would be breaking. Sabotage in the US code is defined mostly in terms of war material and damages done to physical "national defense" properties. Certainly an employee would be fired and sued by the company, but is deliberately changing a routing policy (and not something like a worm or virus that deletes or otherwise degrades hardware and software) a crime?
IANAL but I would assume computer fraud and abuse act:
(5)(a)knowingly causes the transmission of a program, information, code, or command, and as a result of such conduct, intentionally causes damage without authorization, to a protected computer;
In the cases cited under the CFAA (such as https://scholar.google.com/scholar_case?case=124545279862007...) it seems the employee deleted data and private info. In this case, no data was deleted or other computing property damaged it just became unreachable.
Proof of intent is a significant burden placed upon prosecution. If that can be overcome, there’s legal precedent for criminal conviction namely under the CFAA.
I’m pretty sure the vast majority of entry level spy craft is about convincing people to do highly illegal and destructive things from a place of fear.
Not saying this is the work of spies, just that it’s not unimaginable to think some middle manager could convince themselves or a subordinate to do something drastically illegal out of some fear that terrible things would happen otherwise.
bingo. I don't care whether it's in the realm of tinfoil hat or not, this is the very real effect that this outage has had. By the time Facebook is back up, people on Facebook will be talking about the outage, not about the whistle blower report. Intentional or not, it will certainly be in Facebook's favor.
Facebook controls the algorithm, wouldn't they just be able to down amplify how much that story is spread on it's network? (Rather than resort to this?)
Just to clarify...I pretty obviously don't think that Facebook intentionally pulled the plug to suppress a critical story. But the inadvertent effect of the downtime is nonetheless the fact that the critical story will not be the center of discussion on Facebook when Facebook is back up.
I love a good tinfoil hat theory, but in this case I doubt it. I have FB blocked on my network via pihole, but I don't explicitly block Instagram. Until sometime late last week (I noticed on Saturday), blocking facebook.com also blocked Instagram. As of this weekend, Instagram works just fine even with those blocks in place.
I suspect Facebook was making some change to their DNS generally, and they made some kind of mistake in deployment that blew up this morning.
They deployed this morning that doesn’t imply they implemented anything. I can’t think of a better time that way you have the whole week work on anything that uncovers. Or in the case of something this big they have the rest of the day to freak out.
Nah, a ticking time-bomb would "explode" on Christmas (or Aïd El Kebir, etc.), whenever most of the employees who could do something about it are absent.
Still wasn't clear enough with my analogy! I was thinking more like a dam failure due to operator/designer error, not sabotage (but who knows). The damage is really small signs initially, followed by rolling catastrophic failures.
Counterpoint: I had not even heard about the whistle-blower until seeing stories about the outage. One of the largest web services in the world being out of commission for multiple hours is a big deal in 2021. It's a top story on most news sites and other social media (e.g. here at HN, reddit, twitter). If you want something to pass under the radar, it's probably best to not attract global attention.
Most people outside of the US don't even know what "60 Minutes" is. Even fewer have heard about that report. And even fewer care. But everyone has now heard about the outage. This would be the worst possible way of trying to stop the spread of the story.
The more likely scenario is that this was the final straw for some disgruntled employee who decided to pull the plug on the entire thing.
Agree. I just did a quick check and 60 Minutes averages around 10 million viewers. It's not like in 1977 when something 20%+ of the US population was watching that show.
> they leaked information about 20% of the earth's population
This is straight up false. It was scrapers extracting data from public profiles. They already incorporate anti-scraping techniques, so there's not much they can do other than require every one to set their profile to private.
If you don't collect the data in one place, there's no chance of leaking it.
If they want to position themselves as the global phonebook, that's fine, but they should be open about that.
Edit to add: If you aren't in the "gather and sell access to everybody's data" business, "private" is a sensible default setting for that information. On the other hand, if you're Facebook...
If we're in "tinhat" territory: it seems extremely odd to me that this whistelblower seems to be "blowing the whistle" on the fact that facebook isn't doing enough to control what people are thinking and talking about.
Like...what? "Brave whistelblower comes out showing that facebook isn't doing enough to control what you are thinking!" is sortof arguing past the question. Should facecbook be in charge of deciding what you think?
> this whistelblower seems to be "blowing the whistle" on the fact that facebook isn't doing enough to control what people are thinking and talking about.
That is not at all what the whistleblower is alleging. Facebook already controls what content you are seeing through its news feed algorithm. The parameters to that algorithm are not a 1-dimensional "how much control", but instead uses engagement metrics for what content to show. The whistleblower claims that the engagement optimization, according to facebooks own research, prioritizes emotionally angry/hurtful/divisive content.
We all knew Facebook is bad for society. The whistleblower showed us that Facebook has done internal studies and that these studies have shown their products are bad for society/contributed to the insurrection/promote human trafficking/damage teen mental health/etc. But even with these studies, Facebook has decided to prioritize growth and revenue, rather than fix the issues that are bad for society. What this whistleblower leaked will hopefully lead to some sort of government regulation on social media.
Without regulation, social media will always prioritize profit.
They are exercising that power already, they are just explicitly doing so in a way that tears down the trust in society because makes them money, rather than encouraging a less I divisive and more fact based conversation, because that doesn’t make them as much money.
> The outage has pretty much buried that story, and perhaps more importantly, stopped its spread on FB networks.
Buried the news ... which is basically as noteworthy the news that water is still wet. What exactly did she reveal that was not known before, or is it somehow newsworthy that Facebook also knew what everyone else knew? The real news ought to be how that managed to make it to the headlines.
As much as I'd love to imagine FB rage-quitting the internet because people don't seem to appreciate them enough, I'm pretty sure it's a coincidence. Probably has more to do with it being Monday (you don't put big stories on Friday and you sure don't deploy config changes on Friday!) than anything else.
Ah yes, the best way to bury a moral scandal of the kind that usually gets forgotten in a week is to undermine the trust of almost every single user worldwide. This is a very good conspiracy.
I see it as similar to Snowden, in the sense that everybody kind of knew (actually guessed) but now we actually know. It doesn't come as a shock, but it's important information to have since it can be now argued with authority.
The whistleblower revealed that Facebook knows it is bad for society. The documents also show Facebook actively optimizes its algorithms for "bad for society" content because that drives engagement which makes them more money. Furthermore Facebook doesn't do as much content moderation in regions/languages with low usage numbers because it costs more than those users make them. So calls for genocide in Myanmar basically go unchallenged and unmoderated because Facebook doesn't make much money in Myanmar. Sorry genocided minority, you should have been more valuable to Facebook.
> would actually agree to carry out something like this intentionally.
Well, they work for Facebook. In my opinion you would have to have no morals to join that corporations in the first place, so I can imagine such ask would be just another dirty task to do. They seem to love it.
The story that a woman at Facebook doesn't think they're going far enough to control speech they hate and bad-thoughts?
I think Facebook is awful, but her primary complaint seemed to me that she lacked controls for what people like her, you know, the good people have access to prevent anyone else from seeing. That she was powerless to stop users from saying the wrong things. How was her motivation anything but a desire for more authoritarianism? She said she specifically took the job on the condition she could monitor and direct posts to prevent the wrong info from being online, that's the last type of person you want in that position, the one that wants it.
I expect that we're still pretending Facebook is "just a private business", despite it being unlike any in history and that the ties to government are completely benign.
I'm not saying she was wrong in any claim about internal discussions. But, if you can not imagine yourself being on the wrong side of someone like that, you have limited imagination.
Facebook is surprisingly tolerant of controversial subjects. YouTube has gone scorched earth on millions of channels and deleted years of work of many people. Facebook was far more lenient and you could talk about non-official covid information for example where YouTube deleted anything that wasn't official narrative with extreme prejudice. Given how much bad stuff all over the world is happening to sacrifice freedom to get everyone to tow the official line on Covid that is complete science fiction level totalitarianism, I am sure Facebook made some very powerful and determined enemies with its more lenient stance. I was downvoted earlier for saying this was an intentional takedown and deleted my comment, but now I think this could be a full blown William Gibson Neuromancer Cyberpunk level corporate takedown attempt in progress!
She said she wanted FB to do something to stop misinformation and hate speech but what we've seen from Reddit is that "are mRNA vaccines actually safe?" becomes misinformation and "we shouldn't perform elective life-altering surgery on pre-teen children" becomes hate speech. There's not much I applaud Facebook for, but not listening to this woman is one of the few I do.
It also looks like its much deeper than just people not finding the site. Employees are all locked out and there's another story on the front page on HN saying employees are locked out of the building as well.
If you wanted to scrub a lot of the data and nefarious evidence the whistle blower brought out, this would be a great way to do it, under the guise of a simple "employee screw up" cover story.
Its hard for me to think something more nefarious is afoot considering FB's track record with a myriad of other things. At this point, it seems more likely something sketchy is going on and not just some random employee who screwed up and brought down the entire network with a simple change. I would assume there are several layers of decision makers who oversee the BGP records. I have a hard time thinking one person had sole access to these and brought everything down with an innocent change.
FB has too many smart people who would allow a single point of failure for their entire network such that if it goes down, it becomes "a simple error on the part of some random employee". This is not some junior dev who broke the build, its far more serious than that.
"As a result, when one types Facebook.com into a web browser, the browser has no idea where to find Facebook.com, and so returns an error page."
Not quite.
Many DoH servers are working fine. DNS isn't a problem for the browser, but it seems to be a problem for Facebook's internal setup. It's like their proxy configuration is 100% reliant on DNS lookups in order to find backends.
The FB content servers are reachable. It is only the Facebook DNS servers that are unreachable.
Don't take my word for it, try for yourself
www.facebook.com 1 IN A 179.60.192.3 (content)
static.facebook.com 1 IN A 157.240.21.16 (content)
a.ns.facebook.com 1 IN A 129.134.30.12 (DNS)
ping -c1 157.240.21.16 |grep -A1 statistics
--- 157.240.21.16 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
ping -c1 179.60.192.3|grep -A1 statistics
--- 179.60.192.3 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
ping -c1 -W2 129.134.30.12 |grep -A1 statistics
--- 129.134.30.12 ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms
The browser, i.e., client, here, curl, has an idea where to find Facebook.com
links -dump index.htm
[IMG]
Sorry, something went wrong.
We're working on it and we'll get it fixed as soon as we can.
Go Back
Facebook (c) 2020 . Help Center
grep HTTP index.htm
HTTP/1.1 503 No server is available for the request
Even if authoritative DNS servers are unreachable, there are, for example, multiple, authorised, public scans of port 53 available for free download. There are also passive DNS datasets also available for free. While recursive caches may be convenient, IMO they are the least trustworthy sources of DNS data.
Due to DNS being busted, all internal FB services/tooling that they'd use to push DNS config updates are probably completely inaccessible. Someone at FB will have to manually SSH into a production host (assuming they can even identify the right one), and issue some commands to repopulate the DNS records. They'll probably have to do this without any access to internal wikis, documentation, or code.
Keeping those poor network engineers in our thoughts.
Hmm... I'm always reminded of my professor telling me that it's never the fault of who pressed the button, responsibility lay upon who decided to make them able to press a button that can cause such catastrophic issues.
Somebody from my engineering class had an internship at DuPonts main facility/production line. Was implementing something that managed to complety shut down production for an entire shift & cause a large fire, ended up being something in the millions worth of damages from production loss and fire damage.
Intern wasn't even yelled at IIRC. He actually went on to do some very helpful things the rest of the internship. But man, did the person who let an intern be in the position to single handedly cause such a mess get absolutely fucked by his superiors.
What about the middle manager who gave that supervisor the power to put an intern in a such critical position without review?. You can keep going up like that .
In most companies someone is the fall guy depending on much impact is there. It is not uncommon for that some guy to be the CEO if the fuck up is big enough .
So it appears that WhatsApp are in the process of restoring from backup? Why would they need to do that if it was just a DNS issue? And why would the server be accessible while backup restoration was still in progress? I feel like there is going to be a lot more to this story when it all shakes out.
Once the DNS is back up they need to basically reboot every service. Once server one can’t talk to server two, everything is out of sync and they need to resolve this somehow. They probably have mitigation plans for a few data centers going down, but when it’s all of them at once, that’s going to be a huge pain.
Who knows. I use PiHole where all DNS records are cached. Maybe this is the reason why it happens to me. And regards Twitter (obviously), I'm not the only one who is facing this weird behaviour.
I don't know how WhatsApp works but e2e doesn't mean that messages can't be cached/stored in their encrypted form. Actually they almost certainly are since otherwise messages couldn't be delivered to recipients while your phone is off/disconnected.
Even before E2E - to my knowledge, whatsapp would only store messages until they could be delivered. They never really stored your chats once they made it to their destination - there shouldn't be any "restoring" of backups that brings back messages unless it's just a re-delivery at most. (And honestly, i'd doubt that gets backed up).
If they're restoring from backup that makes sense right? I assume backups are read-only, so deleting messages won't delete them from the backup also. It is sloppy though that you would see anything before the restore was totally done though (including re-deleting messages)
Before messages had unlimited expiry, FB would auto expire them after few weeks. When they announced messages would remain forever, I went back to check and kept scrolling up until my arm hurts and voila! there they are, messages that expired YEARS ago all of a sudden were visible!
>In addition to stranding billions of users, the Facebook outage also has stranded its employees from communicating with one another using their internal Facebook tools. That’s because Facebook’s email and tools are all managed in house and via the same domains that are now stranded.
Thanos snapped his fingers and Zuckerberg vanished with the keys.
My (very large) employer had a worldwide outage a few years ago where a single bad DNS update stopped everything in its tracks (at the time many things were still in our own data centers, now more is in Amazon/etc). It took most of the day to restart everything. But it's not something most people would have noticed like FB. Thankfully I worked in mobile so not involved.
It is hard to balance dogfooding (good) with SPOF (bad), many big companies do get it wrong (AWS with S3, Slack in the recent past) all the time.
It is easy to get it wrong if your company provides internet services that every developer typical depends in their workflows and to keep educating your own developers on how to use them and when not to use your own services.
Although, to be fair, that is kind of like praising the arsonist after he put out the fire he started (which had already smoke-damaged the whole neighborhood).
lol! it's like the bicycle, appliance and consumer toilet paper shortages that resulted from changed consumer behavior during last year's lockdowns, but instead with internet distractions.
Yeah; it sounds like that is maybe the case. Reminds me of the concern around a "dark start" if the power grid goes down where you can't bring up certain power plants because they need power to start.
I do know that many plants which require power to bootstrap themselves maintain emergency generation facilities (with battery backup for the diesel/natural gas engine starters). Hopefully there's a sufficient number of these to make the "dark start" issue not much of a concern.
The big problem with a dark start is bringing the grid back up (syncing frequency, overcoming initial load, calculating the load on specific lines ...). Jumpstarting a few plants is going to be the easy part.
I know that in Ontario, the Bruce Nuclear Plant (with about 8GW capacity) is designed to run indefinitely through a power outage and did during the Northeast blackout in 2003. I assume that sort of power would be enough to bootstrap the grid in Ontario.
I believe so, at least to some degree. Anybody working remotely is almost certainly locked out unless they know the right IP address. And from what I hear, internal email is down as well.
I hear a rumor that the badge readers at the door also don't work, which would be just amazing if true... [Edit: Apparently partially confirmed.]
Facebook's BGP is not advertising any routes (as I understand things), so knowing the IP address won't help you because your ISP will have no idea how to route packets to that address.
FB have really managed to knot their own shoelaces together here.
How it can be allowed that two of the most used messaging apps inn the world fall at the same time?
The regulators in many countries that allowed the purchase failed to protect customers and competition and helped to create a more fragile world prone to systemic disruptions.
While this is a massive inconvenience, I don't see how messaging apps like this are a government problem if they go offline. These are not state run businesses.
Imagine the SWIFT network (handles all bank transfers) going down. _Technically_, it's a private company, but it can wreck havoc on a country. Similarly, these messaging services are quite essential for some people and this dependency is only going to become stronger. So it can absolutely make sense for a country to have a fallback.
Banking and the ability to use Facebook messenger are not even remotely close to the same thing. SMS services are still alive, so are phone and other actually critical services. Messenger is nice, but really, outside of being able to send stickers and whatnot (which is nice) this isn't critical infrastructure.
Messenger is a bit more popular than WhatsApp in the U.S. compared to rest of the world where WhatsApp is sometimes defacto replacement for SMS (phone number vs Facebook account I guess).
I have seen WhatsApp or telegram logos on billboards or printed media as support channels but never really messenger anywhere. I have not seen WhatsApp in the U.S. commonly as a support channel yet .
Since you mentioned messenger I qualified U.S. out as perhaps a different market
I see what you're saying. I still don't think there's a case to be made here for letting the government control these services to ensure uptime. The argument instead should be stop using Facebook properties for business, that's not what they're for.
The argument is essentially that these organizations have gotten so successful that they need to hand control over their infrastructure to the state, since the state will manage it better. You might not be making this argument, but some in this thread clearly are when they ask "how can this be allowed to happen". I can't think of a single system managed by the government which actually is run in a way that's as good as Facebook's networks.
Consider it this way: Facebook has crashed less in the past 10 years than the stock market. That should give people thinking of state control for reliability something to think about.
Comparable would be messaging on traditional networks.
OTP/Erlang based AXD301 has had exceptionally high uptime, the reliability figures commonly cited are 9 Nines! (99.9999999%)[1] for example. The entire language stack was built primarily for telecom to have exceptional reliability and uptime. WhatsApp (at-least originally) was built on Erlang and BEAM VM.
Telecom systems have likely had lesser downtime than Facebook in the last 30 years.
[1] Uptime for AXD301 over 20 years has many other factors including hardware, architecture etc and the 9 nines is really the reliability of Erlang when there was uptime , the caveat being reliability numbers cited and uptime are not directly related .
The stock market is a fine comparison. It's something that's run by, or at least heavily controlled and regulated by the federal government of many nations. Telecom services run on the telecom company's infrastructure. The government just happens to pay for them to build it where I live. The purpose of the analogy is to highlight how the state is pretty bad at managing things, even when it pours billions of dollars into doing so.
> Telecom systems have likely had lesser downtime than Facebook in the last 30 years
I've had a handful outages in the past few years for my telecom services. Rogers' network failed in Canada in 2020 for a full day, crippling communications across the country. I've had one outage with Facebook since 2008. At worst, they're just as reliable. Regardless, the argument here really is "should the state step in and take control of Facebook's uptime because a lot of people chose to use it and it goes offline for a few hours a decade"?. I still maintain that none of this is an affirmative argument.
Regulators don't _see_ every single facet of an acquisition. I'd bet they didn't even think about a scenario like this. Their concerns were probably more along the lines of anti-monopoly, preservation of competition, etc.
So somebody messed up Facebook's BGP records and traffic couldn't be routed to Facebook servers. I wouldn't be surprised if some angry insider(employee) got his revenge on Facebook for whatever reason.
With the right access it wouldn't be hard for someone to configure some key routers in such a way that all traffic is blocked and no one can get into them over the network. They'd need to send someone physically to the sites to reset or replace them.
I'm still leaning heavily towards an "oopsie" with routing that accomplished the same thing, however.
A network engineer with enough experience to handle Facebook's DNS and BGP configurations can probably design a plausibly deniable mistake/misunderstanding/unfortunate coincidence.
I think if you were smart enough, you may be able to mask some needed changes under some legitimate tickets. You make certain changes that you know will break stuff, but you assign a reviewer who doesn't know enough about the particular thing that they may think it seems fine.
I am talking in a very generalized sense, not for this particular issue. But I don't think the code review/deployment process is entirely safe against internal bad actors.
The whole point is to write C that appears on the level at first, but actually has a subtle exploitable flaw. The flaw is supposed to appear like a simple mistake for plausible deniability. Some of the winning responses are very devious.
Code reviews can potentially catch bugs and prevent an obvious inside attack but are mostly to keep the code-base healthy and consistent over time. Something that can take down multiple revenue streams for all customers should have some other check besides a peer code review.
Talk about a tactical attack.
Whistleblower interview goes up.
BGP weakness likely hacked.
Facebook down.
Facebook internal tools for communicating problem and fix also down.
Everyone is WFH because of COVID.
The fix may have been easy, all the tools and comms down you need to fix is making it hard. It's all so interesting. Good riddance to Facebook.
> Technical issue (most probably the case) or coverup?
The whistleblower story is all over every news site.
Turning off everyone's favorite time wasting website is the worst possible way to cover it up. How many people are typing "Facebook down" into Google and getting the Facebook whistleblower news story in the "Related News" section of their results?
While it's certainly possible (likely?) that it's "just" a technical issue, the article talks about this:
> The mass outage comes just hours after CBS’s 60 Minutes aired a much-anticipated interview with Frances Haugen, the Facebook whistleblower who recently leaked a number of internal Facebook investigations showing the company knew its products were causing mass harm, and that it prioritized profits over taking bolder steps to curtail abuse on its platform — including disinformation and hate speech.
> We don’t know how or why the outages persist at Facebook and its other properties, but the changes had to have come from inside the company, as Facebook manages those records internally. Whether the changes were made maliciously or by accident is anyone’s guess at this point.
I think we can't completely rule out the possible connection to this. Again, likely isn't, but answering the question how one might come to the conclusion.
I mean, Cambridge Analytica is the example here. Facebook has been privy to some shady shit at the very least. Is it likely that they purposefully took down all their revenue making machines to distract from the 60 minutes piece? No, probably not. But they've demonstrated that they can't be trusted so it's at least worth investigating.
> Was just on phone with someone who works for FB who described employees unable to enter buildings this morning to begin to evaluate extent of outage because their badges weren’t working to access doors.
> The mass outage comes just hours after CBS’s 60 Minutes aired a much-anticipated interview with Frances Haugen, the Facebook whistleblower who recently leaked a number of internal Facebook investigations showing the company knew its products were causing mass harm, and that it prioritized profits over taking bolder steps to curtail abuse on its platform — including disinformation and hate speech.
I'm always try to side with Occam, but let me speculate here: This may be a sign of resistance from within? A hacker group so good, they were hired by FB only to carry out a huge, clandestine hack that results in FB being down for hours, if not days?
I’m just gonna say this. Disclaimer I have no knowledge nor evidence whatsoever that this may be the reality. But speculation seems to be the order of the day…
Seems like cutting their ASN off from the world would be a great way to cut off any would-be Discovery Volunteers that might try to collect evidence 4chan style to support the whistleblower’s case.
I strongly dislike how we are forced into centralizing our online life into a few big corporations. Therefore, it is somewhat nice to read that even the access cards don’t work at Facebook HQ due to them running everything via the Facebook domains.
Still, let‘s hope that this gets fixed soon for the engineers and users involved
Because of the level of internal access required to do this intentionally I would assume it isn't a hack, but it could be. The timing is interesting with the whistle-blower news though.
But then the timing with regard to China sending 50 military aircraft over Taiwan today is also interesting... FB and communication infrastructure would go down first in times of tension, if you want to go full tin-foil hat.
No matter when something happens, other things will be happening in the world around the same time. That doesn't establish a correlation (China has been doing that for awhile), much less causation.
I worked at WhatsApp until Aug 2019. WhatsApp hasn't been on completely separate infrastructure for quite some time. It's in FB datacenters, so if FB BGP is messed up, so is WA. There is a separate WA ASN that's used for DNS, but it's still FB infra, and announced through the FB ASN, so that doesn't help either. Instagram still has their DNS with AWS Route 53, so their DNS is still up, but their site isn't because it runs on FB Infra too.
Within the FB datacenters, WhatsApp is somewhat isolated (at least when I left, chat was run on dedicated hosts allocated to WhatsApp only, but using FB container orchestration tools, and FB specified hardware, etc).
Edit to add: WhatsApp has some mitigations against DNS not working, but in this case, it looks like DNS being dead was a symptom of something (probably BGP config error, if the Reddit post earlier was accurate) and that something also broke the underlying servers.
If the issues is DNS-related or routing-related then since it's all owned by FB, they own the records too so it's likely that literally everything that has a DNS record under the FB cost code will be hit.
The WhatsApp servers will be sitting idle most likely!
Leaving aside whether or not this is true, why would you apply DNS changes to all of your properties at once? Especially if it's a sensitive change?
Seems much more likely to me that WA/IG/Oculus somehow rely on facebook.com behind the scenes than that all FB domains were affected by a config change.
Human error when preparing such a big update and schedule it for the (European) afternoon and the Asian evening. Maybe some kind of security system in the code should have caught this exception and not let the computer take control of this.
So → Facebook's fault — also for hundreds of other online companies that lost money.
Tomorrow the issue will be fixed, stocks will return to normal and even higher than it was yesterday.
It will be like this never happened. FB will live on.
Not all minutes are equal, so the real number is likely well north of $200k/min. And that is a lot, even for Facebook, with a large enough number of minutes. We're at >4 hours. So they've likely lost north of $50 million in advertising revenue today.
Human error when prepairing the update, maybe some kind of security system in the code should have this excetion catched and not let it run.
So => Facebook fault - also for hundreds of other online companies that lost money
Ops folks: do you have dedicated networking hardware you can push config changes to as a sandbox of prod? Does Facebook? Do they get simulated or shadowed traffic for pre-prod testing?
My guess is no, but I’ve never really worked in a big DC.
My first job was working in data centers for telecos and my impression was that everything was one cable trip away from never working again.
Networks were really complex, nothing was documented nor deployed as code, most equipment were untestable black boxes, people who deployed stuff moved on, etc... Just thinking about working with on prem again gives me chills.
FB's mistake was using this kind of complicated over-engineered setup. It works great when it works... but when it does not, it blows up everything and its complexity means recovery is extremely complicated.
Imagine if Google went down like this for 8 hours. No Gmail, Google Search, Google Maps, Google Drive, YouTube? I thought these companies were a little more fault tolerant.
This is why, even with keycards, you need __key__ disaster recovery employees to have real keys that really go through the locks and let them in to do what must be done.
Signal is up - everything seems slow online, though. Getting normal congestion at the ISP level, looks like the side-effects on the web in general are noticeable when facebook goes hard down like this.
It's not for me. But several services seem to be a lot slower, possibly due to network effects (haha) on DNS as many devices are repeatedly trying lookup FB domains and not finding them, so they just try again and again.
Why the article claims the change originated at Facebook? Updates to BGP routing are not authenticated. BGP hijacking is a real thing. To the best of my understanding, other well-positioned AS could publish this evil update to BGP routing tables.
I wrote about what is going on today with FaceBook and many other social media sites long ago. Market-driven social media platforms end up becoming destructive in behavior on their user base over time because profit demand from investors grows over time driving bad practices.
Tom from Myspace really had the concept right. There's no reason why he shouldn't be on CNN right now speaking about what is going on as an informed consultant.
They may possibly be covertly cleaning up obviously harmful content and evidence behind the curtains now that they are closed. Just speculation/opinion, not proven fact in any way though...
Many sites and apps on the Internet also rely on FaceBook for authentication and analytical tracking, so that may explain some cases of service and site outages, but all social media sites operate under the same cloud of non-transparent and profit driven mystery.
Congress is overdue in protecting citizens from psychological, financial, and emotional manipulation, but first they need THE RIGHT people educating them about how to recognize the underlying issues in modern IT and algorithms.
This is a major point in the Internet's history, a point where everything may change.
>Now, here's the fun part. @Cloudflare runs a free DNS resolver, 1.1.1.1, and lots of people use it. So Facebook etc. are down... guess what happens? People keep retrying. Software keeps retrying. We get hit by a massive flood of DNS traffic asking for http://facebook.com
https://twitter.com/jgrahamc/status/1445066136547217413
>Our small non profit also sees a huge spike in DNS traffic. It’s really insane.
https://twitter.com/awlnx/status/1445072441886265355
>This is frontend DNS stats from one of the smaller ISPs I operate. DNS traffic has almost doubled.
https://twitter.com/TheodoreBaschak/status/14450732299707637...