Hacker News new | past | comments | ask | show | jobs | submit login
A Facebook crawler was making 7M requests per day to my stupid website (napolux.com)
1074 points by napolux on June 11, 2020 | hide | past | favorite | 407 comments



We've had the same issue. They were doing huge bursts of tens of thousands of requests in very short time several times a day. The bots didn't identify as FB (used "spoofed" UAs) but were all coming from FB owned netblocks. I've contacted FB about it, but they couldn't figure out why this was happening and didn't solve the problem. I found out that there is an option in the FB Catalog manager that lets FB auto-remove items from the catalog when their destination page is gone. Disabling this option solved the issue.


I just had an idea: if you control your own name server I believe you could use a BIND view to send all their own traffic to themselves based on the source address.

By the way, if someone discovers how to trigger this issue it would be easy to use it as a DOS pseudo-botnet.


I had a different idea. Maybe you could craft a zip-bomb response. The bot would fetch the small gzipped content and upon extraction discover it was GBs of data? Not sure that's possible here, when responding to a request, but that would surely turn the admins attention to it.


Here's an example of things you can do against malicious crawlers: http://www.hackerfactor.com/blog/index.php?/archives/762-Att....


Thanks a lot! I've spent like 2h now reading his blog. Just amazing.


fascinating read. :)


It is!


If their client asks for gzip compression of the http traffic, you could do it.


Nice idea, I like the thinking. I'll tuck that away for use later. PowerDNS has LUA built in amongst a few other things.

My stack of projects to do is growing at a hell of a rate and I'm not popping them off the stack fast enough.


I know the feeling, it's one of the reasons I'm working on https://github.com/hofstadter-io/hof

Check out the code generation parts and modules, they are the most mature. We have HRDs (like CRDs in k8s for anything) and a scripting language between bash and Python coming out soon too.


I've tried to work out what your project does but I'm none the wiser. GEB is prominent on my bookshelf. I'm a syadmin and I got as far as "hollow wold" in Go or was it "Hail Marrow"? Can't remember.

I've checked out your repo for a look over tomorrow when I'm cough sober!


Stop by gitter and I'd be happy to explain more


Something about this idea sits uncomfortably with me. I also just had an idea / thought experiment based on your idea.

We think of net neutrality as being for carriers and ISPs, but you could see it applied to a publicly accessible DNS service too. These DNS service providers are just as much part of the core service of the Internet as anyone else. It’s not a huge leap to require that those who operate a publicly accessible DNS service are bound by the same spirit of the regulations: that the infrastructure must not discriminate based on who is using it.

It’s different to operating a discriminatory firewall. DNS is a cacheable public service with bad consequences if poisonous data ends up in the system. Fiddling with DNS like this doesn’t seem like a good idea. Too much weird and bad stuff could go wrong.

Another analogy would be to the use of encryption on amateur radio. It seems like an innocuously good idea, but the radio waves were held open in public trust for public use. If you let them be used for a different (though arguably a more useful purpose) then the resource ends up being degraded.

Also along these lines of thought [begin irony mode]: FCC fines for DNS wildcard abuse / usage.


Principled neutrality is fine for acceptable use. There’s no moral quandary in closing the door to abusers.


Isn't that the argument that providers make for wanting to meter usage? I.e. video streamers, torrenters and netflix and the like are 'abusing' the network by using a disproportionate amount of their capacity / bandwidth?

I guess my point is that "abuse" in this sense is pretty subjective.


Not really, those things are easy to contrast. Network providers have always been comfortable blackholing DoS routes, and it’s never been controversial. That’s clearly distinct from those wanting to double-dip on transport revenues for routine traffic.

The difference is in whether both endpoints want the traffic, not whether (or on what basis) the enabling infrastructure wants to bear it.


> There’s no moral quandary in closing the door to abusers

Doesn't necessarily apply to this conversation, but the moral mistake that people (and societies) frequently make is underestimating the nuance that should be exercised when identifying others as abusers.


if only there was a way to accidentally amplify that.


You could just probably send an HTTP redirect to their own site, no need to play with DNS for that


Probably better to find the highest level FB employees personal site that you can and send the million requests from fb there.


Except then you'd still have to deal with all the traffic.


>> The bots didn't identify as FB (used "spoofed" UAs)

That's surprising. What were the spoofed user agents that they used?

We've run into this issue also, but all Facebook bot activity had user agents that contained the string "facebookexternalhit".


I've seen these these two user-agents from FB IPs, maybe others:

Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Version/7.0 Mobile/11D257 Safari/9537.53

Mozilla/5.0 (iPhone; CPU iPhone OS 13_5 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Mobile/15E148 Safari/604.1

Which also execute javascript in some modified sandbox or something, causing errors, and executing error handlers. Interesting attempt to analyze the crawler here: https://github.com/aFarkas/lazysizes/issues/520#issuecomment...


Yep, these were the UAs we also saw (amongst others). And also in our case those bots were executing the JS, even hitting our Google Analytics. For some reason GA reported this traffic to come from Peru and Philippines, while an IP lookup showed it belonged to FB registered in the US or Ireland


Set up a robots.txt that disallows Facebook crawlers, sue Facebook if the crawling continues for unauthorized access to computer systems, profit.


I believe this only works if it is coming from a corporation and targeted at an individual -_-


robots.txt is not a legal document. It is asking nicely, and plenty of crawlers purposefully ignore it.


I'm not a lawyer, but I believe I remember people being sued for essentially making a GET request to a URL they weren't supposed to GET.


Legal document is a tricky phrase to use. "No trespassing" signs are usually considered sufficient to justify prosecution for trespassing. If the sign is conspicuously placed, it does not usually matter if you actually see the sign or not.

I am not as familiar with law around accessing computer systems, but I imagine that given some of the draconian enforcement we've seen in the past that a robots.txt should be sufficient to support some legal action against someone who disregards it.


the no tresspassing sign does not mean i can't yell at you from the street 'tell me your life story.' which you are free to ignore.


Sure, but federal law does prevent you from repeatedly placing that request into my mailbox. I don't even need a sign for that.

Making an http request does not fit cleanly as an analogue to yelling from the street, nor does it fit as an analogue to throwing a written request on a brick through a window. It is something different that must be understood on its own terms.


Side question: how did you get in contact with facebook? I've an ad account that was suspended last year and gave up trying to contact them.


Similar experience, closed ad account because of “suspicious activity”, at least ten support tickets (half closed automatically), four lame apologies (our system says no, sry) and then finally, “there was an error in our system, you’re good to go”


I lost it because there was some sort of UI bug in the ad preview page that led the page to reload a bunch of times really quick, and boom, I'm apparently a criminal. You're lucky though, I never got it back and I had to change my business credit card because they somehow kept charging me after suspension.


Really? I've never had problems contacting them by email. They're one of the easiest tech companies to talk to.


In my experience they're one of the most useless and difficult companies I've ever tried to interact with.


The entirety of the universe is contained in the preceding two comments.


Aah, finally it all makes sense!



What did you contact them for? Just curious as I've almost always heard they're like Google and impossible to get a human response from.


What is their email? I've never found one that they'd respond to.


I have an option to email them, live chat, or have our account manager contact us. I guess if you spend enough on ads per month you are entitled to more support...?


try contacting their NOC, they _may_ give you a human that can help.


Try their live chat.


Do they actually have a live chat available for smaller advertisers? When I looked last year there wasn't


They do, I have used it a couple of times. Go go facebook.com/business/help click on get started button at the bottom for "find answers or contact support". Follow through, you will see a button called "chat with a representative".


twitter does the same thing. it sends a bunch of spoofed US visitors from Korea and Germany and US. The bots are spoofed to make it harder to filter them.


I wonder what the (legitimate?) reason is for them to spoof. Seems intentionally shady. Maybe there's a legit reason we're missing?


Possibly trying to avoid people sending them a different version of the page than users would see (of course they could change the page after the initial caching of a preview, but Twitter might refresh/check them later).

Also, you often need an impressive amount of the stuff thats in a normal UA string for random sites to not break/send you the "unsupported browser, please use Netscape 4 or newer!!!" page/..., although you normally can fit an identifier of what you really are at the end. (As an example, here's an iOS Safari user agent: "Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.5 Mobile/15E148 Safari/604.1" - it's not Mozilla, it's not Gecko, but it has those keywords and patterns because sites expect to see that)


Yeah, I once tried to tell my browser to send... I forget; either no UA, or a blank UA string, or nonsense or just "Firefox" or something. I figured, "hey, some sites might break, but it can't be that important!" It broke everything. IIRC, the breaking point was that my own server refused to talk to me. Now, I still think this is insane, but apparently this really is how it is right now.


Should make you realize just how much abuse there is on the internet that it's worth it to just filter traffic with no UA.

Usually people get stuck on the fact that we can't have nice things, so X sucks for not letting us have nice things, yet I seem to never see people acknowledge why we can't have nice things.

Then I'd see a lot more "ugh, bad actors suck!" and less "ugh, websites are just trying to make life miserable for me >:("


Is there an official error message for that? Because filtering no UA would trip me up every time I use wget. Most of the time, if I'm casually using wget for something, I don't bother with a UA. If sites started rejecting that, I'd like to get a clear error message, so I would not go crazy trying to figure out what the problem was. If I got a clear message "send a UA" then I would probably started wrapping my wget requests in a short bash script with some extra stuff thrown in to keep everyone happy. But I'd have to know what it is that is needed to keep everyone happy.


wget has a default User-Agent string.


Well, if filtering for UA really makes things difficult for bad actors, they do suck but more in as a technical opinion than an moral statement.


That's amazing. Any idea what piece of middleware on your own server was doing that?


I don't remember, but what really got me was that I wasn't running anything that I expected to do fancy filtering; I think this was just Apache httpd running on CentOS. But there was no web application firewall, no load balancers, pretty sure fail2ban was only set up for sshd. It at least appeared that just the apache stock config was in play.


You removed a vital part of the http protocol and you're surprised when things break?


Seems like it follows the spec to me:

https://tools.ietf.org/html/rfc7231#section-5.5.3


That makes sense. I couldn't come up with a shady reason why they would do it to be honest, but I was curious.



When you know your target runs Internet Explorer, you serve phishing pages to IE users and some boring content to other users, at the server level. We've had this keep our user tests out of Google Safebrowse and so on. I'm sure similar tricks end up applied to Facebook's UA and "legitimate" marketing sites.


Thanks man! I'll have a look.


Hey! Facebook engineer here. If you have it, can you send me the User-Agent for these requests? That would definitely help speed up narrowing down what's happening here. If you can provide me the hostname being requested in the Host header, that would be great too.

I just sent you an e-mail, you can also reply to that instead if you prefer not to share those details here. :-)


I'm not sure I'd publicly post my email like that, if I worked at FB. But congratulations on your promotion to "official technical contact for all facebook issues forever".


My e-mail address is already public from my kernel commits and upstream work. :-)


Don't think I used my email for anything important doing my time at FB. If it gets out of hand he could just make a request to have a new primary email made and use the above one for "spam"


Curiousity question: does FB use Gmail/Google suite?


FB uses Office365 for email. It was on-premise Exchange many many years ago, but moved "to the cloud" a while back.


Feels odd to read Facebook uses office365/exchange for emails. they haven't built their fsuite yet, I thought they would simply promote Facebook messenger internally. I'm only half joking.


Most communication is via Workplace (group posts and chat). Emails aren't very common any more - mainly for communication with external people and alerting.


My impression is that they pretty much roll their own communication suite.


That’s somewhat correct

But at least for email/calendar backend its exchange

The internal replacement clients for calendar and other things are killer...have yet to find replacements

For the most part though they use Facebook internally for messaging and regular communication (technically now Worplace but before it was just Facebook)

Email is really just for external folks


I don't want to share my website for personal reasons, but here is some data from cloudflare dashboard (a request made on 11 Jun, 2020 21:30:55 from Ireland, I have 3 requests in the same second from 2 different IPs)

user-agent: facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php) ip 1: 2a03:2880:22ff:3::face:b00c (1 request) ip 2: 2a03:2880:22ff:b::face:b00c (2 requests) ASN: AS32934 FACEBOOK


Yes requests are still coming. Thanks CloudFlare for saving my ass.


Hey,

I can by the way confirm this issue. I work in a large newspaper in Norway and around a year ago we saw the same issue. Thousands of requests per second until we blocked it . And after we blocked it, traffic to our Facebook page also plummeted. I assume Facebook considered our website down and thus wouldn't give users content from our Facebook page either as that would serve them content that would give a bad user experience. The Facebook traffic did not normalize before the attack stopped AND after we told Facebook to reindex all our content.

I'd you want more info, send me a email and il dig out some logs etc. thu at db.no


Thanks for looking at this!


Thank goodness for mod_rewrite, which makes blocking/redirecting traffic on basic things like headers pretty easy.

https://www.usenix.org.uk/content/rewritemap.html

You could of course block upstream by IP, but if you want to send the traffic away from a CPU heavy dynamic page to something static that 2xx's or 301's to https://developers.facebook.com/docs/sharing/webmasters/craw... then this could be the answer.


Here are some more details from my report to FB:

"My webserver is getting hit with bursts of hundreds of requests from Facebook's IP ranges. Google Analytics also reports these hits and shows them as coming from (mostly) Philippines and Peru, however, IP lookup shows that these IPs belong to Facebook (TFBNET3). The number of these hits during a burst typically exceeds my normal traffic by 200%, putting a lot of stress at our infrastructure, putting our business at risk.

This started happening after the Facebook Support team resolved a problem I reported earlier regarding connecting my Facebook Pixel as a data source to my Catalog. It seems Facebook is sending a bot to fetch information from the page, but does so very aggressively and apparently call other trackers on the page (such as Google Analytics)"

69.171.240.19 - - [13/Aug/2018:11:09:52 +0200] "GET /items/ley3xk/ford-dohc-20-sierra-mondeo-scorpio-luk-set.html HTTP/1.1" 200 15181 "https://www.facebook.com/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17134"

"e.g. IP addresses 173.252.87.* performed 15,211 hits between Aug 14 12:00 and 12:59, followed by 13,946 hits from 31.13.115.*"

"What is also interesting is that the user agents are very diverse. I would expect a Facebook crawler to identify itself with a unique User-Agent header (as suggested by the documentation page mentioned earlier), but instead I see User-Agent strings that belong to many different browsers. E.g. this file contains 53,240 hits from Facebook's IP addresses with User-Agent "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17134"

There are a few Facebook useragents in there, but far less than browser useragents: 7,310 hits: facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php) 2,869 hits: facebookexternalhit/1.1 1,439 hits: facebookcatalog/1.0 120 hits: facebookexternalua

Surprisingly, there is even a useragent string that mentions Bing: 6,280 hits: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534+ (KHTML, like Gecko) BingPreview/1.0b

These IPs don't only fetch the HTML page, but load all the page's resources (images, css, ..) including all third-party trackers (such as Google Analytics). Not only does this put unnecessary stress at our infrastructure, it drives up the usage costs of 3rd party tracking services and renders some of our reports unreliable."

final response from FB: "Thanks for your patience while our team looked into this. They've added measures to reduce the amount of crawler calls made. Further optimizations are being worked on as well, but for now, this issue should be resolved." <- NOT.


I dont really understand what is the issue. On my welcome page (while all other urls are impossible to guess) i give browser something that requires a few seconds of cpu at 100% to crunch. And tracking some user action in between, visting tarpitted urls etc. In last few years no bot came through. Why bother with robots.txt, just give them something to break their teeths...

(I would give you the url, but I just dont want ti be visited)


I don't think most users would appreciate having a site spike their CPU for a few seconds when they visit...at least I wouldn't.


The far-fetched "solution" along with the "I don't understand everyone doesn't do <thing that is easy to understand why everyone wouldn't do it>" make it hard for me to believe you're actually doing this. Sounds more like a shower thought, maybe fun weekend project, than something in production.


Sounds like terrible UX! Won't browsers give you a "Stop script" prompt if you do that?


I don’t think the solution is flatlining everyone’s CPUs.


> but I just dont want ti be visited

Maybe you should just take your website offline?


How much have you mined?


Same thing has happened to me: https://twitter.com/moonscript/status/1124888489298808834

The network address range falls under Facebook's ownership, so I don't think it's someone spoofing. I do think it's very possible someone found a way to trigger crawl requests in large quantity. Alternatively, I would not be surprised it's just a bug on facebook's end.


They've done this before to me, too. First I tried `iptables -j DROP`, which made the machine somewhat usable, but didn't help with the traffic. After trying a few things, I tried `-j TARPIT`, and that appeared to make them back off.

Of course, sample size of 1, etc. It could have been coincidental.


Tarpits are an underappreciated solution to a pool of bad actors.

You can add artificial wait times to responses, or you can just route all of the 'bad' traffic to one machine, which becomes oversubscribed (be sure to segregate your stats!). All bad actors fighting over the same scraps creates proportional backpressure. Just adding 2 second delays to each request won't necessarily achieve that if multiple user agents are hitting you at once.


I never looked into the TARPIT option in iptables before reading your comment. That seems really useful. I've been dealing with on and off bursts of traffic from a single AWS region for the last month. They usually keep going for about 90 minutes every day, regardless of how many IPs I block, and consume every available resource with about 250 requests per second (not a big server and I'm still waiting for approval to just outright block the AWS region). I'm going to try a tarpit next time rather than a DROP and see if it makes a difference.


Be careful, as tarpitting connections can consume your resources faster than those of the attacker.


Most spiders limit the number of requests per domain, so if it's stupidity and not malice, you probably don't have a runaway situation.

... unless you're hosting a lot of websites for people in a particular industry. In which case the bot will just start making requests to three other websites you also are responsible for.

Then if you use a tarpit machine instead of routing tricks, the resource pool is bounded by the capacity of that single machine. If you have 20 other machines that's just the Bad Bot Tax and you should pay it with a clean conscience and go solve problems your human customers actually care about.


Only if you use conntracking though no?


Interesting. Implemented DROP for some Cloudflare net blocks a few weeks ago due to lots of weird interrupted TCP connections from them.

That's seemed to be good enough, but will consider the TARPIT option if it turns out to be needed. :)


one man’s weird interrupted TCP connection is another man’s nmap attempt.


TIL TARIPT!


This was happening to us > 5 years ago. The FB crawlers were taking out our image serving system as we used the og:image thing. What we did was route FB crawler traffic to a separate Auto Scaling Group to keep our users happy while also getting the nice preview image on FB when our content was shared. I can't understate the volume of the FB requests, I can't remember the exact numbers now but it was insane.


I once made a PHP page which fetches a random wikipedia article, but changes the title to "What the fuck is [TOPIC]?". It was incredibly funny, until I started getting angry emails from people, mostly threatening a form of libel lawsuit.

Turns out, since it was a front for all of wikipedia[1], google was agressively indexing it, but the results rarely made it to the first search page. And since this isn't exactly an important site, old results would stick around.

Hence a pattern:

1. Some Rando creates a page about themselves

2. Wikipedia editors, being holy and good, extinguish that nonsense

1.5. GOOGLE INDEXES IT

3. Rando, by nature of being a rando, googles themselves, doesn't find their wikipedia page anymore (that's gone), but does find a link to my site in the first page of google results with their name.

4. Lawyers get involved, somehow

Details: http://omershapira.com/blog/2013/03/randomfax-net/

[1] Hebrew wikipedia. I can't imagine what would've happened on the English version.


I understand why people got angry, though. The title was rougly "What is this '<article name>' shit?"[1], with "shit" in this Hebrew context also able to be interpreted as calling the subject "shit".

Saying "מה זה לעזאזל X?"[2] is closer to "what the fuck is X?" (except it's more "what the hell" than "what the fuck").

[1] I wasn't sure how to portray this to non-Hebrew speakers but, surprisingly, Google Translate actually nailed it: https://translate.google.com/#view=home&op=translate&sl=auto...

[2] Google Translate got this example right, too: https://translate.google.com/#view=home&op=translate&sl=auto...


Doesn't ignoring headers and taking someones site down fall afoul of the CFAA? Especially given how comments here are showing that this is a recurring issue? Depending on which side of the fence you're on, there could be standing to go after FB for this either for money, or for their lawyers to help set a precedent to limit CFAA further.


When someone does this to Facebook its malicious and they go to jail. WHen Facebook does it to someone else... "oops".


Never forget Aaron Swartz. All he did was send download requests to JSTOR. https://www.youtube.com/watch?v=9vz06QO3UkQ


I'm going to get a ton of hate for this, but he was doing it with the intent to redistribute the content for free. He didn't own the content. There's a big difference. That being said it's very sad what came about of that. I really don't think the FBI needed to be involved.


> he was doing it with the intent to redistribute the content for free

There is actually no proof of this.


Really? I felt that what happened to Aaron was a tragic injustice on many levels (from the fact that he was charged at all to the number of charges they threw at him), but for some reason I had always thought that it was fairly well-known that that was his intent.

I have no idea where I got that impression from, but I do recall reading several articles about him, as well as his blog around that time. It's possible I just internalized others' assumptions about his behavior, but it certainly seemed like an in-character thing for him to to.


IIRC, he had published a manifesto about how he believed that information that was paid for with public funds should be free for the public to view. However, I don't believe that he ever mentioned why he was making get requests to JSTOR's servers and I know that he never uploaded any of those JSTOR documents to the internet for public download.


Yeah, but any decent criminal lawyer could tear "he once wrote a manifesto" to shreds. His political beliefs would only be pertinent if he wrote the manifesto somehow in connection with what he was doing; otherwise, it would not be sufficient to show intent to distribute.


Criminal law is about proving things, not in-charachter assumptions.


Intent is the sticking point for the vast majority of prosecutions. That's why you have a jury, not Compute-o-Bot 9000.


Yeah and Facebook did this crawlings with intent to commit CFAA - computer fraud and abuse, now someone call the feds quickly before the perpetrators are zucked into hiding


From the point of view of the law, the intent (malicious or benevolent) is often as important as the action itself.


Nah, it only matters if you are a large corporation or not. If you are a large corporation you will ruin your opponent financial with legal fees (whether they are guilty or not). If you are a small timer going up against a large corporation they will delay the court case until you can not afford the legal fees.


The always-classic "what color are your bits?"

https://ansuz.sooke.bc.ca/entry/23


No. The owner needs send Facebook a cease and desist and block their IPs in order to revoke authorization under the CFAA, and then Facebook would need to continue accessing his website despite explicitly being told not to. That's the precedent set by Craigslist v. 3Taps.

Whether or not the above even holds today (hiQ Labs v. LinkedIn argues it does not) remains to be decided in a current appeal to the Supreme Court.


That's 81 request per second on average.

Shouldn't anybody doing such thing be liable, and be sued for negligence and required to pay damages?

That sounds like a lot of bandwidth (and server stress).


Yes, this was my first thought too - Facebook is essentially denial of servicing people's sites and it sounds like they're aware of it. Move fast and break your things. If this were your average Joe I'm fairly certain something like the DMCA could be used.


What? How does the DCMA apply here at all?


The DMCA has a couple major parts. The part most people are familiar with deals with safe harbor provisions and "DMCA takedowns". The relevant part for this discussion makes it illegal to circumvent digital access-control technology. So you could argue that if you have reasonable measures in place to stop automated access of your content such as robots.txt then Facebook continuing to send automated requests would be a circumvention of your access controls.


Shouldn't any public facing website have a rate limit? If someone attempts to circumvent simple rate limits (randomizing the source IP or header content), then that could demonstrate intent to cause damage, and you'd have a better case. But if you don't set a limit, how can you be mad that someone exceeded it?

(I know they're ignoring robots.txt, but robots.txt is not a law. And, it doesn't apply to user-generated requests, for things like "link unfurling" in things like Slack. I am guessing the crawler ignores robots.txt because it is doing the request on behalf of a human user, not to create some sort of index. Google is attempting to standardize this widely-understood convention: https://developers.google.com/search/reference/robots_txt)


There is a cost to apply throttling, you still have to receive those requests and filter/block them, plus incoming bandwidth.

The post mentions setting `og:ttl` and replying with HTTP 429 - Too Many Requests, and both being ignored.


Plenty of crawlers, including Facebook's, are operating out of huge pools of IPs usually smeared across multiple blocks. If you have an idea on how to ratelimit a crawler doing 2 req/s from 300+ distinct IPs with random spoofed UAs let me know when your startup launches.


There are many features that you can pick up on. TCP fingerprints, header ordering, etc. You build a reputation for each IP address or block, then block the anomalous requests.

If you don't want to do this yourself and you're already using Cloudflare... congratulations, this is exactly why they exist. Write them a check every month for less than one hour of engineering time, and their knowledge is your knowledge. Your startup will launch on time!


The author specified that the crawler ignores 429 status code. So they do have some rate limit.


I noticed that, but I get the feeling that sending that status code didn't reduce the amount of work the program did, so it's kind of a pointless. You can say "go away", but at some point, you probably have to make them go away. (All RFC6585 says about 429 is that it MUST NOT be cached, not that the user agent can't try the request again immediately. It is very poorly specified.)

The author mentioned that they were using Cloudflare, and I see at least two ways to implement rate limiting. One is to have your application recognize the pattern of malicious behavior, and instruct Cloudflare via its API to block the IP address. Another is to use their rate-limiting that detects and blocks this stuff automatically (so they claim).

Like, there are many problems here. One is a rate limit that doesn't reduce load enough to serve legitimate requests. Another is wanting a caching service to cache a response that "MUST NOT" be cached. And the last is expecting people on the Internet to behave. There are always going to be broken clients, and you probably want the infrastructure in place to flat-out stop replying to TCP SYN packets from broken/malicious networks for a period of time. If the SYN packets themselves are overwhelming, welp, that is exactly why Cloudflare exists ;)

The author is right to be mad at Facebook. Their program is clearly broken. But it's up to you to mitigate your own stuff. This time it's a big company with lots of money that you can sue. The next time it will be some script kiddie in some faraway country with no laws. Your site will be broken for your users in either case, and so it falls on you to mitigate it.


Rate limiting websites is a lot of work. You have to figure out what kind of limits you should set, you have to figure out if that should be on an IP or a /24 or an ASN, if it's an ASN, you have to figure out how to turn an IP into an ASN.

If your website is pretty fast, you probably won't necessarily care about a lot of hits from one source, but if it's supposed to be a small website on inexpensive hosting, and all those hits add up to real transfer numbers, maybe it's an issue.


In the ToS, add a charge of 2 guineas, 5 shilling, a sixpence and a peppercorn per request.


Nah make it a strand of saffron per request and get your moneys' worth out of them.


I saw also of 300reqs per second.


Did anyone notice the branded IP address? 2a03:2880:20ff:d::face:b00c

"face:b00c"


My "404-pests" fail2ban-client status, which drops everything making *.php requests (This machine has never had PHP installed...):

2a03:2880:10ff:14::face:b00c 2a03:2880:10ff:21::face:b00c 2a03:2880:11ff:1a::face:b00c 2a03:2880:11ff:1f::face:b00c 2a03:2880:11ff:2::face:b00c 2a03:2880:12ff:10::face:b00c 2a03:2880:12ff:1::face:b00c 2a03:2880:12ff:9::face:b00c 2a03:2880:12ff:d::face:b00c 2a03:2880:13ff:3::face:b00c 2a03:2880:13ff:4::face:b00c 2a03:2880:20ff:12::face:b00c 2a03:2880:20ff:1e::face:b00c 2a03:2880:20ff:4::face:b00c 2a03:2880:20ff:5::face:b00c 2a03:2880:20ff:75::face:b00c 2a03:2880:20ff:77::face:b00c 2a03:2880:20ff:e::face:b00c 2a03:2880:21ff:30::face:b00c 2a03:2880:22ff:11::face:b00c 2a03:2880:22ff:12::face:b00c 2a03:2880:22ff:14::face:b00c 2a03:2880:23ff:5::face:b00c 2a03:2880:23ff:b::face:b00c 2a03:2880:23ff:c::face:b00c 2a03:2880:30ff:10::face:b00c 2a03:2880:30ff:11::face:b00c 2a03:2880:30ff:17::face:b00c 2a03:2880:30ff:1::face:b00c 2a03:2880:30ff:71::face:b00c 2a03:2880:30ff:a::face:b00c 2a03:2880:30ff:b::face:b00c 2a03:2880:30ff:c::face:b00c 2a03:2880:30ff:d::face:b00c 2a03:2880:30ff:f::face:b00c 2a03:2880:31ff:10::face:b00c 2a03:2880:31ff:11::face:b00c 2a03:2880:31ff:12::face:b00c 2a03:2880:31ff:13::face:b00c 2a03:2880:31ff:17::face:b00c 2a03:2880:31ff:1::face:b00c 2a03:2880:31ff:2::face:b00c 2a03:2880:31ff:3::face:b00c 2a03:2880:31ff:4::face:b00c 2a03:2880:31ff:5::face:b00c 2a03:2880:31ff:6::face:b00c 2a03:2880:31ff:71::face:b00c 2a03:2880:31ff:7::face:b00c 2a03:2880:31ff:8::face:b00c 2a03:2880:31ff:c::face:b00c 2a03:2880:31ff:d::face:b00c 2a03:2880:31ff:e::face:b00c 2a03:2880:31ff:f::face:b00c 2a03:2880:32ff:4::face:b00c 2a03:2880:32ff:5::face:b00c 2a03:2880:32ff:70::face:b00c 2a03:2880:32ff:d::face:b00c 2a03:2880:ff:16::face:b00c 2a03:2880:ff:17::face:b00c 2a03:2880:ff:1a::face:b00c 2a03:2880:ff:1c::face:b00c 2a03:2880:ff:1d::face:b00c 2a03:2880:ff:25::face:b00c 2a03:2880:ff::face:b00c 2a03:2880:ff:b::face:b00c 2a03:2880:ff:c::face:b00c 2a03:2880:ff:d::face:b00c


I think we're together in this, my friend


I believe, but have not tried, that you can craft zipped files that are small but when expanding produce multi-gigabyte files. If you can sufficiently target the bad guys, you can probably grind them to a halt by serving only them these files. Some care needs to be taken so you don’t harm the innocent and don’t run afoul of any laws that may apply. Good luck.


Do you think it would be possible to create something similar for gzip? If you then serve with Content-Type: text/html, and Content-Encoding: gzip, the client would accept the payload. And when it tries to expand it, it would get expanded to a large file, eating up their resources.


That was the idea. There's a name for it - zip bomb. https://en.wikipedia.org/wiki/Zip_bomb

A superficial search leads to things like https://www.rapid7.com/db/modules/auxiliary/dos/http/gzip_bo...

https://stackoverflow.com/questions/1459673

You really want to be careful about potentially breaking laws ...


Here is another article: https://www.blackhat.com/docs/us-16/materials/us-16-Marie-I-...

If FB supports brotli, a much bigger compression factor than 1000 is possible, apparently.


Here's a brotli file I created that's 81MB compressed and 100TB uncomrpessed[1] (bomb.br). That's a 1.2M:1 compression ratio (higher than any other brotli ratio I see mentioned online).

There's also a script in that directory that allows you to create files of whatever size you want (hovering around that same compression ratio). You can even use it to embed secret messages in the brotli (compressed or uncompressed). There's also a python script there that will serve it with the right header. Note that for Firefox it needs to be hosted on https, because Firefox only supports brotli over https.

Back when I created it, it would crash the entire browser of ESR Firefox, crash the tab of Chrome, and would lead to a perpetually loading page in regular Firefox.

It's currently hosted at [CAREFUL] https://stuffed.web.ctfcompetition.com [CAREFUL]

[1] https://github.com/google/google-ctf/tree/master/2019/finals...


A single layer of deflate compression has a theoretical expansion limit of 1032:1. ZIP bombs with higher ratios only achieve it by nesting a ZIP file inside of another ZIP file and expecting the decompressor to be recursive.

This means you can serve 1M payload and have it come out to 1G at decompression time. Not a bad compression ratio, but it doesn't seem like enough to break Facebook servers without taking on considerable load of your own.

http://www.zlib.net/zlib_tech.html


Just a guessing, but maybe a really large file with a limited character set, maybe even just repeating 1 character should compress really well


1 character repeated 1877302224 times has a compression ratio of 99.9% ~1.9GB compresses to ~1.8MB.


I believe zip bombs are pretty easily mitigated, with memory limits, cpu time ulimits, etc.

But zip bombs aren't limited to zip files... lots of files have some compression in them You can make malicious PNG's that do the same thing. Probably tonnes of other files.


Good points. According to OP, the goal would be to persuade FB to back off crawling at 300 qps, not to bring down the FB crawl. Having each request expand to 1GB is probably already enough to do that, since a crawler implementing the mitigations you mentioned will likely reduce the crawl rate. Should be easily testable.


The mighty power of IPV6 :P


if only they could have convinced IANA to give them a vanity prefix.


FB has 2a00::/12 (and probably other blocks?) and can make something that looks like its company name from [0-9a-f]. Why wouldn't it do something harmless and fun like this? It's not as if it requires special dispensation or is breaking any rules.


I don't think the person was implying it's bad. Just pointing out an easter egg.


Ah, well if that’s the case then I apologise. My bad!


A single company does not get a /12 prefix. 2a00::/12 is almost half of the space currently allocated to all of RIPE NCC. Facebook seems to have 2a03:2880::/29 out of that /12, and a /40 through ARIN (2620:0:1c00::/40)


Ugh right you are. I misread the whois output and didn't stop to consider how realistic a /12 was. This is embarrassing.


ugh why is ipv6 impossible to understand :/


Is the notation the problem here? IPs have 16 bytes now, so the notation is a bit more compact: hex instead of decimal numbers, and :: is a shortcut meaning "replace with as many zeros as necessary to fill the address to 16 bytes"


It's not. Think IPv4 but bigger.




It bugs me that they require a login to see a bug report, especially in the case of this bug where those affected aren't necessarily facebook users.


I've stumbled in that too.


Reading through the article and comments here I find it odd that a company like FB can't get a crawler to respect robots.txt or 429 status codes.

Even I would stop and think "maybe I should put in some guards against large amounts of traffic" when writing a crawler, and I'm certainly not one of those brilliant minds who manage to pass their interview process.


If the dev respected others’ boundaries and norms of expected behavior, they would probably work somewhere else.


It's an interesting thing to think about, that someone can make a request to your website and you pay for it.

This is not how the mail works, where someone needs to buy a stamp to spam you.


My office still gets spam faxes to this day. The paper and toner only add up to a few cents a month, so it's not worth doing anything about.

I knew a realtor that had a sheet of black paper with a few choice expletives written on it that they would send back to spammers. There was an art to taping it into a loop so it would continuously feed. This was a few decades ago when a the spam faxes could cost more than a stamp.


My desk phone at an old job used to get dialed by a fax machine. Not fun picking that up. I redirected it to a virtual fax line and it turns out it was a local clinic faxing medical records. I faxed them back with some message about you have the wrong number but they never stopped.


If you really wanted to make them stop, you might find a contact for their lawyer and make HIPAA noises at them ;)


I considered it. That was after my time in hospital IT. Really I wanted to stop getting my eardrums blown out by a robot. Luckily I happened to be working at a phone company so changing my number just took a couple clicks.


My dad used to receive a lot of misdirected faxes.

His solution was to send a return fax with disorderly handwriting begging and pleading for them to fax someone else.

The other person stopped faxing him.


I was using a virtual fax service so a digital equivalent would have been easy. Not sure if I was creative enough to invert the colors before sending my response. Really I just wanted the noise to stop.


It’s because we as an industry decided to give in and submit to the cloud providers’ bullshit model of paying overpriced amounts for bandwidth while good old bare-metal providers still offer unmetered bandwidth for very reasonable prices.


The question isn't about pricing. The fact that you pay for your own bandwidth and metal is a problem.


Paying for your bandwidth and metal isn't a problem per-se as long as prices are reasonable (which they are with the old-school bare-metal providers). After all, those services do cost money to provide.

The problem happens when prices are extortionate (or the pricing model is predatory, ie pay per MB transferred instead of a flat rate per 1Gbps link) and relatively minor traffic translates to a major bill.

This particular issue is about 80 requests/second which is a drop in the bucket on a 1Gbps link. It would be basically unnoticeable if it wasn't for the cloud providers nickel & diming their customers.


You're still not getting it. I'm not arguing about whether it should be free or how much it should cost. I'm saying that when you have to pay for someone else's request, it's ripe for abuse. Instead, a system where the requester pays would be similar to how postage works in the US. Although I think it's highly unlikely that would make sense in a digital world.


Most of my junk mail does not have stamps because the spammers drop it off in pallets at the post office and just have it delivered to everyone.

The USPS is like a reverse garbage service.


This is a common pattern in (blocking i/o versus non-blocking i/o) and (sync task versus async task).

The typical analogy I hear for DDoS attacks is if you own a business and a giant group of protesters drives traffic to your store, but no one is interested in buying. They create a backlog to enter the store, they crowd the isles, they pick up your inventory, and the lines and crowds scare off your legitimate visitors/customers.


This reminds me of a case years ago when I had a small-ish video file on my server that would cause Google Bot to just keep fetching it in a loop. It wasted a huge amount of bandwidth for me since I only noticed it in a report at the end of the month. I had no way of figuring out what was wrong with it, so I just deleted the file and the bot went away.


Was it a real FB crawler or a random buggy homemade bot masquerading as one?

Check here: https://developers.facebook.com/docs/sharing/webmasters/craw...



Someone up to no good using a false flag would not surprise me in the slightest.

"who do people already like to hate? I'll pretend to be them."



IP packets have a source field, which can be fake and not their actual IP. That's a Facebook IP, but the packet might not have actually come from it.


If they made actual requests that went through, a connection got established. That won't happen with a faked source.


Ah okay, thanks for helping me learn :)


Could be the IP address OP is showing is actually from an X-Forwarded header or other proxy header (but not the actual source of the packet).


UDP packets can fake their source IP. TCP packets realistically can't.


More precisely, spoofing a TCP handshake is a problem that we know how to prevent. Odds are pretty good that your kernel has these protections enabled by default, but it's not guaranteed and you should check as a matter of due diligence.


Why is it so hard for people here to believe that Facebook could have a crawler misbehaving? It is all over this thread, even though multiple people are saying they saw the same thing and confirmed the IP belonged to FB.

What gives?


If you can verify it's actually coming from Facebook's IP address (as many other comments here suggest), you should absolutely get in touch with them, though I'm not sure how. Perhaps their security team. That's a very serious bug that they'd be glad to catch -- if it's affecting you it's surely affecting others.

Otherwise it's a bot/malware/etc. spoofing Facebook and gone wrong, which sucks. And yeah just block it by UA, and hopefully eventually it goes away.


From the sounds of it the difference between Facebook's actual crawler and malware could be hard to define or differentiate, given the circumstances.


It was from their IPs tried to figure out how to get in touch, without success


This is one of the better HN threads in a while. Starting from an easily described problem, it goes into a number of interesting directions including how FB seems to DDOS certain sites for reasons unknown, defenses against such attacks and DDOS in general, GPL licensing, how to collapse HN conversations, etc.

Not a profound comment. But it was really fun to follow the path down the various rabbit holes. This is why I like HN.


Interesting. According to the Facebook Crawler link provided by the OP [1], it makes a range request of compressed resources. I wonder if the response from the OP’s PHP/SQLite app isn’t triggering a crawler bug.

Maybe Cloudflare can step up and act as an intermediary to isolate the problem. Isolating DoS attacks is one of their comparative advantages.

[1] https://developers.facebook.com/docs/sharing/webmasters/craw...


It would be nice to understand how you came to the conclusion that this was a Facebook bot.



You can check list of IP addresses?


the OP did not (explicitly) check, but you can check if the IP falls into a range allocated to Facebook e.g. https://ipinfo.io/AS32934


Of course I've checked. Here is a recent example. https://news.ycombinator.com/item?id=23491455


Make sure you file a bug - there are a myriad of sources internally, but we can often hunt it down easily enough (assuming it gets triaged to eng). Important info is the host of the urls being crawled, and the User Agent attached to the requests (headers are also good too). A timeseries graph of the hits (with date & timezone specified) can also help.


Why is this the owner's problem? Someone at Facebook should be filing the bug, or better yet instrumenting their systems so that incidents like this issue a wake-up-an-engineer alert.

Fuck this culture of "it's up to the victim of our fuckup to file a bug report with us".


Not just “file a bug report” but also compile a time series graph (lol) and then pray that Facebook triages it correctly, which they have no incentive to do.


And to make things even more ridiculous, you need to sign in with a facebook account to even file a bug there.


100%



Did you check the ip and not just the UA? Does it matches with the one from fb?



Surprise, Facebook not caring about playing nice with others!

Anyway, what reason would they have for so many requests? Why would they need to do 300 per second?


We had that with google in the past. We put a kml database on a web server and made the google maps API read them. Google went to get them without any throttling (my guess is that they primed their edge servers straight from us), the server went down so fast we sought there had been a hardware failure. We ended up putting a cdn just for google servers.


What would happen if you created a "rabbithole" for webcrawlers? Would they fall for it?


Modern web crawlers usually have mechanisms against rabbit holes, e.g. for Apache Nutch you can configure it should only follow to a certain depth of links from a root, it should only follow a redirect a few times and so on. It's always a trade-off though. If you cut off too aggressive you don't get everything, if you cut off too late you waste resources.


Maybe someone at Facebook is testing a project that didn't work as expected?


I wish we had better tools for things like this.

What I've heard happen is someone puts a governor on the system, and much later your new big customer starts having errors because your aggregate traffic exceeds capacity planning you did 2 years ago and forgot about.


Are you sure it's a crawler and not a proxy of some sort? Eg. one of your links is on something high traffic in facebook, and all requests are human, running through fb machines.


Nah, it's a specific user-agent for their crawler


Facebook caches URLs pretty aggressively (often you need to explicitly purge the cache to pick up updates) so I don't quite understand exactly what happened here. The article is very light on details. Do you have 7 million unique URLs that have been shared on Facebook? Or is it the same URL being scraped over and over?

Are you correctly sending caching headers?


When I enabled IPv6 traffic at home and logged default denies I constantly see Facebook owned IPv6 blocks trying to reach addresses browsing Facebook.

I have no idea what its trying to do but its legitimately the only IPv6 inbound traffic that isn't related to normal browsing.


I’m a bit confused. Facebook is social media company, why do they send out crawlers?


Facebook shows previews of pretty much everything you share a link to, so at a minimum they have to fetch the page/image to build those previews.


I see, thanks for the explanation.


Why would a crawler want to crawl the same page every second (let alone 71 times a second). Even if I were Google, wouldn't crawling any page every few mins (or say every minute) be way more than enough for freshness?


There was a very old post: https://news.ycombinator.com/item?id=7649025, lots of ways to abuse Facebook network.


> I own a little website I use for some SEO experiments [...] can generate thousands of different pages.

Do you have more details on this? One could read that as if the page is designed to attract crawlers.


Is it worth duplicating the fb specific parts of your website at another web address as a test site, so that If it was similarly affected, you could provide the address info to help troubleshoot?


Shouldn't have Cloudflare considered this a DDOS attack?


Hate to say but this is probably a FB engineer running a test/experiment and not a production crawler taking robots.txt etc into account.


Or malware infecting some FB-internal machine?


Doubt it. If you're a bad enough dude to get control over Facebook's internal infrastructure, I doubt you'd blow that access by using it to spam random sites.


Detect the crawler and delay the returns as long as you can. Have it use as many resources as you can that should be interesting.


> And then they’ll probably ask you to pay the bill

Can anyone comment on whether there would be any legal basis for this?


Thanks for sharing. I’d like to hear more about the website described here, it sounds very interesting.


I can add I work in SEO (tech side), so the website is "super optimized", but it's not rocket science.


Super optimized? The site takes longer to load than Hacker News for me.


They explicitly didn't mention the site this affected. It's not the one linked in the OP.


Super optimized with Facebook button on each post then. Not my definition of 'super optimized', but you know.


Super optimized in a SEO sense, not in a pleasing the HN crowd kind of sense I'd assume ;)


You're right, but for some reason the whole SEO thing just winds me up. It's my opinion that 'good' SEO makes sites worse for actual people to use.


SEO is for machines, not for users IMHO.


Still affects users.

Potential Positive: page speed, https

Negative: All blog posts with the same length, keywords dropped in every paragraph


Cant' share, sorry :)


Dude something is super sketchy


Nah


I wonder what the crawler does to it requires so many requests per day to the same site...


Title change? A stupid Facebook crawler was making 7M requests per day to my website


Nice one. ;)


we had a similar experience with yandex. had to block it. crawler started to crawl on a larger e-shop by trying out every single possible filter combination. in other words we got ddos attack from yandex.


Then your website isn't stupid but rather the FB crawler


Do you have some sort of fb plugin in your SEO site?


Why does facebook have crawlers in the first place?


Their 2FA page was broken yesterday...it goes on


Unrelated, but has anyone written a Chrome/Firefox extension to browse the web sending out Googlebot or Facebook user agent?

I wonder if you can bypass paywalls or see things that aren't generally presented to regular users


I surfed through a Google network for a bit a few years ago, and it does reveal things. The most common was Drupal and WordPress sites with some malicious code running that would link to online pharmacies on every page, but only to Google user agents / networks, I assume to boost their ranking. I let a bunch of webmasters know, most of them didn't believe me because they didn't see it themselves.


You wouldn't even need an extension. Both browsers allow you to change your User-agent. An extension just makes it a little easier.

I have a UA switcher on Firefox so I can change my UA in two clicks. One of the built-in UA options is indeed the Googlebot.



You chose to mention Safari, but link to an article about Chrome, Firefox, and Safari? Looks like all major browsers (probably Edge too) have this built in.



Someone could use this for DDOS or Coinhive


I think sending back a `429 Too Many Requests` can solve this. Unless you want to block their IPs completely, which I doubt you'd want to do.


A well made crawler that would support 429 codes isn't likely to have this problem in the first place.


> HTTP 429 -> ignored

It's in the article


I tried. It was ignored.


It says in the post he did this already


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: