A Facebook crawler was making 7M requests per day to my stupid website

lun4r · on June 11, 2020

We've had the same issue. They were doing huge bursts of tens of thousands of requests in very short time several times a day. The bots didn't identify as FB (used "spoofed" UAs) but were all coming from FB owned netblocks. I've contacted FB about it, but they couldn't figure out why this was happening and didn't solve the problem. I found out that there is an option in the FB Catalog manager that lets FB auto-remove items from the catalog when their destination page is gone. Disabling this option solved the issue.

pera · on June 11, 2020

I just had an idea: if you control your own name server I believe you could use a BIND view to send all their own traffic to themselves based on the source address.

By the way, if someone discovers how to trigger this issue it would be easy to use it as a DOS pseudo-botnet.

mszcz · on June 12, 2020

I had a different idea. Maybe you could craft a zip-bomb response. The bot would fetch the small gzipped content and upon extraction discover it was GBs of data? Not sure that's possible here, when responding to a request, but that would surely turn the admins attention to it.

Nextgrid · on June 12, 2020

Here's an example of things you can do against malicious crawlers: http://www.hackerfactor.com/blog/index.php?/archives/762-Att....

Jazgot · on June 23, 2020

Thanks a lot! I've spent like 2h now reading his blog. Just amazing.

touristtam · on June 12, 2020

fascinating read. :)

mszcz · on June 12, 2020

It is!

narrowtux · on June 12, 2020

If their client asks for gzip compression of the http traffic, you could do it.

gerdesj · on June 11, 2020

Nice idea, I like the thinking. I'll tuck that away for use later. PowerDNS has LUA built in amongst a few other things.

My stack of projects to do is growing at a hell of a rate and I'm not popping them off the stack fast enough.

verdverm · on June 11, 2020

I know the feeling, it's one of the reasons I'm working on https://github.com/hofstadter-io/hof

Check out the code generation parts and modules, they are the most mature. We have HRDs (like CRDs in k8s for anything) and a scripting language between bash and Python coming out soon too.

gerdesj · on June 11, 2020

I've tried to work out what your project does but I'm none the wiser. GEB is prominent on my bookshelf. I'm a syadmin and I got as far as "hollow wold" in Go or was it "Hail Marrow"? Can't remember.

I've checked out your repo for a look over tomorrow when I'm cough sober!

verdverm · on June 11, 2020

Stop by gitter and I'd be happy to explain more

gorgoiler · on June 12, 2020

Something about this idea sits uncomfortably with me. I also just had an idea / thought experiment based on your idea.

We think of net neutrality as being for carriers and ISPs, but you could see it applied to a publicly accessible DNS service too. These DNS service providers are just as much part of the core service of the Internet as anyone else. It’s not a huge leap to require that those who operate a publicly accessible DNS service are bound by the same spirit of the regulations: that the infrastructure must not discriminate based on who is using it.

It’s different to operating a discriminatory firewall. DNS is a cacheable public service with bad consequences if poisonous data ends up in the system. Fiddling with DNS like this doesn’t seem like a good idea. Too much weird and bad stuff could go wrong.

Another analogy would be to the use of encryption on amateur radio. It seems like an innocuously good idea, but the radio waves were held open in public trust for public use. If you let them be used for a different (though arguably a more useful purpose) then the resource ends up being degraded.

Also along these lines of thought [begin irony mode]: FCC fines for DNS wildcard abuse / usage.

inopinatus · on June 12, 2020

Principled neutrality is fine for acceptable use. There’s no moral quandary in closing the door to abusers.

zdragnar · on June 12, 2020

Isn't that the argument that providers make for wanting to meter usage? I.e. video streamers, torrenters and netflix and the like are 'abusing' the network by using a disproportionate amount of their capacity / bandwidth?

I guess my point is that "abuse" in this sense is pretty subjective.

inopinatus · on June 12, 2020

Not really, those things are easy to contrast. Network providers have always been comfortable blackholing DoS routes, and it’s never been controversial. That’s clearly distinct from those wanting to double-dip on transport revenues for routine traffic.

The difference is in whether both endpoints want the traffic, not whether (or on what basis) the enabling infrastructure wants to bear it.

ttwinder · on June 12, 2020

> There’s no moral quandary in closing the door to abusers

Doesn't necessarily apply to this conversation, but the moral mistake that people (and societies) frequently make is underestimating the nuance that should be exercised when identifying others as abusers.

foobiekr · on June 12, 2020

if only there was a way to accidentally amplify that.

raverbashing · on June 12, 2020

You could just probably send an HTTP redirect to their own site, no need to play with DNS for that

mint2 · on June 13, 2020

Probably better to find the highest level FB employees personal site that you can and send the million requests from fb there.

tempestn · on June 12, 2020

Except then you'd still have to deal with all the traffic.

twmahna · on June 11, 2020

>> The bots didn't identify as FB (used "spoofed" UAs)

That's surprising. What were the spoofed user agents that they used?

We've run into this issue also, but all Facebook bot activity had user agents that contained the string "facebookexternalhit".

rob-olmos · on June 12, 2020

I've seen these these two user-agents from FB IPs, maybe others:

Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Version/7.0 Mobile/11D257 Safari/9537.53

Mozilla/5.0 (iPhone; CPU iPhone OS 13_5 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Mobile/15E148 Safari/604.1

Which also execute javascript in some modified sandbox or something, causing errors, and executing error handlers. Interesting attempt to analyze the crawler here: https://github.com/aFarkas/lazysizes/issues/520#issuecomment...

lun4r · on June 12, 2020

Yep, these were the UAs we also saw (amongst others). And also in our case those bots were executing the JS, even hitting our Google Analytics. For some reason GA reported this traffic to come from Peru and Philippines, while an IP lookup showed it belonged to FB registered in the US or Ireland

adrianN · on June 12, 2020

Set up a robots.txt that disallows Facebook crawlers, sue Facebook if the crawling continues for unauthorized access to computer systems, profit.

albatross · on June 12, 2020

I believe this only works if it is coming from a corporation and targeted at an individual -_-

justinph · on June 12, 2020

robots.txt is not a legal document. It is asking nicely, and plenty of crawlers purposefully ignore it.

adrianN · on June 12, 2020

I'm not a lawyer, but I believe I remember people being sued for essentially making a GET request to a URL they weren't supposed to GET.

greggyb · on June 12, 2020

Legal document is a tricky phrase to use. "No trespassing" signs are usually considered sufficient to justify prosecution for trespassing. If the sign is conspicuously placed, it does not usually matter if you actually see the sign or not.

I am not as familiar with law around accessing computer systems, but I imagine that given some of the draconian enforcement we've seen in the past that a robots.txt should be sufficient to support some legal action against someone who disregards it.

dungdang · on June 13, 2020

the no tresspassing sign does not mean i can't yell at you from the street 'tell me your life story.' which you are free to ignore.

greggyb · on June 17, 2020

Sure, but federal law does prevent you from repeatedly placing that request into my mailbox. I don't even need a sign for that.

Making an http request does not fit cleanly as an analogue to yelling from the street, nor does it fit as an analogue to throwing a written request on a brick through a window. It is something different that must be understood on its own terms.

wheresmycraisin · on June 11, 2020

Side question: how did you get in contact with facebook? I've an ad account that was suspended last year and gave up trying to contact them.

justusw · on June 12, 2020

Similar experience, closed ad account because of “suspicious activity”, at least ten support tickets (half closed automatically), four lame apologies (our system says no, sry) and then finally, “there was an error in our system, you’re good to go”

wheresmycraisin · on June 13, 2020

I lost it because there was some sort of UI bug in the ad preview page that led the page to reload a bunch of times really quick, and boom, I'm apparently a criminal. You're lucky though, I never got it back and I had to change my business credit card because they somehow kept charging me after suspension.

missedthecue · on June 11, 2020

Really? I've never had problems contacting them by email. They're one of the easiest tech companies to talk to.

redeux · on June 11, 2020

In my experience they're one of the most useless and difficult companies I've ever tried to interact with.

corobo · on June 12, 2020

The entirety of the universe is contained in the preceding two comments.

peanutz454 · on June 12, 2020

Aah, finally it all makes sense!

lvs · on June 12, 2020

https://twitter.com/james_kpatrick/status/320150923336892416

deadwing0 · on June 12, 2020

What did you contact them for? Just curious as I've almost always heard they're like Google and impossible to get a human response from.

wheresmycraisin · on June 13, 2020

What is their email? I've never found one that they'd respond to.

lun4r · on June 12, 2020

I have an option to email them, live chat, or have our account manager contact us. I guess if you spend enough on ads per month you are entitled to more support...?

danmurphy · on June 12, 2020

try contacting their NOC, they _may_ give you a human that can help.

bishalb · on June 12, 2020

Try their live chat.

wheresmycraisin · on June 13, 2020

Do they actually have a live chat available for smaller advertisers? When I looked last year there wasn't

bishalb · on June 13, 2020

They do, I have used it a couple of times. Go go facebook.com/business/help click on get started button at the bottom for "find answers or contact support". Follow through, you will see a button called "chat with a representative".

paulpauper · on June 11, 2020

twitter does the same thing. it sends a bunch of spoofed US visitors from Korea and Germany and US. The bots are spoofed to make it harder to filter them.

rapind · on June 11, 2020

I wonder what the (legitimate?) reason is for them to spoof. Seems intentionally shady. Maybe there's a legit reason we're missing?

detaro · on June 11, 2020

Possibly trying to avoid people sending them a different version of the page than users would see (of course they could change the page after the initial caching of a preview, but Twitter might refresh/check them later).

Also, you often need an impressive amount of the stuff thats in a normal UA string for random sites to not break/send you the "unsupported browser, please use Netscape 4 or newer!!!" page/..., although you normally can fit an identifier of what you really are at the end. (As an example, here's an iOS Safari user agent: "Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.5 Mobile/15E148 Safari/604.1" - it's not Mozilla, it's not Gecko, but it has those keywords and patterns because sites expect to see that)

yjftsjthsd-h · on June 11, 2020

Yeah, I once tried to tell my browser to send... I forget; either no UA, or a blank UA string, or nonsense or just "Firefox" or something. I figured, "hey, some sites might break, but it can't be that important!" It broke everything. IIRC, the breaking point was that my own server refused to talk to me. Now, I still think this is insane, but apparently this really is how it is right now.

hombre_fatal · on June 12, 2020

Should make you realize just how much abuse there is on the internet that it's worth it to just filter traffic with no UA.

Usually people get stuck on the fact that we can't have nice things, so X sucks for not letting us have nice things, yet I seem to never see people acknowledge why we can't have nice things.

Then I'd see a lot more "ugh, bad actors suck!" and less "ugh, websites are just trying to make life miserable for me >:("

lkrubner · on June 12, 2020

Is there an official error message for that? Because filtering no UA would trip me up every time I use wget. Most of the time, if I'm casually using wget for something, I don't bother with a UA. If sites started rejecting that, I'd like to get a clear error message, so I would not go crazy trying to figure out what the problem was. If I got a clear message "send a UA" then I would probably started wrapping my wget requests in a short bash script with some extra stuff thrown in to keep everyone happy. But I'd have to know what it is that is needed to keep everyone happy.

auscompgeek · on June 12, 2020

wget has a default User-Agent string.

licebmi__at__ · on June 12, 2020

Well, if filtering for UA really makes things difficult for bad actors, they do suck but more in as a technical opinion than an moral statement.

laughinghan · on June 12, 2020

That's amazing. Any idea what piece of middleware on your own server was doing that?

yjftsjthsd-h · on June 12, 2020

I don't remember, but what really got me was that I wasn't running anything that I expected to do fancy filtering; I think this was just Apache httpd running on CentOS. But there was no web application firewall, no load balancers, pretty sure fail2ban was only set up for sshd. It at least appeared that just the apache stock config was in play.

bzb3 · on June 11, 2020

You removed a vital part of the http protocol and you're surprised when things break?

zucker42 · on June 12, 2020

Seems like it follows the spec to me:

https://tools.ietf.org/html/rfc7231#section-5.5.3

rapind · on June 11, 2020

That makes sense. I couldn't come up with a shady reason why they would do it to be honest, but I was curious.

xg15 · on June 11, 2020

Obligatory link: https://webaim.org/blog/user-agent-string-history/

technion · on June 12, 2020

When you know your target runs Internet Explorer, you serve phishing pages to IE users and some boring content to other users, at the server level. We've had this keep our user tests out of Google Safebrowse and so on. I'm sure similar tricks end up applied to Facebook's UA and "legitimate" marketing sites.

napolux · on June 11, 2020

Thanks man! I'll have a look.

cdown · on June 11, 2020

Hey! Facebook engineer here. If you have it, can you send me the User-Agent for these requests? That would definitely help speed up narrowing down what's happening here. If you can provide me the hostname being requested in the Host header, that would be great too.

I just sent you an e-mail, you can also reply to that instead if you prefer not to share those details here. :-)

traverseda · on June 11, 2020

I'm not sure I'd publicly post my email like that, if I worked at FB. But congratulations on your promotion to "official technical contact for all facebook issues forever".

cdown · on June 11, 2020

My e-mail address is already public from my kernel commits and upstream work. :-)

doommius · on June 11, 2020

Don't think I used my email for anything important doing my time at FB. If it gets out of hand he could just make a request to have a new primary email made and use the above one for "spam"

thirtythree · on June 12, 2020

Curiousity question: does FB use Gmail/Google suite?

dan15 · on June 12, 2020

FB uses Office365 for email. It was on-premise Exchange many many years ago, but moved "to the cloud" a while back.

hirako2000 · on June 12, 2020

Feels odd to read Facebook uses office365/exchange for emails. they haven't built their fsuite yet, I thought they would simply promote Facebook messenger internally. I'm only half joking.

dan15 · on June 13, 2020

Most communication is via Workplace (group posts and chat). Emails aren't very common any more - mainly for communication with external people and alerting.

callalex · on June 12, 2020

My impression is that they pretty much roll their own communication suite.

rock_hard · on June 12, 2020

That’s somewhat correct

But at least for email/calendar backend its exchange

The internal replacement clients for calendar and other things are killer...have yet to find replacements

For the most part though they use Facebook internally for messaging and regular communication (technically now Worplace but before it was just Facebook)

Email is really just for external folks

napolux · on June 11, 2020

I don't want to share my website for personal reasons, but here is some data from cloudflare dashboard (a request made on 11 Jun, 2020 21:30:55 from Ireland, I have 3 requests in the same second from 2 different IPs)

user-agent: facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php) ip 1: 2a03:2880:22ff:3::face:b00c (1 request) ip 2: 2a03:2880:22ff:b::face:b00c (2 requests) ASN: AS32934 FACEBOOK

napolux · on June 11, 2020

Yes requests are still coming. Thanks CloudFlare for saving my ass.

Ueland · on June 12, 2020

Hey,

I can by the way confirm this issue. I work in a large newspaper in Norway and around a year ago we saw the same issue. Thousands of requests per second until we blocked it . And after we blocked it, traffic to our Facebook page also plummeted. I assume Facebook considered our website down and thus wouldn't give users content from our Facebook page either as that would serve them content that would give a bad user experience. The Facebook traffic did not normalize before the attack stopped AND after we told Facebook to reindex all our content.

I'd you want more info, send me a email and il dig out some logs etc. thu at db.no

fsociety · on June 11, 2020

Thanks for looking at this!

6c696e7578 · on June 12, 2020

Thank goodness for mod_rewrite, which makes blocking/redirecting traffic on basic things like headers pretty easy.

https://www.usenix.org.uk/content/rewritemap.html

You could of course block upstream by IP, but if you want to send the traffic away from a CPU heavy dynamic page to something static that 2xx's or 301's to https://developers.facebook.com/docs/sharing/webmasters/craw... then this could be the answer.

lun4r · on June 12, 2020

Here are some more details from my report to FB:

"My webserver is getting hit with bursts of hundreds of requests from Facebook's IP ranges. Google Analytics also reports these hits and shows them as coming from (mostly) Philippines and Peru, however, IP lookup shows that these IPs belong to Facebook (TFBNET3). The number of these hits during a burst typically exceeds my normal traffic by 200%, putting a lot of stress at our infrastructure, putting our business at risk.

This started happening after the Facebook Support team resolved a problem I reported earlier regarding connecting my Facebook Pixel as a data source to my Catalog. It seems Facebook is sending a bot to fetch information from the page, but does so very aggressively and apparently call other trackers on the page (such as Google Analytics)"

69.171.240.19 - - [13/Aug/2018:11:09:52 +0200] "GET /items/ley3xk/ford-dohc-20-sierra-mondeo-scorpio-luk-set.html HTTP/1.1" 200 15181 "https://www.facebook.com/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17134"

"e.g. IP addresses 173.252.87.* performed 15,211 hits between Aug 14 12:00 and 12:59, followed by 13,946 hits from 31.13.115.*"

"What is also interesting is that the user agents are very diverse. I would expect a Facebook crawler to identify itself with a unique User-Agent header (as suggested by the documentation page mentioned earlier), but instead I see User-Agent strings that belong to many different browsers. E.g. this file contains 53,240 hits from Facebook's IP addresses with User-Agent "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17134"

There are a few Facebook useragents in there, but far less than browser useragents: 7,310 hits: facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php) 2,869 hits: facebookexternalhit/1.1 1,439 hits: facebookcatalog/1.0 120 hits: facebookexternalua

Surprisingly, there is even a useragent string that mentions Bing: 6,280 hits: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534+ (KHTML, like Gecko) BingPreview/1.0b

These IPs don't only fetch the HTML page, but load all the page's resources (images, css, ..) including all third-party trackers (such as Google Analytics). Not only does this put unnecessary stress at our infrastructure, it drives up the usage costs of 3rd party tracking services and renders some of our reports unreliable."

final response from FB: "Thanks for your patience while our team looked into this. They've added measures to reduce the amount of crawler calls made. Further optimizations are being worked on as well, but for now, this issue should be resolved." <- NOT.

stiray · on June 11, 2020

I dont really understand what is the issue. On my welcome page (while all other urls are impossible to guess) i give browser something that requires a few seconds of cpu at 100% to crunch. And tracking some user action in between, visting tarpitted urls etc. In last few years no bot came through. Why bother with robots.txt, just give them something to break their teeths...

(I would give you the url, but I just dont want ti be visited)

cblades · on June 11, 2020

I don't think most users would appreciate having a site spike their CPU for a few seconds when they visit...at least I wouldn't.

hombre_fatal · on June 12, 2020

The far-fetched "solution" along with the "I don't understand everyone doesn't do <thing that is easy to understand why everyone wouldn't do it>" make it hard for me to believe you're actually doing this. Sounds more like a shower thought, maybe fun weekend project, than something in production.

trashburger · on June 11, 2020

Sounds like terrible UX! Won't browsers give you a "Stop script" prompt if you do that?

aetherspawn · on June 11, 2020

I don’t think the solution is flatlining everyone’s CPUs.

xg15 · on June 11, 2020

> but I just dont want ti be visited

Maybe you should just take your website offline?

quickthrower2 · on June 11, 2020

How much have you mined?

leafo · on June 11, 2020

Same thing has happened to me: https://twitter.com/moonscript/status/1124888489298808834

The network address range falls under Facebook's ownership, so I don't think it's someone spoofing. I do think it's very possible someone found a way to trigger crawl requests in large quantity. Alternatively, I would not be surprised it's just a bug on facebook's end.

_lqaf · on June 11, 2020

They've done this before to me, too. First I tried `iptables -j DROP`, which made the machine somewhat usable, but didn't help with the traffic. After trying a few things, I tried `-j TARPIT`, and that appeared to make them back off.

Of course, sample size of 1, etc. It could have been coincidental.

hinkley · on June 11, 2020

Tarpits are an underappreciated solution to a pool of bad actors.

You can add artificial wait times to responses, or you can just route all of the 'bad' traffic to one machine, which becomes oversubscribed (be sure to segregate your stats!). All bad actors fighting over the same scraps creates proportional backpressure. Just adding 2 second delays to each request won't necessarily achieve that if multiple user agents are hitting you at once.

apocalyptic0n3 · on June 11, 2020

I never looked into the TARPIT option in iptables before reading your comment. That seems really useful. I've been dealing with on and off bursts of traffic from a single AWS region for the last month. They usually keep going for about 90 minutes every day, regardless of how many IPs I block, and consume every available resource with about 250 requests per second (not a big server and I'm still waiting for approval to just outright block the AWS region). I'm going to try a tarpit next time rather than a DROP and see if it makes a difference.

rocho · on June 11, 2020

Be careful, as tarpitting connections can consume your resources faster than those of the attacker.

hinkley · on June 11, 2020

Most spiders limit the number of requests per domain, so if it's stupidity and not malice, you probably don't have a runaway situation.

... unless you're hosting a lot of websites for people in a particular industry. In which case the bot will just start making requests to three other websites you also are responsible for.

Then if you use a tarpit machine instead of routing tricks, the resource pool is bounded by the capacity of that single machine. If you have 20 other machines that's just the Bad Bot Tax and you should pay it with a clean conscience and go solve problems your human customers actually care about.

noir-york · on June 11, 2020

Only if you use conntracking though no?

justinclift · on June 11, 2020

Interesting. Implemented DROP for some Cloudflare net blocks a few weeks ago due to lots of weird interrupted TCP connections from them.

That's seemed to be good enough, but will consider the TARPIT option if it turns out to be needed. :)

jschwartzi · on June 11, 2020

one man’s weird interrupted TCP connection is another man’s nmap attempt.

fomine3 · on June 12, 2020

TIL TARIPT!

zackb · on June 11, 2020

This was happening to us > 5 years ago. The FB crawlers were taking out our image serving system as we used the og:image thing. What we did was route FB crawler traffic to a separate Auto Scaling Group to keep our users happy while also getting the nice preview image on FB when our content was shared. I can't understate the volume of the FB requests, I can't remember the exact numbers now but it was insane.

omershapira · on June 12, 2020

I once made a PHP page which fetches a random wikipedia article, but changes the title to "What the fuck is [TOPIC]?". It was incredibly funny, until I started getting angry emails from people, mostly threatening a form of libel lawsuit.

Turns out, since it was a front for all of wikipedia[1], google was agressively indexing it, but the results rarely made it to the first search page. And since this isn't exactly an important site, old results would stick around.

Hence a pattern:

1. Some Rando creates a page about themselves

2. Wikipedia editors, being holy and good, extinguish that nonsense

1.5. GOOGLE INDEXES IT

3. Rando, by nature of being a rando, googles themselves, doesn't find their wikipedia page anymore (that's gone), but does find a link to my site in the first page of google results with their name.

4. Lawyers get involved, somehow

Details: http://omershapira.com/blog/2013/03/randomfax-net/

[1] Hebrew wikipedia. I can't imagine what would've happened on the English version.

smileybarry · on June 12, 2020

I understand why people got angry, though. The title was rougly "What is this '<article name>' shit?"[1], with "shit" in this Hebrew context also able to be interpreted as calling the subject "shit".

Saying "מה זה לעזאזל X?"[2] is closer to "what the fuck is X?" (except it's more "what the hell" than "what the fuck").

[1] I wasn't sure how to portray this to non-Hebrew speakers but, surprisingly, Google Translate actually nailed it: https://translate.google.com/#view=home&op=translate&sl=auto...

[2] Google Translate got this example right, too: https://translate.google.com/#view=home&op=translate&sl=auto...

bberenberg · on June 11, 2020

Doesn't ignoring headers and taking someones site down fall afoul of the CFAA? Especially given how comments here are showing that this is a recurring issue? Depending on which side of the fence you're on, there could be standing to go after FB for this either for money, or for their lawyers to help set a precedent to limit CFAA further.

yupyup54133 · on June 11, 2020

When someone does this to Facebook its malicious and they go to jail. WHen Facebook does it to someone else... "oops".

yupyup54133 · on June 11, 2020

Never forget Aaron Swartz. All he did was send download requests to JSTOR. https://www.youtube.com/watch?v=9vz06QO3UkQ

gregf · on June 11, 2020

I'm going to get a ton of hate for this, but he was doing it with the intent to redistribute the content for free. He didn't own the content. There's a big difference. That being said it's very sad what came about of that. I really don't think the FBI needed to be involved.

yupyup54133 · on June 11, 2020

> he was doing it with the intent to redistribute the content for free

There is actually no proof of this.

gknoy · on June 11, 2020

Really? I felt that what happened to Aaron was a tragic injustice on many levels (from the fact that he was charged at all to the number of charges they threw at him), but for some reason I had always thought that it was fairly well-known that that was his intent.

I have no idea where I got that impression from, but I do recall reading several articles about him, as well as his blog around that time. It's possible I just internalized others' assumptions about his behavior, but it certainly seemed like an in-character thing for him to to.

yupyup54133 · on June 11, 2020

IIRC, he had published a manifesto about how he believed that information that was paid for with public funds should be free for the public to view. However, I don't believe that he ever mentioned why he was making get requests to JSTOR's servers and I know that he never uploaded any of those JSTOR documents to the internet for public download.

WealthVsSurvive · on June 11, 2020

Yeah, but any decent criminal lawyer could tear "he once wrote a manifesto" to shreds. His political beliefs would only be pertinent if he wrote the manifesto somehow in connection with what he was doing; otherwise, it would not be sufficient to show intent to distribute.

throwaway2048 · on June 11, 2020

Criminal law is about proving things, not in-charachter assumptions.

jrockway · on June 11, 2020

Intent is the sticking point for the vast majority of prosecutions. That's why you have a jury, not Compute-o-Bot 9000.

type0 · on June 11, 2020

Yeah and Facebook did this crawlings with intent to commit CFAA - computer fraud and abuse, now someone call the feds quickly before the perpetrators are zucked into hiding

occamrazor · on June 11, 2020

From the point of view of the law, the intent (malicious or benevolent) is often as important as the action itself.

yupyup54133 · on June 11, 2020

Nah, it only matters if you are a large corporation or not. If you are a large corporation you will ruin your opponent financial with legal fees (whether they are guilty or not). If you are a small timer going up against a large corporation they will delay the court case until you can not afford the legal fees.

cortesoft · on June 11, 2020

The always-classic "what color are your bits?"

https://ansuz.sooke.bc.ca/entry/23

meritt · on June 11, 2020

No. The owner needs send Facebook a cease and desist and block their IPs in order to revoke authorization under the CFAA, and then Facebook would need to continue accessing his website despite explicitly being told not to. That's the precedent set by Craigslist v. 3Taps.

Whether or not the above even holds today (hiQ Labs v. LinkedIn argues it does not) remains to be decided in a current appeal to the Supreme Court.

jokoon · on June 11, 2020

That's 81 request per second on average.

Shouldn't anybody doing such thing be liable, and be sued for negligence and required to pay damages?

That sounds like a lot of bandwidth (and server stress).

Traster · on June 11, 2020

Yes, this was my first thought too - Facebook is essentially denial of servicing people's sites and it sounds like they're aware of it. Move fast and break your things. If this were your average Joe I'm fairly certain something like the DMCA could be used.

sprayk · on June 11, 2020

What? How does the DCMA apply here at all?

sirclueless · on June 12, 2020

The DMCA has a couple major parts. The part most people are familiar with deals with safe harbor provisions and "DMCA takedowns". The relevant part for this discussion makes it illegal to circumvent digital access-control technology. So you could argue that if you have reasonable measures in place to stop automated access of your content such as robots.txt then Facebook continuing to send automated requests would be a circumvention of your access controls.

jrockway · on June 11, 2020

Shouldn't any public facing website have a rate limit? If someone attempts to circumvent simple rate limits (randomizing the source IP or header content), then that could demonstrate intent to cause damage, and you'd have a better case. But if you don't set a limit, how can you be mad that someone exceeded it?

(I know they're ignoring robots.txt, but robots.txt is not a law. And, it doesn't apply to user-generated requests, for things like "link unfurling" in things like Slack. I am guessing the crawler ignores robots.txt because it is doing the request on behalf of a human user, not to create some sort of index. Google is attempting to standardize this widely-understood convention: https://developers.google.com/search/reference/robots_txt)

ricardobeat · on June 11, 2020

There is a cost to apply throttling, you still have to receive those requests and filter/block them, plus incoming bandwidth.

The post mentions setting `og:ttl` and replying with HTTP 429 - Too Many Requests, and both being ignored.

sammitch · on June 11, 2020

Plenty of crawlers, including Facebook's, are operating out of huge pools of IPs usually smeared across multiple blocks. If you have an idea on how to ratelimit a crawler doing 2 req/s from 300+ distinct IPs with random spoofed UAs let me know when your startup launches.

jrockway · on June 12, 2020

There are many features that you can pick up on. TCP fingerprints, header ordering, etc. You build a reputation for each IP address or block, then block the anomalous requests.

If you don't want to do this yourself and you're already using Cloudflare... congratulations, this is exactly why they exist. Write them a check every month for less than one hour of engineering time, and their knowledge is your knowledge. Your startup will launch on time!

dgellow · on June 11, 2020

The author specified that the crawler ignores 429 status code. So they do have some rate limit.

jrockway · on June 11, 2020

I noticed that, but I get the feeling that sending that status code didn't reduce the amount of work the program did, so it's kind of a pointless. You can say "go away", but at some point, you probably have to make them go away. (All RFC6585 says about 429 is that it MUST NOT be cached, not that the user agent can't try the request again immediately. It is very poorly specified.)

The author mentioned that they were using Cloudflare, and I see at least two ways to implement rate limiting. One is to have your application recognize the pattern of malicious behavior, and instruct Cloudflare via its API to block the IP address. Another is to use their rate-limiting that detects and blocks this stuff automatically (so they claim).

Like, there are many problems here. One is a rate limit that doesn't reduce load enough to serve legitimate requests. Another is wanting a caching service to cache a response that "MUST NOT" be cached. And the last is expecting people on the Internet to behave. There are always going to be broken clients, and you probably want the infrastructure in place to flat-out stop replying to TCP SYN packets from broken/malicious networks for a period of time. If the SYN packets themselves are overwhelming, welp, that is exactly why Cloudflare exists ;)

The author is right to be mad at Facebook. Their program is clearly broken. But it's up to you to mitigate your own stuff. This time it's a big company with lots of money that you can sue. The next time it will be some script kiddie in some faraway country with no laws. Your site will be broken for your users in either case, and so it falls on you to mitigate it.

toast0 · on June 12, 2020

Rate limiting websites is a lot of work. You have to figure out what kind of limits you should set, you have to figure out if that should be on an IP or a /24 or an ASN, if it's an ASN, you have to figure out how to turn an IP into an ASN.

If your website is pretty fast, you probably won't necessarily care about a lot of hits from one source, but if it's supposed to be a small website on inexpensive hosting, and all those hits add up to real transfer numbers, maybe it's an issue.

quickthrower2 · on June 11, 2020

In the ToS, add a charge of 2 guineas, 5 shilling, a sixpence and a peppercorn per request.

FridgeSeal · on June 11, 2020

Nah make it a strand of saffron per request and get your moneys' worth out of them.

napolux · on June 11, 2020

I saw also of 300reqs per second.

rcfox · on June 11, 2020

Did anyone notice the branded IP address? 2a03:2880:20ff:d::face:b00c

"face:b00c"

tombrossman · on June 11, 2020

My "404-pests" fail2ban-client status, which drops everything making *.php requests (This machine has never had PHP installed...):

2a03:2880:10ff:14::face:b00c 2a03:2880:10ff:21::face:b00c 2a03:2880:11ff:1a::face:b00c 2a03:2880:11ff:1f::face:b00c 2a03:2880:11ff:2::face:b00c 2a03:2880:12ff:10::face:b00c 2a03:2880:12ff:1::face:b00c 2a03:2880:12ff:9::face:b00c 2a03:2880:12ff:d::face:b00c 2a03:2880:13ff:3::face:b00c 2a03:2880:13ff:4::face:b00c 2a03:2880:20ff:12::face:b00c 2a03:2880:20ff:1e::face:b00c 2a03:2880:20ff:4::face:b00c 2a03:2880:20ff:5::face:b00c 2a03:2880:20ff:75::face:b00c 2a03:2880:20ff:77::face:b00c 2a03:2880:20ff:e::face:b00c 2a03:2880:21ff:30::face:b00c 2a03:2880:22ff:11::face:b00c 2a03:2880:22ff:12::face:b00c 2a03:2880:22ff:14::face:b00c 2a03:2880:23ff:5::face:b00c 2a03:2880:23ff:b::face:b00c 2a03:2880:23ff:c::face:b00c 2a03:2880:30ff:10::face:b00c 2a03:2880:30ff:11::face:b00c 2a03:2880:30ff:17::face:b00c 2a03:2880:30ff:1::face:b00c 2a03:2880:30ff:71::face:b00c 2a03:2880:30ff:a::face:b00c 2a03:2880:30ff:b::face:b00c 2a03:2880:30ff:c::face:b00c 2a03:2880:30ff:d::face:b00c 2a03:2880:30ff:f::face:b00c 2a03:2880:31ff:10::face:b00c 2a03:2880:31ff:11::face:b00c 2a03:2880:31ff:12::face:b00c 2a03:2880:31ff:13::face:b00c 2a03:2880:31ff:17::face:b00c 2a03:2880:31ff:1::face:b00c 2a03:2880:31ff:2::face:b00c 2a03:2880:31ff:3::face:b00c 2a03:2880:31ff:4::face:b00c 2a03:2880:31ff:5::face:b00c 2a03:2880:31ff:6::face:b00c 2a03:2880:31ff:71::face:b00c 2a03:2880:31ff:7::face:b00c 2a03:2880:31ff:8::face:b00c 2a03:2880:31ff:c::face:b00c 2a03:2880:31ff:d::face:b00c 2a03:2880:31ff:e::face:b00c 2a03:2880:31ff:f::face:b00c 2a03:2880:32ff:4::face:b00c 2a03:2880:32ff:5::face:b00c 2a03:2880:32ff:70::face:b00c 2a03:2880:32ff:d::face:b00c 2a03:2880:ff:16::face:b00c 2a03:2880:ff:17::face:b00c 2a03:2880:ff:1a::face:b00c 2a03:2880:ff:1c::face:b00c 2a03:2880:ff:1d::face:b00c 2a03:2880:ff:25::face:b00c 2a03:2880:ff::face:b00c 2a03:2880:ff:b::face:b00c 2a03:2880:ff:c::face:b00c 2a03:2880:ff:d::face:b00c

napolux · on June 11, 2020

I think we're together in this, my friend

golem14 · on June 11, 2020

I believe, but have not tried, that you can craft zipped files that are small but when expanding produce multi-gigabyte files. If you can sufficiently target the bad guys, you can probably grind them to a halt by serving only them these files. Some care needs to be taken so you don’t harm the innocent and don’t run afoul of any laws that may apply. Good luck.

djxfade · on June 11, 2020

Do you think it would be possible to create something similar for gzip? If you then serve with Content-Type: text/html, and Content-Encoding: gzip, the client would accept the payload. And when it tries to expand it, it would get expanded to a large file, eating up their resources.

golem14 · on June 11, 2020

That was the idea. There's a name for it - zip bomb. https://en.wikipedia.org/wiki/Zip_bomb

A superficial search leads to things like https://www.rapid7.com/db/modules/auxiliary/dos/http/gzip_bo...

https://stackoverflow.com/questions/1459673

You really want to be careful about potentially breaking laws ...

golem14 · on June 11, 2020

Here is another article: https://www.blackhat.com/docs/us-16/materials/us-16-Marie-I-...

If FB supports brotli, a much bigger compression factor than 1000 is possible, apparently.

Thorrez · on June 12, 2020

Here's a brotli file I created that's 81MB compressed and 100TB uncomrpessed[1] (bomb.br). That's a 1.2M:1 compression ratio (higher than any other brotli ratio I see mentioned online).

There's also a script in that directory that allows you to create files of whatever size you want (hovering around that same compression ratio). You can even use it to embed secret messages in the brotli (compressed or uncompressed). There's also a python script there that will serve it with the right header. Note that for Firefox it needs to be hosted on https, because Firefox only supports brotli over https.

Back when I created it, it would crash the entire browser of ESR Firefox, crash the tab of Chrome, and would lead to a perpetually loading page in regular Firefox.

It's currently hosted at [CAREFUL] https://stuffed.web.ctfcompetition.com [CAREFUL]

[1] https://github.com/google/google-ctf/tree/master/2019/finals...

notriddle · on June 11, 2020

A single layer of deflate compression has a theoretical expansion limit of 1032:1. ZIP bombs with higher ratios only achieve it by nesting a ZIP file inside of another ZIP file and expecting the decompressor to be recursive.

This means you can serve 1M payload and have it come out to 1G at decompression time. Not a bad compression ratio, but it doesn't seem like enough to break Facebook servers without taking on considerable load of your own.

http://www.zlib.net/zlib_tech.html

njvsifhgnvion · on June 12, 2020

Just a guessing, but maybe a really large file with a limited character set, maybe even just repeating 1 character should compress really well

njvsifhgnvion · on June 12, 2020

1 character repeated 1877302224 times has a compression ratio of 99.9% ~1.9GB compresses to ~1.8MB.

x0 · on June 12, 2020

I believe zip bombs are pretty easily mitigated, with memory limits, cpu time ulimits, etc.

But zip bombs aren't limited to zip files... lots of files have some compression in them You can make malicious PNG's that do the same thing. Probably tonnes of other files.

golem14 · on June 12, 2020

Good points. According to OP, the goal would be to persuade FB to back off crawling at 300 qps, not to bring down the FB crawl. Having each request expand to 1GB is probably already enough to do that, since a crawler implementing the mitigations you mentioned will likely reduce the crawl rate. Should be easily testable.

napolux · on June 11, 2020

The mighty power of IPV6 :P

sprayk · on June 11, 2020

if only they could have convinced IANA to give them a vanity prefix.

cranekam · on June 11, 2020

FB has 2a00::/12 (and probably other blocks?) and can make something that looks like its company name from [0-9a-f]. Why wouldn't it do something harmless and fun like this? It's not as if it requires special dispensation or is breaking any rules.

gk1 · on June 11, 2020

I don't think the person was implying it's bad. Just pointing out an easter egg.

cranekam · on June 12, 2020

Ah, well if that’s the case then I apologise. My bad!

detaro · on June 11, 2020

A single company does not get a /12 prefix. 2a00::/12 is almost half of the space currently allocated to all of RIPE NCC. Facebook seems to have 2a03:2880::/29 out of that /12, and a /40 through ARIN (2620:0:1c00::/40)

cranekam · on June 12, 2020

Ugh right you are. I misread the whois output and didn't stop to consider how realistic a /12 was. This is embarrassing.

vvG94KbDUtRa · on June 12, 2020

ugh why is ipv6 impossible to understand :/

detaro · on June 12, 2020

Is the notation the problem here? IPs have 16 bytes now, so the notation is a bit more compact: hex instead of decimal numbers, and :: is a shortcut meaning "replace with as many zeros as necessary to fill the address to 16 bytes"

chungy · on June 12, 2020

It's not. Think IPv4 but bigger.

lun4r · on June 11, 2020

Related thread at SO: https://stackoverflow.com/questions/49577546/facebook-crawle...

lun4r · on June 11, 2020

And FB bug report: https://developers.facebook.com/support/bugs/189402442061080...

nickysielicki · on June 11, 2020

It bugs me that they require a login to see a bug report, especially in the case of this bug where those affected aren't necessarily facebook users.

napolux · on June 11, 2020

I've stumbled in that too.

TheChaplain · on June 11, 2020

Reading through the article and comments here I find it odd that a company like FB can't get a crawler to respect robots.txt or 429 status codes.

Even I would stop and think "maybe I should put in some guards against large amounts of traffic" when writing a crawler, and I'm certainly not one of those brilliant minds who manage to pass their interview process.

callalex · on June 12, 2020

If the dev respected others’ boundaries and norms of expected behavior, they would probably work somewhere else.

danschumann · on June 11, 2020

It's an interesting thing to think about, that someone can make a request to your website and you pay for it.

This is not how the mail works, where someone needs to buy a stamp to spam you.

ortusdux · on June 11, 2020

My office still gets spam faxes to this day. The paper and toner only add up to a few cents a month, so it's not worth doing anything about.

I knew a realtor that had a sheet of black paper with a few choice expletives written on it that they would send back to spammers. There was an art to taping it into a loop so it would continuously feed. This was a few decades ago when a the spam faxes could cost more than a stamp.

mulmen · on June 11, 2020

My desk phone at an old job used to get dialed by a fax machine. Not fun picking that up. I redirected it to a virtual fax line and it turns out it was a local clinic faxing medical records. I faxed them back with some message about you have the wrong number but they never stopped.

fennecfoxen · on June 11, 2020

If you really wanted to make them stop, you might find a contact for their lawyer and make HIPAA noises at them ;)

mulmen · on June 12, 2020

I considered it. That was after my time in hospital IT. Really I wanted to stop getting my eardrums blown out by a robot. Luckily I happened to be working at a phone company so changing my number just took a couple clicks.

tjalfi · on June 11, 2020

My dad used to receive a lot of misdirected faxes.

His solution was to send a return fax with disorderly handwriting begging and pleading for them to fax someone else.

The other person stopped faxing him.

mulmen · on June 12, 2020

I was using a virtual fax service so a digital equivalent would have been easy. Not sure if I was creative enough to invert the colors before sending my response. Really I just wanted the noise to stop.

Nextgrid · on June 11, 2020

It’s because we as an industry decided to give in and submit to the cloud providers’ bullshit model of paying overpriced amounts for bandwidth while good old bare-metal providers still offer unmetered bandwidth for very reasonable prices.

xmprt · on June 11, 2020

The question isn't about pricing. The fact that you pay for your own bandwidth and metal is a problem.

Nextgrid · on June 11, 2020

Paying for your bandwidth and metal isn't a problem per-se as long as prices are reasonable (which they are with the old-school bare-metal providers). After all, those services do cost money to provide.

The problem happens when prices are extortionate (or the pricing model is predatory, ie pay per MB transferred instead of a flat rate per 1Gbps link) and relatively minor traffic translates to a major bill.

This particular issue is about 80 requests/second which is a drop in the bucket on a 1Gbps link. It would be basically unnoticeable if it wasn't for the cloud providers nickel & diming their customers.

xmprt · on June 12, 2020

You're still not getting it. I'm not arguing about whether it should be free or how much it should cost. I'm saying that when you have to pay for someone else's request, it's ripe for abuse. Instead, a system where the requester pays would be similar to how postage works in the US. Although I think it's highly unlikely that would make sense in a digital world.

mulmen · on June 11, 2020

Most of my junk mail does not have stamps because the spammers drop it off in pallets at the post office and just have it delivered to everyone.

The USPS is like a reverse garbage service.

thephyber · on June 11, 2020

This is a common pattern in (blocking i/o versus non-blocking i/o) and (sync task versus async task).

The typical analogy I hear for DDoS attacks is if you own a business and a giant group of protesters drives traffic to your store, but no one is interested in buying. They create a backlog to enter the store, they crowd the isles, they pick up your inventory, and the lines and crowds scare off your legitimate visitors/customers.

avian · on June 11, 2020

This reminds me of a case years ago when I had a small-ish video file on my server that would cause Google Bot to just keep fetching it in a loop. It wasted a huge amount of bandwidth for me since I only noticed it in a report at the end of the month. I had no way of figuring out what was wrong with it, so I just deleted the file and the bot went away.

spiderfarmer · on June 11, 2020

Was it a real FB crawler or a random buggy homemade bot masquerading as one?

Check here: https://developers.facebook.com/docs/sharing/webmasters/craw...

napolux · on June 11, 2020

One IP (ipv6) I just got from CloudFlare.

https://www.ultratools.com/tools/ipv6InfoResult?ipAddress=2a...

hinkley · on June 11, 2020

Someone up to no good using a false flag would not surprise me in the slightest.

"who do people already like to hate? I'll pretend to be them."

napolux · on June 11, 2020

Doesn't seem the case https://apps.db.ripe.net/db-web-ui/query?searchtext=2a03:288...

Wingy · on June 11, 2020

IP packets have a source field, which can be fake and not their actual IP. That's a Facebook IP, but the packet might not have actually come from it.

detaro · on June 11, 2020

If they made actual requests that went through, a connection got established. That won't happen with a faked source.

Wingy · on June 12, 2020

Ah okay, thanks for helping me learn :)

lgats · on June 12, 2020

Could be the IP address OP is showing is actually from an X-Forwarded header or other proxy header (but not the actual source of the packet).

thephyber · on June 11, 2020

UDP packets can fake their source IP. TCP packets realistically can't.

hinkley · on June 11, 2020

More precisely, spoofing a TCP handshake is a problem that we know how to prevent. Odds are pretty good that your kernel has these protections enabled by default, but it's not guaranteed and you should check as a matter of due diligence.

draw_down · on June 11, 2020

Why is it so hard for people here to believe that Facebook could have a crawler misbehaving? It is all over this thread, even though multiple people are saying they saw the same thing and confirmed the IP belonged to FB.

What gives?

crazygringo · on June 11, 2020

If you can verify it's actually coming from Facebook's IP address (as many other comments here suggest), you should absolutely get in touch with them, though I'm not sure how. Perhaps their security team. That's a very serious bug that they'd be glad to catch -- if it's affecting you it's surely affecting others.

Otherwise it's a bot/malware/etc. spoofing Facebook and gone wrong, which sucks. And yeah just block it by UA, and hopefully eventually it goes away.

mikece · on June 11, 2020

From the sounds of it the difference between Facebook's actual crawler and malware could be hard to define or differentiate, given the circumstances.

napolux · on June 11, 2020

It was from their IPs tried to figure out how to get in touch, without success

hodgesrm · on June 12, 2020

This is one of the better HN threads in a while. Starting from an easily described problem, it goes into a number of interesting directions including how FB seems to DDOS certain sites for reasons unknown, defenses against such attacks and DDOS in general, GPL licensing, how to collapse HN conversations, etc.

Not a profound comment. But it was really fun to follow the path down the various rabbit holes. This is why I like HN.

sradman · on June 11, 2020

Interesting. According to the Facebook Crawler link provided by the OP [1], it makes a range request of compressed resources. I wonder if the response from the OP’s PHP/SQLite app isn’t triggering a crawler bug.

Maybe Cloudflare can step up and act as an intermediary to isolate the problem. Isolating DoS attacks is one of their comparative advantages.

[1] https://developers.facebook.com/docs/sharing/webmasters/craw...

ablanco · on June 11, 2020

It would be nice to understand how you came to the conclusion that this was a Facebook bot.

napolux · on June 11, 2020

Ip addresses.

for example https://www.ultratools.com/tools/ipv6InfoResult?ipAddress=2a...

akerro · on June 11, 2020

You can check list of IP addresses?

huac · on June 11, 2020

the OP did not (explicitly) check, but you can check if the IP falls into a range allocated to Facebook e.g. https://ipinfo.io/AS32934

napolux · on June 11, 2020

Of course I've checked. Here is a recent example. https://news.ycombinator.com/item?id=23491455

lclarkmichalek · on June 11, 2020

Make sure you file a bug - there are a myriad of sources internally, but we can often hunt it down easily enough (assuming it gets triaged to eng). Important info is the host of the urls being crawled, and the User Agent attached to the requests (headers are also good too). A timeseries graph of the hits (with date & timezone specified) can also help.

falcolas · on June 11, 2020

Why is this the owner's problem? Someone at Facebook should be filing the bug, or better yet instrumenting their systems so that incidents like this issue a wake-up-an-engineer alert.

Fuck this culture of "it's up to the victim of our fuckup to file a bug report with us".

mulmen · on June 11, 2020

Not just “file a bug report” but also compile a time series graph (lol) and then pray that Facebook triages it correctly, which they have no incentive to do.

rurp · on June 11, 2020

And to make things even more ridiculous, you need to sign in with a facebook account to even file a bug there.

_j7tr · on June 11, 2020

napolux · on June 11, 2020

Where? Here? https://developers.facebook.com/support/bugs/

namanaggarwal · on June 11, 2020

Did you check the ip and not just the UA? Does it matches with the one from fb?

napolux · on June 11, 2020

Yes, one random ip from cloudflare: https://www.ultratools.com/tools/ipv6InfoResult?ipAddress=2a...

snorrah · on June 11, 2020

Surprise, Facebook not caring about playing nice with others!

Anyway, what reason would they have for so many requests? Why would they need to do 300 per second?