We've had the same issue. They were doing huge bursts of tens of thousands of requests in very short time several times a day. The bots didn't identify as FB (used "spoofed" UAs) but were all coming from FB owned netblocks.
I've contacted FB about it, but they couldn't figure out why this was happening and didn't solve the problem.
I found out that there is an option in the FB Catalog manager that lets FB auto-remove items from the catalog when their destination page is gone. Disabling this option solved the issue.
I just had an idea: if you control your own name server I believe you could use a BIND view to send all their own traffic to themselves based on the source address.
By the way, if someone discovers how to trigger this issue it would be easy to use it as a DOS pseudo-botnet.
I had a different idea. Maybe you could craft a zip-bomb response. The bot would fetch the small gzipped content and upon extraction discover it was GBs of data? Not sure that's possible here, when responding to a request, but that would surely turn the admins attention to it.
Check out the code generation parts and modules, they are the most mature. We have HRDs (like CRDs in k8s for anything) and a scripting language between bash and Python coming out soon too.
I've tried to work out what your project does but I'm none the wiser. GEB is prominent on my bookshelf. I'm a syadmin and I got as far as "hollow wold" in Go or was it "Hail Marrow"? Can't remember.
I've checked out your repo for a look over tomorrow when I'm cough sober!
Something about this idea sits uncomfortably with me. I also just had an idea / thought experiment based on your idea.
We think of net neutrality as being for carriers and ISPs, but you could see it applied to a publicly accessible DNS service too. These DNS service providers are just as much part of the core service of the Internet as anyone else. It’s not a huge leap to require that those who operate a publicly accessible DNS service are bound by the same spirit of the regulations: that the infrastructure must not discriminate based on who is using it.
It’s different to operating a discriminatory firewall. DNS is a cacheable public service with bad consequences if poisonous data ends up in the system. Fiddling with DNS like this doesn’t seem like a good idea. Too much weird and bad stuff could go wrong.
Another analogy would be to the use of encryption on amateur radio. It seems like an innocuously good idea, but the radio waves were held open in public trust for public use. If you let them be used for a different (though arguably a more useful purpose) then the resource ends up being degraded.
Also along these lines of thought [begin irony mode]: FCC fines for DNS wildcard abuse / usage.
Isn't that the argument that providers make for wanting to meter usage? I.e. video streamers, torrenters and netflix and the like are 'abusing' the network by using a disproportionate amount of their capacity / bandwidth?
I guess my point is that "abuse" in this sense is pretty subjective.
Not really, those things are easy to contrast. Network providers have always been comfortable blackholing DoS routes, and it’s never been controversial. That’s clearly distinct from those wanting to double-dip on transport revenues for routine traffic.
The difference is in whether both endpoints want the traffic, not whether (or on what basis) the enabling infrastructure wants to bear it.
> There’s no moral quandary in closing the door to abusers
Doesn't necessarily apply to this conversation, but the moral mistake that people (and societies) frequently make is underestimating the nuance that should be exercised when identifying others as abusers.
Yep, these were the UAs we also saw (amongst others). And also in our case those bots were executing the JS, even hitting our Google Analytics.
For some reason GA reported this traffic to come from Peru and Philippines, while an IP lookup showed it belonged to FB registered in the US or Ireland
Legal document is a tricky phrase to use. "No trespassing" signs are usually considered sufficient to justify prosecution for trespassing. If the sign is conspicuously placed, it does not usually matter if you actually see the sign or not.
I am not as familiar with law around accessing computer systems, but I imagine that given some of the draconian enforcement we've seen in the past that a robots.txt should be sufficient to support some legal action against someone who disregards it.
Sure, but federal law does prevent you from repeatedly placing that request into my mailbox. I don't even need a sign for that.
Making an http request does not fit cleanly as an analogue to yelling from the street, nor does it fit as an analogue to throwing a written request on a brick through a window. It is something different that must be understood on its own terms.
Similar experience, closed ad account because of “suspicious activity”, at least ten support tickets (half closed automatically), four lame apologies (our system says no, sry) and then finally, “there was an error in our system, you’re good to go”
I lost it because there was some sort of UI bug in the ad preview page that led the page to reload a bunch of times really quick, and boom, I'm apparently a criminal. You're lucky though, I never got it back and I had to change my business credit card because they somehow kept charging me after suspension.
I have an option to email them, live chat, or have our account manager contact us. I guess if you spend enough on ads per month you are entitled to more support...?
They do, I have used it a couple of times. Go go facebook.com/business/help click on get started button at the bottom for "find answers or contact support". Follow through, you will see a button called "chat with a representative".
twitter does the same thing. it sends a bunch of spoofed US visitors from Korea and Germany and US. The bots are spoofed to make it harder to filter them.
Possibly trying to avoid people sending them a different version of the page than users would see (of course they could change the page after the initial caching of a preview, but Twitter might refresh/check them later).
Also, you often need an impressive amount of the stuff thats in a normal UA string for random sites to not break/send you the "unsupported browser, please use Netscape 4 or newer!!!" page/..., although you normally can fit an identifier of what you really are at the end. (As an example, here's an iOS Safari user agent: "Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.5 Mobile/15E148 Safari/604.1" - it's not Mozilla, it's not Gecko, but it has those keywords and patterns because sites expect to see that)
Yeah, I once tried to tell my browser to send... I forget; either no UA, or a blank UA string, or nonsense or just "Firefox" or something. I figured, "hey, some sites might break, but it can't be that important!" It broke everything. IIRC, the breaking point was that my own server refused to talk to me. Now, I still think this is insane, but apparently this really is how it is right now.
Should make you realize just how much abuse there is on the internet that it's worth it to just filter traffic with no UA.
Usually people get stuck on the fact that we can't have nice things, so X sucks for not letting us have nice things, yet I seem to never see people acknowledge why we can't have nice things.
Then I'd see a lot more "ugh, bad actors suck!" and less "ugh, websites are just trying to make life miserable for me >:("
Is there an official error message for that? Because filtering no UA would trip me up every time I use wget. Most of the time, if I'm casually using wget for something, I don't bother with a UA. If sites started rejecting that, I'd like to get a clear error message, so I would not go crazy trying to figure out what the problem was. If I got a clear message "send a UA" then I would probably started wrapping my wget requests in a short bash script with some extra stuff thrown in to keep everyone happy. But I'd have to know what it is that is needed to keep everyone happy.
I don't remember, but what really got me was that I wasn't running anything that I expected to do fancy filtering; I think this was just Apache httpd running on CentOS. But there was no web application firewall, no load balancers, pretty sure fail2ban was only set up for sshd. It at least appeared that just the apache stock config was in play.
When you know your target runs Internet Explorer, you serve phishing pages to IE users and some boring content to other users, at the server level. We've had this keep our user tests out of Google Safebrowse and so on. I'm sure similar tricks end up applied to Facebook's UA and "legitimate" marketing sites.
Hey! Facebook engineer here. If you have it, can you send me the User-Agent for these requests? That would definitely help speed up narrowing down what's happening here. If you can provide me the hostname being requested in the Host header, that would be great too.
I just sent you an e-mail, you can also reply to that instead if you prefer not to share those details here. :-)
I'm not sure I'd publicly post my email like that, if I worked at FB. But congratulations on your promotion to "official technical contact for all facebook issues forever".
Don't think I used my email for anything important doing my time at FB. If it gets out of hand he could just make a request to have a new primary email made and use the above one for "spam"
Feels odd to read Facebook uses office365/exchange for emails. they haven't built their fsuite yet, I thought they would simply promote Facebook messenger internally. I'm only half joking.
Most communication is via Workplace (group posts and chat). Emails aren't very common any more - mainly for communication with external people and alerting.
But at least for email/calendar backend its exchange
The internal replacement clients for calendar and other things are killer...have yet to find replacements
For the most part though they use Facebook internally for messaging and regular communication (technically now Worplace but before it was just Facebook)
I don't want to share my website for personal reasons, but here is some data from cloudflare dashboard (a request made on 11 Jun, 2020 21:30:55 from Ireland, I have 3 requests in the same second from 2 different IPs)
user-agent: facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
ip 1: 2a03:2880:22ff:3::face:b00c (1 request)
ip 2: 2a03:2880:22ff:b::face:b00c (2 requests)
ASN: AS32934 FACEBOOK
I can by the way confirm this issue. I work in a large newspaper in Norway and around a year ago we saw the same issue. Thousands of requests per second until we blocked it . And after we blocked it, traffic to our Facebook page also plummeted. I assume Facebook considered our website down and thus wouldn't give users content from our Facebook page either as that would serve them content that would give a bad user experience. The Facebook traffic did not normalize before the attack stopped AND after we told Facebook to reindex all our content.
I'd you want more info, send me a email and il dig out some logs etc. thu at db.no
"My webserver is getting hit with bursts of hundreds of requests from Facebook's IP ranges. Google Analytics also reports these hits and shows them as coming from (mostly) Philippines and Peru, however, IP lookup shows that these IPs belong to Facebook (TFBNET3). The number of these hits during a burst typically exceeds my normal traffic by 200%, putting a lot of stress at our infrastructure, putting our business at risk.
This started happening after the Facebook Support team resolved a problem I reported earlier regarding connecting my Facebook Pixel as a data source to my Catalog. It seems Facebook is sending a bot to fetch information from the page, but does so very aggressively and apparently call other trackers on the page (such as Google Analytics)"
"e.g. IP addresses 173.252.87.* performed 15,211 hits between Aug 14 12:00 and 12:59, followed by 13,946 hits from 31.13.115.*"
"What is also interesting is that the user agents are very diverse. I would expect a Facebook crawler to identify itself with a unique User-Agent header (as suggested by the documentation page mentioned earlier), but instead I see User-Agent strings that belong to many different browsers. E.g. this file contains 53,240 hits from Facebook's IP addresses with User-Agent "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17134"
There are a few Facebook useragents in there, but far less than browser useragents:
7,310 hits: facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
2,869 hits: facebookexternalhit/1.1
1,439 hits: facebookcatalog/1.0
120 hits: facebookexternalua
Surprisingly, there is even a useragent string that mentions Bing:
6,280 hits: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534+ (KHTML, like Gecko) BingPreview/1.0b
These IPs don't only fetch the HTML page, but load all the page's resources (images, css, ..) including all third-party trackers (such as Google Analytics). Not only does this put unnecessary stress at our infrastructure, it drives up the usage costs of 3rd party tracking services and renders some of our reports unreliable."
final response from FB:
"Thanks for your patience while our team looked into this. They've added measures to reduce the amount of crawler calls made. Further optimizations are being worked on as well, but for now, this issue should be resolved." <- NOT.
I dont really understand what is the issue. On my welcome page (while all other urls are impossible to guess) i give browser something that requires a few seconds of cpu at 100% to crunch. And tracking some user action in between, visting tarpitted urls etc. In last few years no bot came through. Why bother with robots.txt, just give them something to break their teeths...
(I would give you the url, but I just dont want ti be visited)
The far-fetched "solution" along with the "I don't understand everyone doesn't do <thing that is easy to understand why everyone wouldn't do it>" make it hard for me to believe you're actually doing this. Sounds more like a shower thought, maybe fun weekend project, than something in production.
The network address range falls under Facebook's ownership, so I don't think it's someone spoofing. I do think it's very possible someone found a way to trigger crawl requests in large quantity. Alternatively, I would not be surprised it's just a bug on facebook's end.
They've done this before to me, too. First I tried `iptables -j DROP`, which made the machine somewhat usable, but didn't help with the traffic. After trying a few things, I tried `-j TARPIT`, and that appeared to make them back off.
Of course, sample size of 1, etc. It could have been coincidental.
Tarpits are an underappreciated solution to a pool of bad actors.
You can add artificial wait times to responses, or you can just route all of the 'bad' traffic to one machine, which becomes oversubscribed (be sure to segregate your stats!). All bad actors fighting over the same scraps creates proportional backpressure. Just adding 2 second delays to each request won't necessarily achieve that if multiple user agents are hitting you at once.
I never looked into the TARPIT option in iptables before reading your comment. That seems really useful. I've been dealing with on and off bursts of traffic from a single AWS region for the last month. They usually keep going for about 90 minutes every day, regardless of how many IPs I block, and consume every available resource with about 250 requests per second (not a big server and I'm still waiting for approval to just outright block the AWS region). I'm going to try a tarpit next time rather than a DROP and see if it makes a difference.
Most spiders limit the number of requests per domain, so if it's stupidity and not malice, you probably don't have a runaway situation.
... unless you're hosting a lot of websites for people in a particular industry. In which case the bot will just start making requests to three other websites you also are responsible for.
Then if you use a tarpit machine instead of routing tricks, the resource pool is bounded by the capacity of that single machine. If you have 20 other machines that's just the Bad Bot Tax and you should pay it with a clean conscience and go solve problems your human customers actually care about.
This was happening to us > 5 years ago. The FB crawlers were taking out our image serving system as we used the og:image thing. What we did was route FB crawler traffic to a separate Auto Scaling Group to keep our users happy while also getting the nice preview image on FB when our content was shared. I can't understate the volume of the FB requests, I can't remember the exact numbers now but it was insane.
I once made a PHP page which fetches a random wikipedia article, but changes the title to "What the fuck is [TOPIC]?". It was incredibly funny, until I started getting angry emails from people, mostly threatening a form of libel lawsuit.
Turns out, since it was a front for all of wikipedia[1], google was agressively indexing it, but the results rarely made it to the first search page. And since this isn't exactly an important site, old results would stick around.
Hence a pattern:
1. Some Rando creates a page about themselves
2. Wikipedia editors, being holy and good, extinguish that nonsense
1.5. GOOGLE INDEXES IT
3. Rando, by nature of being a rando, googles themselves, doesn't find their wikipedia page anymore (that's gone), but does find a link to my site in the first page of google results with their name.
I understand why people got angry, though. The title was rougly "What is this '<article name>' shit?"[1], with "shit" in this Hebrew context also able to be interpreted as calling the subject "shit".
Saying "מה זה לעזאזל X?"[2] is closer to "what the fuck is X?" (except it's more "what the hell" than "what the fuck").
Doesn't ignoring headers and taking someones site down fall afoul of the CFAA? Especially given how comments here are showing that this is a recurring issue? Depending on which side of the fence you're on, there could be standing to go after FB for this either for money, or for their lawyers to help set a precedent to limit CFAA further.
I'm going to get a ton of hate for this, but he was doing it with the intent to redistribute the content for free. He didn't own the content. There's a big difference. That being said it's very sad what came about of that. I really don't think the FBI needed to be involved.
Really? I felt that what happened to Aaron was a tragic injustice on many levels (from the fact that he was charged at all to the number of charges they threw at him), but for some reason I had always thought that it was fairly well-known that that was his intent.
I have no idea where I got that impression from, but I do recall reading several articles about him, as well as his blog around that time. It's possible I just internalized others' assumptions about his behavior, but it certainly seemed like an in-character thing for him to to.
IIRC, he had published a manifesto about how he believed that information that was paid for with public funds should be free for the public to view. However, I don't believe that he ever mentioned why he was making get requests to JSTOR's servers and I know that he never uploaded any of those JSTOR documents to the internet for public download.
Yeah, but any decent criminal lawyer could tear "he once wrote a manifesto" to shreds. His political beliefs would only be pertinent if he wrote the manifesto somehow in connection with what he was doing; otherwise, it would not be sufficient to show intent to distribute.
Yeah and Facebook did this crawlings with intent to commit CFAA - computer fraud and abuse, now someone call the feds quickly before the perpetrators are zucked into hiding
Nah, it only matters if you are a large corporation or not. If you are a large corporation you will ruin your opponent financial with legal fees (whether they are guilty or not). If you are a small timer going up against a large corporation they will delay the court case until you can not afford the legal fees.
No. The owner needs send Facebook a cease and desist and block their IPs in order to revoke authorization under the CFAA, and then Facebook would need to continue accessing his website despite explicitly being told not to. That's the precedent set by Craigslist v. 3Taps.
Whether or not the above even holds today (hiQ Labs v. LinkedIn argues it does not) remains to be decided in a current appeal to the Supreme Court.
Yes, this was my first thought too - Facebook is essentially denial of servicing people's sites and it sounds like they're aware of it. Move fast and break your things. If this were your average Joe I'm fairly certain something like the DMCA could be used.
The DMCA has a couple major parts. The part most people are familiar with deals with safe harbor provisions and "DMCA takedowns". The relevant part for this discussion makes it illegal to circumvent digital access-control technology. So you could argue that if you have reasonable measures in place to stop automated access of your content such as robots.txt then Facebook continuing to send automated requests would be a circumvention of your access controls.
Shouldn't any public facing website have a rate limit? If someone attempts to circumvent simple rate limits (randomizing the source IP or header content), then that could demonstrate intent to cause damage, and you'd have a better case. But if you don't set a limit, how can you be mad that someone exceeded it?
(I know they're ignoring robots.txt, but robots.txt is not a law. And, it doesn't apply to user-generated requests, for things like "link unfurling" in things like Slack. I am guessing the crawler ignores robots.txt because it is doing the request on behalf of a human user, not to create some sort of index. Google is attempting to standardize this widely-understood convention: https://developers.google.com/search/reference/robots_txt)
Plenty of crawlers, including Facebook's, are operating out of huge pools of IPs usually smeared across multiple blocks. If you have an idea on how to ratelimit a crawler doing 2 req/s from 300+ distinct IPs with random spoofed UAs let me know when your startup launches.
There are many features that you can pick up on. TCP fingerprints, header ordering, etc. You build a reputation for each IP address or block, then block the anomalous requests.
If you don't want to do this yourself and you're already using Cloudflare... congratulations, this is exactly why they exist. Write them a check every month for less than one hour of engineering time, and their knowledge is your knowledge. Your startup will launch on time!
I noticed that, but I get the feeling that sending that status code didn't reduce the amount of work the program did, so it's kind of a pointless. You can say "go away", but at some point, you probably have to make them go away. (All RFC6585 says about 429 is that it MUST NOT be cached, not that the user agent can't try the request again immediately. It is very poorly specified.)
The author mentioned that they were using Cloudflare, and I see at least two ways to implement rate limiting. One is to have your application recognize the pattern of malicious behavior, and instruct Cloudflare via its API to block the IP address. Another is to use their rate-limiting that detects and blocks this stuff automatically (so they claim).
Like, there are many problems here. One is a rate limit that doesn't reduce load enough to serve legitimate requests. Another is wanting a caching service to cache a response that "MUST NOT" be cached. And the last is expecting people on the Internet to behave. There are always going to be broken clients, and you probably want the infrastructure in place to flat-out stop replying to TCP SYN packets from broken/malicious networks for a period of time. If the SYN packets themselves are overwhelming, welp, that is exactly why Cloudflare exists ;)
The author is right to be mad at Facebook. Their program is clearly broken. But it's up to you to mitigate your own stuff. This time it's a big company with lots of money that you can sue. The next time it will be some script kiddie in some faraway country with no laws. Your site will be broken for your users in either case, and so it falls on you to mitigate it.
Rate limiting websites is a lot of work. You have to figure out what kind of limits you should set, you have to figure out if that should be on an IP or a /24 or an ASN, if it's an ASN, you have to figure out how to turn an IP into an ASN.
If your website is pretty fast, you probably won't necessarily care about a lot of hits from one source, but if it's supposed to be a small website on inexpensive hosting, and all those hits add up to real transfer numbers, maybe it's an issue.
I believe, but have not tried, that you can craft zipped files that are small but when expanding produce multi-gigabyte files. If you can sufficiently target the bad guys, you can probably grind them to a halt by serving only them these files. Some care needs to be taken so you don’t harm the innocent and don’t run afoul of any laws that may apply. Good luck.
Do you think it would be possible to create something similar for gzip? If you then serve with Content-Type: text/html, and Content-Encoding: gzip, the client would accept the payload. And when it tries to expand it, it would get expanded to a large file, eating up their resources.
Here's a brotli file I created that's 81MB compressed and 100TB uncomrpessed[1] (bomb.br). That's a 1.2M:1 compression ratio (higher than any other brotli ratio I see mentioned online).
There's also a script in that directory that allows you to create files of whatever size you want (hovering around that same compression ratio). You can even use it to embed secret messages in the brotli (compressed or uncompressed). There's also a python script there that will serve it with the right header. Note that for Firefox it needs to be hosted on https, because Firefox only supports brotli over https.
Back when I created it, it would crash the entire browser of ESR Firefox, crash the tab of Chrome, and would lead to a perpetually loading page in regular Firefox.
A single layer of deflate compression has a theoretical expansion limit of 1032:1. ZIP bombs with higher ratios only achieve it by nesting a ZIP file inside of another ZIP file and expecting the decompressor to be recursive.
This means you can serve 1M payload and have it come out to 1G at decompression time. Not a bad compression ratio, but it doesn't seem like enough to break Facebook servers without taking on considerable load of your own.
I believe zip bombs are pretty easily mitigated, with memory limits, cpu time ulimits, etc.
But zip bombs aren't limited to zip files... lots of files have some compression in them You can make malicious PNG's that do the same thing. Probably tonnes of other files.
Good points. According to OP, the goal would be to persuade FB to back off crawling at 300 qps, not to bring down the FB crawl. Having each request expand to 1GB is probably already enough to do that, since a crawler implementing the mitigations you mentioned will likely reduce the crawl rate. Should be easily testable.
FB has 2a00::/12 (and probably other blocks?) and can make something that looks like its company name from [0-9a-f]. Why wouldn't it do something harmless and fun like this? It's not as if it requires special dispensation or is breaking any rules.
A single company does not get a /12 prefix. 2a00::/12 is almost half of the space currently allocated to all of RIPE NCC. Facebook seems to have 2a03:2880::/29 out of that /12, and a /40 through ARIN (2620:0:1c00::/40)
Is the notation the problem here? IPs have 16 bytes now, so the notation is a bit more compact: hex instead of decimal numbers, and :: is a shortcut meaning "replace with as many zeros as necessary to fill the address to 16 bytes"
Reading through the article and comments here I find it odd that a company like FB can't get a crawler to respect robots.txt or 429 status codes.
Even I would stop and think "maybe I should put in some guards against large amounts of traffic" when writing a crawler, and I'm certainly not one of those brilliant minds who manage to pass their interview process.
My office still gets spam faxes to this day. The paper and toner only add up to a few cents a month, so it's not worth doing anything about.
I knew a realtor that had a sheet of black paper with a few choice expletives written on it that they would send back to spammers. There was an art to taping it into a loop so it would continuously feed. This was a few decades ago when a the spam faxes could cost more than a stamp.
My desk phone at an old job used to get dialed by a fax machine. Not fun picking that up. I redirected it to a virtual fax line and it turns out it was a local clinic faxing medical records. I faxed them back with some message about you have the wrong number but they never stopped.
I considered it. That was after my time in hospital IT. Really I wanted to stop getting my eardrums blown out by a robot. Luckily I happened to be working at a phone company so changing my number just took a couple clicks.
I was using a virtual fax service so a digital equivalent would have been easy. Not sure if I was creative enough to invert the colors before sending my response. Really I just wanted the noise to stop.
It’s because we as an industry decided to give in and submit to the cloud providers’ bullshit model of paying overpriced amounts for bandwidth while good old bare-metal providers still offer unmetered bandwidth for very reasonable prices.
Paying for your bandwidth and metal isn't a problem per-se as long as prices are reasonable (which they are with the old-school bare-metal providers). After all, those services do cost money to provide.
The problem happens when prices are extortionate (or the pricing model is predatory, ie pay per MB transferred instead of a flat rate per 1Gbps link) and relatively minor traffic translates to a major bill.
This particular issue is about 80 requests/second which is a drop in the bucket on a 1Gbps link. It would be basically unnoticeable if it wasn't for the cloud providers nickel & diming their customers.
You're still not getting it. I'm not arguing about whether it should be free or how much it should cost. I'm saying that when you have to pay for someone else's request, it's ripe for abuse. Instead, a system where the requester pays would be similar to how postage works in the US. Although I think it's highly unlikely that would make sense in a digital world.
This is a common pattern in (blocking i/o versus non-blocking i/o) and (sync task versus async task).
The typical analogy I hear for DDoS attacks is if you own a business and a giant group of protesters drives traffic to your store, but no one is interested in buying. They create a backlog to enter the store, they crowd the isles, they pick up your inventory, and the lines and crowds scare off your legitimate visitors/customers.
This reminds me of a case years ago when I had a small-ish video file on my server that would cause Google Bot to just keep fetching it in a loop. It wasted a huge amount of bandwidth for me since I only noticed it in a report at the end of the month. I had no way of figuring out what was wrong with it, so I just deleted the file and the bot went away.
More precisely, spoofing a TCP handshake is a problem that we know how to prevent. Odds are pretty good that your kernel has these protections enabled by default, but it's not guaranteed and you should check as a matter of due diligence.
Why is it so hard for people here to believe that Facebook could have a crawler misbehaving? It is all over this thread, even though multiple people are saying they saw the same thing and confirmed the IP belonged to FB.
If you can verify it's actually coming from Facebook's IP address (as many other comments here suggest), you should absolutely get in touch with them, though I'm not sure how. Perhaps their security team. That's a very serious bug that they'd be glad to catch -- if it's affecting you it's surely affecting others.
Otherwise it's a bot/malware/etc. spoofing Facebook and gone wrong, which sucks. And yeah just block it by UA, and hopefully eventually it goes away.
This is one of the better HN threads in a while. Starting from an easily described problem, it goes into a number of interesting directions including how FB seems to DDOS certain sites for reasons unknown, defenses against such attacks and DDOS in general, GPL licensing, how to collapse HN conversations, etc.
Not a profound comment. But it was really fun to follow the path down the various rabbit holes. This is why I like HN.
Interesting. According to the Facebook Crawler link provided by the OP [1], it makes a range request of compressed resources. I wonder if the response from the OP’s PHP/SQLite app isn’t triggering a crawler bug.
Maybe Cloudflare can step up and act as an intermediary to isolate the problem. Isolating DoS attacks is one of their comparative advantages.
Make sure you file a bug - there are a myriad of sources internally, but we can often hunt it down easily enough (assuming it gets triaged to eng). Important info is the host of the urls being crawled, and the User Agent attached to the requests (headers are also good too). A timeseries graph of the hits (with date & timezone specified) can also help.
Why is this the owner's problem? Someone at Facebook should be filing the bug, or better yet instrumenting their systems so that incidents like this issue a wake-up-an-engineer alert.
Fuck this culture of "it's up to the victim of our fuckup to file a bug report with us".
Not just “file a bug report” but also compile a time series graph (lol) and then pray that Facebook triages it correctly, which they have no incentive to do.
We had that with google in the past. We put a kml database on a web server and made the google maps API read them. Google went to get them without any throttling (my guess is that they primed their edge servers straight from us), the server went down so fast we sought there had been a hardware failure. We ended up putting a cdn just for google servers.
Modern web crawlers usually have mechanisms against rabbit holes, e.g. for Apache Nutch you can configure it should only follow to a certain depth of links from a root, it should only follow a redirect a few times and so on. It's always a trade-off though. If you cut off too aggressive you don't get everything, if you cut off too late you waste resources.
What I've heard happen is someone puts a governor on the system, and much later your new big customer starts having errors because your aggregate traffic exceeds capacity planning you did 2 years ago and forgot about.
Are you sure it's a crawler and not a proxy of some sort? Eg. one of your links is on something high traffic in facebook, and all requests are human, running through fb machines.
Facebook caches URLs pretty aggressively (often you need to explicitly purge the cache to pick up updates) so I don't quite understand exactly what happened here. The article is very light on details. Do you have 7 million unique URLs that have been shared on Facebook? Or is it the same URL being scraped over and over?
Why would a crawler want to crawl the same page every second (let alone 71 times a second). Even if I were Google, wouldn't crawling any page every few mins (or say every minute) be way more than enough for freshness?
Is it worth duplicating the fb specific parts of your website at another web address as a test site, so that If it was similarly affected, you could provide the address info to help troubleshoot?
Doubt it. If you're a bad enough dude to get control over Facebook's internal infrastructure, I doubt you'd blow that access by using it to spam random sites.
we had a similar experience with yandex. had to block it. crawler started to crawl on a larger e-shop by trying out every single possible filter combination. in other words we got ddos attack from yandex.
I surfed through a Google network for a bit a few years ago, and it does reveal things. The most common was Drupal and WordPress sites with some malicious code running that would link to online pharmacies on every page, but only to Google user agents / networks, I assume to boost their ranking. I let a bunch of webmasters know, most of them didn't believe me because they didn't see it themselves.
You chose to mention Safari, but link to an article about Chrome, Firefox, and Safari? Looks like all major browsers (probably Edge too) have this built in.
Establish contractual damages in a ToS for the site. Prove violation and offender. Take to court and collect damages.
Converting the effort into cash is tough, but the strategy exists.
Project HoneyPot is an API which allows any website to do this for honeypot email addresses which are injected the website, along with a ToS which says:
> By continuing to access the Website, You acknowledge and agree that each email address the Website contains has a value not less than US $50 derived from their relative secrecy.[1]
I also found the $1 billion lawsuit (against a bunch, not just one spammer, I believe), and could also not find any sort of resolution - not in legal docs, news pages, or project honeypot itself.
It looks like your site is using a theme based on my website (https://ruudvanasseldonk.com/, source at https://github.com/ruuda/blog). That is fine — it is open source after all, licensed under the GPLv3. But I can’t find the source code for your site, and I can’t find any prominent notices saying that you modified my source. Could you please add those?
Sounds like yes, they are flattered for exactly that. And it's totally ok.
There exists small joys in life, like being flattered for something that other people might not think are much of a big deal.
I guess that's one reason why something may be called flattering in the first place, if they didn't see it as being all that much in their own eyes, but somebody else appreciates it.
The distinction can be difficult to make, but I think it is relatively clear here.
I'm "reading between the lines" here so could be wrong, but "looks like your site is using a theme based on" implies to me that there is enough code (markup, styles, perhaps script) similarly that is enough to not be pure coincidence. That suggests using some of the code not just being inspired to produce a clean-room design, so the GPL is relevant.
My understanding is unless he's distributing it (not just serving it from a website) he doesnt need to release his changes. That's what the AGPL is for.
It's not the copyright on the server code that is at issue, but the copyright on the HTML and CSS files (and portions thereof) that get distributed by the server.
You see the contradiction, right? If it's about the HTML and the CSS the user has that code directly by visiting the site. No further action would be needed.
If it's about something that creates the HMTL and CSS, then OP has no requirement because of the GPL just from people visiting the site and accessing the output CSS/HTML. Because that's not distributing the code as defined by the GPL. The original author should have used the AGPL for that case - and if it's really just about HTML/CSS a different license altogether.
> If it's about the HTML and the CSS the user has that code directly by visiting the site. No further action would be needed.
Wrong. Transmitting modified GPL works requires more than just distributing the source:
a) The work must carry prominent notices stating that you modified it, and giving a relevant date.
b) The work must carry prominent notices stating that it is released under this License and any conditions added under section 7. This requirement modifies the requirement in section 4 to “keep intact all notices”.
c) You must license the entire work, as a whole, under this License to anyone who comes into possession of a copy. This License will therefore apply, along with any applicable section 7 additional terms, to the whole of the work, and all its parts, regardless of how they are packaged. This License gives no permission to license the work in any other way, but it does not invalidate such permission if you have separately received it.
d) If the work has interactive user interfaces, each must display Appropriate Legal Notices; however, if the Program has interactive interfaces that do not display Appropriate Legal Notices, your work need not make them do so.
First, that is for Conveying Modified Source Versions. See the definition of conveying:
> To “convey” a work means any kind of propagation that enables other parties to make or receive copies. Mere interaction with a user through a computer network, with no transfer of a copy, is not conveying.
Generally the common position here is to use the AGPL if you want to cover regular network access. Probably all a bit murky because the GPL does not really fit well to things like this. But mostly it does not apply here, pretty specifically by definition.
Maybe you'd have a point regardless if the source was directly the HTML and CSS. And it seems like I was wrong with the modification notice (but I wasn't thinking about modifications in particular). But it's not HTML and CSS directly. Having looked at the source in question now the transmitted HTML and CSS is evidently not the source code, as both is produced by template files.
The actual GPL uses the word "convey", not distribute or serve. [1](section 5). Merriam-Webster defined "convey" as "to transfer or deliver". Pretty hard to argue that a web server does not transfer or deliver HTML + CSS.
Next time, have some courtesy for the author and the rest of us by requesting via personal exchange over email instead of hijacking the thread and distracting from the conversation.
whoa there. the gpl v3 only require to have the gpl v3 declaration visible and to be copmpliant source code can be provided on demand, no need to have a source code link in the object as long as a a contact is available and the code is provided at a reasonable cost.
They look similar at a glance (border-top + Calluna font), so he might have taken inspiration from yours, but doesn't seem to have used any of your assets - the styles are clearly different and based on the WP 'BlankSlate' theme.
(my personal blog had a top border like that a decade ago, when styles on the body were a novelty :))
I checked the source and it does indeed mention https://wordpress.org/themes/blankslate/, but judging from the screenshot there, that theme is just really a blank theme with no style at all, and it was used to include a different stylesheet.
So, just to be clear, you are perfectly fine with someone taking something someone else created and violating the terms upon which they were given that thing?
e.g. If I took code you wrote and lets say released under an MIT license and claimed I wrote it and didn't give you any credit, and in fact released it under another license entirely, you'd be fine with that?
It's perfectly valid to criticize the original license choice.
GPLv3 is a very restrictive license, especially for what is essentially a micro blog (though I dislike the license for most open source software anyway).
Add on the original author going after a bit of CSS, not even the main effort of the project in question, and you've got my "petty" comment.
There is an artist who takes images from magazines, repurposes them for his own art, and sells them for hundreds of thousands of dollars, then gets sued by the magazines & photographers & artists, and guess what, HE WINS AGAINST THOSE LAWSUITS. Copyright is BS, especially concerning HTML & CSS code. What a joke.
Not the parent but your question and the implicit accusation is way too overblown. The design here is so generic that it can easily be used by tons of site out there. It's just a few lines of CSS here and there and if I have a design like that and come across something similar I would just chalk it up to someone with similar taste. Going out of your way to demand attribution for it is the very definition of petty.
> e.g. If I took code you wrote and lets say released under an MIT license and claimed I wrote it and didn't give you any credit, and in fact released it under another license entirely, you'd be fine with that?
If I released it on Github, under any license whatever? I’d more or less be expecting that.
If it was about the 4hr of work that went into my blog theme, I wouldn’t be bothered at all.
But then, I wouldn’t release anything like that under the GPL.
The author probably cares about either copy-left or the right to tinker. Some people believe it's important and others don't.
Basically any derivative product out of GPL3 code needs to either be open-source/copy-left, or if it uses the code as a library needs to let end users substitute that library for their own version.
Worse than unimportant, I find copyleft harmful to the developer community as a whole. Software is hardly "free" when you have to bend to the author's demands to license your entire rest of your project under their ideology if you want to use it, and I think we should stop calling it so. Maybe "Conditionally Free Software" instead. You're hardly helping the world releasing libraries in terms that nobody but hobbyists (who are going to rip your code out and replace it if they start making a proprietary product, which is not evil) will find acceptable. It works for the Linux kernel and self-contained applications but that's it.
Copyleft is about maximizing the users's rights, not the developers's rights. Companies can't take linux, put it on a router for sale, and then say that their customers/users aren't allowed to know what's going on on the box in terms of backdoors and spying. The users have a right to look at the source code if they wish.
It may focus on user's rights, but it still requires technical expertise to exercise half of the 4 freedoms: #1 (inspection & modification) and #3 (distributing your modifications). Merely knowing to ask "can I see the source code" I would put into the "technical user" realm. Overwhelmingly, most people don't know or don't care.
The other two freedoms #0 (freedom to run) and #2 (freedom to share) are readily obvious to non-technical users. "Double click to run" and "drag and drop <on external drive> to copy". Unfortunately, they are also often permitted by non-free/libre software. So non-technical users that can readily see they can do #0 and #2 and generally have no litmus test to further determine whether the software is "free/libre".
This is a case where the ideology's practical concerns hamper its purity. I critique despite generally liking the FLOSS ideal, but it's important to know its flaws.
The users don't have to be technical to benefit from those freedoms though - there's a level of indirection involved. I might not have personally scrutinized every line of the Linux kernel, but knowing that there are tons of people in the world with the ability and motivation to do that inspires confidence.
I am not arguing that one has to be technical in order to benefit from those freedoms. I like FLOSS and agree users in general benefit.
I am saying that your example, for instance, stills falls under the "how does a non-technical user simply verify the software they just downloaded respects their freedoms?" which is an very real educational and cultural problem. For example, see the massive money and numerous gun ranges, gun stores, gun clubs, and other gun-associated organizations in the US that work to educate the "unskilled" general public on "how to be aware, recognize, and exercise their rights and freedoms" under the 2nd Amendment while respecting local laws. Folks generally are 1) aware they have the right and 2) have a low-friction no-special-technical-skilled path to exercising that right. The FLOSS movement is nowhere near that level of educating and making aware non-technical users of their freedoms, their digital rights, and how to then act upon them and exercise them. Folks generally are 1) unaware of libre software and 2) don't have a low-friction no-special-technical-skilled path to exercising that right.
They can take Linux and put it behind a locked bootloader that makes it impossible to replace with your own kernel. Sure you can have the modified sources but you can't do anything with it. Hence GPL3.
Users don't care about any of that, but developers are certainly hindered by restrictive licenses, which in turn hurts users.
I never found a right to see source code compelling as a real right. To read the assembly and modify something they bought, sure, but not an entitlement to the source.
Nobody is forcing you to use GPL code. It is stinking of entitlement to demand that you get to use other people's code regardless of what they think about it.
I don't feel entitled to it, I just consider the prevalence of copyleft bad for the developer community. License however you want, doesn't mean I'm entitled for criticizing it.
What might be stinking of entitlement is the idea users have an inalienable right to source code.
Users and developers are not different beasts. Users with source code are more likely to become developers. Developers are more likely to become users if they can tinker.
You sound very bitter that a developers that let others use their code gets to pick his license of choice. Should everything by locked down like Microsoft Windows code or an Apple phone? If what you want is for everything to be completely free from licenses instead, just code your own version and release it as freeware. If you only complain and don't then you are just being hypocritical.
>You're hardly helping the world releasing libraries in terms that nobody but hobbyists (who are going to rip your code out and replace it if they start making a proprietary product, which is not evil) will find acceptable
I'm not bitter, just disappointed in all the wasted developer time that happens because people get caught up in these copyleft ideas. Of course you can pick whatever license you want but that doesn't mean it can't be criticized.
> If you only complain and don't then you are just being hypocritical.
You don't need to be an architect to complain about crumbling bridges, but indeed I have released software under more free licenses.
Only if you assume copyleft code would otherwise be available for developers. The other alternative is people who don't want others profiting from their free code just don't release their code.
Well, it's easy to be disappointed in all the wasted developer time because copy left. It's transparent.
It's harder not only to be disappointed but to even notice all the wasted developer time because of closed source code because well it's even obscured how can you as a developer benefit from the code.
I made my site open source so others who like it can take a look at how it works, or give it their own twist. Visitors of my site can see that they have that freedom, but visitors of an adaptation might not know if it’s not stated anywhere.
Well I respect the question of GP if nothing then for principle. People should get their shit together and start respecting the license. The state today is horrendous and the culture is evolving into less and less respect of the license. If people don't follow them personally they wont care to follow them professionally either
Design layman here, switching between the two sites the top bar looks similar (it's just a solid line) except a different color and slightly different height. I don't see other similarities or assets coming from your site? Can you explain this request; I'm not sure two sites both using a blank horizontal rule across the top is something I would personally call out as similar.
My point is, that you don't want to do this inside of a load balancer there. I don't recall any traffic filtering abilities that would suffice, but I'm not fully up-to-date with the configurability either. If the load balancer supports it, a short search in the net or docs should surface an easily-applicable guide, and if not, I'd probably put that blocking closer to my app server.
And the reason against AWS for this would be both the general cost (AWS is not cost-efficient in many cases, unless you have complicated infrastructure that takes a lot of management, where the Infrastructure-as-code approach can give you a sizable benefit), the bandwidth cost in particular, and the lack of configurability of their services to e.g. apply suitable tar-pitting against such crawlers.
Then the action needs to be harmless, because GET is defined as not merely idempotent but also safe.
Didn't we already learn this lesson after all the unsafe-GET problems unveiled when prefetching browser accelerators came on the scene in, IIRC, the late 1990s?
> Then the action needs to be harmless to repeat, because GET is defined as idempotent.
No, this is a terrible response. The action needs to be harmless to execute every time, not just every time after the first time.
HTTP DELETE is conceptually idempotent, but you don't want to be deleting stuff with GET requests. That's why the standard provides a DELETE method! The distinction that really matters is safe/unsafe, not idempotent/unique.
(Do you need to use DELETE for deleting stuff? No, POST is fine.)
As an aside, this kind of hyperbole really gets under my skin. It wasn't a terrible response. That statement is already technically correct: GET is idempotent, and the definition of idempotency is that it is harmless to repeat.
Your gripe is that OP didn't mention that GET is not only idempotent but must also be "safe"; i.e. that it should not alter the resource. OP got it 50% correct.
Does that omission make his comment a "terrible response"? No -- just incomplete.
Yes, it was a terrible response. Here are some examples of idempotent requests:
- Change the email address registered to my account from owner@gmail.com to new_owner@136.com .
- Instead of sending my direct deposit to account XXXX XXXX at Bank of America, from now on, send it to account YYYY YYYY at Wells Fargo.
- Delete my account.
- Drop the database.
None of these have any business being available to GET requests. Objecting to a misconfigured endpoint on the grounds that the functionality it implements is not idempotent implies that the lack of idempotence is what was wrong. That's a bad thing to do - anyone who takes your lesson to heart is still going to screw themselves over, because you gave them terrible advice. They may do it more than they otherwise would have, because you gave them advice that directly endorses really bad ideas. Idempotence or the lack thereof is beside the point.
Messing up on endpoint idempotence means you might hurt the feelings of a document. Messing up on endpoint safety means you might lose all your data as soon as anyone else links to your homepage. Or worse.
Most of the time, GET should not affect system state at all (other than caching, etc). Idempotent can still have effects on the first access. Even deletion can be idempotent.
I generally see PUT and PATCH classified as idempotent (for the same input).
Then like literally every search engine and website crawler ever does, the action will be performed.
Which is why it's bad practice to design your website using GET for actions. That's what POST is for.
I mean, using GET for actions will break so many things -- browser prefetching, link previews, the list is endless. If you use GET for actions, just... yikes.
> Going against the HTTP standard isn't a problem. For example, it's a good practice to ignore HEAD requests as opposed to responding appropriately.
That's only against the standard of you advertise HEAD as a supported method on the resource, which converts it from a good idea in some circumstances to a bad one, so if there is a good example to support your claim, that isn't it.
The most classic example is the JWT specification, which says you need to honor the encryption algorithm defined by the token you receive. (JWT includes a "none" algorithm, making token forgery trivial when the parser implements the standard.)
It's known widely enough now that people have chosen to reinterpret the language of the standard in order to claim that their implementations are compliant -- after making the change specifically to bring themselves out of compliance.
(It's possible that it's been so long that the standard itself has been changed to accommodate this. But regardless, the point stands that standards compliance is not a virtue for its own sake. This wasn't a good idea back when everyone agreed that the standard required it, it's not a good idea now, and future bad ideas do not in general become good ideas by virtue of being specified in standards.)
I'm not sure what sort of action you mean, but Facebook doesn't fetch the page in real-time using the client's browser. Some sort of job scrapes the page HTML looking for metadata and then uses that metadata to populate the preview. If you want to play around with what it "sees", you can test using the sharing debugger: https://developers.facebook.com/tools/debug/
Whenever I paste a link in any sort of messaging app - iMessages to Slack - it puts a thumbnail and summary, polluting the entire conversation with tons of noise.
Messaging apps have no business in looking up the URL. Just let it pass as a link.
Fuck everything about this and we need to push back on this nonsense.
I disagree completely, I find it extremely useful.
Most links are unreadable (e.g. an ID to a cloud file), and are truncated anyways even if readable.
The preview gives me the title of the webpage, which is infinitely more useful -- especially letting me know whether it's a document I've already seen (and don't need to open) or something new, and if so, what.
I suspect that most apps have a way to turn this off. Slack certainly does. That doesn't stop other users from seeing previews of your links, if it's set up that way for them, but it does prevent you from seeing previews of others' (and your own) links.
A funny story (well, funny because it didn't happen to me) was chronicled on a recent episode of Corey Quinn's "Whiteboard Confessions" -- https://www.lastweekinaws.com/podcast/aws-morning-brief/whit... -- where Slackbot's auto URL unfurling feature "clicked" an SNS alert unsubscribe link when a tech posted the a report including that link into a slack channel. If nobody lost their job over this triggering a SEV-1 alert I'm sure they had a laugh about it later.
Another issue with auto-unfurling links or generating previews of URLs is the potential for stalking. If I'm hunting my ex and I know their cell phone number AND that they have an iPhone I can simply send a link to a website and let iOS generate the preview... which then pings the URL of the website which I set up and I can now run geo-IP resolution and narrow down where my ex is (and maybe I follow up with a well-crafted PDF exploit when I'm ready to narrow down even more on an area).
While systemvoltage was a bit brusque in how the complaint was worded above, these "friendly and helpful" features of Slack, SMS, and other apps can have devastating unforeseen consequences. I question the value of having to opt out of these features instead of making them opt-in instead.
1) You can't do effective geolocation on cell phone IP address, except perhaps at a country level [1]
2) Your ex would need to open your message to trigger the link preview loading. If you're stalking them, they're probably either ignoring you or blocking you
3) At most, if the ex is connected to a Wi-Fi network like Starbucks and opens the message, you'll be able to get the city they're in, maybe.
But at the end of the day, what you'd get from link previews is no different from embedding a tracker image in an e-mail. And while it requires someone to suspect they're being stalked, blocking would prevent all of this.
As far as I know the iOS implementation mitigates that by generating the preview on the sender side. The receiver side will not make a request to the URL unless they manually click on it.
The next level would be for the server to see who is requesting content. Facebook IP? "Here's some harmless HTML!". Browser IP? "Here's an executable pretending to be a harmless page!"
Also considering the sibling story, at least the server-side preview builder is definitely entirely unauthenticated to every possible service. A preview builder running on the client side might conceivably pick up your authentication to something in various circumstances.
The issue with doing them client side (other than the missed cache opportunity of using a server) is that it means you're having every user in the chat send a request to an arbitrary URL from their device. Also CORS makes this impossible 99% of the time.
The third approach, which Signal uses as a privacy preservation measure, is for the sender to generate the link preview on send. Of course, this then opens up abuse where the sender can arbitrarily control the "link preview" to say whatever they want.
I just thought of a malicious idea. If I were Amazon or some other cloud provider, I could hammer my customers' web sites with tons of requests to make them pay more for resource usage (network bandwidth, s3 calls, misc per/unit usage etc). It would be hard to trace as well. Wonder if people are already doing that today.
The amount of bad will that would generate when, not if, it came to light would dwarf the possible gains. So ethical concerns aside, it would be idiotic.
Sometimes that doesn't stop the pointy haired bosses, because sometimes they are the perfect storm of unethical and moronic.
This is actually a common tactic from malicious actors - pretend to be other bots to get through to websites. If you reverse lookup the IP address you can see if it part of the facebook network.