By the way, if someone discovers how to trigger this issue it would be easy to use it as a DOS pseudo-botnet.
My stack of projects to do is growing at a hell of a rate and I'm not popping them off the stack fast enough.
Check out the code generation parts and modules, they are the most mature. We have HRDs (like CRDs in k8s for anything) and a scripting language between bash and Python coming out soon too.
I've checked out your repo for a look over tomorrow when I'm cough sober!
We think of net neutrality as being for carriers and ISPs, but you could see it applied to a publicly accessible DNS service too. These DNS service providers are just as much part of the core service of the Internet as anyone else. It’s not a huge leap to require that those who operate a publicly accessible DNS service are bound by the same spirit of the regulations: that the infrastructure must not discriminate based on who is using it.
It’s different to operating a discriminatory firewall. DNS is a cacheable public service with bad consequences if poisonous data ends up in the system. Fiddling with DNS like this doesn’t seem like a good idea. Too much weird and bad stuff could go wrong.
Another analogy would be to the use of encryption on amateur radio. It seems like an innocuously good idea, but the radio waves were held open in public trust for public use. If you let them be used for a different (though arguably a more useful purpose) then the resource ends up being degraded.
Also along these lines of thought [begin irony mode]: FCC fines for DNS wildcard abuse / usage.
I guess my point is that "abuse" in this sense is pretty subjective.
The difference is in whether both endpoints want the traffic, not whether (or on what basis) the enabling infrastructure wants to bear it.
Doesn't necessarily apply to this conversation, but the moral mistake that people (and societies) frequently make is underestimating the nuance that should be exercised when identifying others as abusers.
That's surprising. What were the spoofed user agents that they used?
We've run into this issue also, but all Facebook bot activity had user agents that contained the string "facebookexternalhit".
Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Version/7.0 Mobile/11D257 Safari/9537.53
Mozilla/5.0 (iPhone; CPU iPhone OS 13_5 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Mobile/15E148 Safari/604.1
I am not as familiar with law around accessing computer systems, but I imagine that given some of the draconian enforcement we've seen in the past that a robots.txt should be sufficient to support some legal action against someone who disregards it.
Making an http request does not fit cleanly as an analogue to yelling from the street, nor does it fit as an analogue to throwing a written request on a brick through a window. It is something different that must be understood on its own terms.
Also, you often need an impressive amount of the stuff thats in a normal UA string for random sites to not break/send you the "unsupported browser, please use Netscape 4 or newer!!!" page/..., although you normally can fit an identifier of what you really are at the end. (As an example, here's an iOS Safari user agent: "Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.5 Mobile/15E148 Safari/604.1" - it's not Mozilla, it's not Gecko, but it has those keywords and patterns because sites expect to see that)
Usually people get stuck on the fact that we can't have nice things, so X sucks for not letting us have nice things, yet I seem to never see people acknowledge why we can't have nice things.
Then I'd see a lot more "ugh, bad actors suck!" and less "ugh, websites are just trying to make life miserable for me >:("
I just sent you an e-mail, you can also reply to that instead if you prefer not to share those details here. :-)
But at least for email/calendar backend its exchange
The internal replacement clients for calendar and other things are killer...have yet to find replacements
For the most part though they use Facebook internally for messaging and regular communication (technically now Worplace but before it was just Facebook)
Email is really just for external folks
user-agent: facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
ip 1: 2a03:2880:22ff:3::face:b00c (1 request)
ip 2: 2a03:2880:22ff:b::face:b00c (2 requests)
ASN: AS32934 FACEBOOK
I can by the way confirm this issue. I work in a large newspaper in Norway and around a year ago we saw the same issue. Thousands of requests per second until we blocked it . And after we blocked it, traffic to our Facebook page also plummeted. I assume Facebook considered our website down and thus wouldn't give users content from our Facebook page either as that would serve them content that would give a bad user experience. The Facebook traffic did not normalize before the attack stopped AND after we told Facebook to reindex all our content.
I'd you want more info, send me a email and il dig out some logs etc. thu at db.no
You could of course block upstream by IP, but if you want to send the traffic away from a CPU heavy dynamic page to something static that 2xx's or 301's to https://developers.facebook.com/docs/sharing/webmasters/craw... then this could be the answer.
"My webserver is getting hit with bursts of hundreds of requests from Facebook's IP ranges. Google Analytics also reports these hits and shows them as coming from (mostly) Philippines and Peru, however, IP lookup shows that these IPs belong to Facebook (TFBNET3). The number of these hits during a burst typically exceeds my normal traffic by 200%, putting a lot of stress at our infrastructure, putting our business at risk.
This started happening after the Facebook Support team resolved a problem I reported earlier regarding connecting my Facebook Pixel as a data source to my Catalog. It seems Facebook is sending a bot to fetch information from the page, but does so very aggressively and apparently call other trackers on the page (such as Google Analytics)"
220.127.116.11 - - [13/Aug/2018:11:09:52 +0200] "GET /items/ley3xk/ford-dohc-20-sierra-mondeo-scorpio-luk-set.html HTTP/1.1" 200 15181 "https://www.facebook.com/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17134"
"e.g. IP addresses 173.252.87.* performed 15,211 hits between Aug 14 12:00 and 12:59, followed by 13,946 hits from 31.13.115.*"
"What is also interesting is that the user agents are very diverse. I would expect a Facebook crawler to identify itself with a unique User-Agent header (as suggested by the documentation page mentioned earlier), but instead I see User-Agent strings that belong to many different browsers. E.g. this file contains 53,240 hits from Facebook's IP addresses with User-Agent "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17134"
There are a few Facebook useragents in there, but far less than browser useragents:
7,310 hits: facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
2,869 hits: facebookexternalhit/1.1
1,439 hits: facebookcatalog/1.0
120 hits: facebookexternalua
Surprisingly, there is even a useragent string that mentions Bing:
6,280 hits: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534+ (KHTML, like Gecko) BingPreview/1.0b
These IPs don't only fetch the HTML page, but load all the page's resources (images, css, ..) including all third-party trackers (such as Google Analytics). Not only does this put unnecessary stress at our infrastructure, it drives up the usage costs of 3rd party tracking services and renders some of our reports unreliable."
final response from FB:
"Thanks for your patience while our team looked into this. They've added measures to reduce the amount of crawler calls made. Further optimizations are being worked on as well, but for now, this issue should be resolved." <- NOT.
(I would give you the url, but I just dont want ti be visited)
Maybe you should just take your website offline?
The network address range falls under Facebook's ownership, so I don't think it's someone spoofing. I do think it's very possible someone found a way to trigger crawl requests in large quantity. Alternatively, I would not be surprised it's just a bug on facebook's end.
Of course, sample size of 1, etc. It could have been coincidental.
You can add artificial wait times to responses, or you can just route all of the 'bad' traffic to one machine, which becomes oversubscribed (be sure to segregate your stats!). All bad actors fighting over the same scraps creates proportional backpressure. Just adding 2 second delays to each request won't necessarily achieve that if multiple user agents are hitting you at once.
... unless you're hosting a lot of websites for people in a particular industry. In which case the bot will just start making requests to three other websites you also are responsible for.
Then if you use a tarpit machine instead of routing tricks, the resource pool is bounded by the capacity of that single machine. If you have 20 other machines that's just the Bad Bot Tax and you should pay it with a clean conscience and go solve problems your human customers actually care about.
That's seemed to be good enough, but will consider the TARPIT option if it turns out to be needed. :)
Turns out, since it was a front for all of wikipedia, google was agressively indexing it, but the results rarely made it to the first search page. And since this isn't exactly an important site, old results would stick around.
Hence a pattern:
1. Some Rando creates a page about themselves
2. Wikipedia editors, being holy and good, extinguish that nonsense
1.5. GOOGLE INDEXES IT
3. Rando, by nature of being a rando, googles themselves, doesn't find their wikipedia page anymore (that's gone), but does find a link to my site in the first page of google results with their name.
4. Lawyers get involved, somehow
 Hebrew wikipedia. I can't imagine what would've happened on the English version.
Saying "מה זה לעזאזל X?" is closer to "what the fuck is X?" (except it's more "what the hell" than "what the fuck").
 I wasn't sure how to portray this to non-Hebrew speakers but, surprisingly, Google Translate actually nailed it: https://translate.google.com/#view=home&op=translate&sl=auto...
 Google Translate got this example right, too: https://translate.google.com/#view=home&op=translate&sl=auto...
There is actually no proof of this.
I have no idea where I got that impression from, but I do recall reading several articles about him, as well as his blog around that time. It's possible I just internalized others' assumptions about his behavior, but it certainly seemed like an in-character thing for him to to.
Whether or not the above even holds today (hiQ Labs v. LinkedIn argues it does not) remains to be decided in a current appeal to the Supreme Court.
Shouldn't anybody doing such thing be liable, and be sued for negligence and required to pay damages?
That sounds like a lot of bandwidth (and server stress).
(I know they're ignoring robots.txt, but robots.txt is not a law. And, it doesn't apply to user-generated requests, for things like "link unfurling" in things like Slack. I am guessing the crawler ignores robots.txt because it is doing the request on behalf of a human user, not to create some sort of index. Google is attempting to standardize this widely-understood convention: https://developers.google.com/search/reference/robots_txt)
The post mentions setting `og:ttl` and replying with HTTP 429 - Too Many Requests, and both being ignored.
If you don't want to do this yourself and you're already using Cloudflare... congratulations, this is exactly why they exist. Write them a check every month for less than one hour of engineering time, and their knowledge is your knowledge. Your startup will launch on time!
The author mentioned that they were using Cloudflare, and I see at least two ways to implement rate limiting. One is to have your application recognize the pattern of malicious behavior, and instruct Cloudflare via its API to block the IP address. Another is to use their rate-limiting that detects and blocks this stuff automatically (so they claim).
Like, there are many problems here. One is a rate limit that doesn't reduce load enough to serve legitimate requests. Another is wanting a caching service to cache a response that "MUST NOT" be cached. And the last is expecting people on the Internet to behave. There are always going to be broken clients, and you probably want the infrastructure in place to flat-out stop replying to TCP SYN packets from broken/malicious networks for a period of time. If the SYN packets themselves are overwhelming, welp, that is exactly why Cloudflare exists ;)
The author is right to be mad at Facebook. Their program is clearly broken. But it's up to you to mitigate your own stuff. This time it's a big company with lots of money that you can sue. The next time it will be some script kiddie in some faraway country with no laws. Your site will be broken for your users in either case, and so it falls on you to mitigate it.
If your website is pretty fast, you probably won't necessarily care about a lot of hits from one source, but if it's supposed to be a small website on inexpensive hosting, and all those hits add up to real transfer numbers, maybe it's an issue.
2a03:2880:10ff:14::face:b00c 2a03:2880:10ff:21::face:b00c 2a03:2880:11ff:1a::face:b00c 2a03:2880:11ff:1f::face:b00c 2a03:2880:11ff:2::face:b00c 2a03:2880:12ff:10::face:b00c 2a03:2880:12ff:1::face:b00c 2a03:2880:12ff:9::face:b00c 2a03:2880:12ff:d::face:b00c 2a03:2880:13ff:3::face:b00c 2a03:2880:13ff:4::face:b00c 2a03:2880:20ff:12::face:b00c 2a03:2880:20ff:1e::face:b00c 2a03:2880:20ff:4::face:b00c 2a03:2880:20ff:5::face:b00c 2a03:2880:20ff:75::face:b00c 2a03:2880:20ff:77::face:b00c 2a03:2880:20ff:e::face:b00c 2a03:2880:21ff:30::face:b00c 2a03:2880:22ff:11::face:b00c 2a03:2880:22ff:12::face:b00c 2a03:2880:22ff:14::face:b00c 2a03:2880:23ff:5::face:b00c 2a03:2880:23ff:b::face:b00c 2a03:2880:23ff:c::face:b00c 2a03:2880:30ff:10::face:b00c 2a03:2880:30ff:11::face:b00c 2a03:2880:30ff:17::face:b00c 2a03:2880:30ff:1::face:b00c 2a03:2880:30ff:71::face:b00c 2a03:2880:30ff:a::face:b00c 2a03:2880:30ff:b::face:b00c 2a03:2880:30ff:c::face:b00c 2a03:2880:30ff:d::face:b00c 2a03:2880:30ff:f::face:b00c 2a03:2880:31ff:10::face:b00c 2a03:2880:31ff:11::face:b00c 2a03:2880:31ff:12::face:b00c 2a03:2880:31ff:13::face:b00c 2a03:2880:31ff:17::face:b00c 2a03:2880:31ff:1::face:b00c 2a03:2880:31ff:2::face:b00c 2a03:2880:31ff:3::face:b00c 2a03:2880:31ff:4::face:b00c 2a03:2880:31ff:5::face:b00c 2a03:2880:31ff:6::face:b00c 2a03:2880:31ff:71::face:b00c 2a03:2880:31ff:7::face:b00c 2a03:2880:31ff:8::face:b00c 2a03:2880:31ff:c::face:b00c 2a03:2880:31ff:d::face:b00c 2a03:2880:31ff:e::face:b00c 2a03:2880:31ff:f::face:b00c 2a03:2880:32ff:4::face:b00c 2a03:2880:32ff:5::face:b00c 2a03:2880:32ff:70::face:b00c 2a03:2880:32ff:d::face:b00c 2a03:2880:ff:16::face:b00c 2a03:2880:ff:17::face:b00c 2a03:2880:ff:1a::face:b00c 2a03:2880:ff:1c::face:b00c 2a03:2880:ff:1d::face:b00c 2a03:2880:ff:25::face:b00c 2a03:2880:ff::face:b00c 2a03:2880:ff:b::face:b00c 2a03:2880:ff:c::face:b00c 2a03:2880:ff:d::face:b00c
A superficial search leads to things like
You really want to be careful about potentially breaking laws ...
If FB supports brotli, a much bigger compression factor than 1000 is possible, apparently.
There's also a script in that directory that allows you to create files of whatever size you want (hovering around that same compression ratio). You can even use it to embed secret messages in the brotli (compressed or uncompressed). There's also a python script there that will serve it with the right header. Note that for Firefox it needs to be hosted on https, because Firefox only supports brotli over https.
Back when I created it, it would crash the entire browser of ESR Firefox, crash the tab of Chrome, and would lead to a perpetually loading page in regular Firefox.
It's currently hosted at [CAREFUL] https://stuffed.web.ctfcompetition.com [CAREFUL]
This means you can serve 1M payload and have it come out to 1G at decompression time. Not a bad compression ratio, but it doesn't seem like enough to break Facebook servers without taking on considerable load of your own.
But zip bombs aren't limited to zip files... lots of files have some compression in them You can make malicious PNG's that do the same thing. Probably tonnes of other files.
Even I would stop and think "maybe I should put in some guards against large amounts of traffic" when writing a crawler, and I'm certainly not one of those brilliant minds who manage to pass their interview process.
This is not how the mail works, where someone needs to buy a stamp to spam you.
I knew a realtor that had a sheet of black paper with a few choice expletives written on it that they would send back to spammers. There was an art to taping it into a loop so it would continuously feed. This was a few decades ago when a the spam faxes could cost more than a stamp.
His solution was to send a return fax with
disorderly handwriting begging and pleading for them to fax someone else.
The other person stopped faxing him.
The problem happens when prices are extortionate (or the pricing model is predatory, ie pay per MB transferred instead of a flat rate per 1Gbps link) and relatively minor traffic translates to a major bill.
This particular issue is about 80 requests/second which is a drop in the bucket on a 1Gbps link. It would be basically unnoticeable if it wasn't for the cloud providers nickel & diming their customers.
The USPS is like a reverse garbage service.
The typical analogy I hear for DDoS attacks is if you own a business and a giant group of protesters drives traffic to your store, but no one is interested in buying. They create a backlog to enter the store, they crowd the isles, they pick up your inventory, and the lines and crowds scare off your legitimate visitors/customers.
"who do people already like to hate? I'll pretend to be them."
Otherwise it's a bot/malware/etc. spoofing Facebook and gone wrong, which sucks. And yeah just block it by UA, and hopefully eventually it goes away.
Not a profound comment. But it was really fun to follow the path down the various rabbit holes. This is why I like HN.
Maybe Cloudflare can step up and act as an intermediary to isolate the problem. Isolating DoS attacks is one of their comparative advantages.
for example https://www.ultratools.com/tools/ipv6InfoResult?ipAddress=2a...
Fuck this culture of "it's up to the victim of our fuckup to file a bug report with us".
Anyway, what reason would they have for so many requests? Why would they need to do 300 per second?
What's an AWS LoadBalancer way of blocking this traffic? Again, noob here. THanks.
And the reason against AWS for this would be both the general cost (AWS is not cost-efficient in many cases, unless you have complicated infrastructure that takes a lot of management, where the Infrastructure-as-code approach can give you a sizable benefit), the bandwidth cost in particular, and the lack of configurability of their services to e.g. apply suitable tar-pitting against such crawlers.
What I've heard happen is someone puts a governor on the system, and much later your new big customer starts having errors because your aggregate traffic exceeds capacity planning you did 2 years ago and forgot about.
Are you correctly sending caching headers?
I have no idea what its trying to do but its legitimately the only IPv6 inbound traffic that isn't related to normal browsing.
Do you have more details on this? One could read that as if the page is designed to attract crawlers.
Can anyone comment on whether there would be any legal basis for this?
Potential Positive: page speed, https
Negative: All blog posts with the same length, keywords dropped in every paragraph
I wonder if you can bypass paywalls or see things that aren't generally presented to regular users
I have a UA switcher on Firefox so I can change my UA in two clicks. One of the built-in UA options is indeed the Googlebot.