Hacker News new | past | comments | ask | show | jobs | submit login
A Facebook crawler was making 7M requests per day to my stupid website (napolux.com)
1074 points by napolux on June 11, 2020 | hide | past | favorite | 407 comments



We've had the same issue. They were doing huge bursts of tens of thousands of requests in very short time several times a day. The bots didn't identify as FB (used "spoofed" UAs) but were all coming from FB owned netblocks. I've contacted FB about it, but they couldn't figure out why this was happening and didn't solve the problem. I found out that there is an option in the FB Catalog manager that lets FB auto-remove items from the catalog when their destination page is gone. Disabling this option solved the issue.


I just had an idea: if you control your own name server I believe you could use a BIND view to send all their own traffic to themselves based on the source address.

By the way, if someone discovers how to trigger this issue it would be easy to use it as a DOS pseudo-botnet.


I had a different idea. Maybe you could craft a zip-bomb response. The bot would fetch the small gzipped content and upon extraction discover it was GBs of data? Not sure that's possible here, when responding to a request, but that would surely turn the admins attention to it.


Here's an example of things you can do against malicious crawlers: http://www.hackerfactor.com/blog/index.php?/archives/762-Att....


Thanks a lot! I've spent like 2h now reading his blog. Just amazing.


fascinating read. :)


It is!


If their client asks for gzip compression of the http traffic, you could do it.


Nice idea, I like the thinking. I'll tuck that away for use later. PowerDNS has LUA built in amongst a few other things.

My stack of projects to do is growing at a hell of a rate and I'm not popping them off the stack fast enough.


I know the feeling, it's one of the reasons I'm working on https://github.com/hofstadter-io/hof

Check out the code generation parts and modules, they are the most mature. We have HRDs (like CRDs in k8s for anything) and a scripting language between bash and Python coming out soon too.


I've tried to work out what your project does but I'm none the wiser. GEB is prominent on my bookshelf. I'm a syadmin and I got as far as "hollow wold" in Go or was it "Hail Marrow"? Can't remember.

I've checked out your repo for a look over tomorrow when I'm cough sober!


Stop by gitter and I'd be happy to explain more


Something about this idea sits uncomfortably with me. I also just had an idea / thought experiment based on your idea.

We think of net neutrality as being for carriers and ISPs, but you could see it applied to a publicly accessible DNS service too. These DNS service providers are just as much part of the core service of the Internet as anyone else. It’s not a huge leap to require that those who operate a publicly accessible DNS service are bound by the same spirit of the regulations: that the infrastructure must not discriminate based on who is using it.

It’s different to operating a discriminatory firewall. DNS is a cacheable public service with bad consequences if poisonous data ends up in the system. Fiddling with DNS like this doesn’t seem like a good idea. Too much weird and bad stuff could go wrong.

Another analogy would be to the use of encryption on amateur radio. It seems like an innocuously good idea, but the radio waves were held open in public trust for public use. If you let them be used for a different (though arguably a more useful purpose) then the resource ends up being degraded.

Also along these lines of thought [begin irony mode]: FCC fines for DNS wildcard abuse / usage.


Principled neutrality is fine for acceptable use. There’s no moral quandary in closing the door to abusers.


Isn't that the argument that providers make for wanting to meter usage? I.e. video streamers, torrenters and netflix and the like are 'abusing' the network by using a disproportionate amount of their capacity / bandwidth?

I guess my point is that "abuse" in this sense is pretty subjective.


Not really, those things are easy to contrast. Network providers have always been comfortable blackholing DoS routes, and it’s never been controversial. That’s clearly distinct from those wanting to double-dip on transport revenues for routine traffic.

The difference is in whether both endpoints want the traffic, not whether (or on what basis) the enabling infrastructure wants to bear it.


> There’s no moral quandary in closing the door to abusers

Doesn't necessarily apply to this conversation, but the moral mistake that people (and societies) frequently make is underestimating the nuance that should be exercised when identifying others as abusers.


if only there was a way to accidentally amplify that.


You could just probably send an HTTP redirect to their own site, no need to play with DNS for that


Probably better to find the highest level FB employees personal site that you can and send the million requests from fb there.


Except then you'd still have to deal with all the traffic.


>> The bots didn't identify as FB (used "spoofed" UAs)

That's surprising. What were the spoofed user agents that they used?

We've run into this issue also, but all Facebook bot activity had user agents that contained the string "facebookexternalhit".


I've seen these these two user-agents from FB IPs, maybe others:

Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Version/7.0 Mobile/11D257 Safari/9537.53

Mozilla/5.0 (iPhone; CPU iPhone OS 13_5 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Mobile/15E148 Safari/604.1

Which also execute javascript in some modified sandbox or something, causing errors, and executing error handlers. Interesting attempt to analyze the crawler here: https://github.com/aFarkas/lazysizes/issues/520#issuecomment...


Yep, these were the UAs we also saw (amongst others). And also in our case those bots were executing the JS, even hitting our Google Analytics. For some reason GA reported this traffic to come from Peru and Philippines, while an IP lookup showed it belonged to FB registered in the US or Ireland


Set up a robots.txt that disallows Facebook crawlers, sue Facebook if the crawling continues for unauthorized access to computer systems, profit.


I believe this only works if it is coming from a corporation and targeted at an individual -_-


robots.txt is not a legal document. It is asking nicely, and plenty of crawlers purposefully ignore it.


I'm not a lawyer, but I believe I remember people being sued for essentially making a GET request to a URL they weren't supposed to GET.


Legal document is a tricky phrase to use. "No trespassing" signs are usually considered sufficient to justify prosecution for trespassing. If the sign is conspicuously placed, it does not usually matter if you actually see the sign or not.

I am not as familiar with law around accessing computer systems, but I imagine that given some of the draconian enforcement we've seen in the past that a robots.txt should be sufficient to support some legal action against someone who disregards it.


the no tresspassing sign does not mean i can't yell at you from the street 'tell me your life story.' which you are free to ignore.


Sure, but federal law does prevent you from repeatedly placing that request into my mailbox. I don't even need a sign for that.

Making an http request does not fit cleanly as an analogue to yelling from the street, nor does it fit as an analogue to throwing a written request on a brick through a window. It is something different that must be understood on its own terms.


Side question: how did you get in contact with facebook? I've an ad account that was suspended last year and gave up trying to contact them.


Similar experience, closed ad account because of “suspicious activity”, at least ten support tickets (half closed automatically), four lame apologies (our system says no, sry) and then finally, “there was an error in our system, you’re good to go”


I lost it because there was some sort of UI bug in the ad preview page that led the page to reload a bunch of times really quick, and boom, I'm apparently a criminal. You're lucky though, I never got it back and I had to change my business credit card because they somehow kept charging me after suspension.


Really? I've never had problems contacting them by email. They're one of the easiest tech companies to talk to.


In my experience they're one of the most useless and difficult companies I've ever tried to interact with.


The entirety of the universe is contained in the preceding two comments.


Aah, finally it all makes sense!



What did you contact them for? Just curious as I've almost always heard they're like Google and impossible to get a human response from.


What is their email? I've never found one that they'd respond to.


I have an option to email them, live chat, or have our account manager contact us. I guess if you spend enough on ads per month you are entitled to more support...?


try contacting their NOC, they _may_ give you a human that can help.


Try their live chat.


Do they actually have a live chat available for smaller advertisers? When I looked last year there wasn't


They do, I have used it a couple of times. Go go facebook.com/business/help click on get started button at the bottom for "find answers or contact support". Follow through, you will see a button called "chat with a representative".


twitter does the same thing. it sends a bunch of spoofed US visitors from Korea and Germany and US. The bots are spoofed to make it harder to filter them.


I wonder what the (legitimate?) reason is for them to spoof. Seems intentionally shady. Maybe there's a legit reason we're missing?


Possibly trying to avoid people sending them a different version of the page than users would see (of course they could change the page after the initial caching of a preview, but Twitter might refresh/check them later).

Also, you often need an impressive amount of the stuff thats in a normal UA string for random sites to not break/send you the "unsupported browser, please use Netscape 4 or newer!!!" page/..., although you normally can fit an identifier of what you really are at the end. (As an example, here's an iOS Safari user agent: "Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.5 Mobile/15E148 Safari/604.1" - it's not Mozilla, it's not Gecko, but it has those keywords and patterns because sites expect to see that)


Yeah, I once tried to tell my browser to send... I forget; either no UA, or a blank UA string, or nonsense or just "Firefox" or something. I figured, "hey, some sites might break, but it can't be that important!" It broke everything. IIRC, the breaking point was that my own server refused to talk to me. Now, I still think this is insane, but apparently this really is how it is right now.


Should make you realize just how much abuse there is on the internet that it's worth it to just filter traffic with no UA.

Usually people get stuck on the fact that we can't have nice things, so X sucks for not letting us have nice things, yet I seem to never see people acknowledge why we can't have nice things.

Then I'd see a lot more "ugh, bad actors suck!" and less "ugh, websites are just trying to make life miserable for me >:("


Is there an official error message for that? Because filtering no UA would trip me up every time I use wget. Most of the time, if I'm casually using wget for something, I don't bother with a UA. If sites started rejecting that, I'd like to get a clear error message, so I would not go crazy trying to figure out what the problem was. If I got a clear message "send a UA" then I would probably started wrapping my wget requests in a short bash script with some extra stuff thrown in to keep everyone happy. But I'd have to know what it is that is needed to keep everyone happy.


wget has a default User-Agent string.


Well, if filtering for UA really makes things difficult for bad actors, they do suck but more in as a technical opinion than an moral statement.


That's amazing. Any idea what piece of middleware on your own server was doing that?


I don't remember, but what really got me was that I wasn't running anything that I expected to do fancy filtering; I think this was just Apache httpd running on CentOS. But there was no web application firewall, no load balancers, pretty sure fail2ban was only set up for sshd. It at least appeared that just the apache stock config was in play.


You removed a vital part of the http protocol and you're surprised when things break?


Seems like it follows the spec to me:

https://tools.ietf.org/html/rfc7231#section-5.5.3


That makes sense. I couldn't come up with a shady reason why they would do it to be honest, but I was curious.



When you know your target runs Internet Explorer, you serve phishing pages to IE users and some boring content to other users, at the server level. We've had this keep our user tests out of Google Safebrowse and so on. I'm sure similar tricks end up applied to Facebook's UA and "legitimate" marketing sites.


Thanks man! I'll have a look.


Hey! Facebook engineer here. If you have it, can you send me the User-Agent for these requests? That would definitely help speed up narrowing down what's happening here. If you can provide me the hostname being requested in the Host header, that would be great too.

I just sent you an e-mail, you can also reply to that instead if you prefer not to share those details here. :-)


I'm not sure I'd publicly post my email like that, if I worked at FB. But congratulations on your promotion to "official technical contact for all facebook issues forever".


My e-mail address is already public from my kernel commits and upstream work. :-)


Don't think I used my email for anything important doing my time at FB. If it gets out of hand he could just make a request to have a new primary email made and use the above one for "spam"


Curiousity question: does FB use Gmail/Google suite?


FB uses Office365 for email. It was on-premise Exchange many many years ago, but moved "to the cloud" a while back.


Feels odd to read Facebook uses office365/exchange for emails. they haven't built their fsuite yet, I thought they would simply promote Facebook messenger internally. I'm only half joking.


Most communication is via Workplace (group posts and chat). Emails aren't very common any more - mainly for communication with external people and alerting.


My impression is that they pretty much roll their own communication suite.


That’s somewhat correct

But at least for email/calendar backend its exchange

The internal replacement clients for calendar and other things are killer...have yet to find replacements

For the most part though they use Facebook internally for messaging and regular communication (technically now Worplace but before it was just Facebook)

Email is really just for external folks


I don't want to share my website for personal reasons, but here is some data from cloudflare dashboard (a request made on 11 Jun, 2020 21:30:55 from Ireland, I have 3 requests in the same second from 2 different IPs)

user-agent: facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php) ip 1: 2a03:2880:22ff:3::face:b00c (1 request) ip 2: 2a03:2880:22ff:b::face:b00c (2 requests) ASN: AS32934 FACEBOOK


Yes requests are still coming. Thanks CloudFlare for saving my ass.


Hey,

I can by the way confirm this issue. I work in a large newspaper in Norway and around a year ago we saw the same issue. Thousands of requests per second until we blocked it . And after we blocked it, traffic to our Facebook page also plummeted. I assume Facebook considered our website down and thus wouldn't give users content from our Facebook page either as that would serve them content that would give a bad user experience. The Facebook traffic did not normalize before the attack stopped AND after we told Facebook to reindex all our content.

I'd you want more info, send me a email and il dig out some logs etc. thu at db.no


Thanks for looking at this!


Thank goodness for mod_rewrite, which makes blocking/redirecting traffic on basic things like headers pretty easy.

https://www.usenix.org.uk/content/rewritemap.html

You could of course block upstream by IP, but if you want to send the traffic away from a CPU heavy dynamic page to something static that 2xx's or 301's to https://developers.facebook.com/docs/sharing/webmasters/craw... then this could be the answer.


Here are some more details from my report to FB:

"My webserver is getting hit with bursts of hundreds of requests from Facebook's IP ranges. Google Analytics also reports these hits and shows them as coming from (mostly) Philippines and Peru, however, IP lookup shows that these IPs belong to Facebook (TFBNET3). The number of these hits during a burst typically exceeds my normal traffic by 200%, putting a lot of stress at our infrastructure, putting our business at risk.

This started happening after the Facebook Support team resolved a problem I reported earlier regarding connecting my Facebook Pixel as a data source to my Catalog. It seems Facebook is sending a bot to fetch information from the page, but does so very aggressively and apparently call other trackers on the page (such as Google Analytics)"

69.171.240.19 - - [13/Aug/2018:11:09:52 +0200] "GET /items/ley3xk/ford-dohc-20-sierra-mondeo-scorpio-luk-set.html HTTP/1.1" 200 15181 "https://www.facebook.com/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17134"

"e.g. IP addresses 173.252.87.* performed 15,211 hits between Aug 14 12:00 and 12:59, followed by 13,946 hits from 31.13.115.*"

"What is also interesting is that the user agents are very diverse. I would expect a Facebook crawler to identify itself with a unique User-Agent header (as suggested by the documentation page mentioned earlier), but instead I see User-Agent strings that belong to many different browsers. E.g. this file contains 53,240 hits from Facebook's IP addresses with User-Agent "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17134"

There are a few Facebook useragents in there, but far less than browser useragents: 7,310 hits: facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php) 2,869 hits: facebookexternalhit/1.1 1,439 hits: facebookcatalog/1.0 120 hits: facebookexternalua

Surprisingly, there is even a useragent string that mentions Bing: 6,280 hits: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534+ (KHTML, like Gecko) BingPreview/1.0b

These IPs don't only fetch the HTML page, but load all the page's resources (images, css, ..) including all third-party trackers (such as Google Analytics). Not only does this put unnecessary stress at our infrastructure, it drives up the usage costs of 3rd party tracking services and renders some of our reports unreliable."

final response from FB: "Thanks for your patience while our team looked into this. They've added measures to reduce the amount of crawler calls made. Further optimizations are being worked on as well, but for now, this issue should be resolved." <- NOT.


I dont really understand what is the issue. On my welcome page (while all other urls are impossible to guess) i give browser something that requires a few seconds of cpu at 100% to crunch. And tracking some user action in between, visting tarpitted urls etc. In last few years no bot came through. Why bother with robots.txt, just give them something to break their teeths...

(I would give you the url, but I just dont want ti be visited)


I don't think most users would appreciate having a site spike their CPU for a few seconds when they visit...at least I wouldn't.


The far-fetched "solution" along with the "I don't understand everyone doesn't do <thing that is easy to understand why everyone wouldn't do it>" make it hard for me to believe you're actually doing this. Sounds more like a shower thought, maybe fun weekend project, than something in production.


Sounds like terrible UX! Won't browsers give you a "Stop script" prompt if you do that?


I don’t think the solution is flatlining everyone’s CPUs.


> but I just dont want ti be visited

Maybe you should just take your website offline?


How much have you mined?


Same thing has happened to me: https://twitter.com/moonscript/status/1124888489298808834

The network address range falls under Facebook's ownership, so I don't think it's someone spoofing. I do think it's very possible someone found a way to trigger crawl requests in large quantity. Alternatively, I would not be surprised it's just a bug on facebook's end.


They've done this before to me, too. First I tried `iptables -j DROP`, which made the machine somewhat usable, but didn't help with the traffic. After trying a few things, I tried `-j TARPIT`, and that appeared to make them back off.

Of course, sample size of 1, etc. It could have been coincidental.


Tarpits are an underappreciated solution to a pool of bad actors.

You can add artificial wait times to responses, or you can just route all of the 'bad' traffic to one machine, which becomes oversubscribed (be sure to segregate your stats!). All bad actors fighting over the same scraps creates proportional backpressure. Just adding 2 second delays to each request won't necessarily achieve that if multiple user agents are hitting you at once.


I never looked into the TARPIT option in iptables before reading your comment. That seems really useful. I've been dealing with on and off bursts of traffic from a single AWS region for the last month. They usually keep going for about 90 minutes every day, regardless of how many IPs I block, and consume every available resource with about 250 requests per second (not a big server and I'm still waiting for approval to just outright block the AWS region). I'm going to try a tarpit next time rather than a DROP and see if it makes a difference.


Be careful, as tarpitting connections can consume your resources faster than those of the attacker.


Most spiders limit the number of requests per domain, so if it's stupidity and not malice, you probably don't have a runaway situation.

... unless you're hosting a lot of websites for people in a particular industry. In which case the bot will just start making requests to three other websites you also are responsible for.

Then if you use a tarpit machine instead of routing tricks, the resource pool is bounded by the capacity of that single machine. If you have 20 other machines that's just the Bad Bot Tax and you should pay it with a clean conscience and go solve problems your human customers actually care about.


Only if you use conntracking though no?


Interesting. Implemented DROP for some Cloudflare net blocks a few weeks ago due to lots of weird interrupted TCP connections from them.

That's seemed to be good enough, but will consider the TARPIT option if it turns out to be needed. :)


one man’s weird interrupted TCP connection is another man’s nmap attempt.


TIL TARIPT!


This was happening to us > 5 years ago. The FB crawlers were taking out our image serving system as we used the og:image thing. What we did was route FB crawler traffic to a separate Auto Scaling Group to keep our users happy while also getting the nice preview image on FB when our content was shared. I can't understate the volume of the FB requests, I can't remember the exact numbers now but it was insane.


I once made a PHP page which fetches a random wikipedia article, but changes the title to "What the fuck is [TOPIC]?". It was incredibly funny, until I started getting angry emails from people, mostly threatening a form of libel lawsuit.

Turns out, since it was a front for all of wikipedia[1], google was agressively indexing it, but the results rarely made it to the first search page. And since this isn't exactly an important site, old results would stick around.

Hence a pattern:

1. Some Rando creates a page about themselves

2. Wikipedia editors, being holy and good, extinguish that nonsense

1.5. GOOGLE INDEXES IT

3. Rando, by nature of being a rando, googles themselves, doesn't find their wikipedia page anymore (that's gone), but does find a link to my site in the first page of google results with their name.

4. Lawyers get involved, somehow

Details: http://omershapira.com/blog/2013/03/randomfax-net/

[1] Hebrew wikipedia. I can't imagine what would've happened on the English version.


I understand why people got angry, though. The title was rougly "What is this '<article name>' shit?"[1], with "shit" in this Hebrew context also able to be interpreted as calling the subject "shit".

Saying "מה זה לעזאזל X?"[2] is closer to "what the fuck is X?" (except it's more "what the hell" than "what the fuck").

[1] I wasn't sure how to portray this to non-Hebrew speakers but, surprisingly, Google Translate actually nailed it: https://translate.google.com/#view=home&op=translate&sl=auto...

[2] Google Translate got this example right, too: https://translate.google.com/#view=home&op=translate&sl=auto...


Doesn't ignoring headers and taking someones site down fall afoul of the CFAA? Especially given how comments here are showing that this is a recurring issue? Depending on which side of the fence you're on, there could be standing to go after FB for this either for money, or for their lawyers to help set a precedent to limit CFAA further.


When someone does this to Facebook its malicious and they go to jail. WHen Facebook does it to someone else... "oops".


Never forget Aaron Swartz. All he did was send download requests to JSTOR. https://www.youtube.com/watch?v=9vz06QO3UkQ


I'm going to get a ton of hate for this, but he was doing it with the intent to redistribute the content for free. He didn't own the content. There's a big difference. That being said it's very sad what came about of that. I really don't think the FBI needed to be involved.


> he was doing it with the intent to redistribute the content for free

There is actually no proof of this.


Really? I felt that what happened to Aaron was a tragic injustice on many levels (from the fact that he was charged at all to the number of charges they threw at him), but for some reason I had always thought that it was fairly well-known that that was his intent.

I have no idea where I got that impression from, but I do recall reading several articles about him, as well as his blog around that time. It's possible I just internalized others' assumptions about his behavior, but it certainly seemed like an in-character thing for him to to.


IIRC, he had published a manifesto about how he believed that information that was paid for with public funds should be free for the public to view. However, I don't believe that he ever mentioned why he was making get requests to JSTOR's servers and I know that he never uploaded any of those JSTOR documents to the internet for public download.


Yeah, but any decent criminal lawyer could tear "he once wrote a manifesto" to shreds. His political beliefs would only be pertinent if he wrote the manifesto somehow in connection with what he was doing; otherwise, it would not be sufficient to show intent to distribute.


Criminal law is about proving things, not in-charachter assumptions.


Intent is the sticking point for the vast majority of prosecutions. That's why you have a jury, not Compute-o-Bot 9000.


Yeah and Facebook did this crawlings with intent to commit CFAA - computer fraud and abuse, now someone call the feds quickly before the perpetrators are zucked into hiding


From the point of view of the law, the intent (malicious or benevolent) is often as important as the action itself.


Nah, it only matters if you are a large corporation or not. If you are a large corporation you will ruin your opponent financial with legal fees (whether they are guilty or not). If you are a small timer going up against a large corporation they will delay the court case until you can not afford the legal fees.


The always-classic "what color are your bits?"

https://ansuz.sooke.bc.ca/entry/23


No. The owner needs send Facebook a cease and desist and block their IPs in order to revoke authorization under the CFAA, and then Facebook would need to continue accessing his website despite explicitly being told not to. That's the precedent set by Craigslist v. 3Taps.

Whether or not the above even holds today (hiQ Labs v. LinkedIn argues it does not) remains to be decided in a current appeal to the Supreme Court.


That's 81 request per second on average.

Shouldn't anybody doing such thing be liable, and be sued for negligence and required to pay damages?

That sounds like a lot of bandwidth (and server stress).


Yes, this was my first thought too - Facebook is essentially denial of servicing people's sites and it sounds like they're aware of it. Move fast and break your things. If this were your average Joe I'm fairly certain something like the DMCA could be used.


What? How does the DCMA apply here at all?


The DMCA has a couple major parts. The part most people are familiar with deals with safe harbor provisions and "DMCA takedowns". The relevant part for this discussion makes it illegal to circumvent digital access-control technology. So you could argue that if you have reasonable measures in place to stop automated access of your content such as robots.txt then Facebook continuing to send automated requests would be a circumvention of your access controls.


Shouldn't any public facing website have a rate limit? If someone attempts to circumvent simple rate limits (randomizing the source IP or header content), then that could demonstrate intent to cause damage, and you'd have a better case. But if you don't set a limit, how can you be mad that someone exceeded it?

(I know they're ignoring robots.txt, but robots.txt is not a law. And, it doesn't apply to user-generated requests, for things like "link unfurling" in things like Slack. I am guessing the crawler ignores robots.txt because it is doing the request on behalf of a human user, not to create some sort of index. Google is attempting to standardize this widely-understood convention: https://developers.google.com/search/reference/robots_txt)


There is a cost to apply throttling, you still have to receive those requests and filter/block them, plus incoming bandwidth.

The post mentions setting `og:ttl` and replying with HTTP 429 - Too Many Requests, and both being ignored.


Plenty of crawlers, including Facebook's, are operating out of huge pools of IPs usually smeared across multiple blocks. If you have an idea on how to ratelimit a crawler doing 2 req/s from 300+ distinct IPs with random spoofed UAs let me know when your startup launches.


There are many features that you can pick up on. TCP fingerprints, header ordering, etc. You build a reputation for each IP address or block, then block the anomalous requests.

If you don't want to do this yourself and you're already using Cloudflare... congratulations, this is exactly why they exist. Write them a check every month for less than one hour of engineering time, and their knowledge is your knowledge. Your startup will launch on time!


The author specified that the crawler ignores 429 status code. So they do have some rate limit.


I noticed that, but I get the feeling that sending that status code didn't reduce the amount of work the program did, so it's kind of a pointless. You can say "go away", but at some point, you probably have to make them go away. (All RFC6585 says about 429 is that it MUST NOT be cached, not that the user agent can't try the request again immediately. It is very poorly specified.)

The author mentioned that they were using Cloudflare, and I see at least two ways to implement rate limiting. One is to have your application recognize the pattern of malicious behavior, and instruct Cloudflare via its API to block the IP address. Another is to use their rate-limiting that detects and blocks this stuff automatically (so they claim).

Like, there are many problems here. One is a rate limit that doesn't reduce load enough to serve legitimate requests. Another is wanting a caching service to cache a response that "MUST NOT" be cached. And the last is expecting people on the Internet to behave. There are always going to be broken clients, and you probably want the infrastructure in place to flat-out stop replying to TCP SYN packets from broken/malicious networks for a period of time. If the SYN packets themselves are overwhelming, welp, that is exactly why Cloudflare exists ;)

The author is right to be mad at Facebook. Their program is clearly broken. But it's up to you to mitigate your own stuff. This time it's a big company with lots of money that you can sue. The next time it will be some script kiddie in some faraway country with no laws. Your site will be broken for your users in either case, and so it falls on you to mitigate it.


Rate limiting websites is a lot of work. You have to figure out what kind of limits you should set, you have to figure out if that should be on an IP or a /24 or an ASN, if it's an ASN, you have to figure out how to turn an IP into an ASN.

If your website is pretty fast, you probably won't necessarily care about a lot of hits from one source, but if it's supposed to be a small website on inexpensive hosting, and all those hits add up to real transfer numbers, maybe it's an issue.


In the ToS, add a charge of 2 guineas, 5 shilling, a sixpence and a peppercorn per request.


Nah make it a strand of saffron per request and get your moneys' worth out of them.


I saw also of 300reqs per second.


Did anyone notice the branded IP address? 2a03:2880:20ff:d::face:b00c

"face:b00c"


My "404-pests" fail2ban-client status, which drops everything making *.php requests (This machine has never had PHP installed...):

2a03:2880:10ff:14::face:b00c 2a03:2880:10ff:21::face:b00c 2a03:2880:11ff:1a::face:b00c 2a03:2880:11ff:1f::face:b00c 2a03:2880:11ff:2::face:b00c 2a03:2880:12ff:10::face:b00c 2a03:2880:12ff:1::face:b00c 2a03:2880:12ff:9::face:b00c 2a03:2880:12ff:d::face:b00c 2a03:2880:13ff:3::face:b00c 2a03:2880:13ff:4::face:b00c 2a03:2880:20ff:12::face:b00c 2a03:2880:20ff:1e::face:b00c 2a03:2880:20ff:4::face:b00c 2a03:2880:20ff:5::face:b00c 2a03:2880:20ff:75::face:b00c 2a03:2880:20ff:77::face:b00c 2a03:2880:20ff:e::face:b00c 2a03:2880:21ff:30::face:b00c 2a03:2880:22ff:11::face:b00c 2a03:2880:22ff:12::face:b00c 2a03:2880:22ff:14::face:b00c 2a03:2880:23ff:5::face:b00c 2a03:2880:23ff:b::face:b00c 2a03:2880:23ff:c::face:b00c 2a03:2880:30ff:10::face:b00c 2a03:2880:30ff:11::face:b00c 2a03:2880:30ff:17::face:b00c 2a03:2880:30ff:1::face:b00c 2a03:2880:30ff:71::face:b00c 2a03:2880:30ff:a::face:b00c 2a03:2880:30ff:b::face:b00c 2a03:2880:30ff:c::face:b00c 2a03:2880:30ff:d::face:b00c 2a03:2880:30ff:f::face:b00c 2a03:2880:31ff:10::face:b00c 2a03:2880:31ff:11::face:b00c 2a03:2880:31ff:12::face:b00c 2a03:2880:31ff:13::face:b00c 2a03:2880:31ff:17::face:b00c 2a03:2880:31ff:1::face:b00c 2a03:2880:31ff:2::face:b00c 2a03:2880:31ff:3::face:b00c 2a03:2880:31ff:4::face:b00c 2a03:2880:31ff:5::face:b00c 2a03:2880:31ff:6::face:b00c 2a03:2880:31ff:71::face:b00c 2a03:2880:31ff:7::face:b00c 2a03:2880:31ff:8::face:b00c 2a03:2880:31ff:c::face:b00c 2a03:2880:31ff:d::face:b00c 2a03:2880:31ff:e::face:b00c 2a03:2880:31ff:f::face:b00c 2a03:2880:32ff:4::face:b00c 2a03:2880:32ff:5::face:b00c 2a03:2880:32ff:70::face:b00c 2a03:2880:32ff:d::face:b00c 2a03:2880:ff:16::face:b00c 2a03:2880:ff:17::face:b00c 2a03:2880:ff:1a::face:b00c 2a03:2880:ff:1c::face:b00c 2a03:2880:ff:1d::face:b00c 2a03:2880:ff:25::face:b00c 2a03:2880:ff::face:b00c 2a03:2880:ff:b::face:b00c 2a03:2880:ff:c::face:b00c 2a03:2880:ff:d::face:b00c


I think we're together in this, my friend


I believe, but have not tried, that you can craft zipped files that are small but when expanding produce multi-gigabyte files. If you can sufficiently target the bad guys, you can probably grind them to a halt by serving only them these files. Some care needs to be taken so you don’t harm the innocent and don’t run afoul of any laws that may apply. Good luck.


Do you think it would be possible to create something similar for gzip? If you then serve with Content-Type: text/html, and Content-Encoding: gzip, the client would accept the payload. And when it tries to expand it, it would get expanded to a large file, eating up their resources.


That was the idea. There's a name for it - zip bomb. https://en.wikipedia.org/wiki/Zip_bomb

A superficial search leads to things like https://www.rapid7.com/db/modules/auxiliary/dos/http/gzip_bo...

https://stackoverflow.com/questions/1459673

You really want to be careful about potentially breaking laws ...


Here is another article: https://www.blackhat.com/docs/us-16/materials/us-16-Marie-I-...

If FB supports brotli, a much bigger compression factor than 1000 is possible, apparently.


Here's a brotli file I created that's 81MB compressed and 100TB uncomrpessed[1] (bomb.br). That's a 1.2M:1 compression ratio (higher than any other brotli ratio I see mentioned online).

There's also a script in that directory that allows you to create files of whatever size you want (hovering around that same compression ratio). You can even use it to embed secret messages in the brotli (compressed or uncompressed). There's also a python script there that will serve it with the right header. Note that for Firefox it needs to be hosted on https, because Firefox only supports brotli over https.

Back when I created it, it would crash the entire browser of ESR Firefox, crash the tab of Chrome, and would lead to a perpetually loading page in regular Firefox.

It's currently hosted at [CAREFUL] https://stuffed.web.ctfcompetition.com [CAREFUL]

[1] https://github.com/google/google-ctf/tree/master/2019/finals...


A single layer of deflate compression has a theoretical expansion limit of 1032:1. ZIP bombs with higher ratios only achieve it by nesting a ZIP file inside of another ZIP file and expecting the decompressor to be recursive.

This means you can serve 1M payload and have it come out to 1G at decompression time. Not a bad compression ratio, but it doesn't seem like enough to break Facebook servers without taking on considerable load of your own.

http://www.zlib.net/zlib_tech.html


Just a guessing, but maybe a really large file with a limited character set, maybe even just repeating 1 character should compress really well


1 character repeated 1877302224 times has a compression ratio of 99.9% ~1.9GB compresses to ~1.8MB.


I believe zip bombs are pretty easily mitigated, with memory limits, cpu time ulimits, etc.

But zip bombs aren't limited to zip files... lots of files have some compression in them You can make malicious PNG's that do the same thing. Probably tonnes of other files.


Good points. According to OP, the goal would be to persuade FB to back off crawling at 300 qps, not to bring down the FB crawl. Having each request expand to 1GB is probably already enough to do that, since a crawler implementing the mitigations you mentioned will likely reduce the crawl rate. Should be easily testable.


The mighty power of IPV6 :P


if only they could have convinced IANA to give them a vanity prefix.


FB has 2a00::/12 (and probably other blocks?) and can make something that looks like its company name from [0-9a-f]. Why wouldn't it do something harmless and fun like this? It's not as if it requires special dispensation or is breaking any rules.


I don't think the person was implying it's bad. Just pointing out an easter egg.


Ah, well if that’s the case then I apologise. My bad!


A single company does not get a /12 prefix. 2a00::/12 is almost half of the space currently allocated to all of RIPE NCC. Facebook seems to have 2a03:2880::/29 out of that /12, and a /40 through ARIN (2620:0:1c00::/40)


Ugh right you are. I misread the whois output and didn't stop to consider how realistic a /12 was. This is embarrassing.


ugh why is ipv6 impossible to understand :/


Is the notation the problem here? IPs have 16 bytes now, so the notation is a bit more compact: hex instead of decimal numbers, and :: is a shortcut meaning "replace with as many zeros as necessary to fill the address to 16 bytes"


It's not. Think IPv4 but bigger.




It bugs me that they require a login to see a bug report, especially in the case of this bug where those affected aren't necessarily facebook users.


I've stumbled in that too.


Reading through the article and comments here I find it odd that a company like FB can't get a crawler to respect robots.txt or 429 status codes.

Even I would stop and think "maybe I should put in some guards against large amounts of traffic" when writing a crawler, and I'm certainly not one of those brilliant minds who manage to pass their interview process.


If the dev respected others’ boundaries and norms of expected behavior, they would probably work somewhere else.


It's an interesting thing to think about, that someone can make a request to your website and you pay for it.

This is not how the mail works, where someone needs to buy a stamp to spam you.


My office still gets spam faxes to this day. The paper and toner only add up to a few cents a month, so it's not worth doing anything about.

I knew a realtor that had a sheet of black paper with a few choice expletives written on it that they would send back to spammers. There was an art to taping it into a loop so it would continuously feed. This was a few decades ago when a the spam faxes could cost more than a stamp.


My desk phone at an old job used to get dialed by a fax machine. Not fun picking that up. I redirected it to a virtual fax line and it turns out it was a local clinic faxing medical records. I faxed them back with some message about you have the wrong number but they never stopped.


If you really wanted to make them stop, you might find a contact for their lawyer and make HIPAA noises at them ;)


I considered it. That was after my time in hospital IT. Really I wanted to stop getting my eardrums blown out by a robot. Luckily I happened to be working at a phone company so changing my number just took a couple clicks.


My dad used to receive a lot of misdirected faxes.

His solution was to send a return fax with disorderly handwriting begging and pleading for them to fax someone else.

The other person stopped faxing him.


I was using a virtual fax service so a digital equivalent would have been easy. Not sure if I was creative enough to invert the colors before sending my response. Really I just wanted the noise to stop.


It’s because we as an industry decided to give in and submit to the cloud providers’ bullshit model of paying overpriced amounts for bandwidth while good old bare-metal providers still offer unmetered bandwidth for very reasonable prices.


The question isn't about pricing. The fact that you pay for your own bandwidth and metal is a problem.


Paying for your bandwidth and metal isn't a problem per-se as long as prices are reasonable (which they are with the old-school bare-metal providers). After all, those services do cost money to provide.

The problem happens when prices are extortionate (or the pricing model is predatory, ie pay per MB transferred instead of a flat rate per 1Gbps link) and relatively minor traffic translates to a major bill.

This particular issue is about 80 requests/second which is a drop in the bucket on a 1Gbps link. It would be basically unnoticeable if it wasn't for the cloud providers nickel & diming their customers.


You're still not getting it. I'm not arguing about whether it should be free or how much it should cost. I'm saying that when you have to pay for someone else's request, it's ripe for abuse. Instead, a system where the requester pays would be similar to how postage works in the US. Although I think it's highly unlikely that would make sense in a digital world.


Most of my junk mail does not have stamps because the spammers drop it off in pallets at the post office and just have it delivered to everyone.

The USPS is like a reverse garbage service.


This is a common pattern in (blocking i/o versus non-blocking i/o) and (sync task versus async task).

The typical analogy I hear for DDoS attacks is if you own a business and a giant group of protesters drives traffic to your store, but no one is interested in buying. They create a backlog to enter the store, they crowd the isles, they pick up your inventory, and the lines and crowds scare off your legitimate visitors/customers.


This reminds me of a case years ago when I had a small-ish video file on my server that would cause Google Bot to just keep fetching it in a loop. It wasted a huge amount of bandwidth for me since I only noticed it in a report at the end of the month. I had no way of figuring out what was wrong with it, so I just deleted the file and the bot went away.


Was it a real FB crawler or a random buggy homemade bot masquerading as one?

Check here: https://developers.facebook.com/docs/sharing/webmasters/craw...



Someone up to no good using a false flag would not surprise me in the slightest.

"who do people already like to hate? I'll pretend to be them."



IP packets have a source field, which can be fake and not their actual IP. That's a Facebook IP, but the packet might not have actually come from it.


If they made actual requests that went through, a connection got established. That won't happen with a faked source.


Ah okay, thanks for helping me learn :)


Could be the IP address OP is showing is actually from an X-Forwarded header or other proxy header (but not the actual source of the packet).


UDP packets can fake their source IP. TCP packets realistically can't.


More precisely, spoofing a TCP handshake is a problem that we know how to prevent. Odds are pretty good that your kernel has these protections enabled by default, but it's not guaranteed and you should check as a matter of due diligence.


Why is it so hard for people here to believe that Facebook could have a crawler misbehaving? It is all over this thread, even though multiple people are saying they saw the same thing and confirmed the IP belonged to FB.

What gives?


If you can verify it's actually coming from Facebook's IP address (as many other comments here suggest), you should absolutely get in touch with them, though I'm not sure how. Perhaps their security team. That's a very serious bug that they'd be glad to catch -- if it's affecting you it's surely affecting others.

Otherwise it's a bot/malware/etc. spoofing Facebook and gone wrong, which sucks. And yeah just block it by UA, and hopefully eventually it goes away.


From the sounds of it the difference between Facebook's actual crawler and malware could be hard to define or differentiate, given the circumstances.


It was from their IPs tried to figure out how to get in touch, without success


This is one of the better HN threads in a while. Starting from an easily described problem, it goes into a number of interesting directions including how FB seems to DDOS certain sites for reasons unknown, defenses against such attacks and DDOS in general, GPL licensing, how to collapse HN conversations, etc.

Not a profound comment. But it was really fun to follow the path down the various rabbit holes. This is why I like HN.


Interesting. According to the Facebook Crawler link provided by the OP [1], it makes a range request of compressed resources. I wonder if the response from the OP’s PHP/SQLite app isn’t triggering a crawler bug.

Maybe Cloudflare can step up and act as an intermediary to isolate the problem. Isolating DoS attacks is one of their comparative advantages.

[1] https://developers.facebook.com/docs/sharing/webmasters/craw...


It would be nice to understand how you came to the conclusion that this was a Facebook bot.



You can check list of IP addresses?


the OP did not (explicitly) check, but you can check if the IP falls into a range allocated to Facebook e.g. https://ipinfo.io/AS32934


Of course I've checked. Here is a recent example. https://news.ycombinator.com/item?id=23491455


Make sure you file a bug - there are a myriad of sources internally, but we can often hunt it down easily enough (assuming it gets triaged to eng). Important info is the host of the urls being crawled, and the User Agent attached to the requests (headers are also good too). A timeseries graph of the hits (with date & timezone specified) can also help.


Why is this the owner's problem? Someone at Facebook should be filing the bug, or better yet instrumenting their systems so that incidents like this issue a wake-up-an-engineer alert.

Fuck this culture of "it's up to the victim of our fuckup to file a bug report with us".


Not just “file a bug report” but also compile a time series graph (lol) and then pray that Facebook triages it correctly, which they have no incentive to do.


And to make things even more ridiculous, you need to sign in with a facebook account to even file a bug there.


100%



Did you check the ip and not just the UA? Does it matches with the one from fb?



Surprise, Facebook not caring about playing nice with others!

Anyway, what reason would they have for so many requests? Why would they need to do 300 per second?


We had that with google in the past. We put a kml database on a web server and made the google maps API read them. Google went to get them without any throttling (my guess is that they primed their edge servers straight from us), the server went down so fast we sought there had been a hardware failure. We ended up putting a cdn just for google servers.


What would happen if you created a "rabbithole" for webcrawlers? Would they fall for it?


Modern web crawlers usually have mechanisms against rabbit holes, e.g. for Apache Nutch you can configure it should only follow to a certain depth of links from a root, it should only follow a redirect a few times and so on. It's always a trade-off though. If you cut off too aggressive you don't get everything, if you cut off too late you waste resources.


Maybe someone at Facebook is testing a project that didn't work as expected?


I wish we had better tools for things like this.

What I've heard happen is someone puts a governor on the system, and much later your new big customer starts having errors because your aggregate traffic exceeds capacity planning you did 2 years ago and forgot about.


Are you sure it's a crawler and not a proxy of some sort? Eg. one of your links is on something high traffic in facebook, and all requests are human, running through fb machines.


Nah, it's a specific user-agent for their crawler


Facebook caches URLs pretty aggressively (often you need to explicitly purge the cache to pick up updates) so I don't quite understand exactly what happened here. The article is very light on details. Do you have 7 million unique URLs that have been shared on Facebook? Or is it the same URL being scraped over and over?

Are you correctly sending caching headers?


When I enabled IPv6 traffic at home and logged default denies I constantly see Facebook owned IPv6 blocks trying to reach addresses browsing Facebook.

I have no idea what its trying to do but its legitimately the only IPv6 inbound traffic that isn't related to normal browsing.


I’m a bit confused. Facebook is social media company, why do they send out crawlers?


Facebook shows previews of pretty much everything you share a link to, so at a minimum they have to fetch the page/image to build those previews.


I see, thanks for the explanation.


Why would a crawler want to crawl the same page every second (let alone 71 times a second). Even if I were Google, wouldn't crawling any page every few mins (or say every minute) be way more than enough for freshness?


There was a very old post: https://news.ycombinator.com/item?id=7649025, lots of ways to abuse Facebook network.


> I own a little website I use for some SEO experiments [...] can generate thousands of different pages.

Do you have more details on this? One could read that as if the page is designed to attract crawlers.


Is it worth duplicating the fb specific parts of your website at another web address as a test site, so that If it was similarly affected, you could provide the address info to help troubleshoot?


Shouldn't have Cloudflare considered this a DDOS attack?


Hate to say but this is probably a FB engineer running a test/experiment and not a production crawler taking robots.txt etc into account.


Or malware infecting some FB-internal machine?


Doubt it. If you're a bad enough dude to get control over Facebook's internal infrastructure, I doubt you'd blow that access by using it to spam random sites.


Detect the crawler and delay the returns as long as you can. Have it use as many resources as you can that should be interesting.


> And then they’ll probably ask you to pay the bill

Can anyone comment on whether there would be any legal basis for this?


Thanks for sharing. I’d like to hear more about the website described here, it sounds very interesting.


I can add I work in SEO (tech side), so the website is "super optimized", but it's not rocket science.


Super optimized? The site takes longer to load than Hacker News for me.


They explicitly didn't mention the site this affected. It's not the one linked in the OP.


Super optimized with Facebook button on each post then. Not my definition of 'super optimized', but you know.


Super optimized in a SEO sense, not in a pleasing the HN crowd kind of sense I'd assume ;)


You're right, but for some reason the whole SEO thing just winds me up. It's my opinion that 'good' SEO makes sites worse for actual people to use.


SEO is for machines, not for users IMHO.


Still affects users.

Potential Positive: page speed, https

Negative: All blog posts with the same length, keywords dropped in every paragraph


Cant' share, sorry :)


Dude something is super sketchy


Nah


I wonder what the crawler does to it requires so many requests per day to the same site...


Title change? A stupid Facebook crawler was making 7M requests per day to my website


Nice one. ;)


we had a similar experience with yandex. had to block it. crawler started to crawl on a larger e-shop by trying out every single possible filter combination. in other words we got ddos attack from yandex.


Then your website isn't stupid but rather the FB crawler


Do you have some sort of fb plugin in your SEO site?


Why does facebook have crawlers in the first place?


Their 2FA page was broken yesterday...it goes on


Unrelated, but has anyone written a Chrome/Firefox extension to browse the web sending out Googlebot or Facebook user agent?

I wonder if you can bypass paywalls or see things that aren't generally presented to regular users


I surfed through a Google network for a bit a few years ago, and it does reveal things. The most common was Drupal and WordPress sites with some malicious code running that would link to online pharmacies on every page, but only to Google user agents / networks, I assume to boost their ranking. I let a bunch of webmasters know, most of them didn't believe me because they didn't see it themselves.


You wouldn't even need an extension. Both browsers allow you to change your User-agent. An extension just makes it a little easier.

I have a UA switcher on Firefox so I can change my UA in two clicks. One of the built-in UA options is indeed the Googlebot.



You chose to mention Safari, but link to an article about Chrome, Firefox, and Safari? Looks like all major browsers (probably Edge too) have this built in.



Someone could use this for DDOS or Coinhive


I think sending back a `429 Too Many Requests` can solve this. Unless you want to block their IPs completely, which I doubt you'd want to do.


A well made crawler that would support 429 codes isn't likely to have this problem in the first place.


> HTTP 429 -> ignored

It's in the article


I tried. It was ignored.


It says in the post he did this already


Did you reach out to the fb contact?


bit rot? 1 off anywhere in a DNS request could slam someone


check these hits for an X-Purpose header or similar


How do you monetize a robot?


Establish contractual damages in a ToS for the site. Prove violation and offender. Take to court and collect damages.

Converting the effort into cash is tough, but the strategy exists.

Project HoneyPot is an API which allows any website to do this for honeypot email addresses which are injected the website, along with a ToS which says:

> By continuing to access the Website, You acknowledge and agree that each email address the Website contains has a value not less than US $50 derived from their relative secrecy.[1]

[1] https://www.projecthoneypot.org/terms_of_use.php


Has that ever worked? I can't find any record of judgments one way or the other, on their website or elsewhere.


I know ProjectHoneyPot was pushing a $x billion litigation against a spammer. I don't remember if/how that was resolved.

This guy[1] apparently spent years just suing email spammers and occasionally winning.

[1] https://www.danhatesspam.com/index.html


I also found the $1 billion lawsuit (against a bunch, not just one spammer, I believe), and could also not find any sort of resolution - not in legal docs, news pages, or project honeypot itself.


That's a good question. Even better if Facebook can pay me some money :P


dont know whether the facebook bots execute javascript - https://www.rdegges.com/2017/how-to-monetize-your-website-wi...


If it runs javascript, get it to mine bitcoins for you?


Hi Napolux,

It looks like your site is using a theme based on my website (https://ruudvanasseldonk.com/, source at https://github.com/ruuda/blog). That is fine — it is open source after all, licensed under the GPLv3. But I can’t find the source code for your site, and I can’t find any prominent notices saying that you modified my source. Could you please add those?


Sure man no problem. The code for my theme is here BTW with credits To your original blog

https://github.com/napolux/coding.napolux.com/


Thanks, I’m flattered to see it be used as inspiration :)

I searched quickly but I didn’t find that repository. You might want to link it somewhere in your footer or from a comment in the html.


Sure, no problem


Flattered that your few, boring lines of CSS and HTML (that could easily be reproduced by a monkey) was used on someone else's website?


We've banned this account for breaking the site guidelines.

If you don't want to be banned, you're welcome to email hn@ycombinator.com and give us reason to believe that you'll follow the rules in the future.

https://news.ycombinator.com/newsguidelines.html


Toxic comments have no place on HN. Thank you for policing them so diligently.


Sounds like yes, they are flattered for exactly that. And it's totally ok.

There exists small joys in life, like being flattered for something that other people might not think are much of a big deal.

I guess that's one reason why something may be called flattering in the first place, if they didn't see it as being all that much in their own eyes, but somebody else appreciates it.


And it's a beautiful lightweight theme.


That isn't how you hacker news.


Am inexperienced in the technicals of this, is "taking inspiration" from someone else's website grounds for a copyright or gpl violation?

When I've worked with designers it usually starts "pick a site that inspires you", where the deliverable has resemblance.


The distinction can be difficult to make, but I think it is relatively clear here.

I'm "reading between the lines" here so could be wrong, but "looks like your site is using a theme based on" implies to me that there is enough code (markup, styles, perhaps script) similarly that is enough to not be pure coincidence. That suggests using some of the code not just being inspired to produce a clean-room design, so the GPL is relevant.


My understanding is unless he's distributing it (not just serving it from a website) he doesnt need to release his changes. That's what the AGPL is for.


It's not the copyright on the server code that is at issue, but the copyright on the HTML and CSS files (and portions thereof) that get distributed by the server.


You see the contradiction, right? If it's about the HTML and the CSS the user has that code directly by visiting the site. No further action would be needed.

If it's about something that creates the HMTL and CSS, then OP has no requirement because of the GPL just from people visiting the site and accessing the output CSS/HTML. Because that's not distributing the code as defined by the GPL. The original author should have used the AGPL for that case - and if it's really just about HTML/CSS a different license altogether.


> If it's about the HTML and the CSS the user has that code directly by visiting the site. No further action would be needed.

Wrong. Transmitting modified GPL works requires more than just distributing the source:

    a) The work must carry prominent notices stating that you modified it, and giving a relevant date.  


    b) The work must carry prominent notices stating that it is released under this License and any conditions added under section 7. This requirement modifies the requirement in section 4 to “keep intact all notices”.   


    c) You must license the entire work, as a whole, under this License to anyone who comes into possession of a copy. This License will therefore apply, along with any applicable section 7 additional terms, to the whole of the work, and all its parts, regardless of how they are packaged. This License gives no permission to license the work in any other way, but it does not invalidate such permission if you have separately received it.  


    d) If the work has interactive user interfaces, each must display Appropriate Legal Notices; however, if the Program has interactive interfaces that do not display Appropriate Legal Notices, your work need not make them do so.  


  
https://www.gnu.org/licenses/gpl-3.0.html


First, that is for Conveying Modified Source Versions. See the definition of conveying:

> To “convey” a work means any kind of propagation that enables other parties to make or receive copies. Mere interaction with a user through a computer network, with no transfer of a copy, is not conveying.

Generally the common position here is to use the AGPL if you want to cover regular network access. Probably all a bit murky because the GPL does not really fit well to things like this. But mostly it does not apply here, pretty specifically by definition.

Maybe you'd have a point regardless if the source was directly the HTML and CSS. And it seems like I was wrong with the modification notice (but I wasn't thinking about modifications in particular). But it's not HTML and CSS directly. Having looked at the source in question now the transmitted HTML and CSS is evidently not the source code, as both is produced by template files.


It is my understanding that style-related code is effectively not copyrightable in the US. Is my knowledge out of date?


NAL, but the resulting style isn't, as there are many ways to generate the same style, but the specific code is.


The actual GPL uses the word "convey", not distribute or serve. [1](section 5). Merriam-Webster defined "convey" as "to transfer or deliver". Pretty hard to argue that a web server does not transfer or deliver HTML + CSS.

[1] https://www.gnu.org/licenses/gpl-3.0.html


Next time, have some courtesy for the author and the rest of us by requesting via personal exchange over email instead of hijacking the thread and distracting from the conversation.


Boo hiss.

There’s plenty of room here for a polite exchange between professionals.

Collapse the thread and move on.


Come on. I love Ruud's posts and I took inspiration for my blog. He was right asking for a link, which I gladly provided.

That's it for me.


Sorry, I think you might have misinterpreted my comment?


It seems he intended to reply to the parent comment.


Yes, sorry. I was from mobile ;)


Ah, yes, I see that now. Thanks for clearing that up.


The question is why does everyone need to read what really could first be something done on a private communication.


No body is forcing you to read anything.

Though I will concede: reading is involuntary to literate adults.

Having said that, there’s currently, what? 28 comments in this thread.

Is that indicative enough that at least some people have derived value from it?


If it were a financial thing I'd agree but if you aren't complying with the GPL I think it's good to set an example (but not shaming)


whoa there. the gpl v3 only require to have the gpl v3 declaration visible and to be copmpliant source code can be provided on demand, no need to have a source code link in the object as long as a a contact is available and the code is provided at a reasonable cost.


> ... and the rest of us ...

Please don't try to police the thread and speak for yourself only.


“And the rest of us, excluding Dahoon”


You can add me to that list.


They look similar at a glance (border-top + Calluna font), so he might have taken inspiration from yours, but doesn't seem to have used any of your assets - the styles are clearly different and based on the WP 'BlankSlate' theme.

(my personal blog had a top border like that a decade ago, when styles on the body were a novelty :))


I took inspiration but I didn’t know he was sharing the code for is blog

https://github.com/napolux/coding.napolux.com/


I checked the source and it does indeed mention https://wordpress.org/themes/blankslate/, but judging from the screenshot there, that theme is just really a blank theme with no style at all, and it was used to include a different stylesheet.

The real style is at https://coding.napolux.com/wp-content/themes/coding.napolux...., which looks like normalize.css followed by a Wordpress adaptation of my stylesheet.

The similarity is more than superficial, the footer headers match exactly.


Come on, this is super petty. And that includes your license choice and your enforcement for this.


I would back you up on this. This is super petty to be like that over a couple of lines of css.


So, just to be clear, you are perfectly fine with someone taking something someone else created and violating the terms upon which they were given that thing?

e.g. If I took code you wrote and lets say released under an MIT license and claimed I wrote it and didn't give you any credit, and in fact released it under another license entirely, you'd be fine with that?


It's perfectly valid to criticize the original license choice.

GPLv3 is a very restrictive license, especially for what is essentially a micro blog (though I dislike the license for most open source software anyway).

Add on the original author going after a bit of CSS, not even the main effort of the project in question, and you've got my "petty" comment.


There is an artist who takes images from magazines, repurposes them for his own art, and sells them for hundreds of thousands of dollars, then gets sued by the magazines & photographers & artists, and guess what, HE WINS AGAINST THOSE LAWSUITS. Copyright is BS, especially concerning HTML & CSS code. What a joke.


You didn't answer my question. That pretty much says all that needs to be said.


My answer is less useful to the discussion. But here you go:

1. I wouldn't use GPLv3

2. I wouldn't care if people stole my code that I open sourced or if they tried to license it a different way.

3. I personally follow the license of others when using their code. I wouldn't steal GPLv3 code without proper attribution etc. That's their right.

All that doesn't go against my initial opinion: GPLv3 for a small micro blog templating system is lame. Enforcing it for a bit of CSS is petty.


> All that doesn't go against my initial opinion: GPLv3 for a small micro blog templating system is lame. Enforcing it for a bit of CSS is petty.

It is a matter of principle, probably?


Not the parent but your question and the implicit accusation is way too overblown. The design here is so generic that it can easily be used by tons of site out there. It's just a few lines of CSS here and there and if I have a design like that and come across something similar I would just chalk it up to someone with similar taste. Going out of your way to demand attribution for it is the very definition of petty.


> e.g. If I took code you wrote and lets say released under an MIT license and claimed I wrote it and didn't give you any credit, and in fact released it under another license entirely, you'd be fine with that?

If I released it on Github, under any license whatever? I’d more or less be expecting that.

If it was about the 4hr of work that went into my blog theme, I wouldn’t be bothered at all.

But then, I wouldn’t release anything like that under the GPL.


So, you are okay with people violating other peoples licenses and ignoring copyright. Gotcha.


You can simultaneously follow other people's licenses to the letter while also not caring if other people don't follow your's.


I just think it’s naive to assume they won’t.

It’s a bit like putting a solid gold bar on your lawn and putting a sign next to it saying ‘please don’t take, this is mine’.


Hello there. I've added a link to your website in the footer! :)

https://coding.napolux.com/

Thanks for reaching out!


Awesome, thank you! Sorry for derailing your thread, I did not mean for my comment to escalate like it did.


Despite how it sounds, I ask this with zero judgment and pure curiosity.

Why do you care?


The author probably cares about either copy-left or the right to tinker. Some people believe it's important and others don't.

Basically any derivative product out of GPL3 code needs to either be open-source/copy-left, or if it uses the code as a library needs to let end users substitute that library for their own version.


Worse than unimportant, I find copyleft harmful to the developer community as a whole. Software is hardly "free" when you have to bend to the author's demands to license your entire rest of your project under their ideology if you want to use it, and I think we should stop calling it so. Maybe "Conditionally Free Software" instead. You're hardly helping the world releasing libraries in terms that nobody but hobbyists (who are going to rip your code out and replace it if they start making a proprietary product, which is not evil) will find acceptable. It works for the Linux kernel and self-contained applications but that's it.


Copyleft is about maximizing the users's rights, not the developers's rights. Companies can't take linux, put it on a router for sale, and then say that their customers/users aren't allowed to know what's going on on the box in terms of backdoors and spying. The users have a right to look at the source code if they wish.


It may focus on user's rights, but it still requires technical expertise to exercise half of the 4 freedoms: #1 (inspection & modification) and #3 (distributing your modifications). Merely knowing to ask "can I see the source code" I would put into the "technical user" realm. Overwhelmingly, most people don't know or don't care.

The other two freedoms #0 (freedom to run) and #2 (freedom to share) are readily obvious to non-technical users. "Double click to run" and "drag and drop <on external drive> to copy". Unfortunately, they are also often permitted by non-free/libre software. So non-technical users that can readily see they can do #0 and #2 and generally have no litmus test to further determine whether the software is "free/libre".

This is a case where the ideology's practical concerns hamper its purity. I critique despite generally liking the FLOSS ideal, but it's important to know its flaws.


The users don't have to be technical to benefit from those freedoms though - there's a level of indirection involved. I might not have personally scrutinized every line of the Linux kernel, but knowing that there are tons of people in the world with the ability and motivation to do that inspires confidence.


I am not arguing that one has to be technical in order to benefit from those freedoms. I like FLOSS and agree users in general benefit.

I am saying that your example, for instance, stills falls under the "how does a non-technical user simply verify the software they just downloaded respects their freedoms?" which is an very real educational and cultural problem. For example, see the massive money and numerous gun ranges, gun stores, gun clubs, and other gun-associated organizations in the US that work to educate the "unskilled" general public on "how to be aware, recognize, and exercise their rights and freedoms" under the 2nd Amendment while respecting local laws. Folks generally are 1) aware they have the right and 2) have a low-friction no-special-technical-skilled path to exercising that right. The FLOSS movement is nowhere near that level of educating and making aware non-technical users of their freedoms, their digital rights, and how to then act upon them and exercise them. Folks generally are 1) unaware of libre software and 2) don't have a low-friction no-special-technical-skilled path to exercising that right.


They can take Linux and put it behind a locked bootloader that makes it impossible to replace with your own kernel. Sure you can have the modified sources but you can't do anything with it. Hence GPL3.


Users don't care about any of that, but developers are certainly hindered by restrictive licenses, which in turn hurts users.

I never found a right to see source code compelling as a real right. To read the assembly and modify something they bought, sure, but not an entitlement to the source.


Nobody is forcing you to use GPL code. It is stinking of entitlement to demand that you get to use other people's code regardless of what they think about it.


I don't feel entitled to it, I just consider the prevalence of copyleft bad for the developer community. License however you want, doesn't mean I'm entitled for criticizing it.

What might be stinking of entitlement is the idea users have an inalienable right to source code.


What about the prevalence of proprietary software? You can’t use that for anything without making a contract with the author either.


Users and developers are not different beasts. Users with source code are more likely to become developers. Developers are more likely to become users if they can tinker.


You sound very bitter that a developers that let others use their code gets to pick his license of choice. Should everything by locked down like Microsoft Windows code or an Apple phone? If what you want is for everything to be completely free from licenses instead, just code your own version and release it as freeware. If you only complain and don't then you are just being hypocritical.

>You're hardly helping the world releasing libraries in terms that nobody but hobbyists (who are going to rip your code out and replace it if they start making a proprietary product, which is not evil) will find acceptable

Now you are just trolling.


I'm not bitter, just disappointed in all the wasted developer time that happens because people get caught up in these copyleft ideas. Of course you can pick whatever license you want but that doesn't mean it can't be criticized.

> If you only complain and don't then you are just being hypocritical.

You don't need to be an architect to complain about crumbling bridges, but indeed I have released software under more free licenses.


>wasted developer time

Only if you assume copyleft code would otherwise be available for developers. The other alternative is people who don't want others profiting from their free code just don't release their code.


Well, it's easy to be disappointed in all the wasted developer time because copy left. It's transparent.

It's harder not only to be disappointed but to even notice all the wasted developer time because of closed source code because well it's even obscured how can you as a developer benefit from the code.


I made my site open source so others who like it can take a look at how it works, or give it their own twist. Visitors of my site can see that they have that freedom, but visitors of an adaptation might not know if it’s not stated anywhere.


You're not wrong to request attribution. I don't know why you're getting so much flak.


Lots of people don't like the GPL (I love it myself, for reasons stated in the GP), and seeing it enforced in real time brings that debate up again.


Well I respect the question of GP if nothing then for principle. People should get their shit together and start respecting the license. The state today is horrendous and the culture is evolving into less and less respect of the license. If people don't follow them personally they wont care to follow them professionally either


I can only guess at motivations, but holistically people need to enforce their license terms or they lose their teeth.


Cause it’s his special.


Please read https://sfconservancy.org/copyleft-compliance/principles.htm.... My understanding is that if you do not enforce your copyright (or copyleft in this case) you can lose the copyright.


Whilst you can lose a trademark for not enforcing it, you cannot lose a copyright by not enforcing it (in the United States).


You are confusing this with trademark, that's completely different thing.


Kind of an odd request for an open source wordpress blog theme imo


Design layman here, switching between the two sites the top bar looks similar (it's just a solid line) except a different color and slightly different height. I don't see other similarities or assets coming from your site? Can you explain this request; I'm not sure two sites both using a blank horizontal rule across the top is something I would personally call out as similar.


It appears to be a serif font and a border top. That's all it is? Am I missing something.


GPL license states you must provide source code to the end-user. In this case, the end-user is legally Napolux, not the visitors to his site.


In any case, View -> Source

The source code of a public website is automatically provided to everyone who visits it. This fact is unfortunately not well-known.


That's... not how webpages work. This is why the AGPL even exists.


Your blog is awesome! I'm really glad you commented here, or I wouldn't have found it. I even forwarded your most recent post to some friends.


I'm new to webadmining in the cloud and my website is getting hammered by baidu, google, fb, et others causing traffic I/O costs to increase.

What's an AWS LoadBalancer way of blocking this traffic? Again, noob here. THanks.


Don't host this stuff on AWS if you care about cost.


So you don't know how to do this in AWS is what I'm hearing?


My point is, that you don't want to do this inside of a load balancer there. I don't recall any traffic filtering abilities that would suffice, but I'm not fully up-to-date with the configurability either. If the load balancer supports it, a short search in the net or docs should surface an easily-applicable guide, and if not, I'd probably put that blocking closer to my app server.

And the reason against AWS for this would be both the general cost (AWS is not cost-efficient in many cases, unless you have complicated infrastructure that takes a lot of management, where the Infrastructure-as-code approach can give you a sizable benefit), the bandwidth cost in particular, and the lack of configurability of their services to e.g. apply suitable tar-pitting against such crawlers.


[flagged]


Ask me anything. That specific website is a personal project of mine which I'd like not to disclose.

As you can see from my blog I'm not selling anything, I don't even have a banner on my blog, so where your suspect is coming from?


Sorry, I was mostly joking. I definitely do believe Facebook could be responsible for this.


No problem :) I’m not the only one suffering this as you can see from the thread


> Am I just getting too suspicious?

Yes, I think so. What's the conspiracy supposed to be here?


isn't this what robots.txt for? can't you block facebook?


The article mentions that they tried that and the crawler ignored it.


As the article clearly states: no


whoops, my laziness. sorry


Why does FB need a crawler? Their users provide the content for their site. Is there an FB web search engine?


When you paste in a URL to share it with your friends, Facebook tries to grab some information from that webpage to provide a summary.


What happens if that link performs an action upon a GET request?

Edit: Folks, I agree with you all, but I've seen a lot of garbage out there. Just asking the question for the discussion.


Then the action needs to be harmless, because GET is defined as not merely idempotent but also safe.

Didn't we already learn this lesson after all the unsafe-GET problems unveiled when prefetching browser accelerators came on the scene in, IIRC, the late 1990s?


> Then the action needs to be harmless to repeat, because GET is defined as idempotent.

No, this is a terrible response. The action needs to be harmless to execute every time, not just every time after the first time.

HTTP DELETE is conceptually idempotent, but you don't want to be deleting stuff with GET requests. That's why the standard provides a DELETE method! The distinction that really matters is safe/unsafe, not idempotent/unique.

(Do you need to use DELETE for deleting stuff? No, POST is fine.)


> The action needs to be harmless to execute every time, not just every time after the first time.

You are correct, and the grandparent post has been updated appropriately.


> No, this is a terrible response.

As an aside, this kind of hyperbole really gets under my skin. It wasn't a terrible response. That statement is already technically correct: GET is idempotent, and the definition of idempotency is that it is harmless to repeat.

Your gripe is that OP didn't mention that GET is not only idempotent but must also be "safe"; i.e. that it should not alter the resource. OP got it 50% correct.

Does that omission make his comment a "terrible response"? No -- just incomplete.


Yes, it was a terrible response. Here are some examples of idempotent requests:

- Change the email address registered to my account from owner@gmail.com to new_owner@136.com .

- Instead of sending my direct deposit to account XXXX XXXX at Bank of America, from now on, send it to account YYYY YYYY at Wells Fargo.

- Delete my account.

- Drop the database.

None of these have any business being available to GET requests. Objecting to a misconfigured endpoint on the grounds that the functionality it implements is not idempotent implies that the lack of idempotence is what was wrong. That's a bad thing to do - anyone who takes your lesson to heart is still going to screw themselves over, because you gave them terrible advice. They may do it more than they otherwise would have, because you gave them advice that directly endorses really bad ideas. Idempotence or the lack thereof is beside the point.

Messing up on endpoint idempotence means you might hurt the feelings of a document. Messing up on endpoint safety means you might lose all your data as soon as anyone else links to your homepage. Or worse.


Most of the time, GET should not affect system state at all (other than caching, etc). Idempotent can still have effects on the first access. Even deletion can be idempotent.

I generally see PUT and PATCH classified as idempotent (for the same input).


Then like literally every search engine and website crawler ever does, the action will be performed.

Which is why it's bad practice to design your website using GET for actions. That's what POST is for.

I mean, using GET for actions will break so many things -- browser prefetching, link previews, the list is endless. If you use GET for actions, just... yikes.


> browser prefetching, link previews

Both of them are pretty bad. I do not see GET actions breaking anything that is not trash.


that's your own problem because you are going against the HTTP protocol standard. GET should be idempotent.


Going against the HTTP standard isn't a problem. For example, it's a good practice to ignore HEAD requests as opposed to responding appropriately.

The problem with unsafe GET is that it conflicts with reality, not that it conflicts with the standard.


> Going against the HTTP standard isn't a problem. For example, it's a good practice to ignore HEAD requests as opposed to responding appropriately.

That's only against the standard of you advertise HEAD as a supported method on the resource, which converts it from a good idea in some circumstances to a bad one, so if there is a good example to support your claim, that isn't it.


The most classic example is the JWT specification, which says you need to honor the encryption algorithm defined by the token you receive. (JWT includes a "none" algorithm, making token forgery trivial when the parser implements the standard.)

It's known widely enough now that people have chosen to reinterpret the language of the standard in order to claim that their implementations are compliant -- after making the change specifically to bring themselves out of compliance.

(It's possible that it's been so long that the standard itself has been changed to accommodate this. But regardless, the point stands that standards compliance is not a virtue for its own sake. This wasn't a good idea back when everyone agreed that the standard required it, it's not a good idea now, and future bad ideas do not in general become good ideas by virtue of being specified in standards.)


I'm not sure what sort of action you mean, but Facebook doesn't fetch the page in real-time using the client's browser. Some sort of job scrapes the page HTML looking for metadata and then uses that metadata to populate the preview. If you want to play around with what it "sees", you can test using the sharing debugger: https://developers.facebook.com/tools/debug/


That "link" doesn't respect the GET semantics then.


Then it performs the action, what else would it do?


And to prevent some URLs to be posted.


This pattern needs to die.

Whenever I paste a link in any sort of messaging app - iMessages to Slack - it puts a thumbnail and summary, polluting the entire conversation with tons of noise.

Messaging apps have no business in looking up the URL. Just let it pass as a link.

Fuck everything about this and we need to push back on this nonsense.


I disagree completely, I find it extremely useful.

Most links are unreadable (e.g. an ID to a cloud file), and are truncated anyways even if readable.

The preview gives me the title of the webpage, which is infinitely more useful -- especially letting me know whether it's a document I've already seen (and don't need to open) or something new, and if so, what.


To me it pollutes the discussion away from the messages and puts a bunch of thumbnails where it could just be a clean text conversation.

The thing that bothers me is also that I, as a user, have no choice - it just does it automatically.

Do you like IRC?


I suspect that most apps have a way to turn this off. Slack certainly does. That doesn't stop other users from seeing previews of your links, if it's set up that way for them, but it does prevent you from seeing previews of others' (and your own) links.


Whatsapp allows to delete the link box: enter the link, wait for the preview, backspace, and send. The hyperlink remains clickable too.


Slack does it as well


this is one of those "people like different things" situations and a config option should be added to support both.


Slack, and I suspect most other messaging apps, has such a config option.


A funny story (well, funny because it didn't happen to me) was chronicled on a recent episode of Corey Quinn's "Whiteboard Confessions" -- https://www.lastweekinaws.com/podcast/aws-morning-brief/whit... -- where Slackbot's auto URL unfurling feature "clicked" an SNS alert unsubscribe link when a tech posted the a report including that link into a slack channel. If nobody lost their job over this triggering a SEV-1 alert I'm sure they had a laugh about it later.

Another issue with auto-unfurling links or generating previews of URLs is the potential for stalking. If I'm hunting my ex and I know their cell phone number AND that they have an iPhone I can simply send a link to a website and let iOS generate the preview... which then pings the URL of the website which I set up and I can now run geo-IP resolution and narrow down where my ex is (and maybe I follow up with a well-crafted PDF exploit when I'm ready to narrow down even more on an area).

While systemvoltage was a bit brusque in how the complaint was worded above, these "friendly and helpful" features of Slack, SMS, and other apps can have devastating unforeseen consequences. I question the value of having to opt out of these features instead of making them opt-in instead.


That's an interesting idea of a security breach.

In practice it's going to be difficult because:

1) You can't do effective geolocation on cell phone IP address, except perhaps at a country level [1]

2) Your ex would need to open your message to trigger the link preview loading. If you're stalking them, they're probably either ignoring you or blocking you

3) At most, if the ex is connected to a Wi-Fi network like Starbucks and opens the message, you'll be able to get the city they're in, maybe.

But at the end of the day, what you'd get from link previews is no different from embedding a tracker image in an e-mail. And while it requires someone to suspect they're being stalked, blocking would prevent all of this.

[1] http://www.cs.yale.edu/homes/mahesh/papers/ephemera-imc09.pd...


As far as I know the iOS implementation mitigates that by generating the preview on the sender side. The receiver side will not make a request to the URL unless they manually click on it.


Same for Whatsapp and Wire.


haha, wow. This is some story. Thanks for sharing.


Part of the goal here is to prevent people from clicking on malicious links. A preview helps with that.

Or, more simply, you can't be rickrolled if you know the destination in advance.


The next level would be for the server to see who is requesting content. Facebook IP? "Here's some harmless HTML!". Browser IP? "Here's an executable pretending to be a harmless page!"


I like link previews. I don't like when they are managed server-side instead of client-side.


Caching a link's title and thumbnail server-side will save your site potentially millions of requests from Facebook and elsewhere.

Seems like a benefit for site owners to me.

Not to mention that the server is reducing the image size of a thumbnail, potentially converting from HTTP to HTTPS, and so on.


> Seems like a benefit for site owners to me.

Maybe, but it’s definitely less control for the site owner.

Maybe she wants to do analytics on the previews (how far shared, which ip regions) or delete the content at a specific time. Etc.


Also considering the sibling story, at least the server-side preview builder is definitely entirely unauthenticated to every possible service. A preview builder running on the client side might conceivably pick up your authentication to something in various circumstances.


The issue with doing them client side (other than the missed cache opportunity of using a server) is that it means you're having every user in the chat send a request to an arbitrary URL from their device. Also CORS makes this impossible 99% of the time.


The third approach, which Signal uses as a privacy preservation measure, is for the sender to generate the link preview on send. Of course, this then opens up abuse where the sender can arbitrarily control the "link preview" to say whatever they want.


I just thought of a malicious idea. If I were Amazon or some other cloud provider, I could hammer my customers' web sites with tons of requests to make them pay more for resource usage (network bandwidth, s3 calls, misc per/unit usage etc). It would be hard to trace as well. Wonder if people are already doing that today.


The amount of bad will that would generate when, not if, it came to light would dwarf the possible gains. So ethical concerns aside, it would be idiotic.

Sometimes that doesn't stop the pointy haired bosses, because sometimes they are the perfect storm of unethical and moronic.


This is actually a common tactic from malicious actors - pretend to be other bots to get through to websites. If you reverse lookup the IP address you can see if it part of the facebook network.

UA is very untrustworthy






Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: