What you can do, however, is make it hard so that the vast majority of developers can't do it (e.g. My tech crawl billions of pages, but there was a whole team dedicated to keeping it going). If you have money to spend, there's Distill Networks or Incapsula that have good solutions. They block PhantomJS and browsers that use Selenium to navigate websites, as well as rate limit the bots.
What I found really affective that some websites do is tarpit bots. That is, slowly increase the number of seconds it takes to return the http request. So after a certain amount of request to your site it takes 30+ seconds for the bot to get the HTML back. The downside is that your web servers need to accept many more incoming connections but the benefit is you'll throttle the bots to an acceptable level.
I currently run a website that gets crawled a lot, deadheat.ca. I've written a simple algorithm that tarpits bots. I also throw a captcha every now and then when I see an IP address hits too often over a span of a few minutes. The website is not super popular and, in my case, it's pretty simple to differentiate between a human or bot.
Hope this helps...
One of the alternatives is charging for my service but bots are not my users problem, they are mine.
Wow, what were you doing with the data?
Crawling thousands of website, mashing up the data to analyze competitiveness between them, and selling it back.
For example, cost of flights. Different websites provide different prices for the same flight. The technology crawls all the prices, combines the data, then resells it back to the websites. Everyone knows everyones prices, keeps competition high, lower prices for consumers.
Solutions like CloudFare and Distill have sophisticated algorithm to balance out fake and real data but even they are not close to being perfect.
I am pretty sure crawling robot.txt links was their P1 requirement.
BTW, interested in learning a bit more about your stack, we are on the same route but at smaller scale.
If you can hit Google 60 times per minute per IP before getting blocked and you need to crawl them 1000 times per minute, you need 17 IPs per hour. Randomize headers to look like real people coming from schools, office buildings, etc... Lots of work but possible.
Note that Google is pretty aggressive about captcha-ing "suspicious" activity and/or throttling responses to suspicious requests. You can easily trigger a captcha with your own manual searching. Just search for something, go to page 10, and repeat maybe 5-20 times and you'll see a captcha challenge.
If Google gets more serious about blocking me then I'll use ML to overcoming their ML (which should be doable because they're always worried about keeping Search consumer-friendly).
My solution would solve the issue in a fair way by having a "reasonable usage" limit applied to everyone, bot or not. This also means it can't be defeated by someone paying humans to do the dirty work to bypass bot restrictions.
I remember working hard on a project for a year, then releasing the data and visualizations online. I was very proud. It was very cool. Almost immediately, we saw grad students and research assistants across the globe scraping our site. I started brainstorming clever ways to fend off the scrapers with a colleague when my boss interrupted.
Him: "WTF are you doing?"
Me: "We're trying to figure out how to prevent people from scraping our data."
Him: "WTF do you want to do that for?"
Me: "Uh... to prevent them from stealing our data."
Him: "But we put it on the public Web..."
Me: "Yeah, but that data took thousands of compute hours to grind out. They're getting a valuable product for free!"
Him: "So then pull it from Web."
Me: "But then we won't get any sales from people who see that we published this new and exciting-- Oh. I see what you mean."
Him: "Yeah, just get a list of the top 20 IP addresses, figure out who's scraping, and hand it off to our sales guys. Scraping ain't free, and our prices aren't high. This is a sales tool, and it's working. Now get back to building shit to make our customers lives easier, not shittier."
Sure enough, most of the scrapers chose to pay rather than babysit web crawlers once we pointed out that our price was lower than their time cost. If your data is valuable enough to scrape, it's valuable enough to sell.
The only technological way to prevent someone crawling your website is to not put it on a publicly-facing property in the first place. If you're concerned about DoS or bandwidth charges, throttle all users. Otherwise, any attempts to restrict bots is just pissing into the wind, IMHO.
Spend your energies on generating real value. Don't engage in an arms racw you're destined to lose.
They'd be profitable in a month.
Bear in mind that it’s a large technical feat to be able to ingest the firehose effectively, so it’s not really suitable for consumers.
I don't believe this is true. It used to be the case but Twitter went and put all API sales under Gnip.
We do a bunch of other stuff, like adding fake properties so we can check who is scraping our content, and using tarpits.
Developers always argue with me that it's pissing in the wind, that "content wants to be free", that you shouldn't bother even trying to prevent it since it's inevitable. And yet it has helped. We did a split A/B test on preventing scrapers, and it turns out that it's quite effective.
If you want to charge rent (via a subscription service etc), then do that, and be clear that you're in the business of charging rent. Don't conflate selling with renting - that just leads to a product gimping death spiral.
To me, it sounds like your business lacks some fundamentals.
To have a free website is not the same than to unlimited grants such as a guy to taking the whole pie.
Or worst, lets say that you writes a free essay and everybody could read it and share. However, somebody takes it, deletes your name and put his name instead.
If your free content is the sole source of your online revenue, then your reputation is your business. Nefarious crawlers who rebrand your content cannot, by definition, beat you to market. So you stand to gain much more by writing great content, driving readers to it as soon as it's posted, and filing the occasional DMCA takedown than trying to compete with plagiarists in a game of whack-a-mole.
Unauthorized access to a computer system is a different legal category than IP licensing.
Ever heard of KAZAA? Napster? BitTorrent?
All sites where people who have usually purchased data make it freely available to others.
Funny part was the lawyer on the other side wanted us to return all of the content on disk. Not show what we had copied but literally return it. My lawyer laughed about it. The other lawyer was smart/savvy enough to be effectively using the internet in 01 but didn't really understand the tech.
Other funny part is if he had generalized his site outside of law he would have had a major business these days.
I actually had a client that asked me to record a screenshot of me deleting their files from my computer. (and this was actually a developer making the request)
There's a nice Github repo with some advice on blocking scrapers:
Finally, you could use a plugin in your Webserver to display a CAPTCHA to visitors from IP addresses that cause a lot of requests to your site.
There are many more strategies available (up to creating fake websites / content to lead crawlers astray), but the CAPTCHA solution is the most robust one. It will not be able to protect you against crawlers that use a large source IP pool to access your site though.
The going rate for CAPTCHA solving is about 1/10 of a USD penny.
You really can't protect against this unless you start making the experience of regular visitors much worse.
For example, it might detect that a mouse click is dispatched at exact intervals (and block it). To which you'd think "I'll just add Math.random() * 2000" which it'll easily detect as well.
It's _definitely_ doable, but it's not as trivial as recording a selenium macro. (Not to mention these tools look for selenium presence and extensions anyway).
By doing this you must accept that a certain percentage of legitimitate users, which can be quite significant, will be blocked. In case you're wondering, yes, this does happen with solutions such as Cloudflare.
And at that point, either your website isn't popular, in which case you can't afford to lose users, or it is very popular, in which case you'll piss of enough users as to receive bad reviews.
Basically you can't afford to do this, unless you're Facebook or Google, and you have to then wonder why Facebook or Google do not deploy such protections
So going back to my main point, of course it's possible, but the experience for users gets significantly worse such that you won't want to do it.
Both Distil and Incapsula _did_ have a small number of false positives (showing a captcha to users). We did have to write some code to overcome that.
Imagine if everyday they changed? It would make things a lot more difficult.
There would be disadvantages to actual users with this method like caching wouldn't work very well but maybe this alternative site could be displayed only to bots.
The crawler could get smart about it and only use xpaths like the 6th div on the page so maybe in the daily update you could throw in some random useless empty divs and spans in various locations.
It's a lot of work to setup but I think you would make scraping almost impossible.
Think about it this way: you could render the whole page as a JPEG with no computer readable text, but someone could still ocr it.
Also, you have to take into account regex.
Otherwise, your best bet (hardest to get around in my experience) is monitoring for actual user I/O. Like if someone starts typing in an input field, real humans have to click on it beforehand, and most bots won't.
Or if a user clicks next-page without the selector being visible or without scrolling the page at all. Not natural behavior.
Think like a human.
1) request fingerprinting - browser request headers have arbitrary patterns that depend on user agent. matching user agent strings with a database of the request header fingerprints allows you to filter out anyone who is not using a real browser who hasn't taken the time to correctly spoof the headers. this will filter out low skill low energy scrapers and create higher costs.
3) do access pattern detection. require valid refer headers. don't allow api access without page access, etc. check that assets from page are loaded. etc.
4) use maxmind database and treat as suspicious any access not from a consumer isp. block access from aws, gcp, azure, and other cloud services offering cheap ip rental.
Both worked, both worked well with http downloads and selenium (and common techniques). Neither worked against someone dedicated enough - but there are the usual tricks for bypassing them (which we used, to test our own stuff).
We also developed something in-house, but that never helps.
For ill-behaved ones, it depends on why you're trying to block them. Rate throttling, IP blocking, requiring login, or just gating all access to the site with HTTP Basic Auth can all work.
For example, a dictionary site. Someone tries to crawl your site after triggering your "This is a bot" code, serve bad data to every 20 requests. Mispell a word, Mislabel a noun as a verb, give an incorrect definition.
If you combine this with throttling then the value of scraping your site is greatly reduced. Also, most people won't come up with a super advanced crawler if they never get a "Permission denied, please stop crawling" message.
simple demo: http://botbouncer.xyz/
I ran it for awhile on some medium traffic websites that were being heavily scraped. It blocked thousands of IP addresses, but IIRC only received one Bitcoin payment.
This is for Drupal sites. It has a strong firewall (csf) and it has a lot of crawler detections on the nginx configurations. It checks the load and when on high load it blocks the crawlers.
If someone wants to scrape your website badly enough, they'll find a way.
The wiley and expensive way is to alter your page
The company I work for does a large amount of scraping of partner websites, with whom we have contracts that allow us to do it and that someone in their company signed off, but we still get blocked and throttled by tech teams who think they are helping by blocking bots. If we can't scrape a site we just turn off the partner, and that means lost business for them.
Also you can do frontend rendering, it's a bit larger roadblock but you can use phantomJS or something to crawl that.
IIRC there is a php framework that mutate your front end code but I'm not sure if it does it enough to stop a generalized xpath...
Also I used to work for company where they employ people full time for crawling. It will even notified the company if crawler stopped working so they can update their crawler...
- permanently block known anonymizer service IP addresses
- permanently block known server IP address ranges, such as AWS
- temporarily (short intervals, 5-15 mins) block IP addresses with typical scraping access patterns (more than 1-2 hits/sec over 30+ secs)
- add captchas
All of these will cost you a small fraction of legitimate users and are only worth it if scraping puts a strain on your server or kills your business model...
One technique that bothers me quite a bit is constant random changes in class names or DOM structure, which can make it more difficult. Not impossible but more difficult.
then Zip bombs.
Most crawlers will make hundreds of requests in five minutes, while legitimate viewers will make be bellow 100.
This is pretty cheap to do and I've seen it done before in several places I've worked (on the defending side).