I've built crawlers that retrieve billions of web pages every month. We had a whole team working modifying the crawlers to resolve website changes, to reverse engineer ajax requests and solve complex problems like captcha solvers. Bottom line, if someone wants to crawl your website they will.
What you can do, however, is make it hard so that the vast majority of developers can't do it (e.g. My tech crawl billions of pages, but there was a whole team dedicated to keeping it going). If you have money to spend, there's Distill Networks or Incapsula that have good solutions. They block PhantomJS and browsers that use Selenium to navigate websites, as well as rate limit the bots.
What I found really affective that some websites do is tarpit bots. That is, slowly increase the number of seconds it takes to return the http request. So after a certain amount of request to your site it takes 30+ seconds for the bot to get the HTML back. The downside is that your web servers need to accept many more incoming connections but the benefit is you'll throttle the bots to an acceptable level.
I currently run a website that gets crawled a lot, deadheat.ca. I've written a simple algorithm that tarpits bots. I also throw a captcha every now and then when I see an IP address hits too often over a span of a few minutes. The website is not super popular and, in my case, it's pretty simple to differentiate between a human or bot.
Do you feel bad at all about apparently making a business out of crawlers, but still apparently viewing it as bad enough that you want countermeasures against it? Don't you feel a slight bit hypocritical about this?
I don't think I'm being hypocritical. I have no issues if people crawl my site, I even identify who they are and give them access to a private API. I do not generate any income from that website though. I provide a service because I love doing it and I cover all the costs. Bots do increase my cost so I choose to limit their activity. Crawl me, but do so using my rules.
One of the alternatives is charging for my service but bots are not my users problem, they are mine.
Crawling thousands of website, mashing up the data to analyze competitiveness between them, and selling it back.
For example, cost of flights. Different websites provide different prices for the same flight. The technology crawls all the prices, combines the data, then resells it back to the websites. Everyone knows everyones prices, keeps competition high, lower prices for consumers.
Travel companies pay their GDS for every search they do. It costs so much that it's the primary cost centre for some of them. You were costing them thousands of dollars a day.
If unwanted scraping can be distinguished from legitimate traffic, wouldn't a sort of honeypot strategy work such that you then provide those requests you've identified as likely to be unwelcome with fake or divergent data?
When websites get a ton of traffic the concern is that the algorithm that find fake data will not be accurate and start blocking paying customers. So it's a fine line between blocking paying customers and fake data. What these algorithms do instead of blocking is to throw captcha's so if the traffic is really human, the captcha can be solved. The bigger problem is that there's a good chance that humans who are thrown a captcha will leave to go buy somewhere else (either because they are lazy, the captcha is hard, etc...).
Solutions like CloudFare and Distill have sophisticated algorithm to balance out fake and real data but even they are not close to being perfect.
How do you prevent your website from not functioning over time for legitimate users, though? I'm a Sysadmin & not a coder or developer, so the tricks you can do are a little foreign to me. Can you provide examples? Why don't Adidas/Nike/et. al. do this to fight the likes of sneakerbots?
I can't provide details on any innovations we've done with sites like google, but in general if you want to crawl google you'll want to get "many, many" IP addresses. I've heard of people using services like 2captcha.com but the best way is to obfuscate who you are.
If you can hit Google 60 times per minute per IP before getting blocked and you need to crawl them 1000 times per minute, you need 17 IPs per hour. Randomize headers to look like real people coming from schools, office buildings, etc... Lots of work but possible.
I do it using rotating proxies, stripping cookies between requests, randomly varying the delay between requests, randomly selecting a valid user-agent string, etc. It's a pain in the butt. And to scrape more than I do, faster than I do, would be pretty freaking expensive in terms of time and money.
Note that Google is pretty aggressive about captcha-ing "suspicious" activity and/or throttling responses to suspicious requests. You can easily trigger a captcha with your own manual searching. Just search for something, go to page 10, and repeat maybe 5-20 times and you'll see a captcha challenge.
If Google gets more serious about blocking me then I'll use ML to overcoming their ML (which should be doable because they're always worried about keeping Search consumer-friendly).
Don't prevent them. The same data you let humans access for free should be accessible via bots. If you only want to give out a "reasonable" amount of data, that humans wouldn't usually exceed but bots would, then define a rate-limit that wouldn't inconvenience humans and then apply it for everyone - bot or not. That way you're discriminating based on the amount of data instead of whether it's a bot or not. It will thwart people simply paying humans to scrape the data (which would happen if you magically found a way to block bots) while not inconveniencing humans who use a bot to make their job easier while scraping a reasonable amount of data.
Don't humans kill your bandwidth too? In fact a properly written bot would use less bandwidth as it doesn't care about CSS or images.
My solution would solve the issue in a fair way by having a "reasonable usage" limit applied to everyone, bot or not. This also means it can't be defeated by someone paying humans to do the dirty work to bypass bot restrictions.
"How do I stop all these dinner guests from eating this lovely pie I set out on the table?"
I remember working hard on a project for a year, then releasing the data and visualizations online. I was very proud. It was very cool. Almost immediately, we saw grad students and research assistants across the globe scraping our site. I started brainstorming clever ways to fend off the scrapers with a colleague when my boss interrupted.
Him: "WTF are you doing?"
Me: "We're trying to figure out how to prevent people from scraping our data."
Him: "WTF do you want to do that for?"
Me: "Uh... to prevent them from stealing our data."
Him: "But we put it on the public Web..."
Me: "Yeah, but that data took thousands of compute hours to grind out. They're getting a valuable product for free!"
Him: "So then pull it from Web."
Me: "But then we won't get any sales from people who see that we published this new and exciting-- Oh. I see what you mean."
Him: "Yeah, just get a list of the top 20 IP addresses, figure out who's scraping, and hand it off to our sales guys. Scraping ain't free, and our prices aren't high. This is a sales tool, and it's working. Now get back to building shit to make our customers lives easier, not shittier."
Sure enough, most of the scrapers chose to pay rather than babysit web crawlers once we pointed out that our price was lower than their time cost. If your data is valuable enough to scrape, it's valuable enough to sell.
The only technological way to prevent someone crawling your website is to not put it on a publicly-facing property in the first place. If you're concerned about DoS or bandwidth charges, throttle all users. Otherwise, any attempts to restrict bots is just pissing into the wind, IMHO.
Spend your energies on generating real value. Don't engage in an arms racw you're destined to lose.
Twitter do charge for the firehose, and I hear it’s a reasonable amount of revenue. A person can’t go and buy it on a credit card, but there are authorised sellers, enterprise sales people, etc.
Bear in mind that it’s a large technical feat to be able to ingest the firehose effectively, so it’s not really suitable for consumers.
Do you know what the current bitrate is? In 2011 I looked into buying this but gave up because it was too hard to buy. Twitter was pretending at the time that it was hard to process so much data, but it was incredibly easy as long as you don't do it in Ruby.
I don't know, but I imagine part of the reason for using resellers now is probably support contracts and stuff like that which go along with it. I'd imagine they also sell custom firehoses that pre-filter certain data, etc.
Given the number of times people assert this, I think Twitter must have explored this option and decided it wasn't a viable option. I don't know why, but I'll assume that they're smart enough to have decided against it for a good reason.
I would not make the assumption. I've been working for corporate America long enough to know that there's an astonishing level of incompetence many places. I'm sure Twitter is no exception. I'm sure _someone_ at Twitter has considered this, but perhaps they work for someone who doesn't "get it." Among many other possibilities.
They'd probably take a lot of flak from an ethical perspective though (scandal level). I suspect people would not be too excited to know Twitter straight up sells their content
I run a property website that lists properties for sale, rent, etc. A big part of my job is importing feeds, scraping sites (with permission) - and preventing others scraping our site. I know that some people will scrape, but I make I try to make it unprofitable for them to do so.
We do a bunch of other stuff, like adding fake properties so we can check who is scraping our content, and using tarpits.
Developers always argue with me that it's pissing in the wind, that "content wants to be free", that you shouldn't bother even trying to prevent it since it's inevitable. And yet it has helped. We did a split A/B test on preventing scrapers, and it turns out that it's quite effective.
You are deliberately devoting energy and resources towards removing value that already existed in your product.
If you want to charge rent (via a subscription service etc), then do that, and be clear that you're in the business of charging rent. Don't conflate selling with renting - that just leads to a product gimping death spiral.
Value to whom? If a competitor scrapes all my content and then gets the leads instead of me, they've removed value from my bank account, and that's the main value I'm concerned about. This is not a hypothetical concern - it happens a lot.
The competitor did not remove anything from your bank account. If your assertion is that you will lose in economic competition if you make certain information publicly accessible, why are you publicly providing that information while expecting to make money from it in the first place?
To me, it sounds like your business lacks some fundamentals.
Exactly. If you have data so valuable that people want to take it, it's going to be a lot easier to figure out how to sell it to them (they're probably not professional web scrapers, it's just a means to an end.) Than to waste yours, theirs and everyone else's time going tit-for-tat keeping them away from your data.
To have a free website is not the same than to unlimited grants such as a guy to taking the whole pie.
Or worst, lets say that you writes a free essay and everybody could read it and share. However, somebody takes it, deletes your name and put his name instead.
Then you, my friend, have a copyright issue. There are lots of legal tools for addressing this, and they're all cheaper than wasting your energy trying to counter the bots.
If your free content is the sole source of your online revenue, then your reputation is your business. Nefarious crawlers who rebrand your content cannot, by definition, beat you to market. So you stand to gain much more by writing great content, driving readers to it as soon as it's posted, and filing the occasional DMCA takedown than trying to compete with plagiarists in a game of whack-a-mole.
You’re making a valid point for many cases, but there are definitely negative–value scrapers out there; let’s say you run a publishing platform and you see scrapers scraping your users content and then see that your site’s content has been rehosted for ad clicks. You can’t really license the content for this purpose and it’s bad for your brand and bad for your users.
That is good advice from a technical standpoint but from a legal standpoint creating security features that prevent scraping gives you a clearer cause of action against scrapers so if someone starts making a lot of money off your content you get leverage to force them to pay for it.
That's a great point. I'm distracted a bit by how you built a project for a year without understanding what the business strategy was going to be. Some big time comms breakdown there eh?
My favorite thing was to identify bots and instead of blocking them, switch to a slightly scrambled data set to make the scrape useless but look good to the developer who stole it. It was a ton of fun as a side project. I'd also suggest you add some innocent fake data to your real site and then set up google alerts of all of the above to catch traffic. About 50% of sites would respond positively to an email when you showed them they were hosting fake data. About 90% would take my data down if that was followed up with a stronger C&D. One key is to catch them fast, while they're still a little nervous about showing off their stolen data online.
This is what we used to do. Then send a large zipfile with schreenshots and other data to the lawyers to handle the contact. Shortly after the scraping usually stopped. The contact and sell access wasnt an option because it was competitors taking the data.
I did some scraping for a lawyer back in like 01 from other lawyers. He got a c&d and told me to turn it off (we were done anyway).
Funny part was the lawyer on the other side wanted us to return all of the content on disk. Not show what we had copied but literally return it. My lawyer laughed about it. The other lawyer was smart/savvy enough to be effectively using the internet in 01 but didn't really understand the tech.
Other funny part is if he had generalized his site outside of law he would have had a major business these days.
> Funny part was the lawyer on the other side wanted us to return all of the content on disk.
I actually had a client that asked me to record a screenshot of me deleting their files from my computer. (and this was actually a developer making the request)
I assume you are concerned about crawlers that do not respect the robots.txt file (which is the polite way to restrict them from indexing your side, but does not provide any actual protection if crawlers chose to ignore the file). Cloudflare has a tool for doing this (now part of their core service):
Finally, you could use a plugin in your Webserver to display a CAPTCHA to visitors from IP addresses that cause a lot of requests to your site.
There are many more strategies available (up to creating fake websites / content to lead crawlers astray), but the CAPTCHA solution is the most robust one. It will not be able to protect you against crawlers that use a large source IP pool to access your site though.
The other day I've made a Chrome extension for scrapping a protected website. It worked wonderfully, as it simulated a normal user session, bypassing the JavaScript protections the website has. You can also run such scripts with a headless browser for full automation, PhantomJS being an obvious choice.
You really can't protect against this unless you start making the experience of regular visitors much worse.
Interesting, i am facing problem with this approach.There is a site which pops up alert box for username and password and i can't seem to switch to it with chrome web extension.With selenium it is as simple as driver.switchTo().alert();
have you ever faced this problem? by any chance you know how to fix it? I am trying to accomplish the same thing as described in this question in SO https://sqa.stackexchange.com/questions/20710/not-able-to-us...
Sure you can protect against this - there are several companies that use machine learning to spot small differences between selenium and real users (mouse delays etc).
For example, it might detect that a mouse click is dispatched at exact intervals (and block it). To which you'd think "I'll just add Math.random() * 2000" which it'll easily detect as well.
It's _definitely_ doable, but it's not as trivial as recording a selenium macro. (Not to mention these tools look for selenium presence and extensions anyway).
The issue here isn't that it can't be done, the problem being one of false positives.
By doing this you must accept that a certain percentage of legitimitate users, which can be quite significant, will be blocked. In case you're wondering, yes, this does happen with solutions such as Cloudflare.
And at that point, either your website isn't popular, in which case you can't afford to lose users, or it is very popular, in which case you'll piss of enough users as to receive bad reviews.
Basically you can't afford to do this, unless you're Facebook or Google, and you have to then wonder why Facebook or Google do not deploy such protections
So going back to my main point, of course it's possible, but the experience for users gets significantly worse such that you won't want to do it.
Yeah, but if you can use ML to pick out the differences between real users and bots then you can also use ML to mimic real user behavior. Both sides spend a lot of time and money trying to overcome each other, and end state is the same it was at the beginning: near total access to the data, at least for those sophisticated enough.
They generally are fairly slow but it doesn't matter because you just set it and leave it to it until it has all the data you want, leaving you to get on with other things.
Unless the websites work together to protect themselves against crawlers that work that way (e.g., rejecting requests to different hosts from the same origin within a time interval).
One thing I've thought about but never had the chance to put into practice would be to randomize CSS classes and IDs. Most web scraping relies on these to identify the content they are looking for.
Imagine if everyday they changed? It would make things a lot more difficult.
There would be disadvantages to actual users with this method like caching wouldn't work very well but maybe this alternative site could be displayed only to bots.
The crawler could get smart about it and only use xpaths like the 6th div on the page so maybe in the daily update you could throw in some random useless empty divs and spans in various locations.
It's a lot of work to setup but I think you would make scraping almost impossible.
Depends on your content. If the content is dependable, but the DOM isn't, you can get pretty far with something like XPath's contains(). Calling .text on an element in many parsers will happily return all the child content. Worst case is you call .text on <body>
If you're getting a lot of crawler traffic, your site probably has information a lot of people find useful, so you should consider finding a way to monetize it.
Otherwise, your best bet (hardest to get around in my experience) is monitoring for actual user I/O. Like if someone starts typing in an input field, real humans have to click on it beforehand, and most bots won't.
Or if a user clicks next-page without the selector being visible or without scrolling the page at all. Not natural behavior.
You will create accessibility issues for users if you do this. The bias you'd encode in this idea of "human" behavior doesn't consider assistive software at all.
I don't click text inputs when my form-filling plugin enters my personal information on a payment screen. And even if I did, you wouldn't know it if I had JavaScript disabled.
As someone who who does a fair amount of scraping for my job, it's always vastly preferred to simply pay for the data. It's generally more efficient and cost effective then scraping. Scraping is always the last resort: we need the data, it's not available for sale, and it's publicly accessible.
There are a variety of methods that can be deployed:
1) request fingerprinting - browser request headers have arbitrary patterns that depend on user agent. matching user agent strings with a database of the request header fingerprints allows you to filter out anyone who is not using a real browser who hasn't taken the time to correctly spoof the headers. this will filter out low skill low energy scrapers and create higher costs.
2) put javascript in the page that tracks mouse movement and pings back. this forces scrapers to simulate mouse movement in a js execution environment or reverse engineer your ping back script. this is a very high hurdle and once again forces much more computationally intensive scraping and also much more sophisticated engineering effort.
3) do access pattern detection. require valid refer headers. don't allow api access without page access, etc. check that assets from page are loaded. etc.
4) use maxmind database and treat as suspicious any access not from a consumer isp. block access from aws, gcp, azure, and other cloud services offering cheap ip rental.
We've used Incapsula (cheap and works, but awful support and service) and Distil (expensive and works, great support but steep pricing).
Both worked, both worked well with http downloads and selenium (and common techniques). Neither worked against someone dedicated enough - but there are the usual tricks for bypassing them (which we used, to test our own stuff).
We also developed something in-house, but that never helps.
For ill-behaved ones, it depends on why you're trying to block them. Rate throttling, IP blocking, requiring login, or just gating all access to the site with HTTP Basic Auth can all work.
Domain specific but if you detect a bot you can start giving it false information.
For example, a dictionary site. Someone tries to crawl your site after triggering your "This is a bot" code, serve bad data to every 20 requests. Mispell a word, Mislabel a noun as a verb, give an incorrect definition.
If you combine this with throttling then the value of scraping your site is greatly reduced. Also, most people won't come up with a super advanced crawler if they never get a "Permission denied, please stop crawling" message.
This is very true. My scraping efforts have become vastly more sophisticated after running into explicit attempts to block me. Now I've got all kinds of bells and whistles, and validate the data returned.
I ran it for awhile on some medium traffic websites that were being heavily scraped. It blocked thousands of IP addresses, but IIRC only received one Bitcoin payment.
This is for Drupal sites. It has a strong firewall (csf) and it has a lot of crawler detections on the nginx configurations. It checks the load and when on high load it blocks the crawlers.
Spellcheckers and error-correcting OCR are a thing. In fact, spellchckers are so well-understood that they're a frequent assignment for CS undergraduates.
I encourage developers thinking of doing this to check that they aren't required to have their website be scraper-friendly first.
The company I work for does a large amount of scraping of partner websites, with whom we have contracts that allow us to do it and that someone in their company signed off, but we still get blocked and throttled by tech teams who think they are helping by blocking bots. If we can't scrape a site we just turn off the partner, and that means lost business for them.
robots.txt is a "sign" that you leave at the door for well-behaving robots, but it doesn't actually make any practical difference when a robot isn't implemented to honor it
You can use cloudfare but it's a small roadblock. I can still crawl that.
Also you can do frontend rendering, it's a bit larger roadblock but you can use phantomJS or something to crawl that.
IIRC there is a php framework that mutate your front end code but I'm not sure if it does it enough to stop a generalized xpath...
Also I used to work for company where they employ people full time for crawling. It will even notified the company if crawler stopped working so they can update their crawler...
Why do you think frontend rendering is harder? Every time I see it, I'm happy because there's a nice API it relies on - I can grab clean, structured data from it rather than trying to extract it from the text.
- permanently block Tor exit nodes and relays (some relays are hidden exit nodes)
- permanently block known anonymizer service IP addresses
- permanently block known server IP address ranges, such as AWS
- temporarily (short intervals, 5-15 mins) block IP addresses with typical scraping access patterns (more than 1-2 hits/sec over 30+ secs)
- add captchas
All of these will cost you a small fraction of legitimate users and are only worth it if scraping puts a strain on your server or kills your business model...
I think websites, in general, tend to get a lot of bot traffic. My website doesn't have anything valuable to scrape, but I still get 100 hits from bot traffic every day.
Add keywords which are likely to get the crawling company involved into a lawsuit (like the names of persons who suit google to be removed from search).
One technique that bothers me quite a bit is constant random changes in class names or DOM structure, which can make it more difficult. Not impossible but more difficult.
This is ineffective and dangerous where ISPs allow switching dynamic IP addresses with no delays. Large german ISPs do this, so your abusive scraper will just continue with a new IP address while some legitimate user, who is unlucky enough to get the abuser's old IP address, is blocked.
If an attacker uses something like Illuminati (or just abuses Tor or another botnet) then they can (and should) distribute HTTP requests across different IPs.
This is pretty cheap to do and I've seen it done before in several places I've worked (on the defending side).
What you can do, however, is make it hard so that the vast majority of developers can't do it (e.g. My tech crawl billions of pages, but there was a whole team dedicated to keeping it going). If you have money to spend, there's Distill Networks or Incapsula that have good solutions. They block PhantomJS and browsers that use Selenium to navigate websites, as well as rate limit the bots.
What I found really affective that some websites do is tarpit bots. That is, slowly increase the number of seconds it takes to return the http request. So after a certain amount of request to your site it takes 30+ seconds for the bot to get the HTML back. The downside is that your web servers need to accept many more incoming connections but the benefit is you'll throttle the bots to an acceptable level.
I currently run a website that gets crawled a lot, deadheat.ca. I've written a simple algorithm that tarpits bots. I also throw a captcha every now and then when I see an IP address hits too often over a span of a few minutes. The website is not super popular and, in my case, it's pretty simple to differentiate between a human or bot.
Hope this helps...