My website is being constantly hit by scrapers from EC2 machines http://d.pr/i/bLtE . I went aggressive and blocked all access from Ec2 IP’s http://bit.ly/SUOaof until I realized that quite a few reader proxies like Flipboard are based out of Ec2 and blanket blocking of these amazon machines wont help. How is the community dealing with this problem? Can you advice?
Edit: I’ve seen somewhere that Stackoverflow blocks all the ec2 machines. I don’t think this is the most optimal solution considering many legit services. Also the hits come from different ip’s.
1) You can block EC2 wholesale. You've mentioned issues with this, and can be bypassed via VPN or using another network. EC2 is attractive because it's so cheap (with spot instances, starts at 0.3 cents per hour), but it's not the only option.
2) Timing. Normal traffic isn't rapid fire. Many scrapers, however, fire off their scripts as quickly as possible. Block traffic that doesn't have enough meaningful pauses.
3) Report addresses to Amazon. I really don't know if they'd take action.
4) Reverse lookup, or whitelist addresses. I know if it's a legitimate source (like Flipboard) they'd probably work with you at least a little bit. Reverse lookup might not be successful, but maybe that can help you whitelist any legit sources that map their AWS IP to a legit DNS name. Most scrapers use the AWS external domain name. Also, I imagine legit sources give you a distinctive user agent, so that can help you let traffic through.
However, if you have a public resource, this is simply an issue you have to deal with. Anayltics: I'd just filter out traffic scraped traffic from your analytics. Content duping: blocking scrapers won't stop this. If someone stealing your content can't scrape at 0.5 cents per hour, they'll pay someone 5 cents an hour to copy/paste. You just have to use the same diligence others use, in terms of reporting to Google, etc. Perfomance: use Varnish/nginx/etc to combat performance hit from scrapers.