
Ask HN: How are you dealing with scraping hits from EC2 machines? - digitalpbk
My website is being constantly hit by scrapers from EC2 machines http://d.pr/i/bLtE . I went aggressive and blocked all access from Ec2 IP’s http://bit.ly/SUOaof until I realized that quite a few reader proxies like Flipboard are based out of Ec2 and blanket blocking of these amazon machines wont help. How is the community dealing with this problem? Can you advice?<p>Edit: I’ve seen somewhere that Stackoverflow blocks all the ec2 machines. I don’t think this is the most optimal solution considering many legit services. Also the hits come from different ip’s.
======
bdcravens
I've done a lot of automation with Selenium, on EC2. (For the curious, it was
on behalf of clients with legit access, not wholesale pillaging of a public
resource)

1) You can block EC2 wholesale. You've mentioned issues with this, and can be
bypassed via VPN or using another network. EC2 is attractive because it's so
cheap (with spot instances, starts at 0.3 cents per hour), but it's not the
only option.

2) Timing. Normal traffic isn't rapid fire. Many scrapers, however, fire off
their scripts as quickly as possible. Block traffic that doesn't have enough
meaningful pauses.

3) Report addresses to Amazon. I really don't know if they'd take action.

4) Reverse lookup, or whitelist addresses. I know if it's a legitimate source
(like Flipboard) they'd probably work with you at least a little bit. Reverse
lookup might not be successful, but maybe that can help you whitelist any
legit sources that map their AWS IP to a legit DNS name. Most scrapers use the
AWS external domain name. Also, I imagine legit sources give you a distinctive
user agent, so that can help you let traffic through.

However, if you have a public resource, this is simply an issue you have to
deal with. Anayltics: I'd just filter out traffic scraped traffic from your
analytics. Content duping: blocking scrapers won't stop this. If someone
stealing your content can't scrape at 0.5 cents per hour, they'll pay someone
5 cents an hour to copy/paste. You just have to use the same diligence others
use, in terms of reporting to Google, etc. Perfomance: use Varnish/nginx/etc
to combat performance hit from scrapers.

~~~
logn
> 3) Report addresses to Amazon. I really don't know if they'd take action.

They take action. I've seen it done.

~~~
lifeguard
They take action on _abuse_. I am not sure scraping is abuse.

------
ressaid1
There is no silver bullet for stopping web scrapers. If you just try to block
User Agents or IPs, all you are doing is putting a small hurdle in their way.
You have to employ a lot of different tools to be able to make the wall high
enough that they actually stop trying to scrape you. Some of the key things
you need to do are:

behavioral modeling - rate limiting, bandwidth restrictions, etc

identity verifications - make sure they are running the browser they say they
are, allow google and other search engins by whitelisting their IPs, block
others that are pretending to be google, etc

code obfuscation - make it hard for them to scrape your code. Change up the
CSS, etc.

OR you can use an automated service to do all this for you. Check out
www.distil.it. Full disclosure, I'm the CEO of Distil.

------
lifeguard
Use mod_qos but keep a close eye on it for first few weeks as you tune it:

<http://opensource.adnovum.ch/mod_qos/>

------
dumbfounder
I had problems with people scraping Twicsy so hard that it was taking the site
down. For a while I would manually review the top IP addresses requesting
pages a couple times per day and look for patterns and ban IP's based on that.
Then I created a script based on the patterns I recognized to do it
automatically.

But then I just made Twicsy fast enough to deal with the traffic so I don't
need to worry about it anymore. I guess it depends on your business model
whether or not that will work for you.

~~~
covati
We've blocked a few of the worst, but mostly just added servers to deal with
the load.

We actually found out who one of the worst ones was and contact them. It turns
out it was a major legit proxy, but they had a bug in their proxy code that
caused refetching of one of our urls over and over. They were very easy to
work with and they fixed the bug.

------
centdev
Are there known user agents that you can identify as coming from the scrapers?
If so, you either block that way and not worry about IP addresses or choose to
disable Google Analytics on those requests so it doesn't skew GA data.

~~~
bdcravens
As was identified, most of the scraping occurs via Selenium. Selenium just
automates a real browser, so it'd look like 100% legit traffic with a legit
user agent (it defaults to using Firefox)

~~~
chewxy
So the solution is simple isn't it? Google Analytics filter > filter referers
like Selenium (or the entire EC2 block).

~~~
bdcravens
Not sure how you could filter Selenium. A well written script looks like
normal traffic.

~~~
seats
A well written scraper looks like normal traffic in the same way that a well
written pseudo random number generator looks like a random number generator.
It'll fool your eye but not statistical analysis.

Think about the goal of a scraper, it needs to actually walk through all the
content. That doesn't look like a normal user at all. An individual request
might look ok, but in aggregate the pattern of a robot pops out.

[https://www.usenix.org/conference/usenixsecurity12/pubcrawl-...](https://www.usenix.org/conference/usenixsecurity12/pubcrawl-
protecting-users-and-businesses-crawlers)

~~~
brechin
So, would that detection mechanism be able to deal with a number of
coordinated scrapers rotating through lists of proxies, using different User-
Agent strings, making requests with (pseudo-)random delays between requests?

~~~
seats
yup, read the paper

------
craiglaw2
You could add a CAPTCHA to suspect client IPs? You could implement some sort
of real-time log analysis / monitoring based on client IP, x-forward-for (?)
headers and User Agent, also possibly block certain geo-IP ranges from non-
core customers. Inserting a javascript "puzzle" is also a way to test that a
browser is generating the traffic and not a script.

------
jwr
If these scrapers abide by robots.txt (you do have a robots.txt, right?) and
scrape only whatever you left publicly accessible, then I think you should
work on your server performance, because if a scraper causes problems, you'll
have much worse problems once whatever you've built becomes popular.

Hunting down bots is a waste of time and effort better spent elsewhere.

------
heroic
Use this : <http://wiki.nginx.org/HttpLimitReqModule>

------
chewxy
Let's examine your motivation: why do you want to block said scrapers in the
first place? SEO concerns (dupe content)?

~~~
digitalpbk
Mostly duplicate content & messing up my analytics (increased bounce rate,
decreased time spend on page etc.)

~~~
digitalpbk
Because it seems to be from selenium (from referer), it is triggering the JS
too, we are using Google Analytics.

~~~
ig1
Why not exclude the EC2 ip range from analytics ?

------
reefoctopus
Many have suggested editing your robots.txt. This is absolutely the first step
you should take. You could try blocking the crawlers by name or limit the
request rate with a crawl delay in the robots.txt.

If the crawler ignores your robots.txt, check it's name in your access logs.
Often, people build things and set them loose without thinking about the
consequences. Many crawlers have a homepage / programmer contact information
somewhere on the web. Let them know they are hammering your website.

What is the rate at which requests are being made? Are they making 1000
requests per second? Downloading tons of images? You should probably just
ignore it if it is less than 1 request per second.

------
brechin
Scraping, in itself, is often not prohibited, wrong, illegal, or against most
sites' TOS. Your site would allow me, for example, to scrape all your content
for my own personal use, but I couldn't re-publish or re-sell the info.

It seems hard to limit legitimate uses of a free resource without changing the
requirements on how users access the site (require account signup, use
CAPTCHAs, use CSS/JS to only display properly in a browser).

As one who does a lot of scraping, I have encountered few barriers that can't
be (legally) overcome with a reasonable amount of effort.

~~~
pbhjpbhj
> _Your site would allow me, for example, to scrape all your content for my
> own personal use_ //

On what basis are you claiming this. Sounds like it would be true under fair-
use clauses of US Copyright law but it's certainly not true in the UK (and by
extension I presume for you to perform on content served from the UK though
I've yet to read a thorough treatment of how the [ie any] law works with
server locations).

Commercial considerations are usually much broader than selling too: not only
could you not resell it but you couldn't distribute it (whether by publishing
or otherwise).

~~~
brechin
From TOS on his site:

"Cucumbertown authorizes you to view, download and/or print the Materials only
for personal, non-commercial use, provided that you keep intact all copyright
and other proprietary notices contained in the original Materials."

So, for example, I could grab everything from the site (minus copyrighted
images, etc.) and make my own personal DB of the content. Obviously that's a
lot of effort for a little reward for one person, but if I created a repo with
a set of tools for people to do this for themselves it could become a big
legitimate source of "scraping" traffic.

------
jarin
CloudFlare's scrapeshield might help:

<https://www.cloudflare.com/apps/scrapeshield>

------
adrianoconnor
Do you have a robots.txt? That's the standard way.

~~~
dumbfounder
Calling them "scrapers" implies they are doing something nefarious (stealing
content). Robots.txt is for law abiding bots.

~~~
ig1
Not really. "Scraping" just refers to extracting data from a site using an
automated method, it doesn't have any connotations about the motivation or
acceptability of the process.

------
gingerjoos
So the referer is a local Selenium server? How did you figure out it was an
EC2 machine?

~~~
dumbfounder
That second link digitalpbk gives shows all public EC2 IP address ranges:
<http://bit.ly/SUOaof>

------
Toshio
I assume HN itself is also being scraped to death, so ...

s/Ask HN/Ask PG/

