
Ask HN: Best way to block an EC2-hosted scraper? - brandnewlow
Hey, HN,<p>Windy Citizen gets slooooow about once every hour and according to our logs, around this same time, someone or something is making a ton of requests to our RSS feeds, grabbing hundreds of them at once.<p>From what I can tell, the thing grabbing this stuff is hosted on Amazon EC2.  I've tried blocking the IP address before but it seems to refresh and then the problem comes back.  How do I shut this idiot down?<p>The feed URLs being grabbed are all have the following URL format: /neighborhood/<i></i><i>/feed<p>These are old URLs from a prior schema we had.  They're not even valid anymore.  I think this is part of the problem.  Basically this scraper is causing a ton of 404s every 30 minutes.<p>Is there a way to just block out anything trying to hit URLs that match a regex for that URL structure?  Something else?<p>Update: I've added this to my nginx.conf file:<p>location ~</i> /neighborhoods/[-\w]+/feed/?$ { 
    deny all; 
    }<p>And it appears to be working.  It's successfully sending a 403 when people request those URLs.<p>Now, anyone have suggestions for fun things I can redirect the scraper to?
======
jrockway
I use an iptables rule like:

    
    
        # connection limit for HTTP
        -A INPUT -i eth0 -p tcp --syn --dport www -m connlimit ! --connlimit-above 128 --connlimit-mask 24 -j ACCEPT
    

This basically means, "only accept requests to port 80 if the /24 has fewer
than 128 open connections". It is to kill things like Slowloris, but you can
bring the limit down and kill scrapers. You can also block based on connection
rate:

    
    
        -N RATE_CHECK
        -A INPUT -p tcp -m multiport --dports www -m state --state NEW -j RATE_CHECK
        -A RATE_CHECK -m recent --set --name RATE
        -A RATE_CHECK -m recent --update --seconds 60 --hitcount 4 --name RATE -j REJECT
        -A RATE_CHECK -p tcp -m multiport --dports www -j ACCEPT
    

This will deny any connection that is the 5th connection or higher in a single
minute. I use this rule for ssh and smtp, but some tweaking might be adequate
for a web server.

(Note: all these rules are for default deny with a rule like "-A INPUT -m
state --state ESTABLISHED,RELATED -j ACCEPT" to ignore established states. If
your firewall setup is different, then the rules will need to be modified.)

There is also a program called fail2ban that will read logs and apply
temporary bans based on what the logs say.

My solution to this problem is to just cache agressively. Hundreds of requests
per second is nothing for Varnish.

~~~
ehutch79
be careful with things like this, it could get you locked out of your server
very quick.

denyhosts and fail2ban are better ideas for dropping bad actors on the net.

~~~
jrockway
It's not a problem; the rules only apply to eth0, but I connect over tun0.
And, it's time-limited to 60 seconds either way. And, I have web console
access.

So this is not something I worry about. ssh password crackers and script
kiddies taking down my website are a little more worrisome.

------
tptacek
Do you have a document on your site establishing that this idiot is violating
your acceptable use policy? Then you should report them to Amazon's abuse
group:

[http://aws-portal.amazon.com/gp/aws/html-forms-controller/co...](http://aws-
portal.amazon.com/gp/aws/html-forms-controller/contactus/AWSAbuse)

You can use technical countermeasures to defend your site, but you'd be doing
the rest of the Internet a favor to get them knocked off Amazon.

~~~
brandnewlow
We do. I'll send that over to them. Great idea!

~~~
CWuestefeld
Have you heard back from Amazon? We've got a similar incident going on, which
I reported last Friday (3 business days ago), but haven't yet seen any
acknowledgment from them.

------
joshu
Feed them rss containing uniquely generated strings. Then google for the
unique strings.

Also, block generic user-agents on rss feeds. Most everyone was able to change
theirs for delicious, and we did an insane amount of rss traffic.

~~~
a5seo
I found this to work very well.

Only thing is you need to append keys randomly, say one out of 5 requests, in
different locations if possible. And you should hand out 20 or so different
unique strings.

Then use google alerts to find them republishing your stuff an send DMCA
takedown notices with Google (don't send them a c&d first so you have the
element of surprise and maximum chance of causing real pain). If they don't
take care of the DMCA issue with Google, Google will remove their pages from
the index. They won't know why but their site will slowly die.

~~~
joshu
google alerts = clever.

the other thing to do is include some some encoding of the served-to IP
address (maybe md5 of the first and last half of the IP?)

~~~
ay
To generate the unique string, do SHA1(content + TheirIP + unique secret). The
chance of collision is practically zero. Also, you can then conclusively prove
that it was _you_ who created this unique string - which may come in handy.

(Talking about SHA1 because I used it myself in a lighter version - just a
hash of my name - to check how the search engines work with the content of my
little blog).

~~~
a5seo
I like it, but you need to do this in a way that you append a
known/deterministic string so you can monitor THAT one in Google Alerts.

I don't care whether the unique strings collide (although I love the idea of
being able to tie the IP address of the crawler to the page on which the
content was published for maximum evidence collection), I just want Google
Alerts to find them reliably AND I don't want the spammer to catch on and be
able to easily remove the unique strings before I catch him.

The only way for him to find my little landmines (to strip them) would be to
read through every piece of content he scraped from me.

~~~
ay
By the way - thinking of all this - why would not you turn this scraper into a
free advertisement for you ? If you can detect when it crawls your site, just
insert randomly the backlinks to your site into the content you give out to
them.

This way you will _want_ that they steal more from you :-)

(how to put these backlinks in a way that would be difficult to remove - is
another story. But if you use other links, e.g. tinyurl, in your material -
then you could probably use that :-) Still fairly simple to remove but would
require more work from them.)

------
subway
Interestingly I don't see any mention of your robots.txt, or if the scraper
identified themselves in the UA string. If the person or organization running
the scraper is legit, they would probably like to know their software is
misbehaving, so that they can fix the issue. Did you attempt to contact them
before launching into a cat and mouse game?

~~~
petercooper
Yeah, definitely try contacting them first. Even if their software is being
idiotic, they might be a fellow startup or someone who's a bit technically dim
but has reasonable intentions. Of course, if you get the brush off, then you
have carte blanche for raising hell ;-)

------
bretpiatt
You can use varnish to cache all of these requests and just never expire them.
Make the page as small as possible. It'll make the overhead of serving the
requests as low as possible.

<http://www.varnish-cache.org/>

~~~
buro9
You can also use Varnish ACLs to block the EC2 IP addresses.

And if they have a really obvious user-agent you could block that using
Varnish: [http://omninoggin.com/web-development/block-unwanted-spam-
bo...](http://omninoggin.com/web-development/block-unwanted-spam-bots-using-
varnish-vcl/)

And if you're not afraid of using Varnish's VCL InlineC stuff, then you could
add rate limiting to Varnish: [http://drcarter.info/2010/04/how-fighting-
against-scraping-u...](http://drcarter.info/2010/04/how-fighting-against-
scraping-using-varnish-vcl-inline-c-memcached/)

Basically... Varnish has become my Swiss army tool for rejecting crap traffic
(and caching good traffic obviously).

------
joewest
Amazon previously kept a semi-official list of all netblocks for EC2 in their
forum. They've recently moved to posting updates as official announcements,
here is the latest:

<https://forums.aws.amazon.com/ann.jspa?annID=877>

It's interesting to see the size of their deployments measured in usable IP
space.

------
kleinmatic
If something like this can slow your servers down, you're probably going to be
in serious trouble when you get an inbound link from a heavily trafficked
site. I'd recommend you put a caching proxy like Varnish in front of your web
server process.

Varnish serves 404s quickly and happily, and will keep the storm from even
reaching your apaches. You won't even notice it when the scrapers come in.

------
bbuffone
You can simply contact Amazon AWS and report them. Having run yottaa.com live
for the last 6 months our monitoring nodes have been reported and if they are
causing problems we fix them.

We are try to provide valuable service to people and would want to know that
we are causing problems. We have implemented the ability to work with
robots.txt files.

If the nodes are not from a legitimate service then they will be shutdown.

------
rphlx
The proper way to fix this is to add a per-IP rate limiter to the entire site.
Most other fixes are just a cat-n-mouse game.

------
getsat
If you're feeling nefarious, 301 their requests to a static, empty RSS feed.

If you can figure out who's actually running it (does the user agent say?),
simply sending them an email and asking them to throttle their script may be
the simplest solution.

~~~
brandnewlow
Agent doesn't say unfortunately.

~~~
joewest
You should report the behavior to Amazon:

[http://aws-portal.amazon.com/gp/aws/html-forms-controller/co...](http://aws-
portal.amazon.com/gp/aws/html-forms-controller/contactus/AWSAbuse)

In my experience they take it seriously and deal with it promptly.

------
mike-cardwell
Depends on the technology you're using to serve the rss feeds. I'd consider
writing something to tail my access logs and automatically update firewall
rules when a host generates a certain number of 404's in a certain period of
time.

~~~
mooism2
It shouldn't be too difficult to write a fail2ban rule to do this.

------
cheald
mod_rewrite does exactly what you want here. Rewrite urls matching that
pattern to a static file, so the request returns quickly and has very little
overhead.

------
eli
I assume you're running Apache? Have you taken a look at ModSecurity?

I found it a little tricky to set up (the core rules it ships with by default
are way too aggressive, IMHO), but it does exactly what you're asking: block
requests by regex.

~~~
brandnewlow
We're on a hybrid nginx/apache setup.

I just added some lines to my nginx.conf though that appear to be working
based on tips from here and elsewhere:

location ~* /neighborhoods/[-\w]+/feed/?$ { deny all; }

------
neworbit
this is why I love HN, all the suggestions have been actually constructive and
nobody is suggesting 4channy nonsense

