

Ask HN: how to ban aggressive web crawlers - 42tree

They don't respect robot.txt. They have hundreds of different IP addresses. Sometimes they hit our server heavily, so it's easy to spot. But most of the time they maintain a well designed crawling speed such that they won't be easily detected.<p>Are there any solutions to ban bad bots?
======
42tree
I found the following answer quite interesting:

[http://stackoverflow.com/questions/8404775/how-to-
identify-w...](http://stackoverflow.com/questions/8404775/how-to-identify-web-
crawler)

"There are two general ways to detect robots and I would call them
"Polite/Passive" and "Aggressive". Basically, you have to give your web site a
psychological disorder. Polite

These are ways to politely tell crawlers that they shouldn't crawl your site
and to limit how often you are crawled. Politeness is ensured through
robots.txt file in which you specify which bots, if any, should be allowed to
crawl your website and how often your website can be crawled. This assumes
that the robot you're dealing with is polite. Aggressive

Another way to keep bots off your site is to get aggressive.

User Agent

Some aggressive behavior includes (as previously mentioned by other users) the
filtering of user-agent strings. This is probably the simplest, but also the
least reliable way to detect if it's a user or not. A lot of bots tend to
spoof user agents and some do it for legitimate reasons (i.e. they only want
to crawl mobile content), while others simply don't want to be identified as
bots. Even worse, some bots spoof legitimate/polite bot agents, such as the
user agents of google, microsoft, lycos and other crawlers which are generally
considered polite. Relying on the user agent can be helpful, but not by
itself.

There are more aggressive ways to deal with robots that spoof user agents AND
don't abide by your robots.txt file:

Bot Trap

I like to think of this as a "Venus Fly Trap," and it basically punishes any
bot that wants to play tricks with you.

A bot trap is probably the most effective way to find bots that don't adhere
to your robots.txt file without actually impairing the usability of your
website. Creating a bot trap ensures that only bots are captured and not real
users. The basic way to do it is to setup a directory which you specifically
mark as off limits in your robots.txt file, so any robot that is polite will
not fall into the trap. The second thing you do is to place a "hidden" link
from your website to the bot trap directory (this ensures that real users will
never go there, since real users never click on invisible links). Finally, you
ban any IP address that goes to the bot trap directory.

Here are some instructions on how to achieve this: Create a bot trap (or in
your case: a PHP bot trap).

Note: of course, some bots are smart enough to read your robots.txt file, see
all the directories which you've marked as "off limits" and STILL ignore your
politeness settings (such as crawl rate and allowed bots). Those bots will
probably not fall into your bot trap despite the fact that they are not
polite.

Violent

I think this is actually too aggressive for the general audience (and general
use), so if there are any kids under the age of 18, then please take them to
another room!

You can make the bot trap "violent" by simply not specifying a robots.txt
file. In this situation ANY BOT that crawls the hidden links will probably end
up in the bot trap and you can ban all bots, period!

The reason this is not recommended is that you may actually want some bots to
crawl your website (such as Google, Microsoft or other bots for site
indexing). Allowing your website to be politely crawled by the bots from
Google, Microsoft, Lycos, etc. will ensure that your site gets indexed and it
shows up when people search for it on their favorite search engine.

Self Destructive

Yet another way to limits what bots can crawl on your website, is to serve
CAPTCHAs or other challenges which a bot cannot solve. This comes at an
expense of your users and I would think that anything which makes your website
less usable (such as a CAPTCHA) is "self destructive." This, of course, will
not actually block the bot from repeatedly trying to crawl your website, it
will simply make your website very uninteresting to them. There are ways to
"get around" the CAPTCHAs, but they're difficult to implement so I'm not going
to delve into this too much. Conclusion

For your purposes, probably the best way to deal with bots is to employ a
combination of the above mentioned strategies:

    
    
        Filter user agents.
        Setup a bot trap (the violent one).
    

Catch all the bots that go into the violent bot trap and simply black-list
their IPs (but don't block them). This way you will still get the "benefits"
of being crawled by bots, but you will not have to pay to check the IP
addresses that are black-listed due to going to your bot trap."

------
42tree
Another interesting answer:

[http://stackoverflow.com/questions/233192/detecting-
stealth-...](http://stackoverflow.com/questions/233192/detecting-stealth-web-
crawlers?rq=1)

What options are there to detect web-crawlers that do not want to be detected?

(I know that listing detection techniques will allow the smart stealth-crawler
programmer to make a better spider, but I do not think that we will ever be
able to block smart stealth-crawlers anyway, only the ones that make
mistakes.)

I'm not talking about the nice crawlers such as googlebot and Yahoo! Slurp. I
consider a bot nice if it:

    
    
        identifies itself as a bot in the user agent string
        reads robots.txt (and obeys it)
    

I'm talking about the bad crawlers, hiding behind common user agents, using my
bandwidth and never giving me anything in return.

There are some trapdoors that can be constructed updated list (thanks Chris,
gs):

    
    
        Adding a directory only listed (marked as disallow) in the robots.txt,
        Adding invisible links (possibly marked as rel="nofollow"?),
            style="display: none;" on link or parent container
            placed underneath another element with higher z-index
        detect who doesn't understand CaPiTaLiSaTioN,
        detect who tries to post replies but always fail the Captcha.
        detect GET requests to POST-only resources
        detect interval between requests
        detect order of pages requested
        detect who (consistently) requests https resources over http
        detect who does not request image file (this in combination with a list of user-agents of known image capable browsers works surprisingly nice)
    

Some traps would be triggered by both 'good' and 'bad' bots. you could combine
those with a whitelist:

    
    
        It trigger a trap
        It request robots.txt?
        It doest not trigger another trap because it obeyed robots.txt
    

One other important thing here is: Please consider blind people using a screen
readers: give people a way to contact you, or solve a (non-image) Captcha to
continue browsing.

What methods are there to automatically detect the web crawlers trying to mask
themselves as normal human visitors.

Update The question is not: How do I catch every crawler. The question is: How
can I maximize the chance of detecting a crawler.

Some spiders are really good, and actually parse and understand html, xhtml,
css javascript, VB script etc... I have no illusions: I won't be able to beat
them.

You would however be surprised how stupid some crawlers are. With the best
example of stupidity (in my opinion) being: cast all URLs to lower case before
requesting them.

And then there is a whole bunch of crawlers that are just 'not good enough' to
avoid the various trapdoors.

------
amikazmi
If they use different IPs, and normal user request speed, the only way to
differentiate is by irregular user usage patterns (too many pages for a
session, going bfs instead of dfs etc)

But if they don't hit your server heavily, why do you care?

------
ohwp
Do they use "normal" user agents?

