"I've decided to block all crawlers to the site other than Google or Bing" And t...

throwaway420 · on Oct 18, 2013

> But don't ban everything just because a few (proximic and ADmantX) are hammering the site.

I can understand blocking somebody that has a long-term and clear pattern of disrupting your site, not following the robots.txt rules, and not providing any links or anything back to you. But I find the idea of somebody preemptively blocking everything but Google and maybe Bing extremely distasteful.

If everybody out there blocked everything but Google/Bing, it would make it very difficult for anybody to ever try and create a new search engine or create new types of web services or analyze data in new ways.

Possibly a better solution is making the common crawl initiative a better project - make it more frequently updated, make it easier to get started with it, provide better documentation, etc. If there was a way to get every web service out there that wants to crawl the web to contribute to this, it would lighten the load on everybody. http://commoncrawl.org

dripton · on Oct 18, 2013

Or maybe all the other crawlers would just claim to be Googlebot. Just like all the other browsers (partially) claim to be Netscape.

birken · on Oct 18, 2013

This is a fair point, however blacklisting isn't necessarily a perfect solution either. It would require continuous manual effort in going through the logs and blocking bad bots, and if some new bot were to misbehave and crawl too aggressively, blacklisting would only help after the fact.

I do think I did make a mistake though. To your point, I shouldn't block crawlers that both behave and are attempting to help my site in some way (by driving traffic to it -- IE search engines). Whether or not they are currently driving traffic to the site is not important. I'll whitelist Yandex, Baidu, Scoutjet and any other related bots I see and edit the post.

TheLoneWolfling · on Oct 18, 2013

Whitelisting a couple of bots now doesn't help at all for any new search engines trying to start up. What are they to do, contact every site admin individually?

> if some new bot were to misbehave and crawl too aggressively, blacklisting would only help after the fact.

In that case, don't blacklist all bots, simply add a crawl delay to any bots that you haven't specifically allowed:

User-agent: * Crawl-delay: 10

This allows minor bots to continue to crawl the site, while cutting back on bandwidth costs for the couple of ones that are being overly aggressive.

oskarth · on Oct 18, 2013

I agree completely. Blocking everything but Google and Bing is horrible and extremely short-sighted.