"I've decided to block all crawlers to the site other than Google or Bing"
And this is why I find people that respond to privacy complaints about Google with "if you want to switch search engines no-one is stopping you" frustrating.
Additionally, I would like to point out that according to those numbers, there are ~41,000 (418,814 - 199,725 - 40,359 - 36,340 - 33,893 - 26,325 - 13,458 - 10,657 - 6,109 - 5,993 - 4,959) additional robot hits, many of which generally are bots that only visit one or two pages. For comparison, that's more requests than Googlebot did. Just flat-out banning everything frustrates people, and encourages them to just ignore robots.txt entirely.
If you're being hammered by a bot, contact the bot's owner! Most bots have a link in the user agent that you can follow. Barring that, ban the specific bot. But don't ban everything just because a few (proximic and ADmantX) are hammering the site.
> But don't ban everything just because a few (proximic and ADmantX) are hammering the site.
I can understand blocking somebody that has a long-term and clear pattern of disrupting your site, not following the robots.txt rules, and not providing any links or anything back to you. But I find the idea of somebody preemptively blocking everything but Google and maybe Bing extremely distasteful.
If everybody out there blocked everything but Google/Bing, it would make it very difficult for anybody to ever try and create a new search engine or create new types of web services or analyze data in new ways.
Possibly a better solution is making the common crawl initiative a better project - make it more frequently updated, make it easier to get started with it, provide better documentation, etc. If there was a way to get every web service out there that wants to crawl the web to contribute to this, it would lighten the load on everybody. http://commoncrawl.org
This is a fair point, however blacklisting isn't necessarily a perfect solution either. It would require continuous manual effort in going through the logs and blocking bad bots, and if some new bot were to misbehave and crawl too aggressively, blacklisting would only help after the fact.
I do think I did make a mistake though. To your point, I shouldn't block crawlers that both behave and are attempting to help my site in some way (by driving traffic to it -- IE search engines). Whether or not they are currently driving traffic to the site is not important. I'll whitelist Yandex, Baidu, Scoutjet and any other related bots I see and edit the post.
Whitelisting a couple of bots now doesn't help at all for any new search engines trying to start up. What are they to do, contact every site admin individually?
> if some new bot were to misbehave and crawl too aggressively, blacklisting would only help after the fact.
In that case, don't blacklist all bots, simply add a crawl delay to any bots that you haven't specifically allowed:
User-agent: *
Crawl-delay: 10
This allows minor bots to continue to crawl the site, while cutting back on bandwidth costs for the couple of ones that are being overly aggressive.
And this is why I find people that respond to privacy complaints about Google with "if you want to switch search engines no-one is stopping you" frustrating.
Additionally, I would like to point out that according to those numbers, there are ~41,000 (418,814 - 199,725 - 40,359 - 36,340 - 33,893 - 26,325 - 13,458 - 10,657 - 6,109 - 5,993 - 4,959) additional robot hits, many of which generally are bots that only visit one or two pages. For comparison, that's more requests than Googlebot did. Just flat-out banning everything frustrates people, and encourages them to just ignore robots.txt entirely.
If you're being hammered by a bot, contact the bot's owner! Most bots have a link in the user agent that you can follow. Barring that, ban the specific bot. But don't ban everything just because a few (proximic and ADmantX) are hammering the site.