And this is why I find people that respond to privacy complaints about Google with "if you want to switch search engines no-one is stopping you" frustrating.
Additionally, I would like to point out that according to those numbers, there are ~41,000 (418,814 - 199,725 - 40,359 - 36,340 - 33,893 - 26,325 - 13,458 - 10,657 - 6,109 - 5,993 - 4,959) additional robot hits, many of which generally are bots that only visit one or two pages. For comparison, that's more requests than Googlebot did. Just flat-out banning everything frustrates people, and encourages them to just ignore robots.txt entirely.
If you're being hammered by a bot, contact the bot's owner! Most bots have a link in the user agent that you can follow. Barring that, ban the specific bot. But don't ban everything just because a few (proximic and ADmantX) are hammering the site.
I can understand blocking somebody that has a long-term and clear pattern of disrupting your site, not following the robots.txt rules, and not providing any links or anything back to you. But I find the idea of somebody preemptively blocking everything but Google and maybe Bing extremely distasteful.
If everybody out there blocked everything but Google/Bing, it would make it very difficult for anybody to ever try and create a new search engine or create new types of web services or analyze data in new ways.
Possibly a better solution is making the common crawl initiative a better project - make it more frequently updated, make it easier to get started with it, provide better documentation, etc. If there was a way to get every web service out there that wants to crawl the web to contribute to this, it would lighten the load on everybody. http://commoncrawl.org
I do think I did make a mistake though. To your point, I shouldn't block crawlers that both behave and are attempting to help my site in some way (by driving traffic to it -- IE search engines). Whether or not they are currently driving traffic to the site is not important. I'll whitelist Yandex, Baidu, Scoutjet and any other related bots I see and edit the post.
> if some new bot were to misbehave and crawl too aggressively, blacklisting would only help after the fact.
In that case, don't blacklist all bots, simply add a crawl delay to any bots that you haven't specifically allowed:
This allows minor bots to continue to crawl the site, while cutting back on bandwidth costs for the couple of ones that are being overly aggressive.
In the end, we switched to network blocks for the common bots/spiders. Much better.
I run a small wiki that gets just a few thousand human hits a day. But according to the server logs 90% of server hits are crawlers and spambots, so I'm using 10 times the resources I really need to serve customers.
I finally resorted to blocking entire data centers and companies that crawl constantly but send no traffic.
I feel like search engines should crawl websites in proportion to the traffic they send. For example, Yandex, Baidu, and Bing were all crawling my website hundreds or thousands of times a day, but never sending a single visitor (or in the case of Bing sending single-digit visitors). It's an absurd waste of resources, so I blocked them completely.
I also have a bunch of hobby websites, and I agree that bing crawls a lot more than Google and sends me a lot less traffic. And Yandex and Baidu have very little English-speaking usage, so they aren't sending me much traffic.
If I see that a crawler is sending me any traffic at all, I will accept that, but if the amount of traffic is zero, I put them in robots.txt and in the IP block list. Although I try hard to make sure my images are all clean creative common images, I block the robots that are there to find copyrighted images because these ones exist to give me nothing but trouble.
I've never heard of these bots before. Where did you hear about them, and how do you block them?
EDIT: chrsstrm mentions something below
Google is the only crawler I see using the protocol on my site. It does make Google updates occur within minutes, so that alone is reason to implement push.
- If it's cached content, there's no limit
- If it's non cached content and it is a bot, direct to the pool of servers assigned to uninteresting bots
- If it is an interesting bot (Google, bing, a couple others), use instead a pool of servers for interesting bots
Some bots are not only non-traffic generating, they are absurdly aggressive, consuming 99% of server resources for the hour or so needed to pull the content, regular site users be damned. A configuration like this is reasonably low maintenance while limiting the damage by bots.
"GET /calls?month=2&year=7206 HTTP/1.1" 200 2423 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
It wouldn't be hard to stop (404 nonsensical years & disable the previous / next links that will get them there), but it is amusing that Googlebot has clicked on that "next month" link over 60000 times...
What I've discovered though is that bots are not my biggest worry; it's the scrapers that are stealing my client's content and re-posting it. We've successfully filed 5 DMCA complaints at this point which have been effective at stopping the known offenders, but the crawlers continue and new copycat sites keep popping up. I've found that running a 'grep Ruby access_log' returns a good chunk of the offending crawlers (not just Ruby, also search for Python and Java). Running 'host (ip address)' almost always traces back to AWS. These log entries also very rarely list a referrer.
Obviously not all of the grep results are malicious. A little research can reveal the IP is linked to a service or co you want crawling your site. For those that are unknown, they usually get an IP ban until I can determine otherwise (which the client is totally OK with, they don't want their content re-published off-site).
I've thought about setting up a honeypot, but my issue was keeping the bait links hidden from legit services (plant a hidden link on the page that only a bad actor would follow, then trap their IPs in the log and ban them). Since the DMCAs have been so effective, I haven't been forced to pursue a honeypot, but I would be very interested if anyone else has a good solution.
I also discovered that image hotlinking was a _HUGE_ problem with this site. The poor site had been neglected for years and hotlinkers were running wild. Shutting that down with a simple htaccess rewrite rule really helped. That is also how I discover content thieves, 'grep 302 access_log' and look at the referrer URLs.
As a search engine developer that has tried to compete with Google in the past, I find this disheartening.
I agree with blocking anyone doing bad. But you are cutting out the good with the bad.
You were not able to crawl hundreds of thousands of websites because your crawler was disallowed by their robots.txt but major crawlers were allowed to do so ?
I actually blocked Ahrefs, Ezooms, Yandex and Cyveillance by IP address range in httpd.conf for a while, but I've decided to send them randomly-generated HTML (http://stratigery.com/bork.php) instead. I'm really surprised by Ahrefs and Ezooms appetite for gibberish about naked celebrities, condiments and sweater puppies.
I just got a visit from the "MJ12Bot" (http://www.majestic12.co.uk/projects/dsearch/mj12bot.php) asking for robots.txt, and then the URL above. So, MJ12bot reads Hacker News, or at least follows as the URLs listed in it.
I kind of see what he is really trying to do...
Get Google to rank it from the start on long tails of "SEO <name-of-crawler> + other-keyword(s)".
(*I'm not complaining, it's pretty smart to start out like that, and make it on HN's front-page)
However, this site is just my personal blog. It has one post because I just started it today, which is why it is fairly sparse. I'm not going to put any ads on my personal blog, so even if it did attract some amount of search traffic it would be worth very little to me. I also highly doubt it will attract that much traffic, anything related to "SEO" has a significant amount of competition for obvious reasons. I prefer to compete in places with high upside and low competition.
Additionally, I don't place much value on transient links (in my personal SEO opinion). When Google crawls the homepage of HN today, there will be a link to my site. But tomorrow and forever on there wont be. If I'm going to work hard to get a link to my site, I sure as hell would want it to be permanent.
I've also noticed some of them used to make requests synchronously (waiting for the previous to finish before making another), but they have adapted to make requests in parallel and add timeouts so they don't have their time wasted quite as long.
I created a log of the ones who stayed connected the longest.
I don't bother to maintain it anymore, but it was pretty interesting watching them change tactics over time.
You crawl your own site every day (or generate the same content into files)
Put the content into files with file names representing the URLs (or HAR format?)
Zip em up
Put them on bittorrent
Tell the search engines to look there for your content
Wouldn't that save everyone a lot of work?
Apache Logs --> Python script --> CSV --> Google spreadsheets --> Manual labor --> Blog post
They run a reward program for using their agent and getting pages crawled (data is then crunched on a central server cluster owned by the operators) - cash payouts that are seeded by the subscription models they have here http://www.majesticseo.com
I want an nginx module that allows a crawler once per specified period (per day or per week, I would imagine, but configurable is better). That is to say, it allows the bot to finish its crawl, then bans that IP/User-Agent for the specified duration.
At least we can contact/email Yelp and LinkedIn regarding to the crawlers if one can crawl or not according to their robots.txt. It's more generous than just allowing the big search engines such as Google and Bing. I'm not quite sure what's actually happening if we ask them though. I'll try that.
It is possible I am misinterpreting what they are offering.
We probably crawl 50 pages for every visitor we deliver. There are a lot more pages on the web than people on the planet. So the ratios will always favor the bots.
# (This is not the SEO site in question)
The article says to whitelist a few and deny everything else.
The crawler could parse that, and change the user agent to one whitelisted on further requests.