Hacker News new | comments | show | ask | jobs | submit login
Who exactly is crawling my site? (danbirken.com)
97 points by birken on Oct 18, 2013 | hide | past | web | favorite | 78 comments



"I've decided to block all crawlers to the site other than Google or Bing"

And this is why I find people that respond to privacy complaints about Google with "if you want to switch search engines no-one is stopping you" frustrating.

Additionally, I would like to point out that according to those numbers, there are ~41,000 (418,814 - 199,725 - 40,359 - 36,340 - 33,893 - 26,325 - 13,458 - 10,657 - 6,109 - 5,993 - 4,959) additional robot hits, many of which generally are bots that only visit one or two pages. For comparison, that's more requests than Googlebot did. Just flat-out banning everything frustrates people, and encourages them to just ignore robots.txt entirely.

If you're being hammered by a bot, contact the bot's owner! Most bots have a link in the user agent that you can follow. Barring that, ban the specific bot. But don't ban everything just because a few (proximic and ADmantX) are hammering the site.


> But don't ban everything just because a few (proximic and ADmantX) are hammering the site.

I can understand blocking somebody that has a long-term and clear pattern of disrupting your site, not following the robots.txt rules, and not providing any links or anything back to you. But I find the idea of somebody preemptively blocking everything but Google and maybe Bing extremely distasteful.

If everybody out there blocked everything but Google/Bing, it would make it very difficult for anybody to ever try and create a new search engine or create new types of web services or analyze data in new ways.

Possibly a better solution is making the common crawl initiative a better project - make it more frequently updated, make it easier to get started with it, provide better documentation, etc. If there was a way to get every web service out there that wants to crawl the web to contribute to this, it would lighten the load on everybody. http://commoncrawl.org


Or maybe all the other crawlers would just claim to be Googlebot. Just like all the other browsers (partially) claim to be Netscape.


This is a fair point, however blacklisting isn't necessarily a perfect solution either. It would require continuous manual effort in going through the logs and blocking bad bots, and if some new bot were to misbehave and crawl too aggressively, blacklisting would only help after the fact.

I do think I did make a mistake though. To your point, I shouldn't block crawlers that both behave and are attempting to help my site in some way (by driving traffic to it -- IE search engines). Whether or not they are currently driving traffic to the site is not important. I'll whitelist Yandex, Baidu, Scoutjet and any other related bots I see and edit the post.


Whitelisting a couple of bots now doesn't help at all for any new search engines trying to start up. What are they to do, contact every site admin individually?

> if some new bot were to misbehave and crawl too aggressively, blacklisting would only help after the fact.

In that case, don't blacklist all bots, simply add a crawl delay to any bots that you haven't specifically allowed:

User-agent: * Crawl-delay: 10

This allows minor bots to continue to crawl the site, while cutting back on bandwidth costs for the couple of ones that are being overly aggressive.


I agree completely. Blocking everything but Google and Bing is horrible and extremely short-sighted.


I actually did this exact robots.txt for my site this summer. The net effect? A massive loss of secondary market traffic - enough that, after a month, we probably lost 10-15% in revenue. It surprised us. We rolled the old robots.txt back into place and bam - life went back to normal. You can question my pithy 300-character explanation all you want but I'll leave you with this: we were 100% certain that the robots.txt change was the difference.

In the end, we switched to network blocks for the common bots/spiders. Much better.


Do you mind posting your old robots.txt?


It was almost identical to his


Thanks Scott. Please advice how to create network blocks. I run a affiliate site and need to do this for my site too!


It depends on your setup. We have a hardware firewall so I add/remove networks daily to it. If you are on apache, you can use htaccess. On a Windows server, you can add them to Windows Firewall.


Cool. Thank you for your response. My site runs on a Windows server. I'll look into your recommendation. Thanks again.


Crawlers and spambots are the scourge of medium to small websites.

I run a small wiki that gets just a few thousand human hits a day. But according to the server logs 90% of server hits are crawlers and spambots, so I'm using 10 times the resources I really need to serve customers.

I finally resorted to blocking entire data centers and companies that crawl constantly but send no traffic.

I feel like search engines should crawl websites in proportion to the traffic they send. For example, Yandex, Baidu, and Bing were all crawling my website hundreds or thousands of times a day, but never sending a single visitor (or in the case of Bing sending single-digit visitors). It's an absurd waste of resources, so I blocked them completely.


Add a time limit to robots.txt and block anyone who violates it. Maybe also add a nonexistent honeypot (I disallow /wpadmin on my static website haha). I've been fail2banning a lot of shitbirds and my traffic logs before and after are quite different.


Fortunately for us new(er) search engines (I work at blekko), only a small fraction of websites actively block crawlers from new search engines. It's a chicken and egg problem: we're going to have a big index for a long time before we have big traffic.

I also have a bunch of hobby websites, and I agree that bing crawls a lot more than Google and sends me a lot less traffic. And Yandex and Baidu have very little English-speaking usage, so they aren't sending me much traffic.


It's actually a lot worse if your site is large since crawlers will make even more requests crawling a large site.

If I see that a crawler is sending me any traffic at all, I will accept that, but if the amount of traffic is zero, I put them in robots.txt and in the IP block list. Although I try hard to make sure my images are all clean creative common images, I block the robots that are there to find copyrighted images because these ones exist to give me nothing but trouble.


>I block the robots that are there to find copyrighted images because these ones exist to give me nothing but trouble.

I've never heard of these bots before. Where did you hear about them, and how do you block them?


Is there a list of good IPs to block, similar to the hosts files that people pass around?

EDIT: chrsstrm mentions something below


Chicken and egg. Should they 1. crawl you first, or 2. send you traffic first? If 1, how long until they should either send traffic or stop, and if 2, how?


Why not adjust the crawl rate based on how often the site's content changes? If DuckDuckGoogle crawls me once an hour for a week, and notices that I only updated my content twice, why not scale the crawling back to a more reasonable rate, like once a day?


This seems logical, however what about the person who updates rarely (once or twice a month) but get a lot of notice when they do. Search engines wouldn't want to miss the opportunity to get that traffic. I wonder if there is some sort of reverse process you could opt into - ie never crawl me until I ask to be crawled. Disclaimer: I know little about this sort of thing.


You could probably do this by having a robots.txt file that blocks all crawling. When you want to be re-crawled, you can edit it to allow the relevant spiders to crawl you. I would imagine that spiders do not automatically re-check sites which disallow everything often, but Google (and probably others) let you manually submit URLs to be crawled.


But a disallow all robots.txt will kill your sites traffic - its not going to work like you think it is.


Presumably you could set a minimum interval - an hour, or a day. This would make sure that big updates are noticed, without putting undue stress on the server.


Could they not run webhooks where my site can call to them when to do a full or partial crawl? PubSub for web crawling, if you will.


Sitemaps (http://www.sitemaps.org/) help a bit in that regard, as crawlers can check the sitemap and only crawl updated content.


Thanks.


PUsH: http://en.wikipedia.org/wiki/PubSubHubbub

Google is the only crawler I see using the protocol on my site. It does make Google updates occur within minutes, so that alone is reason to implement push.


I thought this was only being used for RSS - interesting!


Do you/did you receive traffic from DDG? Because DDG uses Yandex for [some] organic search results and if you block its spider you will also partially block DDG.


I did not know that! Thanks, I'll look.


I don't block completely, but have for a long time configured servers to react like this:

- If it's cached content, there's no limit

- If it's non cached content and it is a bot, direct to the pool of servers assigned to uninteresting bots

- If it is an interesting bot (Google, bing, a couple others), use instead a pool of servers for interesting bots

Some bots are not only non-traffic generating, they are absurdly aggressive, consuming 99% of server resources for the hour or so needed to pull the content, regular site users be damned. A configuration like this is reasonably low maintenance while limiting the damage by bots.


I'm late, but for this there are companies that sit in the middle and block crawlers, spambots and known-malicious addresses. Here's one I use: http://www.incapsula.com/ So far I've been happy with it!


The irony of the situation is the crawlers specifically allowed are most likely the only ones that bother to abide by robots.txt in the first place. It'll be interesting to see if the change has any impact whatsoever.


I've got a calendar site that's used by <50 people. 99% of my requests look like:

"GET /calls?month=2&year=7206 HTTP/1.1" 200 2423 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

It wouldn't be hard to stop (404 nonsensical years & disable the previous / next links that will get them there), but it is amusing that Googlebot has clicked on that "next month" link over 60000 times...


I would think implementing something like this[0] would help.

[0] https://support.google.com/webmasters/answer/96569?hl=en


aside from nofollow, that someone mentioned, you might want to implement http://en.wikipedia.org/wiki/Canonical_link_element or put a block on year= into the robots.txt file


So I'm dealing with this now too, although at about 10x the traffic mentioned in the post. The first place I started was using the 5G Blacklist/Firewall [http://perishablepress.com/5g-blacklist-2013/] which is really just a great set of .htaccess rules for blocking known bad bots. Legit bots will respect your robots.txt, so if one (looking at you Yandex) is getting too aggressive, slow them down with the 'Crawl-delay: 60' (time in s) directive. Of course rogue bots don't respect this, so they get added to the blacklist rules based on UserAgent.

What I've discovered though is that bots are not my biggest worry; it's the scrapers that are stealing my client's content and re-posting it. We've successfully filed 5 DMCA complaints at this point which have been effective at stopping the known offenders, but the crawlers continue and new copycat sites keep popping up. I've found that running a 'grep Ruby access_log' returns a good chunk of the offending crawlers (not just Ruby, also search for Python and Java). Running 'host (ip address)' almost always traces back to AWS. These log entries also very rarely list a referrer.

Obviously not all of the grep results are malicious. A little research can reveal the IP is linked to a service or co you want crawling your site. For those that are unknown, they usually get an IP ban until I can determine otherwise (which the client is totally OK with, they don't want their content re-published off-site).

I've thought about setting up a honeypot, but my issue was keeping the bait links hidden from legit services (plant a hidden link on the page that only a bad actor would follow, then trap their IPs in the log and ban them). Since the DMCAs have been so effective, I haven't been forced to pursue a honeypot, but I would be very interested if anyone else has a good solution.

I also discovered that image hotlinking was a _HUGE_ problem with this site. The poor site had been neglected for years and hotlinkers were running wild. Shutting that down with a simple htaccess rewrite rule really helped. That is also how I discover content thieves, 'grep 302 access_log' and look at the referrer URLs.


If everybody does this then a new search engine will never come to be. You better hope Google is the best search engine anyone could have ever created.

As a search engine developer that has tried to compete with Google in the past, I find this disheartening.

I agree with blocking anyone doing bad. But you are cutting out the good with the bad.


Can you tell us a little bit about your project to compete with Google if it's ok for you ? "try to compete with Google" is exactly what I'm doing right now and I'm just curious.

You were not able to crawl hundreds of thousands of websites because your crawler was disallowed by their robots.txt but major crawlers were allowed to do so ?


Blocking is never a sustainable strategy but sometimes you have to kill some good cells with the cancerous ones. Chalk it up to collateral damage. It's the same situation with forum/comment spam often the IP falls within a range used by OVH so when it gets out of hand I block the entire IP block.


So, you have a problem with Ahrefs, too? But what about Ezooms? That's a very annoying bot for my site, drives zero traffic. Also, what about Cyveillance? They show up with a lying User-Agent field ("Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2)"), but run Linux, and they ignore robots.txt by not even asking for it.

I actually blocked Ahrefs, Ezooms, Yandex and Cyveillance by IP address range in httpd.conf for a while, but I've decided to send them randomly-generated HTML (http://stratigery.com/bork.php) instead. I'm really surprised by Ahrefs and Ezooms appetite for gibberish about naked celebrities, condiments and sweater puppies.


Does Ahrefs not respect robots.txt? Their website says they do, https://ahrefs.com/robot/index.php


I don't know if I tried that. Originally, I wanted to let them know they were forbidden, but now I just want to jerk them around.


I'd like to note that the above is the first time I've ever referenced that URL. No links exist to it on my web site.

I just got a visit from the "MJ12Bot" (http://www.majestic12.co.uk/projects/dsearch/mj12bot.php) asking for robots.txt, and then the URL above. So, MJ12bot reads Hacker News, or at least follows as the URLs listed in it.


Block Yandex?


That website has only 1 post. And the post is talking about "SEO" and the various search-engine names... In relation to a SEO business.

I kind of see what he is really trying to do...

Get Google to rank it from the start on long tails of "SEO <name-of-crawler> + other-keyword(s)".

(*I'm not complaining, it's pretty smart to start out like that, and make it on HN's front-page)


I do run a few sites whose goal is to attract search traffic (hopefully by containing useful content to people), and the data in this particular post comes from one of them. I'm no stranger to different strategies of making a site attract search traffic.

However, this site is just my personal blog. It has one post because I just started it today, which is why it is fairly sparse. I'm not going to put any ads on my personal blog, so even if it did attract some amount of search traffic it would be worth very little to me. I also highly doubt it will attract that much traffic, anything related to "SEO" has a significant amount of competition for obvious reasons. I prefer to compete in places with high upside and low competition.

Additionally, I don't place much value on transient links (in my personal SEO opinion). When Google crawls the homepage of HN today, there will be a link to my site. But tomorrow and forever on there wont be. If I'm going to work hard to get a link to my site, I sure as hell would want it to be permanent.


Somewhat strange that HN links don't have the rel="nofollow" attribute, or else this particular angle wouldn't work.


From my understanding, due to massive abuse of the rel="nofollow" linkout, for the last several years Google has 1) ignored them by either still passing link-juice and/or incoming weight on terms (SEO), and 2) penalizes sites that use those types of links exclusively.


How about using Crawl-delay at robots.txt?

http://en.wikipedia.org/wiki/Robots_exclusion_standard#Crawl...


I deal(t) with the same thing. I made it so my web server would try to stream a page that never ends, and some bots would stay connected for hours and hours. But over time they seem to have adapted.

I've also noticed some of them used to make requests synchronously (waiting for the previous to finish before making another), but they have adapted to make requests in parallel and add timeouts so they don't have their time wasted quite as long.

I created a log of the ones who stayed connected the longest.

https://gist.github.com/scryptonite/5324724

I don't bother to maintain it anymore, but it was pretty interesting watching them change tactics over time.


I would really advise against disallowing the archive.org bot. Their mission is a really important one in the long run that would be disrupted by everyone adopting a whitelist approach and cutting them out, and I'm sure that their load is negligible.


What about a system like this:

You crawl your own site every day (or generate the same content into files)

Put the content into files with file names representing the URLs (or HAR format?)

Zip em up

Put them on bittorrent

Tell the search engines to look there for your content

Wouldn't that save everyone a lot of work?


That was almost the original intent behind sitemap.xml, (http://sitemaps.org). Except of course, that they were being used erroneously and changes published when no content was changed, and so the spiders stopped using them as canon.


This would be another dream come true for SEO. Too easy to game.


Which tool/code you used to parse the logs and get those numbers?


Just a little python script: https://gist.github.com/danbirken/7047504

Apache Logs --> Python script --> CSV --> Google spreadsheets --> Manual labor --> Blog post


A blog post about that would be cool as well


I'm glad to see that you have revised your list and added DuckDuckGo. It would be sad to see genuine search engines like duckduckgo be left out.


You may want to update this to reflect the fact that "MJ12bot" is selling SEO Services, too.

They run a reward program for using their agent and getting pages crawled (data is then crunched on a central server cluster owned by the operators) - cash payouts that are seeded by the subscription models they have here http://www.majesticseo.com


I would give insane kudos to the first person to implement this, or point me at one which already exists:

I want an nginx module that allows a crawler once per specified period (per day or per week, I would imagine, but configurable is better). That is to say, it allows the bot to finish its crawl, then bans that IP/User-Agent for the specified duration.


I have found this set of htaccess rules to be extremely useful in blocking bad crawling behavior:

https://github.com/bluedragonz/bad-bot-blocker


"It turns out I'm not alone in adding these types of restrictions. Yelp blocks everybody but Google, Bing, ia_archiver (archive.org), ScoutJet (Blekko) and Yandex. LinkedIn also has a similar opt-in robots.txt, though they have whitelisted a larger number of bots than yelp."

At least we can contact/email Yelp and LinkedIn regarding to the crawlers if one can crawl or not according to their robots.txt. It's more generous than just allowing the big search engines such as Google and Bing. I'm not quite sure what's actually happening if we ask them though. I'll try that.


I see crawlers as enablers - I've even gone so far as spinning up a dedicated server instance to serve content to the bots - so they get good response times which increases relevancy in search results.


Your assumption is wrong. Ahrefs is a backlink compiling bot that tracks, documents and records in/outbound links. They do not sell services directly.


*not that I am aware of.


I was just going by this page: https://ahrefs.com/pricing_plans.php

It is possible I am misinterpreting what they are offering.


We ( http://samuru.com ) limit our crawling, and we honor robots.txt We also use Google Infrastructure, so we come from the same IP's as Google Bot.

We probably crawl 50 pages for every visitor we deliver. There are a lot more pages on the web than people on the planet. So the ratios will always favor the bots.


At the bottom of the page he mentions Yelp, and looking at their robots.txt they have this weird stuff for each of the allowed bots:

  Disallow: /biz/outlook-autumn-market-fundamental-catwalk-flimsy-roost-legibility-individualism-grocer-predestination-0
And over and over ten times. Trying to see if the bots ignore robots.txt I guess?


Probably a honeypot, yeah.


The first thing I did after reading this was look at http://danbirken.com/robots.txt I guess he though of that.

  # (This is not the SEO site in question)

  User-Agent: *
  Allow: /


I also noticed on my own site that Bing pulls 5-6x as much traffic as the Google bot. Ridiculous...


You should add some request-rate settings to your robots.txt file to prevent robots from crawling your web site to often. Here is an example: http://www.sparkledb.net/robots.txt


What if a bot uses user-agent to look like it came from an end-user browser?


It probably wouldn't act like a human-operated browser, even if it said it was.


It's easy for a crawler to parse a robots.txt in different ways.

The article says to whitelist a few and deny everything else.

The crawler could parse that, and change the user agent to one whitelisted on further requests.


What about the bots that are not identifying themselves as such?




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: