
Who exactly is crawling my site? - birken
http://danbirken.com/seo/2013/10/17/who-exactly-is-crawling-my-site.html
======
TheLoneWolfling
"I've decided to block all crawlers to the site other than Google or Bing"

And this is why I find people that respond to privacy complaints about Google
with "if you want to switch search engines no-one is stopping you"
frustrating.

Additionally, I would like to point out that according to those numbers, there
are ~41,000 (418,814 - 199,725 - 40,359 - 36,340 - 33,893 - 26,325 - 13,458 -
10,657 - 6,109 - 5,993 - 4,959) additional robot hits, many of which generally
are bots that only visit one or two pages. For comparison, that's more
requests than Googlebot did. Just flat-out banning everything frustrates
people, and encourages them to just ignore robots.txt entirely.

If you're being hammered by a bot, contact the bot's owner! Most bots have a
link in the user agent that you can follow. Barring that, ban the specific
bot. But don't ban everything just because a few (proximic and ADmantX) are
hammering the site.

~~~
throwaway420
> But don't ban everything just because a few (proximic and ADmantX) are
> hammering the site.

I can understand blocking somebody that has a long-term and clear pattern of
disrupting your site, not following the robots.txt rules, and not providing
any links or anything back to you. But I find the idea of somebody
preemptively blocking everything but Google and maybe Bing extremely
distasteful.

If everybody out there blocked everything but Google/Bing, it would make it
very difficult for anybody to ever try and create a new search engine or
create new types of web services or analyze data in new ways.

Possibly a better solution is making the common crawl initiative a better
project - make it more frequently updated, make it easier to get started with
it, provide better documentation, etc. If there was a way to get every web
service out there that wants to crawl the web to contribute to this, it would
lighten the load on everybody.
[http://commoncrawl.org](http://commoncrawl.org)

~~~
dripton
Or maybe all the other crawlers would just claim to be Googlebot. Just like
all the other browsers (partially) claim to be Netscape.

------
ScottWhigham
I actually did this exact robots.txt for my site this summer. The net effect?
A massive loss of secondary market traffic - enough that, after a month, we
probably lost 10-15% in revenue. It surprised us. We rolled the old robots.txt
back into place and bam - life went back to normal. You can question my pithy
300-character explanation all you want but I'll leave you with this: we were
100% certain that the robots.txt change was the difference.

In the end, we switched to network blocks for the common bots/spiders. Much
better.

~~~
breadtk
Do you mind posting your old robots.txt?

~~~
ScottWhigham
It was almost identical to his

------
nostromo
Crawlers and spambots are the scourge of medium to small websites.

I run a small wiki that gets just a few thousand human hits a day. But
according to the server logs 90% of server hits are crawlers and spambots, so
I'm using 10 times the resources I really need to serve customers.

I finally resorted to blocking entire data centers and companies that crawl
constantly but send no traffic.

I feel like search engines should crawl websites in proportion to the traffic
they send. For example, Yandex, Baidu, and Bing were all crawling my website
hundreds or thousands of times a day, but never sending a single visitor (or
in the case of Bing sending single-digit visitors). It's an absurd waste of
resources, so I blocked them completely.

~~~
PaulHoule
It's actually a lot worse if your site is large since crawlers will make even
more requests crawling a large site.

If I see that a crawler is sending me any traffic at all, I will accept that,
but if the amount of traffic is zero, I put them in robots.txt and in the IP
block list. Although I try hard to make sure my images are all clean creative
common images, I block the robots that are there to find copyrighted images
because these ones exist to give me nothing but trouble.

~~~
gabemart
>I block the robots that are there to find copyrighted images because these
ones exist to give me nothing but trouble.

I've never heard of these bots before. Where did you hear about them, and how
do you block them?

------
meritt
The irony of the situation is the crawlers specifically allowed are most
likely the only ones that bother to abide by robots.txt in the first place.
It'll be interesting to see if the change has any impact whatsoever.

------
bryanlarsen
I've got a calendar site that's used by <50 people. 99% of my requests look
like:

"GET /calls?month=2&year=7206 HTTP/1.1" 200 2423 "-" "Mozilla/5.0 (compatible;
Googlebot/2.1;
+[http://www.google.com/bot.html)"](http://www.google.com/bot.html\)")

It wouldn't be hard to stop (404 nonsensical years & disable the previous /
next links that will get them there), but it is amusing that Googlebot has
clicked on that "next month" link over 60000 times...

~~~
RKearney
I would think implementing something like this[0] would help.

[0]
[https://support.google.com/webmasters/answer/96569?hl=en](https://support.google.com/webmasters/answer/96569?hl=en)

------
chrsstrm
So I'm dealing with this now too, although at about 10x the traffic mentioned
in the post. The first place I started was using the 5G Blacklist/Firewall
[[http://perishablepress.com/5g-blacklist-2013/](http://perishablepress.com/5g-blacklist-2013/)]
which is really just a great set of .htaccess rules for blocking known bad
bots. Legit bots will respect your robots.txt, so if one (looking at you
Yandex) is getting too aggressive, slow them down with the 'Crawl-delay: 60'
(time in s) directive. Of course rogue bots don't respect this, so they get
added to the blacklist rules based on UserAgent.

What I've discovered though is that bots are not my biggest worry; it's the
scrapers that are stealing my client's content and re-posting it. We've
successfully filed 5 DMCA complaints at this point which have been effective
at stopping the known offenders, but the crawlers continue and new copycat
sites keep popping up. I've found that running a 'grep Ruby access_log'
returns a good chunk of the offending crawlers (not just Ruby, also search for
Python and Java). Running 'host (ip address)' almost always traces back to
AWS. These log entries also very rarely list a referrer.

Obviously not all of the grep results are malicious. A little research can
reveal the IP is linked to a service or co you want crawling your site. For
those that are unknown, they usually get an IP ban until I can determine
otherwise (which the client is totally OK with, they don't want their content
re-published off-site).

I've thought about setting up a honeypot, but my issue was keeping the bait
links hidden from legit services (plant a hidden link on the page that only a
bad actor would follow, then trap their IPs in the log and ban them). Since
the DMCAs have been so effective, I haven't been forced to pursue a honeypot,
but I would be very interested if anyone else has a good solution.

I also discovered that image hotlinking was a _HUGE_ problem with this site.
The poor site had been neglected for years and hotlinkers were running wild.
Shutting that down with a simple htaccess rewrite rule really helped. That is
also how I discover content thieves, 'grep 302 access_log' and look at the
referrer URLs.

------
dumbfounder
If everybody does this then a new search engine will never come to be. You
better hope Google is the best search engine anyone could have ever created.

As a search engine developer that has tried to compete with Google in the
past, I find this disheartening.

I agree with blocking anyone doing bad. But you are cutting out the good with
the bad.

~~~
seiyak
Can you tell us a little bit about your project to compete with Google if it's
ok for you ? "try to compete with Google" is exactly what I'm doing right now
and I'm just curious.

You were not able to crawl hundreds of thousands of websites because your
crawler was disallowed by their robots.txt but major crawlers were allowed to
do so ?

------
bediger4000
So, you have a problem with Ahrefs, too? But what about Ezooms? That's a very
annoying bot for my site, drives zero traffic. Also, what about Cyveillance?
They show up with a lying User-Agent field ("Mozilla/4.0 (compatible; MSIE
7.0; Windows NT 5.2)"), but run Linux, and they ignore robots.txt by not even
asking for it.

I actually blocked Ahrefs, Ezooms, Yandex and Cyveillance by IP address range
in httpd.conf for a while, but I've decided to send them randomly-generated
HTML ([http://stratigery.com/bork.php](http://stratigery.com/bork.php))
instead. I'm really surprised by Ahrefs and Ezooms appetite for gibberish
about naked celebrities, condiments and sweater puppies.

~~~
devicenull
Does Ahrefs not respect robots.txt? Their website says they do,
[https://ahrefs.com/robot/index.php](https://ahrefs.com/robot/index.php)

~~~
bediger4000
I don't know if I tried that. Originally, I wanted to let them know they were
forbidden, but now I just want to jerk them around.

------
powertower
That website has only 1 post. And the post is talking about "SEO" and the
various search-engine names... In relation to a SEO business.

I kind of see what he is _really_ trying to do...

Get Google to rank it from the start on long tails of "SEO <name-of-crawler>
\+ other-keyword(s)".

(*I'm not complaining, it's pretty smart to start out like that, and make it
on HN's front-page)

~~~
morley
Somewhat strange that HN links don't have the rel="nofollow" attribute, or
else this particular angle wouldn't work.

~~~
powertower
From my understanding, due to massive abuse of the rel="nofollow" linkout, for
the last several years Google has 1) ignored them by either still passing
link-juice and/or incoming weight on terms (SEO), and 2) penalizes sites that
use those types of links exclusively.

------
o0Oo0O
How about using Crawl-delay at robots.txt?

[http://en.wikipedia.org/wiki/Robots_exclusion_standard#Crawl...](http://en.wikipedia.org/wiki/Robots_exclusion_standard#Crawl-
delay_directive)

------
Scryptonite
I deal(t) with the same thing. I made it so my web server would try to stream
a page that never ends, and some bots would stay connected for hours and
hours. But over time they seem to have adapted.

I've also noticed some of them used to make requests synchronously (waiting
for the previous to finish before making another), but they have adapted to
make requests in parallel and add timeouts so they don't have their time
wasted quite as long.

I created a log of the ones who stayed connected the longest.

[https://gist.github.com/scryptonite/5324724](https://gist.github.com/scryptonite/5324724)

I don't bother to maintain it anymore, but it was pretty interesting watching
them change tactics over time.

------
FiloSottile
I would really advise against disallowing the archive.org bot. Their mission
is a really important one in the long run that would be disrupted by everyone
adopting a whitelist approach and cutting them out, and I'm sure that their
load is negligible.

------
elchief
What about a system like this:

You crawl your own site every day (or generate the same content into files)

Put the content into files with file names representing the URLs (or HAR
format?)

Zip em up

Put them on bittorrent

Tell the search engines to look there for your content

Wouldn't that save everyone a lot of work?

~~~
tmarthal
That was almost the original intent behind sitemap.xml,
([http://sitemaps.org](http://sitemaps.org)). Except of course, that they were
being used erroneously and changes published when no content was changed, and
so the spiders stopped using them as canon.

------
acidity
Which tool/code you used to parse the logs and get those numbers?

~~~
birken
Just a little python script:
[https://gist.github.com/danbirken/7047504](https://gist.github.com/danbirken/7047504)

Apache Logs --> Python script --> CSV --> Google spreadsheets --> Manual labor
--> Blog post

------
jzs
I'm glad to see that you have revised your list and added DuckDuckGo. It would
be sad to see genuine search engines like duckduckgo be left out.

------
justkez
You may want to update this to reflect the fact that "MJ12bot" is selling SEO
Services, too.

They run a reward program for using their agent and getting pages crawled
(data is then crunched on a central server cluster owned by the operators) -
cash payouts that are seeded by the subscription models they have here
[http://www.majesticseo.com](http://www.majesticseo.com)

------
maaku
I would give insane kudos to the first person to implement this, or point me
at one which already exists:

I want an nginx module that allows a crawler once per specified period (per
day or per week, I would imagine, but configurable is better). That is to say,
it allows the bot to finish its crawl, then bans that IP/User-Agent for the
specified duration.

------
agwa
I have found this set of htaccess rules to be extremely useful in blocking bad
crawling behavior:

[https://github.com/bluedragonz/bad-bot-
blocker](https://github.com/bluedragonz/bad-bot-blocker)

------
seiyak
"It turns out I'm not alone in adding these types of restrictions. Yelp blocks
everybody but Google, Bing, ia_archiver (archive.org), ScoutJet (Blekko) and
Yandex. LinkedIn also has a similar opt-in robots.txt, though they have
whitelisted a larger number of bots than yelp."

At least we can contact/email Yelp and LinkedIn regarding to the crawlers if
one can crawl or not according to their robots.txt. It's more generous than
just allowing the big search engines such as Google and Bing. I'm not quite
sure what's actually happening if we ask them though. I'll try that.

------
blantonl
I see crawlers as enablers - I've even gone so far as spinning up a dedicated
server instance to serve content to the bots - so they get good response times
which increases relevancy in search results.

------
v2interactive
Your assumption is wrong. Ahrefs is a backlink compiling bot that tracks,
documents and records in/outbound links. They do not sell services directly.

~~~
v2interactive
*not that I am aware of.

~~~
birken
I was just going by this page:
[https://ahrefs.com/pricing_plans.php](https://ahrefs.com/pricing_plans.php)

It is possible I am misinterpreting what they are offering.

------
drakaal
We ( [http://samuru.com](http://samuru.com) ) limit our crawling, and we honor
robots.txt We also use Google Infrastructure, so we come from the same IP's as
Google Bot.

We probably crawl 50 pages for every visitor we deliver. There are a lot more
pages on the web than people on the planet. So the ratios will always favor
the bots.

------
cgore
At the bottom of the page he mentions Yelp, and looking at their robots.txt
they have this weird stuff for each of the allowed bots:

    
    
      Disallow: /biz/outlook-autumn-market-fundamental-catwalk-flimsy-roost-legibility-individualism-grocer-predestination-0
    

And over and over ten times. Trying to see if the bots ignore robots.txt I
guess?

~~~
ceejayoz
Probably a honeypot, yeah.

------
webhat
The first thing I did after reading this was look at
[http://danbirken.com/robots.txt](http://danbirken.com/robots.txt) I guess he
though of that.

    
    
      # (This is not the SEO site in question)
    
      User-Agent: *
      Allow: /

------
eliben
I also noticed on my own site that Bing pulls 5-6x as much traffic as the
Google bot. Ridiculous...

------
ihenriksen
You should add some request-rate settings to your robots.txt file to prevent
robots from crawling your web site to often. Here is an example:
[http://www.sparkledb.net/robots.txt](http://www.sparkledb.net/robots.txt)

------
droidlabour
What if a bot uses user-agent to look like it came from an end-user browser?

~~~
jessaustin
It probably wouldn't _act_ like a human-operated browser, even if it said it
was.

------
txutxu
It's easy for a crawler to parse a robots.txt in different ways.

The article says to whitelist a few and deny everything else.

The crawler could parse that, and change the user agent to one whitelisted on
further requests.

------
ressaid1
What about the bots that are not identifying themselves as such?

