
Robots.txt Disallow: 20 Years of Mistakes To Avoid - hornokplease
http://www.beussery.com/blog/index.php/2014/06/robots-txt-disallow-20/
======
Asparagirl
This article forgot the _very_ worst use of robots.txt:

    
    
      User-agent: ia_archiver
      Disallow: /
    

Those two lines mean that all content hosted on the entire site will be
blocked from the Internet Archive (archive.org) WayBack Machine, and the
public will be unable to look at any previous versions of the website's
content. It wipes out a public view of the past.

Yeah, I'm looking at you, Washington Post:
[http://www.washingtonpost.com/robots.txt](http://www.washingtonpost.com/robots.txt)

Banning access to history like that is shameful.

~~~
pekk
No, you do NOT have the right to pound my site with requests and serve data
that I decided to pull down.

~~~
angersock
"serve data that I decided to pull down."

If it's on their bandwidth and power, why not?

~~~
bsilvereagle
I think pekk meant that if he deletes a blog post, the IA is still going to
serve it. "Pull down" refers to deleting content, not bandwidth usage.

~~~
blueskin_
If you don't want it on the internet, don't post it. Assuming anything can
ever be made to disappear from the internet is naive, and if people become
aware you are, it'll just get Streisanded and become even more widely posted.

~~~
dredmorbius
There are two sides to this argument. I argue both. All the fucking time.

If you're in the business of providing _public_ content that's well-known, to
the public, then allowing it to be archived makes a lot of sense.

If you're providing _user-generated content_ I'd argue that the case for
allowing archival is extended even more so. Sites that violate this, and Quora
comes specifically to mind, are violating what many, myself included, consider
to be part of the social contract of the Web.

On the other hand, if you're an individual, and you are posting your own
content and ramblings, and circumstances change for whatever reason: you've
got a job, you've lost a job, you're married, you're divorced, you're
_getting_ divorced, your child is at war in a foreign country, a foreign
country is at war with yours, or you're just sick of the crap you wrote when
you were young and arrogant and now and old and arrogant you wants it gone:
I'm pretty willing to grant you that right.

If you've committed some terrible crime against humanity, or just a human,
_and have been fairly tried and convicted of it_ , I'd probably _not_ give you
the right to remove large bits of that information.

And yes, there are vast fields, deserts, tundras, plains, steppes, ice-fields,
and oceans of grey about all of this.

Barbra Streisand _got_ Streisanded _because she is_ Streisand.

Ahmed's Falafel Hut likely _wouldn 't_ suffer the same fate. His Q-score is
somewhat lower, and there's only so much real estate in the public
consciousness.

------
TheLoneWolfling
What frustrates me is the number of websites that impose additional
restrictions on anything they don't recognize, or worse, websites that impose
additional restrictions on (or worse yet, just outright ban) anything that
isn't Googlebot.

And people wonder why alternative search engines have such a hard time taking
off.

~~~
dredmorbius
I can give you a really simple operational reason for that: complexity.

Google is somewhere between 50-90% of most sites' search referrals (source:
/dev/ass). Add in a handful of other search engines (Bing, DDG, Yahoo, Ask)
and you've pretty much got all of it.

They're maybe 10-20% of your crawl traffic though. And possibly a _lot_ less
than that.

There are a _TON_ of bots out there. If you're lucky, they just fill your logs
and hammer your bandwidth.

If you're not so lucky, they break your site search, overload your servers,
and if you're particularly unlucky, they wake you up with 2:30 am pages for
two weeks straight.

At which point the simplest way to solve the technical problem, that is, you
getting a full night's sleep, is to ban every last fucking bot but Google. Or
maybe a handful of the majors.

Now, of course, you're a data-driven operation and you're relying on Google
Analytics to tell you who's sending traffic your way. But if you block a
search crawler, it's going to stop sending you traffic, so you won't know it's
important.

It's a rather similar set of logic that drives people to set email bans on
entire CCTLDs or ASN blocks for foreign countries. And if you're a smallish
site, it's probably a decent heuristic. And no, it's not just fucking n00bs
who do this. Lauren Weinstein who pretty much personally birthed ARPANET at
UCLA was bitching on G+ just a week or so back that the new set of unlimited
TLDs ICANN were selling were rapidly going into his mailserver blocklists.
Because, of course, the early adoptors of such TLDs tend to be spammers, or at
least, the early adopters he's likely to hear from.

[https://plus.google.com/114753028665775786510/posts/SsgPNHLG...](https://plus.google.com/114753028665775786510/posts/SsgPNHLGvnF)

------
dredge
The article contains some good observations, but I'm struggling to understand
this one:

"Some sites try to communicate with Google through comments in robots.txt"

In the examples given, none appear to be trying to "communicate with Google
through comments" \- how is including...

    
    
      # What's all this then? 
      #   \
      # 
      #    -----
      #   | . . |
      #    -----
      #  \--|-|--/
      #     | |
      #  |-------|
    

...a "mistake" to avoid? There's no harm in it at all.

~~~
Istof
"Some sites try to communicate with Google through comments in robots.txt"

I thought that was the whole point of robots.txt

~~~
lmm
No, the point is to communicate with Google through non-comments in
robots.txt.

------
freddielarge
fun fact: robots.txt can also be used by attackers to find admin interfaces or
other sensitive tidbits that you don't want search engines to crawl

lots of target-detection crawlers will look at robots.txt as the first thing
they do to see if there's any fun pages you don't want the other crawlers to
see

~~~
snowwrestler
If you want to hide admin pages, add the robots meta tag to each one and set
noindex, nofollow. Then you don't need to list them all in one place in
robots.txt.

That said, obscurity is not really security. Your admin pages should be behind
a password, which, if coded properly, will exclude spiders, bots, and bad
guys.

------
spaulo12
In the past I've created an empty robots.txt just to keep the 404 errors out
of my logs...

------
sp332
Why does Google ignore the crawl delay?

~~~
sbierwagen
Google has millions of spiders, in datacenters all over the world. Maybe
respecting crawl delay added more shared-state overhead than they wanted.

------
pipihu
The main use for robots.txt is to prevent crawling of infinite URL spaces:
[http://googlewebmastercentral.blogspot.com.br/2008/08/to-
inf...](http://googlewebmastercentral.blogspot.com.br/2008/08/to-infinity-and-
beyond-no.html)

Alongside tagging links to such resources with nofollow.

~~~
ashmud
Back in the day, I would use httrack for offline web browsing, and these were
a constant irritation.

------
sbierwagen
My server returns 410 GONE to robots.txt requests.

The robots exclusion protocol is a ridiculous anachronism. I don't use it and
neither should you.

~~~
ars
And what do you do about sites with an infinite number of pages?

~~~
sbierwagen
By not writing bad software. State shouldn't be stored in URLs, it should be
stored in cookies.

Spiders have to be robust against sites with unlimited numbers of internal
links anyway, or else an attacker could trap a web spider with a malicious
site, or a 13 year old writing a buggy PHP add could take down Google's entire
spidering system.

~~~
ars
> By not writing bad software. State shouldn't be stored in URLs, it should be
> stored in cookies.

GAH!! So it's you who writes those horrible sites?

I want to be able to middle click on two different URLs and browse two pages
with completely different state at the same time.

I HATE sites that store state in cookies, the two different tabs start getting
completely mixed up about where I am in the site.

The only thing that should be in a cookie is stuff like a shopping cart. But
that's only because the action "add to cart" is like a transaction and should
be remembered.

Viewing a page and changing the sort is ephemeral and should have no effect on
anything else.

> Spiders have to be robust

Who cares about the spider? What about your site that got hit with an unending
stream of completely useless page views?

Your position about robots.txt is simply wrong and you need to change your
mind.

------
franze
yeah, robots.txt is a horrible standard. trust me, i wrote
[https://www.npmjs.org/package/robotstxt](https://www.npmjs.org/package/robotstxt)
just so that i can really understand what is going on. it's based on
[https://developers.google.com/webmasters/control-crawl-
index...](https://developers.google.com/webmasters/control-crawl-
index/docs/robots_txt)

the article is pretty much correct (although strangely worded at some times),
the stuff about "communicating via robotst comments to google" is of course
not true. the example he gives are developer jokes, nothing more.

still, you should not use comments in the robots.txt, why?

you can group user agents i.e.:

    
    
        User-agent: Googlebot
        User-agent: bingbot
        User-Agent: Yandex
        Disallow: /
    

Congrats, you have just disallowed googlebot, bingbot and yandox from crawling
(not indexing, just crawling)

ok, now:

    
    
        User-agent: Googlebot
        #User-agent: bingbot
        User-Agent: Yandex
        Disallow: /
    

so well, you have definitly blocked yandex, you do not care for bingbot
(commented out), but what about googlebot? is googlebot and yandex part of a
user-agent group? or is googlebot it's own group and yandex it's own group? if
the commented line is interpredted as blank line, then googlebot and yandex
are different groups, if it's interpredted are as non existent, they belong
together.

they way i read the spec [https://developers.google.com/webmasters/control-
crawl-index...](https://developers.google.com/webmasters/control-crawl-
index/docs/robots_txt), this behaviour is undefined. (pleae correct me if i'm
wrong)

simple solution: don't use comments in the robots.txt file.

also, please somebody fork and take over
[https://www.npmjs.org/package/robotstxt](https://www.npmjs.org/package/robotstxt)
it has this undefined behaviour and it also does not follow HTTP 301 requests
(which was unspecified when i coded it) and also it tries to do too much
(fetching and analysing, it should only do one thing).

by the way, my recommendation is to have a robots.txt file like this

    
    
        User-agent: *
        Dissalow: 
    
        Sitemap: http://www.example.com/your-sitemap-index.xml
    

and return HTTP 200

why: if you do not have a file there, then at some point in the future
suddenly you will return HTTP 500 or HTTP 200 with some response, that can be
misleading. also it's quite common that the staging robots.txt file spills
over into the real word, this happens as soon as you forget that you have to
care about your real robots.txt

also read the spec [https://developers.google.com/webmasters/control-crawl-
index...](https://developers.google.com/webmasters/control-crawl-
index/docs/robots_txt)

------
blueskin_
There are enough malicious bots that do follow robots.txt to make it still an
important option for most sites.

------
Istof
500kb limit? you call that short and sweet?

