
Analyzing One Million Robots.txt Files - foob
https://intoli.com/blog/analyzing-one-million-robots-txt-files/
======
zorpner
_The next steps towards standardization began when Google, Yahoo, and
Microsoft came together to define and support the sitemap protocol in 2006.
Then in 2007, they announced that all three of them would support the Sitemap
directive in robots.txt files. And yes, that important piece of internet
history from the blog of a formerly 125 Billion dollar company now only exists
because it was archived by Archive.org._

The Internet Archive (archive.org) is currently running their end-of-year
donation drive, if you value the work they do it's a good time to donate:
[https://archive.org/donate/](https://archive.org/donate/)

(and on the topic of robots.txt, it sounds like they're moving in the
direction of disallowing people from using them indiscriminately to block
access to valuable archival materials:
[https://blog.archive.org/2017/04/17/robots-txt-meant-for-
sea...](https://blog.archive.org/2017/04/17/robots-txt-meant-for-search-
engines-dont-work-well-for-web-archives/) )

~~~
stordoff
Have the IA ever discussed why they retroactively apply robots.txt? I can see
the rational (though don't necessarily think it is the best idea given the
IA's goals) for respecting it at crawl time, but applying it retroactively
always felt unnecessary to me.

~~~
db48x
It seems pretty obvious: copyright restricts distribution, so they hide pages
that the apparent copyright holder apparently doesn't want distributed.

------
benfrederickson
I also wrote up an analysis of the top 1M robots.txt files:
[http://www.benfrederickson.com/robots-txt-
analysis/](http://www.benfrederickson.com/robots-txt-analysis/)

I ended up analyzing very different things from this article though, so this
article was still pretty interesting to me.

------
jp_sc

      “traditionally used for
      vague attempts at humor
      which signal to twenty-something
      white males that this is
      a “cool” place to work.”

WTF with the casual sexism/ageism?

~~~
foob
Just to clarify, this was intended as a tongue-in-cheek critique of tech
companies that actively project superficial images designed to appeal to
specific hiring demographics. I'm sorry if the meaning didn't come across as
clearly as I had hoped for, but the statement was meant to be a condemnation
of sexism and ageism rather than an endorsement.

~~~
digitalsigil
I see what you're saying. And to be fair, I disagree that you're being
casually racist. What you're doing is casually accusing someone else of being
racist.

That being said, I think it's a little hypocritical to insult someone for
dropping casually racist contents into their technical work when you're doing
something very similar.

------
feelin_googley
"The web servers might not have cared about the traffic, but it turns out that
you can only look up domains so quickly before a DNS server starts to question
your intentions!"

s/DNS server/third party open resolver/

IME, querying an authoritative server for the desired name triggers no such
limitations.

One does not even need to use DNS to get the IP addresses for those
authoritative servers, if the zone file is made available for free to the
public as most are, under the ICANN rules.

I have thought about building a database of robots.txt many times. IMO,
robots.txt has an important role besides thwarting "bots". It can thwart
humans as well. It can be used to make entire websites "disappear" from the
Internet Archive Wayback Machine.

Perhaps others are making mirrors of the IA.

However, I have thought it could be useful to monitor the robots.txt of
important websites on a more frequent basis than IA, in order to (if possible)
preemptively archive the IA's collections if robots.txt changes are ever
detected that would effectively "erase" them from the IA.

Perhaps the greatest thing about robots.txt is that it is "plain text". This
"rule" _seems_ to be ubiquitously honoured. Did the author ever find any html,
css, javascript or other surprises in any robots.txt file?

~~~
toomuchtodo
I specifically go after sites for archiving that block the Internet Archive in
their robots.txt file.

The Internet Archive is also modifying its policy on retroactive blocking
using robots.txt, although I don’t have the blog post link handy at the
moment.

If you’d like to mirror certain Internet Archive contents, every item is
served as a torrent.

~~~
mynewtb
The Wayback Machine data is not available in bulk, only via the web interface.

------
mindB
History presented in this post was very interesting, but the analysis ended up
disappointing. The article ends just after they had managed to narrow their
sample of robots.txt files to exclude duplicate and derivative files. They
don't even present any summary statistics for this filtered sample.

------
tomcam
Surprisingly interesting post that goes into history of robots.txt and details
how it is not, in fact, a W3 standard or legal requirement

~~~
walshemj
does seem obsessed with the non standard crawl delay interestingly it doesn't
mention those robots.txt files that have a BOM which can stop them working -
which is why best practice is to have a comment of a blank line at the top of
your robots.txt

~~~
greglindahl
That's a best practice for everyone?

~~~
walshemj
yep if your first Line is a disallow and you have a text file with a BOM it
will be ignored.

~~~
greglindahl
I was more thinking that most webmasters don't use editors and OSes where a
BOM might be emitted.

------
CM30
Honestly, I'm kind of surprised that turnitin's bot listens to robot.txt, or
that the 'anti copyright infringement' bots do the same. Seems like it
provides a very simple way for a cheating site to just thwart their entire
'system'.

But hey, I guess it's one of those cases where the law and basic ethics clash
a bit; with certain laws saying 'unauthorised' access to a server is illegal,
then ignoring that would leave them under fire for that instead.

