
Robots.txt meant for search engines don’t work well for web archives - r721
http://blog.archive.org/2017/04/17/robots-txt-meant-for-search-engines-dont-work-well-for-web-archives/
======
metafunctor
It appears that IA applies (or did apply) a new version of robots.txt to pages
already in their index, even if they were archived years ago. That's silly,
and stopping doing that would probably solve much of this problem.

~~~
icebraining
I don't think it's "silly". IA operates in a sketchy legal environment.
There's no fair use exclusion for what they're doing, and it made sense to be
extra careful and deferential towards website operators, lest they get hit by
a lawsuit.

~~~
the8472
> There's no fair use exclusion for what they're doing

Fair use is not the only exception to copyright. US copyright law has a
separate section on exceptions for libraries and archives.

~~~
jacquesm
Ask yourself what you would rather have the IA spends it's meager funds on:
buying hardware and paying people to do critical work or paying a bunch of
lawyers to fight lawsuits against much better funded opponents they would lose
anyway.

~~~
matt4077
I asked, and the answer was: "it's important to fight these fights, which is
why I'm donating to the ACLU".

I believe Us Code 108 is relevant here. It starts:

    
    
        it is not an infringement of copyright for a library 
        or archives, or any of its employees acting within the
        scope of their employment, to reproduce no more than 
        one copy or phonorecord of a work[...]
    

There's obviously more to it that I haven't done research on, but that's a
pretty good start and I wouldn't worry too much about lawsuits. In fact, if
they were at risk of lawsuits, I don't see why respecting robots.txt would
stop them–there's no "but you didn't tell me not to" excuse in copyright.

~~~
biztos
If someone wanted to sue the Archive, they would probably argue that every
time archive.org serves a file they are making a copy... which is true, after
all, if anything reproduced digitally is a "copy" in that sense.

Nice point about the lack of implied permission in copyright. It makes me
think robots.txt probably doesn't have any meaning one way or the other
legally, but is just a community thing.

~~~
icebraining
> If someone wanted to sue the Archive, they would probably argue that every
> time archive.org serves a file they are making a copy... which is true,
> after all, if anything reproduced digitally is a "copy" in that sense.

It's more than a theoretical point - that each "serving" of a file is a copy
is well established legally. In fact, even _loading a program to RAM_ was
considered a copy, per MAI Systems Corp. v. Peak Computer, until Congress made
an explicit exception.

~~~
zerocrates
And that exception only applies for people doing maintenance on your computer.

------
libeclipse
On the linked page, I see comments about ignoring the webmasters' wishes et
al.

All I can say is f*ck that. It's a free and open internet. If you put content
up on a public site, anyone has the right to go and look at it. Stop
complaining when someone saves it.

And sure some people complain that scrapers slow down their site and that's
why they use robots.txt, but really? Really? It's 2017 and your site is
affected by that. I think you have bigger things to worry about.

~~~
chii
> scrapers slow down their site and that's why they use robots.txt

a poorly written scraper may really slow down your site, especially if it
wasn't intended to be scrapped repeatedly. There should be something to be
said about frequency which scrapers should follow (specified by the website
owner via a robots.txt like spec).

But website owners cannot demand unreasonable frequencies (such as once a
year!), and what constitutes unreasonable is up for debate.

~~~
dingo_bat
> (specified by the website owner via a robots.txt like spec).

Nope, if a website wants such a restriction, it must enforce it. Robots.txt is
a request. It's worthless.

~~~
bkor
If a robot misbehaves, it'll either be blocked or it'll go to the networks
abuse section and that bot will be taken down. That a site possibly could have
some kind of technical solution to this doesn't matter.

~~~
problems
Precisely - the solution here needs to be that the server blocks the robot -
if it can differentiate it from other traffic that is. That's all well and
good and that's the solution which should be used here. If you don't want to
be archived, block the IP.

------
EvanAnderson
The policy the Internet Archive applies re: "robots.txt" comes from an archive
policy created at U.C. Berkeley in the early 2000's (The Oakland Archive
Policy -
[http://www2.sims.berkeley.edu/research/conferences/aps/remov...](http://www2.sims.berkeley.edu/research/conferences/aps/removal-
policy.html)).

Jason Scott (an employee of the Internet Archive) mentioned that the Archive
doesn't ever delete anything. He stated that items may be removed from public
access because of changes to "robots.txt" but they're not actually deleted.
(That's a little comforting, at least.)

------
TeMPOraL
Archive.org needs to be able to apply to itself. We could then use archive.org
to view how archive.org in the past viewed some interesting site, thus
avoiding the whole retroactive robots.txt fail.

;).

------
mushiake
fastastic news.

Archive Team's take on this[0]

[0][http://www.archiveteam.org/index.php?title=Robots.txt](http://www.archiveteam.org/index.php?title=Robots.txt)

~~~
dingaling
It is great news in general, but seems to be done in a clumsy and
counterproductive manner that may cause the Internet Archive to be banned from
crawling some websites.

 _The problem_ : when robots.txt for a website is found to have been made more
restrictive, the IA retrospectively applies its _new_ restrictions to
_already-archived_ pages and hides them from view. This can also cause entire
domains to vanish into the deep-archive. No-one outside IA thinks this is
sensible.

 _Their solution_ : ignore robots.txt altogether. What? That will just annoy
many website operators.

 _My proposed solution_ : keep parsing robots.txt on each crawl and obey it
progressively, without applying the changes to existing archived material.
This is actually less work than what they currently do. If the new robots.txt
says to ignore about_iphone.html you just do that and ignore it. Older
versions aren't affected.

Basically they're switching from being excessively obedient to completely
ignoring robots.txt in order to fix a self-made problem. I can only see that
antagonising operators.

~~~
duskwuff
There's some value in allowing site operators to retroactively remove content
which was never intended to be public. A common and unfortunate example is
backups (like SQL dumps) being stored in web-accessible directories, then
subseqently being indexed and archived when a crawler finds the appropriate
directory index.

What needs to be fixed first is just the really common case mentioned in the
blog post, where a domain changes ownership and a restrictive robots.txt is
applied to the parking page.

~~~
Spare_account
Here's a slight modification to the GP proposal:

\- Respect robots.txt at the time you crawl it.

\- If robots.txt appears later, stop archiving from that date forwards.

\- Preserve access to old archived copies of the site by default.

\- Offer a mechanism that allows a proven site owner to explicitly request
retrospective access removal.

If archive.org have recorded the date that they first observed a robots.txt on
the sites currently unavailable, they could even consider applying the above
logic today retrospectively. Perhaps after a couple of warning emails to the
current Administrative Contact for the domain.

~~~
pbhjpbhj
>mechanism that allows a proven site owner to explicitly request retrospective
access removal. //

It should be "a proven content owner", just buying a site shouldn't allow
someone to remove it from archive.

------
c0achmcguirk
In the 90s I spent a lot of time on my website and I loved learning how web
crawlers worked. I started using a robots.txt file without really
understanding it. I ended up blocking everything thinking it would make my
site faster for visitors because--crawlers might crawl the site all the time.

After I graduated from college I lost access to my website which was hosted on
the Computer Science department's web servers.

I wish I hadn't used that robots.txt file. I would love to find the pages I
made that compared interfold vs. exterfold staple strength, or the site I made
with a ranch theme with a cowboy that had humorous advice....I don't have any
content in archive.org because it honored the robots.txt file.

...sigh...wish I had backed up my stuff.

------
laumars
To be honest I see robots.txt as a failed experiment since it relies on trust
rather than security or thoughtful design.

~~~
hdhzy
I don't think it's about security.

For example I've got a link to do delegated login like /login-with/github.
When people click it an oauth flow will start. But it is useless for robots to
follow so I disallow it in robots.txt. If they still follow nothing breaks and
it's not a security issue but if I can avoid starting unnecessary oauth logins
it's an additional benefit.

~~~
laumars
robots.txt wasn't created for security but it can have security implications
if you publish a list of Disallow paths with the intention of hiding sensitive
content (sadly I have that seen that happen a lot) where as a better approach
would be IP whitelisting and/or user authentication.

However I'm not claiming security is the only reason people use (misuse?)
robots.txt. For example in your case you could mitigate your need for a
robots.txt with a nofollow attribute[1]. Sure bad bots could still crawl your
site and find the authentication URL without probing robots.txt so the
security implications there is pretty much non-existent. But you've already
got a thoughtful design (the other point I raised) that mitigates the need for
robots.txt anyway so adding something like "nofollow" maybe enough to remove
the robots.txt requirement altogether.

[1]
[https://en.wikipedia.org/wiki/Nofollow](https://en.wikipedia.org/wiki/Nofollow)

~~~
dchest
This is crazy, that's not what robots.txt is for. How can you complain about
the security of a thing that is not meant to provide security?

According to your logic, newspapers are a "failed experiment because they rely
on trust rather than security or thoughtful design". I published an article
with my treasure map and told people not to go there, but they stole it.

~~~
laumars
That was an anecdote since the previous poster raised the point about
security. I'm definitely not claiming robots.txt should be for security nor
was designed for security!

I said following proper security and design practices renders obsolete all the
edge cases that people might use robots.txt. I'm saying if you design your
site properly then you shouldn't really need a robots.txt. That applies for
all examples that HN commentators have raised in terms of their robots.txt
usage thus far.

I would rewrite my OP to make my point clearer but sadly I no longer have the
option to edit it.

~~~
dchest
_design your site properly then you shouldn 't really need a robots.txt_

But how? For example, if you don't want a page to be indexed by Google, you
add this information to robots.txt. Nofollow doesn't work for every case,
because any external website can link to it, and Google will discover it.

~~~
laumars
That's a good point. I'm not sure how you'd get around non-HTML documents (eg
PDFs) but web pages themselves can be excluded via a meta tag:

    
    
        <meta name="robots" content="noindex">
    

Source:
[https://support.google.com/webmasters/answer/93710?hl=en](https://support.google.com/webmasters/answer/93710?hl=en)

Interestingly in that article, there is the following disclaimer about not
using robots.txt for your example:

 _" Important! For the noindex meta tag to be effective, the page must not be
blocked by a robots.txt file. If the page is blocked by a robots.txt file, the
crawler will never see the noindex tag, and the page can still appear in
search results, for example if other pages link to it."_

I must admit even I hadn't realised that could happen and I was critical of
the use robots.txt to begin with.

~~~
dchest
Ah, that's true, indeed. The page, though, will appear as a link without any
contents, because the bot won't be able to index it.

~~~
laumars
Except it has indexed it. It just hasn't crawled it. But content or not, the
aim you were trying to achieve (namely your content not being indexed) has
failed. Thus you are then once again dependant on other countermeasures that
render the robots.txt irrelevant.

------
cosinetau
For whatever it's worth: [http://humanstxt.org/](http://humanstxt.org/)

------
afandian
I'm not so sure that even Google respects it. I did some digging about the
semantics about robots.txt whilst writing a bot myself, and it seems that
Google doesn't _follow links_ that are excluded, but it will visit those
pages. Maybe that counts as "paying attention", but I don't think they
"respect" it.

~~~
butler14
they respect it, but because it's so frequently misused or plain broken, it's
basically sidelined vs. more optimal methods for preventing indexation, or
getting an already-indexed piece of content removed, such as the noindex tag.

------
dbg31415
I think the biggest argument for honoring robots.txt is that sites, especially
old sites, can have a lot of really highly resource intensive pages. I don't
want someone crawling a page that has 800+ DB calls... for example. Yes, I
should optimize the page, or whatever... but really it may be useful for that
admin, or 1 out of 10,000 users, who uses the page. It's not ideal to have
someone crawl all those pages at once.

I think they should honor robots.txt, and the meta tag version on specific
pages or links -- given the site publisher went out of their way to give
instructions to crawlers it seems reasonable to honor those requests.

------
sengork
Found this[1] via the Wikipedia's Talk page for the robots.txt article. It
showcases that early on robots.txt was designated to help maintain bandwidth
performance of web servers. Back then it would have been due to bandwidth
contention, today it may be bandwidth cost to some operators which robots.txt
help mitigate.

[1]
[https://yro.slashdot.org/comments.pl?sid=377285&cid=21554125](https://yro.slashdot.org/comments.pl?sid=377285&cid=21554125)

------
yeukhon
I remember writing a dumb parser for robots.txt. I have to agree, robots.txt
is simplistic but so non-standard. I wonder why search engines can't just say
NO to this. Does search engines today still honor robots.txt?

Here's my shameless plug: [https://github.com/yeukhon/robots-txt-
scanner](https://github.com/yeukhon/robots-txt-scanner)

I still remember writing most of this on Caltrain one morning heading to SF
visiting someone I dearly loved.....

------
Aissen
Finally. A bit late, since a lot of the archive has been removed because of
new owner's aggressive (or malicious) new robots.txt

~~~
pjc50
Hidden, rather than removed.

~~~
Aissen
I hope so.

------
6d6b73
There should be a way to direct archiving bots to a file that has the newest,
compressed version of the website for them to download. Wouldn't that be
easier for everyone?

~~~
BHSPitMonkey
Seems like it would just get abused with false content

~~~
6d6b73
True, but maybe it would be a good option for non-commercial websites that
would like to get archived, and make the archiving more efficient for
themselves and archive.org

------
droithomme
I have some sites where I specifically block archiving from some sections for
good reason. (Even if I didn't have a good reason though it would still be my
choice.)

I have a very big problem with them disregarding robots directives. Sure some
crawlers ignore them: Hostile net actors up to no good. This decision means
they are a hostile net actor. I'll have to take extreme measures such as
determining all the ip address ranges they use and totally blocking access.
This inconveniences me, which means they are now my enemy.

 _edit- For those interested: Deny from 207.241.224.0 /22_

~~~
Asparagirl
Are you under the impression that individual web archivists don't _also_
scrape websites of interest, and submit those WARC's for inclusion into the
Wayback Machine, independent of the IA's crawlers?

Because believe me, we do...good luck banning every AWS and DO IP range.

~~~
droithomme
Thank you for the tip. I wasn't aware of that, but it was not a problem to
update the rules to account for the full AWS range based on the new
information. I greatly appreciate your feedback. I am not sure what DO is
though, would you be so kind as to deacronymize that for me, thank you.

~~~
Asparagirl
We also run crawlers on our home laptops, on university servers, on every
cheapo hosting service we can find (especially if they offer decent or
"unlimited" bandwidth), and so on. Tools like wget and wpull can randomize the
timing between requests, use regex to avoid pitfalls, change the user-agent
string, work in tandem with phantomjs and/or youtube-dl to grab embedded video
content...

Good luck playing whack-a-mole against the crawlers. I admit to being very
curious what you're openly hosting online that you really don't want to get
saved for posterity?

------
amelius
I think we should write down a legal license in our robots.txt file, as a
retribution for all those lengthy EULAs these big companies make us read :)

------
madshiva
Yeah just ignore robots.txt because there's others solution.

If the site don't want to be scanned they can adopt a lot of counter measure
and robots.txt will not save it from abuse.

He remind me the old days when my website wasn't working from US because I
just fake that the site was down because there's no reason that somebody goes
to my site from US (I know it's kind stupid, but when all your content is in
french and you are a kid... :) )

------
rubatuga
One thing that should be considered is the right for an individual to be
forgotten.

~~~
TeMPOraL
In my (current) opinion, it's this law that should be forgotten. What's on the
public Internet is a matter of public interest. All I can see is this law
being used by bad people to hide their bad deeds, especially when those bad
deeds should be known.

~~~
Senderman
At the risk of going off-topic, I loved your usage of the word 'current'
before 'opinion', and I'm going to adopt it.

