
ROBOTS.TXT is a suicide note - dbaupp
http://www.archiveteam.org/index.php?title=Robots.txt
======
datalist
Nothing in their rambling even remotely supports their argument.

Yes, robots.txt is no magic bullet against ill-behaving crawlers (as proven by
ArchiveTeam) but it never was supposed to be that.

You choose to ignore my specific wish not to be crawled by you? Fair enough,
I'll return the favour and simply block your useragent

    
    
      ArchiveTeam ArchiveBot/[DATECODE] (wpull [VERSION]) and not Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/[VERSION] Safari/537.36
    

In Apache for example

    
    
      # Remove the ^ anchor in case text gets prepended
      RewriteCond %{HTTP_USER_AGENT} ^ArchiveTeam
      RewriteRule .* - [F,L]
    

and if possible your IP ranges.

~~~
krapp
You're assuming ill-behaving crawlers wouldn't show you a completely mundane
user-agent.

I know they often don't... and I wonder why they don't, but there's no reason
they should.

------
metabrew
Shoutout to the Last.fm robots.txt, which contains this:

    
    
        Disallow: /harming/humans
        Disallow: /ignoring/human/orders
        Disallow: /harm/to/self

~~~
theandrewbailey
link: [http://www.last.fm/robots.txt](http://www.last.fm/robots.txt)

------
xja
Currently returning:

Resource Limit Is Reached

The website is temporarily unable to service your request as it exceeded
resource limit. Please try again later.

The irony! Cache link:

[http://webcache.googleusercontent.com/search?ei=h0J3WMKKA4er...](http://webcache.googleusercontent.com/search?ei=h0J3WMKKA4erUaechNAI&q=cache%3Awww.archiveteam.org%2Findex.php%3Ftitle%3DRobots.txt&oq=cache%3Awww.archiveteam.org%2Findex.php%3Ftitle%3DRobots.txt&gs_l=mobile-
gws-
serp.3...49570.58026.0.59184.25.21.4.0.0.0.493.8016.3-16j5.21.0....0...1c.1.64.mobile-
gws-serp..16.0.0.bEZD95Bj88Q)

~~~
Artemis2
On the page: "While the onslaught of some social media hoo-hah will demolish
some servers in the modern era, normal single or multi-thread use of a site
will not cause trouble, unless there's a server misconfiguration"

------
cooper12
Absolutely agreed. I work on Wikipedia articles on a specific subject where we
rely on a select few web resources. Many of them have long since closed down
and we use the archived versions instead. It's downright tragic when we've
lost complete websites, perfectly usable sources for hundreds of articles with
quality content, all because some domain parker brought up the URL and added a
robots.txt that retroactively disabled the existing archive for the site. We
need to archive _everything_ , regardless of what the site owner thinks
crawlers shouldn't see. Years down the road we might actually wish we had
archives of things that many found uninteresting or not meant to be archived.
(for example to see how sitemaps were set up or RSS feeds)

~~~
Walf
Isn't this a problem with the archiving logic? Why on Earth would one apply
robots.txt rules retroactively?

~~~
detaro
They need _some_ way for people to easily hide archived content, since they
have no clear right to publish their copy. You can send them an e-mail
instead, but those take comparatively more resources to process and robots.txt
at least sort-of indicates that you actually are in a position where you can
demand the removal and can be processed automatically.

~~~
CM30
Wouldn't then the best logic be to finegrain this system a bit more?

So blocking the archive blocks new content added after robots.txt is made
available, and adding a certain few lines in the robots.txt file explicitly
hides older content on the domain in the archive.

That way, everyone wins. Sites that really want to remove content for some
insane reason can do so, older sites usually aren't lost if the domain expires
and domain holding page owners/cybersquatters don't accidentally cause older
content to be hidden (since hey, they don't really want to block it, just stop
their holding page from being archived).

------
shakna
This sounds like a childish rant about why Archive Team don't want to follow
robots.txt, which, incidentally, many many crawlers also don't follow.

I think the crux of the matter is found here:

> If you don't want people to have your data, don't put it online.

As much as I agree in principal with this, because of the way web requests
work, I don't want to be associated with this group.

You cannot ignore copyright, and robots.txt is exactly what I would use if I
didn't want something archived by an organisation I have nothing to do with.

~~~
voltagex_
>You cannot ignore copyright, and robots.txt is exactly what I would use if I
didn't want something archived by an organisation I have nothing to do with.

AFAICT this page is a reaction to an _archive.org_ policy of respecting
robots.txt retroactively - e.g. oldwebsite.com runs from 1999-2009, domain
expires in 2010, gets bought in 2011 and the new owners add a robots.txt
disallowing IA. The archive.org copies for 10 years are now inaccessible.

~~~
shakna
True, but it is archive.org protecting their own work from being shutdown for
breaching copyright.

One group has respect for authorship, and one does not.

It may not be the most palatable solution, but hardly a need for a tantrum,
and intent to ignore well established rights.

~~~
pjc50
Should material be lost forever out of "respect for authorship"?

~~~
krapp
Arguably, if I own the material, I should have the right to deny it to the
historical record if I choose.

Then again, Kafka wanted his unpublished works burned after his death - and
the world is arguably a better place for having ignored his wishes.

------
lcw
Of course they would think it's dumb because some of robots.txt rules are
counter to their objective which is to save the internet in all its glory.
They shouldn't use robots.txt I agree. At the same time robots.txt is not
worthless.

SEO is where robots.txt shines right now. It's not that people are trying to
hide something it's because we don't want it to conflict with the content we
actually want to promote.

~~~
Walf
Bingo. Everything I remember reading about robots.txt strongly emphasised that
it's not a way to hide content – in case one's reasoning skills are lacking –
and that it's main use is to prevent irrelevant or infinite content showing up
in any indices.

------
popobobo
This is something I simply cannot agree with. People are using ROBOTS.TXT for
all kinds of reasons, such as blocking unwanted careless webcrawler or
indexing single page application. I mean come on. How can you say something
like that just because eliminating ROBOTS.TXT would potentially benefit your
business.

~~~
saurik
> ...such as blocking unwanted careless webcrawler...

This file doesn't "block" anything, it simply asks the robot to do something,
which implies that it is probably being abnormally careful: truly annoying,
unwanted, careless robots might follow these guidelines, but that seems like a
stretch. In reality, this file exists so that _extra careful_ robots are able
to get feedback from websites that have extremely narrow bandwidth
availability or extremely high generation cost... concepts which this article
makes a pretty compelling argument for "that doesn't make sense". In practice,
this file then makes the owners of websites sometimes think "I can build
something weirdly broken (such as a procedurally generated content tarpit, or
mapping anonymous GET requests to database insertions) and just rely on this
file to explain what I did along with enforcing rate limits and boundaries"...
and then an "unwanted careless webcrawler" comes along and causes them serious
issues. It is akin to having your entire webserver crash if someone sends you
a non-ASCII character in a form field, but thinking "this will work out: I
have a little flag in my HTML file that makes it clear I only accept ASCII".
If you absolutely feel like you need to block something, then _actually block
it_ : any robot gracious enough to pay attention to this file is also going to
send a useful user agent, and you can use that to return a legitimate 403.

------
greenspot
Strange advice. Not sure if I understood what they mean.

One important use case to exclude sections of your website is to not pollute
the sitemap which Google crawls or to be more precise--the daily crawl volume
Google allocates to your site. If you let every page be crawled more important
pages get crawled less. Example: In the past, you created a content category
which didn't turn out successful. Before you remove this category with plenty
of links which would result in crawl errors it would be smarter to exclude
them in the ROBOTS file and focus on your core categories.

------
bluesign
I am sorry but I can't agree with, if you don't want your content to be
archived, don't put it online.

This is very similar to, taking photos/videos of people on street without
their consent, and archiving and publishing. Even more, like taking photo of
someone and publishing, who is wearing a t-shirt saying please don't take
photo of me.

Sorry but if you will use my server resources, you will be bound with my
rules.

~~~
jhbadger
Are you familar with
[https://en.wikipedia.org/wiki/Streisand_effect](https://en.wikipedia.org/wiki/Streisand_effect)
?

If anything, a robots.txt would _encourage_ archiving because people would be
annoyed with the asocial attitude.

~~~
bluesign
I understand, but don't you think people should have option to opt-out from
archive team archiving their stuff, or google indexing their site?

------
HeadlessChild
I only use robots.txt for pages that already issues 403. Something like this:

    
    
      User-agent: *
      Disallow: /secret/

------
3825
The code 508 that i currently see on the page is interesting and worth
preserving. I think it validates their stance.

Archived at

[https://archive.fo/http://www.archiveteam.org/index.php?titl...](https://archive.fo/http://www.archiveteam.org/index.php?title=Robots.txt)

------
zamber
Once upon a time I had to manually disable ROBOTS.txt parsing in one crawler
just to stress test a staging machine for a friend.

The lesson from this is that ROBOTS.txt works as long as everyone follows a
line set in sand.

