

Robots.txt is a suicide note - panza
http://www.archiveteam.org/index.php?title=Robots.txt
"Archive Team interprets ROBOTS.TXT as damage and temporary madness, and works around it. Everyone should. If you don't want people to have your data, don't put it online."
======
forgotusername
This is composed from equals parts of insight and daftness, though not
entirely for the right reason.

The daftness: maybe the claim is true that robots.txt was only a stop-gap
measure back when web servers sucked, however the _de facto_ modern use for it
goes far beyond that, and ignoring that standard is likely to piss off lots of
people.

The insight: for crawlers, relying on robots.txt to prevent getting stuck
indexing infinite hierarchies of data is a bad idea. It should be able to
figure that much out for itself, so it doesn't explode when faced with sites
that don't exclude such hierarchies using robots.txt.

For servers, relying on a client hint to ensure reliability is daft. It should
have some form of rate limiting built in, as that's the only sensible design.
This seems the only marginally sensible use of robots.txt from a server
standpoint. Using it for any form of security (e.g. preventing DB scraping) is
daft, and a more robust mechanism should be employed there too.

~~~
chronomex
Archiveteam member here. Robots.txt is useful for a site to avoid having
search-engine spiders to blindly wander down an infinite hallway.

Archiveteam projects are different because we run closely-monitored, highly
targeted crawls. If a site has an infinite hallway in it, we'll notice that
and exclude it while taking steps to replicate enough of it to retain its
valuable attributes. We also do our best to avoid retrieving content more
times than necessary. If you operate a database, we try to get enough pages to
replicate a good portion of the underlying data.

~~~
derpy
You do try to contact the "targets" sometimes though?

~~~
sbierwagen
Sure. We ask them to give us database dumps, when they refuse, for some crazy
reason, we just download their website.

~~~
dspillett
I can't say I approve of the "if the crazy fools won't give us what we want,
we'll just take it some other way" attitude.

As the "some crazy reason" might be that the database contains data that is
not publicly addressable for a reason and sanitising the data-set could be
non-zero effort operation, letting you crawl the public pages and do some work
at your side would be the sensible route from most points of view. Heck, even
if there is definitely not anything not already displayed to the public in the
DB, letting you go ahead and get the information the hard way is likely to be
preferable to lifting a finger to send you a DB dump.

Sort of like a cat's attitude to playing fetch: You want it? You go get it.
I've got more important things to do.

------
gojomo
The great things about 'robots.txt' are (1) it's the simplest thing that could
possibly work; and (2) the default assumption in the absence of webmaster
effort is 'allow'.

(2) is immensely valuable. Without it, search engines and the largest archive
of web content, the Internet Archive (where I work on web archiving), could
not exist at their current scales, as a practical matter.

There's a place for ArchiveTeam's style of in-your-face, adversarial
archiving... but if it were the dominant approach, the backlash from
publishers and the law could result in prevailing conventions that are much
worse than robots.txt, such as a default-deny/always-ask-permission-first
regime. Search and archiving activities would have to be surreptitious, or
limited to those with much deeper pockets for obscuring their actions,
requesting/buying permission, or legal defenses.

So, Jason, be careful what you wish for.

------
sgentle
I disagree with this post almost as strongly as I agree with it.

Robots.txt _is_ a suicide note. It's utter short-sighted hubris to say "this
is MY information and I don't want you spidering it". Are you volunteering to
maintain that information forever? Are you promising to never go out of
business? Never be ordered to remove it by the government? Never be bought out
by Oracle?

Right now there seems to be a lot of confusion over the morality of
information. People are possessed by the strange idea that you, mister content
provider, own that content and have an inalienable right to control it any way
you can get away with. But someday you will die, and your company will die,
just like Geocities, Google Video, and the Library of Alexandria. Society
should have a right to keep that information after you're gone.

Of course, the law disagrees. And without the efforts of criminals like
geohot, the iPhone DevTeam, The Nomad, Muslix64 and, yes, The Archive Team,
people of the future will have no way to access the information we've locked
up through our own paranoia. You don't have to cast your mind to a thousand
years in the future - it's happening right now. Vast swathes of data are
disappearing as DRM servers go dark only a few years after they appear
(thanks, MSN Music, Yahoo Music Store).

I believe that we owe it to our descendants to give them access to their
history. I believe it's not our decision whether the things we make are too
valuable or too uncomfortable to be preserved. And I believe that robots.txt
is a suicide note, a product of the diseased minds that think our short-term
desire for control outweighs our legacy.

But I don't know what the fuck the article's talking about. It seems to be
making a bunch of points that don't matter. Use robots.txt to prevent
technical problems if you like, I don't care. Just don't use it to stop people
from crawling your content or you're shitting on the future.

~~~
jamie_ca
While I'm sure the "content provider" would disagree, in many cases it's
perfectly legal to make a copy of their content - the laws in place only
prevent redistribution for a certain time period.

Personally, I'd be ecstatic if there was some organization set up that is
manufacturing high-quality archive copies of books, music, film and the like,
and storing them with the date of initial publication, as well as the date
they will enter public domain (where it is known - not the case for living
authors) in various locales. Then have a site live-tracking the release of
that content.

There's a site that keeps getting linked every year that posts things that
would be public domain this year were it not for US copyright extensions, and
something like that with accurate PD status and archive copies would be
marvelous.

------
smosher
The rationale is weak. Some data is simply not worth indexing, and not worth
serving up to bots. The flipside is: your crawler doesn't need to fetch
everything on my site, and I'd be happy to ban all non-conforming bots site-
wide.

It's not _just_ about the functionality, but also a show of good faith and
basic respect. If you're a bot author who knowingly violates my site policy
I'd rather you didn't communicate with my web server at all.

robots.txt isn't perfect. Ideally a web server would be configured to deny
bots access to restricted content via some sort of dnsbl mechanism (or
CPAN/whatever module.) Or do both and ban the non-conforming site-wide.

The above notwithstanding, I'm voting for this article. It doesn't betray the
usual cowardice by hiding the assertion behind the presumptuous _Why_.

~~~
panza
"Some data is simply not worth indexing, and not worth serving up to bots."

The broader points being made in the article are that the value of information
is determined by the visitor, and that the burden of keeping the site up and
running should fall on the host.

By that reasoning, the presence of a ROBOTS.TXT signifies "damage" or
"temporary madness", and should therefore be ignored.

Also, keep in mind where this article came from - the Archive Team are bloody-
minded about preserving information.

~~~
drivingmenuts
OK, wait.

If the data has value to the user, shouldn't the user be paying the host for
the cost of making that data available (or perhaps considerably more, if it
has a lot of value to the user)?

~~~
sbierwagen
Excellent. How much are you paying ycombinator for your account?

If the answer is "nothing", then I guess you've just told me that all your
comments are useless, and I should ignore them.

~~~
sbierwagen
I'm committing the classic mistake of talking about the karma system here, but
what, precisely, is wrong with my statement?

Drivingmenuts says the user should pay for hosting data. HN is hosting his
data, but he doesn't pay for it. Is there some error in my logic, here, some
flaw in my conclusion?

------
j_baker
This may be a dumb move from a legal perspective. Court cases have alluded
that robots.txt files may count as technological measures in DMCA cases[1].
Granted, that's far from guaranteed. But I certainly wouldn't want to be the
one to go to court over it.

[1] [http://www.groklaw.net/article.php?story=20070819090725314&#...</a>

~~~
smosher
While I think the AT stance amounts to arrogance more than anything, enforcing
compliance with robots.txt with law is absurd and unjust. To the best of my
knowledge it's not a part of the HTTP standards or codified in any law.

~~~
elehack
Think of the potential alternatives. One is everything can be crawled with no
way for the site owner to say "no, please don't crawl this."

The other major one is to legally bar all crawling without express permission.

The current de facto world - crawling is OK unless robots.txt says otherwise -
is pretty nice. If we want that to be a legal defense in court ("You didn't
put up a robots.txt, so my indexing was legal, so you can't sue me."), which
seems useful, then the necessary flipside of that is that violating robots.txt
exposes the crawler to liability. That's a tradeoff I'm perfectly willing to
accept to allow the web, and necessary services such as indexers and crawlers,
to work while still allowing publishers to have some reasonable control over
their content distribution.

I seem to remember one of the writers at Search Engine Land presenting a nice
description of the robots.txt request in contract negotiation terms. Something
like this:

 _Archiver:_ "Are there any limits on what I can archive or index from your
site?" (translation: GET /robots.txt)

 _Site:_ "Nope." (translation: 404 Not Found)

or

 _Site:_ "Yep, here they are." (translation: 200 OK followed by restrictions
in robots.txt format)

So, by asking for /robots.txt, the crawler can be construed as asking
permission to index, and the response setting up the terms of indexing. That
seems like a _really_ useful defense and sane compromise in this age of
"indexing so you can drive search users to our content is copyright
infringement."

[EDIT: fix formatting]

~~~
_delirium
I guess the first alternative seems like the more sane one to me, as far as
legality goes. Website-crawling policies don't seem like the kind of thing
that rises to the level where it's worth involving courts and laws, so I'd
leave it to technological mechanisms plus voluntary compliance with non-
technological mechanisms. But I suppose I have a pretty high bar for what
problems are severe enough to require a government solution.

And pragmatically, the vast majority of the non-robots.txt-respecting crawls I
see are coming from countries that won't enforce such laws anyway, so
enforcing them in western countries seems like a downside (more entanglement
between the internet and various countries' national laws) with little upside
(won't stop many crawls).

------
kunley
This is a childish argument based upon an attitude "don't use robots.txt
because it interferes with what we do and what we do is aw3s0m3 l337". This
attitude is also prevailing in the archiveteam's comments here. I doubt their
actions can be taken seriously.

I wonder how this made into 88 votes here on HN..

------
adaml_623
The Archiveteam.org favicon is a hand making a rude gesture. I think that sums
up many peoples opinion to this story.

It certainly is an indicator of how seriously you should take this
organisation.

~~~
Dylan16807
Am I missing something? It looks to me like it's making a 'stop', with the
hand held up and flat.

------
perlgeek
Their attitude can be summed up as "it's on the internet, it's ours to take".
Ok, oversimplified, but that's the essence, no?

So, dear archiveteam, please remember that when I put a server on the
internet, it's a voluntary and public service, and putting 'Disallow:' lines
in the robots.txt means that I set some rules. It's just rude to ignore those
rules, whatever you motivations are.

You have no right to access my content, just as you have no right to walk into
my house. If I invite you, please behave.

~~~
_delirium
The analogies here always end up all over the map, but the walking-into-a-
house one seems pretty off. The service is set up for anonymous public access,
run by an automated process designed to service requests, and faces a public
thoroughfare. If we have to make IRL analogies, that sounds closer to a
vending machine or other kiosk on the side of a road. In _that_ case, you
probably wouldn't have much faith in whether people will follow any
instructions you tape to the machine.

------
cullenking
Sorry, but this is terrible advice. Yes, you should make sure your site won't
break if it's slammed by a large crawl. Yes you should hide destructive
actions behind posts, not gets. But, robots.txt is insanely useful. If I
didn't have a robots.txt file, google/bing/yahoo would index countless
repetitive non-important files and my site would suffer in search engine
ranking. In our case, we host GPX/KML files and textual cuesheets for driving
and biking. If that stuff is indexed, our sites' relevant keywords are "left",
"right" and GPS timestamp fragments like "0z01".

So, use it wisely, but don't abandon its usage altogether.

------
yaix
robots.txt is simple and effective.

I do not want certain bots, especially so-called "archives" to automatically
download all my content. And that's what robots.txt is for and works well.

The article is just stupid, sorry. There is not one real knowledgable
argument.

~~~
chalst
_that's what robots.txt is for and works well_

Robots.txt doesn't stop anyone doing anything, it is simply a policy. Bots
either respect it or they do not. Archiveteam.org have indicated that they
wish to join the side of the spammers and incompetent spider authors.

~~~
user24
are you confusing archive.org and archiveteam?

~~~
chalst
Thanks, fixed. Actually I was not, that was a typo.

------
GoodIntentions
Wow. What arrogance.

A Good reason to honeypot if you aren't already.

It's expected. It's polite. Respect the site owner's published policy or
expect to get IP banned like any other script kiddie because when a site admin
see you ripping content, he isn't thinking "yay! archive team is here to do a
free backup!" he thinks you are stealing his shit.

------
hammock
I love the tone of this article, I was smiling the whole time. Especially
here-

 _the onslaught of some social media hoo-hah_

edit: just clicked through a few pages- whoever does the writing at
Archiveteam is fantastic!

~~~
zzzo
Jason Scott, very distinctive voice. Maker of the BBS Documentary, etc.
Engaging public speaker too; lots of good examples on his wikipedia page:
<http://en.wikipedia.org/wiki/Jason_Scott_Sadofsky>

~~~
sbierwagen
Point of order: A bunch of us[1] maintain the wiki. Jason just does the big,
visible things, and is thus big and visible.

<http://ascii.textfiles.com/> is his personal blog, and is as profanity-ridden
as you would expect.

1: <http://archiveteam.org/index.php?title=IRC_Channel>

------
mmaunder
I wonder if honeypots that auto-block rogue crawlers have occurred to these
yoyos.

~~~
sbierwagen
We typically use wget running on consumer machines. This isn't like the spider
farm of a typical search engine, with hundreds of machines consuming
gigabytes/second of bandwidth. AT hits are specific and targeted.

------
yuhong
BTW, robots.txt disables access to versions of pages already archived on the
Wayback Machine. I encountered this when looking for old technotes on
developer.apple.com.

~~~
slouch
The wayback machine is the only reason I use robots.txt:

    
    
        User-agent: ia_archiver
        Disallow: /
    

There is no good reason to let old (and possibly embarrassing) website designs
distract from the current version of your website.

~~~
gojomo
Few users prefer the archived versions; only when they need information that
has disappeared (or is temporarily unavailable) do they try the Wayback
Machine.

Many, many web designers are delighted to be able to see their old creations –
or even recover past work from the Archive when all other copies are lost to
organizational upheaval and system problems.

Finally, note that by blocking archiving, you're opting out of history. Those
who look back to see the web of today won't see your contributions.
Alternatives, competitors, and other projects that are open to archiving will
be seen and vividly remembered. Being invisible to the future may be OK for
your projects, but many others prefer to part of the shared memory.

(FYI: I work on web archiving at the Internet Archive.)

------
zbowling
I feel like creating a honey pot for bad bots now. Put an exclude line in
ROBOTS.TXT and then include that URL in my pages and when a bot hits it
anyways, ban the IP.

~~~
catch23
How would you know if the robot is autonomous or not? They could just use a
typical browser user-agent and you wouldn't know...

~~~
tlrobinson
Hide the link from users?

~~~
pyre
I seem to remember there being a thread on HN about spam bot scraping for
email addresses, and some of them are sophisticated enough that they can even
determine if an element has been hidden through rules applied via an external
CSS file.

~~~
Aloisius
Link a 1 pixel image and stick it in the bottle left hand corner? While
someone could click on it, it is extremely unlikely.

IIRC, technically you can have a 0 pixel image in some formats (gif) though I
imagine some browsers won't like that.

------
ssdsa
I don't really understand why they compare the Robots.txt file to a suicide
note. Could anyone explain this?

~~~
derpy
ArchiveTeam mostly flashmob archives sites that are going to die an rm -rf /
death, or atleast kill their users content.

Like Geocities, Yahoo! Video, Google Video, Friendster

------
tomjen3
No, it is about not being willing to waste bandwidth and server capacity on
unworthy projects (no person will ever search for my site through baidu but it
still being indexed).

Google and archive.org is one thing, I will be happy to support them.

------
Super_Jambo
So a little while ago we had a story which was essentially: "promote your
startup / website by causing maximum outrage! Outrage is good! YAY PISSING
PEOPLE OFF!"

Now we get a non-story which is essentially designed to piss off the people of
HN. Looking forward to more of the same given it works.

------
jbk
Well, sure, robots.txt is not the best solution, but it works and helps a lot
when you got msnbot or yandexbot that takes more than half of the requests of
your mediawiki (differences between revisions), your gitweb (commitdiffs) or
your phpBB installation and kills the performance...

Bored of having our machine killed by those bots, we use some robots.txt.

Sure, there are other solutions (proper blocking), but this one works
perfectly fine and avoids having to modify 3rd party applications that we are
running for an open-source development team.

------
jbhelms
The only thing I have ever used robots.txt for is to stop from leaking
pagerank. I have a folder called /redirect/ and i exclude that folder in my
robots.txt. I then link to sites like this /redirect/?l=www.mysite.com

Anything I don't want archived, I put behind a login wall.

------
tszming
Some parameter are useful, but they are not part of the standard.

e.g. Crawl-Delay, prevent DDOS from YAHOO! Slurp

~~~
soult
Is Yahoo! Slurp still active? I thought they turned it off when they switched
to using Bing search results on Yahoo!

~~~
tszming
I don't think so. I still receive a lot of requests from Yahoo! Slurp / Yahoo!
Slurp China this month.

------
djmdjm
Dear "archiveteam", I pay by the MB. How do I opt out of your shit? KTHXBYE

~~~
sbierwagen
Don't accept user-generated content. Failing that, never go down, and never
announce that you're shutting down.

~~~
v21
Or, if you do, hand over the data in a nicely archivable form, and prevent the
need for crawlers.

------
eli
_shrug_

Things like /search?q= are in my robots.txt because crawling those pages is
just a waste of everyone's time and resources.

------
edu
Then you should block their crawler :)

