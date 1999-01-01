Hacker News new | comments | show | ask | jobs | submit login
ROBOTS.TXT is a suicide note (archiveteam.org)
Strange advice. Not sure if I understood what they mean.

One important use case to exclude sections of your website is to not pollute the sitemap which Google crawls or to be more precise--the daily crawl volume Google allocates to your site. If you let every page be crawled more important pages get crawled less. Example: In the past, you created a content category which didn't turn out successful. Before you remove this category with plenty of links which would result in crawl errors it would be smarter to exclude them in the ROBOTS file and focus on your core categories.

Currently returning:

The irony! Cache link:

http://webcache.googleusercontent.com/search?ei=h0J3WMKKA4er...

On the page: "While the onslaught of some social media hoo-hah will demolish some servers in the modern era, normal single or multi-thread use of a site will not cause trouble, unless there's a server misconfiguration"

The code 508 that i currently see on the page is interesting and worth preserving. I think it validates their stance.

Archived at

https://archive.fo/http://www.archiveteam.org/index.php?titl...

This is something I simply cannot agree with. People are using ROBOTS.TXT for all kinds of reasons, such as blocking unwanted careless webcrawler or indexing single page application. I mean come on. How can you say something like that just because eliminating ROBOTS.TXT would potentially benefit your business.

This sounds like a childish rant about why Archive Team don't want to follow robots.txt, which, incidentally, many many crawlers also don't follow.

I think the crux of the matter is found here:

> If you don't want people to have your data, don't put it online.

As much as I agree in principal with this, because of the way web requests work, I don't want to be associated with this group.

You cannot ignore copyright, and robots.txt is exactly what I would use if I didn't want something archived by an organisation I have nothing to do with.

>You cannot ignore copyright, and robots.txt is exactly what I would use if I didn't want something archived by an organisation I have nothing to do with.

AFAICT this page is a reaction to an archive.org policy of respecting robots.txt retroactively - e.g. oldwebsite.com runs from 1999-2009, domain expires in 2010, gets bought in 2011 and the new owners add a robots.txt disallowing IA. The archive.org copies for 10 years are now inaccessible.

True, but it is archive.org protecting their own work from being shutdown for breaching copyright.

One group has respect for authorship, and one does not.

It may not be the most palatable solution, but hardly a need for a tantrum, and intent to ignore well established rights.

I only use robots.txt for pages that already issues 403. Something like this:

  User-agent: *
  Disallow: /secret/

Of course they would think it's dumb because some of robots.txt rules are counter to their objective which is to save the internet in all its glory. They shouldn't use robots.txt I agree. At the same time robots.txt is not worthless.

SEO is where robots.txt shines right now. It's not that people are trying to hide something it's because we don't want it to conflict with the content we actually want to promote.

Absolutely agreed. I work on Wikipedia articles on a specific subject where we rely on a select few web resources. Many of them have long since closed down and we use the archived versions instead. It's downright tragic when we've lost complete websites, perfectly usable sources for hundreds of articles with quality content, all because some domain parker brought up the URL and added a robots.txt that retroactively disabled the existing archive for the site. We need to archive everything, regardless of what the site owner thinks crawlers shouldn't see. Years down the road we might actually wish we had archives of things that many found uninteresting or not meant to be archived. (for example to see how sitemaps were set up or RSS feeds)

