Why robots.txt and favicon.ico are bad ideas and should be eliminated.

gojomo · on June 3, 2009

Even if Gregorio is right about the proliferation of fixed URI-pattern conventions, he's way, way wrong about robots.txt.

If there was a similarly simple and effective solution, either in 1994 or now, he would have suggested it. But he didn't. We have to guess that he'd want something involving a declared robots-rules-'link' elsewhere.

But layering robots-rules as a 'link' from the headers of the root page, or markup tags (if the root page happens to be HTML), still requires an initial investigative hit to a site -- and if at the '/' page, probably involves more bandwidth than a compact robots.txt or 404.

Assuming a hostname is a unified 'site' is not perfect -- but in 1994 and now, that's the only inherent and well-defined unit of administrative control provided by the HTTP protocol.

By going with this easy-to-understand, easy-to-implement convention, webmasters have had enough freedom to opt out, and crawlers enough freedom to collect, to enable ~15 years of dazzling growth in powerful search applications. And robots.txt files built to the 1994 standard still work, handling 99.999% of what webmasters need to communicate to crawlers.

That success satisfies my design aesthetics.

__david__ · on June 3, 2009

> But layering robots-rules as a 'link' from the headers of the root page, or markup tags (if the root page happens to be HTML), still requires an initial investigative hit to a site -- and if at the '/' page, probably involves more bandwidth than a compact robots.txt or 404.

Not necessarily. If the location were specified in the http headers then a robot could use the HEAD command first. That's relatively minor bandwidth cost and still lets the robot get the site rules without first getting any content.

gojomo · on June 3, 2009

Using HEAD doesn't make a link-in-headers approach use less bandwidth than the classic /robots.txt convention. In the very best case -- no robots-rules -- a HEAD is equivalent in bandwidth used to a /robots.txt 404 without a content-body. But if there are robots-rules, the HEAD-then-robots approach means two hits and double-headers instead of one hit.

(It's possible to strain and devise a convention that uses less bandwidth; that's why I said 'probably'. But that would require even more complexity -- such as the server being smart enough to send each robot only the rules that apply to it.)

Configuring a server to emit special headers is also harder for most webmasters than dropping a text file into a conventional location. And a HEAD-then-robots process is more complicated for robots-writers.

For the purpose of warning off robots, the /robots.txt placement convention was a very, very good solution on many axes, including minimizing traffic and adoption costs.

Only a peculiar early-optimization based on a certain aesthetic sense -- and being concerned about being a bad example for other similar applications that aren't a pressing issue even now, 15 years later -- can justify Gregorio's opinion that /robots.txt "was a not-so-good idea when the robot exclusion protocol was rolled out".

__david__ · on June 3, 2009

Yeah, it's more bandwidth than the current "just ask for /robots.txt" method, but only epsilon more. Certainly orders of magnitude less than having the robots link in the html itself (since most index pages are multiple orders of magnitude more than the HTTP headers). The extra HEAD would be lost in the noise its bandwidth is so minimal.

And you're right, the current scheme is easy. But it doesn't cover all the cases and makes certain types of sites impossible to do correctly (see http://news.ycombinator.com/item?id=639396 for an example). Putting a link to the correct robots.txt for a URL in the HTTP headers would be way more flexible and only slightly raise the bandwidth cost.

You could make it backwards compatible by assuming the current /robots.txt path if there is no header--So it would cost nothing to current site operators. Spider writers would have to do one extra HEAD but I refuse to believe that's terribly difficult.

michaelfairley · on June 3, 2009

The largest problem with the only-use-the-meta-tags solution is that the page to be excluded has to be retrieved before it can be excluded.

duskwuff · on June 3, 2009

Yes, this. The whole purpose of robots.txt is to allow restrictions to be applied on objects that the crawler hasn't looked at yet. A META tag can only apply restrictions after the fact.

Moreover, his proposed replacement is incredibly browser-centric. It requires any compliant crawler to contain an HTML parser. And woe befall anyone who typos the META tag, or manages to confuse the HTML parser before it gets there! The robots.txt specification, by contrast, requires no such heavy lifting: just request a single fixed URL.

META tags make perfect sense for favicon and the like - I won't dispute that. Robots exclusion, however, is a special case - it belongs outside HTML, not inside it.

motherwell · on June 3, 2009

Just for clarrification: 1. robots.txt excludes CRAWLING e.g. downloading but NOT indexing, e.g. including a URL / site in a database of known URLs / sites. 2. Robots meta tag disallows INDEXING but NOT crawling.

So it is semantically correct, although most modern SEs do not do this, to index a site / URL that is disallowed via robots.txt, using link data alone.

msie · on June 3, 2009

It's been almost six years since that entry has been published. I don't perceive any big problems with robots.txt and favicon.ico files. Am I mistaken? Bandwidth is better and cheaper now so that's not an issue. Are other solutions just much more complex to implement? I hate over-engineering.

eli · on June 3, 2009

It's not exactly the biggest crime on the modern web, but you have to admit that a hardcoded, root-level URI is pretty inelegant.

javert · on June 3, 2009

Pretty inelegant. Exaclty. Those two words sum it up. Not sure what the point of the whole rest of the article was.

__david__ · on June 3, 2009

One case where they don't work well is when you have multiple users on one domain. Like an old school server with the users at /~user1 and /~user2, etc. user2 might write a cgi program and want to create an entry in robots.txt but he doesn't have access to the global robots.txt, just the stuff in his home directory. Ditto for the favicon.

Nowadays domain names are cheap and I don't see as many sites like that any more...

jcgregorio · on June 3, 2009

Yeah, in the past six years I've come to accept that it's the worst possible solution, except for all the others. At this point I would be happy if we could isolate all the new "well-known locations" to a specific sub-path, such as /.well-known/

  http://bitworking.org/news/431/wave-first-thoughts

sophiebits · on June 3, 2009

If not for a standardized filename (which we already have), how would the programs find the filename?

Implementing a meta tag on the index page could work, but why change the already-working system that we have?

eli · on June 3, 2009

He never said we should abandon the already-working systems. He said we shouldn't continue to create new services in this manner: "Let's not continue to make the same mistakes over and over again."

pronoiac · on June 3, 2009

After he wrote this article in 2003, he posted a favicon in 2004. Also, possibly due to the lack of a robots.txt, his site isn't in archive.org at all.

jcgregorio · on June 3, 2009

What are you talking about?

http://web.archive.org/web/*/http://bitworking.org

pronoiac · on June 3, 2009

Gah! Instead of copy-and-pasting at archive.org, I typed bitworker instead of bitworking. My bad.

gojomo · on June 3, 2009

A note on my flag: there's nothing wrong with the topicality of this submission, but the headline words "shouldn't be emulated" on the original have been changed to "should be eliminated" here. That misrepresents the original author's intent in a controversy-stirring manner.

Corrado · on June 3, 2009

Yup, thats my fault. I didn't do it intentionally and blame it on submitting at mignight. :/

tptacek · on June 3, 2009

What was the upside to "favicon.ico" over some kind of meta tag in the document header? Also: the notion of standardizing on a 16x16 square is intellectually offensive.

mooism2 · on June 3, 2009

I always assumed it was to publicise the capability --- webmasters would see the requests for them and find out what this strange favicon.ico file was. If Microsoft had stuck to the <link> approach, they would have had to rely on other methods for informing webmasters, and their advantage over Netscape would have been slightly less.

DougBTX · on June 3, 2009

I forget: is using a real ico file for favicon.ico well supported? If so, there is no 16x16 limit.

ars · on June 3, 2009

Not just "well supported" - required. And you can, and I always do, put in multiple sizes. I usually have 16, 32, 48, and 64. And for the 16 I also add a 256 color version.

derefr · on June 3, 2009

Not required. You can serve a favicon.gif or favicon.png in your <link> tag just as easily, and it will display in all the browsers I have at hand to test (though that set doesn't include IE, so YMMV.)

ars · on June 3, 2009

I know. But he asked about favicon.ico.

tptacek · on June 3, 2009

Well I'm schooled now.

ars · on June 3, 2009

Not sure if you are being sarcastic, just in case not, you can do other file types (and file names) by using:

  <link rel="icon" type="mime/type" href="....">

But if you are using the default favicon.ico it needs to be an icon, and you can include multiple sizes (which you can't do with png, etc).

tptacek · on June 3, 2009

Nope. Not being sarcastic. Can see why you'd think that.