
Why robots.txt and favicon.ico are bad ideas and should be eliminated. - Corrado
http://bitworking.org/news/No_Fishing
======
gojomo
Even if Gregorio is right about the proliferation of fixed URI-pattern
conventions, he's way, way wrong about _robots.txt_.

If there was a similarly simple and effective solution, either in 1994 or now,
he would have suggested it. But he didn't. We have to guess that he'd want
something involving a declared robots-rules-'link' elsewhere.

But layering robots-rules as a 'link' from the headers of the root page, or
markup tags (if the root page happens to be HTML), _still_ requires an initial
investigative hit to a site -- and if at the '/' page, probably involves more
bandwidth than a compact robots.txt or 404.

Assuming a hostname is a unified 'site' is not perfect -- but in 1994 and now,
that's the only inherent and well-defined unit of administrative control
provided by the HTTP protocol.

By going with this easy-to-understand, easy-to-implement convention,
webmasters have had enough freedom to opt out, and crawlers enough freedom to
collect, to enable ~15 years of dazzling growth in powerful search
applications. And robots.txt files built to the 1994 standard still work,
handling 99.999% of what webmasters need to communicate to crawlers.

That success satisfies my design aesthetics.

~~~
__david__
> But layering robots-rules as a 'link' from the headers of the root page, or
> markup tags (if the root page happens to be HTML), still requires an initial
> investigative hit to a site -- and if at the '/' page, probably involves
> more bandwidth than a compact robots.txt or 404.

Not necessarily. If the location were specified in the http headers then a
robot could use the HEAD command first. That's relatively minor bandwidth cost
and still lets the robot get the site rules without first getting any content.

~~~
gojomo
Using HEAD doesn't make a link-in-headers approach use less bandwidth than the
classic /robots.txt convention. In the very best case -- no robots-rules -- a
HEAD is equivalent in bandwidth used to a /robots.txt 404 without a content-
body. But if there are robots-rules, the HEAD-then-robots approach means two
hits and double-headers instead of one hit.

(It's possible to strain and devise a convention that uses less bandwidth;
that's why I said 'probably'. But that would require even more complexity --
such as the server being smart enough to send each robot only the rules that
apply to it.)

Configuring a server to emit special headers is also harder for most
webmasters than dropping a text file into a conventional location. And a HEAD-
then-robots process is more complicated for robots-writers.

For the purpose of warning off robots, the /robots.txt placement convention
was a _very, very good solution_ on many axes, including minimizing traffic
and adoption costs.

Only a peculiar early-optimization based on a certain aesthetic sense -- and
being concerned about being a bad example for _other_ similar applications
that aren't a pressing issue even now, 15 years later -- can justify
Gregorio's opinion that /robots.txt "was a not-so-good idea when the robot
exclusion protocol was rolled out".

~~~
__david__
Yeah, it's more bandwidth than the current "just ask for /robots.txt" method,
but only epsilon more. Certainly orders of magnitude less than having the
robots link in the html itself (since most index pages are multiple orders of
magnitude more than the HTTP headers). The extra HEAD would be lost in the
noise its bandwidth is so minimal.

And you're right, the current scheme is easy. But it doesn't cover all the
cases and makes certain types of sites impossible to do correctly (see
<http://news.ycombinator.com/item?id=639396> for an example). Putting a link
to the correct robots.txt for a URL in the HTTP headers would be way more
flexible and only slightly raise the bandwidth cost.

You could make it backwards compatible by assuming the current /robots.txt
path if there is no header--So it would cost nothing to current site
operators. Spider writers would have to do one extra HEAD but I refuse to
believe that's terribly difficult.

------
michaelfairley
The largest problem with the only-use-the-meta-tags solution is that the page
to be excluded has to be retrieved before it can be excluded.

~~~
duskwuff
Yes, this. The whole purpose of robots.txt is to allow restrictions to be
applied on objects that the crawler hasn't looked at yet. A META tag can only
apply restrictions after the fact.

Moreover, his proposed replacement is incredibly browser-centric. It requires
any compliant crawler to contain an HTML parser. And woe befall anyone who
typos the META tag, or manages to confuse the HTML parser before it gets
there! The robots.txt specification, by contrast, requires no such heavy
lifting: just request a single fixed URL.

META tags make perfect sense for favicon and the like - I won't dispute that.
Robots exclusion, however, is a special case - it belongs outside HTML, not
inside it.

------
msie
It's been almost six years since that entry has been published. I don't
perceive any big problems with robots.txt and favicon.ico files. Am I
mistaken? Bandwidth is better and cheaper now so that's not an issue. Are
other solutions just much more complex to implement? I hate over-engineering.

~~~
eli
It's not exactly the biggest crime on the modern web, but you have to admit
that a hardcoded, root-level URI is pretty inelegant.

~~~
javert
Pretty inelegant. Exaclty. Those two words sum it up. Not sure what the point
of the whole rest of the article was.

------
spicyj
If not for a standardized filename (which we already have), how would the
programs find the filename?

Implementing a meta tag on the index page could work, but why change the
already-working system that we have?

~~~
eli
He never said we should abandon the already-working systems. He said we
shouldn't continue to create new services in this manner: " _Let's not
continue to make the same mistakes over and over again._ "

------
pronoiac
After he wrote this article in 2003, he posted a favicon in 2004. Also,
possibly due to the lack of a robots.txt, his site isn't in archive.org at
all.

~~~
jcgregorio
What are you talking about?

<http://web.archive.org/web/*/http://bitworking.org>

~~~
pronoiac
Gah! Instead of copy-and-pasting at archive.org, I typed bitworker instead of
bitworking. My bad.

------
gojomo
A note on my flag: there's nothing wrong with the topicality of this
submission, but the headline words "shouldn't be emulated" on the original
have been changed to "should be eliminated" here. That misrepresents the
original author's intent in a controversy-stirring manner.

~~~
Corrado
Yup, thats my fault. I didn't do it intentionally and blame it on submitting
at mignight. :/

------
tptacek
What was the upside to "favicon.ico" over some kind of meta tag in the
document header? Also: the notion of standardizing on a 16x16 square is
intellectually offensive.

~~~
DougBTX
I forget: is using a real ico file for favicon.ico well supported? If so,
there is no 16x16 limit.

~~~
ars
Not just "well supported" - required. And you can, and I always do, put in
multiple sizes. I usually have 16, 32, 48, and 64. And for the 16 I also add a
256 color version.

~~~
derefr
Not required. You can serve a favicon.gif or favicon.png in your <link> tag
just as easily, and it will display in all the browsers I have at hand to test
(though that set doesn't include IE, so YMMV.)

~~~
ars
I know. But he asked about favicon.ico.

