

Here is the robots.txt of Google - coliveira
http://google.com/robots.txt

======
westside1506
We actually have a surprising number of customers come to us at 80legs wanting
to crawl google search results. I don't think most of them are trying to
reverse engineer google or anything like that. Most probably just want a fast
way to find relevant topics to crawl.

They are disappointed when they learn we obey robots.txt, so we have them
manually do searches to pull out seed lists for their 80legs crawls. It's a
pain, but there's not really a way around it within the rules.

------
jacquesm
I always figured that the company that spiders everybody elses content should
have a more relaxed policy towards being spidered itself.

After all, google is datamining the web on an ongoing basis, in return it
should willingly consent to being mined in return.

~~~
brk
Yes, but...

For the most part "Google" is a condensed version of the web. In theory, you
could "spider" a single site (google.com) and build your own search database
without having to go out and crawl the web at large.

It's an odd paradox, but I think one search site crawling another search site
is not a good idea. And there is probably an infinite spider loop hiding in
that process somewhere.

~~~
antipax
Not like anyone writing a spider _has_ to obey the robots.txt file.

~~~
eli
Sure, and their IP will be blocked by Google some time around the 10th
request.

~~~
jacquesm
Next time you're looking at a PC infected with malware have a look at the
network traffic using a sniffer, chances are pretty good that you'll see
searches to google for the weirdest of terms. Apparently this is to get around
the limitation that you mention. I'm assuming the results of such searches
will be 'mailed home' through some kind of dead-drop.

~~~
eli
Sure, but responding to the parent, that's why people don't crawl Google, not
the Robots.txt.

And incidentally, some of those searches are to find forms that it can stuff
links into. My sites are constantly getting hit by botnets searching for
Drupal comment forms. Luckily it uses a quirky URL format that's easy for
mod_security to block.

------
twoz
Does Yahoo! not have a robots.txt file?

<http://yahoo.com/robots.txt> _Sorry, the page you requested was not found._

~~~
jdrock
If that page doesn't exist, then according to specification, they don't and
you can crawl any page.

~~~
twoz
Update: Some Yahoo! subdomains have 'em.

 __Yes __:

    
    
        http://search.yahoo.com/robots.txt
        http://groups.yahoo.com/robots.txt
        http://realestate.yahoo.com/robots.txt
    

__No __:

    
    
        http://maps.yahoo.com/robots.txt
        http://omg.yahoo.com/robots.txt

------
sp332
What is <http://google.com/unclesam> ?

Edit: must have just mistyped it, works fine.

~~~
alexandros
this is even weirder.. <http://www.google.com/microsoft>

~~~
benprew
I would guess it's the compliment of:

<http://www.google.com/linux>

~~~
josefresco
Maybe it's Google's way of saying "we're not even gonna go there"

------
whalesalad
Not that this was difficult or anything, but here's a list of all of the links
as links. Easier to investigate ;)

<http://dpaste.org/TqnU/>

