
Google's robots.txt - jrstanley
http://www.google.com/robots.txt
======
0x0
Curious as to why someone sat down and added this line to that file:

    
    
      Allow: /maps?hq=http://maps.google.com/help/maps/directions/biking/mapleft.kml&ie=UTF8&ll=37.687624,-122.319717&spn=0.346132,0.727158&z=11&lci=bike&dirflg=b&f=d

~~~
ZirconCode
[https://www.google.com/search?num=100&site=&source=hp&q=http...](https://www.google.com/search?num=100&site=&source=hp&q=https%3A%2F%2Fwww.google.com%2Fmaps%3Fhq%3Dhttp%3A%2F%2Fmaps.google.com%2Fhelp%2Fmaps%2Fdirections%2Fbiking%2Fmapleft.kml%26ie%3DUTF8%26ll%3D37.687624%2C-122.319717%26spn%3D0.346132%2C0.727158%26z%3D11%26lci%3Dbike%26dirflg%3Db%26f%3Dd&oq=https%3A%2F%2Fwww.google.com%2Fmaps%3Fhq%3Dhttp%3A%2F%2Fmaps.google.com%2Fhelp%2Fmaps%2Fdirections%2Fbiking%2Fmapleft.kml%26ie%3DUTF8%26ll%3D37.687624%2C-122.319717%26spn%3D0.346132%2C0.727158%26z%3D11%26lci%3Dbike%26dirflg%3Db%26f%3Dd&gs_l=hp.3...6256.6256.0.6770.1.1.0.0.0.0.0.0..0.0....0...1c.1.35.hp..1.0.0.rCg1jO0B9q4)

Turns up three results in google. Very weird indeed.

------
ryanpetrich
See also: [http://www.google.com/humans.txt](http://www.google.com/humans.txt)

~~~
Houshalter
This doesn't work if you are using HTTPS everywhere.

~~~
dfc
Weird, it works for me. Iceweasel Aurora / HTTPS Everywhere 4.0-dev

~~~
daGrevis
I second this. Latest Chromium.

------
logotype
Mine is more awesome:
[http://logotype.se/robots.txt](http://logotype.se/robots.txt)

~~~
russellbeattie
Heh, if only Yandex and Baidu respected robots.txt.

~~~
agwa
I've found this GitHub project to be an invaluable resource for blocking bad
bots: [https://github.com/bluedragonz/bad-bot-
blocker](https://github.com/bluedragonz/bad-bot-blocker)

~~~
anonymfus
_> Unless your website is written in Russian or Chinese, you probably don't
get any traffic from them. They mostly just waste bandwidth and consume
resources._

THIS is evil. You could use this argument for banning any new search engine.

~~~
ScottWhigham
What "new search engine" has actually generated actual revenue for any
webmaster in the past ten years? You could argue DDG but that's the only one I
can think of.

~~~
joveian
I think DDG primarily uses Yandex. They at least often put a "Powered by
Yandex" logo along the side.

------
catmanjan
What a weird entry...

[https://www.google.com/maps?hq=http://maps.google.com/help/m...](https://www.google.com/maps?hq=http://maps.google.com/help/maps/directions/biking/mapleft.kml&ie=UTF8&ll=37.687624,-122.319717&spn=0.346132,0.727158&z=11&lci=bike&dirflg=b&f=d)

~~~
hayksaakian
is it a map focused on SF that highlights bikings paths?

~~~
catmanjan
Looks like it? Very weird

------
gergles
[http://www.google.com/baraza/en](http://www.google.com/baraza/en)

What a weird little product. It's like Yahoo Answers, but somehow with even
less sorting or categorization.

~~~
lesiki
'baraza' is Swahili for forum/meeting place. It was very much Yahoo Answers,
targeted at the African market - here in Kenya, most people only have internet
access via mobile connections, often using feature phones, hence the
minimalistic stylesheet.

Baraza never really took off.

More about it here: [http://whiteafrican.com/2010/10/05/google-baraza-qa-for-
afri...](http://whiteafrican.com/2010/10/05/google-baraza-qa-for-africa/)

------
d99kris
Glassdoor uses its for recruiting:
[http://www.glassdoor.com/robots.txt](http://www.glassdoor.com/robots.txt)

~~~
te_chris
Why would they disallow all their about content? (Bit of an SEO noob).

~~~
JimmyM
Over-simply, because they do not want to indicate that their site is about
those pages, for whatever reason.

In practice, I'm not entirely sure but it looks like it's quite old since
those pages don't seem to exist anymore, even as folders which don't exist as
pages in themselves, and they aren't 301-redirected to the current relevant
pages.

In fact, they're all 404's. So perhaps they used to be pages, were deleted,
and kept being crawled which made their site look crap (because of the 404's).
Now they could use 301's, but I assume that the reason they didn't is because
they might want to restructure the site in the future and re-use those pages.
They don't use 302's because 302's are unreliable and freaky.

Does that sound right to everyone else?

~~~
mattmanser
These guys are the best white-hat SEO growth hackers on the planet, just copy
them and don't question it.

~~~
JimmyM
Oh, I wasn't questioning them - just asking if my assessment sounded right.

------
blossoms
Unrelated but it looks like www.aol.com's robots.txt is served as text/html
[http://www.aol.com/robots.txt](http://www.aol.com/robots.txt)

Is this a common mistake?

~~~
mkonecny
It just issues a "HTTP/1.1 302 Moved Temporarily" directed to their homepage.
Requesting an invalid file such as "robots.txtsdfa32r523" has the same effect,
so they probably don't have a robots file at all.

~~~
judk
Huh? No, it is a regular robots.txt file

~~~
tonyedgecombe
It redirects requests from the UK.

------
yRetsyM
I remember when this used to be a source of product leaks.

~~~
zeckalpha
And now it is an archive of discontinued products.

------
plucas
Yelp's is fun:
[https://www.yelp.com/robots.txt](https://www.yelp.com/robots.txt)

------
wupiass
facebook's...

[https://www.facebook.com/robots.txt](https://www.facebook.com/robots.txt)

~~~
johnvschmitt
That really blows my mind. I mean, how can they say that's any kind of
"agreement"?

I someone writes a curl/wget script wrapper & points it to the top 10
websites, they don't enter into any kind of written contract or agreement.

~~~
MichaelApproved
You're quoting "agreement" as if its literally in their robots file. It's not.

They're telling the public that it does not have permission to crawl the site
which try have the right to do. What is the problem with that?

~~~
yeukhon
What is the point of having such silly prohibition? It's silly because anyone
can crawl it if they want, Facebook may block such DDoS attack, but why would
they bother to put up such sign when they know it's useless?

~~~
MichaelApproved
They have such a large network, anything they could do to prevent unwanted
crawling is probably helpful.

------
ankurpatel
Yelp has rules set for the robots:
[http://www.yelp.com/robots.txt](http://www.yelp.com/robots.txt)

------
Theodores
There are only 602 pages of Google.com indexed on Google.com, mostly 'plus'
profiles. Quite a few show with this message:

A description for this result is not available because of this site's
robots.txt – learn more.

Which is odd.

~~~
eli
Google will still index pages blocked by robots.txt, it just won't crawl them
(so it can't get a description/preview snippet). It indexes them based on the
URL and how people link to them.

------
Istof
404 for
[http://www.gstatic.com/trends/websites/sitemaps/sitemapindex...](http://www.gstatic.com/trends/websites/sitemaps/sitemapindex.xml)

------
oevi
Wikipedia's robots.txt is quite verbose:
[http://en.wikipedia.org/robots.txt](http://en.wikipedia.org/robots.txt)

~~~
iso8859-1
Because it's part of the Wiki:
[https://en.wikipedia.org/w/index.php?title=MediaWiki:Robots....](https://en.wikipedia.org/w/index.php?title=MediaWiki:Robots.txt&action=history)

------
codr
Apple allows robots access to everything!
[http://www.apple.com/robots.txt](http://www.apple.com/robots.txt)

~~~
oneeyedpigeon
If you're big enough, doesn't this just make sense? Why waste time maintaining
a robots.txt policy when it must represent a tiny fraction of your traffic,
which your servers can surely handle? And the really 'bad' guys are going to
ignore it anyway. And if you really care, you'll have some much more
sophisticated bandwidth throttling in place.

For the smaller guys, sure it makes sense to have some kind of simple
robots.txt policy.

~~~
keule
Thats not quite true. Apple has multiple subdomains, each for some part of
their site. www.apple.com hosts most of the marketing stuff, but have a look
at other pages, e.g.:
[http://store.apple.com/robots.txt](http://store.apple.com/robots.txt)

------
techaddict009
Checkout youtube.com/robots.txt !

~~~
dabit
For the lazy:
[http://www.youtube.com/robots.txt](http://www.youtube.com/robots.txt)

~~~
Mindless2112
And for those who don't know the reference:
[http://www.youtube.com/watch?v=WGoi1MSGu64](http://www.youtube.com/watch?v=WGoi1MSGu64)

------
cordite
A lot of these are 404's. It kinda makes me wonder what was behind things like
/c/

------
0_o
taobao.com has the shortest robots.txt
[http://www.taobao.com/robots.txt](http://www.taobao.com/robots.txt)

