

Ask HN: Why is Facebook indexed in search engine, against robots.txt rules? - hiby007

Facebooks robots.txt : https:&#x2F;&#x2F;www.facebook.com&#x2F;robots.txt<p>Facebook Has several different rules for different search engines.<p>But if you go to the end of file. You will find this one rule.<p>User-agent: *
Disallow: &#x2F;<p>The &quot;User-agent: *&quot; means this section applies to all robots. The &quot;Disallow: &#x2F;&quot; tells the robot that it should not visit any pages on the site.<p>site:facebook.com returns About 3,520,000,000 results in google.<p>Then my question is why is facebook.com indexed in google and other search engines?
======
eel
It's not against the robots.txt rules. In Facebook's robots.txt file, there is
a section for the Googlebot (as well as several other bots)

    
    
      User-agent: Googlebot
      Disallow: /ac.php
      ...
    

A bot is supposed to follow the its own User-agent section. If that is not
found, it should follow the User-agent: * section. [1]

[1] [http://www.robotstxt.org/orig.html](http://www.robotstxt.org/orig.html)

~~~
techaddict009
Any supporting link reference for this ?

~~~
eel
I edited the original comment, but yes, it is in this document:
[http://www.robotstxt.org/orig.html](http://www.robotstxt.org/orig.html)

The key paragraph is:

    
    
      If the value is '*', the record describes the default
      access policy for any robot that has not matched any of
      the other records. It is not allowed to have multiple
      such records in the "/robots.txt" file.

~~~
techaddict009
Thanks for sharing this !

------
Scryptonite
What if a different robots.txt is being served up for the real Googlebots?

EDIT:

Based on the comments in their robots.txt it appears that they are
whitelisting certain robots. You would have to apply for your robot to crawl
their site at
[https://www.facebook.com/apps/site_scraping_tos.php](https://www.facebook.com/apps/site_scraping_tos.php)

They probably serve up unique generated robots.txt based on whitelisted
robots. You'd never know what rules it contains unless you are a whitelisted
robot.

~~~
beneills
Or your user-agent is the same as one. Unless they are whitelisting IPs.

I tested this with: wget --user-agent "Mozilla/5.0 (compatible; Googlebot/2.1;
+[http://www.google.com/bot.html)"](http://www.google.com/bot.html\)")
[http://www.facebook.com/robots.txt](http://www.facebook.com/robots.txt)

and diffed, but the results are the same.

~~~
phyalow
Is your IP range the same as googles?

~~~
tazzy531
Google doesn't release their IP ranges. It can come from any IP.

------
lcedp
It's like a if/elsif/else statement.

`User-agent: *` means for all other bots that didn't match any section above.

Here are the rules for Google:

    
    
        User-agent: Googlebot
        Disallow: /ac.php
        Disallow: /ae.php
        Disallow: /ajax/
        Disallow: /album.php
        Disallow: /ap.php
        Disallow: /autologin.php
        Disallow: /checkpoint/
        Disallow: /confirmemail.php
        Disallow: /contact_importer/
        Disallow: /feeds/
        Disallow: /file_download.php
        Disallow: /l.php
        Disallow: /o.php
        Disallow: /p.php
        Disallow: /photo.php
        Disallow: /photo_comments.php
        Disallow: /photo_search.php
        Disallow: /photos.php
        Disallow: /sharer/
    

\- meaning anything else allowed.

------
techaddict009
I too had same doubt. Hope someone who is expert in SEO could answer this in
better way.

~~~
fendmark
Interesting

Yeah, Facebook's robots.txt file white-lists specific search engines and offer
an option to get whitelisted ,then there is a wildcard disallow for everyone
else.

Of course nefarious scrapers can ignore the robots.txt file or even spoof
google or bingbot, but it least it sets a precedent and a policy that they can
take further action on if needed.

