Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Why is Facebook indexed in search engine, against robots.txt rules?
11 points by hiby007 on Sept 14, 2013 | hide | past | favorite | 12 comments
Facebooks robots.txt : https://www.facebook.com/robots.txt

Facebook Has several different rules for different search engines.

But if you go to the end of file. You will find this one rule.

User-agent: * Disallow: /

The "User-agent: *" means this section applies to all robots. The "Disallow: /" tells the robot that it should not visit any pages on the site.

site:facebook.com returns About 3,520,000,000 results in google.

Then my question is why is facebook.com indexed in google and other search engines?




It's not against the robots.txt rules. In Facebook's robots.txt file, there is a section for the Googlebot (as well as several other bots)

  User-agent: Googlebot
  Disallow: /ac.php
  ...
A bot is supposed to follow the its own User-agent section. If that is not found, it should follow the User-agent: * section. [1]

[1] http://www.robotstxt.org/orig.html


Great Reply, thanks.


Any supporting link reference for this ?


I edited the original comment, but yes, it is in this document: http://www.robotstxt.org/orig.html

The key paragraph is:

  If the value is '*', the record describes the default
  access policy for any robot that has not matched any of
  the other records. It is not allowed to have multiple
  such records in the "/robots.txt" file.


Thanks for sharing this !


What if a different robots.txt is being served up for the real Googlebots?

EDIT:

Based on the comments in their robots.txt it appears that they are whitelisting certain robots. You would have to apply for your robot to crawl their site at https://www.facebook.com/apps/site_scraping_tos.php

They probably serve up unique generated robots.txt based on whitelisted robots. You'd never know what rules it contains unless you are a whitelisted robot.


Or your user-agent is the same as one. Unless they are whitelisting IPs.

I tested this with: wget --user-agent "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" http://www.facebook.com/robots.txt

and diffed, but the results are the same.


Is your IP range the same as googles?


Google doesn't release their IP ranges. It can come from any IP.


It's like a if/elsif/else statement.

`User-agent: *` means for all other bots that didn't match any section above.

Here are the rules for Google:

    User-agent: Googlebot
    Disallow: /ac.php
    Disallow: /ae.php
    Disallow: /ajax/
    Disallow: /album.php
    Disallow: /ap.php
    Disallow: /autologin.php
    Disallow: /checkpoint/
    Disallow: /confirmemail.php
    Disallow: /contact_importer/
    Disallow: /feeds/
    Disallow: /file_download.php
    Disallow: /l.php
    Disallow: /o.php
    Disallow: /p.php
    Disallow: /photo.php
    Disallow: /photo_comments.php
    Disallow: /photo_search.php
    Disallow: /photos.php
    Disallow: /sharer/
- meaning anything else allowed.


I too had same doubt. Hope someone who is expert in SEO could answer this in better way.


Interesting

Yeah, Facebook's robots.txt file white-lists specific search engines and offer an option to get whitelisted ,then there is a wildcard disallow for everyone else.

Of course nefarious scrapers can ignore the robots.txt file or even spoof google or bingbot, but it least it sets a precedent and a policy that they can take further action on if needed.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: