There are a couple of things I want to clarify. First, we genuinely support data portability: we want users to be able to use their data in other applications without restriction. Our new data policies, which we deployed at f8, clearly reflect this (http://developers.facebook.com/policy/):
"Users give you their basic account information when they connect with your application. For all other data, you must obtain explicit consent from the user who provided the data to us before using it for any purpose other than displaying it back to the user on your application."
Basically, users have complete control over their data, and as long as user gives an application explicit consent, Facebook doesn't get in the way of the user using their data in your applications beyond basic protections like selling data to ad networks and other sleazy data collectors.
Crawling is a bit of special case. We have a privacy control enabling users to decide whether they want their profile page to show up in search engines. Many of the other "crawlers" don't really meet user expectations. As Blake mentioned in his response on Pete's blog post, some sleazy crawlers simply aggregate user data en masse and then sell it, which we view as a threat to user privacy.
Pete's post did bring up some real issues with the way we were handling things. In particular, I think it was bad for us to stray from Internet standards and conventions by having an robots.txt that was open and a separate agreement with additional restrictions. This was just a lapse of judgment.
We are updating our robots.txt to explicitly allow the crawlers of search engines that we currently allow to index Facebook content and disallow all other crawlers. We will whitelist crawlers when legitimate companies contact us who want to crawl us (presumably search engines). For other purposes, we really want people using our API because it has explicit controls around privacy and has important additional requirements that we feel are important when a company is using users' data from Facebook (e.g., we require that you have a privacy policy and offer users the ability to delete their data from your service).
This robots.txt change should be deployed today. The change will make our robots.txt abide by conventions and standards, which I think is the main legitimate complaint in Pete's post.
Thanks Bret, I appreciate you taking the time to dig into this. You're right, my complaint is the disconnect between the terms-of-service and robots.txt, so I'm very happy you're addressing this.
I just wish we could have had this conversation a few months ago, before you guys threw your lawyers at me.
It would actually be interesting to have a browser plug-in that shows you if a page you are on is excluded in the robots.txt file or which agents are specifically excluded/allowed to scan the page. Might make for some interesting viewing on sites like Facebook.
Getting a response back from an executive is super-awesome.
I'm a little concerned about how we (80legs) fit into Facebook's permission form, though. We're not using or selling the data ourselves, but we can sell access to an already-setup web crawl - so how does that work? We're like a special case of a special case. Never an enviable spot to be in.
Please post contact information for bots wanting to be whitelisted. (Either here or in the robots.txt file itself).
For example, our ChangeDetection.com site has a couple hundred users monitoring pages on facebook. (User-Agent: ChangeDetection). We always honor robots.txt so if we are not whitelisted all these monitors will be disabled.
Edit: looks like you already have posted contact info in robots.txt
Bret wrote: "I think it was bad for us to stray from Internet standards and conventions by having an robots.txt that was open and a separate agreement with additional restrictions."
You don't have an "agreement" at all. An agreement requires that two parties actually, um, agree. You've published a statement where you assert certain rights and imply that you will sue anyone who accesses data on your site in a way you don't like. You may get away with that, regardless of the legal merits of your position, because you have more money for lawyers than most people you're likely to sue. But don't try to dignify what you're doing by calling it an "agreement". It's like an extortionist telling me that we have an agreement that he won't break my windows if I pay him protection money.
There are a couple of things I want to clarify. First, we genuinely support data portability: we want users to be able to use their data in other applications without restriction. Our new data policies, which we deployed at f8, clearly reflect this (http://developers.facebook.com/policy/):
Basically, users have complete control over their data, and as long as user gives an application explicit consent, Facebook doesn't get in the way of the user using their data in your applications beyond basic protections like selling data to ad networks and other sleazy data collectors.Crawling is a bit of special case. We have a privacy control enabling users to decide whether they want their profile page to show up in search engines. Many of the other "crawlers" don't really meet user expectations. As Blake mentioned in his response on Pete's blog post, some sleazy crawlers simply aggregate user data en masse and then sell it, which we view as a threat to user privacy.
Pete's post did bring up some real issues with the way we were handling things. In particular, I think it was bad for us to stray from Internet standards and conventions by having an robots.txt that was open and a separate agreement with additional restrictions. This was just a lapse of judgment.
We are updating our robots.txt to explicitly allow the crawlers of search engines that we currently allow to index Facebook content and disallow all other crawlers. We will whitelist crawlers when legitimate companies contact us who want to crawl us (presumably search engines). For other purposes, we really want people using our API because it has explicit controls around privacy and has important additional requirements that we feel are important when a company is using users' data from Facebook (e.g., we require that you have a privacy policy and offer users the ability to delete their data from your service).
This robots.txt change should be deployed today. The change will make our robots.txt abide by conventions and standards, which I think is the main legitimate complaint in Pete's post.