There are a couple of things I want to clarify. First, we genuinely support data portability: we want users to be able to use their data in other applications without restriction. Our new data policies, which we deployed at f8, clearly reflect this (http://developers.facebook.com/policy/):
"Users give you their basic account information when they connect with your application. For all other data, you must obtain explicit consent from the user who provided the data to us before using it for any purpose other than displaying it back to the user on your application."
Crawling is a bit of special case. We have a privacy control enabling users to decide whether they want their profile page to show up in search engines. Many of the other "crawlers" don't really meet user expectations. As Blake mentioned in his response on Pete's blog post, some sleazy crawlers simply aggregate user data en masse and then sell it, which we view as a threat to user privacy.
Pete's post did bring up some real issues with the way we were handling things. In particular, I think it was bad for us to stray from Internet standards and conventions by having an robots.txt that was open and a separate agreement with additional restrictions. This was just a lapse of judgment.
This robots.txt change should be deployed today. The change will make our robots.txt abide by conventions and standards, which I think is the main legitimate complaint in Pete's post.
I just wish we could have had this conversation a few months ago, before you guys threw your lawyers at me.
IP, location, host of the site you are viewing
whois details for the site
robots txt rules for the page
Probably as a dropdown info panel.
I'm surprised if this doesn't already exist, though.
I'm a little concerned about how we (80legs) fit into Facebook's permission form, though. We're not using or selling the data ourselves, but we can sell access to an already-setup web crawl - so how does that work? We're like a special case of a special case. Never an enviable spot to be in.
For example, our ChangeDetection.com site has a couple hundred users monitoring pages on facebook. (User-Agent: ChangeDetection). We always honor robots.txt so if we are not whitelisted all these monitors will be disabled.
Edit: looks like you already have posted contact info in robots.txt
You don't have an "agreement" at all. An agreement requires that two parties actually, um, agree. You've published a statement where you assert certain rights and imply that you will sue anyone who accesses data on your site in a way you don't like. You may get away with that, regardless of the legal merits of your position, because you have more money for lawyers than most people you're likely to sue. But don't try to dignify what you're doing by calling it an "agreement". It's like an extortionist telling me that we have an agreement that he won't break my windows if I pay him protection money.