Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

"Their contention was robots.txt had no legal force and they could sue anyone for accessing their site even if they scrupulously obeyed the instructions it contained. The only legal way to access any web site with a crawler was to obtain prior written permission."

This is ridiculous. Can someone release the data so this can be tested in court? EFF?



For obvious reasons this would never happen, but it would be pretty cool to see Google unindex Facebook entirely. They could state that Facebook has made it clear that as Google did not have prior written permission to index their content (Google had simply assumed the rules outlined in robots.txt were sufficient, as it is with every other website).

They would state that Google would be happy to reindex Facebook, but it would take several weeks for their lawyers to meet with Facebook's counsel, draft documents, reindex privately, audit their cache to ensure compliance, etc.

We could then sit back and watch how fast Facebook backpedals.


I had a strangely similar problem to this in college with the network admins and filesystem permissions.

This was back in the day when everybody was using the unix network (pine for email, etc). Many people had webpages on their university accounts and would also store homework online. A very popular thing to do was to go into someone else's home directory and copy their .fvwmrc or .bashrc file (yes this was a few years ago).

The IT people came up with a new policy that said you were not allowed to look into other people's directories without explicit (possibly written) permission from the owner. They said that copying files could be considered "cheating" and possibly copyright infringement and you'd be referred to the university disciplinary court. I assume this all stemmed from fear of cheating.

In any event, I got into a long debate with the IT people about the whole point of unix file permissions. Basically: if you don't want people to look at your stuff, don't let them. As you might guess, when dealing with university IT, I was on the losing end of the argument.


It would seem fair to argue that "rwxrwxrwx" is explicit permission to read your files!


robots.txt is good for telling robots they have no business reading a section, or will be severely disappointed if they do. There is no way one can encode the TOS of a site into the syntax of robots.txt, so there is no reason to believe it embodies the TOS.

I've never used facebook, but Section 9.2.6 of their terms says You will delete all data you received from Facebook if we disable your application or ask you to do so.

I don't see how that could be encoded in robots.txt.


User-agent: *

Disallow: /directory/people/*

Disallow: /directory/pages/*

Done. Google crawls Facebook. Every social media monitoring company crawls Facebook (and every other social network). Tons of other companies do the same every day. I know this because many of these companies are our customers.

What really irks me is that Pete collected valuable data and whoever he shared it with was probably able to derive added value from that data - in ways that Facebook is not doing. Facebook is prohibiting value creation.


Google can do it because Facebook doesn't really want to go up against money in making this a test case, and Google has baskets of cash.

Facebook doesn't give a tinker's damn about prohibiting your value creation.


Or because Google would simply comply with their demands and remove all mention of facebook from their indexes... That would be pretty interesting.


Ha! Yeah, I hadn't thought of that aspect. Half of their userbase wouldn't even be able to sign on any more!


rather more than half I think


I guess when they have direct gain, ie- people accessing facebook profiles through google searches, they are more than happy for this to occur.


Surely ToS is only applicable if you use facebook - eg create an account.

Either facebook wants to allow people to crawl it, or they don't. robots.txt should be binary - yes or no.


That's not exactly true. The TOS usually applies to people crawling the severs and mining data. Still, there is no clear way to know how a court would rule on something like this; each case is different.

See http://en.wikipedia.org/wiki/Browse_wrap


You don't waive your copyright by having a robots.txt, and while I believe most people think Google style indexing and searching is fair use - that doesn't mean anything you do with the data is fair use.


User content on Facebook may not be copyrightable. If I make a list of my personal interests, I haven't necessarily produced a creative work by the standards of US law.

Check out this site (just found it via search): http://www.canyoucopyrightatweet.com/


Correct- lists of facts without styling aren't something you can copyright. The specific form that they are printed in are copyrightable, but there is no IP created by a list of facts. Phone numbers, game scores, colors of rocks, etc... not copyrightable.


The courts have generally only required a minimum of creativity to make something copyrightable. Also sentences are surprisingly easy to mke unique- see http://go-to-hellman.blogspot.com/2009/11/uniqueness-of-sent...


I don't see any support for statement #1.

Statement #2 is a false dichotomy.

The Terms for facebook is a terrible document. It is written to be intelligible to humans but is full of ambiguity and undefined terms. Any lawsuit about the theory "robots.txt did not forbid my actions; therefore they are legal" would probably disintegrate into how the Terms language is interpreted.

Their lawyers probably leapt from high windows when it was released.


Surely if true, that means any web crawler must first locate and understand every single websites terms of service before it can be sure what it is allowed to do.

For a start though, you could easily state that facebook freely allow access to data, without requiring you to read terms&conditions.

You could argue that by freely allowing access to all of their data, without requiring you to read and agree to the terms of usage, then their claims have no basis.

If they did want to restrict access, or make sure every crawler had first agreed to T&C, it wouldn't take long for them to add that.


I think this is missing the point. Facebook's attorneys don't care about the technical details. They already know that whatever is technically possible is entirely subordinate to how they can threaten the owners of the technology.

It hasn't been tested in court, and there's a truly excellent chance Facebook would lose - either in terms of the court or in terms of what's left of their privacy reputation - but that doesn't matter one little bit. They have attorneys and money, and that's all that matters in this instance.

Really. The facile assumption that "it's possible to aggregate thus it's OK to aggregate" is exactly the way normal people think (by which I mean, tongue-in-cheek, us), but corporate attorneys see all this in terms of power relationships and contracts. As far as they're concerned, the poster took pictures through their front windows, and they're damn well going to threaten him with kneecapping until he gives them his negatives.


As a long-time (but sadly former) EFF donor, I hope they pick it up too. This case is tailor made for them.


There's nothing ridiculous about that. There's a clearly visible link on the bottom of each Facebook page that links to the Terms & Conditions of accessing Facebook pages. Some relevant parts:

"If you collect information from users, you will: obtain their consent, make it clear you (and not Facebook) are the one collecting their information, and post a privacy policy explaining what information you collect and how you will use it."

"You will only use the data you receive for your application, and will only use it in connection with Facebook."

"By "application" we mean any application or website (including Connect sites) that uses or accesses Platform, as well as anything else that receives data." - note that by this definition a Facebook crawler is an applicatiom

"You will only use the data you receive for your application, and will only use it in connection with Facebook."

"You will have a privacy policy or otherwise make it clear to users what user data you are going to use and how you will use, display, or share that data."

"You will not transfer the data you receive from us (or enable that data to be transferred) without our prior consent."

"You will make it easy for users to remove or disconnect from your application."

etc.

These things are easy to look up, it took me about five minutes to find them. In fact, they have the right to limit the way even publicly faced content on their site is used. Things like copyright and data protection laws come in mind. You might not agree with that but it is absolutely in their right to do so.

The FB Terms & Conditions don't mention robots.txt at all.


Doesn't apply. "If you collect information from users" is not equivalent to "if you collect information about users". Pete collected information about users that Facebook freely gave him without requiring any binding agreement from him at all.

It's probable they would lose. But it's far, far more probable that Pete would be bankrupt well in advance of that point in time, and that's how the court system works.


You might not agree with that but it is absolutely in their right to do so.

Well, they claim it is their right. The courts will decide if it actually is.

Plenty of places make claims to rights and make you sign waivers (dojos, for example), but fail in court. This could be interesting.


As far as I can see, they weren't creating a facebook application, so I'm not sure why the Terms & conditions are relevant.

If you run a public website, you have to accept that people may crawl your website. If you want to prevent it, don't let them crawl it. That's what robots.txt is for.


As far as I understand, the problem was not crawling the website, it was the way he used to data he gained by crawling.


I think it's quite ridiculous.

If I buy a map, can I use it to be a guide, giving people directions and make a living?

If I buy a story book, can I use it to read to children and collect money from their parents?

Or, if I spend some time to index all the books in the library, can I sell the index and make money? Or did I violate the rights of either the library or the books' copyright owners?

I think crawling and making use of the crawled data is not offending the copyright law because it's not actually a copy of the data. Rather, it's a service to transform the data and help other people get to the original data in an easier way.

Technically, they can crawl the data on the spot when receiving a request from their clients. In order to speed up the process, they somehow cache the data crawled from earlier occasions. But that's technical details, which is encapsulated. Otherwise, they could have stored URLs and offsets in the pages, rather than the data itself. I don't think that breaks the law, or there will be no way to avoid breaking the law to refer to anything.


A couple of examples: if you rent a DVD from one of those DVD rental places, normally you can't play it in public places like bars and places like that. If you buy a CD you can't broadcast it on your radio station, you need a special licence for that.

Another possible related issue is the question of derivative works (e.g. the index of a library). There's a Wikipedia article about derivative works, it's a bit too complex to summarize here: http://en.wikipedia.org/wiki/Derivative_work


Two extraordinarily poor examples!

Both films and music (plus songwriting too) have been explicitly enshrined by US copyright law as having separate rights for home use and public performance. Each has multiple spheres of statutory licensing organizations with special exemptions from anti-trust law.


I was referring to the copyright laws of the United Kingdom which is the country where I live (Copyright, Designs and Patents Act 1988). Here, public performance infringement is not limited to solely films and music (Section 16.1c).

I'm not very familiar with US copyright legislation but I assumed that the copyright laws there are similar without checking any sources - which was a mistake apparently.


Indeed. Knowledge about social graphs are probably the core business of Facebook. No wonder they want monopoly over the knowledge about the social graphs they host.

But, really, they can't stop it. An when (not if) nominative aggregated data is out, maybe the general population will begin to actually understand how expensive Facebook actually is.


The information is copyrighted. Redistribution without prior written permission from the copyright owner is illegal.


Can you copyright someone elses name and the names of their friends?


You can't copyright somebody's name and address, but you can copyright a map showing where everybody lives. That actually sounds like what Pete was doing, but Facebook could make similar claims about some of the information they provide.

Everything from stock prices to the temperature is in theory uncopyrightable, but you can still run into trouble if you just scrape some site and republish it.


You can't copyright facts, but you can copyright collections of facts.

This could have been an interesting case.

And regarding robots.txt - it's by no means anything more than a suggestion and common courtesy.. it's a way to suggest to crawlers what to ignore and how to behave with your site, to their benefit and yours - it's not by itself a legally binding, well, anything..... it was just a convention for all parties to unofficially cooperate.


You can't copyright facts, but you can copyright collections of facts.

It depends, re: the oft cited example of phone books being uncopyrightable,


There has to be some kind of creative component to the compilation. And arguably Facebook didn't even compile the data in the first place -- it'd be like a college claiming a compilation copyright on the band fliers, roommate requests, and lost-pet ads pinned to a bulletin board in the student union building.


No


Why don't I just sue Google for crawling my website and get this court case done with. Once precedence is set it would be much harder for FaceBook to make these types of claims. "Your honor in the case of Google Canada vs Zach Aysan the courts clearly decided that..." DONE. Actually, I'm surprised this hasn't happened already.


I'd guess you'd need to burn $100k before you even saw a judge. Are you up for that?


Could somewhat explain why is it so ridiculously expensive? At least in US?


(Not a lawyer)

From my point of view, there has been lots of individually useful rules that have been added to the court system, that when taken in aggregate have negative effects. There are lots of pre-trial motions that have specific uses, designed to stop a problem, or plug a loophole in the law. The problem comes when you have multiple dozens of them. Or even just a few big ones.

For instance, discovery. The idea is that fairness is not helped by surprise evidence. So, the solution is that both sides get access to the other side's evidence. But how do you do that? You have lawyer time, calendar time, etc to collect and go through it all.

That's just one (although one of the larger ones). Add in requests for everything else, and you can see how much lawyer time is "wasted".


A lawyer's initial purpose is almost always intimidation.


Time. You will need a lawyer working for you nearly fulltime for months. I imagine that when you work for someone fulltime for months, you expect to get paid well enough that you can eat, pay rent, bills and maybe have some disposable income. Even the cheapest lawyer will have the same expectations.


You'd have to show damages. Otherwise (and correct me if I'm wrong on this point) you have no standing to file suit.


But wouldn't Facebook have to show damages here as well?


If someone could show that they could extract personally identifiable information from the aggregated database that isn't immediately obvious from browsing the site, I'd expect Facebook could make the case that distributing the database could cause serious damages, much like it did for Netflix.


How is it ridiculous? The more ridiculous thing is to expect robots.txt to actually have some sort of legal force.


Implied agreement.

The law acknowledges standard practice and expectations. That's what it is built on.

If you put something up on the web, then although you have copyright, you're also implying permission for people to be able to do what they normally do with it (at a minimum, view it in a web browser).

Similarly, if you've put up a robots.txt, then you're also quite clearly giving permission to crawlers (within certain bounds). There's also reasonableness and taking into account what robots.txt is capable of, of course. You clearly aren't implying permission for me to DoS you.

Explicitly agreed terms trumps this. If it could be shown that the person operating the crawler had agreed to an AUP, then that would change things. I imagine that this would make this case quite different from Google crawling your site.


Army of Facebook Lawyers:

"Their contention was robots.txt had no legal force and they could sue anyone for accessing their site ..."

Lone data collector dude:

<blink> <blink>

It turns out the lawyers are right. Huh.


Oh. I actually wasn't saying that the guy "blinked." I was trying to evoke the image of a relatively powerless person staring at an army who had just given an order; the guy would of course have no response available except OK. Wasn't deriding the guy at all.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: