

Ask HN: Legality of Web Scraping User-Submitted Content? - brandon272

Is it legal to scrape another website for user submitted content? For example, if there was a news site that got its news solely from users submitting their own personally-written articles and stories -- who owns that content? Does the end user relinquish ownership once they submit that content?<p>Is it legal for another company to "scrape" that content and use it on their site, only removing it if the user who submitted it in the first place asks them to, either directly or via legal means?<p>Thoughts and insight would be appreciated!
======
micks56
It is most likely not legal.

The user owns copyright to the article or story that he wrote. This ownership
of copyright gives the user the right to decide how it is distributed. For you
to use the material legally, you must get permission of the copyright holder.
The user may be the copyright holder, or the site may be if the user transfers
the ownership. Either way, someone owns the copyright and it is not you. You
need permission.

There are two defenses to copyright infringement: 1) fair use, and 2) parody.

Parody probably doesn't fit here. So that means you need to make a case for
fair use.

Fair use has 4 elements:

1: The purpose of the use (commercial vs. non-commercial/educational) - if you
are going to make money on this, fair use is out.

2\. Nature of the copyrighted work - This doesn't really apply here, so I
won't go into the lengthy explanation.

3\. Amount of portion used in relation to the whole - Did you extract a quote?
That is probably ok. Did you copy the entire article? Probably not ok.

4\. Effect upon the market - if your site harms the market of the other site,
no fair use.

~~~
lacker
It is not true that commercial purpose precludes a fair use defense. From
wikipedia:

"While commercial copying for profit work may make it harder to qualify as
fair use, it does not make it impossible."

see: <http://en.wikipedia.org/wiki/Fair_use>

~~~
micks56
In this case it does.

The use in the case you are referring to had several extra elements which
prompted the court to not find copyright infringement.

2 Live Crew was fair use because their work was transformative. They took a
beat and changed words to a Roy Orbison song. They did not blindy copy, as the
question asker will be doing. They took the score, not the words.

The poster is taking the story, not quotes. His site is not transformative,
and therefore his commercial use is not fair

~~~
dhimes
And, they lost the Van Halen lawsuit.

------
mattmaroon
One of the most dangerous things you can do is rely on non-lawyers for legal
advice. Ask an attorney.

~~~
tstegart
Meh, if everyone listened to their lawyer we wouldn't have YouTube or
BitTorrent :)

~~~
drandall
_insert JP Morgan quote_

~~~
nickb
For those who've never heard of this gem :)

 _Well, I don't know as I want a lawyer to tell me what I cannot do. I hire
him to tell how to do what I want to do. ~ J. P. Morgan_

------
gm
Technically, you own whatever you write. So I own the copyright on this
message, unless I granted the copyright to YC News when I signed up (I don't
remember). Assuming that I did not, then I retain copyright, and you have to
get the ok from me, individually.

That's the theory anyway. Talk to a lawyer.

------
emmett
The real question is usually not "Is this legal" but rather "Will this get us
sued". We do many "illegal" things like jaywalking all the time, but everyone
with sense knows which things are truly illegal.

------
nickh
Check to see if the site that you're considering scraping has any sort of
"terms and conditions", "terms of use", "legals", etc page. If it does, read
that page in detail. If it doesn't, ask the site for permission to scrape.

If the site doesn't have one of those pages and you don't ask for permission,
you're not only putting yourself in danger of legal action, but you're also
depending on a data source that isn't reliable.

Remember, it'd only be a matter of time before they notice you scraping, and
take measures to stop it.

------
tstegart
Well, there's the real world answer, and the legal answer. Not too sure about
the legal answer, thats always murky. Some things are copyrighted, some
aren't. You can't just take someone's story, or art, and use it on your own
site without permission. Usually the creator has all rights until they are
given away. Even if they have uploaded it to another site, that doesn't mean
they have given you permission though. On the other hand, some information
can't be controlled. If Bob says it's 70 degrees in San Francisco right now,
you can totally say its 70 degrees. If the Mets won, you're welcome to say the
Mets won. The Drudge Report does nothing except report headlines.

In the real world, however, people steal information all the time. Its not
polite though. Usually people ask for attributes. I don't think a policy of
only removing it when someone asks you to would be polite. That's like
stealing something when no-one is home and leaving a note saying you'll return
it if they ask you to.

Also, pure scraping, even of non-copyrighted information can get you into
trouble if the other person had paid for that information, like a news site.
They pay for their news. Scraping it and making your own news site (with full
content, not just headlines) is illegal.

So the short answer is that the original creator still owns the content, and
no you probably can't have it.

~~~
brandon272
I guess my followup with that would be this:

What if the user is submitting content that they would very clearly want re-
distributed? I guess my question with that is, if you take the end-user who is
submitting the content out of the equation, does the site that is being
scraped _from_ have any kind of leg to stand on if they don't want you
scraping their content (excluding technical means that they might implement),
assuming that the site being scraped from does not force their end-user who is
submitting the content to agree that the site that they are submitting the
content to "owns" the content, once it is submitted?

~~~
tstegart
The myspace suicide case really muddles this and makes it a problem. If the
site you want to scrape from has TOS that say you can't re-use the info, (and
it most likely would), then re-using the info would be a violation of the
terms of service. Basically, in the MySpace case the feds are trying to make
it a crime to violate a website's terms of service.

But like someone else said, if someone wants their content distributed, the
user is not going to give you any trouble, and if they do, you're very
protected if you take the offended material down right away, as someone else
mentioned.

The website you're stealing from does have legal measures they can take,
especially if you're directly competing with them. I think someone else
mentioned that they could also just reconfigure their website to mess up your
scrape, which is probably what they'll keep doing. They'll also likely
publicize your bad business practices and you'll end up with a horrible
reputation. Nobody likes a copier. Its like people who steal designs. They
hardly ever outperform the website they're copying.

------
drewcrawford
In the legal world, whenever text is set in fixed form, it is automatically
copyrighted by its author, whether they claim it or not, unless they
specifically say otherwise. "Specifically say otherwise" probably includes
anything they might have agreed to in the TOS for a particular site, although
the legal waters there are largely untested. The only exemption to this auto-
copyright is statistics, phone numbers, or other non-copyrightable content
(see CBC vs. MLB).

IANAL, but depending on the nature of your service and specifically how the
content is collected, you _may_ qualify for DMCA safe harbor protections. This
means that if you remove things in a timely manner upon request nobody can sue
you. This is how Google caches the whole internet without getting sued.

That's all legal mumbo-jumbo. The real world answer is that some people will
get mad regardless of the law, so take their content down and apologize.
Follow robots.txt guidelines. Don't post takedown replys a la PirateBay.
Generally act sane. If you do all of the above you'll probably be ok.

~~~
micks56
DMCA safe harbor applies when you are a service provider.

Examples are an ISP that merely provides access to the internet. The ISP
cannot be sued for copyright infringement just because infringing bits passed
through its servers.

Also, a message board that posts whatever a user writes would obtain DMCA safe
harbor. They just provide a service and don't screen out for content. An
example is Craigslist.

This person's site DOES NOT enjoy DMCA safe harbor. He scraped the other site
and populated his own. He is not even acting as a service provider as the DMCA
statute prescribes.

------
sh1mmer
My "not a lawyer" answer to this is:

It depends on the terms of service of the site.

#1 The TOS of the site may not let you use a robot on their site at all

#2 The TOS will define who owns the user generated content (UGC), either the
user or the site

#3 Depending who owns the UGC you may or may not be able to scrape it, if it's
the site it's against their TOS if it's the user you would need permission
from the user.

#4 As other people have said fair use might come into play. If the site owns
the material using a single user contribution might be fair use within the
context of the whole site. If the users own the content you are likely to be
using all their content, thereby not able to use fair use.

Again all of these are my observations. Hopefully it will give you something
to think about. If you are starting a business based on this, you do need to
consult a lawyer. Also starting a business based on page scraping is a pretty
risky thing to do. If the scraped site turn you off you could be pretty
screwed.

------
pedalpete
I'm not a lawyer, but had to do a bunch of legal research regarding this topic
for some of my sites.

I think what most of the responses so far are missing is the importance of
accrediting the content to the content owner (likely the site, not the
contributing users), and providing a link to the source.

Check out this pdf <a
href="[http://www.law.berkeley.edu/journals/btlj/articles/vol16/sab...](http://www.law.berkeley.edu/journals/btlj/articles/vol16/sableman/sableman.pdf)">Sableman's
authorized linking</a> and search on Google v. Perfect 10.

You haven't really given much to go on with respect to what you are scraping,
and what you plan to do with it. But I think a bunch of common sense and
ensuring that your site in no way harms the original source's site
(defamation, etc), are the most important things to consider.

Hope this helps.

------
jonmc12
A few years back I wanted to set up a menu service - putting a bunch of
restaurant menus online so that others could search. My lawyer advised that
because the restaurant made these materials open in the public domain, we
could basically do whatever we wanted with them. As if they were public
property.

I'm not sure how this relates to other kind of content from a legal
standpoint, but I've used it to ask myself 'did this person intend for this
information to be public' as sort of an ethical guideline.

~~~
timcederman
Woah, that is seriously not correct advice. Was your lawyer even a copyright
lawyer?

~~~
tstegart
Yeah, I'm not sure "chicken, $10" can be copyrighted. The image of a menu
might be a problem, but the text? Not to mention his lawyer may have been
giving practical advice (hopefully acknowledged as such): what restaurant
would really argue with their menu being available?

~~~
lacker
Isn't there a defense for factual information? The fact that a particular
restaurant sells chicken for $10 cannot be copyrighted. The look and feel of
their menu (like, the squiggly decorations, the font, the layout) can be
copyrighted.

~~~
tstegart
Thats what I was thinking. A phone directory can't be copyrighted. Stealing
the database is a crime though. I think thats what the original question was
really about: How far can he go?

~~~
timcederman
It's a sticky area. <http://en.wikipedia.org/wiki/Feist_v._Rural>

------
uvince
News is one case, but how about the millions ratings & reviews floating out
there on the web? What about this site?
[http://www.boorah.com/restaurants/CA/palo-alto/the-
counter/A...](http://www.boorah.com/restaurants/CA/palo-alto/the-
counter/A1E0B14D68-reviews.html?f=y)

They scrape, re-present abstracts from and supposedly do calculations based on
of the entirety of user-submitted data collected by a number of sites.

Does that violate fair use in anyone's opinion?

------
okeumeni
I will suggest that you mention clearly the origin of anything you get from
anywhere on the internet.

The truth is, it’s hard to find a site without the mention of copyright in
terms and conditions page.

Overall it will depend of the use you make of the copy content. Trouble starts
when you use someone’s material in a line of business that competes with the
owners.

------
webwright
Google does it, as do vertical search companies like Indeed and SimplyHired,
so it's certainly legal under some conditions. Taking only a snippet and
linking back to the source is generally okay (as the original publisher
benefits from traffic and SEO juice).

~~~
jcl
I assume you're talking about Google re-serving cached content. I think
Google's defense boils down to "Any site without a properly configured
robots.txt file implicitly grants permission to be spidered, cached, and
linked to.":

[http://www.google.com/support/webmasters/bin/answer.py?answe...](http://www.google.com/support/webmasters/bin/answer.py?answer=35301&src=top5)

I don't know if this "opt-out" strategy has been tested in court, but I
wouldn't assume that wholesale copying of user content is analogous to
Google's situation unless the users have some similarly accepted way of opting
out of the copying.

------
smakz
Relevant recent discussion on slashdot:

<http://yro.slashdot.org/article.pl?sid=08/07/08/1245204>

------
bprater
I'm similarly curious about the legality of using screenshots.

