

Ask HN: Ethics and laws regarding scraping websites? - schtog

What are the ethics and laws regarding scraping internet sites?<p>For one, we have robot.txt:
http://en.wikipedia.org/wiki/Robot.txt<p>Is that the only thing that prevents(or prohibits) a robot/spider from scraping a site?<p>Can there be copyrighted material that is not allowed to be scraped?<p>I ask because I want to search/scrape a few hundred pages with similar content and present those results all the way to the buying of the product.
I.e. not just like if you google you get a page that has the information. I want to present them with the options(maybe in a ranked order) where  one click is needed to choose the product to buy, pretty much. Or one to search, one to choose and get to the productsite where you obviously have to do some more confirmation but you get the point...
======
dazzawazza
I think this is a very grey area. If it's on the internet it's scrapable. Does
that mean you can use it? You really need a lawyer.

Take for example Amazon reviews. They are there to be scraped but some of them
are not available via the web service API due to copyright restrictions.
That's a clear sign to me that they don't wouldn't want this information
scraped. I'm not sure how a judge would see it though.

Another example is is licensed material. Take TV listings. They are listed all
over the place but they are still under copyright so you can't just scrape
them and use them on your site.

IANAL but my basic rule is that I scrape if I think I am going to offer the
scraped sight something in return (usually traffic). So if it's mutually
beneficial I feel OK. That doesn't mean it's legal though :(

~~~
0x44
Only the layout of the TV listing information is copyrighted. You can't
copyright facts.

~~~
dazzawazza
This seems reasonable. However I seem to remember that the actual listings in
the UK are 'owned' by some central authority (I forget who but it's got
something to do with the Radio Times) and you license the right to publish
them.

While it's a fact that a show is on it's also content owned by some authority.
This is where lawyers make their money.

~~~
0x44
My apologies, the copyright laws of the UK are quite different than those of
the US. I don't know whether or not facts can be copyrighted there.

------
tylercarbone
So, copyright law doesn't care whether you scraped the data or acquired in
some other way. That is, either you can use the information under copyright
law, or you can't. If you can't use it, then getting it through some method
other than scraping won't help you -- and if you can use it, then scraping
won't change the legality.

So you need to make sure there isn't a copyright violation, which is going to
depend on the specific information you're looking at.

There's still a potential problem, though, unrelated to copyright. After eBay
v. Bidder's Edge, it can be trespass to chattels to scrape data in violation
of a site's TOS. In the eBay case, the court held that it was in violation of
trespass law because the eBay TOS prohibited robots... so it would have been
fine if Bidder's Edge had taken the data manually.

Basically, you need to make sure the data isn't copyrighted, and you need to
make sure that scraping the data isn't in violation of the site's TOS.

------
webwright
Search engines all scrape... Titles, meta data, and some or all of the
content.

I think a good rule of thumb is to consider whether the scraping target will
benefit from being scraped. Most sites are delighted to be scraped by Google.
Will your scraping drive sales/visitors to the target? Or will it cost
sales/visitors? Will you link back (which helps them from an seo standpoint)?

~~~
calvins
Search engines do not all scrape.

They scrape if the owners have indicated they wish their site to be scraped by
not forbidding access via the robots.txt file.

~~~
webwright
Um, so yeah-- they all scrape-- like I said.

But no, they don't scrape all sites.

------
a-priori
Ethically, I see no problem is doing any scraping as long as you credit the
source and obey the robot.txt file. I figure, they were the ones who made the
information freely available on the web.

Legally, however, it's a whole different beast that I'm not qualified to talk
about.

------
cduan
You should look into the DMCA safe harbor provisions for Internet Service
Providers. The basic idea behind them is that, if you comply with certain
requirements (establish a DMCA officer, respond to takedown notices, etc.),
then your service will be immune from many sorts of copyright infringement
claims.

This is not legal advice, and it's not even a very good picture of the DMCA
safe harbors, but hopefully it's enough to point you in the right direction.
The Electronic Frontier Foundation has good resources on this, for example:

<http://w2.eff.org/bloggers/lg/faq-ip.php>

------
schaaf
IANAL, but...

To reuse copyrighted content, you have to consider fair use eligibility and
such -- but fortunately, not all data is locked down by copyright.

One of my favorite ten USSC rulings:

[http://en.wikipedia.org/wiki/Feist_Publications_v._Rural_Tel...](http://en.wikipedia.org/wiki/Feist_Publications_v._Rural_Telephone_Service)

""" It is a long-standing principle of United States copyright law that
"information" is not copyrightable, O'Connor notes, but "collections" of
information can be. Rural claimed a collection copyright in its directory. The
court clarified that the intent of copyright law was not, as claimed by Rural
and some lower courts, to reward the efforts of persons collecting
information, but rather "to promote the Progress of Science and useful Arts"
(U.S. Const. 1.8.8), that is, to encourage creative expression. Since facts
are purely copied from the world around us, O'Connor concludes, "the sine qua
non of copyright is originality". However, the standard for creativity is
extremely low. It need not be novel, rather it only needs to possess a "spark"
or "minimal degree" of creativity to be protected by copyright. ... In the
late 1990s, Congress attempted to pass laws which would protect collections of
data, but these measures failed. By contrast, the European Union has a sui
generis (specific to that type of work) intellectual property protection for
collections of data."""

~~~
xirium
It is rumoured that some encyclopedias give biographies of ficticious people
to enforce collective copyright. This wouldn't concern you if wanted
information about a historical figure. However, it would be very problematic
if you worked for a rival encyclopedia or wanted to make your own website of
the encyclopedia.

~~~
nertzy
Dictionaries have long done the same thing with regard to definitions.

<http://en.wikipedia.org/wiki/Copyright_trap>

~~~
xirium
I didn't know that dictionaries also do it. Nor did I know that fake entries
were also known as mountweazels (after the surname of a ficticious entry) or
nihilartikels. How very cromulent ( <http://en.wikipedia.org/wiki/Cromulent>
).

------
jexe
Definitely check the TOS of the sites you plan to scrape. Some data, in
particular factual data, isn't copyrightable, but that doesn't necessarily
give you the rights to automatically collect it from any place you can find
it. Businesses spend a lot of money and time collecting and hosting data, so
they can get pretty particular about how you're allowed to use their service.

Here's an excerpt from the yellowpages.com TOS. Without it, their company
names, addresses, and phone numbers would be more or less fair game for any
competitor:

"You are prohibited from data mining, scraping, crawling, or using any process
or processes that send automated queries to the YELLOWPAGES.COM Web site. You
may not use the YELLOWPAGES.COM Web sites to compile a collection of listings,
including a competing listing product or service."

------
nickb
Depends on what and how you intend to use the scraped data for. I'd recommend
you also check the TOS of the site you intend to scrape. Many sites explicitly
say how their data can be used and how you should scrape it. Some sites also
ban IPs that are scraping them (they assume it's a DoS type of an attack.

~~~
xlnt
you should have a sleep between requests to avoid overloading their servers.

i don't know how much delay is best. does anyone else know?

~~~
dhotson
I think the rule of thumb is roughly 1 request every 2 seconds.

------
ntoshev
If the websites you scrape don't want you to do so, they will update their
robots.txt and ban you. If you don't respect that, they will ban your IP
and/or take legal action.

So just try not to piss them off.

~~~
cmars232
If the websites don't want you to scrape, use proxies and botnets!

------
puppetsock
You just need to respect copyright. So if they explicitly allow you to scrape
the data, or if you're operating within the "Fair Use" doctrine, then you're
fine.

(IANAL)

------
dhotson
Apart from the copyright issues, if you're going to scrape or crawl someone's
site.. it's best to be polite about it.

I wrote a web crawler a few years back and at the time I didn't really
understand the implications of having a crawler grab 20+ pages concurrently
from a site.

I learned pretty fast when I found a few sites had banned my crawler.. Oops,
sorry guys!

------
toddcw
This might be informative: [http://blog.screen-
scraper.com/2008/04/21/screening-scraping...](http://blog.screen-
scraper.com/2008/04/21/screening-scraping-ethics/)

------
schtog
but if we take google for an example. they give the serached sites something
in return, they make the easy to find. but how does the google spider know
that it can index the site? it checks for robots.txt obv but does it check for
Copyright?

<http://en.wikipedia.org/wiki/Fair_Use> fair user seems to be only in the US,
i potentially want to reahc the whole world even if the USA is a good place to
start.

