Hacker News new | comments | show | ask | jobs | submit login
Ask HN: Ethics and laws regarding scraping websites?
22 points by schtog 3521 days ago | hide | past | web | favorite | 24 comments
What are the ethics and laws regarding scraping internet sites?

For one, we have robot.txt: http://en.wikipedia.org/wiki/Robot.txt

Is that the only thing that prevents(or prohibits) a robot/spider from scraping a site?

Can there be copyrighted material that is not allowed to be scraped?

I ask because I want to search/scrape a few hundred pages with similar content and present those results all the way to the buying of the product. I.e. not just like if you google you get a page that has the information. I want to present them with the options(maybe in a ranked order) where one click is needed to choose the product to buy, pretty much. Or one to search, one to choose and get to the productsite where you obviously have to do some more confirmation but you get the point...

I think this is a very grey area. If it's on the internet it's scrapable. Does that mean you can use it? You really need a lawyer.

Take for example Amazon reviews. They are there to be scraped but some of them are not available via the web service API due to copyright restrictions. That's a clear sign to me that they don't wouldn't want this information scraped. I'm not sure how a judge would see it though.

Another example is is licensed material. Take TV listings. They are listed all over the place but they are still under copyright so you can't just scrape them and use them on your site.

IANAL but my basic rule is that I scrape if I think I am going to offer the scraped sight something in return (usually traffic). So if it's mutually beneficial I feel OK. That doesn't mean it's legal though :(

Only the layout of the TV listing information is copyrighted. You can't copyright facts.

This seems reasonable. However I seem to remember that the actual listings in the UK are 'owned' by some central authority (I forget who but it's got something to do with the Radio Times) and you license the right to publish them.

While it's a fact that a show is on it's also content owned by some authority. This is where lawyers make their money.

My apologies, the copyright laws of the UK are quite different than those of the US. I don't know whether or not facts can be copyrighted there.

So, copyright law doesn't care whether you scraped the data or acquired in some other way. That is, either you can use the information under copyright law, or you can't. If you can't use it, then getting it through some method other than scraping won't help you -- and if you can use it, then scraping won't change the legality.

So you need to make sure there isn't a copyright violation, which is going to depend on the specific information you're looking at.

There's still a potential problem, though, unrelated to copyright. After eBay v. Bidder's Edge, it can be trespass to chattels to scrape data in violation of a site's TOS. In the eBay case, the court held that it was in violation of trespass law because the eBay TOS prohibited robots... so it would have been fine if Bidder's Edge had taken the data manually.

Basically, you need to make sure the data isn't copyrighted, and you need to make sure that scraping the data isn't in violation of the site's TOS.

Search engines all scrape... Titles, meta data, and some or all of the content.

I think a good rule of thumb is to consider whether the scraping target will benefit from being scraped. Most sites are delighted to be scraped by Google. Will your scraping drive sales/visitors to the target? Or will it cost sales/visitors? Will you link back (which helps them from an seo standpoint)?

Search engines do not all scrape.

They scrape if the owners have indicated they wish their site to be scraped by not forbidding access via the robots.txt file.

Um, so yeah-- they all scrape-- like I said.

But no, they don't scrape all sites.

Ethically, I see no problem is doing any scraping as long as you credit the source and obey the robot.txt file. I figure, they were the ones who made the information freely available on the web.

Legally, however, it's a whole different beast that I'm not qualified to talk about.

You should look into the DMCA safe harbor provisions for Internet Service Providers. The basic idea behind them is that, if you comply with certain requirements (establish a DMCA officer, respond to takedown notices, etc.), then your service will be immune from many sorts of copyright infringement claims.

This is not legal advice, and it's not even a very good picture of the DMCA safe harbors, but hopefully it's enough to point you in the right direction. The Electronic Frontier Foundation has good resources on this, for example:


IANAL, but...

To reuse copyrighted content, you have to consider fair use eligibility and such -- but fortunately, not all data is locked down by copyright.

One of my favorite ten USSC rulings:


""" It is a long-standing principle of United States copyright law that "information" is not copyrightable, O'Connor notes, but "collections" of information can be. Rural claimed a collection copyright in its directory. The court clarified that the intent of copyright law was not, as claimed by Rural and some lower courts, to reward the efforts of persons collecting information, but rather "to promote the Progress of Science and useful Arts" (U.S. Const. 1.8.8), that is, to encourage creative expression. Since facts are purely copied from the world around us, O'Connor concludes, "the sine qua non of copyright is originality". However, the standard for creativity is extremely low. It need not be novel, rather it only needs to possess a "spark" or "minimal degree" of creativity to be protected by copyright. ... In the late 1990s, Congress attempted to pass laws which would protect collections of data, but these measures failed. By contrast, the European Union has a sui generis (specific to that type of work) intellectual property protection for collections of data."""

It is rumoured that some encyclopedias give biographies of ficticious people to enforce collective copyright. This wouldn't concern you if wanted information about a historical figure. However, it would be very problematic if you worked for a rival encyclopedia or wanted to make your own website of the encyclopedia.

Dictionaries have long done the same thing with regard to definitions.


I didn't know that dictionaries also do it. Nor did I know that fake entries were also known as mountweazels (after the surname of a ficticious entry) or nihilartikels. How very cromulent ( http://en.wikipedia.org/wiki/Cromulent ).

Depends on what and how you intend to use the scraped data for. I'd recommend you also check the TOS of the site you intend to scrape. Many sites explicitly say how their data can be used and how you should scrape it. Some sites also ban IPs that are scraping them (they assume it's a DoS type of an attack.

you should have a sleep between requests to avoid overloading their servers.

i don't know how much delay is best. does anyone else know?

I think the rule of thumb is roughly 1 request every 2 seconds.

Definitely check the TOS of the sites you plan to scrape. Some data, in particular factual data, isn't copyrightable, but that doesn't necessarily give you the rights to automatically collect it from any place you can find it. Businesses spend a lot of money and time collecting and hosting data, so they can get pretty particular about how you're allowed to use their service.

Here's an excerpt from the yellowpages.com TOS. Without it, their company names, addresses, and phone numbers would be more or less fair game for any competitor:

"You are prohibited from data mining, scraping, crawling, or using any process or processes that send automated queries to the YELLOWPAGES.COM Web site. You may not use the YELLOWPAGES.COM Web sites to compile a collection of listings, including a competing listing product or service."

If the websites you scrape don't want you to do so, they will update their robots.txt and ban you. If you don't respect that, they will ban your IP and/or take legal action.

So just try not to piss them off.

If the websites don't want you to scrape, use proxies and botnets!

You just need to respect copyright. So if they explicitly allow you to scrape the data, or if you're operating within the "Fair Use" doctrine, then you're fine.


Apart from the copyright issues, if you're going to scrape or crawl someone's site.. it's best to be polite about it.

I wrote a web crawler a few years back and at the time I didn't really understand the implications of having a crawler grab 20+ pages concurrently from a site.

I learned pretty fast when I found a few sites had banned my crawler.. Oops, sorry guys!

but if we take google for an example. they give the serached sites something in return, they make the easy to find. but how does the google spider know that it can index the site? it checks for robots.txt obv but does it check for Copyright?

http://en.wikipedia.org/wiki/Fair_Use fair user seems to be only in the US, i potentially want to reahc the whole world even if the USA is a good place to start.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact