

Ask HN: Screen Scraping Rules? - DanielBMarkham

For a couple of years now, I've been kicking around an idea to write a program that gathers information from other sites on the web.<p>The problem is that I am unsure of the legality of such a program.<p>If I understand correctly, you can't write a web/server application that goes out and gets stuff on the internet and re-packages it. But you can write a "personal assistant" client-side program that basically visits web sites for the user and presents the information in a different format.<p>The difference, I think, is between a service that basically re-aggregates other people's content and sells it and a service that simply reformats information available through the web browser into something else -- much like a blind person would use a web-reading device.<p>Is that correct? Anybody have any experience in this area?.
======
olefoo
If you are going to scrape other people's web sites there are a few basic
courtesies that you are expected to observe.

1\. Respect the instructions in robots.txt, if they don't want you crawling
anything under /foo/* don't request /foo/bar or /foo/baz. A corollary is to
identify your bot as a robot in the UserAgent string you send.

2\. Observe response times and be gentle with the sites you are scraping,
remember that you are not a paying customer and that you should not be
requesting pages faster than the site can produce them. If response times
suddenly rise, back off the rate at which you are requesting pages. Similarly
if you start getting 500's or other errors stop and wait at least 10 minutes
before resuming.

FWIW, IANAL, etc. I've been told that it's perfectly fine to read information
that is publicly posted to the web, the place you get in trouble is where you
start republishing material without attribution and without transforming it in
some manner.

I have worked with sites that gathered a lot of pricing information in
specific product categories in order to generate pricing recommendations for
merchants, and there was no problem with using that information that I know
of. But your specific situation may be different enough that my experience is
not at all applicable.

Your mileage will vary, ask your own lawyer, Hacker News is not legal advice.

------
matthodan
In March 2005, Agence France Presse (AFP) sued Google for $17.5 million,
alleging that Google News infringed on its copyright because "Google includes
AFP’s photos, stories and news headlines on Google News without permission
from Agence France Presse."[1] [2] It was also alleged that Google ignored a
cease and desist order, though Google counters that it has opt-out procedures
which AFP could have followed but did not. Google now hosts Agence France-
Presse news, as well as the Associated Press, Press Association and the
Canadian Press. This arrangement started in August 2007.[2]In 2007 Google
announced it was paying for Associated Press content displayed in Google News,
however the articles are not permanently archived.[3][4]

Wikipedia-- <http://en.wikipedia.org/wiki/Google_News>

------
nostrademons
AFAIK (and IANAL), information on the web is public but copyrighted. That
means that you can read all the data you want (though it's polite to obey
robots.txt), but if you redistribute it you have to obey the normal copyright
provisions, i.e. you can only take what's considered "fair use". Fair use
depends upon a lot of factors - how much of the original work are you
redistributing, how much of the re-aggregated work does it compose, is it
commercial or non-commercial, and are you harming the original content owner's
business model? Talk to a lawyer for specifics; there are a lot of grey lines,
and if you're close to one you really need a professional to tell you where
the line is.

You can't copyright facts, so the mere act of taking data off the web is not
illegal. For example, if you came up with an algorithm that could extract
stock prices off webpages and then used that to put together a stock chart,
you're well within your rights to publish that. But if you extracted the
_text_ from those pages and then put up a news article with each inflection
point, you _may_ have a legal problem, depending on how much of the text you
used and a bunch of other factors. Again, talk to a lawyer for specifics.

------
jacquesm
It depends. If it is for research purposes then I'd think you're just another
bot, so that's fine (up to you to make sure you don't get kicked off so be
nice about how heavily you load up someone elses server, and keep in mind that
you're possibly not the only bot active on a site).

If it is to repackage content then I think that's (a) not nice and (b) breach
of copyright.

Unless, of course the original data is in the public domain or effectively in
the public domain. Keep in mind that simply placing content on the web is not
the same as placing it in the public domain.

Some data simply can not be copyrighted, for instance names of people or
places, but a collection of such data _can_ be copyrighted.

If you aggregate in order to do some intricate datamining which results in a
product that has no direct link to the original you have created a derivative
work, and possibly even an original work, depending on how strong/tenuous the
link is (you might be able to claim copyright on the aggregate).

For more information see your friendly local copyright lawyer, they're
expensive as hell though.

~~~
DanielBMarkham
I got to thinking about this the other day when I blogged, then posted the
article on HN. One of the posters made a great comment and our back-and-forth
I thought helped explain the article (which was hastily written) So I copied
his comment and my reply over to my blog. (This was noted in the comment
thread and I offered to remove it if it bothered the poster)

What was interesting was the commenter's reply when he found out his comment
had been copied. "I thought that was just a feature of his blog"

Got me to thinking: why couldn't it just be a feature of my blog?

I understand some folks might have a problem with that -- I haven't sorted
through how I feel about it ethically. But from a value proposition to the
user it is an interesting idea: the web (to me) is about conversations with
people of different views. Wouldn't it be great to be able to harvest feedback
on your articles, wherever the feedback happens, and share it with readers
right alongside the article itself?

It's just an idea I'm kicking around. The legal part of it looks tricky,
though. And then there's the fact that it turns some people off, which isn't
good either.

~~~
jacquesm
I think that if you post a blog and someone comments on it 'elsewhere' that
their comments are fair game for citing, with attribution (which is how I
would do it).

------
gtani
Thsi comes up regularly, but since there aren't any defining court cases for
"fair use", you have to set your own bounds:

<http://searchyc.com/copyright>

<http://news.ycombinator.com/item?id=411555>

<http://news.ycombinator.com/item?id=806548>

this blog is also a good resource:

[http://www.coultertm.com/blog/2008/02/mashup-fair-use-or-
inf...](http://www.coultertm.com/blog/2008/02/mashup-fair-use-or-infringing-
derivative-work.html)

[http://www.coultertm.com/blog/2009/05/remix-culture-fair-
use...](http://www.coultertm.com/blog/2009/05/remix-culture-fair-use-is-your-
friend.html)

<http://www.coultertm.com/SocialNetworking.pptx>

~~~
gtani
also the NY times lawsuit was settled and didn't have any defining decision

[http://www.boston.com/business/articles/2009/01/23/lawsuit_o...](http://www.boston.com/business/articles/2009/01/23/lawsuit_over_website_links_in_spotlight/)

[http://www.mediapost.com/publications/index.cfm?fa=Articles....](http://www.mediapost.com/publications/index.cfm?fa=Articles.showArticle&art_aid=98929)

[http://investors.gatehousemedia.com/releases.cfm?ReleasesTyp...](http://investors.gatehousemedia.com/releases.cfm?ReleasesType=&Year=2009)

