
Ask HN: What's your philosophy on page scraping? - TimH
Hey guys, I'm curious to hear what you think about page scraping.  Are you happy to page scrape other sites?  Are you happy to have yours scraped?  What's your view on the ethics?<p>I'm asking because I'm aware my site's started to be scraped (by more than search engines), and I'm trying to figure out how I feel about it.  In this case I'm happy because I know the audience/user base is smallish.  If an app went mainstream that did it, I wouldn't be happy.<p>Of course the pragmatic answer is simple - 'build an api' - a few more weekends and I might.<p>But right now I'm interested to hear people's opinions on it.
======
rmanocha
I think that if you're generally respectful of the target websites - scraping
them is ok. For example, I scrape various government websites for my website.
I use a random delay between requests and am generally very careful about not
requesting the same page multiple times (this is hard 'cause a lot of the
pagination happening on these pages is via JS calls).

I am ok if someone decides to scrape my websites in a similar fashion -
although if I do see that starting to happen, I'd rather just go ahead and
build an API.

------
DanielBMarkham
On the consumer side, I'm happy with the following rules.

I'm happy writing a program to let individual users scrape from their
computers. After all, they have a right to visit the site and retrieve their
data in whatever format suits them.

I'm not so keen on setting up a server to scrape data, or having a server
scrape a huge pile of data for a list of users. After all, whoever is running
the service is keeping stuff for all of the users. My taking it all is just
stealing.

On the provider side, I think my feelings are about the same. I think you have
to be careful that you leverage scraping -- let scrapers come in and get
enough stuff that it makes people want to visit, but not so much that they
have everything. If executed effectively, you can use scraping to great
benefit.

------
_delirium
I guess it depends on what they're doing with it. I'm not particularly against
scraping per se, but I would look askance at some of the more sleazy uses,
like just republishing (slightly modified versions of) blog posts on some
AdSense-laden blog as if it were their own post. The key issues to me are: 1)
transformativity, i.e. it produces something genuinely new and different from
the content it scraped; and 2) proper credit to the source of the original
content.

------
nostrademons
I'm happy to page scrape other sites and not happy to have mine scraped. ;-)

More seriously, if there're bots that you don't want scraping, just robots.txt
them away. If they ignore that, _then_ they're being rather rude and you can
figure out some way to auto-block them.

------
benologist
Depends what it's being scraped for. MFA spam blogs stealing content, or some
valid use that could further your own interests.

------
evancureton
hey can put you on my page

