On the rare occasion that I scrape a site for a project I'm usually less worried about the legal issues than I am worried that the site will recognize my script hitting them repeatedly and block my IP. Does anyone know how prevalent this practice is? I've heard that Google blocks scripts scraping their search results, but I haven't experienced it myself.
(I know it's easy to spoof the user agent, but that seems underhanded/nefarious.)
I recently had to scrape the yellow pages for a list of doctors around chicago. My first script caused them to block my ip address. The warning page was cute, it said I had violated one or more of Asimov's 3 laws of robotics.
I wrote a second script that included delays between loading pages (just 10-15 seconds, some longer pauses every 10 pages or so) and it ran for a whole day without being blocked.
Each time I had to use scraping (like for http://hackerbooks.com or http://learnivore.com) I choosed to ask the owner first. I think it's better to get in touch and just ask.
I thought of something similar, though less surreptitious. I run a website that relies on a fairly enormous quantity of scraped data. To keep its dataset current I need a huge, continuous scraping cluster. I thought about letting people run a little thing in their system tray so that when the network interface on their computer is idle, it becomes part of my scraping cluster. I would be completely upfront with them about what it is doing - it would serve as a non-monetary way for them to contribute meaningfully to the operation of my site.
(I know it's easy to spoof the user agent, but that seems underhanded/nefarious.)