On the rare occasion that I scrape a site for a project I'm usually less worried...

rauljara · on May 11, 2011

I recently had to scrape the yellow pages for a list of doctors around chicago. My first script caused them to block my ip address. The warning page was cute, it said I had violated one or more of Asimov's 3 laws of robotics.

I wrote a second script that included delays between loading pages (just 10-15 seconds, some longer pauses every 10 pages or so) and it ran for a whole day without being blocked.

ctide · on May 11, 2011

I've done a lot of scraping on a variety of projects, and basically as long as you keep your scrapers from going out of control, you won't get banned.

thibaut_barrere · on May 11, 2011

Each time I had to use scraping (like for http://hackerbooks.com or http://learnivore.com) I choosed to ask the owner first. I think it's better to get in touch and just ask.

tszming · on May 11, 2011

Just an (evil) idea: Create some free iPhone app/game, scrape the site in user's iPhone and send to your server.

gtani · on May 11, 2011

Less evil: read about how duckduckgo uses Boss and Bing APIs

http://www.gabrielweinberg.com/blog/2010/08/thoughts-on-yaho...

-----------------

and check http://www.reddit.com/r/datasets and infochimps, maybe dataset is already out there

--------------------

and ...never spider from home, you'll always screw something up: referrer HTTP header, user agent, random wait time between requests,

http://streamhacker.com/2010/10/04/perils-web-crawling/

http://news.ycombinator.com/item?id=2463058 see first comment

bengl3rt · on May 11, 2011

I thought of something similar, though less surreptitious. I run a website that relies on a fairly enormous quantity of scraped data. To keep its dataset current I need a huge, continuous scraping cluster. I thought about letting people run a little thing in their system tray so that when the network interface on their computer is idle, it becomes part of my scraping cluster. I would be completely upfront with them about what it is doing - it would serve as a non-monetary way for them to contribute meaningfully to the operation of my site.