Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

On the rare occasion that I scrape a site for a project I'm usually less worried about the legal issues than I am worried that the site will recognize my script hitting them repeatedly and block my IP. Does anyone know how prevalent this practice is? I've heard that Google blocks scripts scraping their search results, but I haven't experienced it myself.

(I know it's easy to spoof the user agent, but that seems underhanded/nefarious.)



I recently had to scrape the yellow pages for a list of doctors around chicago. My first script caused them to block my ip address. The warning page was cute, it said I had violated one or more of Asimov's 3 laws of robotics.

I wrote a second script that included delays between loading pages (just 10-15 seconds, some longer pauses every 10 pages or so) and it ran for a whole day without being blocked.


I've done a lot of scraping on a variety of projects, and basically as long as you keep your scrapers from going out of control, you won't get banned.


Each time I had to use scraping (like for http://hackerbooks.com or http://learnivore.com) I choosed to ask the owner first. I think it's better to get in touch and just ask.


Just an (evil) idea: Create some free iPhone app/game, scrape the site in user's iPhone and send to your server.


Less evil: read about how duckduckgo uses Boss and Bing APIs

http://www.gabrielweinberg.com/blog/2010/08/thoughts-on-yaho...

-----------------

and check http://www.reddit.com/r/datasets and infochimps, maybe dataset is already out there

--------------------

and ...never spider from home, you'll always screw something up: referrer HTTP header, user agent, random wait time between requests,

http://streamhacker.com/2010/10/04/perils-web-crawling/

http://news.ycombinator.com/item?id=2463058 see first comment


I thought of something similar, though less surreptitious. I run a website that relies on a fairly enormous quantity of scraped data. To keep its dataset current I need a huge, continuous scraping cluster. I thought about letting people run a little thing in their system tray so that when the network interface on their computer is idle, it becomes part of my scraping cluster. I would be completely upfront with them about what it is doing - it would serve as a non-monetary way for them to contribute meaningfully to the operation of my site.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: