

Scraping, cleaning, and selling big data - uberstart
http://radar.oreilly.com/2011/05/data-scraping-infochimps.html

======
wattsbaat
On the rare occasion that I scrape a site for a project I'm usually less
worried about the legal issues than I am worried that the site will recognize
my script hitting them repeatedly and block my IP. Does anyone know how
prevalent this practice is? I've heard that Google blocks scripts scraping
their search results, but I haven't experienced it myself.

(I know it's easy to spoof the user agent, but that seems
underhanded/nefarious.)

~~~
tszming
Just an (evil) idea: Create some free iPhone app/game, scrape the site in
user's iPhone and send to your server.

~~~
gtani
Less evil: read about how duckduckgo uses Boss and Bing APIs

[http://www.gabrielweinberg.com/blog/2010/08/thoughts-on-
yaho...](http://www.gabrielweinberg.com/blog/2010/08/thoughts-on-yahoo-boss-
monetization-ii.html)

\-----------------

and check <http://www.reddit.com/r/datasets> and infochimps, maybe dataset is
already out there

\--------------------

and ...never spider from home, you'll always screw something up: referrer HTTP
header, user agent, random wait time between requests,

<http://streamhacker.com/2010/10/04/perils-web-crawling/>

<http://news.ycombinator.com/item?id=2463058> see first comment

------
hessenwolf
Google's policies are a bit all over the place. Google translate will stop
you, but some other services don't.

