What are the ethics and laws regarding scraping internet sites?
For one, we have robot.txt:
http://en.wikipedia.org/wiki/Robot.txt
Is that the only thing that prevents(or prohibits) a robot/spider from scraping a site?
Can there be copyrighted material that is not allowed to be scraped?
I ask because I want to search/scrape a few hundred pages with similar content and present those results all the way to the buying of the product.
I.e. not just like if you google you get a page that has the information. I want to present them with the options(maybe in a ranked order) where one click is needed to choose the product to buy, pretty much. Or one to search, one to choose and get to the productsite where you obviously have to do some more confirmation but you get the point...
Take for example Amazon reviews. They are there to be scraped but some of them are not available via the web service API due to copyright restrictions. That's a clear sign to me that they don't wouldn't want this information scraped. I'm not sure how a judge would see it though.
Another example is is licensed material. Take TV listings. They are listed all over the place but they are still under copyright so you can't just scrape them and use them on your site.
IANAL but my basic rule is that I scrape if I think I am going to offer the scraped sight something in return (usually traffic). So if it's mutually beneficial I feel OK. That doesn't mean it's legal though :(