
Scraping content issues - andre

======
snorkel
Anyone can legally complain if you copy their content. So they send a C&D;
letter and you remove whatever offended them, no harm done.

If you want to prevent bots from scraping your content then take advantage of
the fact that most bots don't do Javascript: in your server code render the
content of each page with some simple encoding that makes text unreadable then
add a piece of javascript to window.onload() thats decodes and displays the
content.

~~~
tocomment
I've always wondered. Are there any bots that can do javascript? Is there a js
engine you can stick in your bot so it reads everything on a page just like a
user would see it?

~~~
lupin_sansei
Yes. You you can either automate IE <http://support.microsoft.com/kb/167658>
(or Firefox <http://www.iol.ie/~locka/mozilla/mozilla.htm> ) from your code
and make a bot that way.

Or use a "bot" like this which uses IE under the hood:
<http://search.cpan.org/dist/Win32-IE-Mechanize/lib/Win32/IE/Mechanize.pm> or
this one which uses Mozilla under the hood:
<http://search.cpan.org/~slanning/Mozilla-
Mechanize-0.05/lib/Mozilla/Mechanize.pm>

------
andre
If a company has no terms of use or any other kind of policy on their site,
what are the issues in scraping the content? any way to prevent it?

~~~
ks
You always have a copyright, even if you don't say so on the page.

They could of course add a robots.txt and stop nicely behaved scrapers that
way, but to stop all scraping is impossible. There's always a way. The best
you can hope for is to make it so hard, that they don't bother creating a
custom made scraper.

