
Crawling - The Most Underrated Hack - ttunguz
http://tomtunguz.com/crawling-the-most-under-rated-hack
======
eykanal
Very useful, thanks for sharing.

For those who use this, keep in mind that if you're acting as a crawler, you
may be going against the sites TOS and/or robots.txt file. For example,
LinkedIn's User Agreement [1] specifically states:

\------

3\. YOUR RIGHTS.

On the condition that you comply with all your obligations under this
Agreement, including, but not limited to, the Do’s and Don’ts listed in
Section 10, we grant you a limited... right to access the Services, through a
generally available web browser, mobile device or application ( __but not
through scraping, spidering, crawling or other technology or software used to
access data without the express written consent of LinkedIn or its Users
__)...

\-----

And their robots.txt file [2], which states at the bottom:

    
    
      User-agent: *
      Disallow: /
     
      # Notice: If you would like to crawl LinkedIn,
      # please email whitelistcrawl@linkedin.com to apply 
      # for white listing.
    
      Sitemap:   http://partner.linkedin.com/sitemaps/smindex.xml.gz
    

[1]: <http://www.linkedin.com/static?key=user_agreement>

[2]: <http://www.linkedin.com/robots.txt>

------
gilrain
Mechanize is excellent. If you'd like to do some scraping but prefer to work
in Python, check out Scrapy[1], discussed[2] a while back and only getting
better. I've been very pleased with it for some personal projects that rely on
scraping.

[1]: <http://scrapy.org/>

[2]: <http://news.ycombinator.com/item?id=411733>

