Best Practices for Scraping/Data Mining and interpreting?

bbayer · on June 4, 2013

Sure those instructions are not for realtime sources but I believe it worth to take a look.

We built domain-specific scraper for http://otomobul.com . We have used python-scrapy and had to write custom crawler spider to deal with high memory usage for big sites. Also we added resuming capability so we don't need to scrape from scratch and it just travels only non-visited pages when somehow crawler crashes (I think they resolved those issues in never versions.) I found scrapy very well designed (borrowed losts of ideas from django and also it uses twisted as backend) and it is flexible enough for our requirements. For text processing, python is also my favorite language but I can not say whether it suits 100% to your needs though.

I cannot say we did real data mining in this side project but we are using processing middleware which process raw data from spider and dumps this data to Apache Solr for indexing. Solr is really good when it comes to full-text facet search. Web api is easy enough for fast prototyping. We were able to connect PHP backend to Solr and never worry about searching and indexing again. As an alternative you may use elastic search which has great features also.

hojoff79 · on June 5, 2013

Is anyone familiar with the legal implications of scraping sites? Just for things that are easily viewable to anyone with a computer, such as Tweets, a stores prices, descriptions of things etc?

n_coats · on June 5, 2013

Twitter has an API that allows you to more or less scrape tweets.

n_coats · on June 4, 2013

I'm seeing a fair bit about node.js and bobik, has anyone ever worked with either framework?

zeruch · on June 4, 2013

Use APIs when possible...the methods would be dependent on what the goal is.

n_coats · on June 4, 2013

Would you suggest any specific scraping API's or API's for each source I'd like to scrape (granted they offer it)?