Hacker News new | past | comments | ask | show | jobs | submit login
Best Practices for Scraping/Data Mining and interpreting?
9 points by n_coats on June 4, 2013 | hide | past | favorite | 6 comments
Would be interested in hearing various ways people scrape content from sites, mainly in real time ie: social media sites, etc.

Of equal interest would be data mining methods and ways to sort and interpret mined or scraped data from real-time sources.

Any examples, articles, and instructionals are welcomed!




Sure those instructions are not for realtime sources but I believe it worth to take a look.

We built domain-specific scraper for http://otomobul.com . We have used python-scrapy and had to write custom crawler spider to deal with high memory usage for big sites. Also we added resuming capability so we don't need to scrape from scratch and it just travels only non-visited pages when somehow crawler crashes (I think they resolved those issues in never versions.) I found scrapy very well designed (borrowed losts of ideas from django and also it uses twisted as backend) and it is flexible enough for our requirements. For text processing, python is also my favorite language but I can not say whether it suits 100% to your needs though.

I cannot say we did real data mining in this side project but we are using processing middleware which process raw data from spider and dumps this data to Apache Solr for indexing. Solr is really good when it comes to full-text facet search. Web api is easy enough for fast prototyping. We were able to connect PHP backend to Solr and never worry about searching and indexing again. As an alternative you may use elastic search which has great features also.


Is anyone familiar with the legal implications of scraping sites? Just for things that are easily viewable to anyone with a computer, such as Tweets, a stores prices, descriptions of things etc?


Twitter has an API that allows you to more or less scrape tweets.


I'm seeing a fair bit about node.js and bobik, has anyone ever worked with either framework?


Use APIs when possible...the methods would be dependent on what the goal is.


Would you suggest any specific scraping API's or API's for each source I'd like to scrape (granted they offer it)?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: