Would be interested in hearing various ways people scrape content from sites, mainly in real time ie: social media sites, etc.
Of equal interest would be data mining methods and ways to sort and interpret mined or scraped data from real-time sources.
Any examples, articles, and instructionals are welcomed!
We built domain-specific scraper for http://otomobul.com . We have used python-scrapy and had to write custom crawler spider to deal with high memory usage for big sites. Also we added resuming capability so we don't need to scrape from scratch and it just travels only non-visited pages when somehow crawler crashes (I think they resolved those issues in never versions.) I found scrapy very well designed (borrowed losts of ideas from django and also it uses twisted as backend) and it is flexible enough for our requirements. For text processing, python is also my favorite language but I can not say whether it suits 100% to your needs though.
I cannot say we did real data mining in this side project but we are using processing middleware which process raw data from spider and dumps this data to Apache Solr for indexing. Solr is really good when it comes to full-text facet search. Web api is easy enough for fast prototyping. We were able to connect PHP backend to Solr and never worry about searching and indexing again. As an alternative you may use elastic search which has great features also.