

Best Practices for Scraping/Data Mining and interpreting? - n_coats

Would be interested in hearing various ways people scrape content from sites, mainly in real time ie: social media sites, etc.<p>Of equal interest would be data mining methods and ways to sort and interpret mined or scraped data from real-time sources.<p>Any examples, articles, and instructionals are welcomed!
======
bbayer
Sure those instructions are not for realtime sources but I believe it worth to
take a look.

We built domain-specific scraper for <http://otomobul.com> . We have used
python-scrapy and had to write custom crawler spider to deal with high memory
usage for big sites. Also we added resuming capability so we don't need to
scrape from scratch and it just travels only non-visited pages when somehow
crawler crashes (I think they resolved those issues in never versions.) I
found scrapy very well designed (borrowed losts of ideas from django and also
it uses twisted as backend) and it is flexible enough for our requirements.
For text processing, python is also my favorite language but I can not say
whether it suits 100% to your needs though.

I cannot say we did real data mining in this side project but we are using
processing middleware which process raw data from spider and dumps this data
to Apache Solr for indexing. Solr is really good when it comes to full-text
facet search. Web api is easy enough for fast prototyping. We were able to
connect PHP backend to Solr and never worry about searching and indexing
again. As an alternative you may use elastic search which has great features
also.

------
hojoff79
Is anyone familiar with the legal implications of scraping sites? Just for
things that are easily viewable to anyone with a computer, such as Tweets, a
stores prices, descriptions of things etc?

~~~
n_coats
Twitter has an API that allows you to more or less scrape tweets.

------
n_coats
I'm seeing a fair bit about node.js and bobik, has anyone ever worked with
either framework?

------
zeruch
Use APIs when possible...the methods would be dependent on what the goal is.

~~~
n_coats
Would you suggest any specific scraping API's or API's for each source I'd
like to scrape (granted they offer it)?

