
Sentiment Analysis on Web-Scraped Data - shrig94
http://blog.kimonolabs.com/2014/12/17/guest-blog-sentiment-analysis-on-web-scraped-data-with-kimono-and-monkeylearn/
======
dingdingdang
This is very interesting and well written article. Must admit that the fully
online nature of the tools discourage rather than encourage in my case: why
take the time to learn complexities of something as ephemeral as, what seems
like, brand new web service? Especially when even large player like Google
routinely retire whole platforms when they are not popular enough.

All the same, the tech itself seems solid and article is as mentioned superb
so I'm really just beating the proverbial drum for proper distributed services
here (or plain old offline capable apps).

~~~
logn
There are lots of open source tools for these niches.

For sentiment analysis, I'd recommend:
[http://nlp.stanford.edu/sentiment/code.html](http://nlp.stanford.edu/sentiment/code.html)

For web scraping, a popular option is Scrapy:
[http://scrapy.org/](http://scrapy.org/)

And an unknown web scraping option (and shameless plug):
[https://github.com/MachinePublishers/ScreenSlicer](https://github.com/MachinePublishers/ScreenSlicer)

For browser automation see Phantom JS or Selenium:
[http://phantomjs.org/](http://phantomjs.org/)
[http://docs.seleniumhq.org/](http://docs.seleniumhq.org/)

For an open source IFTTT-inspired project:
[https://github.com/cantino/huginn/](https://github.com/cantino/huginn/)

~~~
smartpants
This is a great list. Thanks!

------
Profan
If you haven't yet attempted to build some sort of sentiment analysis by
yourself yet, be it rule-based or on statistical analysis, you should, even
just a rudimentary rule based one is a lot of fun to implement, and it works
surprisingly well [0].

One of the harder parts of making a decent one based on statistical analysis
however is the lack of good training data, other than the analyzed twitter
dataset [1] and another movie reviews one [2].

[0] [http://fjavieralba.com/basic-sentiment-analysis-with-
python....](http://fjavieralba.com/basic-sentiment-analysis-with-python.html)

[1] [http://help.sentiment140.com/for-
students/](http://help.sentiment140.com/for-students/)

[2] [http://www.cs.cornell.edu/people/pabo/movie-review-
data/](http://www.cs.cornell.edu/people/pabo/movie-review-data/)

~~~
jlees
Good training data's partly hard to come by because there's often reasonably
poor inter-annotator agreement on sentiment datasets -- that is to say, humans
disagree a lot in how we interpret a phrase. What reads like sarcasm to you
might read like genuine enthusiasm to another.

It's pretty easy to load up a set of data into a crowdsourcing tool and use
microtasks to rate it, but my experiences doing so weren't superb (even
restricting to native English speakers alone).

A better source of data is starred reviews where you have the star rating and
the review itself -- these come free with a sentiment rating, although plenty
of caveats around normalization. There are lots of places with review systems
like this and some (like Yelp) even make the data available:
[https://www.yelp.com/academic_dataset](https://www.yelp.com/academic_dataset)

~~~
Profan
I wasn't aware that yelp provided a dataset, that's very interesting!

Since I had this very problem as I was working on using the output from
sentiment analysis to modify sentences so to invert the sentiment polarity
(positive to negative, negative to positive), the datasets I found were never
general enough (movie reviews, many domain specific terms, hard in the text
generation step), or had a lot of noise (twitter dataset).

Though evaluating the system was very hard, due to the reasons you stated,
inter-annotator agreement was beyond terrible.

I'll have to look into if other review services expose their data as well,
seems appropriate.

------
hnriot
this is cool, but you can do the same with beautifulsoup and textblob in far
fewer lines of code and you wouldn't need any web services. if textblob isn't
your thing there's plenty of svm implementations out there.

for more interesting sentiment analysis approaches check out sentence vectors,
that's the current bleeding edge of research in this area.

most sentiment analysis systems need to use an ensemble classifier because the
domain of the text is very important. identifying the domain and using the
appropriate domain specific model is important.

------
silentrob
Very cool. MonkeyLearn looks promising. It would be nice if their docs were a
little more clear around uploading CSV and the data structure.

It would also be cool if it did unsupervised learning.

