

Tips for Sentiment Analysis projects  - bsims
http://blog.datumbox.com/10-tips-for-sentiment-analysis-projects/

======
jlees
Good to see more on this subject, although I haven't been working much in
sentiment analysis recently here are a few thoughts from several years of
study into building sentiment analysis algorithms, a startup, and many toy
projects including a lot of work on Twitter reviews:

#1 - You can use lexicon and learning together. My most successful sentiment
analysis work has used a malleable lexicon which itself is trained using
learning techniques, rather than classifying the whole example naively.

The rest of the tips apply to most NLP..

#3 - There's 'neutral', there's 'mixed' and there's 'unclassifiable'.
Depending on the application you might want to filter out stuff that's
contradictory or just not usable rather than assign it the stronger label of
'neutral'. Also bear in mind that in some domains most examples will be
heavily biased towards some strong opinion (reviews is a big one - people
often don't leave "It was ok" reviews) so you might need to tune the degree as
well as the direction. Point #9 mentions the probability of
positive/neutral/negative being equal; I rarely found this to be true.

#7 - Domain specificity is really, really important for sentiment, but in two
different ways: domain (topic) and platform/format. Twitter vs. Yelp reviews
is one example of where you might want to use a totally different base
classifier because the context, language, length and relationship between
individual items is very different -- but you should also be looking at using
different variables (lexicon, classifier weights, etc) between restaurant
reviews, bar reviews, gym reviews and so on. A single word can have
drastically different implications in different topic contexts, and I found
that getting this right (or rather close to right) was the most important
thing.

#8 - Many sentiment analysis techniques are not very good, but don't fall into
the trap I did of over-optimising an algorithm without taking into account the
acceptable level for the use case. One of the reasons my sentiment analysis
startup didn't get market traction was that naive algorithms were 'good
enough' for the market I was focusing on, and trying to sell a 5-10%
improvement without the necessary market intelligence just didn't help anyone.
Note that depending on your domain, humans will disagree anywhere as much as
25% of the time on sentiment. This automatically puts a cap on how good you
can get programmatically. It's also worth considering whether you care more
about false positives, false negatives, etc.

#9 - Admittedly I haven't looked at the current state-of-the-art in annotated
corpora, but if you're working on noisy, modern data then most formally
annotated corpora will be useless. Figure out a way to get real humans to
annotate for you.

#10 - Ensemble learning doesn't work so well, but a classifier that combines
different approaches is much better (in my experience) than over-tuning a
single classifier.

Also, don't forget that sentiment analysis needs input. Depending what you're
working on, your data feed might need filtering first. There's some cool
information retrieval challenges involved in doing that, which can often
(again) make far more difference than the quality of the sentiment analysis
algorithm itself.

