
Self-Improving Bayesian Sentiment Analysis for Twitter - spxdcz
http://danzambonini.com/self-improving-bayesian-sentiment-analysis-for-twitter/
======
randomwalker
For those interested in machine learning, the "self improvement" technique
that the author talks about falls under _semi-supervised learning_ ,
specifically _self-training_ , which is apparently "still extensively used in
the natural language processing community."
<http://en.wikipedia.org/wiki/Semi-supervised_learning>
<http://pages.cs.wisc.edu/~jerryzhu/icml07tutorial.html>

Semi-supervised learning is a good idea in this type of situation, given that
unlabeled samples are far more abundant than labeled samples, but there are
gotchas to watch out for. In general, SSL helps when your model of the data is
correct and hurts when it is not.

Here's an example of what can go wrong in this particular application: let's
say the word 'better' is mildly positive, but when it appears in high-
confidence samples, it's usually because it appears together with the words
'business' and 'bureau', as in "I just reported Company X to the Better
Business Bureau", i.e., strongly negative. This means that the new self-
training samples containing the word better will all be negative, which will
bias the corpus until eventually 'better' is treated as a strongly negative
feature.

Occasional random human spot-checks of the high-confidence classifications
would be useful :-) Also, self-training gives diminishing returns in accuracy,
whereas the possibility for craziness remains, so turning it off after a while
might be best.

A survey of semi-supervised learning:
<http://www.cs.wisc.edu/~jerryzhu/pub/ssl_survey.pdf>

~~~
jacquesm
Single words may not be the best way to attack this problem though, multi-word
expressions would do a better job. And then you could label the occurrence of
'reported*better business bureau' as a strong negative.

Context is everything in natural language processing, and by dropping all
context the problem becomes harder to solve.

~~~
_delirium
Learning becomes problematic if you get too complex in your features though,
especially if you go up to learning things like regexes. Simple ramping up in
complexity, from e.g. word counts to word-pair counts (or other low-n n-grams)
does give you gains sometimes, but there've been a number of cases where
increasing the space like that gave surprisingly little/no gains, which is one
reason the simple models keep being used (besides simplicity and speed).

~~~
jacquesm
That's true. I built a 'chatbot' long ago that used simple regexes
(words+wildcards) to match incoming patterns. It worked well because at the
higher levels of the conversation you'd use single words to guide to a portion
of the conversation tree, and lower down you could make decisions on very
specific differences in the input.

For a classifier that's a less useful approach, but I think single words is
too narrow. 3-grams is probably the sweet spot for something like this.

------
ThomPete
Standford also have an entire semester of lectures on Itunes University

[http://itunes.apple.com/WebObjects/MZStore.woa/wa/viewiTunes...](http://itunes.apple.com/WebObjects/MZStore.woa/wa/viewiTunesUCollection?id=384233048)

~~~
eclark
I just started watching them. While they are pretty math heavy and
implementation light. It's a great watch for the morning train ride.

------
herdrick
Very cool. That seemed to work better than I'd have expected.

The explanation of the 'naive' part of Naive Bayes isn't quite right, though.
Throwing out the possibility of the animal being human based on the datapoint
"four legs" is orthogonal to naivety. A more sophisticated Naive Bayes system
could reject classifying an animal as human based on having four legs, and
conversely a non-naive system i.e. one that used joint probabilities of the
features, might not be more likely to.

I guess the most rational way to do that would be to express and calculate the
conditional probabilities with some statistical distance, like the # of
standard deviations. So 100k examples of humans without a single one having
the "four legs" feature would make that a very strong indicator of being non-
human. And that'd work just as well with a naive algorithm as any other.

Is there a name for this?

~~~
alextp
You seem to be talking about smoothing: how many non-human four-legged animals
and non-four-legged humans have you got to see before you can estimate that
humans are not likely to have four legs? This is reasonably well-solved; and
you either use dirichlet (laplace, add-one) smoothing or something more
closely corresponding to your domain, like good-turing smoothing. This
reweights the probabilities in your classifier to make sure that (by analogy)
if you see a talking, thinking, four-legged banker, you will probably classify
him as a human (since the other features overwhelmingly point in that
direction). They all assume some measure of confidence in the features, and
make sure that a feature that has been observed once or twice will not move
the class boundaries too much.

