
Ask HN: Sentiment Analysis – how to handle biased word list lengths? - markovbling
I&#x27;ve tried posting this on Stack Exchange but no luck so figured I might have more luck here:<p>I&#x27;m implementing a simple sentiment analysis algorithm where the authors of the paper have a word list for positive and negative words and simply count the number of occurrences of each in the analysed document and give it a sentiment score the document with:<p>sentiment = (#positive_matches - #negative_matches) &#x2F; (document_word_count)<p>This is normalising the sentiment score by document length BUT the corpus of negative words is 6 times larger than the positive word corpus (around 300 positive words and 1800 negative words) so by the measure above, the sentiment score will likely be negatively biased since there are more negative words to match than positive words.<p>How can I correct for the imbalance in the length of the positive vs. negative corpuses?<p>When I run calculate the above sentiment score, I get around 70% of my 2000 document set with negative sentiment scores BUT there is no a priori reason that my document set should be biased towards the negative and I would expect the true &#x27;unobserved&#x27; sentiment of the documents to be approximately symmetrical with around half the documents positive and half negative.<p>I need to somehow come up with a methodology that results in representative sentiment scores to remove the bias introduced by asymmetrical word lists.<p>Any thoughts &#x2F; ideas much appreciated :)
======
PaulHoule
(1) sentiment analysis is the one area where bag of words really goes to die;
there is a limit to how good the results you get will be and it won't be good.

(2) the right way to do this is to train a probability estimator on your
scores, that is, put +/\- labels on some of your documents, then apply
logistic regression.

[http://en.wikipedia.org/wiki/Logistic_regression](http://en.wikipedia.org/wiki/Logistic_regression)

A lot of machine learning people think this is harder than it is and worry
more about regularization, overfitting and such, but in the case of turning a
score into a probability estimator you are (a) fitting a small number of
variables and (b) if you have a lot of data and make a histogram you will
ALWAYS get a logistic curve for any reasonable score, I think it has something
to do with the central limit theorem.

This seems to be one of the best kept secrets in machine learning. I used to
be the bagman who supplied data to people at the Cornell CS department and we
ran into a problem where there was an inbalance in the positive and negative
set and in that case the 0 threshold for the SVM is not in the right place
because it gets the wrong idea about the prior distribution and T Joachims
told us to do the logistic regression trick.

Also if you read the papers about IBM Watson they tried just about everything
to fit probability estimators and wound up concluding that logistic regression
"just works" almost all the time.

~~~
gibrown
Expanding on the point that bag of words doesn't work that great for
sentiment... A good example of why are phrases like "not bad", "not great",
"not a good idea". Just scoring based on unigrams really doesn't capture the
context well, and people use negators a lot. You could maybe try filtering
these out or detecting them with some clever rules.

I did a bit of work on this using the JDPA Sentiment Corpus for my thesis
about 5 years ago. Its hand annotated for things like negators and inversions
of sentiment. There's a bunch of code and examples here:
[https://verbs.colorado.edu/jdpacorpus/](https://verbs.colorado.edu/jdpacorpus/)

Warning: code/corpus is academic licensed, but even reading the papers may
give you some ideas.

~~~
PaulHoule
It is worse than than because of:

(i) negators aren't always next to the word they negate, to get good accuracy
you need more of a parse (ii) sentiment is highly dependent on the domain. For
instance, if one was looking at people's opinions on stocks, "Buy" and "Sell"
are considered sentiment but these are emotionally neutral words in general.
(iii) Also there is sarcasm, which sometimes people can't figure out right.

------
wiresurfer
I see you have mentioned TF-IDF as something which you are planning to try.
That should be interesting.

The way I see it, (and i may very well be slightly off point) you have a
corpus of 2000 docs 2 lists -> [Wpos] & [Wneg] with count[Wneg] a factor more
than count[Wpos]

if you compute a [0-1] normalized tf-idf score for each term in the set [Wpos]
& [Wneg] and sum them up for all words in each of those two sets, you get a
score proportional to the count of positive words & negative words. Normalized
here would mean using relative frequencies, rather than absolute freq. [I
prefer calling the latter term counts]

This puts document_word_count based normalization out of picture and makes it
implicit in the tf-idf step.

Now you have Two numbers, Sum(Positive normalized TF-IDFs) and Sum(Negative
Normalized TF-IDFs) which you can individually normalize for your list sizes,
and then use the two scores for sentiment classification. A dirty hack, and
somewhat inefficient if you don't maintain a reverse index.

Second approach could be this. Use your Word List, both positive and negative,
to do a Okapi BM25 scoring against your docs using the list as the query set.
So you would get a BM25score for your docs. and you can use that to define
sentiments.

Corpus - D Di = Document in the corpus you want to classify Query1 = {set of
positive words} Query2 = {set of negative words}

PositiveScore = BM25(Query1, Di ) NegativeScore = BM25(Query2, Di )

Some Combination to do classification. if Positive > Negative Score : call it
positive!

Just a thought. BM25 has some flexibility in tuning it for length
normalizations. Check footnote.

PS: There is the British National Corpus too for word frequencies :)

[1]BM25 and normalizations. [http://nlp.stanford.edu/IR-
book/html/htmledition/okapi-bm25-...](http://nlp.stanford.edu/IR-
book/html/htmledition/okapi-bm25-a-non-binary-model-1.html)

~~~
markovbling
Wow, thank you so much for pointing out BM25 - hadn't heard of it but looks
very cool. Implementing it ASAP.

------
alexbecker
You could use the word frequency lists at
[http://www.wordfrequency.info/](http://www.wordfrequency.info/) in order to
normalize, e.g. add up the frequencies of the positive words and the negative
words, and divide the number of matches by these frequencies.

~~~
markovbling
Great idea! I've looked at term-weighting approaches such as TF-IDF but I
don't have a training set of positive / negative sentences so would have to
term-weight just the occurances of each of positive/sentiment list and compute
a net sentiment on this basis.

Will implement and see if this fixes the bias introduced by asymmetrical
corpus sizes.

Fundamentally, I'm not sure there is a 'solution' available or even necessary
to the issue of a bigger negative word list than positive word list. So what
if there are more ways of saying negative things, all that matters is how many
times positive or negative things are said. My problem then, however, is even
if your range of terms that can create a match is greater for negative words,
you should still have representative positive/negative counts and that doesn't
explain why I'm getting much more negative sentiment scores than would be
expected given my test documents.

~~~
tfgg
I think the issue is whether

P(detected|positive) and P(detected|negative)

are the same, right? It's whether they have equal coverage of all possible
phrases. Having more positive or negative words, as you say, doesn't
inherently bias things. Is your hypothesis that your corpus doesn't skew
negative, which seems to be the basis of this question, correct? Can you do
some manual sampling to get a good bound on it?

~~~
markovbling
Totally agree

I suspect the bias currently giving me
P(detected|negative)>P(detected|positive) is resulting from my simplification
to looking only at 1-grams

------
QuantumRoar
If I understood correctly, you are trying to get a sentiment that is always
correct for single sentences but that can extrapolate word frequencies if they
don't appear in your list. I.e. for large neutral documents you want it to be
neutral, although your negative words match statistically more often.

My intuition tells me that you can't really do both: Either use only those
words in your dictionary and get the behaviour right for single sentences, or
extrapolate as if both sets where the same size (weighted average, would be
the easiest). By extrapolating you may assume, that for each positive match
that you get, you'll miss other positive matches. That means you generally
underestimate positive matches, compared to negative matches. This only works
on large datasets.

But really, how bad is it to get a sentiment of -0.17 for a single sentence?
It tells you that it was a negative sentence but that you have a high chance
that there was a positive word in there that you missed, which is what you
need to implement to get neutral sentiment for large neutral documents.

------
wrath
1\. You can try using bi-grams or even tri-grams to make you word list a
little more precise.

2\. Create a validation by manually identifying each review as positive or
negative. Each time you modify your algorithm run it through your validation
set and note the results in a spreadsheet. If you don't do that, you'll never
know if and how you've improved the results. The bigger the validation set the
better. Similarly, you can use part of your validation set as a training set
into a classifier.

3\. Find a scale that works to bias your score. For example, I would try to
bias your negative score using a log scale. The fewer negative words you have
the more they are worth, the more you have the less they are worth.

~~~
markovbling
Definitely think I should look at using bi-grams and tri-grams

Interesting reflection on society if there are more 1-gram ways of
communicating negativity than positivity e.g. I'm more inclined to say
'terrible' for something very bad while it feels more natural to say 'very
good' than 'excellent'. If that makes any sense :)

~~~
namecast
I found this paper useful for a side project I worked on a few months ago, one
that made use of n-grams in a naive bayesian classifier:

[http://arxiv.org/pdf/1305.6143v2.pdf](http://arxiv.org/pdf/1305.6143v2.pdf)

and the lead authors's github repos are:

[https://github.com/vivekn/sentiment](https://github.com/vivekn/sentiment)
[https://github.com/vivekn/sentiment-web](https://github.com/vivekn/sentiment-
web)

He's implemented 'negative bi-gram detection' (my phrasing, not his) with this
function:

[https://github.com/vivekn/sentiment/blob/master/info.py#L26-...](https://github.com/vivekn/sentiment/blob/master/info.py#L26-L56)

...which I found useful as a jumping off point. Good luck!

------
dhammack
If you think the true sentiment is symmetric, you can just change the decision
threshold so that your algorithm answers positively about half the time. Just
say positive when the sentiment is greater than the mean sentiment over your
training set.

~~~
markovbling
This is an interesting approach to normalisation - will give it a go :)

------
barneso
Two simple things you could do:

1\. Insert each negative example six times into your training set (or weight
negative examples accordingly, ie use #positive matches - 6 * #negative
matches / (2 * positive word count) as your score

2\. Take your distribution of sentiment scores as calculated over held out
data (or the training set itself, but be warned that this will skew your
results), and calculate the mean and standard deviation. Normalize your
results by subtracting the mean and dividing by the standard deviation. You
can then say that positive sentiment is > 0 and negative sentiment < 0, with
the absolute value being the strength of the classification.

~~~
markovbling
I have a list of positive and negative words and a set of documents which I
want to score so not sure if I have a 'training set'.

I think you mean to upweight my positive list by 6 (since it is 1/6 of the
size of the negative list) but the problem with this is the same as my reply
to the other comment where you just shift the bias:

Consider the sentence: 'there are strong and weak divisions in company X's
Europe operations'

The only word matches in your word lists are 'strong' on your positive list
and 'weak' on your negative list.

If you weight these counts as you describe, your sentiment for this sentence
will be -0.44 + 1 = 0.66 even though the sentence is clearly 'neutral' and
should have a score of 0.

~~~
barneso
You are right; this does just shift the bias, which is sometimes all you need
(you have a simple algorithm, presumably for a reason).

I did misunderstand that you don't have a training set, just a list of
positive and negative words. You could still apply a similar idea.

You could test your hypothesis that the score is biased by looking at the
average number of positive and negative words per document, and slightly
modify your factors. For example if you found that the average document had 6
negative words and 4 positive words, but you think that the average sentiment
is neutral across your documents, you could multiply the positive word count
by 1.5. It's a less brutal way to accomplish a similar outcome without
increasing the complexity of your algorithm.

Otherwise, you will need to use an algorithm that has more discrimination
power, and this will likely mean you need a training set. You can go very deep
down that rabbit hole, but I would consider starting with Naive Bayes which is
essentially learning a weight per positive and negative word and combines them
in a similar manner to how you're doing so now. It has the advantage of being
a simple algorithm.

~~~
SergeyHack
I like your "2." suggestion more, because the initial sentiment score
distribution can be not normal.

So there is an option to try making it normal by taking logarithm for example
and calculating mean, etc. after that.

~~~
barneso
I would still expect it to tend towards the normal distribution across a large
set of documents. If you model positive and negative word counts as a binomial
distribution, you have the the difference of two samples from different
binomial distributions which would still tend towards normal (I think, though
I'm not 100% sure, certainly it's true within my experience). A logarithm
would skew away from positive to negative sentiment and is undefined for
negative values.

~~~
markovbling
It only tends to a Normal distribution if you estimate P(negative|matches in
-ve list) & P(positive|matches in +ve list) with an unbiased, consistent
estimator.

A simple 1-gram model like in the question does not model many complexities of
natural language e.g. negation ("not bad" != "bad") so you would expect your
estimator to over-represent the dictionary with more words that are equal to
their adverb-adjusted equivalent. e.g. "not bad" can be described as
'terrible' more readily than 'very good' can be described as excellent since
people assign a hyperbolic weighting to their own happiness (utility theory
101)

The sentiment would only tend to a normal distribution if we had perfect
estimators for document sentiment which requires advanced POS tagging and
models more complex than a 1-gram bag of words aggregation :)

------
bulte-rs
Only using match-counts is imo a bit simplistic (don't get me wrong,
simplistic can be good). Do you have any information like "how many times does
this negatively annotated word occur in a document"; then you can use a simple
calculation (like cosine-similarity) to calculate a measure of matching with
said case.

Also, consider using bigrams (i.e. word-pairs) to do sentiment matching which
will make matching more precise.

~~~
markovbling
Cosine-similarity is a measure of comparison between 2 vectors so what would
you use for these 2 vectors in this case?

Definitely going to look into n-grams for production implementation but right
now trying to resolve the negative bias issue

------
peterhi
Word counting can be pretty ropey but there are some things that you should
check out. Of the 1800 negative words how many of them actually occurred in
the documents?

Or you could simply count negative words as 0.44 rather than 1 (800/1800 if
those numbers are correct).

Is "not" a negative word? This might be causing problems with things like
"This cake is not bad" which has positive sentiment even if it has 2 negative
words

~~~
markovbling
I tried weighting the terms by the relative sizes of the corpuses (as you
suggest) but the problem is you just shift the bias instead of removing it.

Consider the sentence: 'there are strong and weak divisions in company X's
Europe operations'

The only word matches in your word lists are 'strong' on your positive list
and 'weak' on your negative list.

If you weight these counts as you describe, your sentiment for this sentence
will be -0.44 + 1 = 0.66 even though the sentence is clearly 'neutral' and
should have a score of 0.

:)

~~~
danieldk
_If you weight these counts as you describe, your sentiment for this sentence
will be -0.44 + 1 = 0.66 even though the sentence is clearly 'neutral' and
should have a score of 0._

If you want to stick to simple counting (it's a fun exercise at least ;)) and
L is the large lexicon and S the small, why don't you:

\- Generate L' by randomly picking |S| words from L.

\- Compute the score using L' and S.

\- Rinse and repeat for the same text N times.

\- Compute an aggregate over the N scores, e.g. average, score with the
largest number of hits from L, score with the largest number of hits from
both, ...

This way, the lexicons have the same size during each scoring attempt, but you
do use the extra vocabulary of the larger lexicon.

Ps.: don't stick to simple counting. It doesn't work ;).

~~~
markovbling
Ahh resampling |S| words from L is a great idea! :)

I know simple counting not the greatest approach but I started out by trying
to replicate a research paper put out by a stock broker (not the most advanced
research haha!)

I would love some suggestions for a method that does work :)

~~~
markovbling
On second thought, not sure resampling |S| words from L is a good idea because
you want 100% coverage of your sentiment universe and an asymmetrical corpus
is not a priori incorrect so resampling does not solve anything except
reweighting match lists which may not even lead to symmetrical sentiment since
the distribution of words amongst the language (TF-IDF helps here) is not
necessarily the same in your negative and positive lists

------
ai_maker
Do you have the gold standard labels of your dataset? Can you ensure that the
amount of pos/neg labels is symmetrical?

You can heuristically tune the weights of your lexicon to fit your intuition,
but evidence is necessary to progress adequately.

In case you find unbalanced amount of examples, apply an unbalanced
effectiveness score like the F-measure to obtain a fair performance of your
system.

~~~
markovbling
This is part of my problem - I don't have a labeled dataset outside of my
'positive words' / 'negative words' lists.

I don't think asymmetrical test-sets would be a problem if I had training data
for documents since you can reweight to compensate - it would seem my problem
is that over-representing the universe of matches for negative points due to a
bigger 'negative word list' is introcucing bias and I'm not sure how to solve
that.

Please see my reply on reweighting in this thread (if you reweight positive
words to normalize the over-represented negative word count then a neutral
sentence will have a positive sentiment score)

------
var_eps
Assuming there is no inherent bias in terms of sentiment and vocabulary, one
approach would be to repeatedly randomly sample 300 negative words from the
corpus and generate a vector of sentiments. You could then average the
elements of the vector to get an average sentiment, or use another metric from
basic stats. That could decrease the bias.

~~~
markovbling
but wouldn't you miss sentiment terms in the text if you sample a subset of
your negative dictionary?

------
SergeyHack
You can try to find a dataset that contains the equal number of positive and
negative documents (sentences, etc.) and use it as the validation set. I.e. to
tune your hyperparameters on it.

In the simple case your hyperparameter can be α in

sentiment = (α * #positive_matches - #negative_matches) /
(document_word_count)

------
amelius
Out of interest, did you define some kind of measure by which you can test how
well the chosen method performs?

(There are a lot of suggestions here, so it would be nice if at least you
could choose the "best" one)

~~~
markovbling
To be honest I haven't thought about measuring the performance of different
approaches but I have thought about a metric which will signal poor
performance and right now I'm interested in eliminating poor performance in my
simplistic methodology.

What I mean is that if my output 'sentiment scores' are skewed towards the
negative and centered around a negative number (~70% of my documents are
scored as 'negative sentiment' using the above scheme) then I know that my
model is broken because I know that my document "test set" is necessarily
neutral (or even positively skewed).

Mathematically, I need to ensure my 'cost function' is a reflection of reality
which means my sentiment scores at the end of the day need to be symmetrical
or slightly positive skew with a mean of approximately 0.

I can use regularization to trick the model into looking like this but I don't
want to overfit and I can't think of a theoretical reason that I'm getting
negative bias for this simple model except perhaps that my positive corpus is
missing 'true positive sentiment' (ala a parameter versus an estimate in
frequentist statistics) which could be the byproduct of the simplistic 'bag of
words' assumption (words are assumed independent) aka breaking down my
analysis on the basis of 1-grams (single words). As per my other comment,
2-grams could be the more natural way to express positive sentiment while
1-gram negative words are more readily available. Perhaps this is a sign of
our pessimism as a species haha :)

