

Show HN: News headline analysis of over 140,000 headlines over several years - Luiz7
http://headlines-and-data.herokuapp.com/
This is the result of an 8 day final project for DBC Chicago. Our team scraped over 140,000 headlines of several news agencies stretching back several years. We then took those headlines and fed them through the AlchemyAPI sentiment analysis engine to assign each one a score. They were then plotted in a couple different ways using D3.<p>This is far from perfect and even farther from scientific. It was done in 8 days by some passionate amateur developers. It was however a lot of fun and very interesting.<p>You can read about it and the team in more detail on the repo page here:<p>https:&#x2F;&#x2F;github.com&#x2F;kelmerp&#x2F;headline_sentiment_rating<p>and see some slightly more technical slides here:<p>https:&#x2F;&#x2F;speakerdeck.com&#x2F;luizneves77&#x2F;sentimental-headlines<p>This was written in RoR, Postgres (Memcached), and javascript + D3.<p>I&#x27;m also the creator of:<p>onionornot.com
reddesigned.com
http:&#x2F;&#x2F;luiz-n.github.io&#x2F;route-search&#x2F;<p>and am interviewing for web dev (and data visualization) positions in the chicago area if you would like to reach out to me. @hey_luiz
======
Luiz7
This is the result of an 8 day final project for DBC Chicago. Our team scraped
over 140,000 headlines of several news agencies stretching back several years.
We then took those headlines and fed them through the AlchemyAPI sentiment
analysis engine to assign each one a score. They were then plotted in a couple
different ways using D3.

This is far from perfect and even farther from scientific. It was done in 8
days by some passionate amateur developers. It was however a lot of fun and
very interesting.

You can read about it and the team in more detail on the repo page here:

[https://github.com/kelmerp/headline_sentiment_rating](https://github.com/kelmerp/headline_sentiment_rating)

and see some slightly more technical slides here:

[https://speakerdeck.com/luizneves77/sentimental-
headlines](https://speakerdeck.com/luizneves77/sentimental-headlines)

This was written in RoR, Postgres (Memcached), and javascript + D3.

I'm also the creator of:

onionornot.com reddesigned.com [http://luiz-n.github.io/route-
search/](http://luiz-n.github.io/route-search/)

and am interviewing for web dev (and data visualization) positions in the
chicago area if you would like to reach out to me. @hey_luiz

------
DjangoReinhardt
Awesome job!

A few things immediately jumped out at me:

1\. On an average, Fox, seems to be more positive than CNN. Hmm.

2\. The headlines were more positive than most {sources/days} on CNN Politics
in the second half of 2009. Obama's election is the only story I can think of
that could have affected all the sources, but I may be wrong.

3\. Around the same time, (shortly after the election results, in fact)
HuffPost saw a sharp decline in the positivity of their headlines and dipped
into negative. Hmm...

4\. The overall trend of headlines is more positive than negative, leading me
to wonder why is it that, as a people, we are so cynical.

I'd love to see a graph that plots the median instead of the average of the
daily scores. Also, I dunno much about the Alchemy scoring, but quite a few of
the headlines (~30-50% per day) seem to have scored 0.0 - is that bad
detection or consistently neutral reporting? If so, I suspect the median graph
will look a lot different.

I'd also love to see a similar analysis for non-mainstream media - especially,
aggregators like reddit, HN and slashdot. Maybe, you could add the post
ranking/comments/karma as other variables and attempt to refine the analysis
further?

Just bouncing a few ideas, that all. All in all, good job. :)

------
chris_va
If you want to have a fun time with this, break it down by title keywords
(e.g. politician names). You can get a very good break down of political bias
by news source.

In a more automated way, you can extract all keywords from your titles, and
then auto-extract the top keywords that are the most dividing.

(Source: former Google News TL, we used to have way too much fun doing stuff
like this)

~~~
Luiz7
Yeah we talked about that as being the obvious next step but we had neither
the time nor the skillset off the bat for it. A noSql DB would probably be the
better option for that as well. (correct me if i'm wrong)

~~~
chris_va
With 150K titles, you don't have enough data to warrant a no-sql database. You
can set up mysql or postgres in about 10 minutes and dump in the data.

