

Hack idea: text classification for political propaganda - dfranke

Reading the campaign material of political candidates in order to figure out what they stand for is a rather annoying exercise.  Invariably, they're 95% fluff.  Look at<p>http://www.johnmccain.com/Informing/Issues/65bd0fbe-737b-4851-a7e7-d9a37cb278db.htm .<p>It isn't my intent to pick on McCain, but this makes a good example.  These five paragraphs are almost but not quite semantically null.  I can glean from the first paragraph that he believes in man-made global warming, and from the fourth paragraph that he supports nuclear energy.  This is useful information, but having to parse five paragraphs to discover it seems inefficient.<p>I think text classifiers might be able to improve this process.  "Nuclear energy" seems like it would be a pretty strong ham token, while "for our children" and "addressing the challenges" seem pretty spammy.  Trying this out is not quite as simple as just feeding things into CRM114, because we're trying to classify parts of messages rather than complete messages.  It ought to possible to work around that, though: perhaps score each clause, each sentence, and each paragraph, and then somehow derive an overall score from these inputs.<p>Anyone think this could work?
======
xirium
This verbose stenography also applies to economists. Apparently, Alan
Greenspan wasn't happy with a speech until he read the newspaper coverage the
following day. If one newspaper reported the opposite of another newspaper
then he was satisfied. You don't want to spook the market.

------
danohuiginn
Even very simple text classification can get you somewhere. Last year a couple
of economists, Matthew Gentzkow and Jesse Shapiro, published a wonderful paper
in which they did something similar.

Take a look at it:
[http://home.uchicago.edu/~jmshapir/biasmeas052507_formatted....](http://home.uchicago.edu/~jmshapir/biasmeas052507_formatted.pdf)

"To measure news slant, we examine the set of all phrases used by members of
Congress in the 2005 Congressional Record, and identify those that are used
much more frequently by one party than by another. We then index newspapers by
the extent to which the use of politically charged phrases in their news
coverage resembles the use of the same phrases in the speech of a
congressional Democrat or Republican."

They then go on to compare their index of politically-slanted language in
newspapers to the politics of the newspapers' readers, and conclude that
newspaper bias is driven more by readers wanting their own prejudices
confirmed than by the politics of the newspaper owners. It's a really great
piece of work: worth reading in full if you have any interest in text
classification politics.

------
iamwil
I like the idea, though if it became in widespread use, writers would do as
spammers do--write to get past your filter.

One concern is that it'd be labor intensive for you to build up a large enough
corpus with some metric to compare to so you know how well you're doing.
Marking it by hand is labor intensive, and if you use a lot of humans, it
might be a little inconsistent.

~~~
dfranke
I don't think they'd do that. The purpose of the fluff is not (usually) to
deliberately conceal their beliefs. It's there because it's what some people
want to read. They want a candidate who "cares". I don't think the people who
write the fluff would have any motivation to try to beat fluff filters.

~~~
dfranke
Come to think of it, instead of trying to beat the filters, what if they
specifically cooperated with them so that they'd filter out nothing at all?
Think of what a great boast it would make: "I'll take you straight to the
point, and these folks over here will prove it mathematically". I could
picture Ralph Nader trying this.

------
darose
Semi-related:

You might want to take a look at this blog entry:

[http://billburnham.blogs.com/burnhamsbeat/2008/02/skygrid-
an...](http://billburnham.blogs.com/burnhamsbeat/2008/02/skygrid-and-the.html)

It's about a startup that has accomplished some cool things in the area of
recognizing meaning from text. They've developed some "sentiment"-measuring
algorithms that allow them to classify articles into "good" or "bad", for the
purposes of deciding whether to invest in a stock. Pretty cool stuff! Perhaps
it might give you some ideas on similar approaches you could apply to your
problem.

------
neilc
A professor at the university where I did my undergrad has done research on
this topic:

<http://www.cs.queensu.ca/home/skill/papers.html>

For example, he took the Enron email data set, and applied machine learning
techniques to try to distinguish "dishonest" emails where something was being
concealed from normal emails. A student of his has applied similar techniques
to speeches given by MPs in Parliament (Canadian equivalent of Congress).

------
dkokelley
So, a program that loads certain text, then recognizes the primary points of
the text, and prints it the filtered version?

I'd think that there would be much better uses for it than as a BS-O-Meter for
political literature.

Think about a program that could generate the summary of an essay or book, or
assist novice writers in writing concisely. Maybe use it as a service that
takes news items and abbreviates them for busy people who want to stay on top
of things. you could call it, snapnews.com or something.

~~~
dfranke
I think you'd find that usefulness dimishes pretty rapidly as the problem
domain expands. The reason spam filters work as well as they do is that most
spam pretty much looks alike. Likewise, there are only so many ways you can
phrase soothing platitudes that will get soccer moms to vote for you.

------
pius
I'm working on something different, but related. I'm hoping to open-source
part of it this week, so stay tuned. :)

~~~
andreyf
Shoot me an e-mail when you do, please?

fedorov@rutgers.edu

------
cedsav
A bit of topic, but that reminds me of Isaac Asimov's Foundation. At one
point, a politician comes, makes a big speech with lots of promises and
everybody is happy, until a day-long sophisticated semantic analysis of the
speech reveals that it was actually devoid of any meaningful content.

------
tocomment
Good idea, I think it could be done. Perhaps you could look at congressional
speeches and match it to how each speaker voted?

------
edw519
Just invert this:

<http://www.dack.com/web/bullshit.html>

------
curi
where do you get good and bad seed phrases?

~~~
byrneseyeview
Read transcripts from every speaker of the house you've never heard of. Being
powerful but creating nothing memorable is an indicator.

