How to compare two articles..
2 points by bdouglas on July 31, 2008 | 4 comments

trying to figure out what ways are there to compare/determine if two separate articles are the same...

curently researching semantic analysis, but figured i'd turn here as well...




Hi Bd, Ironic, yesterday I uploaded a tech-demo of something I call kindling which attempts to correlate articles against news feeds from social websites.

I read a book called Collective Intelligence by Tony Segaran. Its basically machine learning for dummies, very example heavy, all in Python.

He talks about clustering to group like things together in an unsupervised way. The way this works is to build a vector of words from each article and compare these using something known as pearson distance. The vector of words is known as a feature set. Early on you create this vector in a naive way (i.e. eliminate words that don't show up enough and words that show up too much). At the end of the book he talks about feature detection (which I assume is building this vector in a smarter way).

The book really helped me. Pearson correlation is pretty easy to grasp and implement as well.

Good luck.

There's a great Google tech talk on this subject:


What do you mean "are the same?"

by the "are the same", i'm trying to determine if the articles would basically be talking about the same/similar topic...

this of course would/might take into account similar phrases/words, possibly similar titles, similar timeframe of creation, possible a priori knowledge about the author (past works), etc...


