

How to compare two articles.. - bdouglas

hi...<p>trying to figure out what ways are there to compare/determine if two separate articles are the same...<p>curently researching semantic analysis, but figured i'd turn here as well...<p>thoughts/comments...<p>thanks<p>bd
======
raffi
Hi Bd, Ironic, yesterday I uploaded a tech-demo of something I call kindling
which attempts to correlate articles against news feeds from social websites.

I read a book called Collective Intelligence by Tony Segaran. Its basically
machine learning for dummies, very example heavy, all in Python.

He talks about clustering to group like things together in an unsupervised
way. The way this works is to build a vector of words from each article and
compare these using something known as pearson distance. The vector of words
is known as a feature set. Early on you create this vector in a naive way
(i.e. eliminate words that don't show up enough and words that show up too
much). At the end of the book he talks about feature detection (which I assume
is building this vector in a smarter way).

The book really helped me. Pearson correlation is pretty easy to grasp and
implement as well.

Good luck.

------
MaysonL
There's a great Google tech talk on this subject:

<http://www.youtube.com/watch?v=AyzOUbkUf3M>

------
jfarmer
What do you mean "are the same?"

~~~
bdouglas
by the "are the same", i'm trying to determine if the articles would basically
be talking about the same/similar topic...

this of course would/might take into account similar phrases/words, possibly
similar titles, similar timeframe of creation, possible a priori knowledge
about the author (past works), etc...

thanks

