

Clustering related stories  - mattdeboard
http://blog.getprismatic.com/blog/2012/4/17/clustering-related-stories.html

======
smackfu
The underlying dilemma is that so many of these stores are not really
"related". They're just the same story, rewritten off of a press release. The
ideal system would pick out clusters, but also have sub-clusters within the
cluster that would contain articles on the same subject but with diverse info.

~~~
mattdeboard
Doesn't Google do something similar to this for their news aggregator?

~~~
rabidsnail
They also have access to the link graph, so they probably use that for
clustering instead of looking at the text. Pages with lots of inbound linkers
in common are likely to be similar.

------
cgs1019
At first glance I interpreted the title as "Clustering-related stories", as in
"Stories about clustering" which I thought was a very cleverly self-
referential title. Oh well...

------
papercruncher
What has worked best for us is to narrow the category as much as possible
before attempting to do unsupervised clustering.

We focus solely on sports and our classifiers(supervised) reduce the scope
first to the sport and then to the specific team before we apply any sort of
clustering (k-means, LDA, etc). That allows us to reduce the vocabulary to
what is mostly a list of named entities for the sport/team and key words such
as 'injury', 'quarterback', etc. With a significantly reduced vocabulary, even
algorithms such as Hierarchical LDA work surprisingly well.

------
jnazario
nice!

years ago i got my hands wet with some rudimentary text classification and
grouping techniques. it was when google news came out, and it was fluffier
than i like my news to be (it has since been tuned a lot better). so i wrote
my own.

i wrote up my (admittedly naive and simplistic - but effective) methods here:

[http://www.informit.com/articles/printerfriendly.aspx?p=3988...](http://www.informit.com/articles/printerfriendly.aspx?p=398884)

------
ygyg
Did you consider weighting of the different feature types (on top of tf/idf)?

~~~
jrfinkel
We actually do weight them differently, but the post was already pretty long
so I had to gloss over some of the details.

------
rwynn
You might find this library helpful in this area ->
<https://github.com/rwynn/rugroupy>.

~~~
mattdeboard
You have to manually mark entities though, how is this advantageous over say
indexing your data using solr or whatever, and using even the built-in
clustering tools? (Carrot by default)

~~~
rwynn
I didn't realize solr had clustering. One difference might be the ability to
pass in scoring, include, and dynamic tagging functions at clustering time.
Does carrot cluster on arbitrary document fields?

~~~
mattdeboard
I'm not sure if Carrot does, but with Solr you have to specify which fields to
cluster on then restart the instance. The clustering functionality is a
contrib library I believe.

------
basicallydan
I worked on something similar for my final-year project at University in
Python, at <https://github.com/basicallydan/Onda> \- clustering related
articles in the online media.

In retrospect it's pretty ugly, but it worked pretty well. I really wanted to
implement named entities and n-grams but never got around to it. I'm glad you
guys did :)

------
chrislo
I worked on something similar for clustering BBC news articles. The (Ruby)
code I used is here: <https://github.com/bbcrd/similarity>

I didn't account for names entities or n-grams in the feature vector though.
That's a very interesting idea.

@mattdeboard - what algorithm did you use to count the occurrence and size of
clusters?

------
gms
Since you have feature vectors, have you looked at using LSH to reduce the
number of comparisons/memory consumption?

~~~
aria
In general, LSH is great for detecting near-duplicates which are very close,
but we have two issues: (1) The stories we're looking for have a greater
distance than LSH is great at detecting efficiently and (2) Because our
feature vector has structure (named entities, n-grams, etc.) we have problem-
specific 'coarse' feature-vector, which is good at detecting possible related
candidates cheaply.

------
lookforr
nice article. a few questions here

1\. scalability: does your system ingest multiple documents in parallel? if
so, how often do you observe over-segmentation, if any?

2\. thresholds: how did you set the thresholds at various parts of the
systems?

~~~
jrfinkel
1\. We do ingest multiple documents in parallel, but we also have a system in
place for merging clusters, so it ends up not being a problem. 2\. Pure
hackery. Tweak, look at data, repeat.

~~~
felipehummel
Can you elaborate on how you merge clusters? Many times a clusters will be
just single-doc clusters, how to check if you need to merge two single-doc
clusters? (it seems like the same process that was done in the first place to
create the clusters: doc x doc similarity)

~~~
jrfinkel
Merging is pretty simple, and could probably use a little more TLC. The way we
do it is that when we get a new document in the system, if the similarity
score is above some threshold for documents in two different clusters we will
consider merging those clusters. We then make the yes/no decision by comparing
random documents from both clusters and averaging the scores, but the
threshold we use here is a bit lower than the non-merging decisions (since we
have the additional information of this new document doing a good job of
linking the clusters).

~~~
lookforr
this doesn't work in the case where there are two (and only two) similar
documents get ingested into the system as new singleton clusters at the same
time; this case is very rare so it is not a big issue to you, i guess.

------
rabidsnail
What corpus did you get your term frequencies from?

~~~
lmichelbacher
If I understand the blog correctly, their corpus consists of all the articles
they have processed so far. Maybe they have some additional source, e.g. a
more general collection of web pages.

------
brey
interesting approach. did you consider k-means for this? if so, what
influenced your choice?

------
senthil_rajasek
how are you doing entity extraction?

~~~
jrfinkel
It's an in-house system similar to the one i built in grad school
(<http://nlp.stanford.edu/ner/index.shtml>).

