
Ask HN: How Do Techmeme and Google News Cluster Stories? - babyshake
Gabe Rivera's Techmeme site is able to mostly automatically cluster different stories together into one larger "story". And Google News can do the same, fully automatically.<p>Is there an algorithm particularly well suited for this task, or even better, APIs that can accomplish it?
======
thesethings
The Drupal memetracker module (admittedly pre-beta) uses Python's Pyclust.
With Techmeme, linking is definitely part of it, but not all. I found this out
when experimenting with Onespot.com a site that aspires to be, but is not yet,
white-label memetracking. They cluster _only_ on links, and as a result, have
barely any clusters picked up.

Aside: I think Techmeme does a much better job than Google News, it not only
clusters, it kind of threads. Techmeme's Gabe Rivera, who jokes so much you
can't tell when he's serious, has said that Techmeme "has always been biased,"
and I sometimes take that to mean not that it's biased in terms of what's
"hot" (that's the political interpretation), but that it's biased in terms of
what's a cluster.

Techmeme also recently openly added human curation, but that's a separate
level of intervention than the clustering we're talking about.

~~~
thesethings
Weird. RossM and I were just voted down in one swoop. Normally I wouldn't
comment on this, but our posts are so innocuous, I can't help but wonder if
there's a bug.

------
kylemathews
Techmeme's algorithm seems to rely on a combination of clustering similar
articles and an analysis of the interlinking of various articles. So if you a
textual analysis of two articles and discover that both use very similar words
-- e.g. both discuss possible merger talks between yahoo and microsoft, they
would be placed in a meme together. Also, if an article links to the main
article of a meme, odds are it's on the same subject as the main article of
the meme. The larger the meme and the more clicks it gets, the higher on the
page it rises. And I'm sure there's a number of other factors considered.

Google News seems less sophisticated. I haven't worked as hard at reverse
engineering it but it seems to rely solely on clustering.

I wrote the <a
href="[http://drupal.org/project/memetracker">Drupal](http://drupal.org/project/memetracker)
Memetracker module</a> last summer as part of Google Summer of Code. At this
point it's a usable but fairly simple + unsophisticated memetracker (but
patches are highly welcome!).

It relies solely on clustering right now to find memes. To step a bit into
machine learning concepts -- to cluster items, you first find the "distance"
between different items. Then the clustering algorithm runs through the
different items bring together items close to each other until it reaches the
distance threshold you've set (i.e. at this point items are too far apart to
bring together). This type of clustering is called Hierarchical agglomerative
clustering see [http://nlp.stanford.edu/IR-
book/html/htmledition/hierarchica...](http://nlp.stanford.edu/IR-
book/html/htmledition/hierarchical-agglomerative-clustering-1.html)

Memetracker uses two tools to do the clustering. It uses the MySQL FullText
search to find the distance between news articles. Basically it searches every
article against every other article which returns a very accurate distance
score.

Then for clustering, it uses the PyCluster library as thesethings mentioned.

It uses PyCluster as thesethings mentioned.

------
qeorge
I've found this tutorial on 4 of the big clustering algoritms helpful:

[http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/...](http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/kmeans.html)

I've always been curious about the subject, and would love to know more. I've
admittedly not gotten very far.

------
RossM
I experimented a little with this a while back but I didn't get very far. My
first assumption was to rely on keywords in the article, and match them with
articles from other sources, but I wasn't able to find a good solution to
finding those. My suggestion would be to go that route however.

