

Similarity Measures for Text Document Clustering (2012) [pdf] - dang
http://www.milanmirkovic.com/wp-content/uploads/2012/10/pg049_Similarity_Measures_for_Text_Document_Clustering.pdf

======
bcbrown
A summary: They use the bag-of-words model with stopwords and stemming, and
compute document vectors, where each word is a dimension, and the weight is
the TF-IDF value for that word. The metrics covered: Euclidean distance,
cosine similarity, Jaccard coefficient, Pearson correlation coefficient,
averaged Kullback-Leibler divergence. They use these metrics with K-means
clustering, and evaluate each on purity and entropy of the generated clusters.
The purity measure evaluates the coherence of a cluster, that is, the degree
to which a cluster contains documents from a single category. The entropy
measure evaluates the distribution of categories in a given cluster. In
general, the smaller the entropy value, the better the quality of the cluster
is.

------
peter_l_downs
I'm new to the CS paper-reading game, so maybe someone can enlighten me: is it
normal for a paper (like this one) to come without any source code or
instructions to reproduce the results? I've played around with some tf-idf
code before and this paper was fun to read, but I was disappointed that I
couldn't find any source code.

~~~
jkldotio
It's from a University of Waikato researcher so they are likely using Weka to
do all the "textbook" transformations before the final analysis.[1]

You could also go a long way towards reproducing the results with Python by
installing NLTK and sci-kit learn. NLTK for tokenizing, stemming and
similarity metrics.[2] Scikit-learn for tf-idf vectorization, k-means and even
some of the datasets like 20news.[3] I use NLTK and scikit-learn for
[http://jkl.io](http://jkl.io)

Weka has a partially free book and NLTK has a free book for you to look into
as well. Scikit-learn has some of the best documentation and examples I've
seen. But I agree that more papers should come with working implementations.
The problem is that academics are judged on their publications and
publications frequently build on previous ones, so the work for this paper
might be the starting point for two more papers the author is planning and
they don't want someone to just come along and publish those logical
extensions off their hard work as they want the credit for themselves.

It will be hard to change this kind of academic culture until getting your
dataset or code used and cited becomes more prestigious in metrics used for
academic promotions.

[1][http://www.cs.waikato.ac.nz/ml/weka/](http://www.cs.waikato.ac.nz/ml/weka/)
[2][http://www.nltk.org/](http://www.nltk.org/) [3][http://scikit-
learn.org/stable/](http://scikit-learn.org/stable/)

~~~
peter_l_downs
I'm not saying I cannot reproduce these results on my own – that's something I
might do because it would be fun. It confuses me that a paper describing
results that are largely dependent on software behavior don't include said
software. That's a good point about getting research-sniped, though, and not
one I've heard before.

~~~
deutronium
Yes in general in computer science, I've found people don't tend to
include/reference the source code they write - which is quite a pity.

