I know I've tooted its horn before, but Orange3 is a pretty neat Python-based GUI platform that makes this and a metric buttload of other statistical/ML techniques available to non-programmer types.
Just watch out for null character `x00` in the corpus. That always seems to kill it stone dead.
A fairly interesting topic that I think remains important to this day. Some notes:
* The article misses out on HDBSCAN [1], which is a relatively new technique, and quite effective.
* With the arrival of embeddings, esp. sentence embeddings, now you've a very powerful additional lever to produce results closer to what you want for text. Changing the embeddings can often lead to drastically improved results (as an example, see [2] where the authors observe that clustering with embeddings can often replace topic models).
* Clustering is (or can be, depending on the context) an ill-defined problem (see [3]), with some hard-to-specify degrees of freedom, e.g., text representation, distance metric, the loss or accuracy metric being optimized. For a specific use-case, it helps to pay attention to these, and these are typically buried in the algorithm and not available as levers. An appropriate embedding can sometimes compensate for some of these settings which might be misaligned to your problem.
* There is also some interesting work on accommodating user feedback into clustering (as an example see [4]). While this area has been around for a while (see [5], an early work by Andrew Ng), it isn't big, which is a pity since I'd have been interested to see this continuum where you start with unsupervised groupings, i.e., clustering, and then go all the way to supervised precised groupings, i.e., classifications, based on minimal user interactions. Although I suppose some of metric learning can repurposed for this use-case.
At my work we deal with a lot of text data, and hier. clustering as a first step enables us to make sense of the topics, topic spread, volumes etc.
An excellent comment, but I would stress that topic models such as LDA and stm are not ordinary clustering methods. They are latent variable models where documents represent a mixture of latent topics.
Sorry if I didn't make that clear: I wasn't saying they are methodically equivalent (although there are similarities, e.g., the Dirichlet Process is a distribution over multinomials, which would be one way to formulate a clustering problem) but that the results can be equivalent, as the cited paper shows.
A while back I implemented a sort of hierarchical clustering on points using a recursive k means algorithm. I thought it worked quite well, though I admit it's probably considered a naive approach.
Just watch out for null character `x00` in the corpus. That always seems to kill it stone dead.
https://orangedatamining.com/
https://orange3.readthedocs.io/projects/orange-visual-progra...