
Comparing Clustering Algorithms - merusame
http://nbviewer.jupyter.org/github/lmcinnes/hdbscan/blob/master/notebooks/Comparing%20Clustering%20Algorithms.ipynb
======
mattnedrich
I wrote an article about mean-shift a while back if anyone is interested in
more details about it - [https://spin.atomicobject.com/2015/05/26/mean-shift-
clusteri...](https://spin.atomicobject.com/2015/05/26/mean-shift-clustering/)

Some comments on K-Means - one large limitation of K-Means is that it assumes
spherical shaped clusters. It will fail terribly for any other cluster shape.

It's interesting that the author compared results on the same data set for the
different algorithms. Each clustering approach is going to work best on a
specific type of data set. It would be interesting to compare them across
several different data sets to get a better feel for strengths/weaknesses,
etc.

~~~
lmcinnes
I agree that on some level more data sets would be nice, but I felt that it
cluttered and obscured the exposition. Instead I used the one synthetic
dataset, but crafted in to have various properties (noise, cluster shape,
variable density, non-standard distributions) that will confound many
different clustering approaches ... it is meant to be the "hard" case that
with all the difficulties and confounding factors rolled into one dataset.

~~~
mattnedrich
Cool, I think you did a great job. Do you have run time data for each
algorithm on that data set?

~~~
lmcinnes
It's included in the upper left corner of the plots. To be fair, these are for
the sklearn implementations, some of which are excellent, but I can't speak
for the performance of all of them.

------
jey
This is very specific to 2D data. I bet the story is a lot different for high-
dimensional data. The challenges you encounter with this sort of clumping in
2D is unlikely to occur in high-dimensional data due to the "curse" of
dimensionality. Clustering in high dimensions has its own quirks and gotchas
too, but they're quite distinct from the gotchas of low-dimensionality.

Here's more from an actual expert: [http://research.microsoft.com/en-
US/people/kannan/book-chapt...](http://research.microsoft.com/en-
US/people/kannan/book-chapter-2.pdf)

~~~
lmcinnes
The examples are all in 2D because it allows someone to visualise what's going
on. In higher dimensions things get messier and you have to rely on cluster
quality measures ... which are often bad (or, more often, defined to be the
objective function that a particular clustering algorithm optimizes, and hence
give a false sense of how "well" the clustering has done). I've worked with
HDBSCAN quite successfully on mid-range dimensionality data (50 to 100
dimensions). I agree that it doesn't work as well on truly high dimensional
data, but then little can once the curse of dimensionality truly kicks in.
Your best bet, at that point, is the assume that your data actually lies on
some lower dimensional manifold (i.e. the intrinsic dimensionality is much
lower) and apply suitable dimension reduction techniques (t-SNE, Robust PCA,
etc.) and then cluster.

~~~
visarga
Or to apply neural nets to compute some kind of vector representation that
contains semantic information.

------
merusame
One of the better comparisons of Clustering Algorithms online. It's also worth
checking out how HDBScan works:
[http://nbviewer.jupyter.org/github/lmcinnes/hdbscan/blob/mas...](http://nbviewer.jupyter.org/github/lmcinnes/hdbscan/blob/master/notebooks/How%20HDBSCAN%20Works.ipynb)

------
makmanalp
These all look like clustering based on a 2d space, but does anyone know
methods to tackle clustering on a network?

Is it just a matter of tweaking the definition of density / distance to the
number of hops, or is it a different problem entirely? I can see how with 0 or
1 hops the data would be a very smushed distribution, versus 2d distance is
much more rich and spread out.

~~~
PeterisP
Clustering algorithms generally only need a distance metric. Here 2d space is
used for illustration just because it is easy to visualize.

For n-dimensional space, geometric distance is often (but not always) used,
but you can just use # of hops instead of that.

~~~
makmanalp
Brilliant, that's what I figured, thank you!

------
Xcelerate
I would say there are two aspects of clustering that are important: accuracy
of the model, and accuracy of the model fitting process.

A model is a guess about the underlying process that "generates" the data. If
you're trying to use hyperplanes to divide data that lies on a manifold, then
you are going to have poor results no matter how good your fitting algorithm
is.

On the other hand, even if you know the true model, high levels of noise can
prevent you from recovering the correct parameters. For instance, Max-Cut is
NP-hard, and the best we can do is a semidefinite programming approximation.
Beyond a certain noise threshold, the gap between the SDP solution and the
true solution becomes very large very quickly.

~~~
leecarraher
soft margin SVM would be useful here. And there are manifold aware
topographical clustering algorithms, like Persistent Homology techniques,
Gunnar Carlson is the man on this new clustering offshoot.

~~~
lmcinnes
Oddly enough HDBSCAN can be recast as a persistent homology computation; the
trick is in simplicial complex construction; you need something slightly more
density aware than traditional Rips complexes. I am currently working on
designing a version of Carlsson and Memoli's Multiparameter Hierarchical
Clustering based on the persistent homology interpretation of HDBSCAN.

------
bmordue
This is a great resource. As someone who knows little about clustering, this
was clear and very informative. It covers potential pitfalls much better than
other similar documents I've seen, which is a useful approach.

------
kasperset
how about t-SNE for clustering?
[https://lvdmaaten.github.io/tsne/](https://lvdmaaten.github.io/tsne/)

~~~
eigenman
The t-SNE algorithm isn't a clustering algorithm, but a nonlinear embedding.
Clustering algorithms assigns a label (or no label) to each point in the data
set. Instead, the t-SNE finds low dimensional coordinates for each point such
that nearby points in the original data are nearby in the lower dimensional
representation.

That being said, I've seen the t-SNE used as a preprocessing step prior to
clustering, as in ACCENSE[1].

[1]: [http://www.cellaccense.com/](http://www.cellaccense.com/)

------
popra
As someone intrested in the subject but with 0 insight or knowledge, would
these algorithms be a good match for short text clustering? For example
identifying identical products in a price comparison app based on their
similar but not identical title/name and possibly other attributes.

~~~
karanlyons
The answer is maybe. You need to reduce that text to a set of features that
you can pipe into a clustering algorithm, and what’ll really make or break
your approach is how you convert the text.

There are two types of features and related clustering algorithms: categorical
and numerical. Numerical features are the most commonly supported by out of
the box clustering tools (meaning most of the algorithms made easily available
to you will expect numerical features). Categorical features can be reduced to
binary numerical features, however, though whether that’ll make sense for your
data depends: For example, if products have a set of categories they can be
sorted into you may want to just use that data along with K-Modes clustering
or some other approach that considers categorical features. (Though to me
those don’t really seem to produce useful results.)

For short text clustering I’d try an ngram frequency approach, wherein you
reduce the text to a set of features describing the frequency of ngrams from,
say, unigram up to trigram. You’re attempting to balance the number of
features you end up needing to process with the amount of locality information
you need to get a useful result. If you end up with far too many features, you
could attempt to cull them with an approach such as PCA, but it may not even
be necessary.

Do note that I’m only a little above 0 in terms of insight or knowledge, so
take my advice with a good heap of salt. Clustering of this data would group
together products that are described “similarly”, but “similarly” in this case
doesn’t imply the products are in fact similar, but only that they’re
described in a similar manner.

You may also want to explore graph clustering over vector clustering,
which—while there may be no out of the box solutions readily available—is
likely a much better fit for text in general.

~~~
popra
Oh wow, I still find the level of effort the HN crowd puts into being
helpfull, amazing. You sir rock!

------
Wei-1
Why not use Density Cluster based on this 2014 Science Paper?
[http://science.sciencemag.org/content/344/6191/1492](http://science.sciencemag.org/content/344/6191/1492)

~~~
lmcinnes
Desnity Peak clustering is an interesting idea, but is still centroid based
and will have many of the same issues as K-Means (but may pick better
centroids for this case). It also involves a little bit of parameter tuning
(particularly with regard to bandwidth), and can be relatively slow if not
suitably accelerated with space indexing structures such as kd-trees or cover-
trees.

------
leecarraher
It would be interesting to see what Agglomerative Clustering the author is
using here. I suspect for this two dimensional, density based cluster dataset,
single-link agglomerative would perform much better than what is shown (likely
average link).

~~~
lmcinnes
It was actually Ward. I agree that single linkage would have performed better,
however the noise would have greatly confused the issue. Robust Single Linkage
(Chaudhuri and Dasgupta, 2010) would be the better choice in the presence of
noise as with the test dataset (which was designed to be as difficult as
possible for clustering algorithms, while being obvious to a human viewer.

------
amelius
Interesting, but the subtitle "Why you should use HDBSCAN" makes little sense
on a dataset of N=1.

~~~
lmcinnes
That's fair, but the subtitle was intended to be a little controversial. I
could have included more datasets, but ultimately that just clutters the
exposition -- instead I chose a dataset that can illustrate several different
ways clustering algorithms can break. Ultimately it is meant to be a teaser as
to why you should use HDBSCAN; the real answer is to grab the code and run it
on your data, or your favorite test datasets, and see how it performs for you
(because there is no substitute for that). I'm actually pretty confident
you'll find the results impressive enough to make HDBSCAN a go-to choice for
clustering.

------
chestervonwinch
Cool. I have not looked into the *DBSCAN methods, yet. This post makes me
think I should.

~~~
merusame
Then you should also have a look at OPTICS, which was one of the first density
based clustering algo's
[https://en.wikipedia.org/wiki/OPTICS_algorithm](https://en.wikipedia.org/wiki/OPTICS_algorithm)
and [http://stackoverflow.com/questions/5515675/python-
implementa...](http://stackoverflow.com/questions/5515675/python-
implementation-of-optics-clustering-algorithm)

------
dweinus
Great article! Does anyone know of an implementation of HDBSCAN for R?

~~~
lmcinnes
I don't know of any implementations yet. On the other hand it isn't that hard
to get the basics working (see
[http://nbviewer.jupyter.org/github/lmcinnes/hdbscan/blob/mas...](http://nbviewer.jupyter.org/github/lmcinnes/hdbscan/blob/master/notebooks/How%20HDBSCAN%20Works.ipynb)
for an explanation of the algorithm). The trickier part is getting good
performance, but for small datasets that doesn't really matter. Hopefully
someone will put together an implementation for R soon.

~~~
dweinus
Thank you for the response and also all of your hard work in the Python
package!

------
graycat
Can find such comparisons by

Glenn W. Milligan

Ohio State

back to at least 1980.

