
Clustering in R - gk1
https://blog.dominodatalab.com/clustering-in-r/
======
tom_b
Glad to see the article mention dropping in the fastcluster package to replace
the default R hclust. I would suggest using the parallelDist library as well
instead of the standard dist.

Clustering in general and hierarchical clustering have been something I have
spent some recent time trying to come up to speed on. Current state of the art
seems to be using graph-based approaches of community detection (Louvain
method) where graphs are built for a set of samples with N features by
starting with a graph of K-nearest-neighbors and assigning some weight to the
edges.

~~~
sargram01
> Current state of the art seems to be using graph-based

Do you know how this compares to UMAP?

~~~
Der_Einzige
I strongly suspect that it's worse than UMAP or Ivis. Graph based methods are
great for some things but not clustering

~~~
tom_b
We've found that graph-based community approaches have some really nice
benefits in our bioinformatics data.

In particular, we have found that these approaches seem to preserve very small
cluster structure "better" than traditional approaches. Meaning, we have a
small group of cells that we know belong to their own cluster group and the
graph-based community approaches preserve these "small" groups outside of
other clusters nicely.

But we have also noticed (and had some feedback) that we windup with final
modularity scores that are very high - greater than or equal to 0.90 (on a
scale of -1 to 1). Applied math folks in the graph algorithms world kind of
seem to look at that and go "eh, that is so high you should probably just do
PCA and move on . . . "

Especially given that you could (and people seem to) use UMAP as a precursor
to louvain methods, I'll probably be looking into UMAP to see how it goes. But
our current performance bottleneck is that the clustering (or community
approach of graphs with the louvain method) is our computational bottleneck,
so we'd like to whittle that runtime down as much as possible.

------
RosanaAnaDana
What a great article. Great visualizations, and I particularly like the focus
on post clustering analysis. This doesn't get addressed enough in
undergraduate/ graduate level training on clustering.

I will raise one 'issue', that I expected to see in the first section
regarding k-means, and clustering in general, which is the issue of variable
importance/ variable selection.

I didn't see anything addressing normalization of values, how to deal with
factor variables, or variable weighting. Clustering for data science is
something that I engage in daily and while its a bit of a detail, how you
normalize (if you do so), and how you weight variables is the key to getting a
useful set of clusters or not. Maybe its an aside, but maybe step through the
k-means algorithm early on and show _why_ normalization and weights matter.
Just a suggestion.

Edit: I suppose this is addressed through PCA. There are ways of addressing
these issues outside of PCA.

