
How UMAP Works - anigbrowl
https://towardsdatascience.com/how-exactly-umap-works-13e3040e1668
======
haptork
Nice article.

We recently used t-SNE and UMAP in the field of radiation damage, for
visualising the damage shapes (point defect clusters of different shapes). The
results were interesting in many ways.

We found that the general layout of classes of shapes is more or less same in
UMAP or t-SNE, so the global & local relationship argument didn't really work
for our data. Since, the data was around 1000 points of around 50 dimensional
histograms, efficiency was also not a big distinction. For us, real advantage
of UMAP came out to be its amenability to work with HDBSCAN and embedding new
test data. We are excited to use it on further categorising cascade and sub-
cascade shapes (bigger damage areas).

To check results see the following:

\-
[https://haptork.github.io/csaransh/presentation/index.html#/...](https://haptork.github.io/csaransh/presentation/index.html#/8/1)
: select the t-SNE or UMAP on left pane.

\- [https://haptork.github.io/csaransh/](https://haptork.github.io/csaransh/)
: Go to last pane "Cluster Classes", click on a point to see the shape on
right. Select between t-SNE or UMAP on the left pane. (might take time to
load)

\- [https://arxiv.org/abs/1811.10923](https://arxiv.org/abs/1811.10923) :
arXiv paper

\- [https://github.com/haptork/csaransh](https://github.com/haptork/csaransh)
: GitHub repo

Ideas and suggestions are welcome.

------
zetazzed
UMAP is awesome, and we usually recommend it to people who are struggling with
tSNE performance issues. If you have access to a GPU, it's worth checking out
cuML's UMAP implementation
([https://rapidsai.github.io/projects/cuml/en/0.10.0/api.html#...](https://rapidsai.github.io/projects/cuml/en/0.10.0/api.html#cuml.UMAP)),
which is closely based on McInnes' original python code but is much much
faster.

------
ejstronge
How do you people who work with high-dimensional data outside of biology feel
about t-SNE and/or UMAP?

Some of the points against t-SNE feel like comments that only non-computer
scientists would make (e.g., t-SNE must be run on a cluster/needs a lot of RAM
- despite the fact that rather few genetic datasets can be analyzed on the
commodity laptops most common among biologists).

~~~
Der_Einzige
UMAP is basically one of the most innovative things I've ever seen. It's
highly used in NLP with extremely good results

~~~
throwaway66920
Example? Anything besides visualizing embedding spaces?

~~~
Der_Einzige
Any type of clustering

------
jmrko
Very interesting read. I interpret this in the way that clustering (eg
HDBSCAN) on UMAP-projected data makes some sense at least (contrary to tSNE),
are there any differing opinions on this? Interesting related discussions:
[https://stats.stackexchange.com/questions/263539/clustering-...](https://stats.stackexchange.com/questions/263539/clustering-
on-the-output-of-t-sne)

~~~
jointpdf
Here’s a pretty comprehensive answer on the topic from the original UMAP
author:
[https://github.com/lmcinnes/umap/issues/25](https://github.com/lmcinnes/umap/issues/25)

Clustering the output of UMAP is also given a nice tutorial in the docs:
[https://umap-learn.readthedocs.io/en/latest/clustering.html](https://umap-
learn.readthedocs.io/en/latest/clustering.html)

Basically, the answer is yes you can do this, but verify and analyze the
output to ensure it makes sense (e.g. coloring points by known
features/labels). For example, if you have a small number of points in the
dataset (<1000), UMAP tends to display a dense cluster that is quite separated
from the remaining data. However, this apparent cluster is spurious and
contains noisy data points that UMAP couldn’t “figure out what to do with”
(they are similar in their dissimilarity to the other data).

------
twic
Mediumwalled - i can't read this without signing in to something. Does anyone
happen to know if there are cookies i can clear to work around this?

~~~
amrrs
Outline is another great option -
[https://outline.com/5MNPHn](https://outline.com/5MNPHn)

