
First 1e6 integers represented as binary vectors indicating their prime factors - tobr
https://mobile.twitter.com/jhnhw/status/1031829726757900288
======
tobr
There's a longer explanation of how the images were produced:

[https://johnhw.github.io/umap_primes/index.md.html](https://johnhw.github.io/umap_primes/index.md.html)

I had not heard of UMAP before, but it seems to be a tool to visualise
correlation in high dimensional datasets:

[https://github.com/lmcinnes/umap](https://github.com/lmcinnes/umap)

------
BenoitP
Prime factors of a number is the ultimate high-dimensional space.

Damn. The more I see UMAP, the more I think it is going to be a central and
generic tool for high-dimensional analysis. I haven't taken the time to go in
depth into it yet, though :/

So far, my understanding of it is: t-SNE on steroids

* t-SNE is great for local proximity, but it 'rips' high dimensional global structures too early. UMAP solves both scales by using transformations to map overlapping points of the different lower dimensional spaces that are locally relevant.

* It is faster than t-SNE, and has a better scale factor.

* t-SNE is about moving the points when UMAP is about finding the transformations that move the points.. which means:

a) it yields a model that you can use to create embeddings for unseen data.
This means sharing your work by contributing to the public model zoos.

b) And you can also do supervised dimension reduction as you create your
embedding. Ie You can judge if the shape looks good for unseen data (aka it
generalizes well), and then correct the embedding by choosing which unseen
instances to add to the training set. This means you control the cost of
labeling data. You can see where your errors are, and back-propagate them to
the collection process in a cost effective manner. For high dimensional data.

* You can choose your metric! Specific a distance function and you're good to go. Haversine for a great orange peeling, Levenshtein for visualizing word spelling (and maybe provide an embedding for ML-based spell checking?)

* You can choose the output space to be greater than 2 or 3, in order to stop the compression at a specified level.

I believe it will replace t-SNE in the long term.

Here is a great video of the author presenting his work:

[https://www.youtube.com/watch?v=nq6iPZVUxZU](https://www.youtube.com/watch?v=nq6iPZVUxZU)

------
BenoitP
It'd be very interesting to have an interactive visualization, to see what the
clusters are made of.

------
jacknews
Insane - I first thought this might be an april fool type thing and he posted
one of his kids paintings.

Is this structure really from the numbers or an artifact of the
representation?

