
A visualization of the prime factors of the first million integers - fanf2
https://johnhw.github.io/umap_primes/index.md.html
======
OscarCunningham
> A very pretty structure emerges; this might be spurious in that it captures
> more about the layout algorithm than any _" true"_ structure of numbers.

What would it look like if we used PCA rather than UMAP? PCA is simpler than
UMAP, so it's in some sense "less arbitrary". If the image is similar then we
know we're seeing something about numbers rather than about our methods.

~~~
thraway180306
PCA is dimensionally invalid, it destroys, not preserves structure and
consists of arbitrary linear algebra operations. It is "less arbitrary" the
way x86 assembly is "less arbitrary" wrt. C (actually it ties you to a certain
mode of thinking).

~~~
mturmon
I don't think "arbitrary linear algebra operations" is a valid critique. If
you understand PCA as "take the SVD of the data", then the operations seem
arbitrary. But if you understand it as, "construct a low-rank approximation in
the L2 sense to the data, or its covariance", then it's not.

Also, I don't think that the (very legitimate) "dimensional" critique of PCA
applies here. The units on the coordinates of the representation are the same:
the presence or absence of that prime factor.

To the original question: my suspicion is that PCA might pull out the even
numbers (first PC) and the divisible by 3 numbers (second PC), because these
two factors may explain the most variability in the underlying vector
representation. If it did, that would be pretty intuitive, although not as
interesting.

\---

Edited to add: Suspicion turned out to be true. For the first 2000 integers,
the top 6 PCs turned out to correspond to the first 6 primes (2, 3, 5, 7, 11,
13).

Plot at: [https://imgur.com/a/qi2Sx5u](https://imgur.com/a/qi2Sx5u)?

    
    
      function [nums,pcs]=pca_prime(nMax,nPC)
    
      nums = zeros(nMax, nMax);
      for k = 2:nMax,
        nums(k,factor(k)) = 1; % vector representation of "k"
      end;
      % 2:end because don't care about 1 as a "prime"
      pcs = pca(nums(2:end,:), 'NumComponents', nPC);
    

\--

    
    
      [nums,pcs]=pca_prime(2000,10); % "svd" would work too
      plot(pcs(:,1:6)); % first 6 PCs

~~~
murbard2
If you think of the covariance matrix, entry i,j for i ≠ j will be

    
    
       floor(n / (p[i]*p[j])) / n - floor(n / p[i]) * floor(n/p[j]) / n^2
    

and the ith diagonal entry will be

    
    
       floor(n / p[i]) / n - ( floor(n / p[i]) / n )^2
    

for n large, you approximately get a diagonal matrix with diagonal entries /
eigenvalues 1/p[i] - 1/p[i]^2.

~~~
mturmon
Smart observation. Another way to say it is that, for distinct primes p1 and
p2, the events “p1 divides n”, and “p2 divides n”, are approximately
statistically independent. So you get a near-diagonal covariance with entries
as you wrote.

------
BenoitP
Reposting my comment from here:

[https://news.ycombinator.com/item?id=17816981](https://news.ycombinator.com/item?id=17816981)

\----

Prime factors of a number is the ultimate high-dimensional space.

Damn. The more I see UMAP, the more I think it is going to be a central and
generic tool for high-dimensional analysis. I haven't taken the time to go in
depth into it yet, though :/

So far, my understanding of it is: t-SNE on steroids

* t-SNE is great for local proximity, but it 'rips' high dimensional global structures too early. UMAP solves both scales by using transformations to map overlapping points of the different lower dimensional spaces that are locally relevant.

* It is faster than t-SNE, and has a better scale factor.

* t-SNE is about moving the points when UMAP is about finding the transformations that move the points.. which means:

a) it yields a model that you can use to create embeddings for unseen data.
This means sharing your work by contributing to the public model zoos.

b) And you can also do supervised dimension reduction as you create your
embedding. Ie You can judge if the shape looks good for unseen data (aka it
generalizes well), and then correct the embedding by choosing which unseen
instances to add to the training set. This means you control the cost of
labeling data. You can see where your errors are, and back-propagate them to
the collection process in a cost effective manner. For high dimensional data.

* You can choose your metric! Specific a distance function and you're good to go. Haversine for a great orange peeling, Levenshtein for visualizing word spelling (and maybe provide an embedding for ML-based spell checking?)

* You can choose the output space to be greater than 2 or 3, in order to stop the compression at a specified level.

I believe it will replace t-SNE in the long term.

Here is a great video of the author presenting his work:

[https://www.youtube.com/watch?v=nq6iPZVUxZU](https://www.youtube.com/watch?v=nq6iPZVUxZU)

~~~
macleginn
> * You can choose your metric! Specific a distance function and you're good
> to go. Haversine for a great orange peeling, Levenshtein for visualizing
> word spelling (and maybe provide an embedding for ML-based spell checking?)

t-SNE, at least the implementation I usually work with (from the R Rtsne
package) happily accepts any distance matrix as input. I successfully used all
kinds of distance measures with it.

------
ttoinou
Nice. Reminds me of Buddhabrot images
[http://erleuchtet.org/2010/07/ridiculously-large-
buddhabrot....](http://erleuchtet.org/2010/07/ridiculously-large-
buddhabrot.html)

~~~
NKosmatos
That’s why I love HN, you start reading a post along with the comments and you
see nice links like this one that take you down another (similar) road ;-)

------
tzs
I'd like to see this for various pseudo-random number generators, both
ordinary and cryptographically secure.

------
kiki_jiki
I didn't really understand how they reduce to 2 dimensions. Can somebody
explain?

~~~
throwawaymath
If what you're asking about is the math, the steps are (essentially) as
follows:

1\. A Riemannian manifold is constructed from the dataset.

2\. The manifold is approximately mapped to an _n_ -dimensional topological
structure.

3\. The reduced embedding is an ( _n_ \- _k_ )-dimensional projection
equivalent to the initial topological structure, where _k_ is the number of
dimensions you'd like to reduce by.

I don't know how well that answers your question because it's difficult to
simplify the math beyond that. But you can also check out the paper on arXiv.
[1]

The underlying idea is to transform the data into a topological
representation, analyze its structure, then find a much smaller (dimensionally
speaking) topological structure which is either the same thing ("equivalent")
or very close to it. You get most of the way there by thinking about how two
things which look very different can be topologically the same based on their
properties. A pretty accessible demonstration of that idea is the classical
donut <-> coffee mug example on the Wikipedia page for homeomorphisms. [2]

__________________

1\.
[https://arxiv.org/pdf/1802.03426.pdf](https://arxiv.org/pdf/1802.03426.pdf)

2\.
[https://en.wikipedia.org/wiki/Homeomorphism](https://en.wikipedia.org/wiki/Homeomorphism)

~~~
digitaLandscape
Is this actually capturing any properties of the original set, or is this a
set of operations that will make any input look similar? (i.e. is this just a
pretty picture with no real connection to the math.)

~~~
thraway180306
It captures more of the spatial (metric topological) arrangement in the set.
Example they give in the paper is the MINST dataset where distinctly looking
digits like 1 and 0 get separated farther apart and similar ones clump
together, whereas t-SNE while correctly delimiting individual clusters clumps
them all in a blob.

~~~
digitaLandscape
Cool. Thanks.

------
d33
I wonder what it would look like for completely random numbers?

~~~
mathgenius
Yeah it's hard to tell if this visualization is picking up anything
interesting about prime numbers or not.

------
tbirrell
I don't understand. Why are the points of light moving?

~~~
Mary-Jane
The color scale is min to max at the time the frame was rendered. Each frame
adds 1000 to the previous set, halving their brightness from the previous
frame. This creates the illusion of movement.

------
baxtr
Looks a bit like our universe...

------
thanatropism
It's curious to see UMAP compared primarily to t-SNE and not to MDS, Isomap,
LLE, LTSA, etc.

------
pp19dd
Was I the only one who saw a pattern resembling the flying spaghetti monster?

------
Razengan
Looking at such visualizations of mathematical axioms, like these and the Ulam
Spiral [0], gives me a kind of vague .. intuition? apprehension? feeling/fun-
idea-to-muse-about, that maybe Reality began from these.

As in the root of the "What caused the Big Bang" or "but who created God?"
questions. Stuff like 1+1=2 shouldn't need a root cause, and would give rise
to patterns like these.

[0]
[https://en.wikipedia.org/wiki/Ulam_spiral](https://en.wikipedia.org/wiki/Ulam_spiral)

------
benrbray
What are the loopy things? What do the starbursts correspond to??

~~~
mturmon
My conjecture is that the loops (especially the loops outside the main clump
at the center) might correspond to newly-introduced prime factors.

As a new batch of integers is introduced in going from frame N to frame N+1,
the prime numbers within that batch would become new points in the 2d
projection, because they have not been seen before.

Then, as you go from frame N+1 to frame N+2, etc., the new prime factors from
frame N start to re-occur in those successive frames, and new points are added
to the loop.

------
frayesto
There's a lot of interesting structure. Does this suggest some kind of
structure to the space of prime factors or is it just trying to attach meaning
where none exists?

~~~
vesinisa
Given that there is definitive visual structure to primes themselves[1], it
would be rather surprising if prime factorization had none.

[1]
[https://en.wikipedia.org/wiki/Ulam_spiral](https://en.wikipedia.org/wiki/Ulam_spiral)

~~~
tess0r
Ulam sprials are super interesting. I developed a visualization (and some
explanation) once just to understand the problem a little better. Maybe this
sparks some interest in someone :)

[https://tessi.github.io/walking-the-ulam-
spiral/](https://tessi.github.io/walking-the-ulam-spiral/)

------
davidrusu
Seems the markdown wasn't rendered to html, here's a link to the generated
image
[https://johnhw.github.io/umap_primes/primes_umap_1e6_4k.png](https://johnhw.github.io/umap_primes/primes_umap_1e6_4k.png)

~~~
anc84
The site uses JavaScript to render the Markdown to HTML client-side. Reminds
me of when I thought XML with client-side XSLT for rendering was a good idea,
lol.

I enjoyed the plain markdown page though, very readable.

~~~
davidrusu
I see, I've set my browser to block third party javascript, on inspection, it
seems the author has decided to download the markdown render script from what
looks to be an analytics firm. ([https://casual-
effects.com/markdeep/latest/markdeep.min.js](https://casual-
effects.com/markdeep/latest/markdeep.min.js))

Interesting choice

~~~
asicsp
markdeep is an extension of markdown, see [https://casual-
effects.com/markdeep/](https://casual-effects.com/markdeep/) for details

~~~
davidrusu
Aha! I mistyped the domain when exploring and got to this site
[https://www.causal-effects.com/](https://www.causal-effects.com/)

The site you linked does look much friendlier (still.. self host your js when
you can!)

------
pighive
This is so refreshing! Awesome to see and feel such a different perspective
over numbers.

------
rajacombinator
Cute and colorful collection of squiggly lines but I don’t think this
visualization is useful for capturing any insights into prime factorization...

------
mgalka
Really cool! I'd love to see this go even further and see what other patterns
appear.

------
dbelchamber
BUT WHAT DOES IT MEAN?!

