Hacker News new | comments | ask | show | jobs | submit login
A visualization of the prime factors of the first million integers (johnhw.github.io)
438 points by fanf2 6 months ago | hide | past | web | favorite | 59 comments

> A very pretty structure emerges; this might be spurious in that it captures more about the layout algorithm than any "true" structure of numbers.

What would it look like if we used PCA rather than UMAP? PCA is simpler than UMAP, so it's in some sense "less arbitrary". If the image is similar then we know we're seeing something about numbers rather than about our methods.

PCA is dimensionally invalid, it destroys, not preserves structure and consists of arbitrary linear algebra operations. It is "less arbitrary" the way x86 assembly is "less arbitrary" wrt. C (actually it ties you to a certain mode of thinking).

I don't think "arbitrary linear algebra operations" is a valid critique. If you understand PCA as "take the SVD of the data", then the operations seem arbitrary. But if you understand it as, "construct a low-rank approximation in the L2 sense to the data, or its covariance", then it's not.

Also, I don't think that the (very legitimate) "dimensional" critique of PCA applies here. The units on the coordinates of the representation are the same: the presence or absence of that prime factor.

To the original question: my suspicion is that PCA might pull out the even numbers (first PC) and the divisible by 3 numbers (second PC), because these two factors may explain the most variability in the underlying vector representation. If it did, that would be pretty intuitive, although not as interesting.


Edited to add: Suspicion turned out to be true. For the first 2000 integers, the top 6 PCs turned out to correspond to the first 6 primes (2, 3, 5, 7, 11, 13).

Plot at: https://imgur.com/a/qi2Sx5u?

  function [nums,pcs]=pca_prime(nMax,nPC)

  nums = zeros(nMax, nMax);
  for k = 2:nMax,
    nums(k,factor(k)) = 1; % vector representation of "k"
  % 2:end because don't care about 1 as a "prime"
  pcs = pca(nums(2:end,:), 'NumComponents', nPC);

  [nums,pcs]=pca_prime(2000,10); % "svd" would work too
  plot(pcs(:,1:6)); % first 6 PCs

If you think of the covariance matrix, entry i,j for i ≠ j will be

   floor(n / (p[i]*p[j])) / n - floor(n / p[i]) * floor(n/p[j]) / n^2
and the ith diagonal entry will be

   floor(n / p[i]) / n - ( floor(n / p[i]) / n )^2
for n large, you approximately get a diagonal matrix with diagonal entries / eigenvalues 1/p[i] - 1/p[i]^2.

Smart observation. Another way to say it is that, for distinct primes p1 and p2, the events “p1 divides n”, and “p2 divides n”, are approximately statistically independent. So you get a near-diagonal covariance with entries as you wrote.

> If you understand PCA as "take the SVD of the data", then the operations seem arbitrary. But if you understand it as, "construct a low-rank approximation in the L2 sense to the data, or its covariance", then it's not.

Those are the same thing. Is your first case just a reader who has no idea what the SVD is?

Yes, they are both descriptions of the same thing. I'm trying to say that PCA does have a justification. It's not just an "arbitrary linear algebra operation", although the application of the SVD algorithm to perform PCA can be presented that way.

Are you saying the kth principal component is equal to the kth prime number?

Yes. See the plot: for example, PC #1 is essentially a 0/1 vector with all its weight (1 in this case) placed on the "2", which is the representation of the prime number 2 in the scheme used in the OP.

> dimensionally invalid

What does this mean?

Without (arbitrary) scaling normalization PCA gives different results with change of dimensional units, that is principal components depend on choice of units or scaling.

Oh okay. I did know about that weakness of PCA. But it seems like in this case letting one prime factor correspond to one unit of distance is a nonarbitrary scaling, so I would expect PCA to give sensible results.

Reposting my comment from here:



Prime factors of a number is the ultimate high-dimensional space.

Damn. The more I see UMAP, the more I think it is going to be a central and generic tool for high-dimensional analysis. I haven't taken the time to go in depth into it yet, though :/

So far, my understanding of it is: t-SNE on steroids

* t-SNE is great for local proximity, but it 'rips' high dimensional global structures too early. UMAP solves both scales by using transformations to map overlapping points of the different lower dimensional spaces that are locally relevant.

* It is faster than t-SNE, and has a better scale factor.

* t-SNE is about moving the points when UMAP is about finding the transformations that move the points.. which means:

a) it yields a model that you can use to create embeddings for unseen data. This means sharing your work by contributing to the public model zoos.

b) And you can also do supervised dimension reduction as you create your embedding. Ie You can judge if the shape looks good for unseen data (aka it generalizes well), and then correct the embedding by choosing which unseen instances to add to the training set. This means you control the cost of labeling data. You can see where your errors are, and back-propagate them to the collection process in a cost effective manner. For high dimensional data.

* You can choose your metric! Specific a distance function and you're good to go. Haversine for a great orange peeling, Levenshtein for visualizing word spelling (and maybe provide an embedding for ML-based spell checking?)

* You can choose the output space to be greater than 2 or 3, in order to stop the compression at a specified level.

I believe it will replace t-SNE in the long term.

Here is a great video of the author presenting his work:


> * You can choose your metric! Specific a distance function and you're good to go. Haversine for a great orange peeling, Levenshtein for visualizing word spelling (and maybe provide an embedding for ML-based spell checking?)

t-SNE, at least the implementation I usually work with (from the R Rtsne package) happily accepts any distance matrix as input. I successfully used all kinds of distance measures with it.

That’s why I love HN, you start reading a post along with the comments and you see nice links like this one that take you down another (similar) road ;-)

I'd like to see this for various pseudo-random number generators, both ordinary and cryptographically secure.

I didn't really understand how they reduce to 2 dimensions. Can somebody explain?

If what you're asking about is the math, the steps are (essentially) as follows:

1. A Riemannian manifold is constructed from the dataset.

2. The manifold is approximately mapped to an n-dimensional topological structure.

3. The reduced embedding is an (n - k)-dimensional projection equivalent to the initial topological structure, where k is the number of dimensions you'd like to reduce by.

I don't know how well that answers your question because it's difficult to simplify the math beyond that. But you can also check out the paper on arXiv. [1]

The underlying idea is to transform the data into a topological representation, analyze its structure, then find a much smaller (dimensionally speaking) topological structure which is either the same thing ("equivalent") or very close to it. You get most of the way there by thinking about how two things which look very different can be topologically the same based on their properties. A pretty accessible demonstration of that idea is the classical donut <-> coffee mug example on the Wikipedia page for homeomorphisms. [2]


1. https://arxiv.org/pdf/1802.03426.pdf

2. https://en.wikipedia.org/wiki/Homeomorphism

This is a good summary. For a video that explains some of this (but which still hand waves), see the author's presentation at SciPy 2018: https://www.youtube.com/watch?v=nq6iPZVUxZU

Is this actually capturing any properties of the original set, or is this a set of operations that will make any input look similar? (i.e. is this just a pretty picture with no real connection to the math.)

You're asking about two somewhat different things.

In the strict sense two things which are equivalent share the same properties, yes. This is the topological generalization of algebraic homomorphisms and analytic bijections. See the example about coffee mugs and donuts both being topological tori.

That being said I can't really comment on the potential artifacting details of this specific algorithm. In theory the overarching idea makes sense because if you find structure preserving maps between sets of varying dimensions you should expect relations within the set to be preserved (i.e. the relational information in the smaller set is equivalent, there's just less of it). But practically speaking not all datasets can be usefully abstracted to a manifold in this way, which means that (efficiently) finding an equivalent lower dimensional projection for the embedding might involve a fair amount of approximation.

With enough approximation you'll introduce spurious artifacts. But that's precisely where all the innovation comes in - finding ways to efficiently find equivalent structures with the representative data in fewer dimensions. This isn't the first proposal for topological dimension reduction (see information geometry); the devil is really in the details here.

It captures more of the spatial (metric topological) arrangement in the set. Example they give in the paper is the MINST dataset where distinctly looking digits like 1 and 0 get separated farther apart and similar ones clump together, whereas t-SNE while correctly delimiting individual clusters clumps them all in a blob.

Cool. Thanks.

sidenote: i only recently realized the X in arXiv is a greek chi, so its sonically "archive"...

Yep, just like LaTeX :)

"latechi?" ive always heard it "lah-tek"

Latech is how I have heard it.

im assuming you mean hard "a" like lay-tech. ive heard that too.

Probably similar to how a camera takes the photons from a 3D source and projects it onto film (or photosensitive plate).

I wonder what it would look like for completely random numbers?

Yeah it's hard to tell if this visualization is picking up anything interesting about prime numbers or not.

The last image in the post, which is in section, has an image formed from random numbers. For some definition of random. I imagine you would get different results by choosing how the randomness is applied.

I don't understand. Why are the points of light moving?

The color scale is min to max at the time the frame was rendered. Each frame adds 1000 to the previous set, halving their brightness from the previous frame. This creates the illusion of movement.

Looks a bit like our universe...

It's curious to see UMAP compared primarily to t-SNE and not to MDS, Isomap, LLE, LTSA, etc.

Was I the only one who saw a pattern resembling the flying spaghetti monster?

Looking at such visualizations of mathematical axioms, like these and the Ulam Spiral [0], gives me a kind of vague .. intuition? apprehension? feeling/fun-idea-to-muse-about, that maybe Reality began from these.

As in the root of the "What caused the Big Bang" or "but who created God?" questions. Stuff like 1+1=2 shouldn't need a root cause, and would give rise to patterns like these.

[0] https://en.wikipedia.org/wiki/Ulam_spiral

What are the loopy things? What do the starbursts correspond to??

My conjecture is that the loops (especially the loops outside the main clump at the center) might correspond to newly-introduced prime factors.

As a new batch of integers is introduced in going from frame N to frame N+1, the prime numbers within that batch would become new points in the 2d projection, because they have not been seen before.

Then, as you go from frame N+1 to frame N+2, etc., the new prime factors from frame N start to re-occur in those successive frames, and new points are added to the loop.

There's a lot of interesting structure. Does this suggest some kind of structure to the space of prime factors or is it just trying to attach meaning where none exists?

Prime factors are defined by a pretty rigid structure; you see a factor of 2 every two numbers; a factor of 3 every three numbers; a factor of 5 every five numbers...

For extra fun, if you plot the number of times a specific prime factor occurs, you get a fractal pattern. E.g. for the prime factor 2:

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
  0  1  0  2  0  1  0  3  0  1  0  2  0  1  0

structure to the space of prime factors

There is topological structure https://math.stackexchange.com/questions/2879/mapping-natura... http://ncatlab.org/nlab/show/arithmetic+topology as well as geometric (arithmetic geometry). Question is how does the picture correspond to any of it. Glancing at the UMAP paper https://arxiv.org/abs/1802.03426 it looks like a proper invariant preserving embedding, so maybe loops in the image are real. Even if not, it does hint artistically at sorts of stuff that exist.

Given that there is definitive visual structure to primes themselves[1], it would be rather surprising if prime factorization had none.

[1] https://en.wikipedia.org/wiki/Ulam_spiral

Ulam sprials are super interesting. I developed a visualization (and some explanation) once just to understand the problem a little better. Maybe this sparks some interest in someone :)


"A very pretty structure emerges; this might be spurious in that it captures more about the layout algorithm than any “true” structure of numbers."

Seems the markdown wasn't rendered to html, here's a link to the generated image https://johnhw.github.io/umap_primes/primes_umap_1e6_4k.png

The site uses JavaScript to render the Markdown to HTML client-side. Reminds me of when I thought XML with client-side XSLT for rendering was a good idea, lol.

I enjoyed the plain markdown page though, very readable.

I see, I've set my browser to block third party javascript, on inspection, it seems the author has decided to download the markdown render script from what looks to be an analytics firm. (https://casual-effects.com/markdeep/latest/markdeep.min.js)

Interesting choice

markdeep is an extension of markdown, see https://casual-effects.com/markdeep/ for details

Aha! I mistyped the domain when exploring and got to this site https://www.causal-effects.com/

The site you linked does look much friendlier (still.. self host your js when you can!)

This is so refreshing! Awesome to see and feel such a different perspective over numbers.

Cute and colorful collection of squiggly lines but I don’t think this visualization is useful for capturing any insights into prime factorization...

Really cool! I'd love to see this go even further and see what other patterns appear.


Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact