Hacker News new | past | comments | ask | show | jobs | submit login
Principal component analysis: pictures, code and proofs (2018) (joellaity.com)
179 points by stuffypages on March 11, 2019 | hide | past | favorite | 31 comments

There's a running gag at the local polytechnic: All the engineering fields have one thing in common – you measure data, put it into a matrix, calculate its covariance matrix and find the eigenvectors and eigenvalues.

In the case of radio direction finding, PCA allows you to distinguish multiple overlapping signals on the same frequency if they're hitting the antenna array from different directions.

If you have any other interesting applications in your own field, I'd like to hear them, too.

One of the more remarkable results in 21st human genetics was that the top two PCs of a genotype matrix captures geography to an amazing degree [1]. This is the PCA of a N x M matrix where you measure M genetic markers in N people in a relatively homogenous population (Europeans in the paper above).

1. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2735096/

It's not my field, but I saw a video where a guy generated faces from pictures from his yearbook: https://www.youtube.com/watch?v=4VAkrUNLKSo

The interesting part was he used PCA and then assigned sliders to each component, and sorted them by importance. So the first slider would affect the image the most.

Does anyone have an explanation of PCA that's more accessible to laypeople / the less mathematically inclined?

By way of example, suppose you have a weather dataset with 30 parameters such as temperature, cloud cover, solar radiation, etc.

Your objective is to classify points on a map that have similar weather conditions.

You can do PCA on these geographical points at a given time. Your matrix will have as columns weather parameters and as rows the different geographical points.

PCA will determine how to best represent the variation in data into smaller dimensions. If you are lucky, 80% of the variation can be plotted in a 2D plot as in OP's post.

Points in the x/y cartesian that have similar weather will plot closer to each other.

This process reduces dimension. For example, temperature and solar radiation are strongly correlated and will tend to "push" into the same direction so that you can replace those parameters by a virtual parameter. (For weather data, that would tend to be the x-axis in the PCA biplot as temperature and similar parameters are often sufficient to classify a climate for the purposes that I have encountered in agriculture.)

In a nutshell, this is a real word example of when one can use PCA.

A key point that I feel is missing from this explanation is that the concept of distance becomes almost meaningless in very high dimensional space [1]. PCA is a means of dimensionality reduction, but the reason you want to reduce dimensions is so that you can measure and compare the distance between pairs of samples.

[1] https://en.m.wikipedia.org/wiki/Clustering_high-dimensional_...

The "meaninglessness" of distances in high-dimensional spaces, as defined in that Wikipedia article (i.e. the maximum and minimum distances of points becoming relatively close to each other) only happens when all those dimensions are independent. If the data is actually low-dimensional, such that PCA loses no information (trivial case: tack a billion zeros to all your vectors) then it won't change the distances at all, so it doesn't matter whether you compute them in the high-dimensional space or the low-dimensional projection obtained with PCA.

PCA works under the assumption that you're dealing with a low-dimensional signal that was projected into a high-dimensional space and then corrupted by low-magnitude high-dimensional noise. By taking only the top k components, you filter out the noise and get a better signal. But if you have a low-magnitude high-dimensional signal corrupted by highly-correlated noise, then naively applying PCA will filter out the signal and leave you with only noise.

So whether PCA makes your data more meaningful or less really depends on whether that assumption is satisfied or not. Dimensionality reduction is no silver bullet.

That's very interesting. Can you please clarify "if you have a low-magnitude high-dimensional signal corrupted by highly-correlated noise"?

I don't understand what "highly-correlated noise" means. I thought noise was uncorrelated to the signal by definition.

Also, I understand that PCA is not scale invariant, but I'm having trouble relating that to "low-magnitude, high-dimensional signal".

Thank you!

Noise can be correlated with itself. For example, if you're using a pressure sensor to detect whether the person carrying it has fallen, then the baseline atmospheric pressure is noise, while the signal is in relatively small changes corresponding to a meter or so of height difference. Because atmospheric pressure changes only slowly, it is highly correlated across successive measurements. If you did PCA on timelines recorded under different atmospheric pressure conditions, that would appear as the top component.

I used that example because something similar actually happened (although there's no indication PCA was used): https://semiengineering.com/training-a-neural-network-to-fal...

This is, I think, another example where one may consider to discard factor 1 and to look only at factor 2 and 3.

(Of course, a better approach if possible would be baseline normalisation.)

I think of survey responses as being low-magnitude, high signal sometimes.

Take a list of 100 questions all answered between 1-5 (disagree -agree). Sometimes, PCA can help you group these statements into like concepts. Which means instead of having to look at variation across all 100 statements you only need to look at a handful.

However, in doing so, you might miss out on an interesting pattern in a subset of the data where a given subset of respondents are consistently answering a single choice differently. Maybe in aggregate that signal doesn’t show up, and so you would wash it out in PCA. Obviously, in this case, each of the 100 questions is a dimension.

This distance that you speak of falls under my use of the word "classify" which now in retrospect is a poor choice of word.

However, as explained by another comment, it does not always become "meaningless". In fact, it becomes variable. For example, if you are interested in secondary structure in your data, you can plot factors 2, 3, 4 or higher against another factor and often the change in pairwise distance at different factors is meaningful.

For example, I've had an example where factor 1 would classify low resolution parameters and factor 2 and further are the only factors of relevance.

When one examined the data, some parameters were in 40 square kilometer boxes and others in more like 0.5 square kilometer boxes. Factor 1 would tell you which parameters are in the 40 square kilometer boxes.

Another interesting example is that one can use this to for example counteract the effects of inflation on sales prices.

Here is a neat demo which uses PCA for image compression: https://timbaumann.info/svd-image-compression-demo/

Let's take the terms,first :prinicipal , second :component , third :analysis.

Principal is kinda mathematic's way of saying something is important or critical. Component is basically referring to parts Analysis is basically referring to ability of this method to help you understand what's going on with your data.

So the main idea here is like actually pretty simple.

Let's say you want to understand what makes a difference to a person's SAT score. You track number of hours of study, maybe the average school SAT scores, age of the person, average of mock SAT scores.

Now you know that some of the columns of data are more important than other and you want to know which one?

So in regression what you would do is fit a line that best lies in the middle of these points by playing around weights or 'importance' of the columns till you get least distance from points.

What PCA does is it tells you which variables are important and gives you a sense of ranking of these variables.

So it has the ability to select variables, rank it's importance and tell you what columns of data matter more than others.

Ergo it tells you which components are principal and helps you analyse them through their importance ranking.

Because this method does so many things you can a do a ton of cool stuff. If you have a ton of data, this method can tell you to focus on these 3 or 4 columns which have the biggest impact. So it can help you prioritize

Second if you are looking at optimizing your system,let's say SAT scores, this system can tell you a better school can make a bigger difference than just brutal hours of practice.

In networks like social networks, it can tell you who is the most important/ prestigious/ coolest person by looking at friendship or social messaging links between people.

So to sum up, it gives you an idea of what is important in your data, gives a sense of the quantum of its importance and hence gives a deeper feel for what's going on.

One big point with PCA is that it's a linear method. Which means variables which have exponential impact on your study will not get signalled well. So transformation and processing your data is critical for this method to work.

Hope this helped.

I found this explanation extremely helpful when I did PCA in my postgrad multivariate analysis course:


PCA fits a high-dimensional ellipse to a cloud of points. The axes of the ellipse correspond to the eigenvectors. The length of each axes correspond to the eigenvalues.

You can reduce dimensions by saying some of the eigenvalues are "noise" and discarding them.

Which way does the blob point

Slightly less tersely, if you fit a multivariate gaussian to some points you could think of this as an something like an ellipse (at a contour of constant probability). The semi major/minor axes of the ellipse are the principal component directions. The size of the principal components is like the radii of the ellipse.

I just read this today. You may find it helpful.


A similar example (using the popular Iris dataset):


Draw a banana on a piece of paper. Chances are in doing so you subconsciously eliminated the dimension of the banana that is 'non curvy' / 'not really varying as much'. You've mentally done a form of PCA, and eliminated the least interesting dimension in reducing it to 2 dimensions instead of 3.

It is a means of defining a new coordinate system that is ordered by dimensions of decreasing variance.

You have certain values that you use to predict another value. Now you have maybe a lot of criteria, some more and some less important ones. You try to find correlations here between the inputs, so that you can describe the output more simply with a combination of these input values.

It takes a cloud of points as input, and gives a set of shapes that approximate that cloud, noise filtered out.

So component is one of those shapes. In case of PCA, shapes are ellipses. For SOM, shapes could be more complex like zigzag. For k-means, shapes would be like Voronoi cells.

The problem of PCA is that it is very different of the metric you put on your feature space. Divide/multiply per three the coordinate of one feature can move from one important axis to a small one.

In most case where PCA is used for ML algorithm, it is a very important thing to take into account since you can lose a feature which is quite discriminating but that you'll squeeze in a discarded axis if its coordinate is too small.

There are obviously methods to avoid this like whitening the input data but it doesn't cut it completely.

You can use a correlation matrix instead of covariance. If I understand correctly, the problem you're describing also occurs is you try to interpret a linear model without normalizing first (for the same reasons).

Has anyone run PCA on biggish (~100M rows, ~100 columns) SQL table? Any strategies or pitfalls?

One note is that PCA will give you the same results on any matrix as it will on its transpose. However, the running time can be very different in the two cases because if you have an MxN design matrix, the algorithm will try to generate an NxN matrix. If you have the correct transpose, this will be a nice 100x100 matrix, but if you have the wrong one this will be a 100Mx100M matrix.

Another trick for datasets with high dimensionality is that you can randomly project the data to a lower dimensional space using random Gaussian vectors. By the Jordan-Lindenstrauss lemma, the projected dataset will have statistically similar properties under PCA.

I think you meant the Johnson–Lindenstrauss lemma.

Oops, good catch. I can never remember the name of that lemma.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact