
Principal component analysis: pictures, code and proofs (2018) - stuffypages
https://joellaity.com/2018/10/18/pca.html
======
anilakar
There's a running gag at the local polytechnic: All the engineering fields
have one thing in common – you measure data, put it into a matrix, calculate
its covariance matrix and find the eigenvectors and eigenvalues.

In the case of radio direction finding, PCA allows you to distinguish multiple
overlapping signals on the same frequency if they're hitting the antenna array
from different directions.

If you have any other interesting applications in your own field, I'd like to
hear them, too.

~~~
bacr
One of the more remarkable results in 21st human genetics was that the top two
PCs of a genotype matrix captures geography to an amazing degree [1]. This is
the PCA of a N x M matrix where you measure M genetic markers in N people in a
relatively homogenous population (Europeans in the paper above).

1\.
[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2735096/](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2735096/)

------
hammeiam
Does anyone have an explanation of PCA that's more accessible to laypeople /
the less mathematically inclined?

~~~
mikorym
By way of example, suppose you have a weather dataset with 30 parameters such
as temperature, cloud cover, solar radiation, etc.

Your objective is to classify points on a map that have similar weather
conditions.

You can do PCA on these geographical points at a given time. Your matrix will
have as columns weather parameters and as rows the different geographical
points.

PCA will determine how to best represent the variation in data into smaller
dimensions. If you are lucky, 80% of the variation can be plotted in a 2D plot
as in OP's post.

Points in the x/y cartesian that have similar weather will plot closer to each
other.

This process reduces dimension. For example, temperature and solar radiation
are strongly correlated and will tend to "push" into the same direction so
that you can replace those parameters by a virtual parameter. (For weather
data, that would tend to be the x-axis in the PCA biplot as temperature and
similar parameters are often sufficient to classify a climate for the purposes
that I have encountered in agriculture.)

In a nutshell, this is a real word example of when one can use PCA.

~~~
siddboots
A key point that I feel is missing from this explanation is that the concept
of distance becomes almost meaningless in very high dimensional space [1]. PCA
is a means of dimensionality reduction, but the reason you want to reduce
dimensions is so that you can measure and compare the distance between pairs
of samples.

[1] [https://en.m.wikipedia.org/wiki/Clustering_high-
dimensional_...](https://en.m.wikipedia.org/wiki/Clustering_high-
dimensional_data)

~~~
yorwba
The "meaninglessness" of distances in high-dimensional spaces, as defined in
that Wikipedia article (i.e. the maximum and minimum distances of points
becoming relatively close to each other) only happens when all those
dimensions are independent. If the data is actually low-dimensional, such that
PCA loses no information (trivial case: tack a billion zeros to all your
vectors) then it won't change the distances at all, so it doesn't matter
whether you compute them in the high-dimensional space or the low-dimensional
projection obtained with PCA.

PCA works under the assumption that you're dealing with a low-dimensional
signal that was projected into a high-dimensional space and then corrupted by
low-magnitude high-dimensional noise. By taking only the top _k_ components,
you filter out the noise and get a better signal. But if you have a low-
magnitude high-dimensional signal corrupted by highly-correlated noise, then
naively applying PCA will filter out the signal and leave you with only noise.

So whether PCA makes your data more meaningful or less really depends on
whether that assumption is satisfied or not. Dimensionality reduction is no
silver bullet.

~~~
jqgatsby
That's very interesting. Can you please clarify "if you have a low-magnitude
high-dimensional signal corrupted by highly-correlated noise"?

I don't understand what "highly-correlated noise" means. I thought noise was
uncorrelated to the signal by definition.

Also, I understand that PCA is not scale invariant, but I'm having trouble
relating that to "low-magnitude, high-dimensional signal".

Thank you!

~~~
yorwba
Noise can be correlated with itself. For example, if you're using a pressure
sensor to detect whether the person carrying it has fallen, then the baseline
atmospheric pressure is noise, while the signal is in relatively small changes
corresponding to a meter or so of height difference. Because atmospheric
pressure changes only slowly, it is highly correlated across successive
measurements. If you did PCA on timelines recorded under different atmospheric
pressure conditions, that would appear as the top component.

I used that example because something similar actually happened (although
there's no indication PCA was used): [https://semiengineering.com/training-a-
neural-network-to-fal...](https://semiengineering.com/training-a-neural-
network-to-fall/)

~~~
mikorym
This is, I think, another example where one may consider to discard factor 1
and to look only at factor 2 and 3.

(Of course, a better approach if possible would be baseline normalisation.)

------
malms
The problem of PCA is that it is very different of the metric you put on your
feature space. Divide/multiply per three the coordinate of one feature can
move from one important axis to a small one.

In most case where PCA is used for ML algorithm, it is a very important thing
to take into account since you can lose a feature which is quite
discriminating but that you'll squeeze in a discarded axis if its coordinate
is too small.

There are obviously methods to avoid this like whitening the input data but it
doesn't cut it completely.

~~~
closed
You can use a correlation matrix instead of covariance. If I understand
correctly, the problem you're describing also occurs is you try to interpret a
linear model without normalizing first (for the same reasons).

------
forkandwait
Has anyone run PCA on biggish (~100M rows, ~100 columns) SQL table? Any
strategies or pitfalls?

~~~
antognini
One note is that PCA will give you the same results on any matrix as it will
on its transpose. However, the running time can be very different in the two
cases because if you have an MxN design matrix, the algorithm will try to
generate an NxN matrix. If you have the correct transpose, this will be a nice
100x100 matrix, but if you have the wrong one this will be a 100Mx100M matrix.

Another trick for datasets with high dimensionality is that you can randomly
project the data to a lower dimensional space using random Gaussian vectors.
By the Jordan-Lindenstrauss lemma, the projected dataset will have
statistically similar properties under PCA.

~~~
eigenvalue
I think you meant the Johnson–Lindenstrauss lemma.

~~~
antognini
Oops, good catch. I can never remember the name of that lemma.

