
“‘Principal Component Analysis’ is a dimensionally invalid method” - stplsd
http://www.amazon.com/review/R16RJ2PT63DZ3Q/ref=cm_cr_rev_detmd_pl?ie=UTF8&asin=0521642981&cdForum=Fx37214P6NH2KSB&cdMsgID=Mx19BAIRVRARPZV&cdMsgNo=1&cdPage=1&cdSort=oldest&cdThread=Tx2OVZUUHW9MMJ9&store=books#Mx19BAIRVRARPZV
======
christopheraden
There is actually a fix to this problem in classical statistics (as far back
as Pearson in the early 20th century) if the object is to perform PCA on data
matrices: Don't use the covariance matrix for PCA. The issue of units is only
a problem if you're using a matrix that has units itself. This problem is
readily solved by using the correlation matrix instead, which is dimensionless
by definition. The downside to this circumvention is that you have essentially
re-weighted each of your variables, so the weight contributed by each variable
is more similar. This may not be what you want.

If PCA is not being used to reduce the dimensionality of multivariate data,
this might be invalid. There are other uses of PCA besides working on data
(that image reduction technique using SVD comes to mind) that he might be
addressing.

If you want a treatment of PCA in a respected text (at least by the
statistical community--not sure what the ML people think ;) ), look no further
than Hastie, Tibshirani and Friedman's Elements of Stastical Learning.

[http://www.stanford.edu/~hastie/local.ftp/Springer/OLD//ESLI...](http://www.stanford.edu/~hastie/local.ftp/Springer/OLD//ESLII_print4.pdf)

~~~
radarsat1
> The downside to this circumvention is that you have essentially re-weighted
> each of your variables, so the weight contributed by each variable is more
> similar.

Is this similar to "whitening"?

~~~
christopheraden
It is somewhat similar. The procedure I describe is normalizing (using
z-scores instead of raw data) the data. The difference, as far as I can tell,
is that normalizing retains the correlation structure of the data, while
whitening uncorrelates the observations. R is the lingua franca of the
academic statisticians, so you might not derive a huge amount of value from
this, but this question was asked a couple years ago on Stats.SE:
[http://stats.stackexchange.com/questions/53/pca-on-
correlati...](http://stats.stackexchange.com/questions/53/pca-on-correlation-
or-covariance)

------
jules
This kind of complaint applies to _any_ method. Dimensional analysis just
means that your data satisfies a certain kind of symmetry. A scaling symmetry
in this case. It doesn't matter how long you define one meter, as long as you
do it consistently. Hence any answer should not depend on the length of a
meter. There are other symmetries, for example it does not matter where you
put 0 on the temperature scale. Hence your answer should not depend on where
you put that zero (unless that zero has special significance in your context).
This is a kind of additive symmetry. Another additive symmetry is that it
shouldn't matter which year you call year 0 (unless it has special
significance, e.g. you are investigating the birth of jesus). In the same way
your data can have any kind of symmetry, especially with multi-dimensional
data you often get extra symmetries. For example it shouldn't matter in which
direction you define north and east.

For any given problem, you should generally only use methods that obey the
symmetries that your data has. This doesn't mean that PCA is invalid as a
method, it's just only valid on a data where the scaling symmetry does not
apply. Another example would be fitting a line through the origin for
temperature data. That's invalid, because the result you get depends on where
you define your zero (but as before, it might be valid if that zero has
special significance in your context). Does that mean that fitting a line
through the origin is invalid for any data set? No.

In other words, the same criticism could be applied to any given method. Just
choose any symmetry that the method does not respect, and then declare it
completely invalid. Hence we cannot dismiss a method outright purely based on
this reasoning. For example PCA is perfectly valid on unitless data. What's
even stranger is that the author does like neural neworks, which are certainly
not dimensionally valid, heck they probably don't satisfy _any_ real world
symmetries. This is also a case where it can be OK to use a dimensionally
inconsistent method. As long as it works, it works.

~~~
morpher
In addition to unitless data, PCA works when all of your variables have the
same dimension. It simply doesn't make sense to build a principle component
vector that is a mixture of (i.e. weighted sum of) vectors with non-identical
units.

------
tgflynn
_If you change the units that one of the variables is measured in, it will
change all the "principal components"!_

So what ? The same is true of most machine learning methods including neural
networks and SVM's, you just have to use the same units consistently.

~~~
wookietrader
Wrong.

Neural networks are affinity invariant. You can rotate, skew, translate the
input data however you want, the optimum stays the same. Same for SVMs.

~~~
tgflynn
But the weights will be different, which it seems to me is comparable to
saying that the principal components change.

------
binarysolo
Maybe I'm having a brainfart or something... but since PCA is
eigenvector/eigenspace-based and essentially determines linear noncorrellation
of the different vectors, changing units of measurement shouldn't change which
dimensions are most different about said vectors?

Edit: Ah right - <http://en.wikipedia.org/wiki/Whitening_transformation> \-
let covariance matrix be I.

~~~
thedufer
That's what I thought when I read this, but I haven't looked at PCA in awhile,
so I wasn't sure. It's only relative differences on each axis that matter,
right?

~~~
robrenaud
No.

PCA tries to project to the subspace that preserves as much distance in the
input space possible. If you multiply a coordinate in the input space by a
factor of 2, it will contribute relatively more to the distances, and hence
change the fitted projection beyond just a scaling factor.

~~~
thisrod
So we're talking about a very general problem. When I noticed that a system of
equations including x=y can have a different least-squares solution when you
change it to 2x=2y, I was quite surprised.

------
aheilbut
Lots of kinds of analyses require some kind of normalization or whitening to
work properly.

The problem with PCA is that everyone knows about it and kind of understands
it, and thinks that it'll magically tell them something interesting about
their data. Especially when you can take the first three components and make
cool-looking 3D plots...

------
kyzyl
> If you change the units that one of the variables is measured in, it will
> change all the "principal components"

By the way that's worded, it sounds like a case of sensitivity rather than the
method. If you change one of the variables to a unit that is way out of scale,
then it's quite possible that the results of PCA, and many other methods, will
change. But that's because it's not scale invariant, and so if you want good
results you need to present your variables in the same units, and/or in some
normalized format (zscored, etc.) where the scale of one unit doesn't blow the
others out of the water.

These methods are not magical, and they are not intelligent. They do not know
what they're looking at, so it's your job to feed them something reasonable.

That said, if you take some physically grounded data and change units, you
won't recover different eigenvectors as long as your input data is good. The
physics doesn't care about the units, or the coordinate system, or whether you
use python or anything else.

~~~
tgflynn
_if you take some physically grounded data and change units, you won't recover
different eigenvectors as long as your input data is good._

I don't think that's accurate. Consider a set of points distributed along a
line in 2D. If you do PCA on these points you will find that the 1st
eigenvector points along the line and the second is orthogonal to it.

If you now rescale the axes so that x is measured in meters and y in light
years the slope of the line will change and so will the the 2 PCA
eigenvectors.

However the relationship between the 2 eigenvectors and the distribution of
the data points will remain the same. The first eigenvector will still point
along the line and the second will still be orthogonal to it.

In machine learning one is interested in the distribution of the data not in
whatever units they happen to be measured in, hence I don't understand
MacKay's objection.

~~~
jessriedel
The mathematical way to state this is that a choice of units is equivalent to
a choice of an inner product in linear algebra( or, more generally, the choice
of a metric tensor in differential geometry). Basically, the choice of an
inner product defines what it means for something to be orthogonal in a space.

Principal component analysis is predicated on a choice of inner product, since
component directions are always chosen to be orthogonal. It's not clear what
orthogonality could mean in a 2D plane where one direction is measured in
inches and the other in tons, so naive PCA isn't appropriate in such a case.

Others have mentioned in this thread that "whitening" the data before PCA
fixes this problem, by removing cross-correlations. Presumably, in that case,
the notion of orthogonality is taken from the statistical properties of the
data. (Maybe it normalizes physical units like inches to the standard
deviation of the data's distribution in inches?)

------
tel
In practice if you're working in high dimensions and don't do PCA then you
have very little chance of building a good model. It certainly isn't linearly
valid if to you valid demands scale invariance, but it's an essential tool for
unsupervised feature engineering.

~~~
SatvikBeri
I definitely agree that you need dimensionality reduction, though there are
methods other than PCA (e.g. auto-encoders.)

~~~
dbecker
Auto-encoders have some big advantages over PCA, but they suffer the same
shortcoming (sensitivity to units of measurement) described in the original
post.

~~~
radarsat1
In practice you have to scale data to [-1,1] or [0,1] for a neural net anyway,
right? (Depending on the kernel function.)

~~~
SatvikBeri
Another common method is to scale to mean 0, variance 1. In my opinion this
makes more sense since it handles outliers a bit better-e.g., consider a case
where most of your values for a feature are scaled from 1 to 10 but there's
one point with value 1,000,000.

~~~
dbecker
I agree that you SHOULD normalize/scale data before running neural networks
and autoencoders.... and this resolves the units issue in most cases (unless
measurements in some units are non-linear functions of measurements in
others).

But this scaling also resolves the issue for PCA. So, I don't see much
difference between autoencoders and PCA with regards to original post's
"dimensional invalidity" concern.

If anything, the scaling options you mention suggest "dimensional invalidity"
isn't a big deal in practice for either method.

------
brosephius
Wouldn't readers be better off if the author wrote about this problem with PCA
in the book, instead of ignoring it entirely?

~~~
christopheraden
While I agree with you, if the author believed the entire method is invalid in
all situations, he could make a strong argument that it wasn't worth the space
to put it in at all. It would be a long book if it was filled with all the
methods currently in production that did _not_ work!

~~~
jessriedel
PCA extremely widely known and (ab)used. You can't justify ignoring it.

------
dimatura
"X is a ... method that gives people a delusion that they are doing something
useful with their data." could be applied to pretty much any method in ML if
you don't know what you're doing. PCA is a strange omission, but it's not like
it's hard to find references on it. ITILA is a great book and it's legally
free online, by the way.

------
HelloMcFly
If anyone is interested, PCA is used in scale development to help reduce the
number of items down to something useful. It is typically followed by a factor
analysis to determine factors, though far too many use PCA in place of a
factor analysis. I didn't realize it was used much outside of measurement.

~~~
kyzyl
PCA and other similar forms of dimensionality reduction are used all over the
place, from healthcare to control systems to financial analysis. These days it
is most often used as a preprocessing step for some other analysis routine,
though.

------
revelation
In data compression, there is what is called the Karhunen-Loeve or Hotelling
transform. It's essentially an alias for "Principal Component Analysis".

The Hotelling transform is interesting in that it achieves optimal energy
compaction, but has little practical value since it needs to be constructed
anew for each dataset (usually, images).

Just to pull your thoughts away from massive, unfiltered big data :)

------
pessimizer
I wish that someone would inform the field of psychology about this. Principal
Component Analysis is often a way of simply reifying your prejudices.

------
n00b101
So what alternative method of dimensionality reduction does the author
recommend?

Considering that PCA works fine in the case that the dimensions form a proper
vector space (e.g. geometrical space, stock market returns, temperature, etc,
etc ,etc), it seems questionable to completely dismiss such a useful and
historically important method.

------
streptomycin
Open Matlab and type

    
    
        load hald;
        coeff = pca(zscore(ingredients))
        ingredients(:,1) = ingredients(:,1) * 2;
        coeff = pca(zscore(ingredients))
    

Magically, you get the same result regardless of if you change the units of
one variable.

~~~
christopheraden
Yes, of course you will--because you've normalized the data. If you had run
PCA on just ingredients instead of on the normalized ingredients, I would
imagine the results would be different. I can't verify since I don't have
matlab on this computer, but doing PCA on raw data with one set of units will
produce a different PCA doing it on data with another set of units.

~~~
streptomycin
Yes, you get different results if you don't normalize. My point was, I don't
see why "you have to trivially normalize your data first" is a meaningful
argument against anything.

~~~
christopheraden
Because normalization reduces the influence of variables that have a higher
variance. In raw data, if you have marathon times and heights for runners in a
race, and you measure the times in minutes and the heights in inches, the
influence on the principal components from the times will likely be much more.

Like all problems in statistics, it ought to depend on the specific task at
hand. If there is some a priori reason to use the original scale (or a
different re-weighting), it ought to be used. In general, PCA on correlation
matrices is much preferred for exactly the reason you mention.

------
bhickey
I'm still waiting for Information Theory to be updated with the errata I
submitted. In at least two places (pages 458 and 459) he mentions the
possibility of accepting (!?) the null hypothesis.

snark.

~~~
sesqu
What was your replacement suggestion? I would say acceptance is often a fair
description for lack of rejection.

~~~
bhickey
The null hypothesis is never accepted. You can fail to reject the null, but
this is entirely different from accepting it.

If you claimed to accept the null in an intro statistics class, you'd probably
be failed.

~~~
sesqu
That's an extremely dogmatic position. There are certainly situations where it
is not useful or even accurate to proceed under the null hypothesis, but the
converse is also very much true. Checking Wikipedia, I see they also describe
the scenario as "accept or fail to reject".

If you want to argue for why "accept" is materially different from "fail to
reject", feel free to do so - but I suggest that the chasm is by no means
wide.

------
ankitml
You can always make variables in any system non dimensional. The whole field
of chemical engineering works on this principle. Once you do that, you are
free of irregularities/issues that stem from scaling.

Apply PCA after non dimensionalizing any system. Read More here :
<http://en.wikipedia.org/wiki/Nondimensionalization>

------
cf
Yes, this is a well known problem with PCA. So often we just whiten
(<http://en.wikipedia.org/wiki/Whitening_transformation>) the data first.

~~~
tgflynn
Whitening and (full rank) PCA use the same linear transformation except
whitening scales the eigenvalues so all axes in the transformed system have
unit variance.

In other words whitening the data before applying PCA should result in the
same eigenvectors expressed in the original coordinate system.

~~~
cf
Yes, but I think for many places where PCA is used, we are precisely
interested in which eigenvectors have the largest eigenvalues. The scaling
beforehand makes a unit change less likely to effect which eigenvectors are
the most important.

~~~
tgflynn
But once you know the eigenvectors the eigenvalues (and hence their
distribution) are determined, again in the original space.

On the other hand if you're talking about the eigenvalues for the whitened
data they're all 1.

So I'm still not seeing what whitening adds to PCA.

Just rescaling each dimension of the original space so that all dimensions
have unit variance, without doing any rotations, may change things, but I
don't think that's what is usually called whitening (according to Wikipedia).

~~~
mturmon
You are absolutely correct, and your interlocutor is mistaken to use a
whitening transformation before PCA.

