The key insight that many are missing is that PCA solves a series of optimization problems, namely that reconstructing the data from the first k PCs gives the best k-dimensional approximation in terms of the squared error. Even more, this is equivalent to assuming that the data lives in a k-dimensional subspace and becomes truly high-dimensional because of normally distributed noise that spills into every direction (dimension).
Principal Components is a wonderful concept, together with sister concepts eigenvalues/vectors, and orthogonality. i wish i could force everyone i talk to to internalize these ideas so that I could have more useful discussions with them.
that said, yeah not everything is linearly separable
Best thing I’ve ever read on PCA is Madeleine Udell’s PhD-thesis [1]. It extends PCA in many directions and shows that well-known techniques fit into the developed framework. (Was also impressed with a 138 page thesis in math that is readable as well. Quite the achievement.)
Indeed, this seems worth a deep read as this especially address main PCA shortcomings ( heterogeneous data, non numerical data,.etc...).
Thanks mate I've definitely find a way to keep myself busy this weekend.
It’s kind of crazy that so many people have read this thesis, but it’s really good. I came across it independently a few years ago when I was trying to understand some stuff, but ended up saving it as a reference because I liked it so much.
This is some hot stuff! Thanks for sharing. Very lucid writing, clearly she has some deep understanding of the subject matter to be able to write that down so eloquently
In the UK eating example, it would be better to examine the feature-space singular vector associated with the first singular value instead of instructing the reader to "go back and look at the data in the table". PCA has already done that work, no additional (error-prone, subjective) interpretation needed.
The key insight that many are missing is that PCA solves a series of optimization problems, namely that reconstructing the data from the first k PCs gives the best k-dimensional approximation in terms of the squared error. Even more, this is equivalent to assuming that the data lives in a k-dimensional subspace and becomes truly high-dimensional because of normally distributed noise that spills into every direction (dimension).