The post considers PCA from visualization perspective, but the exactly same thing can also be viewed as a method for reducing number of dimensions in the original dataset.  Now, one of the interesting questions in a dimensionality reduction task is, how to pick the number of dimensions (principal components)? A good number? In a principled way, instead of just computing the next component and the next and the one after that, until you get bored? (It works for visualizations where you often want only the first two or three components anyway, but suppose we want more information than plots.)
I recently learned that there's a fascinating way to do this, presented in Bishop's paper  from 1999. In short: this can answered by recasting the PCA as Bayesian latent variable model with a hierarchical prior. (Yes, it is a bit of mouthful to say. Yes, it is fairly mathematical, unlike the visualization.)
The only issue with this is, if you get tons of data then there will be less uncertainty in the principal components. And so it will recommend as many as possible, even if they only decrease the reconstruction error a tiny bit.
Variations on this:
i) How 'faithfully' does it represent the data, eg, how many modes (components) are needed to resolve accuracy in a particular metric, or the entire system
ii) What is the cut-off component number which has a signal of order of the measurement uncertainty.
You can find the source code here: https://github.com/vicapow/explained-visually
Wish I had the free time to work on these more.
I prefer to think of the singular vectors in PCA as an ordering of "prototype signals" for which some linear combination best reconstructs the data. That explains, for example, why the largest singular vectors on natural time series data gives fourier like coefficients, and why the largest singular vectors on aligned faces gives variations in lighting.
It happens that having too much different features is not necessarily a good thing, in a phenomenom called curse of dimensionality.
Due to this, we are interested in trying to reduce the number of attributes our algorithm will process. There are two big categories of methods to do that: feature selection and feature extraction.
In feature selection, you try to select the attributes that are the "best" to predict your value. For example, computing the statistical correlation between the attributes and the value you want to predict, and choose those with the highest correlations.
In feature extraction, you create new attributes that are a linear combinations of the original attributes. PCA is a feature extraction algorithm.
In your effort to predict whether a person will follow dietary guidelines for healthy eating you could just assign each activity as its own input to the model. Or, you could apply PCA (and something like varimax factor rotation) and what you might find is that these activities seem to reflect three somewhat separable latent variables that is: physical fitness, competitive athletics and friendship/team based social activity dimension. You now potentially have reduced 50 individual activity measures into 3 dimensions.
Next you would think more deeply about the specific items and combine them in into 3 scales and use the scales as a reduced dimensional input into the predictive model.
Essentially it's distilling your data down to what is most relevant, and thus helps say a classification algorithm work better by only training on the reduced "more manageable" data.
I recently used it for a class project to explore the distribution of certain French cities in regards to socio-economic variables.
You can see that Security and Economic Activity are opposite for example.