Hacker News new | past | comments | ask | show | jobs | submit login
Principal Component Analysis (oranlooney.com)
138 points by olooney 12 days ago | hide | past | web | favorite | 48 comments

BTW for real world use if you want to do PCA but want a better solution than an algorithm which makes linearity assumptions there are two really hot algorithms for dimensionality reduction right now

UMAP - topology manifold learning based method

Ivis - simese triplet network autoencoder

Both of them will blow PCA out of the water on basically all datasets. PCAs only advantages are speed and interpretability (easy to see explained covariance)

PCA is simply Eigen vector extraction on covariance matrix. While more impressive techniques exist, PCA is so simple it will never be passé.

Agreed, there's something to be said for simple models that are "good enough," especially when their limitations are clear. k-NN also comes to mind.

Indeed. And e.g. generalized eigen problem extend this to the case of two competing data sets.

Ignoring the eigen aspect would miss a lot of both theory and practice.

> PCAs only advantages are speed and interpretability

Um, these seem to be perhaps the 2 most important advantages something could have? What are advantages that UMAP and Ivis have that are more important than speed and interpretability?

Not competent on Ivis but UMAP is very good at making sure that points that are close in the original space are close in the output space.

When you want to perform some sort of clustering on top of the dimensionality reduction its a very useful property.

Note that is it very common to get rid of 90% of the dimensionality with a PCA and then to apply something like UMAP on top to improve speed (and because PCA can be trusted to keep the relevant information for later postprocessig which cannot be said of other techniques).

I'm just wondering if leaving out tSNE was intentional? I'm a big fan of UMAP too. Just created this ^1 Kaggle Kernel a couple of days back to show how UMAP works on Kannada MNIST data.

I also think PCA still rules in a lot of cases like when people try to create a new index representative of a bunch of numeric variables

[1] https://www.kaggle.com/nulldata/umap-dim-reduction-viz-on-ka...

I am not the OP but I would say that UMAP is a clear superset of t-SNE being able to do the same thing but faster and with a better conservation of large scale distances.

t-SNE doesn’t really tell you how to apply its transformation to new data. (UMAP and PCA do). As a result, t-sne is great for visualization or for “we know everything from the start” data, but not as useful in most data processing pipelines.

Nice kernel, hope to use this in the short future

PCA has speed and interpretability on its side but also the ability to easily add new points (UMAP has a quick fix to do so but its a hack) and go back and forth between the original space and the compressed space making it usable on a wide range of use cases.

You can also use a nonlinear transform e.g. Gaussianization

PCA is great. I like this paper where it holds its own against all the fancy nonlinear techniques:


When have you needed something stronger than PCA? Anybody have good stories?

Yes. I do some creative work in audio and I wanted to cluster a fairly large database of audio. I needed a dimensionality reduction technique to take a set of audio descriptors down to two values (to use for coordinates) in order to create a 2d explorable space. I actually ended up using MFCCs[1] and not typical audio descriptors[2] for my analysis data as there are lots of issues with making the numbers meaningful to begin with. I ended up munging to get around 140 numbers for each audio file by taking the MFCCs, and getting statistics over 3 derivates so that the data somewhat represented change over time. I tried out a number of reduction techniques and PCA was one of them. Perceptually, the groupings it produced were weak and techniques I found useful were ISOMAP[3], t-SNE[4] and lately UMAP[5]. [4] and [5] have given me the best perceptual groupings of the audio files.

You can see some of the code here on Github, although a lot of it depends on having some audio to test among other closed source projects (sorry I have no control over that).


[1] - https://en.wikipedia.org/wiki/Mel-frequency_cepstrum

[2] - http://recherche.ircam.fr/anasyn/peeters/ARTICLES/Herrera_19...

[3] - https://scikit-learn.org/stable/modules/generated/sklearn.ma...

[4] - https://scikit-learn.org/stable/modules/generated/sklearn.ma...

[5] - https://umap-learn.readthedocs.io/en/latest/

The quality of the mapping to this 2D space would seem to be pretty subjective, but incredibly powerful once an intuition of it is developed. For example, I've been watching a bunch of guitar pedal reviews, and it would be very cool to vizualize the 'topography of tone' comparatively between products.

Does the work you're doing give you that ability to compare the outcomes of various reduction schemes to compare them?

The quality of the mapping is entirely subjective. Person to person and corpora to corpora a different kind of mapping will work better for known and tacit goals with the results.

The work I am doing now has the ability for me to compare different plots as well as to experiment with another layer of processing that runs clustering on the output of the dimensionality reduction. I then manually play through the clusters to see how well the groupings seem to me, in terms of how 'close' the audio samples are together and how well a cluster is differentiated from another.

I hope to have some documentation on my project soon as its going to be part of my dissertation in music technology.

Also, to answer your question about guitar pedals - Stefano Fasciani has done some work on finding smoother parameter combinations across chaotic synthesisers.


That is perhaps up your alley!

Very cool! If it's appropriate I hope you post about it here when complete.

Also that TSAM vid is exactly what i had in mind, thanks!

Ooh, corpus based concatenative synthesis! I look forward to reading through these links. The only features that I have had luck with are spectral centroid (for ‘brightness’) and some measure of loudness, but I have always wanted more. Thanks for posting this stuff!

I'm not sure if i'm parsing this conversation correctly, but 'corpus based concatenative synthesis' reminds me of Scrambled? Hackz! - https://www.youtube.com/watch?v=eRlhKaxcKpA

Hey no worries! In the realm of audio descriptors centroid and brightness go far in terms of mapping on to something perceptual. I use them all the time for doing basic corpus navigation and concat synthesis. This indeed was an experiment because I wanted something that respected more perceptual features of the sound. I am in the process of writing my results up into a thesis actually.

Can someone knowledgable please give us examples of real life use of PCA. Not could be used here could be used there kind of toy exmaples but actual use.

If you want to do inference and hypothesis testing.

You need to save your degree of freedom. You want 10 observations to 20 observations per predictor. So you can use PCA to collapse a subset of predictors and keep the predictors you are inferring. This will help the sensitivity of your test.

Another thing is when you do linear regression or any modeling where it multicolinearity is a problem. This problem is where predictors are confounder or affect each other. PCA change basis so that new predictors are orthogonal to each other getting rid of multicolinearity problem.

A toy example is:

student gpa, sat score, math grade, height, hour of study

Where student GPA is the response or what you want to predict.

If you apply PCA to sat score (x1), math grade(x2), height(x3), and hour of study (x4) then it'll give you new predictors that is a linear combination of those predictors. Some statistic book will refer this to a sort of regression.

Anyway you may get new variables as:

new_x1 = 0.4sat score + 1.2math grade

new_x2 = 0.1* height + 0.5* hour of study

These new predictors are orthogonal to each other so they don't suffer from multicolinearity. You can now do linear regression using these predictors.

The problem is explanation, something you get grouping like height + hour of study.

Actually just look here for example: https://www.whitman.edu/Documents/Academics/Mathematics/2017...

Under "6.4 Example: Principal Component Analysis and Linear Regression"

I use it for exactly that kind of purpose - highlighting interesting relative strengths and weaknesses in a 42-point assessment. So much better than benchmarking against some average, with the added advantage that it will keep finding interesting points even as scores improve.

Amazingly little code too. Numpy and Scypi are awesome :-)

Its been a long time but we used PCA in remote sensing to reduce the number of bands into a smaller subset that are easier to handle.

Satellite data is collected using sensors that are multispecteral/hyperspectral (for example LandSat has 11 bands, but sometimes there are over 100) but this can be cumbersome to work with. PCA can be applied to the data so that you have a smaller subset that contains most of the original information that makes further processing faster/easier

Sounds very cool. Howvever, when you transform the data using PCA the interpretation of the signals are different right? How do you approach that problem?

I see this is another way to look at it. I was asking about how to interpret the components themselves. Your link suggests converting the the coefficients of PCA regression back to coefficients for the original variables.

Since PCA is geared towards reducing dimensions, it would be an example of data which has many features (aka dimensions). Data on 'errors in a manufacturing line' would be a good example because you could be capturing a large number of variables which may be contributing towards a defective product. You would be capturing features like ambient temperature, speed of the line, which employees were present, etc. You would (virtually) be throwing in the kitchen sink for features (variables) in the hope of finding what could be causing defective Teslas, for instance.

What PCA does (to reduce this large number of dimensions) is hang this data on new set of dimensions by letting the data itself indicate them. PCA starts off by choosing its first axis based on the direction of the highest degree of variance. The second axis is then chosen by looking perpendicular (orthogonal) to the first and finding the highest variance here. Basically, you continue until you've captured a majority of the variance, which should be feasible within a lower number of dimensions than that which you started. Mathematically, these features are found via eigenvectors of the covariance matrix.

This real life example I am sure will help:

My teacher wanted to buy a car and he needed help on choosing; he wanted a "good" deal, and applied PCA to all models of car for sales:

His real question was:

* what are the most important variables that makes up car's PRICE? or, said in another way

* if I have to compare two cars that have the same price: with which car I get the best out of my money?

The answer was pretty surprising:

the most important variable is WEIGHT


So, while you choose a car, always check for its weight! do two car have the same price? take the heavier one :) (this results relates to the 90s, are they still valid now? not sure: we need PCA)

I've been using this result since then, applying it in different context (which is, of course, not correct): when I am in doubt on which product to choose I always choose the heavier one. I would not use this 'method' to buy a speed bicycle, ...or to choose the best girl ;)

Then, you even have somebody stating that we are using this method even without this explanation https://www.securityinfowatch.com/integrators/article/122343...

Haha nice one.

In the area of chemoinformatice, in order to discover new types of chemical to address some disease one approach is to associate members of a large chemical database with some coordinate space and consider those chemicals which fall in some sense close to known useful pharmaceuticals. (As a simple example, let's say molecular weight along one axis, polarizability along another, number of hydrogen bond donors/acceptors, rotatable bonds, radius of gyration, and perhaps hundreds more) But there are problems with such a high dimensional space[1] particularly if one wants to do some useful statistics, cluster analysis, etc. So enter PCA as a means to lower the dimensionality to something more tractable. At the same time it gives you eigenvalues with a sense of what your target "cares" about among known chemical descriptors (low variability along one axis might indicate relative importance) versus physical factors with more permissible variation.

[1] https://en.wikipedia.org/wiki/Curse_of_dimensionality

It's been used in computer graphics to speed up rendering. One technique which was quite popular IIRC back in the days was to use clustered PCA for precomputed radiance transfer[1]. It even made its way into DirectX 9[2].

Can't comment on longevity, I went for realism over real-time not long after.

[1]: https://www.microsoft.com/en-us/research/video/clustered-pri...

[2]: https://docs.microsoft.com/en-us/windows/win32/direct3d9/pre...

We just published a typical GWAS paper that used PCA to sanity check whether the "ethnicity" reported by our patients aligned with what their genome told us.

We had 200,000 dimensions (ACGT's), which we reduced into 2 via PCA and sure enough if someone said they were "Filipino" then they generally appeared close to the other folks who said they were "Filipino".

https://breckuh.github.io/eopegwas/src/main.nb.html (chart titled: QC: PCA of SNPs shows clustering by reported ethnicity, as expected)

You can perform outlier detection with the 'autoencoder' architecture. Usually you hear this term in the context of neural networks but actually applies for any method which performs dimensionality reduction and which also has an inverse transform defined.


1) Reduce the dimensionality of your data, then perform the inverse transform. This will project your data onto a subset of the original space.

2) Measure the distance between the original data and this 'autoencoded' data. This measures the distance from the data to that particular subspace. Data which is 'described better' by the transform will be closer to the subspace and is more 'typical' of the data and its associated underlying generative process. Conversely, the data which is far away is atypical and can be considered an outlier or anomalous.


Precisely which dimensionality reduction technique (PCA, neural networks, etc.) is chosen depends on which assumptions you wish to encode into the model. The vanilla technique for anomaly/outlier detection using neural networks relies on this idea, but encodes almost zero assumptions beyond smoothness in the reduction operation and its inverse.

In addition to everything else that's been mentioned - you can simple use PCA as a preprocessing step to other algorithms. For example, you can apply a linear regression algorithm using the principal components as input instead of the original features in the dataset.

They’re often used to construct Deprivation Indexes for geographic areas (neighbourhoods and administrative areas). They combine multiple socioeconomic indicators into a single measure (usually the length of the first principal component).


An intro example: principal component regression -- simplifies the inputs into a regression technique (can be used with other ML)

General algorithm is called Singular Value Decomposition, and can be used in lossy compression and other similar simplications.

Chemical spectroscopy. You might have spectra collected from a variety of samples, and want to highlight how they actually differ from one another, possibly en route to identifying an impurity or a manufacturing variation.

Topic modeling of text documents. The so-called LSA topic-modeling technique is basically SVD applied to text. And, as we all know, SVD is simply PCA without data-centralization.

Am I missing something or is equation (9) incorrect unless the mean of the random variable x is zero (which is never specified)?

You are correct, eqn 9 is technically wrong. But it is traditional in PCA to de-mean the columns of X.

I've been using a somewhat related technique in my research: Principal Coordinate Analysis (PCoA) also called Multidimensional Scaling (MDS) which works on a dissimilarity matrix. See [1] for the differences.

[1] http://occamstypewriter.org/boboh/2012/01/17/pca_and_pcoa_ex...

Excellent explanation!

I find this visualization really helpful when I need to explain PCA: http://setosa.io/ev/principal-component-analysis/

This was a great article - I'd been wanting to understand PCA. Particularly liked the digression on Lagrange multipliers.

Some nice intuitive explanations in there.

anyone want to comment on how to choose between PCA and ICA?

(Some of hardest parts of ML imho is more in selection)

From a practical point of view, it's really in the name: Independence. PCA is great for finding a lower dimensional representation capturing most of what is going on (the basis vectors will be uncorrelated but can be hard to interpret). ICA is great for finding independent contributions you might want to pull out or analyze separately (the basis vectors are helpful in themselves).

PCA is very practical for dimensional reduction, ICA for blind source separation.

You wouldn't usually use ICA for dimensional reduction unless you have a known contribution you want to get rid of, but for some reason have difficultly identifying it.

Shooting from the hip here, but I think ICA was originally designed for the blind source separation problem. PCA is over 100 years old and the original dimensionality reduction algorithm.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact