
Linear compression in Python: PCA vs unsupervised feature selection - efavdb
http://efavdb.com/unsupervised-feature-selection-in-python-with-linselect/
======
wjn0
I found this line confusing:

> The printed lines above show that both algorithms capture more than 50% of
> the variance exhibited in the data using only 4 of the 50 stocks.

Based on the sklearn PCA documentation [1] this has nothing to do with the
coefficients on individual stocks, and for PCA should read more like: "[...]
capture more than 50% of the variance exhibited in the data using only 4
components [...]" which is not the same thing.

1\. [http://scikit-
learn.org/stable/modules/generated/sklearn.dec...](http://scikit-
learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)

~~~
pacbard
Not really. PCA extract components (or factors) using individual items (in
this case, each stock). OP points out that those 4 stocks strongly load (large
amount of their variation used to calculate the component) on the PCA first
component. I would interpret those variances explained more like a way to make
sense of the component (maybe stocks of tech companies load on it) rather than
a “quality” of extractoin measure. Usually, correlation matrix eigenvalues can
be used to get a sense of how well the PCA is performing and how many
dimensions are present in the raw data.

~~~
wjn0
The idea (ranking stocks by sum of the absolute value of the coefficients) is
valid, but that line of code (variance explained by each component) doesn't do
that.

~~~
efavdb
I think your criticism is right. In that line I was thinking about the feature
selector, which does pull out 4 of the individual stocks -- these capture more
than 50% of the variance in the full set. As you pointed out, that description
isn't quite right for the PCA line, which uses four hybrid components.

------
samfisher83
Does it even makes sense to run PCA on the change percentage of a stock. To me
it would be make more sense to use it with physical properties of the under
lying the company. PCA helps you reduce dimensions of a higher order dimension
to lower dimension so you can group stocks together. I am a little confused by
what the author is trying to do.

~~~
wjn0
Sure it does. Author could've been clearer about the goals, though. Say that
rather than pre-selecting 50 stocks, we ran the analysis on the whole market -
then each component intuitively corresponds to some segment of the market. For
example, some component might correspond to agro companies whose prices
fluctuate with the weather. Another component might correspond to NFLX and
other companies who rely on AWS, which fluctuate together based on cloud
storage price changes.

This kind of interpretation kind of falls out of the math (the
eigendecomposition/SVD/covariance matrix interpretation of PCA in particular).

------
thanatropism
I wish people were better acquainted with the literature, e.g.
[https://www.nowpublishers.com/article/Details/ECO-002](https://www.nowpublishers.com/article/Details/ECO-002)

(Ed: yeah, that's just a sample of the book but has a large bibliography at
the end.)

------
rubatuga
I can't seem to make the COD reach 1.0

    
    
       >>> selector.ordered_cods
       [0.43298218, ... , 0.5068577, 0.5068577]
    

Would you think this a problem/bug?

~~~
efavdb
Did you use the code for a supervised application? i.e., did you pass both an
`X` and a `y` to the selector? If so, then getting less than 1 just means you
can't get a perfect fit to `y` with your features. Please let me know if
that's not it.

If interested, you can find some detailed examples in a tutorial below
[https://github.com/EFavDB/linselect_demos](https://github.com/EFavDB/linselect_demos)

------
squigs25
Another technique for unsupervised feature selection is Principal Feature
Analysis (PFA): [http://venom.cs.utsa.edu/dmz/techrep/2007/CS-
TR-2007-011.pdf](http://venom.cs.utsa.edu/dmz/techrep/2007/CS-TR-2007-011.pdf)

------
octopod
This dataset could be interesting as it consists of stocks and cryptos
[https://vectorspace.ai/recommend/datasets](https://vectorspace.ai/recommend/datasets)

------
closed
This title seems a bit confusing, since PCA is a form of unsupervised feature
selection (or rather, feature weighting).

The title seems like it has the form "<Specific method> vs <Broader category
method fits in>".

~~~
notafraudster
I tend to think of feature selection as being methods like LASSO that induce
conceptual sparsity in the feature space; and use the label "dimensional
reduction" for methods which reduce covariant dimensionality without inducing
conceptual sparsity -- PCAing 100 features down to 3 principal components may
or may not actually lend itself to a simpler interpretation but often it just
susbstitutes the problem of labeling loaded principal components in lieu of
the original problem.

Not disagreeing with you, just spitballing.

~~~
closed
That's fair and was helpful to hear! I had dimensionality reduction in the
back of my mind, and now that you mention it, your point about conceptual
sparsity definitely seems important here.

