

Ways to Address Collinearity in Regression Models - goumie
http://learnitdaily.com/six-ways-to-address-collinearity-in-regression-models/

======
dbecker
This is a good article, but I think it leaves out an important option, which
is "do nothing." This option isn't always right, but it has some compelling
justifications when attempting causal inference and hypothesis testing.

Leaving all the variables in a regression leads to large standard errors.
These large standard errors accurately reflect the uncertainty of having
highly related explanatory variables. That is, the data literally can't show
which variable is causing the outcome.

Removing covariates leads to smaller standard errors, giving the illusion of
certainty.

Obviously, if you are just trying to get p-values less than 0.05, this isn't
going to help. Or if you are presenting the data to someone who doesn't
understand standard errors, than having standard errors that accurately
reflect inferential uncertainty isn't important. And, if the end goal is
prediction, this whole point is irrelevant.

But "leave everything in there" is frequently written off when it's the most
appropriate approach for representing relationships in the data.

~~~
n00b101
> This is a good article, but I think it leaves out an important option, which
> is "do nothing."

I don't see how "do nothing" is a good option. If the co-linearity between two
variables is high enough, then the linear regression equation will not have a
solution at all. But this is an edge case. The more common problem with co-
linearity is that it leads to unstable regression parameters. For example, all
variables in a dataset could have very high, positive correlation, but one of
the regression parameters could come out as a negative number due to co-
linearity (which would counter-intuitive). If you are basing decisions on the
regression parameters then it could lead to some very silly decisions.
Moreover, if you resample the data and re-fit the regression model, you could
get very different regression parameters (i.e. the parameters will be
unstable). Stability of the parameters over time (or over different training
sets) is one of the most important properties of a good model.

To see why "do nothing" is not a logical option, you can imagine that your
data points are plotted within a 3-dimensional cube (1 response variable, and
2 predictor variables). If the data is co-linear, then your data points will
lie along a straight line that cuts across the diagonal of the cube. If you
"do nothing," then it is directly equivalent to trying to fit a PLANE through
a LINE in 3 dimensions. The problem is that points along a LINE do not
uniquely identify a plane. The slope along one axis of the plane will be
stable, but the slope along the other axis will be fitted arbitrarily by the
regression model and will be unstable. The proper thing to do here is to model
the line in 2 dimensions. In order to do this, you can either eliminate one of
the two predictor variables from the model, or you can project the line in
3-dimensions onto a 2-dimensional subspace (which is what PCA does). Of
course, a line formed by data points in 2 dimensions (x and y) is very nicely
modeled by ordinary linear regression and it will give very a stable slope
(Beta) parameter.

~~~
dbecker
_I don 't see how "do nothing" is a good option._

It is explained in the rest of the comment. I would suggest you read that. For
a more in-depth explanation, this also explained in most econometrics
textbooks.

 _Stability of the parameters over time (or over different training sets) is
one of the most important properties of a good model._

In some cases it is, in many cases unbiasedness or consistency are more
important.

When stability is a priority, collinearity clearly needs to be addressed.

If unbiasedness and/or consistency is a priority, doing nothing is the best
option (since removing variables leads to omitted variable bias, PCA does not
yield the parameter of interest, and regularization techniques such as ridge
regression are both biased and inconsistent).

------
n00b101
These methods fall under the general categories known as Feature Selection,
Feature Extraction and Dimension Reduction:
[https://en.wikipedia.org/wiki/Dimensionality_reduction](https://en.wikipedia.org/wiki/Dimensionality_reduction)

------
tlarkworthy
If you are doing _linear_ _regression_ with correlated variables. PLS is the
correct choice. Decision trees are for categorical variables and are non
linear. PCA is basically PLS without regression. The other methods throw data
away. PLS is fast and can be done online eg LWPR

~~~
n00b101
> If you are doing linear regression with correlated variables. PLS is the
> correct choice.

False. PLS (Partial Least Squares) is an ad-hoc algorithm that was invented by
people working in chemometrics. PLS was not based on any rigorous statistical
theory, it is merely a heuristic. It might appear to work reasonably well on
some data sets, but asserting that "PLS is the correct choice" is cargo cult
science.

~~~
tlarkworthy
"cargo cult science" Um, no, unless you have decided a ton of empirical peer
reviewed journal articles are written by cargo cult scientists?

the same general argument you make can be said of PCA ... but it works. We
_know_ it works becuase it can be used to predict the _real world_.

