Bold headline and then the stepping back later down the article:
> Some of the problems don’t matter as much if your goal for the model is just prediction, not interpretation of the model and its coefficients.
It depends strikes again.
> As my professor once told our class:
> "If you choose the variables in your model based on the data and then run tests on them, you are Evil; and you will go to Hell."
Again, it depends. If one wants to be pedantic, then choosing to omit any variable in the universe in a non-random fashion is based on some data. Choosing variables via "subject matter expertise" is the same thing but "smarter."
> To explore this I wrote a function (code a little way further down the blog) to simulate data with 15 X correlated variables and 1 y variable.
> mod <- lm(y ~ ., data = d)
If one wants to be pedantic, then why are we using a model that assumes independent variables with correlated data?
The important thing is that it isn't based on the data you are attempting to analyse. It's fine to use subject matter expertise beforehand to decide on what is appropriate to include or not in your analysis.
The article seems to analyze a statistical practice from a theoretical perspective.
Using the same perspective, another way to formulate this discussion is:
1. Look at all the data in the universe.
2. Choose some to examine (using a non-random procedure).
3. From those, employ a variable selection procedure (the article argues against stepwise selection and somewhat for Lasso).
4. Fit a model to the remaining data.
In reality, there are at least 2 variable selections occurring. In the first variable selection (choosing data to examine from the universe of data), we are choosing those variables based on some procedure that is ultimately grounded in data.
This is a cache22: unless you look at all data that exists, you choose some subset based on all data that exists.
> Some of the problems don’t matter as much if your goal for the model is just prediction, not interpretation of the model and its coefficients.
It depends strikes again.
> As my professor once told our class:
> "If you choose the variables in your model based on the data and then run tests on them, you are Evil; and you will go to Hell."
Again, it depends. If one wants to be pedantic, then choosing to omit any variable in the universe in a non-random fashion is based on some data. Choosing variables via "subject matter expertise" is the same thing but "smarter."
> To explore this I wrote a function (code a little way further down the blog) to simulate data with 15 X correlated variables and 1 y variable.
> mod <- lm(y ~ ., data = d)
If one wants to be pedantic, then why are we using a model that assumes independent variables with correlated data?