>> Some of the problems don’t matter as much if your goal for the model is just prediction, not interpretation of the model and its coefficients. But most of the time that I see the method used (including recent examples being distributed by so-called experts as part of their online teaching), the end model is indeed used for interpretation, and I have no doubt this is also the case with much published science. Further, even when the goal is only prediction, there are better methods like the Lasso, of dealing with a problem of a high number of variables.
I use this method often for prediction applications. First, it’s a sort of hyper parameter selection, so you should obviously use a holdout and test set to help you make a good choice.
Second, I often see the method dogmatically shut down like this, in favor of lasso. Yet every time I have compared the two they give similar selections — so how can one be “evil” and the other so glorified? I prefer the stepwise method though as you can visualize the benefit of adding in each additional feature. That can help to guide further feature development — a point that I’ve seen significantly lift the bottom line of enterprise scale companies.
> Yet every time I have compared the two they give similar selections — so how can one be “evil” and the other so glorified?
Frequentist and Bayesian approaches often yield similar results but philosophically are different. In general I favor and recommend lasso because I see it perform as well or better than stepwise at variable selection but doesn't come with all the baggage.
Lasso avoid the multiple comparison problem by applying a regularization penalty instead of sequentially fitting multiple models and performing hypothesis testing. This also helps to prevent overfitting. If you want to see which variables would be included/excluded you can turn the regularization up or down (it is pretty easy to spit out an automated report).
Stepwise selection comes in different flavors: forward, backwards, or bidirectional; R-squared, adjusted R-squared, AIC, BIC, etc.; these often all lead to different models so the choices must be justified and I rarely see any defense for them.
Of course, if the point is prediction over coefficient estimation and interpretability then neither of these are great choices.
> I use this method often for prediction applications. First, it’s a sort of hyper parameter selection, so you should obviously use a holdout and test set to help you make a good choice.
What the article is talking about is inference, not prediction. It's a different problem domain, it's not about telling a company whether design A or B leads to more engagement, it's about finding out about the (true!) causal drivers of that difference. The distinction may seem subtle but it's important. The key problems outlined all talk about common (frequentist) statistical tests and how they get messed up by variable selection. Holdout sets don't address this, because if the holdout set comes from the same distribution as the test set (as it should), the biases would be the same there. Bayesian inference isn't a panacea either, the core problem is structuring the model based on the data and then drawing conclusions about their relationships (Bayesian analysis gives you tools to help avoid this, but comes with its own set of traps to fall into, such as the difficulty of finding truly non-informative priors).
Yeah, the title is a bit hyperbolic. I have not used selection methods that much, but not too surprising they would have similar results to LASSO as selection or predictive method for people who think of it in terms of "feature development".
The distaste for step-wise selection comes from its typical use. If one reads Harrell's complaints quoted in the blog post carefully, quite many of them are less about the selection method but what analyst does with it, namely, interpretation of inferential statistics. When you see step-wise in the wild, practitioner often has used step-wise or other selection method and then reports the usual test-statistics and p-values for the final fitted model ... that are derived with assumptions that don't usually take into account the selection steps. It is quite unfortunate in fields where people put lot of faith in coefficient estimates, p-values and Wald confidence intervals when writing conclusions of their paper.
With LASSO and its cousins, the standard packages and literature strongly encourage the user to focus on predictions and run cross-validation right from the beginning.
There are many concepts called stepwise regression, its so weird that statistics, as a field, are so bad in delineating concepts.
I teach my students what you see in most social science papers, and in the light of the article at hand, I would call it "stepwise presentation of multivariate regression".
When it comes to the task of explaning, I think presenting different models, with a discussion what your pick is, provides good value to readers.
That said, I agree to the sentiment of the article but not the wording. "Blind" or manipulative stepwise deletion will decrease falsifiability of your work. That should be more provocative to scientists than evil.
I came to stats through the social sciences, and I get the feeling that's where the author's beef is coming from.
If you're doing say, econometrics for publication and you don't have a solid theoretical basis for all your variables, you're just flapping about.
There are however, plenty of other use cases where this sort of approach may be valid. The author mentions prediction, but regression summaries are really helpful tools in a variety of domains.
I also came to stats through social sciences (now have a masters in stats). I was actually taught about stepwise regression in a stats course before I encountered it in social science.
The problems of stepwise are numerous. In my view many people confuse model fit and variable selection. All too often the analyst will use an automated approach for selecting variables that give the best model fit, without consideration for their meaning. It isn't uncommon to see missing or imbalanced categories. Almost none of them apply any corrections for multiple comparisons.
I encourage lasso over stepwise for variable selection and then fitting an OLS model if precise coefficient estimates are needed.
Bold headline and then the stepping back later down the article:
> Some of the problems don’t matter as much if your goal for the model is just prediction, not interpretation of the model and its coefficients.
It depends strikes again.
> As my professor once told our class:
> "If you choose the variables in your model based on the data and then run tests on them, you are Evil; and you will go to Hell."
Again, it depends. If one wants to be pedantic, then choosing to omit any variable in the universe in a non-random fashion is based on some data. Choosing variables via "subject matter expertise" is the same thing but "smarter."
> To explore this I wrote a function (code a little way further down the blog) to simulate data with 15 X correlated variables and 1 y variable.
> mod <- lm(y ~ ., data = d)
If one wants to be pedantic, then why are we using a model that assumes independent variables with correlated data?
The important thing is that it isn't based on the data you are attempting to analyse. It's fine to use subject matter expertise beforehand to decide on what is appropriate to include or not in your analysis.
The article seems to analyze a statistical practice from a theoretical perspective.
Using the same perspective, another way to formulate this discussion is:
1. Look at all the data in the universe.
2. Choose some to examine (using a non-random procedure).
3. From those, employ a variable selection procedure (the article argues against stepwise selection and somewhat for Lasso).
4. Fit a model to the remaining data.
In reality, there are at least 2 variable selections occurring. In the first variable selection (choosing data to examine from the universe of data), we are choosing those variables based on some procedure that is ultimately grounded in data.
This is a cache22: unless you look at all data that exists, you choose some subset based on all data that exists.
I use this method often for prediction applications. First, it’s a sort of hyper parameter selection, so you should obviously use a holdout and test set to help you make a good choice.
Second, I often see the method dogmatically shut down like this, in favor of lasso. Yet every time I have compared the two they give similar selections — so how can one be “evil” and the other so glorified? I prefer the stepwise method though as you can visualize the benefit of adding in each additional feature. That can help to guide further feature development — a point that I’ve seen significantly lift the bottom line of enterprise scale companies.