
The Crusade Against Multiple Regression Analysis - gpresot
http://edge.org/conversation/richard_nisbett-the-crusade-against-multiple-regression-analysis
======
p4wnc6
One of the must-read papers in this area is a very short, very readable piece
by Chris Achen called "Let's Put Garbage-Can Regressions and Garbage-Can
Probits Where They Belong." [0]

The basic idea is that if your data experiences non-linear coding errors due
to some external process before it gets to you, then as related to the true
causal relationship between your variables, you're actually fitting a
(possibly) non-linear transformation.

If you do this and still use the classical cook-book t-stat / p-value stuff,
you can run into big problems. As the paper shows, even for an extremely
simple data set where there is merely a non-linear coding error, the
coefficient can be very statistically significant (t-stat magnitude > 2) and
yet _it will have the wrong sign._

This is a really remarkable thing. By just jittering your data a bit, you can
obtain a result that passes the usual, simplistic significance tests but for
which the effect size _is negative when it should be positive_ (or vice
versa).

I remember when I worked in quant finance and we were using a terrible,
shamefully bad automated framework for frequentist model fitting, the sense of
dread was huge. If the data we used had been jittered by the data vendor, or
if there was a slight bug in the code that did variable cleaning, outlier
manipulation, or scoring, or any of a hundred other steps where tiny non-
linearities could be introduced, then it would be entirely possible for us to
see "significant" results that pointed in the opposite direction from the
truth.

There are many other papers which point out the pitfalls of naive dump-it-all-
on-the-right-hand-side regression models, but I think this one from Achen is
unique in that it is extremely short, extremely simple to follow, and yet it
is fully devastating to this entire technique of modeling.

[0]
[http://www.columbia.edu/~gjw10/achen04.pdf](http://www.columbia.edu/~gjw10/achen04.pdf)

~~~
jhbadger
There certainly are poorly done regression analyses out there and yes, things
like coding-errors and correlations of variables can screw things up. But the
point of regression (at least in the natural/medical sciences) isn't to give a
definitive answer but to suggest actual experiments to be done. I think it
isn't coincidence that both your link and the original link are from social
scientists. There it is much harder (and often impossible) to conduct followup
experimental studies based on the regression and so the regression often ends
up as the end result.

~~~
p4wnc6
I don't think very many practitioners share your view that regressions are
meant to be simplistic first steps on the path to more sophisticated analyses
or experiments. If that were true, then I would definitely agree with you that
we can forgive most of the shortcommings of frequentist regressions.

In practice and in academia, though, I've only ever experienced researchers
who see regressions as the main result, and they intend on exclusively using
regression models even before they have any sense of what the data is or what
the challenges will be.

In a lot of cases, they even go out of their way to build frameworks that
hard-code in the assumption that whatever model is used must be representable
as a set of coefficients and their standard errors. It's painful to see it
happen: people have been using regression models for so long (due to the easy
manipulation of what the results "mean") that they can't even imagine any
other way of doing things.

Even in cases where someone might want to earnestly prioritize a move to a
more accurate or more flexible modeling framework, they often can't because
decades of legacy code has been tailored to work solely with coefficient-based
regression models. The software development cost of switching would be so high
that they would rather just find excuses to hide the model's inaccuracy and
"sell" the results than to bite the bullet and invest in systems where the
results are actually statistically rigorous.

------
gpresot
One interesting point about regression analysis though is that its outcome is
perfectly fit to be exploited by marketing & advertising. explaining all the
caveats of an experiment is hard, but saying that "people who eat X have less
probability of getting cancer" is very powerful. So, in a way, I suspect these
type of studies (regression) are "pulled" by marketing rather than pusghed by
Research (at least for corporate sponsored studies)

~~~
p4wnc6
The trouble is that researchers and universities also act like marketers, and
also even different factions within a single company will act like marketers
to self-promote their work in order to compete for a better bonus or promotion
or something.

Since regression type analyses are so amenable to this, and it's so easy to
data mine for arbitrary thresholds of success, like "significance" levels, the
problem tends to spiral out of control.

I wish it was just confined to advertisers and SEO stuff. But the ugly truth
is that if you peel back the layers in many organizations that claim to do
"quantitative" or "data-driven" business, it's just more of this same poor
statistical hygiene.

------
thyrsus
Does factor analysis/principal component analysis help any? The article begins
by talking about how looking at a single variable (vitamin E) that is strongly
correlated with a bunch of other variables (healthy lifestyle) points one in
the wrong direction. If you attempt to capture "all" (I know - not a priori
possible) the relevant variables, are you going to do better?

~~~
p4wnc6
Factor analysis can sometimes help by removing human biases in the feature /
covariate creation stage. If you can devise a statistical process that
reliably identifies which features are important, most would tend to agree
that is better than relying on human intuition or first-principles from some
domain science field to simply assert what they are.

But once you've constructed the features via that statistical process, and you
plan to just dump them onto the RHS of a regression formulation, then it's no
better. The same kinds of non-linearities can creep in whether you manually
computed a human-level-description of a factor, or you let a statistical
decomposition pop the factor out automatically.

The main point is that you cannot use most of the success metrics from a
frequentist regression for the model selection and policy selection outcomes
that most people want to use them for. Theoretically, it is just unsound. You
have to dig deeper into the data, look at conditional views of your data, and
try to use model fitting procedures that are provably robust to these kinds of
artifacts, multicollinearity, and so forth.

I think the biggest problem is that non-statistician practitioners who work
under tight deadlines in a domain science field tend to just want some kind of
cookbook system that "just works 100% of the time." But there simply is no
such thing. There is no cookbook procedure for you to take a bunch of
variables, dump them into some fitting procedure, and automatically get
comparable, consistent results that report back relationships according to the
pseudotheorems of the Achen paper.

We need for practitioners in fields like quant finance, quant marketing,
chemical informatics, and so on, to stop _mis-wanting_ a cookbook, and to
instead embrace that to create robust and trustworthy research results, it
simply has to be done slowly and with great care. It's a corner that you just
cannot cut, and you have to stop wanting to cut it.

