
Statistically controlling for confounding constructs is hard (2016) - Veedrac
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0152719
======
Veedrac
This paper discusses a major issue that I think deserves way more attention
from the academically literate crowd. This is the kind of knowledge that
should be taught in secondary school, at least in the modern age with papers
at everyone's fingertips.

A more approachable summary is in narrative form at

[https://www.talyarkoni.org/blog/2016/06/11/the-great-
minds-j...](https://www.talyarkoni.org/blog/2016/06/11/the-great-minds-
journal-club-discusses-westfall-yarkoni-2016/)

> “But there’s a problem: statistical control–at least the way people
> typically do it–is a _measurement-level_ technique. Meaning, when you
> control for the rate of alcohol use in a regression of cancer on bacon,
> you’re not really controlling for alcohol use. What you’re actually
> controlling for is just one particular _operationalization_ of alcohol
> use–which probably doesn’t cover the entire construct, and is also usually
> measured with some error.”

I strongly suggest reading the comment on that page from one of the authors as
well.

------
zwaps
Causal inference has been a huge issue in statistics and its subject fields
for a while, see DAG, instruments, Regression discontinuity, RCT, natural
experiments...

For the main guys, the question is not that causal inference is important and
traditional research design is problematic.

The issue is with all these big ego guys. For example, Pearl pretty much
states that DAG are the only way and his conception is perfect and everyone
who disagrees doesnt know statistics (even though DAG can not capture all
models of causal inference).

Then Angrist and Pischke come and say everyone using parametrized designs,
including Pearl, is not believable and its RCT logic or bust.

And Gelman fight against everyone anyway.

Much of this is less academic dispute than internet blogfighting, sadly.

The issue is that in practice, we need to do something with imperfect
observational data. Its really easy to critique any study on identification
and endogeneity. Its difficult to solve the issue!

~~~
tomrod
I don't think it is sad, I think it is wonderful! This is philosophy of
science in action. We don't know the right answer yet, it's not obvious, and
so we should absolutely argue it out if only to challenge that status quo.

------
a_bonobo
There was an amazing couple of papers recently on polygenic risk scores (now
trendy in human genetics in place of GWAS, since PRS can include tiny effects
from genetic variants that GWAS ignores)

Editorial on both papers here:
[https://elifesciences.org/articles/45380](https://elifesciences.org/articles/45380)

In summary, when you run a GWAS, you control for population structure by using
the first two or three principal components as covariates. In PRS, the model
built is so complex that controlling using PCs doesn't fix all influence by
population structure - which is now why some famous large-scale human genetic
studies fail to replicate.

(This has large implications on a societal level - Twitter is full of 'race
realists' ,i.e., racists, who are using spurious PRS to prop up their foregone
conclusions)

~~~
gwern
It's not remotely that bad... You're leaving out important context here: what
failed to replicate was not the GWASes, but inferred selection signals from
the GWAS (which is very different); actually, 2/3rds of them replicated anyway
(so they're doing much better than, say, social psychology or medicine); _and_
that 3rd one which didn't replicate didn't replicate because the check which
would have caught it, the sibling comparison, which was done, turned out to
have incorrectly validated it due to a Plink software bug (which is the kind
of error which could happen to literally any research result these days). The
situation remains precisely as it was before: GWAS results are generally
trustworthy, especially when validated by sibling comparisons, and human
selection is pervasive.

~~~
chrchang523
Can you point me to a reference re: what happened for the third study? If this
was caused by an actual Plink bug instead of incorrect usage, I need to verify
that the bug has been fixed...

------
graycat
From 10,000 feet up, a two step solution:

(1) Get a lot of data.

(2) Do finely grained cross tabulation.

Why? Because cross tabulation is a discrete version of the most powerful
foundation but for challenging questions can need a lot of data. Curve fitting
is what are pushed into when don't have so much data.

Or suppose build a model of the probability of an auto accident. Okay, want to
evaluate the model for 5' 2", 105 pounds, 17, blond, speaks only Swedish and
Russian, and is in the US in LA driving an 18 wheel truck for the first time
and talking on her cell phone with her sister in Berlin.

So, for that query, instead of a model, just have a lot of data and cross
tabulate, and the cell with that person delivers the answer immediately,
directly. Moreover the answer is unbiased and minimum variance ( _least
squares_ ). But did I mention, need a lot of data?

~~~
tomrod
This is fundamentally why we have deep neural networks. Fitting a simplified
curve in a huge space.

The combination you named may never have been observed before. The curse of
dimensionality strikes.

~~~
graycat
Yup. Did I mention that with cross tabulation we'd need a lot of data!!!!
:-)!!!

~~~
AnthonyMouse
But isn't that the whole problem? You quickly run into the case where the
required sample size is larger than the entire population size.

~~~
graycat
So, you expect us to boil it all down to just F = ma, and f'get about the
apple, the apple tree, the time of day, the phase of the moon, the temperature
of the day, what Newton was wearing?????

That was a really good shot at causality. Let me just say, quite generally,
_causality_ is super tough to find.

------
TadaScientist
How were those embedded charts built? They look like base R but I really like
the download option. Is the library under GMT?

------
masnick
Anyone interested in this article should also take a look at some of Pearl’s
work on causal inference. Here is one article:
[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2836213/](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2836213/)

------
plainOldText
And a paper somewhat related: _How Much Data Do You Need? A Pre-asymptotic
Metric for Fat-tailedness_ [1]

[1]
[https://arxiv.org/pdf/1802.05495.pdf](https://arxiv.org/pdf/1802.05495.pdf)

------
nanis
Of course it is. This is only news to people who think being able to use a
couple of Python libraries makes them a "scientist" of some sort.

See also instrumental variables[1]. Oh, sorry, I forgot, who needs all of the
thought that went into the development of econometrics as a discipline? We now
have data mining as a career.

[1]:
[https://www.nuffield.ox.ac.uk/teaching/economics/bond/instru...](https://www.nuffield.ox.ac.uk/teaching/economics/bond/instrumental%20variables1.pdf)

~~~
soVeryTired
Translating to the language of econometrics, I think the point of this paper
is that oftentimes, the instrument that you're using has errors in its
measurement, or is just a proxy for the 'real' instrument that you'd like to
use.

If that's the case, when IV is interpreted as a two-stage regression, you
might _still_ have covariates that are correlated with the residuals in the
second stage.

------
graycat
So, the article is about regression analysis: And there when I first studied
the topic, I saw the interest in assigning _importance_ to variables one at a
time based on the size of their regression coefficients.

Well, there is some math supporting regression analysis if want to use that
math, i.e., confirm the assumptions, nearly never doable.

So, I did see that if the independent variables were all orthogonal, this one
at a time work had some support. E.g., in linear algebra look at Bessel's
inequality.

Otherwise, easily can get confused: So, we're trying to predict Y from U and
V. U doesn't do at all well; neither does V, but together U and V predict Y
essentially perfectly. How, why? U spans a vector space, and Y is not in that
space. We can project Y onto that vector space and have the projection small
(coefficient of U small). Same for V. But U and V together span a vector
space, one that includes Y -- so U and V predict Y perfectly. Just simple
vector space geometry. It can happen.

This stuff about _controlling_ can make some good sense in cross-tabulation
(see my post here on that), but in regression analysis? _Controlling_? Where
do they get the really strong funny stuff they've been smoking to believe in
that?

------
repolfx
Mindblowing. No wonder so much science doesn't replicate.

I read scientific papers fairly often compared to the average person, and this
is probably the most important I've read in years. It feels like a similar
problem to Spectre in the CPU design space: a problem nobody noticed for
decades, that breaks fundamental assumptions in ways that appear nearly
impossible to fix. Except this is way bigger and more important than Spectre.

Here's a quick summary of the paper and its conclusions:

1\. Virtually all claims of the form "We have found X is an important factor
when controlling for other factors" are wrong, because they ignore the
possibility of measurement error, which has unintuitive and enormous effects.

2\. As a consequence, vast swathes of so-called findings in medical and social
sciences are spurious to an even greater extent than previously realised. The
paper cites Google Scholar searches to show that hundreds of thousands of
papers contain trigger phrases like "after controlling for".

3\. The problem is fundamental and can't be fixed.

For (3) I should explain, because the paper does propose what initially looks
like a fix: an improved form of statistical modelling that incorporates
measurement error into the outcomes (SEM). This can restore statistical
validity.

Unfortunately as we study this idea closer it becomes clear that there's no
way scientists are going to adopt it:

• It explodes the sample sizes and therefore cost.

• Frequently we have no idea what the measurement error actually is.

• Therefore it becomes impossible to say with any confidence whether a claim
is true or false.

Even very small amounts of measurement error can push the confidence level of
the outcome under the already arbitrary and probably too lax 95% threshold. To
detect a weak effect size of 0.1, handling realistic error rates could easily
take you from needing hundreds of samples (grad student populations) to tens
of thousands.

To clarify the term "measurement error" here, what the paper gets at is that
in social sciences they are often trying to measure things indirectly via e.g.
surveys which have some inherent noisiness due to people lying, not
understanding the question, not being sure of the answer but being unwilling
to admit it etc. And then on top of that they are often trying to measure
something ill-defined to begin with, e.g. mood, intelligence, wealth.

But as noise rates go up, the chances of spuriously detecting correlations
that don't exist reaches _100%_. The paper does various simulations to show
that this can easily happen with sample sizes, p values and error rates that
are standard.

The part that really nailed it for me was where they use a hypothetical
example of a correlation between ice cream sales and swimming pool deaths.
Common sense tells us ice cream doesn't cause people to die in pools, but
rather, on hot days more people go swimming and buy ice cream: it's a
confounding variable. In the case where the confounding variable is accurately
measured with a thermometer, measurement errors are zero and the typical sort
of regression analysis scientists use shows no correlation as expected. But if
temperature is measured using a survey where people self report perceived body
temperatures on the Likert scale, then you get an error rate of about 0.4. At
that point a regression analysis shows a strong direct correlation between
eating ice cream and dying in a pool with p < 0.001 - much lower than the p <
0.05 threshold needed to trigger publishing.

A big question is what does this mean for science and the academy? I can see
nothing happening in the short run: it seems this paper wasn't picked up by
journalists, or at least I've not seen any mention of it even in discussions
of the replication crisis. Scientists themselves can't really pay attention to
it without breaking the entire funding model of non-STEM sciences.

In the long run this could possibly lead to a major breakdown in society's
notions of expertise, trustworthiness of scientists and the value of
government funded academia (corporate funding of social sciences is close to
zero). How many decades have people been reading stories of the form "$foo
causes $bar" or "$baz (good thing) is highly correlated with $zap (bad
thing)"? Many people have already clocked that this kind of science is
unreliable - now we seem to know precisely why.

~~~
graycat
For measurements, the social scientists have been considering what they call
_reliability_ and _validity_ for a long time. So, they do consider the
accuracy of measurements.

In science we collect data then use the data. If the data has some measurement
error, no big surprise. Instead just say that the _model_ is in terms of the
measured values with their errors instead of some much more accurate values
that don't have.

Or, if have more accurate values, then use them. Else don't feel guilty and
beat up on self.

See also my earlier post in this thread on cross tabulation.

~~~
tomrod
Why isn't total least squares used to control for measurement error in
regressors?

~~~
graycat
By "total least squares" you mean to pick regression coefficients that
minimize the squared errors between the observed values of the dependent
variable and the predicted value from the regression? The predicted values are
from a perpendicular projection, and we do get the Pythagorean theorem with

total sum of squares = regression sum of squares + error sum of squares

So, we minimize the error sum of squares. Does that really _control_ on
measurement error? Not in any simple way I can see.

Or, if we have errors in the measurements of the independent variables, then
we are facing one of the facts of life, we don't have the error-free, _true_
values. Not good but usually not as bad as having your marriage fail or losing
your pet dog or cat! Like the video clip of a Heifetz master class where a
student tried to play the D flat minor scale and at the end Heifetz assured
the student that they were still alive!

Maybe what you are saying is that with enough mathematical assumptions, e.g.,
the famous _heteroscedasticity_ with independent, identically distributed mean
zero Gaussian errors, as the number of observations goes to infinity the
errors wash out much as in the weak/strong laws of large numbers and we get to
f'get about the errors -- maybe there is such a theorem, I should get out my
copy of the old

C. Radhakrishna Rao, _Linear Statistical Inference and Its Applications:
Second Edition_ , ISBN 0-471-70823-2, John Wiley and Sons, New York.

and look or do some such derivations for myself.

But, to what end if we don't believe the mathematical assumptions for the
mathematical theorems, e.g., _heteroscedasticity_ with independent,
identically distributed mean zero Gaussian errors?

Sorry, from 50,000 feet up, it seems to me that having _control_ variables in
regression is shaky stuff. And without some careful derivations, we should not
be surprised at the effects of various errors.

Also, the usual derivations of the math are in the context of just some one
regression _model_ where we make all those assumptions. Instead, given one
dependent variable and 10 independent variables plus five more we believe are
_causes_ plus 10 more we want to use as _controls_ , the dependent variable
plus 25 more variables in all, last time I checked we were short on how to
pick the 2^25 sets of independent variables and make sense of the different,
maybe wildly different, coefficients we get.

Here's a simple view: If we have 5 independent variables and they are all
orthogonal, then we can get the regression coefficients one at a time just
from 5 projections, covariances, inner products (all essentially the same
things except how we scale things) and have those coefficient just the same
for any of the 2^5 regression analyses. That is, if we have orthogonal
independent variables U, V, W, X, and Y and dependent variable Z, then we can
get the coefficients one at a time and be done -- have all the regression
coefficients for all 2^5 regressions. Otherwise, without othogonality, we face
some possibly tricky math derivations -- maybe they are in Rao's book, it's
thick enough -- and are asking a bit too much from regression analysis.

Others have seen this swamp, and a current idea from the _machine learning_
community, and going back at least to L. Breiman, is that we are not really
looking for coefficients, t-tests on the coefficients, F-ratio tests on the
regressions, confidence intervals on the predictions (for that might try some
resampling ideas), importance of coefficients, causes, control variables, etc.
but are just looking for a fit that can predict: To this end we put the data
into at least two buckets, fit to one bucket and test in another one. Our main
criterion is just that the _model_ predicts well for the data we have. That
is, all the data in all the buckets has all the same statistical assumptions,
whatever the heck they are, and we are just _fitting_ and then _testing_
(confirming) on _simple random samples_ of data all from some one big bucket.
Yes, we still run into the issue of _overfitting_ , fit well in the first
bucket but flop terribly testing on the second bucket. Okay, a bit crude,
uncouth, vulgar, primitive, ..., etc. but maybe useful in some cases --
apparently Breiman made it useful in some cases of medical data.

~~~
tomrod
I inspired a great soliloquy! Thanks for the thoughts

My point: total least squares includes X in the error minimization, not just Y
and linear combination of X. There is a good introductory discussion on wiki
-- essentially in standard regression we typically assume no measurement error
in the independent variables.[0]

As much value as machine learning brings, there is a need for explaining as
much as there is for predicting![1]

Your point on whether there is "true control" seems to agree with Pearl's main
point of contention -- does the causality plot (which is testable) make sense
from a theoretical, experiential, or systemic sense?

[0]
[https://en.wikipedia.org/wiki/Total_least_squares#/media/Fil...](https://en.wikipedia.org/wiki/Total_least_squares#/media/File:Total_least_squares.svg)

[1]
[https://www.stat.berkeley.edu/~aldous/157/Papers/shmueli.pdf](https://www.stat.berkeley.edu/~aldous/157/Papers/shmueli.pdf)

~~~
graycat
Okay, "total least squares" as in your [0]!!!! WOW!!! Back when I knew nothing
of regression or curve fitting and was first considering the issue, the
question I asked was, if we are trying to _fit_ to the given data, why not
have the line as close as possible to each of the points on the scatter
diagram, just as in the quite good picture at [0]!! Gee, value of ignorance!!

Again I believe we are trying to make too much out of regression.

Or, maybe, if somehow we DO have causes, we really know they are the real
causes, and we have some data, good data, and the data likely satisfies the
usual assumptions as in the reference I gave to Rao, THEN, maybe, on a good
day, with luck, do the regression calculations, t-tests, the F-ratio, get the
confidence intervals on the coefficients and the predicted values, etc. and if
all that looks solid, then take it seriously.

Here, however, KNOW the independent variables, all of them, KNOW that they are
the causes, and don't need _controls_ , etc. and are not fishing for the
variables, we are not trying to have statistics tell us about causes, ...,
then maybe okay.

But, sure, if there really are causes and if we really do have variables that
do well measuring those causes, then maybe in the regression the variables
that are candidates as causes will become fairly obvious.

~~~
tomrod
Regression is useful because it allows us to interpolate within observed
populations using relatively light assumptions. Extrapolation requires higher
order theories and structure. Agreed that it can be a logical mess when one
uses it bluntly, but like all tools it has its uses and misuses.

~~~
graycat
Ah, cruel, you are so cruel, how could you be so cruel; the OP was hoping for
something so much better than just some interpolation!!!

Cruel or not, at least in practice, you are on solid ground.

~~~
tomrod
Then you may be interested in pursuing structural models!

Thanks for the chat.

------
kingo55
Observational studies claiming causality are flawed compared to randomised
controlled experiments.

Hopefully in time research will gravitate toward RCTs.

Distinguished Engineer at Microsoft, Ronny Kohavi has a great presentation on
flawed observational studies here:

[https://www.exp-
platform.com/Documents/2016-11BestRefutedCau...](https://www.exp-
platform.com/Documents/2016-11BestRefutedCausalClaimsFromObservationalStudies.pdf)

~~~
mlthoughts2018
But the problem is in many application domains, you cannot obtain holdout
sets. For example, if you measure the causal impact of ads for a client, they
usually will not agree to accept the cost of holdout inventory in order to
measure the effects. They’d rather ensure all the inventory is used, and let
the observational post-hoc attempt at causal inference have errors.. they
knowingly prefer that trade-off.

There are actually a lot of areas like this, where you are just given a one-
sided observational data set snd tasked with recovering causal effects with no
option to collect a holdout set. Often nobody consciously chose it that way
over an RCT, it’s just how it happened.

~~~
SubiculumCode
or simple covariates like a subject's age. If we could randomly assign people
to an age, then...we'd have conquered aging I guess.

------
harlanji
Confounding variables like destitutite candidates in the hiring process with a
model trained by educated people with savings?

