
ML Beyond Curve Fitting: An Intro to Causal Inference and Do-Calculus - dil8
http://www.inference.vc/untitled/
======
smallnamespace
Something to note about this formulation is the explicit assumption that in
p(y|do(x)), the 'do' operation is supposed to be completely independent of
prior observed variables, e.g. the doers are 'unmoved movers' [1].

That fits the model where you randomly 'do' one thing or another (e.g. blinded
testing); however this is _not_ the same thing as p(y|do'(x)), where do' is
your empirical observation of when you yourself have set X=x in a more natural
context.

E.g. let's say you will always turn on the heat when it's cold outside. P(cold
outside | do(turn on heat)) = P(cold outside), because turning on the heat
does not affect the temperature outdoors.

However, P(cold outside | do'(turned on heat)) > P(cold outside), because
empirically, you actually only _choose_ to turn on the heat when it's cold
outdoors.

These two are also different from P(cold outside | heat was turned on) (since
_someone else_ might have access to the thermostat).

In reality our choices and actions are also products of the initial states
(including our own beliefs, and our own knowledge of what would happen if we
did x). Our actions both move the world, but we are also moved by the world.

Does do-calculus have a careful treatment of 'mixed' scenarios where actions
are both causes _and_ effects of other causes?

[1]
[https://en.wikipedia.org/wiki/Unmoved_mover](https://en.wikipedia.org/wiki/Unmoved_mover)

~~~
Darmani
You can write P(cold outside | do(turned on heat=X)), where X is another
random variable. So P(cold outside | do(turned in heat=X=1)) will be equal to
P(cold outside | do(turned on heat=1))P(X=1 | cold outside) = P(cold
outside)P(X=1 | cold outside).

But, you might want to consider making "turned on heat" part of the system in
this case, and go back to using the classic conditioning operator instead of
the do operator.

This is covered in chapter 4 of Pearl's Causality.

------
Darmani
For those trying to understand the difference between action and observation,
here's a good example from a friend:

Every bug you fix in your code increases your chances of shipping on time, but
provides evidence that you won't.

------
phkahler
I really enjoyed the humility the author had in the introduction to this
piece. He paused and took a hard look at what seemed to be harsh or arrogant
criticism of his field and found insight.

~~~
mlthoughts2018
Can you cite any parts of the article that support your view on this? I’ve
read it a few times now and don’t see any. The author describes glossing past
do-calculus before but for practical reasons, and doesn’t mention anything
about “harsh or arrogant criticism” — and in fact doesn’t make reference to
_fair_ criticisms, like Rubin’s & Gelman’s.

~~~
phkahler
>> Can you cite any parts of the article that support your view on this?

How about this: "In the interview, Pearl dismisses most of what we do in ML as
curve fitting. While I believe that's an overstatement (conveniently ignores
RL for example), it's a nice reminder that most productive debates are often
triggered by controversial or outright arrogant comments. Calling machine
learning alchemy was a great recent example."

When a person is dismissive of an entire field and claims to have a better
way, that often comes off as arrogant (even if it is true). My interpretation
is "harsh" while the author uses the word "overstatement". You'll also see
"arrogant" in there and that last line calling it "alchemy" really has to be
interpreted with negative connotations. Perhaps I read more into it than was
written, but that was the impression I got.

~~~
mlthoughts2018
Though the authors mentions that one comment of Pearl, all of the causal
inference / graphical model work takes the opposite stance.

The popular academic writing in that field claims _everyone else_ is being
arrogant. It’s not a statement that Pearl is arrogant for dismissing huge
chunks if ML, rather that since causal inference is such a cure-all, then
_everyone else_ is arrogant for not dropping everything to use it everywhere.

There’s no spirit i this article of saying, “boy it looked like a short-
sighted criticism of ML, but now that I look at it, the causal inference
people _are right_ after all, and ML people are wrong.”

It may try to disingenuously frame it that way, but this is not what they are
saying.

------
thadk
Here is a paper explaining the essentials of how 45+ years of Causal Inference
applies to ML:
[http://www.nber.org/chapters/c14009.pdf](http://www.nber.org/chapters/c14009.pdf)

In this podcast by the same author, it explains the potential of sharing
lessons from both worlds, if you're not in the mood for an academic paper:
[http://www.econtalk.org/archives/2016/09/susan_athey_on.html](http://www.econtalk.org/archives/2016/09/susan_athey_on.html)

~~~
mlthoughts2018
It's _very_ important to note that the term 'causal inference' in this
research paper is not the same thing as Pearl's causal inference techniques,
and in fact the main two statistics and econometrics researchers cited in your
linked article are Imbens and Rubin, two of the biggest critics of Pearl's
methods.

The linked paper mostly goes into instrumental variables and mixed effects
modeling for how classical econometrics has dealt with trying to understand
the causality of intentionally varying a treatment. And, despite citing Rubin
heavily, the paper doesn't go much into the Bayesian methods for solving
similar problems (hierarchical models), even though they are a state of the
art approach with modern computational MCMC techniques.

The last few sections do offer some interesting research citations for how
classical instrumental effects models have been morphed with advances in
machine learning, with things like causal trees.

But just look at one of the take away points of the survey, in section 5:

> "4\. No fundamental changes to theory of identification of causal effects"

Overall, the link you've shared would be strongly in favor of ML-extended
classical econometrics and possibly Bayesian hierarchical models or latent
variable approaches, but almost surely would be _against_ the notion that do-
calculus could lead to a wide-spread or real-world set of applicable models.

------
gowld
How does someone _use_ do-calculus? It's a nice mathematization of Goodhart's
law,
[https://en.wikipedia.org/wiki/Goodhart%27s_law](https://en.wikipedia.org/wiki/Goodhart%27s_law)

but how would help an algorithm make better predictions?

Sure, the reason a person turns on the heat affects our belief in the outside
weather (were they feeling cold, or were they just trolling?), but how do you
_know_ the reason a person turned on the heat, and couldn't you learn which
reason are predictive by measuring correlations with other observables? If you
_know_ the reason directly ("I'm just playing with the dial because I'm 4
years old") that's a data point you could throw into your ML model _without_
explicitly knowing it's a _reason_.

~~~
sjg007
See [http://www.michaelnielsen.org/ddi/if-correlation-doesnt-
impl...](http://www.michaelnielsen.org/ddi/if-correlation-doesnt-imply-
causation-then-what-does/)

And

[https://www.statisticssolutions.com/structural-equation-
mode...](https://www.statisticssolutions.com/structural-equation-modeling/)

------
mlthoughts2018
I am interested in a companion phenomenon with the recent interest in causal
models in machine learning. Namely, the fact that at least in computer vision,
it is not new at all and has been an important idea for at least many decades.

One of the original sources that took this approach is "The Ecological
Approach to Visual Perception" (1979) [0], by James Gibson, discussed at
length the idea of "affordances" of an algorithmic model, similar in some
respects to topics in reinforcement learning as well. Affordances represented
the information about outcomes you gained by varying your degrees of
observational freedom (i.e. you learn how to generalize beyond occluded
objects by moving your head a little to the left or right and seeing how the
visual input varies. This lets you get food, or hide from a predator that's
partially blocked by a tree, etc., so over time generalizing past occlusions
become better and better -- this is much more interesting than a naive
approach, like using data augmentation to augment a labeled data set with
synthetically occluded variations, for example as is often done to improve
rotational invariance).

Then this idea was extended with a lot of formality in the mid-to-late 00's by
Stefano Soatto in his papers on "Actionable Information" [1].

I wish more effort had been made by e.g. Pearl to look into this and unify his
approach with what had already been thought of, especially because it turns me
off a lot when someone tries to create a "whole new paradigm" and it starts to
feel like they want to generate sexy marketing hype about it, rather than to
say hey, this is an extension or connection or alternative of this older idea
_already in the topic of machine learning_ rather than appearing like one is
saying, "Us over hear in causal inference world already know so much more
about what to do ... so now let's apply it to your domain where you never
thought of this". Pearl has a history of doing this stuff too, like with his
previous debates with Gelman about Bayesian models. It almost feels to me like
he is shopping around for some sexy application area where his one-upsmanship
approach will catch on too give him a chance at the hype gravy train or
something.

[0]: <
[https://en.wikipedia.org/wiki/James_J._Gibson#Major_works](https://en.wikipedia.org/wiki/James_J._Gibson#Major_works)
>

[1]: <
[http://www.vision.cs.ucla.edu/papers/soatto09.pdf](http://www.vision.cs.ucla.edu/papers/soatto09.pdf)
>

~~~
joe_the_user
_I wish more effort had been made by e.g. Pearl to look into this and unify
his approach with what had already been thought of, especially because it
turns me off a lot when someone tries to create a "whole new paradigm" and it
starts to feel like they want to generate sexy marketing hype about it, rather
than to say hey, this is an extension or connection or alternative of this
older idea already in the topic of machine learning..._

I think you wind-up with a situation where the none of the less-than
mainstream of conceptions intelligence will have further parts added. Instead,
each becomes associated with a single individual's career. It's something of
the nature of academia, a situation that made sense when scientific models and
approaches were "small" enough to be fully encompassed by an individual.

But you have the problem models aren't naturally modular. Whether X model
extend Y model is something of a judgment call. What makes one like or not-
like another model is a matter of both the structure of the model and the
reasoning behind the model.

Moreover, consider ten programmers creating one computer program tend to
proportionately less productive than one programmer creating a program (ie,
they work much less than 10x as fast as a rule). Ten theorists putting
together one single theory may face a similar or greater problem of
diminishing return and coordination.

~~~
philipov
The development of Quantum Field Theory is a good example where >10 people all
collaborated to come up with a framework that integrated the viewpoints of
multiple theorists with radically different approaches, rather than every new
contributor forking a personalized version of the previous theory.

Consider, for example, the way Freeman Dyson combined the graphical approach
of Feynmann with Schwinger's more formal methods.

~~~
joe_the_user
_The development of Quantum Field Theory is a good example where >10 people
all collaborated to come up with a framework that integrated the viewpoints of
multiple theorists with radically different approaches, rather than every new
contributor forking a personalized version of the previous theory._

Sure, I hope I was clear that I don't ten theorist (or ten programmers)
collaborating is impossible. I would simply say that collaborating has an
extra cost to it - and a competitive academic world, any cost needs some
degree of payoff. This makes extending a mainstream theory advantageous but
not so much less-known theories.

And Quantum field theory had the advantage that the experiments for
demonstrating it's truth or falsehood were relatively straightforward. With
AI, the question of a theories truth is more debatable.

------
carapace
Worth mentioning, perhaps, that Cybernetics originated from the study of
"circular loops of causality", systems where e.g. A causes B, B causes C, and
in turn C causes A, etc...

------
thanatropism
This is really sexy.

------
offpolicy
Nothing to see here. The do-calculus is just fancy notation for what
reinforcement learning is already doing: trying different possible actions and
trying to maximize reward. If you know possible actions in advance, this is
basically minimizing regret of wrong policy actions.

~~~
Darmani
First, RL and causal inference do fundamentally different things. RL is trying
to train a controller; causal inference gives you a theory so that you can
predict the results of a randomized controlled experiment without running one.

Second, consider this: Classic ML techniques will tell you that you should
never go to the doctor because it increases the probability that you have a
disease. Causal inference does not have this problem.

How does RL dodge this?

~~~
Eridrus
Not an RL expert, but Model-Based RL is a thing, where you try to train a
model of how actions affect the world, and then use that model to
choose/influence your actions.

But I don't think it's true that we always need a model, or at least I don't
necessarily think we always need a human understandable model.

Your doctor example is weird to me tbh. A non-causal ML approach would seek to
determine whether a patient has a disease based on some symptoms, and then
send them to a doctor based on those results, sidestepping the need for causal
models.

To rephrase it in a way that makes a bit more sense to me is: let's assume we
want to know if a specific procedure would be good for a patient (basically
the same example). With a non-causal approach we would want to predict whether
a patient would have a better outcome from doing a procedure than not.

A natural way to solve this (to me) would be to build one model that estimates
the probability of various outcomes from the procedure, and one that estimates
the probability of various outcomes from not undergoing the procedure.

Or if you're working in the world of Neural Nets/Deep RL, have a model that
takes all the non-intervention data as input and outputs the expected outcomes
from the procedure and the expected outcomes from not doing the procedure, and
when you train it, you only supervise the outcomes that you had data for.

This ignores the Bayesian/Distributional Shift issue, but I don't think the do
calculus has a real answer to that either.

I would be interested in knowing if this ad-hoc modelling approach is any
different to the causal modelling the Pearl is arguing for, or if Causal
modelling is more necessary when you have more complicated causal
relationships than a single intervention.

~~~
Darmani
> A non-causal ML approach would seek to determine whether a patient has a
> disease based on some symptoms, and then send them to a doctor based on
> those results, sidestepping the need for causal models.

I saw an ML presentation a few months ago, on training a decision tree to do
the same thing as a neural net, so we can understand what the neural net was
doing.

They used this on a neural net trying to diagnose people with diabetes. It
showed that having any other diagnosis would increase its probability of
diagnosing them with diabetes. Why? Because it meant they're more likely to
have gone to the doctor to get diagnosed. (Along with detecting general health
indicators that weren't screened out.)

You can try to partition your data into intervention/non-intervention, or do
something else to try to stop your model from detecting spurious correlations.
Causal models makes this more formal and tells you which things you should
include/exclude, gives you formulas for adjusting them out, and how much bias
you introduce by failing to do so.

The theory of causal inference is also immune to distributional shift, and
serves as a nice guidance for what actual systems should do (usually: failing
to return an answer).

(Yes, I've fully drunk the Pearl Kool-Ade.)

~~~
Eridrus
Thanks for the example, it does motivate it a bit better when there are more
complicated (but still relatively simple) causal relationships.

