Hacker News new | past | comments | ask | show | jobs | submit login
ML Beyond Curve Fitting: An Intro to Causal Inference and Do-Calculus (inference.vc)
184 points by dil8 on May 25, 2018 | hide | past | favorite | 41 comments

Something to note about this formulation is the explicit assumption that in p(y|do(x)), the 'do' operation is supposed to be completely independent of prior observed variables, e.g. the doers are 'unmoved movers' [1].

That fits the model where you randomly 'do' one thing or another (e.g. blinded testing); however this is not the same thing as p(y|do'(x)), where do' is your empirical observation of when you yourself have set X=x in a more natural context.

E.g. let's say you will always turn on the heat when it's cold outside. P(cold outside | do(turn on heat)) = P(cold outside), because turning on the heat does not affect the temperature outdoors.

However, P(cold outside | do'(turned on heat)) > P(cold outside), because empirically, you actually only choose to turn on the heat when it's cold outdoors.

These two are also different from P(cold outside | heat was turned on) (since someone else might have access to the thermostat).

In reality our choices and actions are also products of the initial states (including our own beliefs, and our own knowledge of what would happen if we did x). Our actions both move the world, but we are also moved by the world.

Does do-calculus have a careful treatment of 'mixed' scenarios where actions are both causes and effects of other causes?

[1] https://en.wikipedia.org/wiki/Unmoved_mover

You can write P(cold outside | do(turned on heat=X)), where X is another random variable. So P(cold outside | do(turned in heat=X=1)) will be equal to P(cold outside | do(turned on heat=1))P(X=1 | cold outside) = P(cold outside)P(X=1 | cold outside).

But, you might want to consider making "turned on heat" part of the system in this case, and go back to using the classic conditioning operator instead of the do operator.

This is covered in chapter 4 of Pearl's Causality.

For those trying to understand the difference between action and observation, here's a good example from a friend:

Every bug you fix in your code increases your chances of shipping on time, but provides evidence that you won't.

I really enjoyed the humility the author had in the introduction to this piece. He paused and took a hard look at what seemed to be harsh or arrogant criticism of his field and found insight.

Can you cite any parts of the article that support your view on this? I’ve read it a few times now and don’t see any. The author describes glossing past do-calculus before but for practical reasons, and doesn’t mention anything about “harsh or arrogant criticism” — and in fact doesn’t make reference to fair criticisms, like Rubin’s & Gelman’s.

>> Can you cite any parts of the article that support your view on this?

How about this: "In the interview, Pearl dismisses most of what we do in ML as curve fitting. While I believe that's an overstatement (conveniently ignores RL for example), it's a nice reminder that most productive debates are often triggered by controversial or outright arrogant comments. Calling machine learning alchemy was a great recent example."

When a person is dismissive of an entire field and claims to have a better way, that often comes off as arrogant (even if it is true). My interpretation is "harsh" while the author uses the word "overstatement". You'll also see "arrogant" in there and that last line calling it "alchemy" really has to be interpreted with negative connotations. Perhaps I read more into it than was written, but that was the impression I got.

Though the authors mentions that one comment of Pearl, all of the causal inference / graphical model work takes the opposite stance.

The popular academic writing in that field claims everyone else is being arrogant. It’s not a statement that Pearl is arrogant for dismissing huge chunks if ML, rather that since causal inference is such a cure-all, then everyone else is arrogant for not dropping everything to use it everywhere.

There’s no spirit i this article of saying, “boy it looked like a short-sighted criticism of ML, but now that I look at it, the causal inference people are right after all, and ML people are wrong.”

It may try to disingenuously frame it that way, but this is not what they are saying.

That was what I took away from basically the whole introduction. The first paragraph describes his reaction to the criticism as “harsh” and “arrogant” (author’s words), the second describes his change of heart, and the third describes himself as “embarrassed” at having previously dismissed do-calculus.

It is written in a way that suggests he still regards the criticism as harsh and arrogant, but not incorrect, if that makes sense.

It seems like you must have been reading a different article than me, or else are disingenuously describing what you read. When I do a control-f search for "harsh", it is not found anywhere in the article, so it certainly is not the author's own words.

The only part that mentions anything being "arrogant" is this quote:

> "it's a nice reminder that most productive debates are often triggered by controversial or outright arrogant comments"

which would actually be entirely counter to your point (the author is saying that 'arrogant' comments actually promote stimulating debate -- while I disagree with that too, it's clear the author did not at all say the criticism itself was arrogant, only that arrogant comments, many of which are Pearl's own comments, lead to debates).

When I read the introduction (which I have done now about 10 times), I see the author found practical reasons to dismiss do-calculus before (it was not pragmatic or applicable to real work problems). Now coming back to it later, he seems to be academically more interested in it and willing to invest more time in the nuance (while still nothing in the article gives an indication of its larger scale practical applicability). He does say he was 'embarrassed' to not look deeply into it before, but does not say this is because of how effective it is in real-world cases (which no one in this thread seems able to point to).

Here is a paper explaining the essentials of how 45+ years of Causal Inference applies to ML: http://www.nber.org/chapters/c14009.pdf

In this podcast by the same author, it explains the potential of sharing lessons from both worlds, if you're not in the mood for an academic paper: http://www.econtalk.org/archives/2016/09/susan_athey_on.html

It's very important to note that the term 'causal inference' in this research paper is not the same thing as Pearl's causal inference techniques, and in fact the main two statistics and econometrics researchers cited in your linked article are Imbens and Rubin, two of the biggest critics of Pearl's methods.

The linked paper mostly goes into instrumental variables and mixed effects modeling for how classical econometrics has dealt with trying to understand the causality of intentionally varying a treatment. And, despite citing Rubin heavily, the paper doesn't go much into the Bayesian methods for solving similar problems (hierarchical models), even though they are a state of the art approach with modern computational MCMC techniques.

The last few sections do offer some interesting research citations for how classical instrumental effects models have been morphed with advances in machine learning, with things like causal trees.

But just look at one of the take away points of the survey, in section 5:

> "4. No fundamental changes to theory of identification of causal effects"

Overall, the link you've shared would be strongly in favor of ML-extended classical econometrics and possibly Bayesian hierarchical models or latent variable approaches, but almost surely would be against the notion that do-calculus could lead to a wide-spread or real-world set of applicable models.

How does someone use do-calculus? It's a nice mathematization of Goodhart's law, https://en.wikipedia.org/wiki/Goodhart%27s_law

but how would help an algorithm make better predictions?

Sure, the reason a person turns on the heat affects our belief in the outside weather (were they feeling cold, or were they just trolling?), but how do you know the reason a person turned on the heat, and couldn't you learn which reason are predictive by measuring correlations with other observables? If you know the reason directly ("I'm just playing with the dial because I'm 4 years old") that's a data point you could throw into your ML model without explicitly knowing it's a reason.

I am interested in a companion phenomenon with the recent interest in causal models in machine learning. Namely, the fact that at least in computer vision, it is not new at all and has been an important idea for at least many decades.

One of the original sources that took this approach is "The Ecological Approach to Visual Perception" (1979) [0], by James Gibson, discussed at length the idea of "affordances" of an algorithmic model, similar in some respects to topics in reinforcement learning as well. Affordances represented the information about outcomes you gained by varying your degrees of observational freedom (i.e. you learn how to generalize beyond occluded objects by moving your head a little to the left or right and seeing how the visual input varies. This lets you get food, or hide from a predator that's partially blocked by a tree, etc., so over time generalizing past occlusions become better and better -- this is much more interesting than a naive approach, like using data augmentation to augment a labeled data set with synthetically occluded variations, for example as is often done to improve rotational invariance).

Then this idea was extended with a lot of formality in the mid-to-late 00's by Stefano Soatto in his papers on "Actionable Information" [1].

I wish more effort had been made by e.g. Pearl to look into this and unify his approach with what had already been thought of, especially because it turns me off a lot when someone tries to create a "whole new paradigm" and it starts to feel like they want to generate sexy marketing hype about it, rather than to say hey, this is an extension or connection or alternative of this older idea already in the topic of machine learning rather than appearing like one is saying, "Us over hear in causal inference world already know so much more about what to do ... so now let's apply it to your domain where you never thought of this". Pearl has a history of doing this stuff too, like with his previous debates with Gelman about Bayesian models. It almost feels to me like he is shopping around for some sexy application area where his one-upsmanship approach will catch on too give him a chance at the hype gravy train or something.

[0]: < https://en.wikipedia.org/wiki/James_J._Gibson#Major_works >

[1]: < http://www.vision.cs.ucla.edu/papers/soatto09.pdf >

> It almost feels to me like he is shopping around for some sexy application area where his one-upsmanship approach will catch on too give him a chance at the hype gravy train or something.

This doesn't seem like a very fitting description of Pearl. In his work, he is very careful to cite existing approaches (structural equation model literature, various topics from graphical models). In his various discussions with Gelman, he comes off as freakishly polite and not looking to one up.

I'm sorry but I simply don't agree about the politeness comment. As linked from a Quora post that goes into, this was one of Pearl's original statements about the disagreement (link to the original at the UCLA site appears to have been taken down) [0]:

> "I therefore invite my colleagues... to familiarize themselves with the miracles of do-calculus. Take any causal problem for which you know the answer in advance, submit it for analysis through the do-calculus and marvel with us at the power of the calculus to deliver the correct result in just 3–4 lines of derivation. Alternatively, if we cannot agree on the correct answer, let us simulate it on a computer, using a well specified data-generating model, then marvel at the way do-calculus, given only the graph, is able to predict the effects of (simulated) interventions. I am confident that after such experience all hesitations will turn into endorsements. BTW, I have offered this exercise repeatedly to colleagues from the potential outcome camp, and the response was uniform: “we do not work on toy problems, we work on real-life problems.” Perhaps this note would entice them to join us, mortals, and try a small problem once, just for sport."

This is absolutely the cheeky spirit of one-upsmanship I am talking about. The offers are always framed in terms of "look how causal inference supersedes everything," which is not a charitable take on approaches from others, especially in historical applied ML, that might have already developed some of the same underlying ideas.

[0]: https://www.quora.com/Why-is-there-a-dispute-between-Judea-P...

I don't know. The issue he is addressing in your quote is that people often leverage criticisms of his approach that are just verbal statements. Pearl wants people to use data-generating models to make their concerns explicit.

The link you used explains the situation pretty well. If anything Pearl's regular acknowledgement of graphical models seems to be an indication that he is mindful of at least one very common approach in current ML.

Isn't it incumbent on Pearl or someone on the do-calculus school to run an experiment to show it performs better than popular ML systems?

It's a beautiful theory, but it hearkens back to the symbolic AI era that has had limited effectiveness.

In theory, yes. However, I think in practice addressing the concerns of critics is often out of Pearl's hands.

Until they supply a "ground truth" or data generating model, he has a dilemma:

* if he doesn't create a data generating model, then arguments for / against his approach will be specious.

* if he creates a data generating model, they can claim it doesn't reflect reality.

In the case of Judea Pearl and Andy Gelman, it seems like the point of contention is much broader than the do-calculus. Andy Gelman does not seem to be a fan of structural equation modeling / similar graphical models.

How is it out of Pearl’s hands? Also, Gelman & Rubin already did look into Pearl’s models, and even agreed that for some toy model examples, the technique works as intended, but that there are serious how-things-work-in-practice reasons why Pearl’s models are unlikely to be mathematically appropriate for some real world use cases.

It’s really a fair response from them to Pearl, especially when the whole time Pearl is presenting it like causal inference is a miracle cure-all.

All I am seeing in your comments is hand waving attempts to shift the burden of proof onto the group of practitioners who already looked into this stuff and weren’t convinced!

So why does it being incumbent on Pearl or on another causal inference practitioner to demonstrate it scaling up to a more complicated in-practice problem still get qualified with an “in theory” from you? Why isn’t it resoundingly obvious by this point that the burden of proof lies with Pearl, and that people would be happy to hear if he can use these models for large-scale, practical use cases, but they (rightfully) don’t see a reason (even after looking into the models) to spend their own time doing it?

As an outsider, that reads as quite polite, especially given the context of his quote as what seems to be a request fallen on deaf ears.

I read the Quora link. It seems most people are supportive of Pearl's ideas. Why not try his proposed thought experiment and see?

Many people have already and continue to try his experiment, and indeed work hard on scaling up the problems that his method is applied to.

That’s part of the problem. It was offered up as a “miracle” that supersedes and fully, logically subsumes the hard fought and real-world tested methods of others, who rightfully weren’t going to advocate the use of some other, unproven thing claimed to be a cure-all, yet they still did engage heavily with it and worked through the math and agreed with derivations for simple models under various collections of assumptions.

I’m still not seeing anything convincing. Pearl’s tone is not polite or even quirky, and strays badly from any measure of humility or collaborative spirit of inquiry. Really. I mean, sure, he’s not coming out dropping f-bombs, but that’s utterly not the point. Trying to justify it as if it’s a diplomatic and earnest request is silly. “Hey, your career’s worth of work is totally wrong. Here look, my work completely supersedes yours because I worked through some canonical, highly academic cases. But I’m being fair & balanced about it, I swear!”

And then when folks did look into it and still felt unconvinced, “I guess you’re just dogmatic & closed minded. I offered to look into it with you, but you ignored me.”

I wish more effort had been made by e.g. Pearl to look into this and unify his approach with what had already been thought of, especially because it turns me off a lot when someone tries to create a "whole new paradigm" and it starts to feel like they want to generate sexy marketing hype about it, rather than to say hey, this is an extension or connection or alternative of this older idea already in the topic of machine learning...

I think you wind-up with a situation where the none of the less-than mainstream of conceptions intelligence will have further parts added. Instead, each becomes associated with a single individual's career. It's something of the nature of academia, a situation that made sense when scientific models and approaches were "small" enough to be fully encompassed by an individual.

But you have the problem models aren't naturally modular. Whether X model extend Y model is something of a judgment call. What makes one like or not-like another model is a matter of both the structure of the model and the reasoning behind the model.

Moreover, consider ten programmers creating one computer program tend to proportionately less productive than one programmer creating a program (ie, they work much less than 10x as fast as a rule). Ten theorists putting together one single theory may face a similar or greater problem of diminishing return and coordination.

The development of Quantum Field Theory is a good example where >10 people all collaborated to come up with a framework that integrated the viewpoints of multiple theorists with radically different approaches, rather than every new contributor forking a personalized version of the previous theory.

Consider, for example, the way Freeman Dyson combined the graphical approach of Feynmann with Schwinger's more formal methods.

The development of Quantum Field Theory is a good example where >10 people all collaborated to come up with a framework that integrated the viewpoints of multiple theorists with radically different approaches, rather than every new contributor forking a personalized version of the previous theory.

Sure, I hope I was clear that I don't ten theorist (or ten programmers) collaborating is impossible. I would simply say that collaborating has an extra cost to it - and a competitive academic world, any cost needs some degree of payoff. This makes extending a mainstream theory advantageous but not so much less-known theories.

And Quantum field theory had the advantage that the experiments for demonstrating it's truth or falsehood were relatively straightforward. With AI, the question of a theories truth is more debatable.

You make good points, and particularly to explain why there might not be much effort to unify approaches, this makes sense.

But it still doesn't explain Pearl's generally thorny disposition regarding other approaches. Most practitioners and researchers will err on the side of humility, and assume that broad swaths of comparable research is valuable and that many of their ideas have probably been thought of before, in one form or another, even if the researcher's approach is deserving of praise for its innovation or novelty.

David Mumford, in his 'dawning of the age of stochasticity' lecture, mentioned the idea of a 'hubris quotient' -- for him it was the idea of claiming to adequately summarize thousands of years of math progress to the point that someone could actually say something novel, in the span of a single career. If you've only been working on it for 30-40 years, and you're claiming to upend something that's been central for hundreds of years, that's a poor hubris quotient, and so maybe you should proceed with a lot of humility and caution.

It just never quite feels like Pearl accepts this for causal inference. Maybe he feels like it has not gotten the attention it deserves and needs to advocate in a more no-nonsense kind of way, but it just seems like somewhat of a bad hubris quotient to start speaking about how it is a novel take on something he feels ML has historically not adequately accounted for.

Well, I'm not qualified to judge Pearl's integrity.

I would note that Pearl is not necessarily the first or the only person to note that modern machine learning has problems associated with it, problems often described by "correlation is not the same as causation." We can see actual practical problems appear when machine learning systems are deployed in situations they make definite judgments affecting people's lives based only on factors correlated with a condition. In the extreme, if X,Y, Z factor are associated with someone acting criminally, are we allowed to arrest the person without a crime being committed? (etc).

So Pearl has some credibility stepping into this "breach" with his (perhaps sell-branded but) more mathematically grounded and statistically sound approach. Of course, the problem is no statics really gives a "sound" way to unambiguously predict a future datum only from past data. They Bayesian does describe how to make sound predictions when you happen to know prior probabilities, a view that "kicks the can down the road" as others have mentioned.

The thing is, in contrast to math, AI has involved a group of models, theories and ideas which have all broadly moved forward across the decades with their stars rising and falling but few being utterly discarded. This is because little to nothing can be proven and moreover, because despite presented alternatives, they intersect like fat Venn Diagrams if considered only formally (though as specific programs-of-research, they may be exclusive). Moreover, publicity is one key to a given approach getting more concrete implementations and ultimately getting funding, more researchers and chance to go the next generation. The relative speed of a neural net on a GPU might well be a key to this sort of model showing promising practical applications. Is this speed inherent or are other models waiting for optimized implementations? If such an optimized implementation is possible, it would require a specialized programmer and hence funding.

And this means? Well, I'm not sure what it means. Perhaps one could deduce a correct model of machine intelligence if one could determine and correct for the biases which currently drive the process.

This comment doesn't really make much sense for me, especially since none of Pearl's techniques have been convincingly demonstrated to work in real situations. It's one thing to take pot shots at practical engineering problems and point of flaws and locations for improvement, but it's quite different to claim that a new framework would solve them when (a) elements of that framework have already existed a while and practitioners knew about them, and (b) the framework hasn't been shown to give state of the art performance or to actually solve cases when algorithmic decision making made improper judgments.

Do you have examples to dispute this... actual examples where a causal inference based model was used for large-scale deployed machine learning problems and demonstrably fixed some type of judgment error that had previously been leading to bad outcomes for people?

I mean there are structural equation models that preceded Pearl's work that Pearl cites. And before that the Neyman-Rubin work.. Neyman first wrote about it in 1923. I think Pearl's principle insight was to use graph theory to reason about either Bayesian things (see probabilistic graphical models) or causal things (see causality). This is a fairly fundamental insight.

Pearl's attention to the do conditionality -- i.e., P(Y|do(X)) versus P(Y|X) is interesting and important in a certain sense, but I'm not sure it's really resolved debates about causality in any practical sense.

I don't really mean that in a dismissive sense, just to point out that his notation just begs the question of what do(X) means, in terms of why it is actually important. To me it just kind of formalizes a certain notation and kicks the hard theoretical can down the road.

In the books and papers I've read of Pearl's, he makes reasonable logical arguments for certain types of causal inferences, but when, in discussion with colleagues, we've tried to think of how they would be implemented outside the context of an experiment, we've been sort of at a loss. I say this as someone who identifies with observational study professionally, but who recognizes the importance of experiments.

My broader point is that I think Pearl's do-calculus can be reexpressed in traditional graph theory/structural equations/statistics without introducing anything new. In that sense, although I think his writings have drawn attention to important issues, I don't think they have solved anything.

> just to point out that his notation just begs the question of what do(X) means,

It's very formally specified. The key object of study in Pearl-style causal inference is a structural causal model. A structural causal model is composed as equations like the following:

Y = f(X, Z, U)

Here, X and Z are observed inputs, other random variables in your system. U is unobserved. In other words, "Y is computed by a deterministic function which takes an unknown random input."

Then, P(Y = 1 | do(X=1, Z=2)) is defined as P(f(1,2, U) = 1).

My gripe isn't with the importance of what Pearl published-- of course it's important. I just mean the concept of conditioning on how the target or observational outcome varies when you intentionally vary some conditional variables, that concept for use in machine learning is not new at all. Causal models would just be one more take on it, with interconnections and differences and pros and cons compared with what came before. But it's always disingenuously framed like, "ML practitioners never knew about doing this, but it's the only way to truly go further with our models."

Worth mentioning, perhaps, that Cybernetics originated from the study of "circular loops of causality", systems where e.g. A causes B, B causes C, and in turn C causes A, etc...

This is really sexy.

Nothing to see here. The do-calculus is just fancy notation for what reinforcement learning is already doing: trying different possible actions and trying to maximize reward. If you know possible actions in advance, this is basically minimizing regret of wrong policy actions.

First, RL and causal inference do fundamentally different things. RL is trying to train a controller; causal inference gives you a theory so that you can predict the results of a randomized controlled experiment without running one.

Second, consider this: Classic ML techniques will tell you that you should never go to the doctor because it increases the probability that you have a disease. Causal inference does not have this problem.

How does RL dodge this?

Not an RL expert, but Model-Based RL is a thing, where you try to train a model of how actions affect the world, and then use that model to choose/influence your actions.

But I don't think it's true that we always need a model, or at least I don't necessarily think we always need a human understandable model.

Your doctor example is weird to me tbh. A non-causal ML approach would seek to determine whether a patient has a disease based on some symptoms, and then send them to a doctor based on those results, sidestepping the need for causal models.

To rephrase it in a way that makes a bit more sense to me is: let's assume we want to know if a specific procedure would be good for a patient (basically the same example). With a non-causal approach we would want to predict whether a patient would have a better outcome from doing a procedure than not.

A natural way to solve this (to me) would be to build one model that estimates the probability of various outcomes from the procedure, and one that estimates the probability of various outcomes from not undergoing the procedure.

Or if you're working in the world of Neural Nets/Deep RL, have a model that takes all the non-intervention data as input and outputs the expected outcomes from the procedure and the expected outcomes from not doing the procedure, and when you train it, you only supervise the outcomes that you had data for.

This ignores the Bayesian/Distributional Shift issue, but I don't think the do calculus has a real answer to that either.

I would be interested in knowing if this ad-hoc modelling approach is any different to the causal modelling the Pearl is arguing for, or if Causal modelling is more necessary when you have more complicated causal relationships than a single intervention.

> A non-causal ML approach would seek to determine whether a patient has a disease based on some symptoms, and then send them to a doctor based on those results, sidestepping the need for causal models.

I saw an ML presentation a few months ago, on training a decision tree to do the same thing as a neural net, so we can understand what the neural net was doing.

They used this on a neural net trying to diagnose people with diabetes. It showed that having any other diagnosis would increase its probability of diagnosing them with diabetes. Why? Because it meant they're more likely to have gone to the doctor to get diagnosed. (Along with detecting general health indicators that weren't screened out.)

You can try to partition your data into intervention/non-intervention, or do something else to try to stop your model from detecting spurious correlations. Causal models makes this more formal and tells you which things you should include/exclude, gives you formulas for adjusting them out, and how much bias you introduce by failing to do so.

The theory of causal inference is also immune to distributional shift, and serves as a nice guidance for what actual systems should do (usually: failing to return an answer).

(Yes, I've fully drunk the Pearl Kool-Ade.)

Thanks for the example, it does motivate it a bit better when there are more complicated (but still relatively simple) causal relationships.

I think what offpolicy was trying to clumsily say is that policy evaluation (I come from the economic policy econometrics world originally) can be used for RL.

Maybe it can, but isn't Bayesian stuff really costly most of the time?

How? AFAIK policy analysis is based on asking causal and counterfactual questions, primarily by trying to find a quasi-control-population that can be a proxy for the intervention and then regressing against it vs the observational data. I forget the name specifically. Causal models would represent this explicitly and you could reason about the model to see if your economics question is well posed. Nobody does this now but it is important to do because it lays out the assumptions which can't be hidden behind politics.

RL is for training system parameters based on positive or negative reinforcement from a critic. RL is based on a Markov decision process. RL has policy search idea but that is separate from economic policy evaluation.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact