That fits the model where you randomly 'do' one thing or another (e.g. blinded testing); however this is not the same thing as p(y|do'(x)), where do' is your empirical observation of when you yourself have set X=x in a more natural context.
E.g. let's say you will always turn on the heat when it's cold outside. P(cold outside | do(turn on heat)) = P(cold outside), because turning on the heat does not affect the temperature outdoors.
However, P(cold outside | do'(turned on heat)) > P(cold outside), because empirically, you actually only choose to turn on the heat when it's cold outdoors.
These two are also different from P(cold outside | heat was turned on) (since someone else might have access to the thermostat).
In reality our choices and actions are also products of the initial states (including our own beliefs, and our own knowledge of what would happen if we did x). Our actions both move the world, but we are also moved by the world.
Does do-calculus have a careful treatment of 'mixed' scenarios where actions are both causes and effects of other causes?
But, you might want to consider making "turned on heat" part of the system in this case, and go back to using the classic conditioning operator instead of the do operator.
This is covered in chapter 4 of Pearl's Causality.
Every bug you fix in your code increases your chances of shipping on time, but provides evidence that you won't.
How about this: "In the interview, Pearl dismisses most of what we do in ML as curve fitting. While I believe that's an overstatement (conveniently ignores RL for example), it's a nice reminder that most productive debates are often triggered by controversial or outright arrogant comments. Calling machine learning alchemy was a great recent example."
When a person is dismissive of an entire field and claims to have a better way, that often comes off as arrogant (even if it is true). My interpretation is "harsh" while the author uses the word "overstatement". You'll also see "arrogant" in there and that last line calling it "alchemy" really has to be interpreted with negative connotations. Perhaps I read more into it than was written, but that was the impression I got.
The popular academic writing in that field claims everyone else is being arrogant. It’s not a statement that Pearl is arrogant for dismissing huge chunks if ML, rather that since causal inference is such a cure-all, then everyone else is arrogant for not dropping everything to use it everywhere.
There’s no spirit i this article of saying, “boy it looked like a short-sighted criticism of ML, but now that I look at it, the causal inference people are right after all, and ML people are wrong.”
It may try to disingenuously frame it that way, but this is not what they are saying.
It is written in a way that suggests he still regards the criticism as harsh and arrogant, but not incorrect, if that makes sense.
The only part that mentions anything being "arrogant" is this quote:
> "it's a nice reminder that most productive debates are often triggered by controversial or outright arrogant comments"
which would actually be entirely counter to your point (the author is saying that 'arrogant' comments actually promote stimulating debate -- while I disagree with that too, it's clear the author did not at all say the criticism itself was arrogant, only that arrogant comments, many of which are Pearl's own comments, lead to debates).
When I read the introduction (which I have done now about 10 times), I see the author found practical reasons to dismiss do-calculus before (it was not pragmatic or applicable to real work problems). Now coming back to it later, he seems to be academically more interested in it and willing to invest more time in the nuance (while still nothing in the article gives an indication of its larger scale practical applicability). He does say he was 'embarrassed' to not look deeply into it before, but does not say this is because of how effective it is in real-world cases (which no one in this thread seems able to point to).
In this podcast by the same author, it explains the potential of sharing lessons from both worlds, if you're not in the mood for an academic paper:
The linked paper mostly goes into instrumental variables and mixed effects modeling for how classical econometrics has dealt with trying to understand the causality of intentionally varying a treatment. And, despite citing Rubin heavily, the paper doesn't go much into the Bayesian methods for solving similar problems (hierarchical models), even though they are a state of the art approach with modern computational MCMC techniques.
The last few sections do offer some interesting research citations for how classical instrumental effects models have been morphed with advances in machine learning, with things like causal trees.
But just look at one of the take away points of the survey, in section 5:
> "4. No fundamental changes to theory of identification of causal effects"
Overall, the link you've shared would be strongly in favor of ML-extended classical econometrics and possibly Bayesian hierarchical models or latent variable approaches, but almost surely would be against the notion that do-calculus could lead to a wide-spread or real-world set of applicable models.
but how would help an algorithm make better predictions?
Sure, the reason a person turns on the heat affects our belief in the outside weather (were they feeling cold, or were they just trolling?), but how do you know the reason a person turned on the heat, and couldn't you learn which reason are predictive by measuring correlations with other observables? If you know the reason directly ("I'm just playing with the dial because I'm 4 years old") that's a data point you could throw into your ML model without explicitly knowing it's a reason.
One of the original sources that took this approach is "The Ecological Approach to Visual Perception" (1979) , by James Gibson, discussed at length the idea of "affordances" of an algorithmic model, similar in some respects to topics in reinforcement learning as well. Affordances represented the information about outcomes you gained by varying your degrees of observational freedom (i.e. you learn how to generalize beyond occluded objects by moving your head a little to the left or right and seeing how the visual input varies. This lets you get food, or hide from a predator that's partially blocked by a tree, etc., so over time generalizing past occlusions become better and better -- this is much more interesting than a naive approach, like using data augmentation to augment a labeled data set with synthetically occluded variations, for example as is often done to improve rotational invariance).
Then this idea was extended with a lot of formality in the mid-to-late 00's by Stefano Soatto in his papers on "Actionable Information" .
I wish more effort had been made by e.g. Pearl to look into this and unify his approach with what had already been thought of, especially because it turns me off a lot when someone tries to create a "whole new paradigm" and it starts to feel like they want to generate sexy marketing hype about it, rather than to say hey, this is an extension or connection or alternative of this older idea already in the topic of machine learning rather than appearing like one is saying, "Us over hear in causal inference world already know so much more about what to do ... so now let's apply it to your domain where you never thought of this". Pearl has a history of doing this stuff too, like with his previous debates with Gelman about Bayesian models. It almost feels to me like he is shopping around for some sexy application area where his one-upsmanship approach will catch on too give him a chance at the hype gravy train or something.
: < https://en.wikipedia.org/wiki/James_J._Gibson#Major_works >
: < http://www.vision.cs.ucla.edu/papers/soatto09.pdf >
This doesn't seem like a very fitting description of Pearl. In his work, he is very careful to cite existing approaches (structural equation model literature, various topics from graphical models). In his various discussions with Gelman, he comes off as freakishly polite and not looking to one up.
> "I therefore invite my colleagues... to familiarize themselves with the miracles of do-calculus. Take any causal problem for which you know the answer in advance, submit it for analysis through the do-calculus and marvel with us at the power of the calculus to deliver the correct result in just 3–4 lines of derivation. Alternatively, if we cannot agree on the correct answer, let us simulate it on a computer, using a well specified data-generating model, then marvel at the way do-calculus, given only the graph, is able to predict the effects of (simulated) interventions. I am confident that after such experience all hesitations will turn into endorsements. BTW, I have offered this exercise repeatedly to colleagues from the potential outcome camp, and the response was uniform: “we do not work on toy problems, we work on real-life problems.” Perhaps this note would entice them to join us, mortals, and try a small problem once, just for sport."
This is absolutely the cheeky spirit of one-upsmanship I am talking about. The offers are always framed in terms of "look how causal inference supersedes everything," which is not a charitable take on approaches from others, especially in historical applied ML, that might have already developed some of the same underlying ideas.
The link you used explains the situation pretty well. If anything Pearl's regular acknowledgement of graphical models seems to be an indication that he is mindful of at least one very common approach in current ML.
It's a beautiful theory, but it hearkens back to the symbolic AI era that has had limited effectiveness.
Until they supply a "ground truth" or data generating model, he has a dilemma:
* if he doesn't create a data generating model, then arguments for / against his approach will be specious.
* if he creates a data generating model, they can claim it doesn't reflect reality.
In the case of Judea Pearl and Andy Gelman, it seems like the point of contention is much broader than the do-calculus. Andy Gelman does not seem to be a fan of structural equation modeling / similar graphical models.
It’s really a fair response from them to Pearl, especially when the whole time Pearl is presenting it like causal inference is a miracle cure-all.
All I am seeing in your comments is hand waving attempts to shift the burden of proof onto the group of practitioners who already looked into this stuff and weren’t convinced!
So why does it being incumbent on Pearl or on another causal inference practitioner to demonstrate it scaling up to a more complicated in-practice problem still get qualified with an “in theory” from you? Why isn’t it resoundingly obvious by this point that the burden of proof lies with Pearl, and that people would be happy to hear if he can use these models for large-scale, practical use cases, but they (rightfully) don’t see a reason (even after looking into the models) to spend their own time doing it?
I read the Quora link. It seems most people are supportive of Pearl's ideas. Why not try his proposed thought experiment and see?
That’s part of the problem. It was offered up as a “miracle” that supersedes and fully, logically subsumes the hard fought and real-world tested methods of others, who rightfully weren’t going to advocate the use of some other, unproven thing claimed to be a cure-all, yet they still did engage heavily with it and worked through the math and agreed with derivations for simple models under various collections of assumptions.
I’m still not seeing anything convincing. Pearl’s tone is not polite or even quirky, and strays badly from any measure of humility or collaborative spirit of inquiry. Really. I mean, sure, he’s not coming out dropping f-bombs, but that’s utterly not the point. Trying to justify it as if it’s a diplomatic and earnest request is silly. “Hey, your career’s worth of work is totally wrong. Here look, my work completely supersedes yours because I worked through some canonical, highly academic cases. But I’m being fair & balanced about it, I swear!”
And then when folks did look into it and still felt unconvinced, “I guess you’re just dogmatic & closed minded. I offered to look into it with you, but you ignored me.”
I think you wind-up with a situation where the none of the less-than mainstream of conceptions intelligence will have further parts added. Instead, each becomes associated with a single individual's career. It's something of the nature of academia, a situation that made sense when scientific models and approaches were "small" enough to be fully encompassed by an individual.
But you have the problem models aren't naturally modular. Whether X model extend Y model is something of a judgment call. What makes one like or not-like another model is a matter of both the structure of the model and the reasoning behind the model.
Moreover, consider ten programmers creating one computer program tend to proportionately less productive than one programmer creating a program (ie, they work much less than 10x as fast as a rule). Ten theorists putting together one single theory may face a similar or greater problem of diminishing return and coordination.
Consider, for example, the way Freeman Dyson combined the graphical approach of Feynmann with Schwinger's more formal methods.
Sure, I hope I was clear that I don't ten theorist (or ten programmers) collaborating is impossible. I would simply say that collaborating has an extra cost to it - and a competitive academic world, any cost needs some degree of payoff. This makes extending a mainstream theory advantageous but not so much less-known theories.
And Quantum field theory had the advantage that the experiments for demonstrating it's truth or falsehood were relatively straightforward. With AI, the question of a theories truth is more debatable.
But it still doesn't explain Pearl's generally thorny disposition regarding other approaches. Most practitioners and researchers will err on the side of humility, and assume that broad swaths of comparable research is valuable and that many of their ideas have probably been thought of before, in one form or another, even if the researcher's approach is deserving of praise for its innovation or novelty.
David Mumford, in his 'dawning of the age of stochasticity' lecture, mentioned the idea of a 'hubris quotient' -- for him it was the idea of claiming to adequately summarize thousands of years of math progress to the point that someone could actually say something novel, in the span of a single career. If you've only been working on it for 30-40 years, and you're claiming to upend something that's been central for hundreds of years, that's a poor hubris quotient, and so maybe you should proceed with a lot of humility and caution.
It just never quite feels like Pearl accepts this for causal inference. Maybe he feels like it has not gotten the attention it deserves and needs to advocate in a more no-nonsense kind of way, but it just seems like somewhat of a bad hubris quotient to start speaking about how it is a novel take on something he feels ML has historically not adequately accounted for.
I would note that Pearl is not necessarily the first or the only person to note that modern machine learning has problems associated with it, problems often described by "correlation is not the same as causation." We can see actual practical problems appear when machine learning systems are deployed in situations they make definite judgments affecting people's lives based only on factors correlated with a condition. In the extreme, if X,Y, Z factor are associated with someone acting criminally, are we allowed to arrest the person without a crime being committed? (etc).
So Pearl has some credibility stepping into this "breach" with his (perhaps sell-branded but) more mathematically grounded and statistically sound approach. Of course, the problem is no statics really gives a "sound" way to unambiguously predict a future datum only from past data. They Bayesian does describe how to make sound predictions when you happen to know prior probabilities, a view that "kicks the can down the road" as others have mentioned.
The thing is, in contrast to math, AI has involved a group of models, theories and ideas which have all broadly moved forward across the decades with their stars rising and falling but few being utterly discarded. This is because little to nothing can be proven and moreover, because despite presented alternatives, they intersect like fat Venn Diagrams if considered only formally (though as specific programs-of-research, they may be exclusive). Moreover, publicity is one key to a given approach getting more concrete implementations and ultimately getting funding, more researchers and chance to go the next generation. The relative speed of a neural net on a GPU might well be a key to this sort of model showing promising practical applications. Is this speed inherent or are other models waiting for optimized implementations? If such an optimized implementation is possible, it would require a specialized programmer and hence funding.
And this means? Well, I'm not sure what it means. Perhaps one could deduce a correct model of machine intelligence if one could determine and correct for the biases which currently drive the process.
Do you have examples to dispute this... actual examples where a causal inference based model was used for large-scale deployed machine learning problems and demonstrably fixed some type of judgment error that had previously been leading to bad outcomes for people?
I don't really mean that in a dismissive sense, just to point out that his notation just begs the question of what do(X) means, in terms of why it is actually important. To me it just kind of formalizes a certain notation and kicks the hard theoretical can down the road.
In the books and papers I've read of Pearl's, he makes reasonable logical arguments for certain types of causal inferences, but when, in discussion with colleagues, we've tried to think of how they would be implemented outside the context of an experiment, we've been sort of at a loss. I say this as someone who identifies with observational study professionally, but who recognizes the importance of experiments.
My broader point is that I think Pearl's do-calculus can be reexpressed in traditional graph theory/structural equations/statistics without introducing anything new. In that sense, although I think his writings have drawn attention to important issues, I don't think they have solved anything.
It's very formally specified. The key object of study in Pearl-style causal inference is a structural causal model. A structural causal model is composed as equations like the following:
Y = f(X, Z, U)
Here, X and Z are observed inputs, other random variables in your system. U is unobserved. In other words, "Y is computed by a deterministic function which takes an unknown random input."
Then, P(Y = 1 | do(X=1, Z=2)) is defined as P(f(1,2, U) = 1).
Second, consider this: Classic ML techniques will tell you that you should never go to the doctor because it increases the probability that you have a disease. Causal inference does not have this problem.
How does RL dodge this?
But I don't think it's true that we always need a model, or at least I don't necessarily think we always need a human understandable model.
Your doctor example is weird to me tbh. A non-causal ML approach would seek to determine whether a patient has a disease based on some symptoms, and then send them to a doctor based on those results, sidestepping the need for causal models.
To rephrase it in a way that makes a bit more sense to me is: let's assume we want to know if a specific procedure would be good for a patient (basically the same example). With a non-causal approach we would want to predict whether a patient would have a better outcome from doing a procedure than not.
A natural way to solve this (to me) would be to build one model that estimates the probability of various outcomes from the procedure, and one that estimates the probability of various outcomes from not undergoing the procedure.
Or if you're working in the world of Neural Nets/Deep RL, have a model that takes all the non-intervention data as input and outputs the expected outcomes from the procedure and the expected outcomes from not doing the procedure, and when you train it, you only supervise the outcomes that you had data for.
This ignores the Bayesian/Distributional Shift issue, but I don't think the do calculus has a real answer to that either.
I would be interested in knowing if this ad-hoc modelling approach is any different to the causal modelling the Pearl is arguing for, or if Causal modelling is more necessary when you have more complicated causal relationships than a single intervention.
I saw an ML presentation a few months ago, on training a decision tree to do the same thing as a neural net, so we can understand what the neural net was doing.
They used this on a neural net trying to diagnose people with diabetes. It showed that having any other diagnosis would increase its probability of diagnosing them with diabetes. Why? Because it meant they're more likely to have gone to the doctor to get diagnosed. (Along with detecting general health indicators that weren't screened out.)
You can try to partition your data into intervention/non-intervention, or do something else to try to stop your model from detecting spurious correlations. Causal models makes this more formal and tells you which things you should include/exclude, gives you formulas for adjusting them out, and how much bias you introduce by failing to do so.
The theory of causal inference is also immune to distributional shift, and serves as a nice guidance for what actual systems should do (usually: failing to return an answer).
(Yes, I've fully drunk the Pearl Kool-Ade.)
Maybe it can, but isn't Bayesian stuff really costly most of the time?
RL is for training system parameters based on positive or negative reinforcement from a critic. RL is based on a Markov decision process. RL has policy search idea but that is separate from economic policy evaluation.