Hacker News new | past | comments | ask | show | jobs | submit login
Causal inference as a blind spot of data scientists (dzidas.com)
225 points by Dzidas on Oct 15, 2023 | hide | past | favorite | 102 comments



The main reason for not using causal inference is not because data scientists don’t know about the different approaches or can’t imagine something equivalent (a lot of reinvention); forecasting is one of the most common tasks, after all.

The main reason is that they generally work for software companies where it’s easier and less susceptible to analyst influence to implement the suggested change and test it with a Random Control Trial. I remember running an analysis that found that gender was a significant explaining factor for behavior on our site; my boss asked (dismissively): What can we do with that information? If there is an assumption of how things work that doesn’t translate to a product change, that insight isn’t useful; if there is a product intuition, testing the product change itself is key, and there’s no reason to delay that.

There are cases where RCTs are hard to organize (for example, multi-sided platform businesses) of changes that can’t be tested in isolation (major brand changes). Those tend to benefit from the techniques described there——and they have dedicated teams. But this is a classic case of a complicated tool that doesn’t fit most use cases.


Actually causal inference is also really hard to benchmark. My colleague started an effort to be actually able to reproduce and compare results. Also the algorithms often do not scale too well.

Everytime we wanted to use this for real data it is just a little bit too much effort and the results are not conclusive because it is hard to verify huge graphs. My colleague e.g. wanted to apply it explain risk confounders in investment funds.

I personally also do not like the definition of causality they base it on.


One way to test this is through a placebo test, where you shift the treatment, such as moving it to an earlier date, which I have seen used successfully in practice. Another approach is to test the sensitivity of each feature, which is often considered more of an art than a science. In practice, I haven't observed much success with this method.


You don’t need to look at a graph at all though, right? There are plenty of tests that can help you identify factors that could be significantly affecting your distribution


If you want to make causal inferences you really do have to look at a graph that includes both observed and probable unobserved causes to get any real sense of what’s going on. Automated methods absent real thinking about the data generating process are junk.


“Graph” here means the directed acyclic graph encoding the causal relationships, not a chart of a distribution.


You can only select among features that you have measured.


Go on, please. What definition, and algorithms with scaling problems?


A/b experiments are definitely a gold standard as they provide true causality measurement (if implemented correctly). However, they are often expensive to run: need to implement the feature in question (which is less than 50% going to work) and then collect data for 1-4 weeks before being able to make the decision. As a result only a small number of business decisions today rely on a/b tests. Observational causal inference can help bring causality into many of the remaining decisions, which need to be made quicker or cheaper.


The “gold standard” has failure modes that seem to be ignored.

E.g.: making UI elements jump around unpredictably after a page load may increase the number of ad clicks simply because users can’t reliably click on what they actually wanted.

I see A/B testing turning into a religion where it can’t be argued with. “The number went up! It must be good!”


That’s generally because the metrics you are looking at do not represent what users care about. It’s different than the testing methodology, often overlooked, and a lot more important.

I’ve argued that A/B testing training should focus on that skill a lot more than Welch’s theory, but I had to record my own classes for that to happen.


But those metrics are hard to move, so you target secondary metrics.

The problem with that strategy becomes obvious when you spell out the consequences: measurably improving the product is hard, so you measure something else and hope you get product improvements.


There can be a real ethical dilemma when applying A/B testing in medical setting. Placing someone with an incurable disease in a control group is condemning them to death while in treatment group they might have a chance. On the other hand, without a proper A/B testing methodology the drug efficacy cannot be established. So far no perfect solution to the dilemma has been found.


> in a control group

The control group gets the current standard treatment, not nothing (in case that was a source of confusion). Plus they typically don't have to pay for it which is a benefit for them.

Large trials today will typically conduct interim analyses and will have pre-defined guidelines for when to stop the trial because the new treatment is either clearly providing a benefit or is clearly futile.

Here is an example of such a study: https://www.ahajournals.org/doi/10.1161/CIRCHEARTFAILURE.111...


Most therapeutic trials are nowadays "Intent to treat". So subject would receive either standardized tx or experimental tx in th e randomization. Many of them also have crossovers such that when measurable (as defined by the protocol) benefit is seen, standard tx based subjects can be moved over to the experimental arm


It's not really an ethical dilemma until you know it works, and then usually if the evidence is strong enough they'll cut the trial early.


All the alternative methods require the same sacrifice. More importantly, most suggested treatments fail to cure deadly conditions or have major side effects or risks that are just as unethical to thrust upon people untested.

If you look at it properly, i.e. evaluate what should be your actions before the test (Do nothing, Impose untested treatment, Test with proper control to learn what to do with the majority of the population), the answer is rarely ambiguous.

There is a debate to be had on how much pre-clinical work to be done before clinical testing, but those are increasingly automated, cheap, and fast, so we often reach the point where a double-blind test is the next logical step.

The argument you present is based on either an unwarranted confidence in treatments, or information that wasn’t available when the decision had to be made.


You can end the trial early when it’s clear the treatment is working. This just happened last week with Ozempic for diabetes caused kidney disease. https://www.wxyz.com/news/health/ask-dr-nandi/novo-nordisk-e...


Causal inference is useful, but it's neither quicker nor cheaper.


Agree that it is hard today. A person you might know is trying to prove that is doesn’t have to be: https://www.motifanalytics.com/blog/bringing-more-causality-... .

We’d love to chat more with you on the topic - feel free to hit Sean or me on LinkedIn.


I am a big fan of what Sean and you are trying to do–I wrote up a chapter about it this weekend, actually. I’m worried that you both have worked for companies where a lot of work has been done to identify relevant dimensions (metrics and categories) and automate causality (or rather: estimating factors on a pre-existing causal graph because that’s the slight of hands the word “causality” does) made sense once you’ve reached that level of maturity.

But to reach that point, before having relevant dimensions, there has to be a lot of work, generally motivated by disappointing experiments. “Why didn’t that work?” is often answered by “Because our goal is too remote from our actions—here’s a better proxy” or “Because this change only makes sense to 8% of our users, here’s how we can split them.”

I’m worried that too many people will think the tool itself is enough and not a complement to the maturity in understanding a company’s user. This ‘solutionism’ is widespread among Data tools: https://www.linkedin.com/posts/bertilhatt_the-potential-gap-...


Thank you for clarifying.

Reading some of your posts I think we agree more than disagree. A big difference from most new analytics tools you see today is that we don't want to provide a magic "solution" (which is bound to over-promise and under-deliver) but rather a generic tool to quickly define and try out different business categories on the data.

Followed you on LinkedIn for more in-depth takes.


It is likely to be cheaper and quicker to run a counterfactual test in the computer than in real life.

The question is how reliable it is.


> As a result only a small number of business decisions today rely on a/b tests.

The default for all code changes at Netflix is they’re A/B tested.


an expensive test is better than an expensive mistake :) within the scale of hundreds of decisions made with inherent bias of the product/biz/ops teams that direction misalignment can be catastrophic


You can apply it to estimate the impact of any business decision if you have data, so not only IT companies can benefit from it. However, the problem arises when the results don't align with the business's expectations. I have firsthand experience with projects being abandoned simply because the results didn't meet expectations.


For a hands on introduction to Causality, I would recommend “Causal Inference in Python” by M. Facure https://amzn.to/46byWnl Well written and to the point.

<ShamelessSelfPromotion> I also have a series of blog posts on the topic: https://github.com/DataForScience/Causality where I work through Pearls Primer: https://amzn.to/3gsFlkO </ShamelessSelfPromotion>


The Facure text is good, can confirm


Thanks for the recommendation


Thank you for sharing


For what it’s worth, my undergraduate was in Economics with an emphasis in econometrics and this article touched on probably 80% of the curriculum.

The only problem is by the time I graduated I was somewhat disillusioned with most causal inference methods. It takes a perfect storm natural experiment to get any good results. Plus every 5 years a paper comes out that refutes all previous papers that use whatever method was in vogue at the time.

This article makes me want to get back into this type of thinking though. It’s refreshing after years of reading hand-wavy deep learning papers where SOTA is king and most theoretical thinking seems to occur post hoc, the day of the submission deadline.


Yeah, the only common theme I see in causal inference research is that every method and analysis eventually succumbs to a more thorough analysis that uncovers serious issues in the assumptions.

Take for instance the running example of catholic schoolings effect on test scores used by the boook Counterfactuals and Causal Inference. Subsequent chapter re-treat this example with increasingly sophisticated techniques and more complex assumptions about causal mechanisms, and each time they uncover a flaw in the analysis using techniques from previous chapters.

My lesson from this: outcomes causal inference is very dependent on assumptions and methodologies, of which the options are many. This is a great setting for publishing new research, but its the opposite of what you want in an industry setting where the bias is/should be towards methods that are relatively quick to test and validate and put in production.

I see researchers in large tech companies pushing for causal methodologies, but I'm not convinced they're doing anything particularly useful since I have yet to see convincing validation on production data of their methods that show they're better than simpler alternatives which will tend to be more robust.


> My lesson from this: outcomes causal inference is very dependent on assumptions and methodologies, of which the options are many.

This seems like a natural feature of any sensitive method, not sure why this is something to complain about. If you want your model to always give the answer you expected you don't actually have to bother collecting data in the first place, just write the analysis the way pundits do.


> This seems like a natural feature of any sensitive method, not sure why this is something to complain about.

I am exactly complaining it is sensitive. If theres robust alternatives why would i put this in prod?


Because you care about accuracy?


Because with real world data like in production in tech there are so many factors to account for. Brittle methods are more susceptible to unexpected changes in the data or unexpected ways in which complex assumptions abut the data fail.


But really, how accurate are your results if they depend on strong assumptions about your data?


just use propensity scores + ipw and you have the same thing as a rct. :)


From my experience propensity scores + ipw really doesn't get you far in practice. Propensity scoring models rarely balance all the covariates well (more often, one or two are marginally better and some may be worse than before). On top of that, IPW either assumes you don't have any cases of extreme imbalance, or, if you do you end up trimming weights to avoid adding additional variance, but in some cases you do even with trimmed weights..


not necessarily unless you skim over meaningful confounding factors :)


In Corporate and Medical data science fields, people begin to accept causal inference. It is difficult, as the subject is still in flux and under development.

I am aware of three reputable causal inference frameworks:

1. Judea Pearl's framework, which dominates in CS and AI circles

2. Neyman-Rubin causal model: https://en.wikipedia.org/wiki/Rubin_causal_model

3. Structural equation modelling: https://en.wikipedia.org/wiki/Structural_equation_modeling

None of them would acknowledge each other, but I believe the underlying methodology is the same/similar. :-)

It's good to see that it is becoming more accepted, especially in Medicine, as it will give more, potentially life-saving, information to make decisions.

In Social Sciences, on the other hand, causal inference is being completely willfully ignored. Why? Causal inference is an obstacle to making a preconceived conclusions based on pure correlations: something correlates with something, therefore ... invest large sums of money, change laws in our favor, etc... This works for both sides. Sadly, I don't think this could be fixed.


> In Social Sciences, on the other hand, causal inference is being completely willfully ignored. Why? Causal inference is an obstacle to making a preconceived conclusions based on pure correlations: something correlates with something, therefore ... invest large sums of money, change laws in our favor, etc... This works for both sides. Sadly, I don't think this could be fixed.

This remark is totally ignorant of the reality in the social sciences. Certainly in economics (which I know well) this hasn't described the reality of empirical work for more than 30 years. Political Science and Sociology are increasingly concerned with causal methods as well.

Medicine on the other hand is the opposite. Medical journals generally publish correlations when they aren't publishing experiments.


> In Social Sciences, on the other hand, causal inference is being completely willfully ignored.

This conflicts with what the article says:

> Economists and social scientists were among the first to recognize the advantages of these emerging causal inference techniques and incorporated in their research.


Economist here. Causal inference is more alive than never, in Economics at least. A publication in an applied top journal practically has to use causal methods.

The DID literature, for instance, has been expanding at the speed of light -- it has never been so hard to keep up as it is now.


Social sciences haven't ignored causal inference. Perhaps it’s not everywhere you’d like to see it, but it’s common in quant papers, its the backbone of econometrics, and you’d probably have trouble finding a single top ranked PhD program which doesn’t provide at least cursory coverage of the methods.


Pearl’s framework isn’t really distinct from SEM as I understanding it. SEM is really just one tool to achieve the sort of adjustments that Pearl describes to make causal inferences from observational data.


Social Scientist here. It is thriving under the name Qualitative comparative analysis for a quarter of a century. This is a good paper for more on the epistemological foundations: https://doi.org/10.1177/1098214016673902


An important topic. Today most tech companies worship a/b experiments as the main way of being data-driven and bringing causality into decision-making. It deserves to be the gold standard.

However, most experiments are usually expensive: they require investing in building the feature in question and then collecting data for 1-4 weeks before being certain of the effects (plus there are long-term ones to worry about). Some companies report that fewer than 50% of their experiments prove truly impactful (my experience as well). That’s why only a small number of business decisions are made using experiments today.

Observational causal inference offers another approach, trading off full confidence in causality with speed and cost. It was pretty hard to run correctly so far, so it is not widely adopted. We are working on changing that with Motif Analytics and wrote a post with an in depth exploration of the problem: https://www.motifanalytics.com/blog/bringing-more-causality-... .


Interestingly, recent research suggest that you can make better decisions by combining experimental and observational data than by using either alone:

https://ftp.cs.ucla.edu/pub/stat_ser/r513.pdf

> Abstract: Personalized decision making targets the behavior of a specific individual, while population-based decision making concerns a sub-population resembling that individual. This paper clarifies the distinction between the two and explains why the former leads to more informed decisions. We further show that by combining experimental and observational studies we can obtain valuable information about individual behavior and, consequently, improve decisions over those obtained from experimental studies alone.


I 100% agree with this blind spot. Most data science coursework avoids the very thing making it a science: the explanation of what change causes what effect. I've been surprised that year after year, programs at so many "Schools of Data Science" keep gliding over this area, perhaps alluding to it in an early stats course if at all.

It's an important part of validating that your data-driven output or decision is actually creating the change you hope for. So many fields either do poor experimentation or none at all, others are prevented from doing the usual "full unrestricted RCT": med and fin svcs and other regulated industries have legal constraints on what they can experiment with; in other cases, data privacy restricts the measures one can take.

I've had many data folks throw up their hands if they can't do a full RCT, and instead look to pre-post with lots of methodological errors. You can guess how many of those projects end up. (No, not every change needs a full test, and some things are easy rollback. But think of how many others would have benefitted from some uncertainty reduction.)

Sure, "LLM everything" and "just gbm it!" and "ok, just need a new feature table and I'm done!" are all important and fun parts of a data science day. But if I can't show that a data driven decision or output makes things better, then it's just noise.

Causal modeling gets us there. It improves the impact of ml models that recognize the power of causal interventions, and it gives us evidence that we are helping (or harming).

It's (IMO) necessary, but of course, not sufficient. Lots of other great things are done by ML eng and data scientists and data eng and the rest, having nothing to do with casual inference... But I keep thinking how much better things get when we apply a causal lens to our work.

(And next on my list would be having more data folks understanding slowly changing dimension tables, but this can wait for another time).


I realize this is nitpicking a minor point in your comment, but I don't agree with your characterization of RCTs in medical research as being primarily constrained by laws and regulations. Any time I've discussed research on human subjects with doctors doing that research, the discussion of what is and is not an acceptable experiment has always been primarily driven by the risks of harm to the people involved in the study. Any time the law comes up, it's usually because the law requires an RCT in a specific setting, as opposed to preventing it (e.g. drug trials). (Of course in the setting of starting a company based on some medical product, the situation may be quite different.)

Biologists, if not data scientists, are used to considering indirect evidence for causality. It's why we sometimes accept studies performed in other organisms as evidence for biology in humans; it's why we sometimes accept research performed on post-mortem human tissue as being representative of the biology of living humans; to name but a few examples. A big part of a compelling high-impact biology (or bioinformatics) paper is often the innovative ways that one comes up to show causality when a direct RCT is not feasible, and papers are frequently rejected because they don't to the follow-up experiments required to show causality.


That's a very fair point. I didn't mean to suggest that harm to the patients or subjects was not the overriding factor, nor that bio, pharma, and other medical fields never do RCTs.

But there are a slew of laws and requirements around _how_ to run an RCT across the world of bio-related work, esp as it becomes a product. From marketing to manufacture to packaging, there are strict limits around where variation is allowed, at least anything involving the FDA in the US. (Some would say too many regs, others say not enough).

And in those cases, having a wider collection of ways to impute cause would be great.


Yes, that's true, legal requirements definitely become much more of a factor the closer you get from research to product.


I've self-learned for a long time in the causal inference space and model evaluation is a concern for me. My biggest concern is falsification of hypotheses. In ML, you have a clear mechanism to check estimation/prediction through holdout approaches. In classical metrics, you have model metrics that can be used to define reasonable rejection regions for hypothesis tests. But causal inference doesn't seem to have this, outside traditional model fit metrics or ML holdout assessment? So the only way a model is deemed acceptable is by prior biases?

If my understanding is right, this means that each model has to be hand-crafted, adding significant technical debt to complex systems, and we can't get ahead of the assessment. And yet, it's probably the only way forward for viable AI governance.


> In ML, you have a clear mechanism to check estimation/prediction through holdout approaches.

To be clear, you can overfit while your validation loss does not decrease. If your train and test data are too similar then no holdout will help you measure generalization. You have to remember that datasets are proxies for the thing you're actually trying to model, they are not the thing you are modeling themselves. You can usually see this when testing on in class but out of train/test distribution data (e.g. data from someone else).

You have to be careful because there are a lot of small and non-obvious things that can fuck up statistics. There's a lot of aggregation "paradoxes" (Simpsons, Berkson's), and all kinds of things that can creep in. This is more perilous the bigger your model too. The story of the Monte Hall problem is a great example of how easy it is to get the wrong answer while it seems like you're doing all the right steps.

For the article, the author is far too handwavy with causal inference. The reason we tend not to do it is because it is fucking hard and it scales poorly. Models like Autoregressive (careful here) and Normalizing Flows can do causal inference (and causal discovery) fwiw (essentially you need explicit density models with tractable densities: referring to Goodfellow's taxonomy). But things get funky as you get a lot of variables because there are indistinguishable causal graphs (see Hyvarien and Pajunen). Then there's also the issues with the types of causalities (see Judea's Ladder) and counterfactual inference is FUCKING HARD but the author just acts like it's no big deal. Then he starts conflating it with weaker forms of causal inference. Correlation is the weakest form of causation, despite our often chanted saying of "correlation does not equate to causation" (which is still true, it's just in the class and the saying is more getting at confounding variable). This very much does not scale. Similarly discovery won't scale as you have to permute so many variables in the graph. The curse of dimensionality hits causal analysis HARD.


To be clear, the mechanism for checking ML doesn't really check ML. There's really little value in a confidence interval conditional on the same experimental conditions that produced the dataset on which the model is trained. I'd often say it's actively harmful, since it's mostly misleading.

Insofar as causal inference has no such 'check', its because there never was any. Casual inference is about dispelling that illusion.


> Insofar as causal inference has no such 'check', its because there never was any. Casual inference is about dispelling that illusion.

Aye, and that's the issue I'm trying to understand. How to know if model 1 or model 2 is more "real" or, for my lack of a better term, more useful and reflective of reality?

We can focus on a particular philosophical point, like parsimony / Occam's razor, but as far as I can tell that isn't always sufficient.

There should be some way to determine a model's likelihood of structure beyond "trust me, it works!" If there is, I'm trying to understand it!


> How to know if model 1 or model 2 is more "real" or, for my lack of a better term, more useful and reflective of reality?

I just want to second MJ's points here. You have to remember that 1) all models are wrong and 2) it's models all the way down. Your data is a model: it models the real world distribution, what we might call the target distribution, which is likely intractable and often very different from your data in various conditions. Your metrics are models: obviously given the previous point, but not as obvious from the point that even with perfect data these are still models. Your metrics all have limitations and you must be careful to clearly understand what they are measuring, rather than what you think they are. This is an issue of alignment and the vast majority of people do not consider precisely what their metrics mean and instead rely on the general consensus (great ML example: FID does not measure fidelity, it is distance measurement of distributions. But you shouldn't stop there, that's the start). These get especially fuzzy in higher dimensions where geometries are highly non-intuitive. It is best to remember that metrics are guides and not targets (Goodhart).

> There should be some way to determine a model's likelihood of structure beyond "trust me, it works!" If there is, I'm trying to understand it!

I mean we can use likelihood ;) if we model density of course. But that's not the likelihood that your model is the correct model, it is the likelihood that given the data that you have that your model's parameterization can reasonably model the sampling distribution of data. These are subtly different, the difference is from above. And then we gotta know if you're actually operating on the right number of dimensions. Are you approximating PCA like a typical VAE? Is the bottleneck enough for proper parameterization? Is your data in sufficient dimensionality? Does the fucking manifold hypothesis even hold for your data? What about the distribution assumption? IID? And don't get me started on indistinguishablity in large causal graphs (references in another comment).

So rather in practice it is just best to try to make a model that is robust to your data but always maintain suspicion of it. After all, all models are wrong and you're trying to model data, not have a model of data.

Evaluation is fucking hard (it is far too easy to make mistakes)


I love this comment to bits. Thanks, from a fellow applied researcher embedded in the tech world.


I always love to find/know there are others in the tech world that care about the nuance around evaluation math and not just benchmarks. Often it feels like I'm alone. So thank you!


Would love to connect. Feel free to reach out, msg box address available if a bit mangled in my profile.


In general, you can't, and most of reality isnt knowable. That's a problem with reality, and us.

I'd take a bayesian approach across an ensemble of models based on the risk of each being right/wrong.

Consider whether Drug A causes or cures cancer. If there's some circumstantial evidence of it causing cancer at rate X in population Y with risk factors Z -- and otherwise broad circumstial evidence of it curing at rate A in pop B with features C...

then what? Then create various scenarios under these (likely contradictory) assumptions. Formulate an appropriate risk. Derive some implied policies.

This is the reality of how almost all actual decisions are made in life, and necessarily so.

The real danger is when ML is used to replace that, and you end up with extremely fragile systems that automate actions of unknown risk -- on the basis they were "99.99%", "accurate", ie., considered uncontrolled experimental condition E1 and not E2...10_0000 which actually occur


> How to know if model 1 or model 2 is more "real" or, for lack of a better term, more useful and reflective of reality?

You don't. Given observational data alone, it's typically only possible to determine which d-separation equivalence class you're in. Identifying the exact causal structure requires intervening experimentally.

> There should be some way to determine a model's likelihood of structure

Why? If the information isn't there, it isn't there. No technique can change that.


Acyclic structure on variables is a very strong pre-supposition that, honestly, is not how many systems in engineering are well-described by, so I don't like this idea of boiling causality solely down to DAG-dependent phrases like "d-separation" or "exact causal structure". Exact causal structure a.k.a. actual causality is particular to one experimental run on one intervention.


D-separation still works for cyclic graphs, it just can't rule out causal relationships between variables that lie on the same cycle. And neither can any other functional-form-agnostic method, because in general feedback loops really do couple everything to everything else.

More rigorously: given a graph G for a structural equation model S, construct a DAG G' as follows

- Find a minimal subgraph C_i transitively closed under cycle membership (so a cycle, all the cycles it intersects, all the cycles they intersect, and so on)

- Replace each C_i with a complete graph C'_i on the same number of vertices, preserving outgoing edges.

- Add edges from the parents of any vertices in C_i (if not in C_i themselves) to all vertices in C'_i

- Repeat until acyclic

d-separation in G' then entails independence in S given reasonable smoothness assumptions I don't remember the details of off the top of my head.


There is though. You run two linear models you get numbers back that inform how well these different models are fitting to the data.


This isn't a quality of fit issue (and even if it were, linear models are not always sufficient). The problem is that different causal structures can entail the same set of correlations, which makes them impossible to distinguish through observation alone.


Grandparent commenter here -- I'm glad I sufficiently communicated my concern, I feel like you and mjburgess have nailed it. Fit metrics alone aren't sufficient to determine an appropriate model use (even ignoring the issues of p-hacking an other ills).


Such errors can be estimated with other models

https://pubmed.ncbi.nlm.nih.gov/22364439/


I would argue it's more a blind spot of big data, which tends to tacitly imply just doing correlational studies on data that happens to be laying around.

Most data scientists work for companies that don't really want to pay for controlled experiments outside of maybe letting the UI team do A/B tests. Natural experiments can be hard to come by in a business setting. And all of the wild mathematical gyrations that econometricians and political scientists have developed to try to do causal inference from correlational data have a tendency not to be as popular in business because, outside of some special domains such as politics and consumer finance, it can be rather difficult to get away with dressing your emperor in math that nobody can understand instead of actual clothing.


Exactly. This is the primary difference between observational and experimental studies (controlled experiments). Experimental studies control for the hypothesized mechanism as part of the experimental design, but observation studies do not or often cannot. Good data from controlled experiments is difficult, costly, and time-consuming to generate in general, and that often does not mesh with the notion of "big data". I think we are running into this problem more and more as we discover that our data sets really are superficial --- collections of a lot of data that is easy to collect rather than a representative sample of everything (especially in a controlled manner). Good data isn't cheap.


The Atlantic/American Causal Inference Conference (ACIC) hosts a data challenge every year, I think. Useful to see many different methods compared on simulated data.

Does anyone know similar challenges/competitions?

ACIC links to years I could find:

- 2016: https://arxiv.org/abs/1707.02641

- 2017: https://arxiv.org/abs/1905.09515

- 2019: https://sites.google.com/view/acic2019datachallenge/data-cha...

- 2022: https://acic2022.mathematica.org/results

- 2023: https://sci-info.org/data-competition/


I’ve tried to understand causal inference several times and failed. Tutorials seem unnecessarily long winded. I wish authors would give simple, to the point examples.

Say I have a simple table of outdoor temperatures and ice cream sales.

What can the machinery of causal inference do for me in this situation?

If it doesn’t apply here, what do I need to add to my dataset to make it appropriate for causal inference? More columns of data? Explicit assumptions?

If I can use causal inference, what can it tell me? If I think of it as a function CA(data), can it tell me if the relationship is actually causal? Can it tell me the direction of the relationship? If there were more columns, could it return a graph of causal relationships and their strength? Or do I need to provide that graph to this function?

I know a wet pavement can be caused by rain or spilled water or that an alarm can go off due to an earthquake or a burglary. I have common sense. I also understand the basics of graph traversal from comp sci classes.

How do I practically use causal inference?

To the authors of future articles on this (or any technical tutorial), please explain the essence, the easy path, then the caveats and corner cases. Only then will abstract philosophizing make sense.


> Say I have a simple table of outdoor temperatures and ice cream sales. What can the machinery of causal inference do for me in this situation?

Not much. Causal inference works over networks of variables, specifically a DAG. But usually you know more than one variable association, so this is more an issue of pedagogy than the tool itself.

Probably the shortest, most persuasive example I can give you is a logical resolution to Simpson's Paradox: when the correlation between two variables can change depending on whether you consider a third variable or not.

The classic example is gender discrimination in college admissions. When looking at admissions rates across the entire university, women are less likely to be accepted than men. But when (in this example) you break that down into departments, every department favors women over men. This is a paradoxical contradiction, and worrying in that your science is only as good as the dimensions your data captures. Worse, the data offers no clean way to say which is the correct answer: the aggregate or the total. Statisticians stumbled for a long while on this, and it's kind of wild that we were able to declare smoking causes cancer without a resolution to this.

Pearl wrote a paper on how bayesian approaches resolve the paradox[1], but it does presume familiarity with terms like "colliders," "backdoor criterion" and "do-calculus." His main point is that causal inference techniques give us the language and tools to resolve the paradox that frequentist approaches do not.

[1]: https://ftp.cs.ucla.edu/pub/stat_ser/r414.pdf


When looking at admissions rates across the entire university, women are less likely to be accepted than men. But when (in this example) you break that down into departments, every department favors women over men.

If every department favored women then the entire university would also favor women. Parity is guaranteed in that scenario. What happened in the Berkeley case is that not every department favored women, and women applied disproportionately to the departments with lower admissions rates (including some that didn't favor them), while men did the opposite.


Yes, apologies, what I meant by "favored" was that in every department, women applicants were more likely to get an admission than men. But I'm pretty sure the admission rate can still be lower for women overall than men overall, using exactly the same scenario you described. If the sociology department admits 10 percent of applicants and the physics department admits 90, it seems very easy for gender bias in applications to shift women towards 10 and men towards 90, even if the rate is a few percent higher for women.


I get your point now. You're quite right that you can construct scenarios that arbitrarily favor men in the aggregate but women in specific departments, given the right ratio of applicants.


> Or do I need to provide that graph to this function?

You need to do that, and the math can help you measure how much each arrow contributes. The idea that you need to provide your model of the world is strangely not a key part of most introductions, but it’s crucial.

> outdoor temperatures and ice cream sales

That’s too simple: a simple regression can handle that. Causal inference can handle cases with three variables, assuming you provide an interaction graph. Say: your ice cream truck goes either to a fancy neighborhood or a working-class plaza. After observing the weather, you decide where to go, so know that wealth and weather influence sales, but sales can’t influence the other two. Assuming you have data all for cases (sunny/poor, sunny/rich, rainy/poor, rainy/rich), then you can separate the two effects.


> > outdoor temperatures and ice cream sales > That’s too simple: a simple regression can handle that.

Not quite. Regression by itself will not answer the causal (or equivalently, the counterfactual) question.

I strongly suspect you already know this and was elaborating on a related point. But just for the sake of exposition, let me add a few words for the HN audience at large.

Let me give an example. In an email corpus, mails that begin with "Honey sweetheart," will likely have a higher than baseline open rate. A regression on word features will latch on to that. However, if your regular employer starts leading with "Honey sweetheart" that will not increase the open rate of corporate communications.

Causal or counterfactual estimation is fundamentally about how a dependent variable responds to interventional changes in a causal variable. Regression and relatedly, conditional probabilities are about 'filtering' the population on some predicate.

An email corpus when filtered upon the opening phrase "Honey sweetheart" may have disproportionately high email open rates, but that does not mean that adding or adopting such a leading phrase will increase the open rate.

Similarly, regressing dark hair as a feature against skin cancer propensity will catch an anti-correlation effect. Dyeing blonde hair dark will not reduce melanoma propensity.


Your model needs to introduce a third piece of information: whether an email is a corporate communication—or a deliberate intervention.


My understanding (that might be out of date) is that the tools are weak. Ideally you would have tabular data and it would give you a digraph for the causal structure between variables. You can try this but the tools don't work reliably yet. Otherwise everyone would use them.


Agreed. Afaict, in practice, you setup your own casual graphs and test them. This seems very academic 1950s.

Interestingly, folks are finally doing more realistic experiments in the casual equiv of arch search, and genAI is giving these efforts a second wind. Still feels like at the toy stage or for academics & researchers with a lot of time on their hands, vs relevant for most data scientists.

I'm still on the sidelines, but keep checking in in case finally practical for our users..


Same here, I check in every year or so because it would be fantastic to have.


> Say I have a simple table of outdoor temperatures and ice cream sales.

You have more than that! You have knowledge about the world!

> What can the machinery of causal inference do for me in this situation?

Well, (I’m being purposefully pedantic here) you haven’t really asked a question yet. The first thing it can do is help you while you’re formulating one. It can answer questions like, “how can I anticipate how things I have and havent measured will the estimates I’m interested in/making?”

> If it doesn’t apply here, what do I need to add to my dataset to make it appropriate for causal inference? More columns of data? Explicit assumptions?

The first thing you need to do is articulate what you’re actually interested in. Then you need to be explicit about the causal relationships between things relevant to those questions. The big thing (to me) is that particular causal structures have testable conditional independence structures and by assessing these, you can build evidence for or against particular diagrams of the context.


Judea Pearl's The Book Of Why gives you more practical and easy to understand examples, I recommend that.


It's pretty simple. You cannot infer casualty from observational data. No matter how sophisticated your statistical tools are.

You need to perform a properly controlled experiments to infer casualty. And even then it's hard.

Inferring casualty from observational data is cargo cult science.


TL;DR: Causal inference is a complex topic, not a simple tool.

How's the ice cream example better than the sugary snacks example given in the article?

Here's the part about needing to add more columns to the data:

> When dealing with a causal question, it’s crucial to include variables known as confounders. These are variables that can influence both the treatment and the outcome. By including confounding variables, we can better isolate and estimate the true causal effect of the treatment. Failing to add or account for confounding variables may lead to incorrect estimates.


> How's the ice cream example better than the sugary snacks example given in the article?

Not the OP, but because that fails to explain how the basic hypothetical example works(!)

You want to know how much your sales would be in a parallel world where kids were stuck with bland snacks compared to your sweet treats. This is where causal inference steps in to provide the solution. (nice graph follows)

So how is that done?


> TL;DR: Causal inference is a complex topic, not a simple tool.

The simple version using graphical models and joint probabilities isn't difficult to explain or teach. The issue is that to do anything useful with it at scale you either need MCMC or variational inference and that's an entirely different bag of worms all together. For medical datasets you rarely have "scale", instead you have very few sample cases and a large expert model (the doctor/specialist).


There is an excellent video on YouTube by MIT Prof Sontag on Casual Inference worth checking out [1]

And if you like it, 2nd part is here [2]

[1] https://youtu.be/gRkUhg9Wb-I?si=6oMUgdjia_4g6-DR

[2] https://www.youtube.com/watch?v=g5v-NvNoJQQ


> The DoubleML method is founded on machine learning modeling and consists of two key steps. First, we build a model that predicts the treatment variable based on the input variables . Then, we create a separate model that predicts the outcome variable using the same set of input variables . Subsequently, we calculate the residuals from the former model and regress them against the residuals from the latter model. An important feature of this method is its flexibility in accommodating non-linear models, which allows us to capture non-linear relationships — a distinctive advantage of this approach.

Just... don't do this. You're not going to be able to math your way to better conclusions. Make your model, make your plots, and use critical thinking to evaluate your results.


Not sure what point you are trying to make here. Double ML is a valid approach for debiasing confounding effects.


I disagree. It's vulnerable to all sorts of mishaps. You're now having to worry about data leakage between your treatment group AND your target variable. Casual inference without experiment data is all just a mathematical exercise to make a one size fits all approach to identifying relationships. Yes, correlation has weaknesses. But the name "causal inference" is grossly misleading. It's "well if we assume X, Y, and Z then the effect which we have already assumed is causal is probably around this order of magnitude". And hey, maybe that will help you identify cases where a confounding variable is actually the thing that matters. But you're not going to do better than just doing an analysis on the variables and their interactions. You don't have the brainpower to do this at a scale larger than pretty much all causal methods will begin to fail. It does not offer you the legitimacy the name implies.

I think it confuses far more than it helps.


Contrasting frequentist statistics and causal inference, and saying the latter often goes beyond the former, makes for a bizarre opening. It's like saying apples have nutritional value, unlike soccer balls. It's like saying trigonometry often goes beyond the scope of calculus.


I think in practice most of these techniques are useless (or worse, confer a false sense of precision) because they require so many nuanced judgement calls that they become little more than a way to launder biases.


A co-worker pointed me to this e-book, which I thought did a great job of presenting he concepts in a relatable and applied way:

https://matheusfacure.github.io/python-causality-handbook/la...

But I agree with other comments here, at the end of the day it seems like causal analysis often boils down to whether you trust the analyst and/or their techniques since it is hard to validate the results.


We built some causal discovery and inference features with graph visualization in Kanaries RATH: https://docs.kanaries.net/rath/discover-causals/causal-analy...

It's also open-sourced. Welcome to have a try.


One gripe with this article—-regression coefficient doesn’t provide ATE under most circumstances using observational data.


It's not that hard. if the causality cannot make sense logically or plausibly, then you can reasonably reject it . no reasonable person would ever get the umbrella puddles thing confused.


> It's not that hard. if the causality cannot make sense logically or plausibly, then you can reasonably reject it . no reasonable person would ever get the umbrella puddles thing confused.

It's an illustrative example, taking it literally is missing the point. The reason you know it doesn't make sense for umbrellas to cause rain is that you already have an applicable causal model. The situations where you need to do causal inference are exactly those where you don't, and can't just rely on "reasonableness" or "plausibility".


This is not true. These causal methods generally require you to have a pre established framework for how the thing works. If you cannot supply additional variables that you, with your domain knowledge, know cover the confounding elements, it won’t do anything.

it’s Mathematical soup for trying to normalize out the effects of other variables to see what remains and calling it “causal”.


You should open a epidemiological journal these days. Half the papers are either as bad as "umbrellas causes puddles" or obviously confounded with socio-economic status.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: