
Causal Analytics - bmahmood
https://blog.clearbrain.com/posts/introducing-causal-analytics
======
otterk10
Scott here from ClearBrain - the ML engineer who built the underlying model
behind our causal analytics platform.

We’re really excited to release this feature after months of R&D. Many of our
customers want to understand the causal impact of their products, but are
unable to iterate quickly enough running A/B tests. Rather than taking the
easy path and serving correlation based insights, we took the harder approach
of automating causal inference through what's known as an observational study,
which can simulate A/B experiments on historical data and eliminate spurious
effects. This involved a mix of linear regression, PCA, and large-scale custom
Spark infra. Happy to share more about what we did behind the scenes!

~~~
bertil
I’ve noticed two questions on twitter:

\- Do you use a causal graph? Would it make sense?

\- Spark seems over-kill for what you yourself describe as regression: is
there something more intensive here that we could be missing?

~~~
otterk10
Our analysis runs over our user’s customer data (usually collected through
either a tag manager or a CDP such as Segment), which is a few petabytes of
data for some of our larger customers. The reason for using Spark is to
quickly transform this massive amount of raw data into a ML-ready format.
You’re correct that the regression itself does not need to be done inside of
Spark.

------
cuchoi
Very exciting to see causal theory being productionized!

From the article, this seems like a normal regression to me. Would be
interesting to know what makes it causal (or at least better) compared to an
OLS. PCA has been used for a long time to select the features to use in
regression. Would it be accurate to say that the innovation is on how the
regression is calculated rather than the statistical methodology?

Either way, it would interesting to test this approach against an A/B test and
check how much an observational study differs from the A/B estimates, and how
sensitive is this approach to including (or not) a set of features. Also would
be interesting to compare it to other quasi-experimental methodologies, such
as propensity score matching.

Is there a more extended document explaining the approach?

Good luck!

~~~
otterk10
Yes, you're correct that the underlying algorithm used is very close to OLS.
What allows the regression to provide an estimate for average treatment
effects is how it is structured. Namely, adding in pre-treatment confounders
as well as interactions between the treatment and confounders. I found this
chapter
([http://www.stat.columbia.edu/~gelman/arm/chap9.pdf](http://www.stat.columbia.edu/~gelman/arm/chap9.pdf))
on causal inference does a good job of outlining the approach.

Yes, we actually explored other approaches such as PSM. The main reason we did
not initially go with PSM was because of the compute power required - you
would need to train a model for each treatment variable. However, we're
actually in the midst of developing a way to train a model for each treatment
variable efficiently, which will allow us to add items such as inverse
propensity weighting (or explore other approaches such as PSM).

~~~
lern_too_spel
This approach only works if all confounders are known, which is never the case
in practice, so the model you fit is correlational and not suitable for causal
inference. Propensity matching suffers from the same issue if the propensities
are estimated from the same features. If not all confounders are known, you
must be able to find instrumental variables to build a causal model.

------
6gvONxR4sf7o
I only skimmed it, so forgive me if I got this wrong. The causal model used
here makes some incredibly strong (unlikely to be close enough to accurate)
assumptions. Are these results valid if there are unobserved confounders or
selection bias?

~~~
benmaraschino
Well, at the end of the day, you can never really be sure of strongly
ignorable treatment assignment/unconfoundedness, no matter what problem you’re
working on. Especially if you’re an economist or an epidemiologist working
with data that’ve been collected by someone else—you can’t exactly easily go
back and measure more predictors of treatment assignment. But if you’re
running a website, there are a _lot_ of variables you can measure on the user
end and more opportunities to iterate, and so SITA then begins to look more
and more like a better bet.

~~~
6gvONxR4sf7o
Maybe you run a different kind of website or ask different kinds of questions,
but despite being able to measure all kinds of things, there's so much at my
job that you need experimentation for. You do the observational study, and it
points in this direction. Sometimes it's true and sometimes it's not. Selling
this kind of observational analysis as 'you don't need A/B tests anymore' is
totally disingenuous.

~~~
otterk10
Thanks for the feedback! I totally agree about observational studies being
suggestive but not replacing A/B tests - that’s why the main use case I listed
in the blog (and how current customers have used the product so far) is
“prioritization of a/b tests”, not replacing a/b tests themselves. The
language around “simulating a/b tests” is just a way to try to concisely
explain to someone at a high level who may not be very technical or has never
heard of an observational study. Happy for suggestions on how to better
explain observational studies to less technical customers without over-
selling! It’s something we’ve been iterating on ourselves.

------
mrbonner
I have been involving in causal inference analysis since 2015. We use a mixed
model of decision tree and fixed effect regressions. I read your paper and
could not find a reference of why, while one cannot do AB test to verify the
relationship but can use observational analysis to do it. Could you share a
reference please? Thank you for this insightful article!

~~~
otterk10
You can definitely do an AB test to verify the causal relationship - in fact,
that's the preferred method! Our platform is for situations where you didn't
run an A/B test - either because you can't run as many as you'd like or you
forgot - in order to give you an estimate after the fact.

------
whoisnnamdi
Cool stuff, thanks for sharing publicly.

Did you all consider using Double Selection [1] or Double Machine Learning
[2]?

The reason I ask is that your approach is very reminiscent of a Lasso style
regression where you first run lasso for feature selection then re-run a
normal OLS with only those controls included (Post-Lasso). This is somewhat
problematic because Lasso has a tendency to drop too many controls if they are
too correlated with one another, introducing omitted variable bias.
Compounding the issue, some of those variables may be correlated with the
treatment variable, which increases the chance they will be dropped.

The solution proposed is to run two separates Lasso regressions, one with the
original dependent variable and another with the treatment variable as the
dependent variable, recovering two sets of potential controls, and then using
the union of those sets as the final set of controls. This is explained in
simple language at [3].

Now, you all are using PCA, not Lasso, so I don't know if these concerns apply
or not. My sense is that you still may be omitting variables if the right
variables are not included at the start, which is not a problem that any
particular methodology can completely avoid. Would love to hear your thoughts.

Also, you don't show any examples or performance testing of your method. An
example would be demonstrating in a situation where you "know" (via A/B test
perhaps) what the "true" causal effect is that your method is able to recover
a similar point estimate. As presented, how do we / you know that this is
generating reasonable results?

[1]
[http://home.uchicago.edu/ourminsky/Variable_Selection.pdf](http://home.uchicago.edu/ourminsky/Variable_Selection.pdf)
[2] [https://arxiv.org/abs/1608.00060](https://arxiv.org/abs/1608.00060) [3]
[https://medium.com/teconomics-blog/using-ml-to-resolve-
exper...](https://medium.com/teconomics-blog/using-ml-to-resolve-experiments-
faster-bd8053ff602e)

~~~
otterk10
Thanks! Yes, the concerns you mentioned would also apply to PCA. What we've
actually done to help alleviate this is a union of components from y-aware[1]
and normal PCA to capture variables that are correlated to both the dependent
variables and (hopefully) most of the treatment variables. This seems similar
to the double selection approach you mention - the difference being that since
we are trying to run this at scale for 1000s of treatment variables, running a
feature selection with each of the 1000 treatment variables as the dependent
variable isn't super feasible, so the normal PCA acts as proxy for this part
of the double selection.

Regardless, we're never going to completely remove omitted variable bias, as
we're never going to capture 100% of relevant variables. One way we monitor
our model's bias is by looking at the error distribution between users in the
treatment vs control. If these aren't similar, there's too much bias in our
estimate of the treatment effect, so we wouldn't want to serve an estimate of
the treatment effect for this variable to our customers.

The current product is in beta and we're working with some of our current
customers to try to re-create our results with A/B tests. I'm hoping that by
our GA release in the fall we'll have some case studies with specific
examples!

[1] [http://www.win-vector.com/blog/2016/05/pcr_part2_yaware/](http://www.win-
vector.com/blog/2016/05/pcr_part2_yaware/)

------
kk58
Did you guys look into Partial mutual information for confounding variable
selection

Granger causality for estimating Granger cause

------
whirlofpearl
Looks like you lifted this straight of Judea Pearl's seminal research.

Congratulations! Just remember to patent it :)

~~~
otterk10
Thanks! You're correct in surmising that our approach was heavily influenced
by Judea Pearl's research.

And yes, the timing of the blog post isn't a coincidence, we actually filed a
patent last week :)

~~~
bertil
If you are implementing a documented technique, what are your claims of
originality for a patent?

I’m asking because we are building our own implementation of mSPRT. There are
some variants but I didn’t expect enough to patent. We are confronted with
internal debates and I’d rather have actual examples than ageing memory of my
law class.

~~~
bmahmood
Thanks for the interest! (Cofounder of Clearbrain here).

The patent covers a combination of statistical techniques and engineering
systems we built. The tricky part of this is the infrastructure needed to
select confounding variables and estimate treatment effects for thousands of
variables at scale in seconds. That was what we filed a patent on.

------
move-on-by
An analytics platform without a privacy policy? :(

404: [https://www.clearbrain.com/privacy](https://www.clearbrain.com/privacy)

404: [https://www.clearbrain.com/terms](https://www.clearbrain.com/terms)

~~~
bmahmood
Apologies! Looks like the site was mid-update when you noted the 404s. It's
back live now :)

------
Rainymood
Interesting to note that ClearBrain is in YC.

