

Scalable A/B Experiments at Pinterest - folz
http://engineering.pinterest.com/post/95378137929/scalable-a-b-experiments-at-pinterest

======
freedom123
"We have hundreds of A/B experiments running at anytime" which means you have
a very big attribution problem.

~~~
dfrankow
I work at Pinterest, and have some involvement in how we learn from
experiments.

In each experiment, we assign users randomly based on a hash of experiment
name and userid. So, we assume (and have tested) that the groups of one
experiment have a fairly uniform mix of users from the groups of any other
experiment. This means the primary effect of being in a particular experiment
group G should be much larger than the effect of being in G and some other
experiment's group. So, we usually analyze the effects of an experiment
independent of any other experiments. Millions of users helps here.

I think this is how trials normally work: assume that random assignment into
groups will spread all other factors equally across the groups.

There are experiments which depend on each other (change the same UI, one
assumes a precondition of another), and that situation is complicated. We try
to avoid it when we can, and negotiate carefully when we find ourselves in it.

You might also be referring to the fact that looking at hundreds of things
will find "statistically significant" effects simply at random. As my advisor
said, "95% confidence means wrong 1 in 20 times." That is a risk to be
managed. We always have to ask things like: is this a reasonable outcome for
this experiment? Do we have other corroborating evidence? Is the effect
consistent? Do we have reason to disbelieve the result? Some fraction of our
experiments are run incorrectly or have a broken implementation; we detect
some, likely not all.

Also: is the risk higher to run an experiment that may be misinterpreted, or
not to experiment at all and just release things? We like to learn from data
if we think we can. There is art in the tradeoffs.

~~~
tmarthal
Did you guys look into Bayesian Bandits? They seem much more tradeoff-able.

~~~
dfrankow
Using Bayesian methods to learn about our experimental effects is very
interesting to me. For example [http://www.evanmiller.org/bayesian-ab-
testing.html](http://www.evanmiller.org/bayesian-ab-testing.html), which may
be the type of bandit you refer to.

In the best case, it might be more robust to low-data outliers that cause
random blips that look significant, as well as addressing the problem of
looking at an experiment over and over (which frequentists are uncomfortable
with: more chances to succeed).

However, it is non-trivial to understand how this works for us in practice.
Examples: we want overall results as well as days-in results (to look for
novelty effects); we would have to choose priors with consideration, because
they have a huge impact on most results; etc.

This takes time and effort, and there are a lot of other things competing for
those resources.

------
rboling91
"There is no persistence layer involved for experiment group allocation, so
that we could minimize the latency/load on our production services. We left
all of the complicated metric computations to offline data processing." Does
that mean that the experiment to which the user is allocated is stored through
a log in memory instead of being stored on disk/on disk and in memory (through
logs)? If so, are there frequent trade-offs between memory/cpu usage in
running experiments?

~~~
dfrankow
Putting ("triggering") someone into an experiment group is most often some
function calls: do they meet the criteria (e.g., country, ..), do they fall in
the right bucket. The fact that they've been triggered is logged to Kafka, an
open-source messaging system whose logs we push (eventually) into Hive.

