
A/B Testing vs MAB algorithms - It's complicated - btilly
http://bentilly.blogspot.com/2012/09/ab-testing-vs-mab-algorithms-its.html
======
patio11
I really want to tell you all to read this.

The reality is (as it points out in the text) the overwhelming majority of you
don't A/B test. If you don't A/B test already you should stop reading about
A/B testing and start A/B testing. It will be the most important thing you do
today. It will probably be the most important single thing you do in quite a
while.

For the overwhelming majority of people A/B testing, this is similarly not
really a priority for you today. You're fine doing what you're doing. (But...
test more.)

But for the sliver of a sliver of people who really need to care about moving
the state of the art in conversion optimization forward, this essay is _very
important_. If I had an A/BatSignal I would be shining it on this.

------
psb217
Over the last few years, some people have started to work on a problem that
has been described as "pure exploration" in multi-armed bandits [1,2]. The
objective in this problem is, roughly speaking, to maximize the rate at which
you become certain about whether or not you've correctly identified the best
bandit arm (or, more generally, the top _n_ arms). This solves some of the
more common complaints about classic MAB algorithms in that, when run for a
finite number of trials, the resulting sampling of the arms produces a far
more confident decision than either the regret-minimizing policies followed by
standard MAB algorithms or the uniform allocation policies typically used in
A/B testing.

The original article mentions problems in MAB algorithms dealing with delayed
feedback. Such issues are largely ameliorated by the use of algorithms related
to "Thompson sampling" [3], which induces stochastic trial allocation policies
rather than the deterministic policies induced by UCB's selection process.
It's definitely possible to develop Thompson-like methods for the exploration-
oriented MAB problem, and such methods can rapidly distinguish the best among
a rather large set of options, as might be required in applications like MVT
(see note below).

Note: I'm currently doing academic research in this area and, if anyone's
particularly interested, I could share some of the (simulated) empirical data
and algorithmic details pertinent to what I said above.

[1]: Multi-Bandit Best Arm Identification, Gabillon et. al, NIPS 2011

[2]: PAC Subset Selection in Multi-Armed Bandits, Kalyanakrishnan and Stone,
ICML 2012

[3]: An Empirical Evaluation of Thompson Sampling, Chappelle and Li, NIPS 2011

------
ckluis
At the end of the day, I think A/B testing is utilized more just because it is
easier to understand & implement. I would love to utilize something like
unbounce for MAB testing, but I haven't seen good MAB frameworks that are
ready for the average marketer to use.

It would be nice to have a MAB framework that utilizes MVT so that you could
preload 100s of alternates (c2a, headline, copy, images, placement) and see
let it go to town.

~~~
btilly
You make good points. However the ultra-simple MAB for home pages is a
solvable problem. Noel at Myna has been responsive to me. If you give him
feedback on how he can target that market segment, particularly if you're
willing to be a paying customer, he'll likely be responsive.

MVT is inherently more complex. I guarantee you that there are MAB based
systems that do MVT on problems of insane complexity. I pointed at one system
at Yahoo that likely is an example. And the potential win isn't small. Yahoo
claims a 300% increase in clicks on their home page module since implementing
it.

But I haven't noticed anything like that publicly available. (I have seen
companies that do it behind the scenes - but the one that first comes to mind
hides that detail and tries to be priced on a pay for performance model.) Of
course a company that already does MAB could build a custom solution. Or you
can try to build your own if you think there is money in it. My personal life
does not currently permit me the time commitment that a startup would require.
But if you or someone else wants to build it, I'd be willing to consult on how
to make it work.

~~~
ckluis
My first piece of advice is that screenshots & videos are good.

Picture is 1,000 words. A video is 30,000 words a second.

I dream of a MAB/MVT combo solution. I have a project coming up that I might
try this out on.

------
adrianhoward
I thank you from the bottom of my heart for writing this - since I can now
cross of the "Write something comparing bandit & a/b" item from my to do list
and point people to your post instead :-)

------
cmansley
I think the real distinction here is stationary verses non-stationary
distributions. Many of the arguments made in this article hinge on the fact
that the responses to the same input change over time (nights are different
then daytime, which is different than the weekends). By continuously running
the A/B testing, you are looking at a small window in time, which you assume
is stationary so you can do your t-tests or whatever statistical test.

But, to be clear, this is a heuristic in A/B testing. If you have a window of
time over which you know your distributions are stationary, you should always
use a logarithmic regret MAB algorithm because it is theoretically better.

I think the best way to frame this argument is that because the domain does
not match the assumptions of MAB, A/B testing has been shown to be robust and
easy to modify for non-stationary domains, while logarithmic regret algorithms
are somewhat more fragile.

~~~
btilly
Correction.

A/B testing does not need to assume that it is testing over a small window in
time that has stationary conversion rates. In fact in practice tests run for a
long window of time over which you have good evidence that the distribution is
_not_ stationary. For example in the middle of running a long test that
eventually found a small lift, I've often run and rolled out a second test
that generated a much stronger lift.

The weaker assumption that you need is that the preference between versions is
stable across your fluctuations. Then because the mix of versions is time
independent, that non-stationary fluctuation is not statistically different
between the slice that was put into A and the slice that was put into B. And
therefore the variation between different samples becomes just another unknown
random factor that does not interfere with your statistical analysis of
whether there is a difference.

This is an advantage between A/B testing and the current state of the art in
MAB algorithms. When this came up before, Noel (at Myna) and I privately did
an admittedly brief search of the literature for discussion of this point with
regards to MAB algorithms. We turned up a number of things that would work in
the long run, but none that directly addressed the problem.

But in discussion we did manage to come up with effective MAB algorithms,
whose regret is only a constant factor worse than standard MAB algorithms,
that accurately will identify stable preferences in the face of constantly
fluctuating conversion rates. To the best of my knowledge nobody, including
Noel, has yet implemented such algorithms in practice. But in principle it can
be done.

However even if you do it, several of my other points still apply as real
differences.

~~~
cmansley
Wait.

I am slightly confused and this may demonstrate my ignorance, but I was under
the impression that A/B testing worked by allocating two different approach to
the users and then scoring the response. This provides a sampling from the
population of users as to how effective A or B is. You can then run some
statistical test on the averages of the scores for each test to determine
which one is the winner.

If what I said is true, these statistical tests almost always assume that the
distribution that is being drawn from is stationary. So, the only way things
work out is if you have an underlying stationary distribution. Otherwise, your
statistical test might indicate the wrong thing.

I freely admit that many of your points are still valid, but I don't see how
A/B is a more powerful algorithm that has less assumptions.

~~~
btilly
Perhaps stepping back can clarify.

The necessary, reasonable, and much weaker assumption needed for A/B testing
is that users are independent of each other, and arrive by some Poisson
process. Meaning that at any given point in time there is some average rate
that users arrive, but each user's arrival is independent of all other
arrivals. (More mathematically precisely, the number of people who will arrive
in any specified time period follows a Poisson distribution. Poisson processes
accurately model everything from nuclear decay counts to emergency room
arrivals.) If you randomly divide a group of people who arrived by a Poisson
process into two subgroups in a fixed ratio (say, evenly), those subgroups
will also turn out to be generated by a Poisson process.

Now we're going to take those subgroups, and feed them into our versions.
Let's focus on what happens with version A. Depending on when a user arrived,
that user will have some chance of converting to a success, and some chance of
not doing so. If we look at a random user that arrived and ignore when they
arrived, their probability of converting will be the average of their
probability of converting, weighted by the rate of arrival. Furthermore our
assumption that users are arriving from a Poisson process means that each user
is statistically independent from all of the others. Now it is true that if we
pay attention to the fact that certain users arrived close to others, there
are likely correlations to be found between those specifically users. But if
you sample two random users from all of the users who could have arrived, they
will be independent and have an identical probability of conversion, which is
that average. (This flows out of the assumption that the initial population
came from a Poisson process.)

This same analysis can be done for A and for B. Now we don't know the actual
conversion rates over our trial. But suppose that we did know them, and it
happened to be that the conversion rate for B is better at every time than for
A. Then it it is easy to show that the average conversion rate (weighted by
arrival rate of course) over the whole interval for B would be better than for
A.

Therefore if we assume that there is a consistent preference, statistical
evidence over the sample that B converts better than A is valid statistical
evidence that B is actually the better version at any given point in time.
This holds even if the difference between their conversion rate is much lower
than the fluctuation of the conversion rates of both over the interval we
sampled from.

(Very helpfully for A/B test practitioners, this math works whether or not you
happen to understand it. As long as you're not adjusting sampling rates, A/B
tests are statistically valid.)

Now take a traditional MAB algorithm. The whole analysis that I just did falls
apart. The fact that we send traffic to the versions at different rates at
different points of time means that the average for random people in the two
versions is weighted differently over the interval. This opens up the
possibility that the average conversion rate of the whole sample for B can be
better than for A, yet A might at every point in time have had a better
conversion rate than B.

See <http://en.wikipedia.org/wiki/Simpsons_paradox> if that seems impossible
to you. If you've read that and understand how being better in the sample that
a MAB algorithm collected is not statistical evidence that you're actually
better, then you may want to re-read this post to understand why an A/B test
is still statistically valid.

To avoid this trap, what you need to do in the MAB algorithm is either
subsample or scale data (subsampling is provably more robust but not much so,
scaling is simpler and converges faster) so that the statistical decision that
you're basing your MAB choices on avoids this pitfall, at least in the limit.
But as I said before, a detailed discussion of how to make this work and the
necessary trade-offs would get fairly involved.

~~~
cmansley
I don't know how we got onto arrival rates of our visitors. I would just like
to state that Simpson's paradox is exactly why we shouldn't compare percent
conversions. They are meaningless. However, many of the statistical tests like
Student's t-test compensate for this paradox by including the number of
samples in the tests. See : [http://en.wikipedia.org/wiki/Student%27s_t-
test#Unequal_samp...](http://en.wikipedia.org/wiki/Student%27s_t-
test#Unequal_sample_sizes.2C_unequal_variance)

I think you said one thing that are at the heart of the issue. We assume that
the E[conversion of B] > E[conversion of A] for the entire period sampled.

I think all of the details about Poisson processes are not required if you
just assume that each person is drawn IID from the population.

I just don't think you are answering the right questions here.

Let's assume that each person arrives IID from the infinite population. Then,
we have a Bernoulli process for each A or B query. A "conversion" results in a
1, a failure results in a 0. Since, these people are arriving IID, we can
select a sub-sample which is also IID. We would now like to estimate the
parameters for each process and/or compare the two processes. We can do this
using t-test. This will give us the statistical significance that the one
group had a higher "conversion rate" than the other group. Note: rate does not
factor into this problem at all because we assume the participants were IID,
so the t-test (used correctly accounting for different number of samples) will
tell us which test is larger.

My question is now what happens when the parameters of your queries for A or B
change over time. Still under the assumption that E[B] > E[A], it now matters
greatly in which order you use your samples.

I think the only reason you brought the Poisson model into the discussion is
to weight the more recent samples higher and down weight the earlier samples
in your basket of samples. This is a heuristic for considering a fixed
interval in which the samples are stationary. It effectively considers a
window that slides with the time of arrival.

~~~
btilly
First let me mention that you should not try to use the Student's t-test. One
of the first assumptions of the Student's t-test is that _Each of the two
populations being compared should follow a normal distribution._ In A/B tests
this assumption is almost never true, and therefore the Student's t-test is an
inappropriate statistical assumption.

OK, now to what I said about Poisson distributions. Assuming that people
arrive on a Poisson distribution allows us to conclude 2 key facts:

1\. The statistics will behave exactly like it would if each person arrives
IID from an infinite population.

2\. Simpson's paradox will not apply to the theoretical distribution of the
samples for A and B.

Assuming #1 without #2 does not get you very far. But having facts #1 and #2
allows us to use statistics.

I have no idea why you would speculate that I am attempting to weight recent
samples higher and downweight earlier samples. All samples are, in fact,
weighted exactly the same. This fact notwithstanding, different times of
arrival are not weighted the same. That is because the sample rate fluctuates
over time depending on factors such as traffic levels on your webserver. But
it fluctuates in an identical way for the two versions. (This fact is critical
in being able to conclude point #2.)

Does this help?

~~~
cmansley
First, I was using Student t-test as a stand-in for whatever test or
statistical measure you would like to use. I believe the popular one is
Hoeffding's inequality in the bandit literature, hence the log term in the MAB
algorithms. I agree this was a poor choice of example.

Second, I believe I am getting hung up on the fact that different arrival
times are "weighted" differently. I think you are claiming that the Poisson
assumption gives us equal numbers of A and B trials, so we can combined the
statistics (counts) and avoid Simpson's paradox. This is fine, but why would
you say "different times of arrival are not weighted the same". Does this mean
you are somehow weighting periods of heavy traffic down and weighting low
traffic up?

So, what happens when trial A becomes less favorable over time or is less
favorable for brief periods? This means that the underlying random variable's
mean is changing over time. Most statistical bounds cannot handle this
situation.

I am not saying that A/B testing is not something we should do in general. I
am saying that it is a good heuristic with very few provable properties
compared with logarithmic regret MAB algorithms.

~~~
btilly
_This is fine, but why would you say "different times of arrival are not
weighted the same". Does this mean you are somehow weighting periods of heavy
traffic down and weighting low traffic up?_

You keep on reversing the exact point that I keep on making, and then fail to
understand what I said. So I guess that I'll keep repeating the same point in
different ways and hope that at some point you'll get it.

Why do I say reversing? Because the weight a time period gets is directly
proportional to the expected traffic. Therefore each observation is weighted
the same, and periods of heavy traffic are the ones that are weighted most
heavily.

Anyways, let's suppose, for the sake of argument, that from 2 AM to 3 AM
observations arrive at an average rate of 1 every 10 minutes. Suppose that
from 8 AM to 9 AM that they arrive at an average rate of one per minute.

Then, on average, we expect to have 6 observations from the hour in the middle
of the night, and 60 observations from the hour from 8 AM to 9 AM.

Thus when we calculate average returns across the entire interval, on average
we'll have 10x as many observations from 8 AM to 9 AM. Therefore on average
the latter time period will have 10x the impact on the final results.

The conclusion is that different time periods are naturally weighted
differently. However the weighting is the same across the two different
versions.

If you want to get more mathematical about it, suppose that r(t) is the
average rate at which observations are arriving in our subgroups. (So r(t) is
the same for versions A and B.) Suppose that cA(t) is the rate at which
version A converts, and suppose that cB(t) is the rate at which version B
converts.

Here is what I claim:

Average conversion of A = integral(r(t) * cA(t)) / integral(r(t))

Average conversion of B = integral(r(t) * cB(t)) / integral(r(t))

Therefore if at all points cA(t) < cB(t) then the difference between their
conversion rates is:

integral(r(t) * cB(t)) / integral(r(t)) - integral(r(t) * cA(t)) /
integral(r(t)) = integral(r(t) * cB(t) - r(t) * cA(t)) / integral(r(t))

which is always positive. (It should be noted that this analysis remains the
same whether we're looking for a binary convert/no convert, or whether we're
looking at a more complex signal, such as amount paid. If we add the
complication that people entering the test may convert to payments at one or
multiple later points, the analysis becomes more complicated, but the result
remains the same.)

 _So, what happens when trial A becomes less favorable over time or is less
favorable for brief periods? This means that the underlying random variable's
mean is changing over time. Most statistical bounds cannot handle this
situation._

As long as there is a consistent preference between A and B, fluctuations in
either or both do not alter the validity of the statistical analysis. If the
preference is not consistent then, of course, A/B testing stops being valid.

 _I am not saying that A/B testing is not something we should do in general. I
am saying that it is a good heuristic with very few provable properties
compared with logarithmic regret MAB algorithms._

The fact that you are not following this proof does not mean that the proof I
am offering you is invalid. In fact A/B testing has provable properties that,
in a common real-world situation, are _much_ better than current state of the
art logarithmic regret MAB algorithms.

I am also claiming (without proof) that this deficiency in current MAB
algorithms is fixable, at the cost of a constant factor worse regret in the
ideal situation where conversion rates do not change.

~~~
cmansley
Ok, you are making a point about sampling. In periods of high traffic you have
more samples that will bias the calculation. But, since the number of samples
will be consistent between A and B, everything is fine. Fine.

I believe further discussion will not be productive. But, I suggest if you
have a proof about A/B testing as it relates to the MAB problem or even if you
have a proof about A/B testing in general that drops the normally distributed
assumption or tells you exactly when to switch from exploring to exploiting, I
suggest you write it up and publish it on arXv.

Thanks for your time.

~~~
btilly
Yes, I was indeed making a point about sampling. Hopefully we're on the same
page now.

But here is an honest question for you. Why do you think that it would make
sense for me to try to write up and publish a paper on arXv?

My view is that doing so would take a considerable amount of work. And the
real-world constraints that matter to my clients don't seem to be in a
direction that academics care much about, so I can't see them getting
particularly interested in it. So it does not seem like it positively impacts
my life.

I say this as someone who has several publications to my name. This fact has
only once mattered to me. That once was when I needed to get sign-off from my
current employer to have something I did before they hired me get published.
Unfortunately my employer at the moment was eBay, they had just purchased
Skype, and the paper that I was publishing among other things implied that
Skype was unlikely to be worth what eBay had paid for it. This was..not fun.

If some research mathematician particularly wanted to sit down, pick my
brains, and try to formalize the real-world constraints I've observed in my
clients, that would be fine by me. It would only be fair in that case for me
to be listed as a co-author. But unless that happens, I'm not going to try to
publish a formal paper.

~~~
cmansley
I was encouraging you to write down your mathematical evidence of better
performing algorithms, because I know from my experience that when I try to
write down proofs, my assumptions and reasoning become clearer. And often a
step in the proof or logic that seems trivial becomes less trivial once you
actually try to lay out the proof.

I believe that one of two things will arise when you go to write up a
mathematical proof. One, you will be dependent on a statistical test that
makes a normal distribution assumption implicitly or explicitly. Two, you will
dependent on a bound that only applies in the limit. There is a third
possibility, which is the most interesting to me, which is that you are
dependent on a period of stationarity in your data, in other words, the
distribution you are measuring must be stationary.

The proofs in the MAB papers are not intentionally obtuse or ignoring
mainstream statistical ideas. They are written that way to say very explicit
things under well defined assumptions. Locking down assumptions and being very
precise is what writing down a mathematical proof is all about.

~~~
btilly
You know, after disbelieving what I said about A/B testing only to find out
that I was right, you might have adjusted your expectations of me. I've
written formal proofs before. I have published papers in mathematics. I know
what's involved. Please spare me the lecture.

In this case none of the three possibilities that you stated are correct. I
have an algorithm about which the following statement is true:

Suppose that users arrive according to a Poisson process. Suppose further that
the random reward function of putting them in to different versions is
variable but satisfies the following properties:

1\. There is a static upper bound on the expected value of each version.

2\. There is a static upper bound on the variance of each version.

3\. The cumulative distribution of the reward function is integrable. (We
don't even need continuity!)

4\. One of the versions at all times has an expected return that is at least
epsilon greater than all other versions

Then I have a MAB algorithm which achieves logarithmic regret in the long run.
It is only worse by a constant factor than MAB algorithms for the case with
fixed returns. That factor is likely to be somewhere in the neighborhood of
sqrt(2), but I can't guarantee that.

Those are limit statements, but you can get more concrete bounds in the finite
case. In principle it should be possible to derive an explicit formula - you
tell me the bounds on expected value and variance for the versions, and the
size of epsilon, and I can put an explicit bound on the odds that the best
version is winning after N observations.

The principle behind the proof is exactly what it was for the A/B test case.
The assumption of Poisson processes allows us to subselect equal samples from
the versions for the purposes of statistical inference, and make provable
statistical statements about how the behavior of the statistical sample allows
us to make inferences about which version is best, even though actual
conversion behavior is constantly changing.

------
paraschopra
Thanks for writing this Ben. It's a very fair and balanced assessment of both
techniques.

