
Stop A/B Testing and Make Out like a Bandit - fezzl
http://untyped.com/untyping/2011/02/11/stop-ab-testing-and-make-out-like-a-bandit/
======
paraschopra
This is interesting (and I'm yet to study the algorithm in detail) but I guess
you are comparing two dissimilar problems. One is hypothesis testing and other
is continuous optimization.

The reason A/B testing has many parameters is because at the end of data
collection it hopes to employ a hypothesis test to determine whether variation
was better performing as compared to control (the interpretation of "better"
is done according to parameters defined). The confidence level and other
parameters allow you to do proper risk-reward analysis and see if you want to
go ahead with variation. Moreover, minimal sample size (visitors/conversions)
ensures that local (initial) fluctuations do not unnecessarily bias the test
results. In fact, in A/B test first 100 visitors are virtually identical to
last 100 visitors with regards to impact on final results.

However, I am guessing in bandit algorithms (since their primary usage is
continuous optimization and not hypothesis testing) local fluctuations can
have an adverse impact on ultimate results which may be okay for continuous
optimization problems but not for projects where you need to determine with a
defined level of confidence whether one version works better than the other.

Different needs, different algorithms.

~~~
noelwelsh
In short, no. In the domain that we're talking about (website optimisation)
bandit algorithms and A/B testing address exactly the same problem, and bandit
algorithms do so more efficiently.

Bandit algorithms actually arose out of hypothesis testing when people started
to think about stopping experiments early. One of the first papers on this is
Chernoff's "Sequential Design of Experiments"
(<http://www.jstor.org/pss/2237415>). Reading just the abstract on JStor will
give the flavour of the idea, and bandit algorithms are the natural extension
of this of this idea to a continuously running problem.

~~~
asharp
Interesting.

So you take the output of the bandit algorithm and use that as an output (ie.
X was the "best" design) for the next stage of design?

What type of bandit algorithm do you use?

~~~
noelwelsh
My opinion is that it's best to just keep running the bandit algorithm. There
is no need to stop it -- you can add and remove options as it is running.

As for the type of algorithm, I've referenced various bits of the literature
elsewhere in the comments. I can't give details on our precise approach
unfortunately.

------
ojilles
The article doesn't do a very good job of explaining "Bandit algorithms". The
closest s/he comes is this, but that really doesn't enlighten me:

 _So, what is the bandit problem? You have a set of choices you can make. On
the web these could be different images to display, or different wordings for
a button, and so on. Each time you make a choice you get a reward. For
example, you might get a reward of 1 if a button is clicked, and reward of 0
otherwise. Your goal is to maximise your total reward over time. This clearly
fits the content optimisation problem._

Edit: Anyone have better pointers? (Other than the UU article referenced in
the post)

~~~
charliepark
I could be misreading it, but I believe the premise is "more traction => more
action".

If I'm right, the idea is that the algorithm dynamically weights the "display
frequency" of the two (or n) options. So as one of the a/b options shows
itself to be more successful, it's shown more frequently. Because the test is
self-correcting, you as an A/B test runner don't have to decide when the
results are significant enough, and the program itself will automatically
choose the more successful option.

~~~
datadon
Going purely off of your boiled down description, this seems like a more
advanced version of Genetify (<https://github.com/gregdingle/genetify/wiki>).

@noelwelsh - care to comment on now what you're doing at Untyped differs from
what Genetify offers?

~~~
StavrosK
That looks like it uses genetic algorithms, which is much less optimal and
more exploratory than bandit algorithms. What noelwelsh is proposing would
lead to better results much more quickly and without random permutations of
elements.

~~~
noelwelsh
I agree :)

------
coffeemug
I think the big issue here is that the bandit problem assumes independence
between machines, while this is almost certainly an incorrect assumption to
make when analyzing user behavior. For example, I might be able to increase
convergence by changing the button text to "buy now" _and_ changing the
background color to black, but not each one independently. Conversely,
changing the button text to "buy now" might _hurt_ convergence if the
background color is black, but improve convergence if the background color is
white.

Essentially that means that if I make N changes at the same time and
convergence changes, it's not possible to tell which combination of the
changes affected convergence, and how (at least not without a new series of
hypothesis tests to establish controls). Perhaps only one change made the
difference and the rest were irrelevant, perhaps multiple changes made the
difference, etc.

The bandit problem is much simpler because it guarantees that none of the
variables depend on each other. If there is no such guarantee, we're
effectively stuck with searching through an exponentially large space of
possibilities, and using NHST to tell the difference.

~~~
noelwelsh
What you say is absolutely true of the basic bandit problem. More complex
algorithms can deal with more complex problems, handling the interdependencies
you describe. See the "bandit slate" problem, for example.

~~~
coffeemug
The slate problem still assumes each machine is independent (i.e. you want to
pick k machines, but picking a particular machine does not effect the
probability function of other machines). The ordered slate problem adds one
more variable and deals with two dimensions, but it doesn't deal with an
arbitrary number of dimensions.

Think about it - if you don't know which of the variables aren't independent,
you're effectively searching an exponential space. You can prune it a bit but
establishing that some variables are in fact independent, but the space of
possibilities is still enormous. You can use standard multivariable function
optimization techniques and set up correlation tests to guide the pruning, but
you still need to use NHST at each one of these steps.

~~~
noelwelsh
I agree with the general thrust of what you're saying -- it's a hard problem
and no amount of algorithmic finesse can get around the basic problem of the
exponentially growing space. In the context of this post, however, I think you
can still do a lot better than A/B testing or MVT :).

Here's a paper that I think (I haven't read it in detail yet) addresses the
problem you're talking about: <http://arxiv.org/pdf/1105.4871v1>

Another approach would be to model interactions as a (undirected?) Bayesian
network and try to learn the structure of the network from data. I've had
reasonable success doing this in the past but with a much simpler problem.

This is certainly something we're looking at, but I think there is enough work
building a profitable business addressing the basic problem.

~~~
coffeemug
Yes, I think it's a huge commercial opportunity. I could use such a product
too, and I sincerely wish you luck. However, the bandit approach is almost
certainly not the right way to do it, and most of the value will come from the
tools and integration, not a magical statistical model.

I do believe Bayesian network construction will work, though I don't know how
much better it would do than NHST.

~~~
noelwelsh
Woah, we've been talking past each other in a big way. Let me try to clarify:

1\. The setup in the bandit problem is identical to the setup in standard A/B
testing as applied to web content optimisation. The only difference is that in
the bandit problem you are allowed make decisions as data arrives; in A/B
testing you have to wait till your experiment completes (otherwise, see "early
stopping" which in fact is how the bandit problem came to be). Algorithms for
the bandit problem are strictly superior to A/B testing in this setup.

2\. The case you seem to be interested in is where you have n possible items
to display and you display k <= n simultaneously. In A/B testing land this is
known as multivariate testing. The problem comes from dependencies between the
items, otherwise it just reduces to k bandit problems. Typical MVT setups
assume linear relationships between items. You can do the same in a bandit
setup, and this what (I think from a quick read) the arxiv paper I linked
above does.

3\. NHST (null hypothesis statistical testing, right?) is _not_ more powerful
than a bandit algorithm. Consider this: in your hypothesis test you have a
probability of making a mistake (determined by the p-value and probability of
a type II error which you only indirectly control). The expected regret is
thus Pr(error) * Cost(error) * forever (once you make your decision you're
stuck with it). Thus the expected regret is infinite (due to that "forever"
term). If you decide instead to continue making decisions the probability of
making an error rises rapidly. If you decide to control for this you're
reinventing sequential design of experiments / the bandit problem.

4\. I blogged about the bandit problem because it's the direct analogue of A/B
testing. That doesn't mean there aren't more powerful algorithms available in
the field of decision theory. If you display your k items in sequence you're
doing reinforcement learning, for which there are algorithms with optimal
regret bounds. I've discussed k items simultaneously above. No doubt this is a
hard problem. The key idea to take away is that you have to control for your
uncertainty in the correct action, something that hypothesis testing doesn't
do.

That was long; I hope it sheds some light. Oh, and drop me an email -- I've
love to at least ask you more questions about the kind of product you'd use.

------
StavrosK
This is actually revolutionary, if it works properly. I had never considered
the problem as an exploration vs exploitation issue, which it clearly is.

Imagine throwing a few alternatives at the problem, in any way you like, and
having an algorithm select the best ones automatically and optimally, without
you needing to hand-tune anything.

You wouldn't even need to select the best-performing variation, as the
algorithm would converge to it in the end. You could also throw in new ones at
any time to have tested, or have new ones produced automatically, e.g. in the
context of new items in e-stores (we have these new featured items, select the
best one to display on the front page).

I'm sure there's a catch, but I don't remember the algorithm very well (and,
from what I remember, it looks like there isn't), and I don't have time to
read the paper thoroughly now.

~~~
e-dard
In a nutshell, the problem is that as you add more arms (e.g., landing page
variations) the amount of page views you need (i.e., actions tested), in order
to learn reasonably accurate reward distributions, grows significantly.

Further, there are some other challenging aspects to this. 1) The environment
-- visitors' preferences over time, your competitors' pages -- is very
dynamic; 2) because you have to test and learn in the real world, your actions
may impact your future performance.

I used and designed these types of algorithms for my PhD research, but applied
them to trader selection between market exchanges; my thesis is in my sig if
anyone is interested.

~~~
StavrosK
Very true, but these are rather trivial concerns compared to A/B testing. With
A/B testing you still need a large amount of samples, but you also need to
specify the (usually arbitrary) parameters, and decide when to stop the test
on top of that.

The challenging aspects seem more of a limitation, but you still would be
testing things all the time, ostensibly, rather than test once in the
beginning and then stay there. I'd imagine some of the algorithms have a
window of data to consider, rather than everything since the beginning.

You could also discard old data to make sure past actions don't impact your
performance now.

Overall, the technique isn't a miracle cure, of course, but it's leaps and
bounds better than the split tests we have now.

------
snippyhollow
Relevant: <http://explo.cs.ucl.ac.uk/> (International Conference of Machine
Learning 2011, Workshop on "exactly that".)

~~~
noelwelsh
Nice link. Thanks.

------
d2
From the Wikipedia entry:

"Originally [the bandit problem was] considered by Allied scientists in World
War II, it proved so intractable that it was proposed the problem be dropped
over Germany so that German scientists could also waste their time on it."

------
noelwelsh
Author here. We have a beta implementation of the idea available. Drop me a
line (noel at untyped dot com) if you're interested in trying it out.

~~~
djm
Hi, I'm getting a db connection error when trying to view the article so I'm
not sure what it's all about yet.

I'm aware of your work with racket (read your paper from a few years ago about
deploying the web server in a uni environment).

Is this something I can plug into a racket web app? If it is I'd be interested
as I am building web apps with racket. I'm not in a position where I can
actually deploy something yet but I'll want to build in statistics collection
when I am a little closer.

Edit: OK, I read your article via the cached link another user posted. I'll
have a look at your linked papers etc when I get a chance but it definitely
looks interesting. Your post wasn't clear on what form this beta will take -
is it a commercial venture to compete with optimizely et al or one of your
open source projects I could require into my own app and deploy myself?

~~~
noelwelsh
We're making a commercial system. Implementing a bandit algorithm yourself is
quite straightforward. You could implement, say, UCB1 in a day. I might open
source some of that code if I get the time. Packaging that algorithm into a
usable and scalable system (unlike our blog ;-) is a lot more work.

PS: I've installed a cache so the blog should stay up now.

------
asharp
Just as a general note re repeated sig testing errors: Wouldn't it be possible
to run a standard A/B testing run over a small but not insignificant number of
iterations, and then iterate over that a number of times?

You could then use bayes to find a final estimate of, say, H1. As each high
level iteration is fairly small, feedback can be provided to the user,
although it couldn't be acted on.

Speaking of which, if we have an expected number of false positives for any
given number of test scores, couldn't you take the average number of positives
generated as an rv and then try and determine if it is different to the
expected number of false positives?

It seems as though this type of error relies on the fact that a single false
positive stops the testing rather then continuing on and allowing regression
to the mean. By stopping this, it should then stop, or at least reduce this
type of error.

------
StavrosK
historious cache, because it's intermittently dropping for me:

<http://cache.historious.net/cached/1369875/>

------
asharp
Interesting.

Some of the claims made seem to be strange. Adding in additional choices is
fine, dealing with multiple choices is fine, modifying each page as you give
it to the user is fine, you're just adding in additional assumptions, which,
when wrong would completely ruin your test. Similarly these results would
completely ruin a bandit algorithm, because it relies on a much larger set of
assumptions then a standard a/b test.

One quick example: You lose temporal independence and you are testing X and Y.
For the first 15 rounds, X's metric is 100 and Y's is -100. after that, they
are reversed. With an epsilon first gambit algorithm with N=15, the algorithm
will simply choose X forever.

That said, they are an very interesting set of algorithms and it'd be
interesting to see how brittle they are in practice.

------
wccrawford
It sounds exactly like A/B testing, but using a specific algorithm to
determine the winner.

It talks about comparing the current situation to the best situation... But in
most A/B testing, A would be the current 'best' and B would be the challenger.
Same thing.

It also talks about a reward for certain button-presses, but that isn't
actually what you want to optimize. You want to optimize revenue. So it's
possible this could send you down the wrong path.

And if it's saying you should pit the current site against the best the site
has done historically, that's ridiculous. You couldn't possibly put controls
on all of the factors involved. That's why A/B testing is special: All the
other factors are guaranteed to be as identical as possible.

~~~
noelwelsh
No, it's not A/B testing for the reasons I try to explain the in the post. A/B
testing can't change the choices while the experiment is running, doesn't
adjust to customer preferences in real-time, etc.

You can optimise for revenue. Button presses was just a simple example.

~~~
mwexler
Can you compare to MVT (Multi-Variate Testing)? From an admittedly surface
reading of the post, this sounds like the somewhat common "A/B sucks, MVT is
more accurate at optimizing and showing impact" with a newer optimization
approach.

~~~
noelwelsh
In MVT you're interesting in testing the interactions between different
elements. If the elements are encountered in sequence the bandit analogue is
reinforcement learning. If you have one page with different "slots" to fill
then "bandit slate" algorithms might be appropriate. The key advantage is that
all these approaches are online: they take advantage of information as it is
received, and you can change things and the algorithms adapt. A/B testing and
MVT don't do either.

------
wglb
So we have Patrick's described experiences about how A/B testing produces
measurable results for him with only a tiny bit of theory.

Now we have the bandit theory ready for market, saying that A/B is not
optimal.

I know you are thinking i am about to say "premature optimization". But
instead I'll just ask what results to the group at untyped have to show versus
Patrick's freely-available simple-to-implement proven results?

~~~
jplewicke
When I read this article, my first thought was that this would make a great
option in a/bingo. UCB1, which is one of the bandit methods discussed in this
thread, looks like it'd be relatively easy to implement: you just calculate a
simple formula for each alternative in the test and choose the alternative
with the highest result.

While it would definitely take some time and testing to see whether the bandit
method worked better in practice, it might be even easier to work with than
the current state of a/bingo. Instead of writing one line of code to choose
between alternatives and then checking back after a bit to see which one
worked best, you could just write a single line of code once and not worry
about it until you wanted to clean that program up and standardize on one
choice.

------
jamescoops
Is this the same as multi-variate testing?

------
tintin
Maybe a little oftopic, but are there places where you can find tips about
best-practice content? To be clear: I once read that a button labeled "read
more" is not very good because people don't like to read. But when you name it
something like "more about this subject" people get greedy.

------
michaelfairley
The multi-armed bandit solution has seen most real world use in medical
trials. Randomly assigning patients to treatments when you have more
information available is highly unethical, as it can literally result in extra
loss of life.

------
jasonkolb
There is a standard way to do this called Taguchi Testing that has been around
in the manufacturing world for years. I have a Java API that does it that I've
been thinking about open sourcing, ping me if you have any interest in it.

------
d2
<http://www.math.ucla.edu/~tom/Stopping/Contents.html>

See chapter 7.

------
fedd
it says "Error establishing a database connection" when i click the link. it's
true and even a bit on topic ;)

edit: hallelujah, the db is up now, after 30 minutes. can somebody upvote me
back now? ;)

------
hammerbrostime
Anything simple out there to make it available to the masses?

~~~
noelwelsh
If you're interested in trying it, drop me a line at noel at untyped dot com.

~~~
golden_apples
I'm curious as to how you plan on implementing your results... Are you working
on integration with existing CMS systems (WP/Drupal/etc) or are you building
something more abstracted?

