
Why Multi-armed Bandit algorithms are superior to A/B testing (with Math) - yummyfajitas
http://www.chrisstucchio.com/blog/2012/bandit_algorithms_vs_ab.html
======
patio11
For what it's worth, I've been following this cross-Internet debate with more
than a little professional interest. Cards on the table: I have coded A/B
testing software, I frequently code and/or administer it for clients (often in
ways which are provably suboptimal), and I am a dirty loyalty-free scientist-
cum-capitalist-pig who would stab A/B testing in the back in a second if I
thought there were an easier way to extract more money for the same amount of
work.

I strongly, strongly suggest that anyone attempting to look at this problem
from the perspective of a site owner rather than a mathematical abstraction
read and digest btilly's comment from earlier this week:

<http://news.ycombinator.com/item?id=4040616>

The issues he lays out are _very real in the course practical use of site
testing to actually make money_. In particular, his #2 would scare the heck
out of me, in a much deeper way than "A/B testing provably doesn't minimize
regret" worries me in the other direction. (Or e.g. other flaws with
particular A/B testing implementations. For example, repeatedly checking the
results of your A/B test and deciding to end it when you see significance has
been explained quite a few times as a bad idea with stats to match. However,
even if you check like a hyperactive squirrel, you're still winning, you're
just winning less often than you think you are. Take your B- in stats class
but proceed to make motivational amounts of money for the business.)

The worst possible takeaway you, personally, the typical HN reader, could
possibly have from this debate is "Oh, I guess I shouldn't A/B test then."
Pick A/B testing, bandit testing, whatever -- any option in the set, even with
poor algorithms and/or the easiest errors I can think of, strictly dominate
not testing at all. (Actually testing today also is better than "testing...
someday", which from my own experience and that of clients I know is something
which is very easy to slip into even if you theoretically know you should be
doing it.)

~~~
yummyfajitas
_Pick A/B testing, bandit testing, whatever -- any option in the set, even
with poor algorithms and/or the easiest errors I can think of, strictly
dominate not testing at all._

So I submitted this post and went off to boxing. On the train ride back, I
thought "I hope I don't make people think they shouldn't A/B test. And at the
same time you were writing this post, I added a conclusion to my blog post
saying the same thing.

Bad A/B testing is an 80% solution. Good A/B testing is a 90% solution. Good
bandit is a 95% solution. 80% >> 0.

A/B testing has another benefit to software engineers that bandit doesn't - it
lets you delete code.

~~~
patio11
Unless I misunderstand bandit algorithms, there's a trivial modification that
makes the actual, practical administration of them essentially identical to
A/B testing with regards to when you can rip out code.

If A smashes B, then bandit will converge in a very obvious manner on A, and
you pick it and delete the B code branch, accepting future regret from the
possibility that B was in fact better as a cost of doing business. If A
doesn't smash B, then at an arbitrary point in the future you realize it has
not converged on either A or B, pick one branch using a traditional method
like "I kind of like A myself", delete the B code, and go on to something that
will actually matter for the business rather than trying to minimize your
regret function where both possible solutions are (provably) likely non-
motivational amounts of money between each other.

Please feel free to correct me if I'm wrong on this -- I have a bad flu today
and take only marginal responsibility for my actions.

~~~
yummyfajitas
You are correct for practical purposes.

However, you are mathematically wrong. Doing things the way you describe is
another form of A/B testing, just one designed to reduce the regret incurred
during the test. You still get the linear growth in regret I described (i.e.,
there is some % chance you picked the wrong branch, and you are leaving money
on the table forever after).

Of course, your way is also _economically_ correct, and the theoretical CS
guys are wrong. They shouldn't be trying to minimize regret, they should be
minimizing _time discounted_ regret.

(Yes, I'm being extremely pedantic. I'm a mathematician, it comes with the
territory.)

~~~
greendestiny
Well fundamentally website optimization is not a multi-armed bandit problem.
You are not simply choosing between machines. Instead each website change is
another node on a tree and you want to find the highest performing path in
constantly changing world, with additional constraints that you can't keep
incompatible machines for very long.

A guess at good strategy in those situations would be something that enables
you to choose a good path from a small number of options and keep doing so -
on the basis that the cumulative advantage will be far more important than any
particular choice in itself. Sounds like A/B testing would win if you could
actually map the complexity of the problem correctly.

------
noelwelsh
Let's settle this with science rather than rhetoric. I'd like to do some
proper comparisons between bandit algorithms and A/B testing. Unfortunately we
haven't been saving time series data at Myna, so we don't have any test data.
If anyone has time series data from an A/B test, and is happy to donate it to
the cause, please get in touch (email in profile).

Updated for clarity.

~~~
patio11
Write out what .CSV columns you need and what formats they need to be in, and
I will happily get you this for a handful of A/B tests. (Though probably not
faster than late July. As much as I love A/B testing there is the small matter
of a wedding and honeymoon to throw a wee bit of a wrench into my near term
schedule...)

~~~
nyellin
Gosh, congratulations Patrick. You have helped so many of us. Good luck ;)

------
tmoertel
What’s being glossed over here, and explains a lot of the confusion around
which is the best method of “solving” the multi-arm-bandit problem, is the
classical bias-variance tradeoff. All of the methods presume some model of the
problem, and some of those models are more flexible than others. When a model
is more flexible, it allows solutions to take on more shapes and must burn
more of its training data choosing among those shapes. Models that are more
biased toward a particular shape, on the other hand, can use more of their
data for convergence and so converge more rapidly.

Which method is “best” depends on what you know about the problem. Does its
optimal solution look a certain way? change over time? and so on. If you’re
willing to bet on your answers to those questions, you can choose a method
that’s biased toward your answers, and you’ll converge more rapidly on a
solution. The risk, however, is that you’ll bet wrong and converge on a poor
solution (because your biases rule out better solutions).

If you’re not willing to bet on your answers, you can choose a method that
will place bets for you based on what it sees in the data. But now you’re
burning some of your data on betting. So that’s the tradeoff: you can use more
of _your_ knowledge to place bets (and risk placing the wrong bets), or more
of the _data_ ’s knowledge to place bets (and burn some of your data on
betting). Where you adjust the slider between those two is up to you.

Which brings us back to our original question. Which method of solving the
multi-arm-bandit problem is best? It depends a lot on where you want to adjust
the slider. Which depends on your knowledge, aversion to risk, and expected
payoffs.

In life, sometimes one size does not fit all. If you’re going to test one shoe
size against another, make sure you know which foot will end up wearing the
winner. Likewise, if you’re going to compare algorithms for solving the multi-
arm-bandit problem, make sure know the particulars of the problem _you_ need
to solve.

------
mxfh
What is is the half-time rate of this debate? 300 days or less?
<http://news.ycombinator.com/item?id=2831455>

Sadly, the option to drop this problem over Germany doesn't exist any more.[1]

[1] [http://en.wikipedia.org/wiki/Multi-
armed_bandit#Empirical_mo...](http://en.wikipedia.org/wiki/Multi-
armed_bandit#Empirical_motivation)

------
blueskittle
As someone who works for a major e-commerce site, I am often the one who has
the most influence when it comes time to decide which testing method to adopt.
Multi-armed Bandit testing can be good, just as standard A/B testing can be
good. But the factor which trumps all of these are the total costs of testing
(and the return on investment to the business). One must consider the
following before undertaking any of these testing methods:

1\. Implementation Costs - How much time will it take to implement the testing
code? Some tests are easier to implement than others. 2\. Maintenance Costs -
How much time will it cost to maintain the test for the duration of the
testing period? We've ignored this in the past only to realize on occasion
that implementation introduces bugs which incur cost and can be disruptive.
3\. Opportunity Costs - What is the cost of doing the test versus not doing
the test? Consider setup time, analysis, and final implementation.

After going through a few tests now, we have a pretty good sense for what the
total cost to the business is. We don't really look at it as adopting one test
method over the other, but instead rely upon the projected ROI to test this
versus that, versus doing nothing.

~~~
btilly
If you've conducted multiple tests and "time to implement the testing code" is
a major consideration, then you're doing it wrong. If ROI is also a major
consideration, then again you're doing it wrong.

Seriously to add an email test right now at the company I'm contracting for
takes 2 lines of code. One appears in the program that sends email and looks
like:

    
    
        $email_state_contact->ab_test_version("test_1234", {A => 1, B => 1});
    

where test is the name of a test, and 1234 is a ticket number to avoid
accidental conflicts of test names. The other appears in a template and looks
something like this:

    
    
        .../[% ab_test.test_12345 == 'A' ? 'button1' : 'button2' %].png...
    

That's it. The test automatically shows up in a daily reports. When it wins,
you get rid of that code and put the right thing in the template.

Done.

~~~
polyfractal
I can imagine a number of situations where the implementation is significantly
more complex. While _ideally_ A/B tests should be looking at relatively small
changes, where each change is independent, many times people are making
profoundly larger changes.

If you are testing the conversion rate in shopping carts, and the changes
involves drastic redesigns of the flow through the shopping cart process, that
could be a serious technological difference and requires substantial time to
implement.

Not every test is as easy as changing the copy on an email.

~~~
btilly
Even if you're making larger and more complex changes, the overhead of your
testing methodology remains the same. That is how you measure things should be
a fixed (small) effort, The cost of building the test is whatever the test is.

In other words multi-armed bandit versus A/B test is something that you
shouldn't be deciding based on the effort of the testing methodology.

~~~
polyfractal
I don't think he was referring to the technology behind the A/B test itself,
but rather the technology behind the change that was being made.

That's how I interpreted his statement. I agree with you that the actual A/B
testing overhead should be minimal and fairly trivial to put into place.

------
conductrics
Disclaimer: I also have software for running AB/MVT as well as adaptive
control problems (so bandits as well as extended sequential decisions) at
www.conductrics.com.

I wouldn't sweat too much UCB methods vs e-greedy or other heuristics for
balancing explore/exploit. E-greedy (and e-greedy decreasing) is nice because
it is simple. Softmax/Boltzman, is interesting since it is satisfying in that
it selects arms weighted by the estimated means, and UCB-Tuned and UCB-Normal
are nice because, like AB testing, they take variance measures directly into
account when selecting an arm. Take a look at this paper from Doina Precup
(who is super nice BTW) and Volodymyr Kuleshov from 2000
<http://www.cs.mcgill.ca/~vkules/bandits.pdf> they have comparisons between
various methods. Guess what - the simple methods work just fine. Of course
there are various Bayesion versions - esp of UCB. Nando de Freitas over at UBC
has a recent ICML paper on using the Gaussian Process for Bandits (based on a
form of UCB). See <http://www.cs.ubc.ca/~nando/papers/BayesBandits.pdf> I have
not given it a tight read, but not sure what the practical return would be.
Plus you have to fiddle with picking a Kernel function, and I imagine length
scales and the rest of hyper parameters associated with GPs. I did read a
working paper from Nando a few years back that used a random forest as a prior
- I can't seem to find it now. BTW - John Langford is program chair of this
year’s ICML over in Edinburgh. If you are in the UK might be worth it to pop
up and attend. Plus Chris Williams is there at Edinburgh, so maybe you can
corner him about GPs. Although he has moved on from GPs - he still wrote
(well, co-wrote) the book and is one of the smartest people I have ever met.

------
gcanyon
Doesn't this method also (over) simplify? It goes into pseudo explorer mode
anytime there is a >0 probability that the presumed worst cast is actually
better than the presumed best/better case. Shouldn't there be a threshold to
that process, so that the presumed worst case must have at least a (for
example) .05 probability of being better before the algorithm gives it a shot?

------
luigi
So am I getting this right?

(Edit) So the two steps to run are:

1\. Run a traditional A/B test until 95% confidence is reached. This is full
exploration.

2\. Then, switch to the MAB after that, showing the better performing variant
most of the time. As time increases, the display of the worse performing
variants decreases.

~~~
aaronjg
Option 1 will NOT give you the correct answer. You CANNOT use confidence
intervals as a stopping criteria. If you do this, you end up running many
tests, and then you need to apply a multiple test correction to account for
this. Otherwise you run a VERY HIGH risk of picking the wrong result.

I emphasize, because this is a common problem made by A/B test practitioners.
For a fuller discussion of the problems, check out the papers by Armitage
(frequentist) and Anscombe (Bayesian) on the topic. Or see my summary of the
issue here:

[http://blog.custora.com/2012/05/a-bayesian-approach-to-ab-
te...](http://blog.custora.com/2012/05/a-bayesian-approach-to-ab-testing/)

~~~
luigi
Sorry I wasn't clear. I meant run #1 first, then run #2. I didn't mean them as
different options.

------
raverbashing
What, now it's indirect argument over HN?

Dear UX specialists with no knowledge of statistics: you can use the MAB algo
with 2 choices, no problem. And it is a better way of getting 'the right
choice'.

Dear statisticians: there's more to life (and to UX) than A/Bing (or MABing)
everything

~~~
dwc
_> What, now it's indirect argument over HN?_

This has obviously been standard procedure for a while now. I see this on
almost a daily basis. Afraid your comment on a HN thread won't get enough
traction? Make a blog post instead, meant expressly for submission to HN (or
reddit, or ...)

------
ygmelnikova
My guess is 98% of developers use neither.

~~~
patio11
You are, sadly, overshooting the worldwide population of A/B testing
developers by at least an order of magnitude. Great news for my consulting
business, bad news for everyone else.

