Stop A/B Testing and Make Out like a Bandit

paraschopra · on Aug 1, 2011

This is interesting (and I'm yet to study the algorithm in detail) but I guess you are comparing two dissimilar problems. One is hypothesis testing and other is continuous optimization.

The reason A/B testing has many parameters is because at the end of data collection it hopes to employ a hypothesis test to determine whether variation was better performing as compared to control (the interpretation of "better" is done according to parameters defined). The confidence level and other parameters allow you to do proper risk-reward analysis and see if you want to go ahead with variation. Moreover, minimal sample size (visitors/conversions) ensures that local (initial) fluctuations do not unnecessarily bias the test results. In fact, in A/B test first 100 visitors are virtually identical to last 100 visitors with regards to impact on final results.

However, I am guessing in bandit algorithms (since their primary usage is continuous optimization and not hypothesis testing) local fluctuations can have an adverse impact on ultimate results which may be okay for continuous optimization problems but not for projects where you need to determine with a defined level of confidence whether one version works better than the other.

Different needs, different algorithms.

noelwelsh · on Aug 1, 2011

In short, no. In the domain that we're talking about (website optimisation) bandit algorithms and A/B testing address exactly the same problem, and bandit algorithms do so more efficiently.

Bandit algorithms actually arose out of hypothesis testing when people started to think about stopping experiments early. One of the first papers on this is Chernoff's "Sequential Design of Experiments" (http://www.jstor.org/pss/2237415). Reading just the abstract on JStor will give the flavour of the idea, and bandit algorithms are the natural extension of this of this idea to a continuously running problem.

asharp · on Aug 1, 2011

Interesting.

So you take the output of the bandit algorithm and use that as an output (ie. X was the "best" design) for the next stage of design?

What type of bandit algorithm do you use?

noelwelsh · on Aug 1, 2011

My opinion is that it's best to just keep running the bandit algorithm. There is no need to stop it -- you can add and remove options as it is running.

As for the type of algorithm, I've referenced various bits of the literature elsewhere in the comments. I can't give details on our precise approach unfortunately.

ojilles · on Aug 1, 2011

The article doesn't do a very good job of explaining "Bandit algorithms". The closest s/he comes is this, but that really doesn't enlighten me:

So, what is the bandit problem? You have a set of choices you can make. On the web these could be different images to display, or different wordings for a button, and so on. Each time you make a choice you get a reward. For example, you might get a reward of 1 if a button is clicked, and reward of 0 otherwise. Your goal is to maximise your total reward over time. This clearly fits the content optimisation problem.

Edit: Anyone have better pointers? (Other than the UU article referenced in the post)

charliepark · on Aug 1, 2011

I could be misreading it, but I believe the premise is "more traction => more action".

If I'm right, the idea is that the algorithm dynamically weights the "display frequency" of the two (or n) options. So as one of the a/b options shows itself to be more successful, it's shown more frequently. Because the test is self-correcting, you as an A/B test runner don't have to decide when the results are significant enough, and the program itself will automatically choose the more successful option.

datadon · on Aug 1, 2011

Going purely off of your boiled down description, this seems like a more advanced version of Genetify (https://github.com/gregdingle/genetify/wiki).

@noelwelsh - care to comment on now what you're doing at Untyped differs from what Genetify offers?

e-dard · on Aug 1, 2011

The difference in performance between using evolutionary approaches and bandit algorithms is complex and problem dependent. There is no "one algorithm is better than the other".

I've used both approaches for a different application and neither dominate the other.

stavros · on Aug 1, 2011

That looks like it uses genetic algorithms, which is much less optimal and more exploratory than bandit algorithms. What noelwelsh is proposing would lead to better results much more quickly and without random permutations of elements.

noelwelsh · on Aug 1, 2011

I agree :)

noelwelsh · on Aug 1, 2011

That's correct.

stavros · on Aug 1, 2011

The original formulation of the problem is that you have a slot machine (one-armed bandit) with multiple arms, each arm having different payoffs.

You want to maximise payoff while minimizing time spent on arms that don't pay off as well, but you obviously need to spend enough money on everything to estimate the probabilities.

That's the gist of the problem.

tel · on Aug 1, 2011

The gist (from reading the paper) is to compute a ranking of the choices based on past performance that optimally selects the best bandit based on incomplete (and stochastic) observations.

I can understand some hesitance to post the exact algorithm because, like anything that fixes the large number of heinous flaws in basic hypothesis testing, it gets complicated.

Worse, the paper presents a number of algorithms which appear, at first glance, to be very simple and performant. The primary problem with that is they are analyzing a very simplified version of the problem; assuming independence greatly reduces the practical flexibility of the algorithm.

Anyway, the basic flavor of the algorithm is to try to spend most of your time running the choice with the highest average reward (click-throughs, dollars) with some deterministic or stochastic correction for the chance that a worse performing model just happened to have the highest average at this point in your test.

fhars · on Aug 1, 2011

Here is another paper that looks promising: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.109....

From the abstract: The multi-armed bandit problem for a gambler is to decide which arm of a K-slot machine to pull to maximize his total reward in a series of trials.

The problem is: should he continue to pull the arm that has given the best payout until now, or try another arm he doesn't yet know much about to see if it has a higher payout?

rakkhi · on Aug 1, 2011

Agreed. So what I love about the A/B test is how simple it is to explain and run. How do I run a bandit test? Is it effectively what Google is doing when you tick the adwords option display successful ad's more often?

Is there a simple article that explains how to run a bandit test on exactly the example provided by the OP: which button text (e.g. click here, buy now) should I use to get maximum number of conversions?

noelwelsh · on Aug 1, 2011

There are a bunch of papers I can point you to. Is that what you're looking for?

smu · on Aug 1, 2011

I would like that.

noelwelsh · on Aug 1, 2011

Check out "Prediction Learning and Games" for a fairly recent introduction to the fundamental. See the work by Daniel Hsu, John Langford, and Jean-Yves Audibert (to name a few of the more prominent researchers) for interesting recent work. If you want more detail drop me a line.

MLnick · on Aug 1, 2011

+1 for Langford. He and many others (e.g. Deepak Agarwal) at Y! are among the most prolific publishers on this topic. Check out: http://hunch.net/~exploration_learning/ for a good, but pretty technical overview.

coffeemug · on Aug 1, 2011

I think the big issue here is that the bandit problem assumes independence between machines, while this is almost certainly an incorrect assumption to make when analyzing user behavior. For example, I might be able to increase convergence by changing the button text to "buy now" and changing the background color to black, but not each one independently. Conversely, changing the button text to "buy now" might hurt convergence if the background color is black, but improve convergence if the background color is white.

Essentially that means that if I make N changes at the same time and convergence changes, it's not possible to tell which combination of the changes affected convergence, and how (at least not without a new series of hypothesis tests to establish controls). Perhaps only one change made the difference and the rest were irrelevant, perhaps multiple changes made the difference, etc.

The bandit problem is much simpler because it guarantees that none of the variables depend on each other. If there is no such guarantee, we're effectively stuck with searching through an exponentially large space of possibilities, and using NHST to tell the difference.

asharp · on Aug 1, 2011

It also assumes that the distribution remains constant over time, an assumption that is, well, wrong.

But that's not very interesting. What's interesting is if it's still useful regardless, or if there is some sort of modification that would improve it taking that into account.

rfergie · on Aug 1, 2011

Couldn't that be modelled as 4 bandits:

1. Original colour/Original text 2. Original colour/"Buy Now" 3. Black colour/Original Text 4. Black colour/"Buy Now"

A bit of a curse of dimensionality for anything except the simplest cases, but that is a different problem.

noelwelsh · on Aug 1, 2011

What you say is absolutely true of the basic bandit problem. More complex algorithms can deal with more complex problems, handling the interdependencies you describe. See the "bandit slate" problem, for example.

coffeemug · on Aug 1, 2011

The slate problem still assumes each machine is independent (i.e. you want to pick k machines, but picking a particular machine does not effect the probability function of other machines). The ordered slate problem adds one more variable and deals with two dimensions, but it doesn't deal with an arbitrary number of dimensions.

Think about it - if you don't know which of the variables aren't independent, you're effectively searching an exponential space. You can prune it a bit but establishing that some variables are in fact independent, but the space of possibilities is still enormous. You can use standard multivariable function optimization techniques and set up correlation tests to guide the pruning, but you still need to use NHST at each one of these steps.

noelwelsh · on Aug 1, 2011

I agree with the general thrust of what you're saying -- it's a hard problem and no amount of algorithmic finesse can get around the basic problem of the exponentially growing space. In the context of this post, however, I think you can still do a lot better than A/B testing or MVT :).

Here's a paper that I think (I haven't read it in detail yet) addresses the problem you're talking about: http://arxiv.org/pdf/1105.4871v1

Another approach would be to model interactions as a (undirected?) Bayesian network and try to learn the structure of the network from data. I've had reasonable success doing this in the past but with a much simpler problem.

This is certainly something we're looking at, but I think there is enough work building a profitable business addressing the basic problem.

coffeemug · on Aug 1, 2011

Yes, I think it's a huge commercial opportunity. I could use such a product too, and I sincerely wish you luck. However, the bandit approach is almost certainly not the right way to do it, and most of the value will come from the tools and integration, not a magical statistical model.

I do believe Bayesian network construction will work, though I don't know how much better it would do than NHST.

noelwelsh · on Aug 1, 2011

Woah, we've been talking past each other in a big way. Let me try to clarify:

1. The setup in the bandit problem is identical to the setup in standard A/B testing as applied to web content optimisation. The only difference is that in the bandit problem you are allowed make decisions as data arrives; in A/B testing you have to wait till your experiment completes (otherwise, see "early stopping" which in fact is how the bandit problem came to be). Algorithms for the bandit problem are strictly superior to A/B testing in this setup.

2. The case you seem to be interested in is where you have n possible items to display and you display k <= n simultaneously. In A/B testing land this is known as multivariate testing. The problem comes from dependencies between the items, otherwise it just reduces to k bandit problems. Typical MVT setups assume linear relationships between items. You can do the same in a bandit setup, and this what (I think from a quick read) the arxiv paper I linked above does.

3. NHST (null hypothesis statistical testing, right?) is not more powerful than a bandit algorithm. Consider this: in your hypothesis test you have a probability of making a mistake (determined by the p-value and probability of a type II error which you only indirectly control). The expected regret is thus Pr(error) * Cost(error) * forever (once you make your decision you're stuck with it). Thus the expected regret is infinite (due to that "forever" term). If you decide instead to continue making decisions the probability of making an error rises rapidly. If you decide to control for this you're reinventing sequential design of experiments / the bandit problem.

4. I blogged about the bandit problem because it's the direct analogue of A/B testing. That doesn't mean there aren't more powerful algorithms available in the field of decision theory. If you display your k items in sequence you're doing reinforcement learning, for which there are algorithms with optimal regret bounds. I've discussed k items simultaneously above. No doubt this is a hard problem. The key idea to take away is that you have to control for your uncertainty in the correct action, something that hypothesis testing doesn't do.

That was long; I hope it sheds some light. Oh, and drop me an email -- I've love to at least ask you more questions about the kind of product you'd use.

jerf · on Aug 1, 2011

OK, but so what? I stipulate up front that there is no magically perfect algorithm. What are you proposing be done about this?

Your criticism applies to any practical algorithm equally, so we're going to have to judge their value on other dimensions.

coffeemug · on Aug 1, 2011

You misunderstand my criticism. The problem isn't that they're proposing an imperfect algorithm to a problem that's too computationally expensive to solve perfectly. The problem is that what they're proposing is strictly worse than NHST (which is the current status quo).

Look at it this way. If you're testing N different pieces of your site, trying Ni changes for each, and some of the variables are dependent on each other (which is almost certainly the case), then doing local improvements to each part is as likely to make the result worse as it is to make it better. NHST does not have that problem because you switch to a new layout only when you have enough evidence that it increases convergence. So the proposed algorithm is strictly worse than NHST. Not all imperfect algorithms are created equal.

eru · on Aug 1, 2011

Or look up Fractional Factorial Designs (https://secure.wikimedia.org/wikipedia/en/wiki/Fractional_fa...).

stavros · on Aug 1, 2011

This is actually revolutionary, if it works properly. I had never considered the problem as an exploration vs exploitation issue, which it clearly is.

Imagine throwing a few alternatives at the problem, in any way you like, and having an algorithm select the best ones automatically and optimally, without you needing to hand-tune anything.

You wouldn't even need to select the best-performing variation, as the algorithm would converge to it in the end. You could also throw in new ones at any time to have tested, or have new ones produced automatically, e.g. in the context of new items in e-stores (we have these new featured items, select the best one to display on the front page).

I'm sure there's a catch, but I don't remember the algorithm very well (and, from what I remember, it looks like there isn't), and I don't have time to read the paper thoroughly now.

noelwelsh · on Aug 1, 2011

I like the way you talk :)

As @e-dard says, the more page variations (arms) you add, the more traffic you need. Non-stationary behaviour isn't a huge problem in this setup. The "non-stochastic" (aka adversarial) algorithms address this, but I don't actually see web pages traffic displaying a great deal of non-stationary behaviour to begin with. A bandit designed with I.I.D. assumptions and, say, a weighted history seems to work fine. Ad optimisation, which is an interesting area for extending the product, seems to be highly non-stationary.

The other issue I'm concerned with is interactions between different variations (called multi-variate testing in the A/B world). This becomes a reinforcement learning problem. There are good algorithms for RL but I'm not aware of much work that also includes generalisation (aka context or function approximation) in a principled way.

tansey · on Aug 1, 2011

>There are good algorithms for RL but I'm not aware of much work that also includes generalisation (aka context or function approximation) in a principled way.

Why wouldn't you use TD-learning? You have what seems like a relatively small state space that conforms well to a table representation. You need to assume it's Markovian, but you can just artificially inflate the state to include 3-4 clicks of history. Seems straight-forward, right?

noelwelsh · on Aug 1, 2011

TD-learning is only optimal in the limit. More modern algorithms bound their loss at any point in time. The basic idea is the same, however.

tansey · on Aug 1, 2011

Sure. I was merely giving an entry-level example of a "principled" approach that would handle this situation.

So the concern then is the time it takes to converge on an optimal value? Is that even possible?

I'm not a content optimization expert by any means, so I'm just curious if we have any evidence that the landscape is static. It seems entirely possible (maybe even probable) that an optimal option now may grow stale and actually be suboptimal next week/month.

noelwelsh · on Aug 1, 2011

You're basically correct in your assessment. The time to converge to the optimal action is formalised as the accumulated regret. This is the sum of the error made over time. If you assume the landscape is static -- called stationary in the literature -- you can achieve within a constant factor of optimal regret. Good algorithms only differ in their constant factors.

If the landscape is not static it is called non-stationary. Amazingly you can still be optimal (for a slightly different definition of optimal) in the non-stationary case! This is called the nonstochastic bandit in the literature.

Finally, are websites static or not? It depends :) However it seems that most are static. At least the one where we're applying our tech. are. Something like Amazon probably wouldn't be, but the non-stationary bits of their problem are probably best viewed as a product recommendation problem. UI tweaks might still be reasonably modelled as stationary.

e-dard · on Aug 1, 2011

In a nutshell, the problem is that as you add more arms (e.g., landing page variations) the amount of page views you need (i.e., actions tested), in order to learn reasonably accurate reward distributions, grows significantly.

Further, there are some other challenging aspects to this. 1) The environment -- visitors' preferences over time, your competitors' pages -- is very dynamic; 2) because you have to test and learn in the real world, your actions may impact your future performance.

I used and designed these types of algorithms for my PhD research, but applied them to trader selection between market exchanges; my thesis is in my sig if anyone is interested.

stavros · on Aug 1, 2011

Very true, but these are rather trivial concerns compared to A/B testing. With A/B testing you still need a large amount of samples, but you also need to specify the (usually arbitrary) parameters, and decide when to stop the test on top of that.

The challenging aspects seem more of a limitation, but you still would be testing things all the time, ostensibly, rather than test once in the beginning and then stay there. I'd imagine some of the algorithms have a window of data to consider, rather than everything since the beginning.

You could also discard old data to make sure past actions don't impact your performance now.

Overall, the technique isn't a miracle cure, of course, but it's leaps and bounds better than the split tests we have now.

jerf · on Aug 1, 2011

I would imagine that in practice you can do things like incrementally fold in new alternatives and expire out old ones as the algorithms evaluate them and still get nearly optimal performance, even if the theory screams bloody murder about it. The theory is screaming about the pathological cases, but I suspect that for things like "the optimal product special to display" you're not looking at something that is in practice all that pathological.

MLnick · on Aug 1, 2011

I have been thinking for a while now about the applications of bandits to financial markets (not so much the Q- and TD- learning approaches as I am a bit less familiar with them). I will definitely have a detailed look at the thesis, sounds very interesting!

snippyhollow · on Aug 1, 2011

Relevant: http://explo.cs.ucl.ac.uk/ (International Conference of Machine Learning 2011, Workshop on "exactly that".)

noelwelsh · on Aug 1, 2011

Nice link. Thanks.

d2 · on Aug 1, 2011

From the Wikipedia entry:

"Originally [the bandit problem was] considered by Allied scientists in World War II, it proved so intractable that it was proposed the problem be dropped over Germany so that German scientists could also waste their time on it."

noelwelsh · on Aug 1, 2011

Author here. We have a beta implementation of the idea available. Drop me a line (noel at untyped dot com) if you're interested in trying it out.

djm · on Aug 1, 2011

Hi, I'm getting a db connection error when trying to view the article so I'm not sure what it's all about yet.

I'm aware of your work with racket (read your paper from a few years ago about deploying the web server in a uni environment).

Is this something I can plug into a racket web app? If it is I'd be interested as I am building web apps with racket. I'm not in a position where I can actually deploy something yet but I'll want to build in statistics collection when I am a little closer.

Edit: OK, I read your article via the cached link another user posted. I'll have a look at your linked papers etc when I get a chance but it definitely looks interesting. Your post wasn't clear on what form this beta will take - is it a commercial venture to compete with optimizely et al or one of your open source projects I could require into my own app and deploy myself?

noelwelsh · on Aug 1, 2011

We're making a commercial system. Implementing a bandit algorithm yourself is quite straightforward. You could implement, say, UCB1 in a day. I might open source some of that code if I get the time. Packaging that algorithm into a usable and scalable system (unlike our blog ;-) is a lot more work.

PS: I've installed a cache so the blog should stay up now.

asharp · on Aug 1, 2011

Just as a general note re repeated sig testing errors: Wouldn't it be possible to run a standard A/B testing run over a small but not insignificant number of iterations, and then iterate over that a number of times?

You could then use bayes to find a final estimate of, say, H1. As each high level iteration is fairly small, feedback can be provided to the user, although it couldn't be acted on.

Speaking of which, if we have an expected number of false positives for any given number of test scores, couldn't you take the average number of positives generated as an rv and then try and determine if it is different to the expected number of false positives?

It seems as though this type of error relies on the fact that a single false positive stops the testing rather then continuing on and allowing regression to the mean. By stopping this, it should then stop, or at least reduce this type of error.

stavros · on Aug 1, 2011

historious cache, because it's intermittently dropping for me:

http://cache.historious.net/cached/1369875/

asharp · on Aug 1, 2011

Interesting.

Some of the claims made seem to be strange. Adding in additional choices is fine, dealing with multiple choices is fine, modifying each page as you give it to the user is fine, you're just adding in additional assumptions, which, when wrong would completely ruin your test. Similarly these results would completely ruin a bandit algorithm, because it relies on a much larger set of assumptions then a standard a/b test.

One quick example: You lose temporal independence and you are testing X and Y. For the first 15 rounds, X's metric is 100 and Y's is -100. after that, they are reversed. With an epsilon first gambit algorithm with N=15, the algorithm will simply choose X forever.

That said, they are an very interesting set of algorithms and it'd be interesting to see how brittle they are in practice.

wccrawford · on Aug 1, 2011

It sounds exactly like A/B testing, but using a specific algorithm to determine the winner.

It talks about comparing the current situation to the best situation... But in most A/B testing, A would be the current 'best' and B would be the challenger. Same thing.

It also talks about a reward for certain button-presses, but that isn't actually what you want to optimize. You want to optimize revenue. So it's possible this could send you down the wrong path.

And if it's saying you should pit the current site against the best the site has done historically, that's ridiculous. You couldn't possibly put controls on all of the factors involved. That's why A/B testing is special: All the other factors are guaranteed to be as identical as possible.

noelwelsh · on Aug 1, 2011

No, it's not A/B testing for the reasons I try to explain the in the post. A/B testing can't change the choices while the experiment is running, doesn't adjust to customer preferences in real-time, etc.

You can optimise for revenue. Button presses was just a simple example.

mwexler · on Aug 1, 2011

Can you compare to MVT (Multi-Variate Testing)? From an admittedly surface reading of the post, this sounds like the somewhat common "A/B sucks, MVT is more accurate at optimizing and showing impact" with a newer optimization approach.

noelwelsh · on Aug 1, 2011

In MVT you're interesting in testing the interactions between different elements. If the elements are encountered in sequence the bandit analogue is reinforcement learning. If you have one page with different "slots" to fill then "bandit slate" algorithms might be appropriate. The key advantage is that all these approaches are online: they take advantage of information as it is received, and you can change things and the algorithms adapt. A/B testing and MVT don't do either.

jamescoops · on Aug 1, 2011

Yes what is the difference between bandit and MVT?

wglb · on Aug 1, 2011

So we have Patrick's described experiences about how A/B testing produces measurable results for him with only a tiny bit of theory.

Now we have the bandit theory ready for market, saying that A/B is not optimal.

I know you are thinking i am about to say "premature optimization". But instead I'll just ask what results to the group at untyped have to show versus Patrick's freely-available simple-to-implement proven results?

jplewicke · on Aug 2, 2011

When I read this article, my first thought was that this would make a great option in a/bingo. UCB1, which is one of the bandit methods discussed in this thread, looks like it'd be relatively easy to implement: you just calculate a simple formula for each alternative in the test and choose the alternative with the highest result.

While it would definitely take some time and testing to see whether the bandit method worked better in practice, it might be even easier to work with than the current state of a/bingo. Instead of writing one line of code to choose between alternatives and then checking back after a bit to see which one worked best, you could just write a single line of code once and not worry about it until you wanted to clean that program up and standardize on one choice.

jamescoops · on Aug 1, 2011

Is this the same as multi-variate testing?

tintin · on Aug 1, 2011

Maybe a little oftopic, but are there places where you can find tips about best-practice content? To be clear: I once read that a button labeled "read more" is not very good because people don't like to read. But when you name it something like "more about this subject" people get greedy.

michaelfairley · on Aug 1, 2011

The multi-armed bandit solution has seen most real world use in medical trials. Randomly assigning patients to treatments when you have more information available is highly unethical, as it can literally result in extra loss of life.

jasonkolb · on Aug 1, 2011

There is a standard way to do this called Taguchi Testing that has been around in the manufacturing world for years. I have a Java API that does it that I've been thinking about open sourcing, ping me if you have any interest in it.

d2 · on Aug 1, 2011

http://www.math.ucla.edu/~tom/Stopping/Contents.html

See chapter 7.

fedd · on Aug 1, 2011

it says "Error establishing a database connection" when i click the link. it's true and even a bit on topic ;)

edit: hallelujah, the db is up now, after 30 minutes. can somebody upvote me back now? ;)

hammerbrostime · on Aug 1, 2011

Anything simple out there to make it available to the masses?

noelwelsh · on Aug 1, 2011

If you're interested in trying it, drop me a line at noel at untyped dot com.

golden_apples · on Aug 1, 2011

I'm curious as to how you plan on implementing your results... Are you working on integration with existing CMS systems (WP/Drupal/etc) or are you building something more abstracted?

karambahh · on Aug 13, 2011

I'm definitely interested in trying it out...