The reason A/B testing has many parameters is because at the end of data collection it hopes to employ a hypothesis test to determine whether variation was better performing as compared to control (the interpretation of "better" is done according to parameters defined). The confidence level and other parameters allow you to do proper risk-reward analysis and see if you want to go ahead with variation. Moreover, minimal sample size (visitors/conversions) ensures that local (initial) fluctuations do not unnecessarily bias the test results. In fact, in A/B test first 100 visitors are virtually identical to last 100 visitors with regards to impact on final results.
However, I am guessing in bandit algorithms (since their primary usage is continuous optimization and not hypothesis testing) local fluctuations can have an adverse impact on ultimate results which may be okay for continuous optimization problems but not for projects where you need to determine with a defined level of confidence whether one version works better than the other.
Different needs, different algorithms.
Bandit algorithms actually arose out of hypothesis testing when people started to think about stopping experiments early. One of the first papers on this is Chernoff's "Sequential Design of Experiments" (http://www.jstor.org/pss/2237415). Reading just the abstract on JStor will give the flavour of the idea, and bandit algorithms are the natural extension of this of this idea to a continuously running problem.
So you take the output of the bandit algorithm and use that as an output (ie. X was the "best" design) for the next stage of design?
What type of bandit algorithm do you use?
As for the type of algorithm, I've referenced various bits of the literature elsewhere in the comments. I can't give details on our precise approach unfortunately.
So, what is the bandit problem? You have a set of choices you can make. On the web these could be different images to display, or different wordings for a button, and so on. Each time you make a choice you get a reward. For example, you might get a reward of 1 if a button is clicked, and reward of 0 otherwise. Your goal is to maximise your total reward over time. This clearly fits the content optimisation problem.
Edit: Anyone have better pointers? (Other than the UU article referenced in the post)
If I'm right, the idea is that the algorithm dynamically weights the "display frequency" of the two (or n) options. So as one of the a/b options shows itself to be more successful, it's shown more frequently. Because the test is self-correcting, you as an A/B test runner don't have to decide when the results are significant enough, and the program itself will automatically choose the more successful option.
@noelwelsh - care to comment on now what you're doing at Untyped differs from what Genetify offers?
I've used both approaches for a different application and neither dominate the other.
You want to maximise payoff while minimizing time spent on arms that don't pay off as well, but you obviously need to spend enough money on everything to estimate the probabilities.
That's the gist of the problem.
I can understand some hesitance to post the exact algorithm because, like anything that fixes the large number of heinous flaws in basic hypothesis testing, it gets complicated.
Worse, the paper presents a number of algorithms which appear, at first glance, to be very simple and performant. The primary problem with that is they are analyzing a very simplified version of the problem; assuming independence greatly reduces the practical flexibility of the algorithm.
Anyway, the basic flavor of the algorithm is to try to spend most of your time running the choice with the highest average reward (click-throughs, dollars) with some deterministic or stochastic correction for the chance that a worse performing model just happened to have the highest average at this point in your test.
From the abstract: The multi-armed bandit problem for a gambler is to decide which arm of a K-slot machine to pull to maximize his total reward in a series of trials.
The problem is: should he continue to pull the arm that has given the best payout until now, or try another arm he doesn't yet know much about to see if it has a higher payout?
Is there a simple article that explains how to run a bandit test on exactly the example provided by the OP: which button text (e.g. click here, buy now) should I use to get maximum number of conversions?
Essentially that means that if I make N changes at the same time and convergence changes, it's not possible to tell which combination of the changes affected convergence, and how (at least not without a new series of hypothesis tests to establish controls). Perhaps only one change made the difference and the rest were irrelevant, perhaps multiple changes made the difference, etc.
The bandit problem is much simpler because it guarantees that none of the variables depend on each other. If there is no such guarantee, we're effectively stuck with searching through an exponentially large space of possibilities, and using NHST to tell the difference.
But that's not very interesting. What's interesting is if it's still useful regardless, or if there is some sort of modification that would improve it taking that into account.
1. Original colour/Original text
2. Original colour/"Buy Now"
3. Black colour/Original Text
4. Black colour/"Buy Now"
A bit of a curse of dimensionality for anything except the simplest cases, but that is a different problem.
Think about it - if you don't know which of the variables aren't independent, you're effectively searching an exponential space. You can prune it a bit but establishing that some variables are in fact independent, but the space of possibilities is still enormous. You can use standard multivariable function optimization techniques and set up correlation tests to guide the pruning, but you still need to use NHST at each one of these steps.
Here's a paper that I think (I haven't read it in detail yet) addresses the problem you're talking about: http://arxiv.org/pdf/1105.4871v1
Another approach would be to model interactions as a (undirected?) Bayesian network and try to learn the structure of the network from data. I've had reasonable success doing this in the past but with a much simpler problem.
This is certainly something we're looking at, but I think there is enough work building a profitable business addressing the basic problem.
I do believe Bayesian network construction will work, though I don't know how much better it would do than NHST.
1. The setup in the bandit problem is identical to the setup in standard A/B testing as applied to web content optimisation. The only difference is that in the bandit problem you are allowed make decisions as data arrives; in A/B testing you have to wait till your experiment completes (otherwise, see "early stopping" which in fact is how the bandit problem came to be). Algorithms for the bandit problem are strictly superior to A/B testing in this setup.
2. The case you seem to be interested in is where you have n possible items to display and you display k <= n simultaneously. In A/B testing land this is known as multivariate testing. The problem comes from dependencies between the items, otherwise it just reduces to k bandit problems. Typical MVT setups assume linear relationships between items. You can do the same in a bandit setup, and this what (I think from a quick read) the arxiv paper I linked above does.
3. NHST (null hypothesis statistical testing, right?) is not more powerful than a bandit algorithm. Consider this: in your hypothesis test you have a probability of making a mistake (determined by the p-value and probability of a type II error which you only indirectly control). The expected regret is thus Pr(error) * Cost(error) * forever (once you make your decision you're stuck with it). Thus the expected regret is infinite (due to that "forever" term). If you decide instead to continue making decisions the probability of making an error rises rapidly. If you decide to control for this you're reinventing sequential design of experiments / the bandit problem.
4. I blogged about the bandit problem because it's the direct analogue of A/B testing. That doesn't mean there aren't more powerful algorithms available in the field of decision theory. If you display your k items in sequence you're doing reinforcement learning, for which there are algorithms with optimal regret bounds. I've discussed k items simultaneously above. No doubt this is a hard problem. The key idea to take away is that you have to control for your uncertainty in the correct action, something that hypothesis testing doesn't do.
That was long; I hope it sheds some light. Oh, and drop me an email -- I've love to at least ask you more questions about the kind of product you'd use.
Your criticism applies to any practical algorithm equally, so we're going to have to judge their value on other dimensions.
Look at it this way. If you're testing N different pieces of your site, trying Ni changes for each, and some of the variables are dependent on each other (which is almost certainly the case), then doing local improvements to each part is as likely to make the result worse as it is to make it better. NHST does not have that problem because you switch to a new layout only when you have enough evidence that it increases convergence. So the proposed algorithm is strictly worse than NHST. Not all imperfect algorithms are created equal.
Imagine throwing a few alternatives at the problem, in any way you like, and having an algorithm select the best ones automatically and optimally, without you needing to hand-tune anything.
You wouldn't even need to select the best-performing variation, as the algorithm would converge to it in the end. You could also throw in new ones at any time to have tested, or have new ones produced automatically, e.g. in the context of new items in e-stores (we have these new featured items, select the best one to display on the front page).
I'm sure there's a catch, but I don't remember the algorithm very well (and, from what I remember, it looks like there isn't), and I don't have time to read the paper thoroughly now.
As @e-dard says, the more page variations (arms) you add, the more traffic you need. Non-stationary behaviour isn't a huge problem in this setup. The "non-stochastic" (aka adversarial) algorithms address this, but I don't actually see web pages traffic displaying a great deal of non-stationary behaviour to begin with. A bandit designed with I.I.D. assumptions and, say, a weighted history seems to work fine. Ad optimisation, which is an interesting area for extending the product, seems to be highly non-stationary.
The other issue I'm concerned with is interactions between different variations (called multi-variate testing in the A/B world). This becomes a reinforcement learning problem. There are good algorithms for RL but I'm not aware of much work that also includes generalisation (aka context or function approximation) in a principled way.
Why wouldn't you use TD-learning? You have what seems like a relatively small state space that conforms well to a table representation. You need to assume it's Markovian, but you can just artificially inflate the state to include 3-4 clicks of history. Seems straight-forward, right?
So the concern then is the time it takes to converge on an optimal value? Is that even possible?
I'm not a content optimization expert by any means, so I'm just curious if we have any evidence that the landscape is static. It seems entirely possible (maybe even probable) that an optimal option now may grow stale and actually be suboptimal next week/month.
If the landscape is not static it is called non-stationary. Amazingly you can still be optimal (for a slightly different definition of optimal) in the non-stationary case! This is called the nonstochastic bandit in the literature.
Finally, are websites static or not? It depends :) However it seems that most are static. At least the one where we're applying our tech. are. Something like Amazon probably wouldn't be, but the non-stationary bits of their problem are probably best viewed as a product recommendation problem. UI tweaks might still be reasonably modelled as stationary.
Further, there are some other challenging aspects to this. 1) The environment -- visitors' preferences over time, your competitors' pages -- is very dynamic; 2) because you have to test and learn in the real world, your actions may impact your future performance.
I used and designed these types of algorithms for my PhD research, but applied them to trader selection between market exchanges; my thesis is in my sig if anyone is interested.
The challenging aspects seem more of a limitation, but you still would be testing things all the time, ostensibly, rather than test once in the beginning and then stay there. I'd imagine some of the algorithms have a window of data to consider, rather than everything since the beginning.
You could also discard old data to make sure past actions don't impact your performance now.
Overall, the technique isn't a miracle cure, of course, but it's leaps and bounds better than the split tests we have now.
"Originally [the bandit problem was] considered by Allied scientists in World War II, it proved so intractable that it was proposed the problem be dropped over Germany so that German scientists could also waste their time on it."
I'm aware of your work with racket (read your paper from a few years ago about deploying the web server in a uni environment).
Is this something I can plug into a racket web app? If it is I'd be interested as I am building web apps with racket. I'm not in a position where I can actually deploy something yet but I'll want to build in statistics collection when I am a little closer.
Edit: OK, I read your article via the cached link another user posted. I'll have a look at your linked papers etc when I get a chance but it definitely looks interesting. Your post wasn't clear on what form this beta will take - is it a commercial venture to compete with optimizely et al or one of your open source projects I could require into my own app and deploy myself?
PS: I've installed a cache so the blog should stay up now.
You could then use bayes to find a final estimate of, say, H1. As each high level iteration is fairly small, feedback can be provided to the user, although it couldn't be acted on.
Speaking of which, if we have an expected number of false positives for any given number of test scores, couldn't you take the average number of positives generated as an rv and then try and determine if it is different to the expected number of false positives?
It seems as though this type of error relies on the fact that a single false positive stops the testing rather then continuing on and allowing regression to the mean. By stopping this, it should then stop, or at least reduce this type of error.
Some of the claims made seem to be strange. Adding in additional choices is fine, dealing with multiple choices is fine, modifying each page as you give it to the user is fine, you're just adding in additional assumptions, which, when wrong would completely ruin your test. Similarly these results would completely ruin a bandit algorithm, because it relies on a much larger set of assumptions then a standard a/b test.
One quick example: You lose temporal independence and you are testing X and Y. For the first 15 rounds, X's metric is 100 and Y's is -100. after that, they are reversed. With an epsilon first gambit algorithm with N=15, the algorithm will simply choose X forever.
That said, they are an very interesting set of algorithms and it'd be interesting to see how brittle they are in practice.
It talks about comparing the current situation to the best situation... But in most A/B testing, A would be the current 'best' and B would be the challenger. Same thing.
It also talks about a reward for certain button-presses, but that isn't actually what you want to optimize. You want to optimize revenue. So it's possible this could send you down the wrong path.
And if it's saying you should pit the current site against the best the site has done historically, that's ridiculous. You couldn't possibly put controls on all of the factors involved. That's why A/B testing is special: All the other factors are guaranteed to be as identical as possible.
You can optimise for revenue. Button presses was just a simple example.
Now we have the bandit theory ready for market, saying that A/B is not optimal.
I know you are thinking i am about to say "premature optimization". But instead I'll just ask what results to the group at untyped have to show versus Patrick's freely-available simple-to-implement proven results?
While it would definitely take some time and testing to see whether the bandit method worked better in practice, it might be even easier to work with than the current state of a/bingo. Instead of writing one line of code to choose between alternatives and then checking back after a bit to see which one worked best, you could just write a single line of code once and not worry about it until you wanted to clean that program up and standardize on one choice.
See chapter 7.
edit: hallelujah, the db is up now, after 30 minutes. can somebody upvote me back now? ;)