The problem with this article is that it's a) not very detailed, and b) the conclusion is linkbait. Bandit optimization is a useful tool, but it has drawbacks, and it's not always better than A/B testing. In particular, bandit approaches take longer to converge (on average), and don't give you reliable ways to know when to stop testing (when all you know is that you're using an approach that's optimal in the limit of a large N, your only guarantee is that things get better as N gets large). These techniques also make assumptions that aren't valid for a lot of web experiments: identical "bandit" distributions that are constant over time. Throw a few choices that are optimal at different times of day/week/month/year at a bandit optimizer, and it'll just happily fluctuate between them.
Also, there's a lot of variation in performance depending on the parameters of your test -- some of which are completely unknowable. So if you want to really learn about this method, you need to read more than a blog post where the author has concluded that bandit optimization is the new pink. For example, here's a pretty readable paper that does an empirical analysis of the various popular bandit algorithms in different paramterizations:
This is just one article, but there's tons of literature on this problem. (In fact, if you use the 'softmax' criterion mentioned in that article, you're doing something very similar to simulated annealing, which is a rather elderly optimization technique.)
This type of error (which can happen very easily on a website going through continuous improvement) can take a very long time to recover from.
A/B tests do not have an issue with this because all versions will have similar mixes of data from before and after the improvement.
Better yet, make x dependent in some way on time since the last change, and relative change in performance of all options from before and after the change.
My problem with the bandit method is that I want to show the same test choice to the same person every time he sees the page so you can hide that there is a test. If I do this with the bandit algo then it warps the results because different cohorts have different weightings of the choices and differing cohorts behave very differently for lots of reasons.
A forgetting factor will help.
This is a variant of the cohort issue that you're talking about.
The cohort issue that you're talking about raises another interesting problem. If you have a population of active users, and you want to test per user, you often will find that your test population ramps up very quickly until most active users are in, and then slows down. The window where most users are assigned is a period where you have poor data (you have not collected for long, users have not necessarily had time to go to final sale).
It seems to me that if you want to use a bandit method in this scenario, you'd be strongly advised to make your fundamental unit the impression, and not the user. But then you can't hide the fact that the test is going on. Whether or not this is acceptable is a business problem, and the answer is not always going to be yes.
Absolutely. That's my biggest objection to bandit methods, but it's also the fuzziest objection, and the one least likely to appeal to hyper-analytical people. There's a strong temptation (as we can see from this article) is to treat bandit optimization as a black box that just spits out an infallible answer (i.e. as a Lazy Button).
It's the same human tendency that has led to "you should follow me on twitter" to be one of the more common n-grams on the interwebs (even though it probably never worked for more than Dustin Curtis, and likely causes a counter-intuitive backlash now).
Of course, these algorithms are cool in certain contexts where you never want to think about it again. For example, ad placement for a giant ad network running against a lot of UGC content. They remind me of neural networks like that: the post office doesn't care how the machine recognizes hand-written zip codes, just as long as they do it reliably.
But startups are all about learning, and interface design is where you have direct contact with users. I want to soak in the data. I want to hang out in people's living rooms when they're using my site. A fancy Lazy Button (great phrase) is the last thing I need.
The A/B test just gives you the conversion rate of each option. And so does the bandit. As I understand, the only difference is that the bandit will be bugging your users with bad results less often.
Of course, the real answer to why not is "we have an A/B system and I'm not going to add bandit stuff for no benefit". But even if I were doing it from scratch, it seems more complex. The benefit of these approaches seems to be that one no longer has to think. We want to think.
No, because the assumptions that underpin many statistical techniques are violated when you're not assigning people to cohorts consistently, and at random.
The standard confidence tests -- t-tests, G-tests, chi-squared tests, etc. -- based on distributions of independent, identically distributed (iid) data.
I'd have to think about it more, but I believe that btilly's examples are also the most intuitive reasons why independence matters. If your data is time-dependent, then assigning users to cohorts based on past performance lets the time dependency dominate. There may be other good examples.
Stopping a test when you reach a "statistically significant" result is the wrong way to do A/B testing. In both multi-armed bandit and A/B testing you need to set ahead of time the number of users you are going to run your test against and stop the test at that point regardless of if your result is significant or not.
See http://elem.com/~btilly/effective-ab-testing/index.html#asli... for part of a presentation that I did where I actually set up some reasonable fake tests, and ran simulations. What I found is that if there is a significant difference, the probability of coming to the wrong conclusion was (as you would expect) higher, but not that high before the underlying difference made mistakes incredibly unlikely. Conversely if there is only a small real difference, the amount of data needed before you have a significant chance of having accidentally come to a erroneous conclusion is very, very long.
So avoid accepting any result where you don't have at least a few hundred successes and set your thresholds reasonably high. You will make fewer mistakes than you probably fear, and the ones that you make will almost always be very minor. (Oops, I had a 3% chance of accepting the 1% worse solution as probably better.)
Of course if you're aiming to publish academic research, your standards need to be higher. But if you're more interested in getting useful results than publishable ones, you can relax your standards. A lot.
Nobody said that it was. But when you do regular split testing, you can use power analysis to estimate the length of time you need to run an experiment to get a significant result at a certain precision:
You can't do this (at least, not easily) when you're using bandit models, because none of the assumptions are valid.