The point of multi-armed bandit situations is that there is a trade-off to be made between gaining new knowledge and exploiting existing knowledge. This comes up in your charts - the "MAB"s always have better conversion rates, because they balance between the two modes. The "A/B testing" always gain more information quickly because they ignore exploitation and only focus on exploration.
I should say also that multi-armed bandit algorithms also aren't supposed to be run as a temporary "campaign" - they are "set it and forget it". In epsilon-greedy, you never stop exploring, even after the campaign is over. In this way, you don't need to achieve "statistical significance" because you're never taking the risk of choosing one path for all time. In traditional A/B testing, there's always the risk of picking the wrong choice.
You aren't comparing A/B testing to a multi-armed bandit algorithm because both are multi-armed bandit algorithms. You're in a bandit situation either way. The strategy you were already using for your A/B tests is a different common bandit strategy called "epsilon-first" by wikipedia, and there is a bit of literature on how it compares to epsilon-greedy.
Also, despite this being a slightly pro a/b testing post, I have to say it's actually made me more interested in trying out Myna's approach MAB algorithm.
/former AB test software dev who fought my users to try to stop them from misinterpretation results, and failed.
Acceptance of this fact has avoided a lot of potential ulcers for me.
See my longer top-level comment for some of the trade-offs.
I'm concerned that the assumptions that are necessary for those statistical tests are not being met. Generally the significance tests are based on an assumption that all samples are independent as far as I know, but MAB grossly violates that assumption, and I do mean grossly. MAB doesn't pass those tests because it is, itself, a statistical significance test in some sense, and it is "deliberately" not feeding the independence-base significance algorithms the data it thinks it should be getting. This is not a bug, it's a feature.
(Pardon the anthropomorphizing there. It's still sometimes the best way to say something quickly in English.)
In fact, the fact that you agree that MAB has a higher conversion rate, which in this context basically means nothing more and nothing less than works better, but that there's this measure Q on which it does worse, is probably better interpreted as evidence that Q is not a useful measurement, rather than that the thing that works better shouldn't be used due to its lack of Q-ness.
Also, I don't see the problem of lack of independence. People are still randomly assigned to one of the conditions in the MAB scheme, people in one sample don't affect people in the other sample, and each person submits one independent datapoint. The problem is just one of unequal sample sizes in your two conditions, so when you make the comparison (in a one factor ANOVA, say), you lose lots of statistical power because your effective sample size is essentially the harmonic mean of the two sample sizes (which is heavily biased towards the lower number of the two).
No, they're only partially randomly assigned to one of the conditions. MAB has a memory based on what the results of the previous runs are, which is basically another way of phrasing that there's a dependence.
By contrast, randomly allocating X% of your population to cohort C_i is not a dependence on anything but the free parameter (X).
If the assignments are the source of the problem, you can't just take them as "given" and assume that things are OK.
A simpler example: if I assign people to cohorts based on their age, then the results of my experiment may be "conditionally independent" with respect to (say) eye color, but it probably won't be conditionally independent with respect to income or reading level or pant size. In other words, the assumption of independence is violated with respect to all variables correlated with age.
With bandit optimization, our bucket allocation scheme is correlated with time, which makes it nearly impossible to apply conventional statistical tests on any metric that may also be correlated with time. And in web testing, that's nearly everything.
tldr: The methodology is flawed.
Though like a commenter said, comparing these two algorithms is actually like comparing apples to oranges, and that was precisely the conclusion.
Which may remove the bias (I'm not sure about this) but it doesn't inspire confidence.
In their results when they compare a MAB with a 50% exploration rate to their split test you start to see a comparable amount of time to converge. Also they only show the results of one simulation with a lot of random in it. Given we're all stats nerds it would've been handy to see a box plot for each of the styles of simulation across multiple runs.
Having said all of that, my biggest concern around MAB as it's being sold is the lack of thought around the experiments. In the end it's just data and it doesn't mean anything without human intuition and preconceptions guiding it. Example: day of week, time of year/day, new page people are getting used to still, old one that garners lots of repeat traffic, etc. When running tests there's a lot more to consider besides just "whatever the data says".
Your simulation is a fiction. It only informs you of the behavior of the code you wrote. To connect statistical significance to reality requires honoring criteria you have not met.
Applying frequentest evaluation concepts to Bayesian and Markov learning methods will lead to mistaken conclusions because they follow from a different set of axioms. They both lead to numbers, but these numbers are not interchangeable and you must remember their qualifications when applying them to predicting reality.
In more frequentest terms, multi-arm bandit is searching for minimum total regret via a random walk, not rapid convergence to a significance predicate. It can do this in the face of changing conditions and weaker assumptions concerning control than frequentest significance requires, which is why such methods are now the norm in machine learning.
There are huge opportunities that you will not see with your current perspective. If you do not learn them, your competitors will.
I do not know of a particular reference for multi-arm bandit algorithms in specific, but they are a specialized case of Markov model learning described with vocabulary that predates a more modern broad and uniform understanding of these topics.
David Barber's recent book is very good.
If you want a broad understanding of common mistakes due to unstated metaphysical assumptions concerning statistics, experiment and action, read Judea Pearl's 2nd edition.
In reality, however, tests like the Wilcoxin test are 99% as powerful and more robust to misspecification in models.
I bring this up because while statistical power and signficance is a very important metric in the theory of picking good tests, it's actually a pretty terrible one in practice. Comparing MAB, which optimizes an entirely different loss parameter, to t/z-tests on power is sort of meaningless.
MAB can produce a cleaner workflow for many kinds of websites. Underperforming classes will be underrepresented and eventually pruned. The increased power of a batch test isn't necessarily so important in this context. I'm not even actually advocating MAB over other tests, just that you shouldn't spend too much time worry about power comparisons unless you're genuinely comparing apples to apples.
A very common example would be you might have a form that converts higher on the weekends than the weekdays just because of some external factor. You fire up a test with this "multi-arm bandit method" on Friday morning with two variations that are exactly equal. By random chance variation A starts winning by the end of Friday, and then it gets shifted to 90% of impressions. Due to this external factor, that it is now the weekend and that naturally increases conversions, variation A's conversion rate is going to excel past that of variation B, maybe even significantly. However, it in fact isn't better at all.
Now this may be uncommon, but think about this: Your website is changing all the time. If you happen to make a change on your website that affects conversion rates and you do not keep consistent bucket assignment ratios, then any ongoing experiment is going to be tainted.
I appreciate that the "multi-armed bandit" method increases the conversion rates on your site during the test, but to do be honest when I am running a test I'm not worried about increasing the conversion rate during the test. I'm worried about increasing the conversion rates in the much longer future, and the best way to accurately do that is to get accurate and significant results from my test as quickly as possible.
This also bunks one of the advantages in the blog post that says you can add variations at any time. You cannot add variations at anytime, because if you do you are comparing apples to oranges. The conversion rates from time period A-->C cannot be assumed to be the same as B-->C, because there likely have been changes in between A and B that could affect the "base" conversion rates (like website changes, external factors like weekend vs weekday, etc). When you add a new variation, you are creating a new A/B test that must be analyzed independently.
However I have been thinking about it since, and it is possible to design a multi-armed bandit approach with logarithmic regret (though higher by a constant factor than a traditional approach), that can handle the varying performance ratio. It also would allow you to add variations at any time.
There remain operational differences, but this problem is fixable.
Here is the absurd case. The conversion rate is 10% on Friday and 50% on Saturday:
Friday (~10% conversion):
A: 10 / 100
B: 11 / 100
Saturday (~50% conversion):
A: 5 / 10
B: 45 / 90
A: (10 + 5) / (100 + 10) = 13.6%
B: (11 + 45) / (90 + 100) = 29.5%
This is 99.9% statistically significant even though both variations are exactly the same: http://www.thumbtack.com/labs/abba/#A=15%2C110&B=56%2C19...
Here are some things I've found:
1) Absolute conversion rate. After a certain point, whatever you add will just detract from the performance of something else. That detraction could either be from the landing page itself (if your lucky) or some longer term variable (hope those "conversions" don't cost you too much money.) I have had both occur.
2) "Statistically significant" can just be noise when variables are fairly close to each other. After getting rid of the obvious losers, I've watched the "winner" of elements like button color change back and forth daily for weeks, with no clear winner, even with 30,000+ conversions a day flowing through. This is the kind of thing visualwebsiteoptimizer would write a case study on 1 hours worth of traffic and declare a winner.
3) You brand can be shit on when dealing with returning users. They are used to seeing one thing and now they see something else. Imagine if the colors, theme, and button locations changed every day (or to be fair, once a week) when you visited hn. Often "better converting" designs actually convert worse when introduced on existing customers.
4) Failure to account for external variables, especially when dealing with small sample sizes. Testing is often done most vigorously with paid traffic sources as the monetary incentive is direct. The traffic source itself is often a bigger determining factor behind the conversion rate than the design. Small budget/sample size, and you could end up with some pretty poor test results that the math will tell you are correct.
I am not saying don't test. I am saying a/b testing, split testing, multivariate testing, etc is abused
"So, comparing A/B testing and multi-armed bandit algorithms head to head is wrong because they are clearly meant for different purposes. A/B testing is meant for strict experiments where focus is on statistical significance, whereas multi-armed bandit algorithms are meant for continuous optimization where focus is on maintaining higher average conversion rate."
Like any business owner, at the end of the day I care more about conversion rates than about statistical significance.
I'm not a scientist trying to advance the sum of humanity's knowledge. I'm a business owner trying to find the shortest path between what my customers need and what I can profitably offer them.
In a way, statistical significance strikes me as a bit of a fool's errand, because significant results in one context may not be generalizable to another, which means even if we know for a near certainty what worked best, it's hard to apply that knowledge reliably in the future.
Of course, with MAB we could still wait for statistical significance if we want it, before turning off variations that are performing worse. And we can certainly still try to draw conceptually useful conclusions by designing our test variations in ways that facilitate easy comparison.
But with MAB and Myna it sounds like we can pretty well count on higher conversion rates at the end of the day, as well, and that counts for a lot in a business context.
I'm grateful to the VWO team for writing up their analysis and findings, and being so frank about the relative advantages of A/B and MAB. Their summary above tells me what I need to know.
Anyone can run "simulations" to prove anything. Providing just a summary table is of little use. I am not saying that the Wingify folks are trying to mislead people - just that this article doesn't have sufficient rigor to justify its conclusions.
OTOH, many CS papers, even published ones, don't provide source code or datasets so people can replicate the results, so perhaps this is the 'new normal' ;) .
I had double-checked the code, but it is quite possible that I made an oversight somewhere.
However running experiments to prove that one 'process' is superior to another in a real world situation often involves more than running a chi square test on randomly generated data ;) (which is what I understood your code to be doing on a very brief glance at it- sorry if I got it wrong).
There is nothing wrong with starting your exploration with something like this, but it really isn't sufficient to make the claims in the blog post about specific ways in MAB is better/worse than A/B testing.
This seems to be methodologically dubious. I am not a stats expert, though I use it in my work, and I could be wrong but testing convergence to statistical significance doesn't seem to mean anything (mathematically/statistically)!.
I could be wrong - there are people with Stats PhD's here and I'll leave it to them to tell me if I am. But I've never heard of an experiment (in the formal sense) which ran till a significance checking algorithm crossed some tripwire level.
Is the sampling random? (especially for an MAB) What are the underlying distributions (and assumptions about them)?
In your article you (imo) either need to say something like "this test isn't really valid and so the conclusions shouldn't be taken very seriously" (which would be weird thing for a company with your product to blog!) or you should do some rigorous experiment design and be able to defend your results.
Right now it reads like some MBA type plugged in two black box formulae in an Excel sheet and drew conclusions from the results. (please note: I am not saying you did it that way -just saying that the article reads like not enough thought went into setting up a proper experiment)
(Statistical) Experiment Design is a lot of work and surprisingly hard to carry off well, and involves much subtlety - and theory! there are whole books written about it. Maybe time to buy a few? :)
The goal of A/B is to decide as quickly as possible for the best one.
The goal of bandit is to optimize given content.
I wouldn't use bandit to "decide" the button color but as a simple recommendation system. This seems by far more natural to me as it reacts better with changing optima.
Example: Let's say I run a larger "fun content" media webpage. I have "awesome videos", "funny images" and "goofy articles". On the bottom and right side of each content page i show follow up content. I would use bandit here to optimize the mix of images,videos,text i recommend.
To spice it up: I would create cohorts for my typical behaviour of users (e.g. registered male user) and only consider interactions of the last two weeks into my bandit calculations.
Who cares about conversion rates, right? I'd much rather have statistical significance than more clicks on my buy button / signups for my site / etc.
Statistical about managing unknowns, not creating knowns, and everyone ignores this!
Let's stop a bad meme before it gets going.
Multi-armed bandit is a general problem. There are many strategies for tackling it. An epsilon-greedy strategy is one strategy. It is not the only strategy, and it is not even a particularly good one. There are many others. In fact A/B testing is itself a solution to the multi-armed bandit approach.
In fact the standard way to classify strategies is their asymptotic average regret - how much time you spend pulling the bad lever. An epsilon-greedy strategy has linear regret - even after a version has won you keep pulling it a certain fraction of the time.
Let's consider an alternate approach. Run an A/B test, then forever after exploit it. This is also a linear regret algorithm. Given two levers and the way you're going to run the test, there is a probability that you will come to the wrong conclusion and forever after do the wrong thing. Most of the time your regret is fixed, but sometimes it is not.
So they are equal? Not exactly. A/B testing converges quickly, and has a very small amount of linear regret forever after. An epsilon-greedy approach converges more slowly. Furthermore to lower the long-term regret you have to drop epsilon, which slows convergence. By the time you lower it to a level that is comparable to an A/B testing approach, convergence is so slow as to be operationally infeasible.
Thus the conclusion of this article is that A/B testing is a better operational strategy than epsilon-regret. And it is. But there are known better strategies still. The algorithms that Myna uses, for instance, have logarithmic regret. In a multi-armed bandit situation, they will soundly beat A/B testing just as A/B testing soundly beats epsilon-regret.
At that point you're left with two questions. The first is whether multi-armed bandit is a good model for website optimization. The second is whether there are any operational advantages to one approach over the other.
The answer to the first, for traditional multi-armed bandit algorithms, is a sound no. While relative conversion rates remain comparable, absolute conversion rates vary for all sorts of reasons. (Time of day, day of week, longer cycles, external website improvements, etc.) That was my big criticism the last time this discussion popped up, and there was no good answer to it.
But behind the scenes Noel (the CEO of Myna) and I have been in a discussion. I believe that I can produce a multi-armed bandit algorithm that has logarithmic regret and will work in the face of ever varying conversion rates, as long as there is a consistent relative difference between the versions.
At that point you are left with the operational differences. If you want to test at a granularity of "user", a website often has to put most of its active population into the test, before collecting enough data to know which version was good. With A/B testing you will eventually put everyone into the better version. With a multi-armed bandit, half your current users are stuck in the wrong version.
In effect you have a limited population of levers available to pull, and have to pull a good chunk of them before you start getting feedback on the rewards.
You can avoid this problem by testing at the granularity of an impression. Whether or not this is OK to do is a business question, and the answer will not always be yes. If it is not, then a multi-armed bandit approach is inappropriate to use.
That sounds interesting. As you undoubtedly know, there is a lot of literature on multi armed bandit problems and also on the non-stochastic multi armed bandit problem. The latter has the advantage of not assuming anything about the way in which the sequence of rewards is generated.
There is a a line of work in the non stochastic setting which sounds related to what you are describing. It's called the problem of tracking the best expert: http://www.cse.ucsc.edu/~mark/papers/track-long.ps
The idea in this problem is to do well as compared not to the best fixed action but rather the best piecewise constant sequence of actions. There are many variations of this problem, but the basic idea is that you can have low regret as compared to a changing sequence of actions so long as that sequence doesn't change too "quickly" for some definition of quickly.
Here is the specific problem that I am interested in.
Suppose we have a multi-armed bandit where the probability of reward are different on every pull. The distribution of rewards are not known. It is known that there is one arm which, on every pull, has a minimum percentage chance by which it is more likely to produce a reward if pulled than any other arm. Which arm this is is unknown, as is the minimum percentage.
In other words payoff percentages change quickly, but the winner is always a winner, and your job is to find it. I have an algorithm that succeeds with only logarithmic regret. I have no idea whether anyone else has come up with the same idea, and there is so much literature (much of which I presume is locked up behind paywalls) that I cannot easily verify whether anyone else has looked at this variant.
There are many, many different variations of the method and result. For example, you can show improved results assuming different things about the rewards. For exapmle, you can show O(log(T)) regret with certain additional assumptions. This is a really good book on different non-stochastic online learning problems:
The book looks like it would take a long time to work through, and they don't get to the multi-armed bandit problem until chapter 6. It goes into my todo list, but is likely to take a while to get off of it...
I agree though that if the difference in conversion rate were stochastic that could potentially make the problem much easier.
AB can be thought of as more of a form of epoch- greedy - you play uniform random then play greedy. One advantage of e-greedy is that if your environment is not stationary - you are still sampling from the arms - its sort of an insurance premium.
To address the differences in reward signals based on the environment, there is the option to model the environment with features - since it maybe that Fridays require a different selection than Sundays - not sure why you would a priori assume that the relative values, or even just the arm rankings are independent of environmental variables.
One other point- if you just want a super quick hack to pick winners (or at least pick from te set of higher performing arms) you can just optimistically seed your estimates for each arm - then just pick greedy. Not claiming it is optimal or anything but it requires almost no coding overhead. Of course you need the problem to be stationary.
Regardless of which algo approach you use, I do think it is useful to at least think in terms of bandits or reimforcemt learning when tackling these problems.
Suppose you pulled decision X for some user two minutes ago, and the user hasn't converted yet. Suppose that you pulled decision Y for some user two weeks ago, and the user hasn't converted yet. Do these two give you the same information? No: the user that got Y is less likely to convert than the user that got X, simply because of the time difference.
What you could do to incorporate this extra information is model the conversion process more explicitly, for example as each lever resulting in a particular exponentially distributed time-to-conversion (and then do Bayesian inference over those parameters).
Why do you say so? Can you elaborate?
Statisticians, academics, math enthusiasts, and People Who Spend Too Much Time Building Tables and Graphs want concrete blocks. They want to build a case or a report or something that suggests a point using a mathematical model. They deal with "statistical significance" because that's how the tools they understand work. Thus A/B testing.
Web workers, businesses, advertisers, people looking for better conversion rates, etc. want to eat. They want apples. Talking about statistical significance, trial sizes, and frequency measures don't help if they don't get more apples. The article showed that MAB gets more apples than A/B and by a wide margin.
Comparing the two isn't just missing the point, it's drawing an equivalency between "statistical significance" (whatever the tools say it means) and actual real world significance, a.k.a. better results. If A/B testing fails to even simulate better results, the proper word for that is worse.
(And don't even get me started about trying to run a statistical experiment until significant results are found... someday that practice will be labeled as fraud.)
If you have a question where the answer can be used to inform other decisions (e.g. Which is more important for our customers, price or support level?) then an A/B test will give you a clear answer that you can apply elsewhere.
If you don't really care what the answer is and just want the best option (e.g. should the hero shot be looking up or down) then a bandit is the way to go.
2. You do not indicate whether you’ve run enough simulations to attain a high statistical significance (over the meta- result that each campaign statistical significance has been attained)
3. Correct me if I’m wrong, but I don’t think A/B testing should be run “until statistical significance has been attained” because it foils the purpose of a statistical test, that is, you should rather define before hand a number of iterations, wait until they have been completed, and then check whether it gives you statistical significance
4. The purpose of the epsilon-greedy algorithm is that it can be run for very long duration without knowing which statistical significance is best because it will adapt the most commonly seen version of your site to this particular version (ie it is made to be run almost indefinitely, and will -- in effect -- automatically remove the non-working versions after a while!
Of course you get statistical significance with fewer observations if you have fewer variables to model. Although you'd get even significance with even fewer observations if you had a continuous independent variable. But more variables produces a higher r-squared. As in, you can explain more of the variation in your conversion rate.
The prevalence of A/B testing demonstrates a sad lack of statistics training. It's a problem in many science fields as well, especially biology.
If so then that ignores one of the best advantages of MAB is you can compare variations A, B, C, D, E, F, ... all simultaneously, where if using A/B testing you'd have to coordinate a tournament between all of those variations to determine the best of the bunch ... so seems it would take just as long with A/B testing (as thus a large sample population) to arrive at knowing the best variation from many choices.
 Like others here, I believe testing until you see significance is Doing It Wrong.