Multi-armed bandits are also well known in game AI. They got popular with the introduction of Monte Carlo Tree Search, where MAB are used there to select which subtrees to search for largest expected payoff, e.g.:
For what it's worth the MAB algo in the original post looks like Epsilon Greedy. It's probably better to look into Upper Confidence Bound variants like UCB1 that dynamically adjust how much to explore vs exploit, e.g.:
You are correct. MCTS + UCB and other variants were state of the art leading up to AlphaGo. And even then, MCTS was also used in AlphaGo.
The main change in AlphaGo was using a deep learning network to encode a value network for fast rollouts and a policy network for move selection (rather than using the UCB rule). They later removed the value network and rollouts entirely, but even AlphaZero uses MCTS.
The purpose of an A/B test isn't to always show the best performing result, it's to perform a _controlled scientific experiment_ with a control group, from which you can learn things.
Also, I work in this field and I will just say that people _do_ behave differently based on traffic source: i.e. users coming from Facebook behave alike, but different than traffic from Reddit who act similarly to each other. If you were running a self-optimizing thing like this it _would_ make sense to split it up by the different traffic sources and handle them separately.
What you said is correct, but I'd like to point out that controlled scientific experiments may not be the right approach for e.g. optimizing conversions on a website.
The reason is that websites are a dynamic environment. All things equal, better controlled experiments are great. However, visitor behavior, especially when from dynamic sources (google serps change weekly), changes all the time.
And that's why I prefer MAB over A/B tests -- A/B tests don't adapt to a dynamic system, so we often wish away the changes in the system to use it. Does anyone go back and re-test their biggest wins?
> Does anyone go back and re-test their biggest wins?
Yes, absolutely! We do research first, then come up with simple, well-controlled tests. Once we have a winner we can either lock it in, but often we continue to research and experiment on the new knowledge we gained. A hefty minority of the tests I implement build on past wins to further flesh out what works and what doesn't with knowledge and the data to back it up.
How do you do knowledge transfer when you're documenting what you've learned? I have found even simple changes often don't reproduce on the same site. It feels like a easy trap for an expert practitioner to fall into -- the belief they can predict the outcome of a test based on a different test they ran.
Sometimes, yes, but often it's not for the reasons people think. This seems very much in Danny Kahneman land -- we can't trust what our brain is telling us, that this experience will have similar results in this other context.
I recently worked at a place that ran > 20 websites, huge traffic numbers, with around 100 simultaneous tests. We would poll each employee about their guess for the test winner. The results were about on par with everyone randomly guessing.
Very interesting. It seems to me that doing incremental work like this might end up in a local minima/maxima. Do you have any advice on how to avoid pitfalls like that? Are you testing radically different ideas along with your incremental improvements?
From a workflow perspective, MAB is a bit difficult to find radical improvements from. The radical improvements come from fundamental design changes, of which it would be very expensive to create a bunch of radically different variants.
MAB is best used where you can generate a bunch of variants cheaply and hope for a 30% gain.
Yes. Bandits will often converge more quickly to the optimal strategy, but it is much more difficult to understand why that strategy is optimal and generalize from the bandit outcomes to predict future performance and performance of other strategies.
It isn't impossible - bandits are seeing adoption in medical trials to avoid precisely the problem discussed - but the standard experiment design and analysis techniques you learn in a decent college statistics class or introductory statistics text no longer apply. That's one of the beauties of A/B testing: while it does require substantial thought to do well, the basic statistics of the setup are very well-understood at this point.
But for results to generalize or to understand why, the confounders must be accounted for in the randomization. This is really hard to do well -- there are often subtle influences that aren't sufficiently understood how they impact these non-linear systems. What makes someone convert? A million different factors; changing the color of a button in one context doesn't necessarily tell me much about how people would respond to that experience in another context.
It's easy to underestimate how complex things are, because we only see some superficial aspects of e.g. a user/software interaction model. This flaw is down to how our brains work -- ref "What you see is all there is".
That could be post-hoc reasoning, though. It would be interesting to pre-register your hypotheses, or see whether you could tell bandit outcomes from random ones.
This is literally the logical fallacy. You could get lucky. Maybe you have obvious gains to chase. But bad logical arguments are bad because they never work forever. They are corrupted heuristics that can get you in trouble without critical thinking.
Edit: added in forever. Phone dropped some wording I originally had. I think.
Incremental wins can still lead to dead ends. My phrasing was off in my post. I meant to say that the fallacies aren't that the tactics never work. Just that they can stop working without you really realizing it. A heuristics that can lead you down a dead end.
By all means, keep doing it if it is working for you. But don't confuse it as good advice. And stay vigilant.
Products exist in human reality not some science paper. There are no absolute truths, everything dead-ends eventually. It’s like trying to prove that one set of genes is better than another for future survival - an impossible task.
This belies a belief that science doesn't reflect the real world. It absolutely does.
Again, it may be working in your case. Argument to authority can go a long way. Even ad hom attacks often exist due to a "smell" of the person speaking. It is not, however, logically sound and can easily lead to unsupportable positions.
So, take care. And realize that a lot of the damage of poor practices may be tangential. For example, a belief that the real world can not be described by science.
Isn't this problem also an issue when people talk about transferring what they learn from one test to another test? That is frequently cited as a benefit of A/B testing.
> The purpose of an A/B test isn't to always show the best performing result
The key part is 'not always'.
Using A/B tests just to find the better performing version is a valid use case too. And if you have the traffic, multi-armed bandits are a nice way of automating the whole procedure. Even scientifically speaking there is nothing wrong with them. Their biggest issue is that they require a lot of traffic for significant results.
In practice, MAB should require less traffic than an A/B test for the same level of significance. Although it's much more difficult to rigorously describe that significance level with a MAB.
Wouldn't a simple solution be to create a handler that runs a MAB for each traffic source?
If you're worried about the site changing for people between visits from different traffic sources, you can cookie their MAB/test-assignment on the first visit.
If your ultimate goal is just to convert, convert, convert, and you can generate enough different content in your content strategy to sell different types of people (usually this content strategy is what's missing) you can set up tracking so it tracks everything you can imagine (did the user scroll down and then scroll all the way back up to the top? How long was the page loaded before the user scrolled the first time? and all kinds of stuff is trackable) you can split the traffic by their origin (because traffic sources seem to behave alike) and I know a guy who does Lead Qualification and his setup he can predict with a 90% accuracy whether a user will convert or not after 1000 visitors from that traffic source. And he's tracking about 30 variables to come to that prediction.
These things are easy to track, but if you can track it but your content strategy has no alternate content or anything to do, having that ability is useless.
Some things that you could do to alter a sales funnel to try to convert a customer:
- if you think they need more priming before they're ready to buy, add more pages into the funnel for that user
- if you think they're in a place to buy immediately, make sure you show them a Call to Action right then so they can buy
- swap entire blocks of text based on what kind of user we think you are
- change the length of pages, (add or rearranging elements on the page to highlight different things to different users based on what we think they care about more)
- etc
You can also often just ask users to self-identify a 'role' that is useful for content strategy, like "Are you a [student] or [teacher]" and people will click the button, and even expect the page to change and customize to that experience. All this stuff is so easy to do technologically, but you have to have a content strategy that actually makes use of the insights available.
> Also, I work in this field and I will just say that people _do_ behave differently based on traffic source: i.e. users coming from Facebook behave alike, but different than traffic from Reddit who act similarly to each other.
Out of curiosity, what is your hypothesis for explaining this difference in behavior? Would you say it's primarily due to differing contexts in which a link is posted, or differing populations on each platform? Or maybe these both contribute about equally?
Stated another way: would you expect the same individual to behave differently coming from Facebook versus coming from reddit, if they happen to be a user of both?
Not a statistician by any means, but could traffic source be a factor that's evaluated alongside conversion by a bandit algorithm when calculating the chance to show a particular option? Or other factors as well (detected device capabilities, users location, etc?)
These could just be weighting factors so instead of a single % chance per option, every time there's a successful interaction the victory is spread across the factors for that option.
New users could be shown the option where the chance for each option is weighted by the factors they match. Is this a valid approach or would it introduce some kind of selection bias?
Yes, of course. You can use a contextual MAB and include the traffic source in the context. Contextual MABs are much more complicated and expensive to implement though.
> The purpose of an A/B test isn't to always show the best performing result, it's to perform a _controlled scientific experiment_ with a control group, from which you can learn things.
Occasionally there is pure scientific interest.
But far more frequently, the purpose of the A/B is to optimize the outcome.
This is why Google Analytics has exclusively chosen multi-armed bandit for its its A/B test framework.
If you want to use the results from one test to inform what to test next, then A/B tests are better optimizing for truth.
Most website changes dont make a significant difference in conversion. If you use MAB, then you dont know if the winner is really better or the result of random variation.
> If you use MAB, then you dont know if the winner is really better or the result of random variation.
No, you know the outcome of MAB is the best.
If it's clearly the best, it converges quickly. If it's less clearly the best, it converges slowly. Either way though, you haven't lost any more conversions that necessary to find that out, or needed to put in non-mathematically fail safes.
The first time I stumbled across the term "Multi Armed Bandit" was when I read Koza's "On the programming of computers by natural selection" in 1992. When I later got involved in e-commerce projects, it was immediately clear to me that this was the way to tackle the involved optimization tasks.
We (Facebook) just released our Adaptive Experimentation platform as open source: https://github.com/facebook/Ax. It supports bandit optimization for discrete sets of candidates and bayesian optimization for continuous/very large search spaces.
Just a small nitpick: this doesn't take into account implementation cost. If you want something dynamic like this it means your app has read access to all the analytics recorded (or at least the ones needed for optimization). Most of the times apps only send data to the analytics services, developers read/analyze them and act based on the data. I personally didn't work on any apps that were using analytics read access within the app. So, in most cases it's a lot more than 20 lines to implement an approach like this (hopefully your analytics platform exposes an API with a read endpoint that you have to then integrate into your app) compared to A/B testing, where you just show several versions and then analyze the data and iterate.
Bandit algorithms can also be approached from a Bayesian point of view! This lets you quantify the uncertainty in your estimates (e.g. how uncertain you are than ad A has a higher CTR than ad B), which a lot of other bandit methods don't offer.
It seems close to simulated annealing in the travelling salesman problem. The basics of it is that you start with a random tour and adjust path segments incrementally. Most of the time, you choose a new path segment that decreases the global route cost, but randomly, you choose a new path segment that increases the global route cost. This random factor is decreased over time, so the route anneals and settles on a close-to-optimal result.
It shares the same principle of choosing reasonable options most of the time, but allowing variation to keep from getting stuck in a local optimum.
Basically you are comparing one strategy to prevent local optima in one optimization algorithm to another strategy in a different optimization algorithm. There are such strategies for basically every optimization problem.
Other examples: momentum in deep learning, tabu search, random search, etc.
The benefit of doing a pure binary A/B test is that the experiment is so simple that as long as you don't break the cardinal rules (you need a fixed experiment size! Most people don't even do this. Also you need to not bias your control/experiment sets) it is easy to get statistically valid results. When doing multivariate optimization using something that is measured (such as user engagement) rather than evaluated, you need a good amount of data for each configuration to evaluate it, or you run the risk of optimizing for random noise. This is true of even the multi-armed bandit problem: if you vary the exploitation strategies over time, then confounding temporal variables (for example, purchases are higher on weekdays because that's when people make business decisions at work) can invalidate the experiment if not controlled for.
Also one reason a lot of teams can't do more than two options (A, B, C, D, E, F, G, etc. testing) is because you need a TON of traffic for it to be statistically significant.
Statistical significance isn't necessary for deriving value from information! The point of multi-armed bandit is that you use the best information you currently have, while also not taking that information too seriously. In a context where experiments have a cost and you need results, this makes more sense than gathering more and more data until you meet statistical significance thresholds.
Guessing based on uncertain data seems better than guessing randomly? Also if the signal to noise ratio is low my data is likely just noise, but at the same time the decision doesn't matter much. The clearer the difference the more important the decision and the more likely my data is right.
Go take a look at some solutions to the multi-armed bandit problem. A common trait is assigning value (ie. monetary) to each hypothesis. A weaker positive-outcome hypothesis is worth less. The method takes that into account.
I published this work in CHI on the use of multiarmed bandits in educational games. My biggest takeaway was the importance of choosing the right metric -- because our optimization worked too well.
This was an interesting read. I played the game, too! Interesting to see how the optimizer converged on such a large ship size, even though you only allowed that case to verify whether the optimizer would successfully reject it.
Microsoft had good talk about contextual bandits and machine learning. The business case discussed was webpage conversion optimization which increased by 80%.
Sure you have to read a bit more to know why it works, but if you write your code well you could plug this in without any extra trouble. It's not like you need a special optimization solver as a dependency.
And, for not much more effort (computing the variance of your samples), you can use UCB1-tuned [0] which gets rid of the 'c' parameter and tends to be even better.
I personnaly think that it should replace UCB1 as a baseline when trying bandit algorithms.
It's funny, I had read that paper a few times while learning about bandit learning, and I never noticed their version, which funnily enough outperforms vanilla UCB1 in all of their tests!
Ahhh, the epsilon greedy strategy... I forgot I had read this until I was a few paragraphs in. If you have rare positive results, it gives you better statistics on your highest signal tests, while still evaluating the others (i.e. more than one alternative).
At least for us, most A/B/.. tests require a stable assignment of test to user as key metrics like retention would obviously skewed by randomly assigning on each visit/page view.
"hundreds of the brightest minds of modern civilization have been hard at work not curing cancer. Instead, they have been refining techniques for getting you and me to click on banner ads."
Just out of curiosity... Have you ever purposefully clicked on an ad on the internet? I honestly dont think I ever have.
ps. I mean an outright overt straight up ad, not, for example, some article linked on HN that is a thinly veiled promo piece for something.
Does anyone have a reference for solving multi-armed bandit problems with a finite time horizon? I would like something that derives rules or heuristics for how your explore/exploit tradeoff changes as the horizon approaches.
This seems like an obvious extension, and something that someone should have worked on given how long this problem has been around, but I've been unable to find anything on it. Any pointers?
What do you mean? Most analyses of multi-armed bandit algorithms assume a finite time horizon. And if not, they use the doubling trick for infinite time horizons.
MAB seems to find a local maxima subject to input biases whereas an AB test is aimed to figure out a scientific truth and isolates out all potential biases in the system. I'd be curious to hear where a MAB approach and an AB test did not yield the same results and why that happened.
Before I even clicked I knew this was going to be about multi-armed bandits. I agree with the author -- why don't more people know about this? It's so effective and easy, and you get results much quicker.
I encounter a/b testing primarily in article headlines, where the choice is always "accurate, informative title" or "inflammatory sensational bullshit clickbait" and the clickbait wins every time.
https://www.aaai.org/ocs/index.php/AIIDE/AIIDE13/paper/view/... https://courses.cs.washington.edu/courses/cse599i/18wi/resou... etc.
For what it's worth the MAB algo in the original post looks like Epsilon Greedy. It's probably better to look into Upper Confidence Bound variants like UCB1 that dynamically adjust how much to explore vs exploit, e.g.:
https://towardsdatascience.com/comparing-multi-armed-bandit-...