Let's stop a bad meme before it gets going.
Multi-armed bandit is a general problem. There are many strategies for tackling it. An epsilon-greedy strategy is one strategy. It is not the only strategy, and it is not even a particularly good one. There are many others. In fact A/B testing is itself a solution to the multi-armed bandit approach.
In fact the standard way to classify strategies is their asymptotic average regret - how much time you spend pulling the bad lever. An epsilon-greedy strategy has linear regret - even after a version has won you keep pulling it a certain fraction of the time.
Let's consider an alternate approach. Run an A/B test, then forever after exploit it. This is also a linear regret algorithm. Given two levers and the way you're going to run the test, there is a probability that you will come to the wrong conclusion and forever after do the wrong thing. Most of the time your regret is fixed, but sometimes it is not.
So they are equal? Not exactly. A/B testing converges quickly, and has a very small amount of linear regret forever after. An epsilon-greedy approach converges more slowly. Furthermore to lower the long-term regret you have to drop epsilon, which slows convergence. By the time you lower it to a level that is comparable to an A/B testing approach, convergence is so slow as to be operationally infeasible.
Thus the conclusion of this article is that A/B testing is a better operational strategy than epsilon-regret. And it is. But there are known better strategies still. The algorithms that Myna uses, for instance, have logarithmic regret. In a multi-armed bandit situation, they will soundly beat A/B testing just as A/B testing soundly beats epsilon-regret.
At that point you're left with two questions. The first is whether multi-armed bandit is a good model for website optimization. The second is whether there are any operational advantages to one approach over the other.
The answer to the first, for traditional multi-armed bandit algorithms, is a sound no. While relative conversion rates remain comparable, absolute conversion rates vary for all sorts of reasons. (Time of day, day of week, longer cycles, external website improvements, etc.) That was my big criticism the last time this discussion popped up, and there was no good answer to it.
But behind the scenes Noel (the CEO of Myna) and I have been in a discussion. I believe that I can produce a multi-armed bandit algorithm that has logarithmic regret and will work in the face of ever varying conversion rates, as long as there is a consistent relative difference between the versions.
At that point you are left with the operational differences. If you want to test at a granularity of "user", a website often has to put most of its active population into the test, before collecting enough data to know which version was good. With A/B testing you will eventually put everyone into the better version. With a multi-armed bandit, half your current users are stuck in the wrong version.
In effect you have a limited population of levers available to pull, and have to pull a good chunk of them before you start getting feedback on the rewards.
You can avoid this problem by testing at the granularity of an impression. Whether or not this is OK to do is a business question, and the answer will not always be yes. If it is not, then a multi-armed bandit approach is inappropriate to use.
That sounds interesting. As you undoubtedly know, there is a lot of literature on multi armed bandit problems and also on the non-stochastic multi armed bandit problem. The latter has the advantage of not assuming anything about the way in which the sequence of rewards is generated.
There is a a line of work in the non stochastic setting which sounds related to what you are describing. It's called the problem of tracking the best expert: http://www.cse.ucsc.edu/~mark/papers/track-long.ps
The idea in this problem is to do well as compared not to the best fixed action but rather the best piecewise constant sequence of actions. There are many variations of this problem, but the basic idea is that you can have low regret as compared to a changing sequence of actions so long as that sequence doesn't change too "quickly" for some definition of quickly.
Here is the specific problem that I am interested in.
Suppose we have a multi-armed bandit where the probability of reward are different on every pull. The distribution of rewards are not known. It is known that there is one arm which, on every pull, has a minimum percentage chance by which it is more likely to produce a reward if pulled than any other arm. Which arm this is is unknown, as is the minimum percentage.
In other words payoff percentages change quickly, but the winner is always a winner, and your job is to find it. I have an algorithm that succeeds with only logarithmic regret. I have no idea whether anyone else has come up with the same idea, and there is so much literature (much of which I presume is locked up behind paywalls) that I cannot easily verify whether anyone else has looked at this variant.
There are many, many different variations of the method and result. For example, you can show improved results assuming different things about the rewards. For exapmle, you can show O(log(T)) regret with certain additional assumptions. This is a really good book on different non-stochastic online learning problems:
The book looks like it would take a long time to work through, and they don't get to the multi-armed bandit problem until chapter 6. It goes into my todo list, but is likely to take a while to get off of it...
I agree though that if the difference in conversion rate were stochastic that could potentially make the problem much easier.
AB can be thought of as more of a form of epoch- greedy - you play uniform random then play greedy. One advantage of e-greedy is that if your environment is not stationary - you are still sampling from the arms - its sort of an insurance premium.
To address the differences in reward signals based on the environment, there is the option to model the environment with features - since it maybe that Fridays require a different selection than Sundays - not sure why you would a priori assume that the relative values, or even just the arm rankings are independent of environmental variables.
One other point- if you just want a super quick hack to pick winners (or at least pick from te set of higher performing arms) you can just optimistically seed your estimates for each arm - then just pick greedy. Not claiming it is optimal or anything but it requires almost no coding overhead. Of course you need the problem to be stationary.
Regardless of which algo approach you use, I do think it is useful to at least think in terms of bandits or reimforcemt learning when tackling these problems.
Suppose you pulled decision X for some user two minutes ago, and the user hasn't converted yet. Suppose that you pulled decision Y for some user two weeks ago, and the user hasn't converted yet. Do these two give you the same information? No: the user that got Y is less likely to convert than the user that got X, simply because of the time difference.
What you could do to incorporate this extra information is model the conversion process more explicitly, for example as each lever resulting in a particular exponentially distributed time-to-conversion (and then do Bayesian inference over those parameters).