Hacker News new | comments | show | ask | jobs | submit login

(For previous discussion on this topic, see http://news.ycombinator.com/item?id=4040022.)

Let's stop a bad meme before it gets going.

Multi-armed bandit is a general problem. There are many strategies for tackling it. An epsilon-greedy strategy is one strategy. It is not the only strategy, and it is not even a particularly good one. There are many others. In fact A/B testing is itself a solution to the multi-armed bandit approach.

In fact the standard way to classify strategies is their asymptotic average regret - how much time you spend pulling the bad lever. An epsilon-greedy strategy has linear regret - even after a version has won you keep pulling it a certain fraction of the time.

Let's consider an alternate approach. Run an A/B test, then forever after exploit it. This is also a linear regret algorithm. Given two levers and the way you're going to run the test, there is a probability that you will come to the wrong conclusion and forever after do the wrong thing. Most of the time your regret is fixed, but sometimes it is not.

So they are equal? Not exactly. A/B testing converges quickly, and has a very small amount of linear regret forever after. An epsilon-greedy approach converges more slowly. Furthermore to lower the long-term regret you have to drop epsilon, which slows convergence. By the time you lower it to a level that is comparable to an A/B testing approach, convergence is so slow as to be operationally infeasible.

Thus the conclusion of this article is that A/B testing is a better operational strategy than epsilon-regret. And it is. But there are known better strategies still. The algorithms that Myna uses, for instance, have logarithmic regret. In a multi-armed bandit situation, they will soundly beat A/B testing just as A/B testing soundly beats epsilon-regret.

At that point you're left with two questions. The first is whether multi-armed bandit is a good model for website optimization. The second is whether there are any operational advantages to one approach over the other.

The answer to the first, for traditional multi-armed bandit algorithms, is a sound no. While relative conversion rates remain comparable, absolute conversion rates vary for all sorts of reasons. (Time of day, day of week, longer cycles, external website improvements, etc.) That was my big criticism the last time this discussion popped up, and there was no good answer to it.

But behind the scenes Noel (the CEO of Myna) and I have been in a discussion. I believe that I can produce a multi-armed bandit algorithm that has logarithmic regret and will work in the face of ever varying conversion rates, as long as there is a consistent relative difference between the versions.

At that point you are left with the operational differences. If you want to test at a granularity of "user", a website often has to put most of its active population into the test, before collecting enough data to know which version was good. With A/B testing you will eventually put everyone into the better version. With a multi-armed bandit, half your current users are stuck in the wrong version.

In effect you have a limited population of levers available to pull, and have to pull a good chunk of them before you start getting feedback on the rewards.

You can avoid this problem by testing at the granularity of an impression. Whether or not this is OK to do is a business question, and the answer will not always be yes. If it is not, then a multi-armed bandit approach is inappropriate to use.

> I believe that I can produce a multi-armed bandit algorithm that has logarithmic regret and will work in the face of ever varying conversion rates, as long as there is a consistent relative difference between the versions.

That sounds interesting. As you undoubtedly know, there is a lot of literature on multi armed bandit problems and also on the non-stochastic multi armed bandit problem. The latter has the advantage of not assuming anything about the way in which the sequence of rewards is generated.

There is a a line of work in the non stochastic setting which sounds related to what you are describing. It's called the problem of tracking the best expert: http://www.cse.ucsc.edu/~mark/papers/track-long.ps The idea in this problem is to do well as compared not to the best fixed action but rather the best piecewise constant sequence of actions. There are many variations of this problem, but the basic idea is that you can have low regret as compared to a changing sequence of actions so long as that sequence doesn't change too "quickly" for some definition of quickly.

I know there is a lot of literature, which I'm unfortunately less familiar with than I'd like to be.

Here is the specific problem that I am interested in.

Suppose we have a multi-armed bandit where the probability of reward are different on every pull. The distribution of rewards are not known. It is known that there is one arm which, on every pull, has a minimum percentage chance by which it is more likely to produce a reward if pulled than any other arm. Which arm this is is unknown, as is the minimum percentage.

In other words payoff percentages change quickly, but the winner is always a winner, and your job is to find it. I have an algorithm that succeeds with only logarithmic regret. I have no idea whether anyone else has come up with the same idea, and there is so much literature (much of which I presume is locked up behind paywalls) that I cannot easily verify whether anyone else has looked at this variant.

I think what you're describing is basically the non-stochastic setting. The idea there is to assume absolutely nothing about the rewards (they can be generated by quickly changing distributions or even an adversary) but still try to do as well as the best single arm. The standard algorithm for this problem is the multiplicative weights update method, also called the exponentially weighted average method or the hedge algorithm. It's a really simple method where you keep a probability distribution over the arms and then reweight the probabilities at each time step using something proportional to the exponential of the rewards of the arms. If you run it for T time steps on a problem involving N arms, it gives you regret O(sqrt(T log N)). This is actually the best possible bound without additional assumptions. This is a good paper on it: http://cns.bu.edu/~gsc/CN710/FreundSc95.pdf

There are many, many different variations of the method and result. For example, you can show improved results assuming different things about the rewards. For exapmle, you can show O(log(T)) regret with certain additional assumptions. This is a really good book on different non-stochastic online learning problems: http://www.ii.uni.wroc.pl/~lukstafi/pmwiki/uploads/AGT/Predi...

The article is interesting, but assumed you get to pull all of the arms at once, and see all of the results. That's different from the website optimization problem.

The book looks like it would take a long time to work through, and they don't get to the multi-armed bandit problem until chapter 6. It goes into my todo list, but is likely to take a while to get off of it...

Yes, I realized after posting that this is probably a better paper to link to: http://cseweb.ucsd.edu/~yfreund/papers/bandits.pdf The algorithms for the partial information setting are sometimes surprisingly similar to the algorithms where you see all the results. The algorithm in the paper linked above is essentially the same algorithm but with a small exploration probability. The regret bound gets worse by a factor of sqrt(N), however.

I don't think the non-stochastic setting is entirely appropriate. What we've been discussing is a situation where the differences in conversion rates are constant in expectation, though absolute conversion rates vary. So it's basically a stochastic setting, which should allow more leverage. Also, as I recall for the non-stochastic setting the definition of regret changes so the regret bounds are not directly comparable.

My take on btilly's problem was that the difference between the convergence rates is not constant but rather known to be bounded away from zero by some unknown constant. That is, there is some value epsilon such that the difference in conversion rate is on every round at least epsilon. However, at some rounds it could be more than epsilon, and it doesn't necessarily vary according to any fixed distribution. I think this is therefore roughly equivalent to the non stochastic setting additionally assuming the best arm has large (relative) reward.

I agree though that if the difference in conversion rate were stochastic that could potentially make the problem much easier.

Thanks for the link. I'll read it asap. For anyone following along at home, I believe Yisong Yue's work on the duelling bandits is also relevant here: http://www.yisongyue.com/publications/jcss2012_dueling_bandi... and http://www.yisongyue.com/publications/icml2011_beat_the_mean... (I haven't had a chance to read these in detail, yet.)

This is a good coversation. So the Disclaimer, my company Conductrics.com allows you to use algos similar to bandits as well as AB/MVT.

AB can be thought of as more of a form of epoch- greedy - you play uniform random then play greedy. One advantage of e-greedy is that if your environment is not stationary - you are still sampling from the arms - its sort of an insurance premium. To address the differences in reward signals based on the environment, there is the option to model the environment with features - since it maybe that Fridays require a different selection than Sundays - not sure why you would a priori assume that the relative values, or even just the arm rankings are independent of environmental variables.

One other point- if you just want a super quick hack to pick winners (or at least pick from te set of higher performing arms) you can just optimistically seed your estimates for each arm - then just pick greedy. Not claiming it is optimal or anything but it requires almost no coding overhead. Of course you need the problem to be stationary.

Regardless of which algo approach you use, I do think it is useful to at least think in terms of bandits or reimforcemt learning when tackling these problems.

There is also another area where improvement can be had. As far as I know most A/B testing just looks at the decisions that led to a conversion. But what about those that didn't lead to a conversion? There is information in that too, and it depends on time.

Suppose you pulled decision X for some user two minutes ago, and the user hasn't converted yet. Suppose that you pulled decision Y for some user two weeks ago, and the user hasn't converted yet. Do these two give you the same information? No: the user that got Y is less likely to convert than the user that got X, simply because of the time difference.

What you could do to incorporate this extra information is model the conversion process more explicitly, for example as each lever resulting in a particular exponentially distributed time-to-conversion (and then do Bayesian inference over those parameters).

Applications are open for YC Summer 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact