> I believe that I can produce a multi-armed bandit algorithm that has logar...

btilly · on June 1, 2012

I know there is a lot of literature, which I'm unfortunately less familiar with than I'd like to be.

Here is the specific problem that I am interested in.

Suppose we have a multi-armed bandit where the probability of reward are different on every pull. The distribution of rewards are not known. It is known that there is one arm which, on every pull, has a minimum percentage chance by which it is more likely to produce a reward if pulled than any other arm. Which arm this is is unknown, as is the minimum percentage.

In other words payoff percentages change quickly, but the winner is always a winner, and your job is to find it. I have an algorithm that succeeds with only logarithmic regret. I have no idea whether anyone else has come up with the same idea, and there is so much literature (much of which I presume is locked up behind paywalls) that I cannot easily verify whether anyone else has looked at this variant.

sparsevector · on June 1, 2012

I think what you're describing is basically the non-stochastic setting. The idea there is to assume absolutely nothing about the rewards (they can be generated by quickly changing distributions or even an adversary) but still try to do as well as the best single arm. The standard algorithm for this problem is the multiplicative weights update method, also called the exponentially weighted average method or the hedge algorithm. It's a really simple method where you keep a probability distribution over the arms and then reweight the probabilities at each time step using something proportional to the exponential of the rewards of the arms. If you run it for T time steps on a problem involving N arms, it gives you regret O(sqrt(T log N)). This is actually the best possible bound without additional assumptions. This is a good paper on it: http://cns.bu.edu/~gsc/CN710/FreundSc95.pdf

There are many, many different variations of the method and result. For example, you can show improved results assuming different things about the rewards. For exapmle, you can show O(log(T)) regret with certain additional assumptions. This is a really good book on different non-stochastic online learning problems: http://www.ii.uni.wroc.pl/~lukstafi/pmwiki/uploads/AGT/Predi...

btilly · on June 1, 2012

The article is interesting, but assumed you get to pull all of the arms at once, and see all of the results. That's different from the website optimization problem.

The book looks like it would take a long time to work through, and they don't get to the multi-armed bandit problem until chapter 6. It goes into my todo list, but is likely to take a while to get off of it...

sparsevector · on June 1, 2012

Yes, I realized after posting that this is probably a better paper to link to: http://cseweb.ucsd.edu/~yfreund/papers/bandits.pdf The algorithms for the partial information setting are sometimes surprisingly similar to the algorithms where you see all the results. The algorithm in the paper linked above is essentially the same algorithm but with a small exploration probability. The regret bound gets worse by a factor of sqrt(N), however.

noelwelsh · on June 1, 2012

I don't think the non-stochastic setting is entirely appropriate. What we've been discussing is a situation where the differences in conversion rates are constant in expectation, though absolute conversion rates vary. So it's basically a stochastic setting, which should allow more leverage. Also, as I recall for the non-stochastic setting the definition of regret changes so the regret bounds are not directly comparable.

sparsevector · on June 1, 2012

My take on btilly's problem was that the difference between the convergence rates is not constant but rather known to be bounded away from zero by some unknown constant. That is, there is some value epsilon such that the difference in conversion rate is on every round at least epsilon. However, at some rounds it could be more than epsilon, and it doesn't necessarily vary according to any fixed distribution. I think this is therefore roughly equivalent to the non stochastic setting additionally assuming the best arm has large (relative) reward.

I agree though that if the difference in conversion rate were stochastic that could potentially make the problem much easier.

noelwelsh · on June 1, 2012

Thanks for the link. I'll read it asap. For anyone following along at home, I believe Yisong Yue's work on the duelling bandits is also relevant here: http://www.yisongyue.com/publications/jcss2012_dueling_bandi... and http://www.yisongyue.com/publications/icml2011_beat_the_mean... (I haven't had a chance to read these in detail, yet.)