> I believe that I can produce a multi-armed bandit algorithm that has logarithmic regret and will work in the face of ever varying conversion rates, as long as there is a consistent relative difference between the versions.That sounds interesting. As you undoubtedly know, there is a lot of literature on multi armed bandit problems and also on the non-stochastic multi armed bandit problem. The latter has the advantage of not assuming anything about the way in which the sequence of rewards is generated.There is a a line of work in the non stochastic setting which sounds related to what you are describing. It's called the problem of tracking the best expert: http://www.cse.ucsc.edu/~mark/papers/track-long.ps The idea in this problem is to do well as compared not to the best fixed action but rather the best piecewise constant sequence of actions. There are many variations of this problem, but the basic idea is that you can have low regret as compared to a changing sequence of actions so long as that sequence doesn't change too "quickly" for some definition of quickly.
 I know there is a lot of literature, which I'm unfortunately less familiar with than I'd like to be.Here is the specific problem that I am interested in.Suppose we have a multi-armed bandit where the probability of reward are different on every pull. The distribution of rewards are not known. It is known that there is one arm which, on every pull, has a minimum percentage chance by which it is more likely to produce a reward if pulled than any other arm. Which arm this is is unknown, as is the minimum percentage.In other words payoff percentages change quickly, but the winner is always a winner, and your job is to find it. I have an algorithm that succeeds with only logarithmic regret. I have no idea whether anyone else has come up with the same idea, and there is so much literature (much of which I presume is locked up behind paywalls) that I cannot easily verify whether anyone else has looked at this variant.
 I think what you're describing is basically the non-stochastic setting. The idea there is to assume absolutely nothing about the rewards (they can be generated by quickly changing distributions or even an adversary) but still try to do as well as the best single arm. The standard algorithm for this problem is the multiplicative weights update method, also called the exponentially weighted average method or the hedge algorithm. It's a really simple method where you keep a probability distribution over the arms and then reweight the probabilities at each time step using something proportional to the exponential of the rewards of the arms. If you run it for T time steps on a problem involving N arms, it gives you regret O(sqrt(T log N)). This is actually the best possible bound without additional assumptions. This is a good paper on it: http://cns.bu.edu/~gsc/CN710/FreundSc95.pdfThere are many, many different variations of the method and result. For example, you can show improved results assuming different things about the rewards. For exapmle, you can show O(log(T)) regret with certain additional assumptions. This is a really good book on different non-stochastic online learning problems: http://www.ii.uni.wroc.pl/~lukstafi/pmwiki/uploads/AGT/Predi...
 The article is interesting, but assumed you get to pull all of the arms at once, and see all of the results. That's different from the website optimization problem.The book looks like it would take a long time to work through, and they don't get to the multi-armed bandit problem until chapter 6. It goes into my todo list, but is likely to take a while to get off of it...
 Yes, I realized after posting that this is probably a better paper to link to: http://cseweb.ucsd.edu/~yfreund/papers/bandits.pdf The algorithms for the partial information setting are sometimes surprisingly similar to the algorithms where you see all the results. The algorithm in the paper linked above is essentially the same algorithm but with a small exploration probability. The regret bound gets worse by a factor of sqrt(N), however.
 I don't think the non-stochastic setting is entirely appropriate. What we've been discussing is a situation where the differences in conversion rates are constant in expectation, though absolute conversion rates vary. So it's basically a stochastic setting, which should allow more leverage. Also, as I recall for the non-stochastic setting the definition of regret changes so the regret bounds are not directly comparable.
 My take on btilly's problem was that the difference between the convergence rates is not constant but rather known to be bounded away from zero by some unknown constant. That is, there is some value epsilon such that the difference in conversion rate is on every round at least epsilon. However, at some rounds it could be more than epsilon, and it doesn't necessarily vary according to any fixed distribution. I think this is therefore roughly equivalent to the non stochastic setting additionally assuming the best arm has large (relative) reward.I agree though that if the difference in conversion rate were stochastic that could potentially make the problem much easier.
 Thanks for the link. I'll read it asap. For anyone following along at home, I believe Yisong Yue's work on the duelling bandits is also relevant here: http://www.yisongyue.com/publications/jcss2012_dueling_bandi... and http://www.yisongyue.com/publications/icml2011_beat_the_mean... (I haven't had a chance to read these in detail, yet.)
 This is a good coversation. So the Disclaimer, my company Conductrics.com allows you to use algos similar to bandits as well as AB/MVT.AB can be thought of as more of a form of epoch- greedy - you play uniform random then play greedy. One advantage of e-greedy is that if your environment is not stationary - you are still sampling from the arms - its sort of an insurance premium. To address the differences in reward signals based on the environment, there is the option to model the environment with features - since it maybe that Fridays require a different selection than Sundays - not sure why you would a priori assume that the relative values, or even just the arm rankings are independent of environmental variables.One other point- if you just want a super quick hack to pick winners (or at least pick from te set of higher performing arms) you can just optimistically seed your estimates for each arm - then just pick greedy. Not claiming it is optimal or anything but it requires almost no coding overhead. Of course you need the problem to be stationary.Regardless of which algo approach you use, I do think it is useful to at least think in terms of bandits or reimforcemt learning when tackling these problems.
 There is also another area where improvement can be had. As far as I know most A/B testing just looks at the decisions that led to a conversion. But what about those that didn't lead to a conversion? There is information in that too, and it depends on time.Suppose you pulled decision X for some user two minutes ago, and the user hasn't converted yet. Suppose that you pulled decision Y for some user two weeks ago, and the user hasn't converted yet. Do these two give you the same information? No: the user that got Y is less likely to convert than the user that got X, simply because of the time difference.What you could do to incorporate this extra information is model the conversion process more explicitly, for example as each lever resulting in a particular exponentially distributed time-to-conversion (and then do Bayesian inference over those parameters).

Applications are open for YC Summer 2018

Search: