This is not an uncommon scenario when testing an email program. And points to operational reasons why A/B testing may be preferred even if you believe that it is statistically worse.
It's not even a question of statistical power, it's just a matter of typing. Typical bandit algorithms use the function play_machine: Machine -> Float. In the situation you describe, the type signature is Machine -> Future[Float].
I'm working on a modified version which handles those situations, however. It's along the lines of UCB2, tweaking the formula so that the numerator is proportional to the number of plays you've submitted while the denominator is the number of plays you've received. Once I finish the math I'll write it up.
The biggest theoretical problem is how it performs when conversion rates are changing out from under it. Which can happen either because your business has natural fluctuations, or because you run overlapping tests - and adoption of one good test can mess up the analysis of unrelated tests that are running at the same time.
You can simply solve this problem by only using data collected in your exploration phase, and throwing away all data collected during exploitation for statistical purposes. However I've yet to see anyone promoting MAB approaches - including you - express any awareness of this rather significant problem.
This is a problem for A/B testing also - if the world changes after your test is finished, you just got the wrong answer. Or am I missing something?
This is why you run tests which you expect to be statistically independent of each other and why you stick to results you expect to be durable.
I'd love it if you email me about it (or post publicly on the topic). I'm thinking of writing a "real world bandit benchmarking suite" which will take into account as many such situations as possible.
A/B testing is robust in the face of absolute changes in conversion rates so long as preferences remain consistent. Of course preferences do not always remain consistent, but that is substantially more likely to happen than that conversion rates do not budge.
Traditional MAB approaches are not robust in the face of absolute changes in conversion rates, even if preferences remain consistent. The problematic change is what happens if conversion rates improve while the worse version is ahead. Then you can come to very solidly to the conclusion that the worse version is better, and be stuck on that for a depressingly long time. The smarter the MAB algorithm, the more solidly you can make that mistake.
Is this likely? Well most businesses have regular fluctuations in conversion rates. Furthermore websites under continuous improvements are constantly introducing changes. It is therefore not uncommon to, while testing one thing, make independent changes that are likely to improve conversion rates.
But as I said, the simple change to throw away data collected during exploitation makes a MAB approach also robust in the face of absolute changes in conversion rates, as long as preferences remain consistent. Doing so increases average regret by a constant factor in the case where conversion rates never change.
(There are complications in the more than 2 arm case. In particular the fact that you're uncertain about A vs C doesn't mean that you should continue exploring version B. So some observations should count as exploration between a strict subset of the versions. But that's a technical detail.)