Hacker News new | comments | show | ask | jobs | submit login

Let's settle this with science rather than rhetoric. I'd like to do some proper comparisons between bandit algorithms and A/B testing. Unfortunately we haven't been saving time series data at Myna, so we don't have any test data. If anyone has time series data from an A/B test, and is happy to donate it to the cause, please get in touch (email in profile).

Updated for clarity.




Write out what .CSV columns you need and what formats they need to be in, and I will happily get you this for a handful of A/B tests. (Though probably not faster than late July. As much as I love A/B testing there is the small matter of a wedding and honeymoon to throw a wee bit of a wrench into my near term schedule...)


Gosh, congratulations Patrick. You have helped so many of us. Good luck ;)


Thanks. I'll be in touch.


Rather than try to mine historical data, run an experiment to pit UCB against Neyman-Pearson inference. For some A/B tests, split the users into two groups. Treatment A is A/B testing, treatment B is UCB.

In A/B testing, follow appropriate A/B testing procedures: Pick a sample size prior to the experiment that gives you appropriate power, or use Armitage's rule for optimal test termination. (Email me if you're interested, I'm happy to send over papers/scan relevant pages from his book). However , it's probably best to use a fixed sample size, as that is what most real life A/B test practitioners use. Picking the sample size can be a bit tricky, but as a rule of thumb, pick something that is large in enough to dectect differences in treatments as small as 1%age point.

In the treatment group B, use the UCB1 procedure. Subject the users to whichever design UCB1 picks, and continue with the learning.

Do not share any information between treatment groups A and B.

Run these tests for a sufficient amount of time over a largish number of clients, and then use permutation tests to determine which treatment, UCB1 vs Neyman-Pearson, performs better.

In all the simulations I've seen, UCB performs simple A/B testing, but it would be great to see some empirical evidence as well.


Most websites lack sufficient traffic to reach statistical significance in a short time frame. Sure, Google and Facebook can run a test and get real results in a day (or even hour(s)), but the rest of us need weeks or months to do things properly


That really depends on the level of significance and the size of the effect you wish to detect.


Unfortunately most of the A/B tests that I am involved with at the moment simply do not look anything like what you'd want for a traditional bandit approach. For a start the data is fundamentally 2-dimensional. I am testing by user, and users may take days to hit a conversion metric. Furthermore a large chunk of the population has to go in the test before I get any data at all about how any of them will perform.

This is not an uncommon scenario when testing an email program. And points to operational reasons why A/B testing may be preferred even if you believe that it is statistically worse.


Actually you are correct that in situations with latency (i.e., you have the opportunity to play a machine at t=0, another opportunity to play at t=1, but you don't receive results of the plays until t=10), UCBx doesn't work.

It's not even a question of statistical power, it's just a matter of typing. Typical bandit algorithms use the function play_machine: Machine -> Float. In the situation you describe, the type signature is Machine -> Future[Float].

I'm working on a modified version which handles those situations, however. It's along the lines of UCB2, tweaking the formula so that the numerator is proportional to the number of plays you've submitted while the denominator is the number of plays you've received. Once I finish the math I'll write it up.


The problems with UCBx in the real world are much bigger than that.

The biggest theoretical problem is how it performs when conversion rates are changing out from under it. Which can happen either because your business has natural fluctuations, or because you run overlapping tests - and adoption of one good test can mess up the analysis of unrelated tests that are running at the same time.

You can simply solve this problem by only using data collected in your exploration phase, and throwing away all data collected during exploitation for statistical purposes. However I've yet to see anyone promoting MAB approaches - including you - express any awareness of this rather significant problem.


Could you explain the problem a little more? From what you wrote here, it sounds like the problem is that if the world changes, it can mess up your answers, and bandit will take exp(N) time to catch up.

This is a problem for A/B testing also - if the world changes after your test is finished, you just got the wrong answer. Or am I missing something?

This is why you run tests which you expect to be statistically independent of each other and why you stick to results you expect to be durable.

I'd love it if you email me about it (or post publicly on the topic). I'm thinking of writing a "real world bandit benchmarking suite" which will take into account as many such situations as possible.


The problem is that the world changes - constantly. What changes could mess you up? When we apply math to the real world we should assume as little as possible.

A/B testing is robust in the face of absolute changes in conversion rates so long as preferences remain consistent. Of course preferences do not always remain consistent, but that is substantially more likely to happen than that conversion rates do not budge.

Traditional MAB approaches are not robust in the face of absolute changes in conversion rates, even if preferences remain consistent. The problematic change is what happens if conversion rates improve while the worse version is ahead. Then you can come to very solidly to the conclusion that the worse version is better, and be stuck on that for a depressingly long time. The smarter the MAB algorithm, the more solidly you can make that mistake.

Is this likely? Well most businesses have regular fluctuations in conversion rates. Furthermore websites under continuous improvements are constantly introducing changes. It is therefore not uncommon to, while testing one thing, make independent changes that are likely to improve conversion rates.

But as I said, the simple change to throw away data collected during exploitation makes a MAB approach also robust in the face of absolute changes in conversion rates, as long as preferences remain consistent. Doing so increases average regret by a constant factor in the case where conversion rates never change.

(There are complications in the more than 2 arm case. In particular the fact that you're uncertain about A vs C doesn't mean that you should continue exploring version B. So some observations should count as exploration between a strict subset of the versions. But that's a technical detail.)


First off, I'm all for settling this with data and math.

However, I'm afraid that some practical matters were lost in the previous discussions. As I see it, things happened somewhat like this: 1) A/B testing starts to become a Big Deal; 2) There's a flurry of articles about A/B testing of various quality; 3) Lots of people implement A/B testing, mostly in a poor way; 4) The 20 lines of code article, which would probably really help most people who haven't done A/B testing correctly; 5) Intense discussion about correctness and efficacy that won't impact those people.

I think getting to the bottom of this is important, but I think the multi-armed bandit article that started this would do far more good than harm for the droves of people trying and largely failing to get good data from their poorly done A/B testing.


I'm all for multi-armed bandits. So much so that I've founded a startup to bring them to the masses (http://mynaweb.com/ Sign up now!) I'm absolutely certain that an appropriate MAB algorithm will outperform A/B in the overwhelming majority of cases and be simpler and flexible to use in practice.

However some reasonable objections have been raised in the previous discussions. (In case anyone is keeping record, here are the ones I have been involved in:

- http://news.ycombinator.com/item?id=4052997 - http://news.ycombinator.com/item?id=4040022 - http://news.ycombinator.com/item?id=3928929 - http://news.ycombinator.com/item?id=3867380 - http://news.ycombinator.com/item?id=2831455 )

The typical way for a company to handle these objections would be to ignore them or publish a few pretty graphs and drown the objections in FUD. I think the community deserves better, and would like to do a proper comparison. It will not only be informative but also help us design a better algorithm. Everyone wins.


> I've founded a startup to bring them to the masses ... It will not only be informative but also help us design a better algorithm. Everyone wins.

Yes, this is how everyone wins. Those with the ability and drive can end up making a difference, largely through providing a framework or library for doing this right. This is why I think a continuing exploration of this is worthwhile. Until that time, I think the masses should be encouraged to do what works best (or at least better) for them in the present.

For now, I think I'll let this rest. I'm on the verge of ranting against a general trend (checking off boxes by following steps rather than understanding what you're doing), but this is the wrong place for that and it deserves more thought and time than I'd give it here.




Applications are open for YC Winter 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: