Hacker News new | comments | show | ask | jobs | submit login

For what it's worth, I've been following this cross-Internet debate with more than a little professional interest. Cards on the table: I have coded A/B testing software, I frequently code and/or administer it for clients (often in ways which are provably suboptimal), and I am a dirty loyalty-free scientist-cum-capitalist-pig who would stab A/B testing in the back in a second if I thought there were an easier way to extract more money for the same amount of work.

I strongly, strongly suggest that anyone attempting to look at this problem from the perspective of a site owner rather than a mathematical abstraction read and digest btilly's comment from earlier this week:


The issues he lays out are very real in the course practical use of site testing to actually make money. In particular, his #2 would scare the heck out of me, in a much deeper way than "A/B testing provably doesn't minimize regret" worries me in the other direction. (Or e.g. other flaws with particular A/B testing implementations. For example, repeatedly checking the results of your A/B test and deciding to end it when you see significance has been explained quite a few times as a bad idea with stats to match. However, even if you check like a hyperactive squirrel, you're still winning, you're just winning less often than you think you are. Take your B- in stats class but proceed to make motivational amounts of money for the business.)

The worst possible takeaway you, personally, the typical HN reader, could possibly have from this debate is "Oh, I guess I shouldn't A/B test then." Pick A/B testing, bandit testing, whatever -- any option in the set, even with poor algorithms and/or the easiest errors I can think of, strictly dominate not testing at all. (Actually testing today also is better than "testing... someday", which from my own experience and that of clients I know is something which is very easy to slip into even if you theoretically know you should be doing it.)

Pick A/B testing, bandit testing, whatever -- any option in the set, even with poor algorithms and/or the easiest errors I can think of, strictly dominate not testing at all.

So I submitted this post and went off to boxing. On the train ride back, I thought "I hope I don't make people think they shouldn't A/B test. And at the same time you were writing this post, I added a conclusion to my blog post saying the same thing.

Bad A/B testing is an 80% solution. Good A/B testing is a 90% solution. Good bandit is a 95% solution. 80% >> 0.

A/B testing has another benefit to software engineers that bandit doesn't - it lets you delete code.

Unless I misunderstand bandit algorithms, there's a trivial modification that makes the actual, practical administration of them essentially identical to A/B testing with regards to when you can rip out code.

If A smashes B, then bandit will converge in a very obvious manner on A, and you pick it and delete the B code branch, accepting future regret from the possibility that B was in fact better as a cost of doing business. If A doesn't smash B, then at an arbitrary point in the future you realize it has not converged on either A or B, pick one branch using a traditional method like "I kind of like A myself", delete the B code, and go on to something that will actually matter for the business rather than trying to minimize your regret function where both possible solutions are (provably) likely non-motivational amounts of money between each other.

Please feel free to correct me if I'm wrong on this -- I have a bad flu today and take only marginal responsibility for my actions.

You are correct for practical purposes.

However, you are mathematically wrong. Doing things the way you describe is another form of A/B testing, just one designed to reduce the regret incurred during the test. You still get the linear growth in regret I described (i.e., there is some % chance you picked the wrong branch, and you are leaving money on the table forever after).

Of course, your way is also economically correct, and the theoretical CS guys are wrong. They shouldn't be trying to minimize regret, they should be minimizing time discounted regret.

(Yes, I'm being extremely pedantic. I'm a mathematician, it comes with the territory.)

Well fundamentally website optimization is not a multi-armed bandit problem. You are not simply choosing between machines. Instead each website change is another node on a tree and you want to find the highest performing path in constantly changing world, with additional constraints that you can't keep incompatible machines for very long.

A guess at good strategy in those situations would be something that enables you to choose a good path from a small number of options and keep doing so - on the basis that the cumulative advantage will be far more important than any particular choice in itself. Sounds like A/B testing would win if you could actually map the complexity of the problem correctly.

I think that assuming stationary behavior (bad assumption) then given enough time either A will smash B or B will smash A. As in, it's almost impossible that they are exactly equal in performance and given enough impressions even the smallest change will be optimized for.

I think that's a bad thing, though. A single conversion metric optimization is only good for some (large or medium) effect size. After a point, criterions like "I kind of like A" are much more important.

In A/B testing you see this when your ambitious testing campaign returns "insignificant". In MAB you see it when two choices run at roughly 50/50 enrollment for a long period of time.

So on these two extremes, I think practical use of A/B and MAB should be roughly identical. In the middle ground where A is usefully but not incredibly better than B, I feel they must differ.

This is correct. One can also modify most bandit algorithms so they stop exploring at some point.

This, I think.

Once you get your expected regret below the cost of maintaining the old code, rip out the old code!

Wouldn't bad A/B testing be a 50/50 solution. Is making decisions on data without statistical significance any better than throwing darts blind?

Consider the bad A/B testing algorithm implemented by Dr. Awesome. Instead of reporting statistical significance, if it would be statistically significant below 10% chance of coincidence, he reliably reports "It was Awesome!" Dr. Awesome then promptly burns his notes to warm his awesome heart.

Even though many people would take issue with using 10% (a lot of practitioners like 5%) and Dr. Awesome has some serious issues with data retention policies, if you always follow Dr. Awesome's advice, you'll win 9 times for every time you lose. I'll take those odds.

Say Dr. Awesome also has one additional problem: one time of twenty, regardless of the results of the A/B test, he just can't help himself and says "It was Awesome!" anyhow. If you follow his advice, you'll now win approximately 5 times for every time you lose. I'll take those odds, too.

There is a large gap between true statistical significance and a fair coin toss. Bad testing (of whatever flavor A/B, MAB, etc.) is likely to land somewhere in that gap. Most likely worse off than proper testing but also quite likely better than tossing the coin or throwing darts.

A prerequisite for making any of this work is having a statistically significant number of visits to your site on a daily basis, right?

I think many people first have to figure out how to cross that bridge before they start to worry about optimizing what's on the other side.

Thank you for saying that. For people who want to dive in deeper, the discussion below http://news.ycombinator.com/item?id=4053739 is highly worthwhile as well.

And you are absolutely right. Even with bad assumptions and techniques, actually testing beats not testing by such a ridiculous margin that you need to.

In practice, if we were going to talk about that -- variations which perform statistically significantly better one day may perform worse the next. For example, because happy people like to click on one variation versus another which is preferred by stressed people.

Keep A-B tests running even after you've decided the best variation to make sure that the uplift you've observed is real! See http://john.freml.in/ab-testing-significance

I think concern #2 was addressed in the original article: add a "fade out" threshold so that results are fixed to some time span.

Applications are open for YC Summer 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact