Hacker News new | comments | show | ask | jobs | submit login

Agreed. And I'd add that for me, the point of A/B testing is to learn something. We're not just interested in whether A or B is better; we're interesting in getting better at doing what we do. Studying the A/B results is half the fun.



"We're not just interested in whether A or B is better; we're interesting in getting better at doing what we do."

Absolutely. That's my biggest objection to bandit methods, but it's also the fuzziest objection, and the one least likely to appeal to hyper-analytical people. There's a strong temptation (as we can see from this article) is to treat bandit optimization as a black box that just spits out an infallible answer (i.e. as a Lazy Button).

It's the same human tendency that has led to "you should follow me on twitter" to be one of the more common n-grams on the interwebs (even though it probably never worked for more than Dustin Curtis, and likely causes a counter-intuitive backlash now).


Well put. I get why they don't get it, but from my perspective it looks like technical gold-plating a lot of the time. I generally aim to amplify the power of the human mind, not eliminate it from the system.

Of course, these algorithms are cool in certain contexts where you never want to think about it again. For example, ad placement for a giant ad network running against a lot of UGC content. They remind me of neural networks like that: the post office doesn't care how the machine recognizes hand-written zip codes, just as long as they do it reliably.

But startups are all about learning, and interface design is where you have direct contact with users. I want to soak in the data. I want to hang out in people's living rooms when they're using my site. A fancy Lazy Button (great phrase) is the last thing I need.


Why can't you do the same study from the bandit results? From what I understand, this the same as A/B testing, except it will only show a suboptimal result to 10% of the users instead of 50% (or more). After a few days of testing, can't you take a look at the statistics and analyze them the same way that you would for an A/B test? Then just stop testing?

The A/B test just gives you the conversion rate of each option. And so does the bandit. As I understand, the only difference is that the bandit will be bugging your users with bad results less often.


Sure, but wouldn't that take longer to get us our answer, and therefore keep us from moving on to the next experiment sooner?

Of course, the real answer to why not is "we have an A/B system and I'm not going to add bandit stuff for no benefit". But even if I were doing it from scratch, it seems more complex. The benefit of these approaches seems to be that one no longer has to think. We want to think.


"After a few days of testing, can't you take a look at the statistics and analyze them the same way that you would for an A/B test? Then just stop testing?"

No, because the assumptions that underpin many statistical techniques are violated when you're not assigning people to cohorts consistently, and at random.


If you are using an epsilon-greedy approach (or something similar), then I believe that the data collected during the exploration portion - (the random calls) are open, albeit with less power due to reduced sample size, to standard hypothesis testing. Think of it this way, you might normally run your experiment on a subset of your traffic (population) - so only 20%, with the rest (80%) getting the current experience. With the e-greedy type of approach you are just swapping the 'current experience' with the maximum estimated experience, but that other 20% is still a random draw.


The draws aren't independent. At any given time, the probability of assigning a user to a cohort is dependent upon a function of the previous observations (in other words, it's a markov model).

The standard confidence tests -- t-tests, G-tests, chi-squared tests, etc. -- based on distributions of independent, identically distributed (iid) data.

I'd have to think about it more, but I believe that btilly's examples are also the most intuitive reasons why independence matters. If your data is time-dependent, then assigning users to cohorts based on past performance lets the time dependency dominate. There may be other good examples.


Is that true in the e-greedy case? Sure, during the exploit call, they are not independent, but during the explore portion I would assume they are, since they have been randomly assigned into the exploration pool (epsilon) and then drawn from a uniform random draw. There is no information that I can see from prior draws being used.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: