Hacker News new | comments | show | ask | jobs | submit login

Convergence to statistical signifcance is a bad metric (just like how early stopping is such an issue) but you can map it to the rigorous parameter of statistical power, perhaps. From this, it's well proven that t-tests are statistically most powerful--there exists nothing better--given that you're simulating disjoint normals.

In reality, however, tests like the Wilcoxin test are 99% as powerful and more robust to misspecification in models.

I bring this up because while statistical power and signficance is a very important metric in the theory of picking good tests, it's actually a pretty terrible one in practice. Comparing MAB, which optimizes an entirely different loss parameter, to t/z-tests on power is sort of meaningless.

MAB can produce a cleaner workflow for many kinds of websites. Underperforming classes will be underrepresented and eventually pruned. The increased power of a batch test isn't necessarily so important in this context. I'm not even actually advocating MAB over other tests, just that you shouldn't spend too much time worry about power comparisons unless you're genuinely comparing apples to apples.

Yes, comparing MAB with A/B test is not accurate and that is what the conclusion (in summary) is. These algorithms optimize for different things, and our customers should know that.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact