I thought about multiple comparison corrections. Here what my thoughts were: 1. ...

cschmidt · on July 30, 2024

If you work for a large website (as I used to), they probably run hundreds of tests a week across various groups. So false positives are a real problem, and often you don't see the gain suggested by the A/B when rolling it out.

I agree that Bonferroni is often too pessimistic. If you Bonferroni correct you'll usually find nothing is significant. And I take your point that you could adjust the $\alpha$. But then of course, you can make things significant or not as you like by the choice.

False Discover Rate is less conservative, and I have used it successfully in the past.

People have strong incentives to find significant results that can be rolled out, so you don't want that person choosing $\alpha$. They will also be peaking at the results every day of a weekly test, and wanting to roll it out if it bumps into significance. I just mention this because the most useful A/B libraries are ones that are resistant to human nature. PM's will talk about things being "almost significant" at 0.2 everywhere I've worked.

e10v_me · on July 31, 2024

Thank you for explanation and for drawing a vivid picture) I will add FWER and FDR to the roadmap. Which specific controlling procedures do you find the most useful on practice?

I'm considering the following: - FWER: Holm–Bonferroni, Hochberg's step-up. - FDR: Benjamini–Hochberg, Benjamini–Yekutieli.

cschmidt · on July 31, 2024

Personally, I've used FDR, but FWER is meant to be good as well. I guess I don't have a preference.