In reality, however, tests like the Wilcoxin test are 99% as powerful and more robust to misspecification in models.
I bring this up because while statistical power and signficance is a very important metric in the theory of picking good tests, it's actually a pretty terrible one in practice. Comparing MAB, which optimizes an entirely different loss parameter, to t/z-tests on power is sort of meaningless.
MAB can produce a cleaner workflow for many kinds of websites. Underperforming classes will be underrepresented and eventually pruned. The increased power of a batch test isn't necessarily so important in this context. I'm not even actually advocating MAB over other tests, just that you shouldn't spend too much time worry about power comparisons unless you're genuinely comparing apples to apples.