Can you look at experimental results along the way or not?

6gvONxR4sf7o · 2024-06-02T06:16:07 1717308967

I'm actually a big fan of certain alpha spending approaches. In one of them (I forget the name), you basically peek along the way, only looking for massive signs, then at the end you get to do a nearly-ordinary analysis. Like, you have to turn p=0.05 into p=0.045 at the end but it's pretty negligible so you don't have to treat the data too carefully in general. You just get to promise nagging partners that yes if it's wildly good (or bad), we've accounted for that and will stop early.

Paul-Craft · 2024-06-02T07:38:55 1717313935

I'm interested in how this might work out (or not!) in practice. Turning p=0.05 into p=0.045 is a mortal sin in some circles (see also: p-hacking). I can't see how you can look at your data for only "massive signs" without also picking up the smaller signs. Have a computer do it?

rcxdude · 2024-06-02T11:05:16 1717326316

The point is that a larger sample size increases precision: the confidence you get rejecting a hypothesis depends on that and the size of the effect. If I tell you I've sampled a process and gotten the values -2, 0, and 15, you can reject the hypothesis that the process is a gaussian process with mean 10000 and standard deviation 0.1 with extreme confidence. Sampling further will almost certainly not change that (in fact you would need an impractical number of samples in this extreme case).

The reason you need to tighten your bounds for an early stop is to deal with the fact that you are effectively running multiple experiments: it's possible by chance that you are stopping at a point where the data seems to show an effect by random variation, but because you are not repeating all the samples so far, there's a lot fewer possible outcomes, so the tightening of the bounds that you need is not so much. (p-hacking is simply not applying the appropriate correction for this).

6gvONxR4sf7o · 2024-06-03T00:06:56 1717373216

I was speaking very roughly. What you do is basically you can peek early just a couple times and stop early if you see p<0.001. You probably won’t, which “leaves you” 0.045 or so to “spend” at the end. Then you have to pretend that when you see p=0.045, that you actually saw p=0.05, and when you see p=0.04999 that it actually said 0.053 or something (non significant at the 0.05 level even though 0.04999<0.05!), so it’s more restrictive at the end… but only by a little!

The adjustment at the end is small, so it’s not a big complication and people don’t have to radically change their interpretations like they would with some other alpha spending approaches. And then there’s the big plus that since you only have to change the end interpretation by a little bit, any secondary analyses don’t get crazy complicated. They probably just carry just a little asterisk, rather than coming with the big “this is simple but super untrustworthy” asterisk of secondary analyses after other alpha spending approaches, or having to do something very very expert.

If you want to read more, look up alpha spending functions in early stopping. I think the one I’m describing might be called Peto? Sorry i’m being lazy.

croemer · 2024-06-02T07:46:05 1717314365

The context is A/B testing, so all the data will be on a computer already.

It makes sense to abort early when the results are clear. This happens all the time in clinical trials.

Paul-Craft · 2024-06-02T08:01:03 1717315263

Just because it happens all the time doesn't mean it's statistically correct. And, I don't see what the data being on a computer has to do with anything; that also "happens all the time," even in clinical trials.

IshKebab · 2024-06-02T15:48:12 1717343292

It's statistically correct if your statistics take into account the fact that you might stop early. That's the point. It is more efficient to stop early, but you can't stop early while using statistics that assume you won't.

Paul-Craft · 2024-06-03T04:22:17 1717388537

As I said, just because it happens all the time does not mean it's statistically correct. I've dug into enough statistical analysis to know that researchers whose field is not statistics simply do not understand statistics enough to know when they're screwing up. Most of them don't even know to apply a correction for applying multiple tests, and you're assuming they'll use statistics that assume they might stop early? Come on. You're imputing a lot more competence to the general case than I believe is actually present, which is my point.

jncfhnb · 2024-06-02T01:20:59 1717291259

> I’d recommend taking either a Frequentist approach or a Bayesian approach, but not complicating things by hybrid approaches such as alpha spending or designing Bayesian experiments to have desired Frequentist operating characteristics. The middle ground is more complicated and prone to subtle mistakes, though we can help you navigate this middle ground if you need to.

Probably not a compelling conclusion to what’s probably a reference article to promote a consultancy

croemer · 2024-06-02T06:25:56 1717309556

<strike>This article confuses more than it informs.</strike>

The answer to the question whether one can look at experimental data in A/B tests (or by analogy clinical trials) can be answered with yes even though the article doesn't make this very clear. Instead, it gets lost in superficial Frequentist/Bayesian (keyword) name dropping.

The concept of adjusting for early checking is interesting but this article is less useful than just looking at the the original alpha spending paper: https://pubmed.ncbi.nlm.nih.gov/7973215/

For a Bayesian approach: https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.47800404...

rhymer · 2024-06-02T07:47:32 1717314452

My takeaway is to avoid mixing the frequentist and Bayesian approaches. Choose one method: either follow the frequentist approach and avoid early data analysis, or use the Bayesian approach to compute posterior probabilities once data are available. Mixing the two without expertise can lead to errors.

croemer · 2024-06-02T07:58:13 1717315093

Mixing seems like a straw man, why would one?

I don't see what's wrong with Frequentist approach with alpha spending. The downside is one needs to understand alpha spending. But doing Bayesian without understanding it can be just as bad.

Paul-Craft · 2024-06-02T07:36:43 1717313803

Unfortunately, your comment does neither, which is the far worse offense.

See also: HN guidelines regarding shallow dismissals.