If you're serious about this, let me know. I've been thinking about doing the same for some time now. How would you solve for accounts with 2-factor auth enabled?
A much simpler approach is to AABB test instead of AB test. Rather than splitting your users into 2 buckets (A and B), split them into 4 buckets (A1, A2, B1, B2). Give groups A1 and A2 one variation and groups B1 and B2 the other variation. When A1 equals A2 and B1 equals B2 then you know you have statistical significance and you can compare A1+A2 to B1+B2.
That's a good strategy, but is solving a different problem. You still have to collect N samples. The proposed strategy is useful for situations where each additional sample has a high cost and being able to terminate early, if possible, is highly desirable.
This is great advice. One of the best things about doing AABB testing is when your two A groups & B groups don't converge, you can identify bugs in your testing procedure or measure the margin of error (since you know those groups are seeing the same thing and should be performing identically). Seeing two identical A groups with wildly different results will make you more skeptical of generic A/B results & make you more rigorous about your testing.
> since you know those groups are seeing the same thing and should be performing identically
That's not how A/B testing works. 95% confidence means you should expect a 5% false positive rate, i.e., you should expect the difference measured in an A/A test to be statistically significant 5% of the time. You'll always measure some difference, since no two random samples will be 100% identical in every regard.
The procedure you and the parent propose is tantamount to selecting 1 out of every 20 test results and discounting it for no real reason. It adds extra cost to your A/B testing without producing more reliable results.
It's a different matter if you're running multiple A/A-type tests over an extended period of time to ensure that the false positive rate is actually 5%, a kind of meta-statistical test. As a sanity check this is sound, but vastly more expensive than what the OP is proposing (for example). I've never seen anyone use A/A, A/A/B, A/A/B/B, etc. tests in this way. Rather, I've only ever seen them used as you and the OP suggest: the two A buckets should be "the same" and if they aren't, the results should be thrown out.
You would need to propose a method to compare the confidence intervals of A1, A2 and B1, B2 without hurting our target coverage probability on the outer confidence interval. This is starting to sound really complicated.
I prefer AABB testing because it doesn't take me much time to figure out if the test needs to run longer and you get used to the patterns your conversion rates follow pretty quickly in tools like VWO. I've talked to some large CRO shops who do the same thing, as they are working with several accounts at the same time and need to be able to tell quickly how things are going.
Then again, CRO is only a piece of my job and I have lots of other stuff to look at everyday, so anything I don't have to expend too much mental energy on, but can still trust, is a nice convenience. If I was optimizing full-time, I'd be a lot more smart/clever in how I went about it.