OTOH, it depresses me - we try to do a fair bit of analysis on surfing patterns, A/B testing features at http://SmartFlix.com, looking at average basket size, etc.
...and now I've got to worry that I've got "confounding variables". Maybe feature X decreases check out rates ... but actually increases check out rates both among male and female customers....
ARGH. My head hurts.
Where this stuff really gets you is in retrospective data analysis, where with the right choice of would-be confounding variables you can pretty much argue both directions on any question.
I certainly wouldn't be making rash decisions on data without a good effect metric.
Simpson's paradox can't occur if the two treatment groups are the same size because the distribution of any external variable is constant across the two treatments (within some known, maximal error).
For example, if women are 10% of your population then there will be ~10% women in all your treatments.
For example, if you're comparing users from Slashdot to users from Hacker News, you might get the following data set for, say, the average click-through rate over 1 week:
Slashdot: 500 hits, 50% click-through rate (250 click-throughs)
Hacker News: 10 hits, 90% click-through rate (9 click-throughs)
Based on this data, you might decide to tweak the page to be more Hacker-News friendly and less Slashdot-friendly. But, hey, maybe you want to run the experiment for one more week, just to be sure.
Slashdot: 10 hits, 10% click-through rate (9 click-throughs)
Hacker News: 500 hits, 20% click-through rate (100 click-throughs)
Wow, we have confirmation! Hacker News is clearly a better source of traffic than Slashdot. So you forget about Slashdot, since HN is consistently giving you a better click-through rate than Slashdot.
Except, of course, when you look at the combined numbers...
Weeks 1 and 2 combined:
Slashdot: 50.7% (259 click-throughs for 510 hits)
Hacker News: 21.3% (109 click-throughs for 510 hits)
So here, the paradox applies in reverse - looking at the weekly data gives you a false impression about how to tweak things to achieve your overall goals.
Now, you might say, this is not A/B testing, just monitoring click-through rates of various sources. However, if you simply double this test (so that you're testing two versions of a page), you might once again reach erroneous conclusions about how each version works for each referrer, because of the Simpson effect.
So while this is not immediately relevant to the most basic A/B testing, it can easily become relevant to more complex examples. And, of course, if you're doing your A/B testing without doing any splitting of the incoming users, you might miss out some other subtleties about your users, so there's a strong temptation to do so.
Everything you said is right, though, and are exactly the types of situations where Simpson's paradox can occur.
There was a TV program about a shoe manufacturer in the UK a few years back that started making larger womens boots. They realised later the demand was driven by transvestite males and that they could make more money in that market and so transitioned from being a womens shoe maker to catering for leather-fetishists and transvestite males. They simply followed the profit.
Inline with your comment, not all companies would be happy to make such a profit-driven move; but I suspect few shareholders would care not to.
We see this so many areas. China's GDP is growing 12% every year and the US has grown just 2% type of arguments.
I haven't been able to improve on my yearly salary raise of 300% in the last 15 years when I went from 5$/hr to 15$/hr