152 points by waldrews on Aug 28, 2009 | hide | past | favorite | 17 comments

 This should be part of a required reading regimen for anyone about to post yet another 'Bullshit Study Reveals Whimsical Quirk' article.
 More importantly, it ought to have been part of the required reading regimen for the authors of said study.
 po on Aug 29, 2009 Even more importantly, it ought to have been part of the required reading regimen for every single high school student.
 That would assume the authors haven't already been instructed to prove a particular point, which they often have.
 This also comes up in immigration studies. Say you have two countries A and B. A has an average income of 5, and B has an average income of 20. If a resident of A earning 10 moves to B where he earns 15, the average income of both countries goes down even though total income increases.
 This is fascinating.OTOH, it depresses me - we try to do a fair bit of analysis on surfing patterns, A/B testing features at http://SmartFlix.com, looking at average basket size, etc....and now I've got to worry that I've got "confounding variables". Maybe feature X decreases check out rates ... but actually increases check out rates both among male and female customers....ARGH. My head hurts.
 The good news is that when you do A/B testing, you can create a properly randomized experiment, where variations of e.g. the number of men/women assigned to the A and B group are accounted for as part of the sampling error.Where this stuff really gets you is in retrospective data analysis, where with the right choice of would-be confounding variables you can pretty much argue both directions on any question.
 The problem that I have seen in these frameworks that handle A/B testing is that they ignore the basic laws of statistics. All groups are different given large enough n. That's what p is about, do you have enough data to tell the means apart. The closer the means, the more data you need. What you really care about is how large the difference is. That's called effect size, which is really just how far apart the means are in divided by a combined standard deviation for both groups.I certainly wouldn't be making rash decisions on data without a good effect metric.
 tel on Aug 28, 2009 And that's the crux of scientific modeling! There's a pretty famous quote to the effect of scientists don't stay up at night wondering if they're testing the right thing, but instead if they chose the right controls.
 It doesn't apply in the A/B testing case because you're randomly assigning a person to one group or another.Simpson's paradox can't occur if the two treatment groups are the same size because the distribution of any external variable is constant across the two treatments (within some known, maximal error).For example, if women are 10% of your population then there will be ~10% women in all your treatments.
 On the contrary, it does apply if you're doing any kind of splitting of your users along various axes.For example, if you're comparing users from Slashdot to users from Hacker News, you might get the following data set for, say, the average click-through rate over 1 week:Week 1:Slashdot: 500 hits, 50% click-through rate (250 click-throughs)Hacker News: 10 hits, 90% click-through rate (9 click-throughs)Based on this data, you might decide to tweak the page to be more Hacker-News friendly and less Slashdot-friendly. But, hey, maybe you want to run the experiment for one more week, just to be sure.Week 2:Slashdot: 10 hits, 10% click-through rate (9 click-throughs)Hacker News: 500 hits, 20% click-through rate (100 click-throughs)Wow, we have confirmation! Hacker News is clearly a better source of traffic than Slashdot. So you forget about Slashdot, since HN is consistently giving you a better click-through rate than Slashdot.Except, of course, when you look at the combined numbers...Weeks 1 and 2 combined:Slashdot: 50.7% (259 click-throughs for 510 hits)Hacker News: 21.3% (109 click-throughs for 510 hits)So here, the paradox applies in reverse - looking at the weekly data gives you a false impression about how to tweak things to achieve your overall goals.Now, you might say, this is not A/B testing, just monitoring click-through rates of various sources. However, if you simply double this test (so that you're testing two versions of a page), you might once again reach erroneous conclusions about how each version works for each referrer, because of the Simpson effect.So while this is not immediately relevant to the most basic A/B testing, it can easily become relevant to more complex examples. And, of course, if you're doing your A/B testing without doing any splitting of the incoming users, you might miss out some other subtleties about your users, so there's a strong temptation to do so.
 You're right, that isn't A/B testing. :)Everything you said is right, though, and are exactly the types of situations where Simpson's paradox can occur.
 You don't need to worry: does change X make you more profit? Forget the details as to whether it reduces your profit from men and increases your profit from women.
 Of course, if you're a women's clothing store you might wind up iterating yourself into a men's clothing store if you follow that advice too strictly. :)
 Which is only a problem if you want to have a women's clothing store in preference to making a profit.There was a TV program about a shoe manufacturer in the UK a few years back that started making larger womens boots. They realised later the demand was driven by transvestite males and that they could make more money in that market and so transitioned from being a womens shoe maker to catering for leather-fetishists and transvestite males. They simply followed the profit.Inline with your comment, not all companies would be happy to make such a profit-driven move; but I suspect few shareholders would care not to.
 In the example of editing wikipedia it mentions: "This imagined paradox is caused when the percentage is provided but not the ratio."We see this so many areas. China's GDP is growing 12% every year and the US has grown just 2% type of arguments.I haven't been able to improve on my yearly salary raise of 300% in the last 15 years when I went from 5\$/hr to 15\$/hr
 Just comparing percentages off different bases isn't quite what the paradox is getting at, but it's also a fine approach to representing data in misleading way. There's a great collection of such less technical tricks in the book _How to Lie with Statistics_ by Darrell Huff.

Search: