Stein's paradox in Statistics (1977) [pdf] 69 points by xtacy on March 5, 2016 | hide | past | favorite | 13 comments

 I've been looking into Stein's paradox last year more deeply as I could not wrap my head around it. The baseball example is easy. It is quite intuitive that some of those who scored well probably just did so by accident, and are actually somewhat less skilled (and vice versa for the lowest scorers). However, the Stein paradox is deeper than that. According to Wikipedia, it apparently also applies to unrelated variables, for example the population of Ulan Bator, the temperature on Mars, and the yearly chocolate consumption of Switzerland. This goes counter all my intuition and I found the following weakness in the theory behind Stein's paradox: the improved estimators all seem to depend on things like the mean or the variance among the included variables. For example, if you estimate Ulan Bator to have 1 million inhabitants and the temperature on Mars to be 200 Kelvin, you would adjust the former estimate a little downwards and the latter estimate a little upwards (towards the common mean). However, this implicitely assumes that the population and the temperature have been drawn from a distribution whose mean exists. My guess, that this is not the case. Obviously, you can always calculate a sample mean and a sample variance, but they might be meaningless if the sample stems from a distribution such as Cauchy.
 The "correct" way to understand this phenomenon is via regression, in which case it boils down to the simple fact that the regression lines E(θ|X) and E(X|θ) are different. (This "Galtonian perspective" is the subject of one of my favorite papers of all time, http://projecteuclid.org/euclid.ss/euclid.ss/1177012274). This is also the only intuitive explanation I am aware of which explains why Stein's phenomenon only occurs in dimensions three and higher.
 Thanks for the elaboration.One thing I don't understand is how quantities with different units can be compared. For instance, you give the example of 1,000,000 people living in Ulan Bator and 200 Kelvin being the temperature on Mars. If you follow this procedure and nudge them toward one another (bringing 1,000,000 down and 200 up), then you're supposed to end up with a better predictor. But what if our units had been millions of people and milliKelvins. Then our quantities would have been 1 and 200,000, respectively. The procedure would have us nudge our estimates in the opposite directions. And that surely wouldn't also improve our estimates, right?Clearly I'm misunderstanding something, so I'm going to read some of these papers.Edit: It seems from the Galtonian perspective paper, all the distributions are assumed to have a constant standard deviation. So perhaps we shouldn't be measuring these quantities in terms of people or milliKelvins, but rather in terms of standard deviations? E.g., the mean is +4 standard deviations above 0?
 The "paradox" is that if player's batting averages, when high, get worse predictions and low averages are predicted to get higher, then these predictions work better than just guessing they will stay the same.This doesn't seem like a paradox to me. Rather it seems kind of obvious. If the statistics are anything like a random walk then random walk theory (usually revisiting the starting point) would predict this.
 I think the article by Richard Samworth lays out the paradox better. The whole article is worth a read, but here's the paradox part " To give an unusual example to emphasise the point, suppose that we were interested in estimating the proportion of the US electorate who will vote for Barack Obama, the proportion of babies born in China that are girls and the proportion of Britons with light-coloured eyes. Then our James–Stein estimate of the proportion of democratic voters depends on our hospital and eye colour data! " Surely that's paradoxical!The OP's post is an outstanding exposition of James-Stein estimators though, so thanks for the post. There seems to be lots of connection between these and doing linear regressions with regularisation in machine learning.
 Yep, there's a link with regularization and also with informative priors – James-Stein works so well because across an incredibly wide range of scenarios, a parameter estimate of infinity is not nearly as likely as a parameter estimate of 0, yet that's what ordinary least squares linear regression assumes.
 Yeah, I think you're right. Everyone expects a reversion to the mean -- even the batters themselves. It becomes a self-fulfilling prophecy.But maybe it's more than that. The will power to overcome our intrinsic limits eventually runs out. We can only focus on the ball so long before our thoughts wander. And at some point below the average of our capabilities, we get tired of taking a break and start to perform again as well as we know we can.
 In the case of the baseball example, at least, isn't the increased accuracy of the Stein estimator a result of incorporating a good Bayesian prior into the "observed average" result of the individual players -- that prior being the batting average of a "typical" player (ie, the average of the averages)?
 That doesn't explain adding in a spurious statistic (in the case of the OP, % imported cars in Chicago)
 It does if that statistic offers a better prior as well. stdbrouw's answer explains why that will often be the case.
 I have seen several mentions of James-Stein estimator being almost an empirical Bayesian estimator, but this article made it more clearer. Thanks for sharing.
 Regression to the mean

Search: