

Ask HN: Regression to the mean math question - noaharc

This might be too off-topic, but just kill it if you think it is.  Otherwise, here goes:<p>I have a question about regression to the mean.<p>Suppose you have a set of pairs (a,b) corresponding to students in a class.  a = the student's score on the first midterm, b = score on second midterm.<p>If you plot the pairs with a on x-axis, b on y-axis, then get the least-squares line, you have an upward sloping line.<p>The line slope should be less than 1, indicating regression to the mean.<p>If you plot b on x-axis, a on y-axis, the slope is necessarily now greater than 1.  But I fail to see what has changed in the analysis -- a and b are both just supposed to be samples from the same distribution, right?<p>This has been driving me crazy, so I'd love some help.<p>Thank you!
======
roundsquare
Don't do a least squares line. That doesn't help. In the first plot, you'll
see that in general:

x < mean => y > x

x > mean => y < x

If the scores are normalized. Regression to the mean is that most people move
towards the mean in subsequent games/attempts/whatever.

 _But I fail to see what has changed in the analysis -- a and b are both just
supposed to be samples from the same distribution, right?_

Not at all. b is not independent of a, thats the whole point of regression to
the mean. If you take ordered pairs where there is no connection between a and
b, then you won't get any regression to the mean, you'll get points
essentially randomly placed on the plane.

~~~
noaharc
Unfortunately what is really tripping me up is when we draw that line. I get
the principle of regression to the mean, and why it occurs. What I don't get
is this particular manifestation of it.

Fair point about "no connection between a and b".

What I should have said was something like: Why is it important that a come
before b chronologically? If we were mistaken, and we thought that b came
first, then what we would be seeing is "progression from the mean".

Does the concept of regression to the mean depend on the chronology of events?
That would be weird -- most probability doesn't, right?

~~~
roundsquare
Hmm, I read this:

<http://en.wikipedia.org/wiki/Regression_toward_the_mean>

And I guess I made an assumption about the situation you described. If the
students were all answering in identical random ways, then you'd see what I
describe. I think this part of the wikipedia article describes it well:

"Consider a simple example: a class of students takes a 100-item true/false
test on a subject. Suppose that all students choose randomly on all questions.
Then, each student’s score would be a realization of one of a set of i.i.d.
random variables, with a mean of 50. Naturally, some students will score
substantially above 50 and some substantially below 50 just by chance. If one
takes only the top scoring 10% of the students and gives them a second test on
which they again guess on all items, the mean score would again be expected to
be close to 50. Thus the mean of these students would “regress” all the way
back to the mean of all students who took the original test. No matter what a
student scores on the original test, the best prediction of their score on
second test is 50."

So, to your question, why is time important? Its important in the sense that
you need the first test to determine who your "high flyers" are for the second
experiment.

------
mbrubeck
Regression to the mean does not imply that the first slope should be less than
one.

If for some reason only the above-average students regressed, then the slope
would be <1\. But regression to the mean also affects the scores of students
who started below average; as a group we should expect them to regress
_upward_ toward the mean. Combine the two groups, and the effects exactly
cancel out, leaving a slope of 1.

(Since you say the slope "should be" one, I assume the scores are normalized
somehow so that the mean score for exam A is the same as the mean for exam B.)

~~~
noaharc
Yep, I assumed they were normalized. I don't see the canceling out, though.

If you do poorly on the first test, the x-coordinate is low/close to y-axis.
You're then expected to do better on the second test, so the y-coordinate is
high. This will flatten the left half of the line.

As you said, if you do well on the first test, the x-coordinate is high, and
the y-coordinate is low. This will flatten the right half of the line.

Right?

~~~
mbrubeck
I'll assume up front that thee scores on the two tests are independent, since
that's the scenario in which regression to the mean applies. (It also applies
if they are correlated but have some independent "noise" component, but that
complicates things.)

Your mistake is here:

 _"If you do poorly on the first test... you're then expected to do better on
the second test, so the y-coordinate is high._ "

Actually, under my assumption, a student who does poorly on the first test is
no more or less likely than anyone else to do well on the second test. Their
y-coordinate will not be "high" in absolute terms; on average, it will be the
same as the mean for the whole class. The regression exists because the group
has a low starting point, not because it has a high ending point. As a group
the high scorers will regress _to_ the mean, not past it. (In the case where
scores are partially correlated, the group will regress _toward_ the mean.)

For example, suppose scores are independently, uniformly distributed on both
tests. Then your scatterplot will have dots distributed uniformly over its
entire area - obviously this does not change if you switch axes. And yet there
is regression to the mean. Divide the graph into quadrants. If you look at the
right half of the plot (high scorers on first exam), you'll see that there are
as many in the upper quadrant as the lower (their mean on the second exam is
the class mean). Same for the left side of the plot. This isn't order-
dependent; you'll find that the high-scorers on B also regress to the mean
when you look at their A scores.

