
The 50 million dollar lie - ColinWright
http://garyrubinstein.teachforus.org/2013/01/09/the-50-million-dollar-lie/
======
feral
I know nothing about the domain of teacher measurement, and have no opinion on
it.

But the first chart in the blog post, with the big mass of blue on it, is
surely not the strong evidence of weak correlation, that the author is making
it out to be?

There could still be a strong correlation in that data, even if it looks like
a blob of blue - because there are so many data points on that chart, that we
can no longer tell where the density of points is. Its possible that there is
a dense line of points in there that still gives a strong correlation, that we
just can't see.

A heatmap with some level of binning, or a hex bin chart, or 2d histogram,
would surely be more appropriate in these circumstances? Something like this:
[http://matplotlib.org/examples/pylab_examples/hexbin_demo.ht...](http://matplotlib.org/examples/pylab_examples/hexbin_demo.html)

Maybe along with some direct math analysis of the correlation. Am I missing
something?

~~~
vy8vWJlco
A heatmap would be more revealing, but I'd eat my hat if it weren't a standard
distribution (and I had a hat). It's nature (but hey, anything's possible).
The scales are only slightly skewed, but we're not talking logarithms. The
merit of the scales ("value-added") on the other hand, is very questionable.

But the relation, whatever it is describing, is pretty weak since the ellipse
isn't very irregular (it's only "slightly" irregular).
[http://www.emeraldinsight.com/content_images/fig/08702801090...](http://www.emeraldinsight.com/content_images/fig/0870280109015.png)
Think Kepler's laws.

Edit: The first post in the series shows a scatter of the direct (non-delta)
scores, and it looks like San Fransisco in the morning:
[http://garyrubinstein.teachforus.org/2012/02/26/analyzing-
re...](http://garyrubinstein.teachforus.org/2012/02/26/analyzing-released-nyc-
value-added-data-part-1/)

~~~
spikels
Sorry don't mean to belabor this point but I think "standard distribution"
(changed from standard deviation) is still incorrect. "Standard" referers to
standardized parameters (i.e. mean, variance) as in "standard normal" which
has a mean of zero and variance of one (and standard deviation of one).
Certainly not the case here. Maybe you mean "bivariate normal with positive
correlation"?

[http://en.wikipedia.org/wiki/Multivariate_normal_distributio...](http://en.wikipedia.org/wiki/Multivariate_normal_distribution)

Interesting thing is the (positive) means and variance seem similar which is
consistent with underlying random variable being the same.

~~~
vy8vWJlco
Yes, you are right. I wasn't being especially rigorous, and technically
correct is the "best kind" of correct. :)

~~~
spikels
No worries. Statistics can be very confusing. Part of the reason it is so
confusing is all the sloppy usage we come across in the media and on the
internet.

Everyone wants to use statistics as a tool to bolster their point of view but
don't want to bother to actually learn enough statistics to deal with the
situation they are trying to analyze.

------
kenjackson
This is a surprising blog post in that I draw the complete opposite conclusion
the author does. The author seems to think that the averaging hides volatility
(which it does), which leads to incorrect conclusions drawn. Whereas to me it
looks like it removes visual noise to show an actual trend.

In his original scatter plot, because of the big ball in the middle it's easy
to handwave and say, "look a random blob" -- but even a slightly longer look
seems to indicate there is a positive correlation in the data. His averaging
that he does at the end, IMO, makes it clear.

~~~
FourthProtocol
The other thing the author does is to derive absolute findings when the author
himself admits to making an assumption about how the numbers were obtained. To
me, using the word "lie" in the title is therefore nothing more than click-
baiting.

And on a separate note - there's a lot of anti-Gates sentiment flying around
lately. Whether his methods are the best or not, he's _using his own money_. I
find it really difficult to give any criticism directed at Bill Gates'
foundation any credence. And that not knowing much at all about what he does.

Maybe people would like Bill more if he did a Larry Ellison and bought a yacht
instead.

~~~
jacalata
People are arguing that he is using his own money to actively damage society.
I'm not sure if I agree with that, but to say you can't criticize someone for
what they do with their money is ridiculous - as an extreme, he could use his
money to poison people, or to lobby for life sentences for marijuana use.

~~~
FourthProtocol
I said Bill Gates, and not someone. And while I don't know how people think
he's damaging society, if the article being discussed here is representative
then I remain unconvinced. This is not an invitation to enlighten me. I'm not
wasting my time reading yet another opinion, assumption or half-baked arm-
chair allegation. The investment dollars and opinion of the likes of Warren
Buffett are a much more compelling endorsement to me than (I mean this
respectfully) members of HN.

------
ColinWright
I suspect there'll be a lot of differing opinions on this, and I'm looking
forward to seeing the discussion. What would be helpful would be if people
could say how much experience they have in hard statistics, and how much what
they say is driven _a priori_ from the data.

I know that hackers, in particular, have real problems with "Argument from
Authority", but stats is one place where it's really, really easy to go wrong,
so knowing how much formal training someone has can be an important indicator.

~~~
absherwin
I'll bite. I won't discuss my experience in my posts since I don't want to
argue from authority but will do so below. But first, I should point out that
you're committing a similar sin to that which is alleged in the article. What
you really care about is the correctness of an argument. You hypothesize that
formal training is an important indicator of correctness. Presumably there's
also noise in that scatterplot but we don't even have it. Since it's the best
thing available, you are choosing to rely on it.

As for my formal stats training: None beyond High School. That said, I've
worked with and interviewed many with far more training. One thing, I learned
is that most people don't have great intuition for statistical problems and
it's not terribly highly correlated with years of schooling. I've met PhD's in
economics and statistics who have made basic conceptual errors as well as
those with less training who were more reliable.

So while training may be correlated with accuracy, you should still demand a
well-reasoned argument and think critically about it.

~~~
ColinWright
I agree almost entirely. I'm specifically not asking for people to say "I'm a
PhD in statistics, and here's the answer. Accept it because I know better than
you." What I'm asking for a a complete and reasoned response, along with some
evidence as to how much I should listen to you in the first place.

If you have no such evidence then the onus will be on you to make your
argument more complete, more coherent, and more comprehensive. If you have
evidence (note: evidence, not proof) then you can be a little less rigorous in
what you say, and rely on people giving you the benefit of the doubt while
they work through the argument.

What I see a lot of is long, apparently good arguments, that then turn out not
to be as complete or coherent. they sometimes just don't hang together.

Significant amounts of formal study in a subject is evidence that someone
might just have a better understanding. After spending a lot of time on the
internet I'm tired of having to wade through every single argument in detail
looking for all the possible chinks.

Maybe that's just impossible. Maybe every person has to redo every analysis
for every argument. Seems like a complete waste of almost everyone's time.
What about "Don't Repeat Yourself" or "Don't re-invent the wheel." I guess we
are doomed to reinvent the wheel in every single discussion.

~~~
absherwin
Reinventing the wheel would be a major problem if our goal is to solve the
education problem with this discussion. No one here has done even the basic
work I would expect of someone trying to understand teacher evaluation as a
solution and compare it to other alternatives.

I would argue that the whole point of HN is to think through arguments in
other domains and build intuition by reasoning through problems and arguments.
Otherwise, what's the point? No one is going to arrive at this thread and scan
the top-rated comments for the solution to his school district's problems.

In cases where actual decisions are being made where the analyses are much
more thorough and fully validating much more expensive, other techniques are
available. First, one generally builds an awareness of the strengths of each
team member which suggests where errors may be more likely. Additionally, one
can check a random set of the most likely problem areas. Perhaps, most
importantly, while everyone won't re-do every analysis, it's highly unlikely
that an any important analysis will only be done once. So one can expect that
the high-level results are generally consistent.

------
ubasu
This blog post linked to from the comments in the article explains the
"ecological correlation fallacy"[1] being described in the article:

[http://ed2worlds.blogspot.com/2013/01/gates-foundation-
waste...](http://ed2worlds.blogspot.com/2013/01/gates-foundation-wastes-more-
money.html)

Key quote:

"...it has been known for more than 60 years now that correlating averages of
groups grossly overstates the strength of the relationship between two
variables"

Apparently, this was also used in the past to show high correlation between
African-Americans and illiteracy.

[1] <http://en.wikipedia.org/wiki/Ecological_fallacy>

------
graeham
There are two problems with this article. First is that the "evidence" is from
a scatter plot graph, with no statistical measurements attached to it. Stats
really aren't something that you can rely on your eyes on. To me, it looks
like a weak, positive correlation between previous year's score and current
years score. The thing to look for is greater densities in the NE and SW
quadrants than the NW and SE. Having a zero-mean "blob" in the data makes it
tricky on the eye, but NE and SW points are the ones the model "correctly"
predicts, while the NW and SE plots are the ones the models "gets wrong". I
think the issue is the eye is looking for a line to process a slope of, but
such data doesn't streach into a line. What it should look for is: "given a
good score last year, is a good score given this year" or "given a bad score
last year, is a bad score given again".

The second issue is that the author does not offer a better solution to the
problem. Some information is almost always better than no informantion in
decision making. Statistics in human-based samples are always tricky, and its
tough to get strong correlations easily. As an aside, this is part of why
clinical trials are so tricky and expensive.

(I am not a stats professional, but am a biomedical research scientist with
some training in human-population based stats).

~~~
lancewiggs
Exactly. Charts are useful for eyeballing, but only the use of proper
statistical tools is valid for showing trends. Charts can easily be very
misleading.

There seem to be a lot of comments about the charts on this page. Arguably
basic statistics should be a part of every education - is that happening in
the USA?

~~~
graeham
Can't comment on USA - in Canada we had an intro in high school then about a
1/2 course-credit module per year from second year until grad school in the
engineering programs.

------
keithpeter
" _In New York City, the value-added score is actually not based on comparing
the scores of a group of students from one year to the next, but on comparing
the ‘predicted’ scores of a group of students to what those students actually
get. The formula to generate this prediction is quite complicated, but the
main piece of data it uses is the actual scores that the group of students got
in the previous year._ "

[http://garyrubinstein.teachforus.org/2012/03/06/analyzing-
re...](http://garyrubinstein.teachforus.org/2012/03/06/analyzing-released-nyc-
value-added-data-part-iii/)

Some _qualitative_ comments (from the UK)

An experienced and resourceful teacher may be allocated more 'problem'
students than a relatively new teacher as a matter of policy. Teaching
students at the extremes of the ability range is harder than the median as the
reasons for (say) severe under achievement can be very diverse (learning
difficulties, home situation, undiagnosed medical issues - diabetes in my
recent experience with one student - and so on) and each student may need a
different approach.

I have taught in a few settings over the years. The best teaching happens in
my experience in settings where people work together. Performance related pay
based on dodgy statistics in an atmosphere of public ridicule does not seem
the best way to encourage sharing of practice and teamwork!

"Good luck with that" as I believe the late Mr Jobbs used to say to
competitors.

~~~
spikels
You are correct that if students were assigned in some biased manner it could
invalidate any relationship between teacher quality and subsequent student
performance.

That is exactly what makes this study interesting. All the students were
assigned randomly! Much like in a randomized medical study this allows the
establishment of a causal relationship.

~~~
keithpeter
" _All the students were assigned randomly! Much like in a randomized medical
study this allows the establishment of a causal relationship._ "

Two thoughts...

1) Ethics: You put some children with dyslexia or with challenging behaviour
with newly trained teachers with little experience? Did their parents have any
say in this? Was there an ethics process? Was there support available for
newly trained teachers meeting major needs in their randomly allocated
classes? Amazing.

2) Correlation does not imply a causal relationship. You need a theory. Theory
in this area is not positivist, it is fuzzier.

As I said before, 'good luck with that'.

~~~
spikels
1) Same students, same teachers and same support in same schools but the
assignment of students and teachers to classes was randomized. Consent was
obtained from principals, teachers and parents.

2) Randomized experiments can establish causation if done properly. This is
how medical treatments are evaluated. The theory is simple: teachers that were
more effective in the past will be more effective in the future and using
objective metrics (tests, classroom observation and surveys) we can measure
teacher effectiveness. This certainly positivist but more important likely
repeatable.

You seem like you are interested in this but I get the sense you already have
strong views. Maybe you should put those aside for a moment and take a closer
look at this new and interesting study you might learn something:

<http://www.metproject.org/reports.php>

~~~
keithpeter
"You seem like you are interested in this but I get the sense you already have
strong views."

I will read the reference, however, to those of us in Europe, please remember
that this methodology resembles a group of earnest steam punks analysing flies
in amber with brass and ebony microscopes on an ice floe that is breaking
up... I'm suggesting that we suspect positivist descriptions do not capture
the important values in education. 90% of what children learn is learned
outside school I suggest.

I am prepared to be astounded with contrary evidence.

------
streptomycin
I'm surprised at the comments here. Seems that nearly everyone disagrees with
the blog? Is Bill Gates such a saint among the HN crowd that we ignore
statistics for him?

Also, see this blog for some more discussion along the same lines:
[http://ed2worlds.blogspot.com/2013/01/gates-foundation-
waste...](http://ed2worlds.blogspot.com/2013/01/gates-foundation-wastes-more-
money.html)

Seems pretty clear that Gates' study presents the data in a very misleading
manner to make a very weak correlation look like a very strong correlation.

~~~
ef4
The blog post doesn't have any statistics to ignore.

Graphs are not statistics. Why doesn't he give us a correlation coefficient on
his big blue blob graph?

~~~
streptomycin
You criticize a short blog post for something that is not done in the Gates
publication. Do you have some bias, perhaps? Why does Gates not even give us a
scatter plot, let alone a correlation coefficient? This is the critical point
raised by the blog.

~~~
spikels
There are quite a few statistics reported in the Gates Foundation publication.
Take a look for yourself:

[http://www.metproject.org/downloads/MET_Validating_Using_Ran...](http://www.metproject.org/downloads/MET_Validating_Using_Random_Assignment_Research_Paper.pdf)

~~~
streptomycin
I never said they don't report "quite a few statistics". I said they don't
report this one specific particularly important statistic, and instead obscure
it behind a ridiculous averaging procedure. This was either done because they
are incompetent or because they are purposely trying to mislead. That's the
whole point of TFA.

~~~
spikels
What one particular statistic do you want?

Take a look at the FULL report. All the statistics behind the graph you
criticize are in Table 1 on page 10.

The next 40 pages seem to be a pretty competent description of both the
strengths and weaknesses of their study.

------
DataWorker
Using LOC to hire or fire programmers ignores important contextual factors,
just as using test scores of students to hire or fire teachers ignores
important contextual factors. What language is being used? How large is the
project? Existing code? Platform? Bug count? Does the person contribute to the
team in other ways?

In teacher evaluation you also have critically important contextual factors.
What course is being taught? Are the students motivated? What's the classroom
like? A lot of ELL kids in the class; does the teacher speak their language?

Many people see _some amount_ of value in these imperfect metrics because they
a) seek to measure the critical output (programmers are paid to make code and
teachers are paid to make kids learn), and b) simple numeric metrics aren't
biased by human fallibility. Popularity amongst peers won't have any impact on
LOC or student test scores.

The author feels they shouldn't be used as 33% of the overall evaluation,
which is the suggestion made by Gates.

In terms the specific points the author made about Gates' evidence in favor of
their suggestion of 33%, I think he's basically right. I haven't read the
original study and so I'm assuming for sake of discussion that he's conveying
it correctly.

Over saturated scatterplots are not very useful, and in the hands of someone
with no training or knowledge of basic statistics they may be dangerous, but
they're not wrong. A trained consumer of research, seeing a plot like the
author's first, will only be able to conclude that the correlation is not
perfect. That rules out only a subset, and a subset that's fairly uncommon
outside of the physical sciences, of the possible relationships between two
variables. Not very useful, but not misleading unless there's an assumption
that the number of data points is relatively small (which may be the case in
some research contexts).

Averaging points and then taking the fake aggregated points and plotting those
is wrong. It's using a tool to show variability in the wrong way. Averaging
hides the variability that is supposed to be shown in a scatterplot. Why not
average down to 2 points and draw the line then? Because that's wrong.

I'm thankful that one or two other people here understand, but a bit dismayed
at the kneejerk groupthink.

------
danso
I see the OP's point, that visual noise can be filtered in such a way as to
create a more compelling trend, and perhaps the filtering that Gates has done
has simplified the results...so, OK. Let's not look at charts, let's look at
actual calculations that have the various factors accounted for...what are
they?

And I guess I'll throw in my two cents into the inevitable teachers-vs-data
debate. I believe teachers may be too quick to toss aside performance analysis
because they are so close to the personal stories and circumstances...no line
chart can seem to account for the student who was improving greatly but then
missed an insurmountable number of classtime due to problems at home.

Most humans are unwilling/unable to see their lives and work measured as
averages and trend lines and deterministic outcomes, and I think teachers are
acutely of this belief due to the nature of their work. And unfortunately,
this blinds them to the possibility that while there is no cookie-cutter
solution that _always_ works, there are surely some measures/guidelines that
_often_ work, or at least mitigate even the most extreme circumstances. As it
is, there's not much incentive or opportunity to see "the bigger picture"
outside of what happens in one's own classroom year after year.

That said, this is not even close to being mainly the fault of the teachers.
Among the other problems is that administrators are just as prone to
shortsightedness, and "the numbers" can be wielded by a petty/mediocre
principal/district officer to apply pressure to a teacher. It's no wonder that
teachers are quick to distrust analyses, given the possibility that they're
abused for political means.

So, essentially, it's a giant crapfest with no easy solutions. From the
perspective of Gates' efforts, we can only hope that he pours in so much money
and advocacy that there's eventually a shift towards best practices...though
it's safe to say there will still be unintended consequences that aren't all
favorable.

~~~
yummyfajitas
_Among the other problems is that administrators are just as prone to
shortsightedness, and "the numbers" can be wielded by a petty/mediocre
principal/district officer to apply pressure to a teacher._

The more we rely on objective numbers, the less a good teacher has to fear
from a bad principal. The principal has no ability to make the teacher's
students do worse than their statistical predictor, after all.

On the other hand, many other popular metrics (e.g. principal's
opinion/coworker's opinion) can easily be tweaked by a principal who wants to
put pressure on a teacher.

------
absherwin
Both sides are right but they seem to be concerned with different things.

Grouping data points does mask volatility. This doesn't make the effect any
less real on the societal level. If we seek to improve student performance,
skewing the pool of teachers towards the better ones will do that.

Yet there is also an enormous amount of volatility which is also real. Ideally
we would just create better predictors. Constraining ourself for the moment to
this set, since they have high error rates on the individual level, it's also
reasonable to assert that in relying on this metric, we will make unfair
decisions for many teachers.

The right solution depends on how much you value optimizing student outcomes
vs. optimizing fairness as well as second order effects such as driving away
teachers. My bias would be to use these scores for economic incentives to
attempt to ensure that retention for the best teachers is higher than for the
worst. Without additional work, termination likely isn't warranted based on
this data.

To those who argue that pay differentials require better evidence, I would
suggest comparing the validity of performance evaluations across any other
job. It's imperfect but better for society than doing nothing.

For long-term fairness, it's important that there isn't a single value-added
model. Competitive pressure can do wonders to improve model quality and it's
also fairer since those who don't do well on one model can move to a different
rubric.

------
eli_awry
He doesn't actually say what the positive correlation is, which I find
disingenuous and suspicious. Because his scatterplot is so non-detailed, it's
about as useful as the following scatterplot of SAT scores:

X|X

\---

X|X

Where the axes are scores above and below 600 in math on the first and the
second time taking the test. There are some individuals who do better or worse
- but just because we can draw a detail-obfuscating graph of data doesn't mean
there is no detail in the data, or no correlation, or that SAT scores won't
help us predict future SAT scores. It just means we can draw a detail-
obfuscating graph, and that there's not a perfect correlation.

One point the author does have is that these scores are not perfectly
predictive, so bad performance one year shouldn't mean that the teacher gets
i.e. fired. OK. He seems to be protesting using information to draw any
conclusions at all. Perhaps it would be more effective to show what fallacious
or irrational conclusions are being falsely drawn from this data and used to
fire teachers, and then object to those specific instances of irrational
actions. This value-added measure has not been shown to be intrinsically
unreliable or wrongheaded but maybe some applications of it are.

------
carbocation
I'm working on reproducing the graph with the data available from the NYT, but
I'm not sure on one or two details of the author's method. If someone sees it,
please let me know! Things I can't tell so far:

1) How is a "teacher" defined? Just(hash firstname lastname)? If so...

2) When a teacher teaches multiple subjects per year, which one is chosen? Or
are they both represented, so that "teacher" actually means (hash firstname
lastname subject)?

~~~
tmoertel
If you want to see how I recreated the OP's graph, just fork my GitHub repo.
It has R code and the data sets, too:

<https://github.com/tmoertel/nyva-cursory>

Anyway, here are the variables I used to identify comparable observations:

    
    
        ## these variables identify comparable teacher observations
        id_vars <- c("subject", "grade", "dbn",
                     "teacher_name_first_1", "teacher_name_last_1")
    

So I used subject, grade, school (identified by dbn) and the teacher's first
and last names.

~~~
carbocation
Nice, thanks for sharing. I used the same variables with the exception of dbn;
my code is here <https://gist.github.com/4652968> (I later started applying
filters such as 4th-grade teachers only, etc., which is reflected in the
gist.)

I was pretty curious how the author specifically munged his data, since then
we could put to rest the speculation about the degree of correlation.

------
bramcohen
Looking at the correlation in students between year to year makes clear what's
going on - [http://garyrubinstein.teachforus.org/2012/03/06/analyzing-
re...](http://garyrubinstein.teachforus.org/2012/03/06/analyzing-released-nyc-
value-added-data-part-iii/)

There's such a high correlation between an individual student's score from one
year to the next, and the number of data points is so small, that trying to
tweeze out the teacher's contribution in a statistically meaningful way is
fairly hopeless.

What I'd like to see are correlates with a teacher's own scores on the tests
of the material they're supposed to be teaching (oh you think they all can get
perfect scores easily? Hah!) and whether teachers banned from giving homework
do any worse, and whether dumping the enforced curriculum in favor of letting
students study at their own pace makes any difference. Given how little what
the teachers actually do seems to make, it would be logical to at least dump
the things which make school dull and unpleasant.

------
WalterBright
There is no objective way to measure teacher performance. Any evaluation
method that can be written as a list of rules can and will quickly be gamed.

The thing is, it's easy to find out who the best teachers are. Simply ask the
students, parents, and staff. They all know who the good ones and the bad ones
are - and that can't be gamed.

~~~
colanderman
Your claim makes no sense. If the evaluation method measures what we want
teachers to do, then it, by definition, can't be "gamed": if the measurement
is high, then the teacher is doing what we want.

Are you claiming that there's something special about teaching that makes
teacher performance much more difficult, or even impossible, to quantify? If
so, what is the point of teaching if one cannot measure results?

Your solution to "ask the students, parents, and staff" is precisely a
measurement method (albeit a more qualitative than quantitative one), and
moreover, one that can be gamed easily by anyone who's socially shrewd.
Qualitative measures like that is exactly why we have so many crappy
politicians.

~~~
WalterBright
You can't specify everything in the rules. For example, I know a case where
management decided to hand out bonuses to the staff for minimizing inventory,
because inventory costs money. The staff minimized inventory, all right, and
got their bonuses. Meanwhile, the production line would regularly get halted
because they'd run out of things like 5 cent resistors. It was a disaster.

The same company decided to rate programmers based on the bug list. A big
chart was posted on the wall that showed the daily bug count. It wasn't a week
before huge fights erupted over the bug count - over what was and was not a
bug. The programmers quickly gamed it. They'd hide bugs, they'd refuse to fix
one bug if that fix would produce some other minor bugs (e.g. if the bug was
"feature X missing", then feature X is added, but had a couple issues with it,
then X would not get fixed). They'd even add "bugs" blamed on other
programmers and then "fix" them to get the credit.

Management gave up on that after two weeks and pulled the banner down.

~~~
yummyfajitas
For teachers, the metric is (roughly speaking) "% of students capable of
multiplying/dividing numbers up to 4 digits in a standardized test setting".
How do you game this metric?

The fact that one company used a couple of bad objective metrics doesn't mean
all objective metrics are bad. They are used with a great deal of success in
many fields. Sales people are paid on commission, traders are paid
proportionally to (risk adjusted) profits, etc. All it means is that if you
set the wrong goals for your company, you'll probably succeed at the wrong
thing.

So tell us - is "maximize the % of students capable of multiplying/dividing 4
digit numbers" the wrong goal? If so, what is the right goal, and why can't it
be measured?

~~~
WalterBright
One popular way of gaming the metric is systematic cheating on the tests by
the teachers. I say popular because it has happened on a large scale, most
recently in Atlanta as I recall.

Another way to manipulate the test results is to manipulate which students are
in your class or your school.

The point is, people are endlessly creative in subverting rules to their own
benefit - so they conform to the letter of the rules but not the spirit.

Consider also the "work to rule" technique used by unions as a bargaining
tactic. It's as simple as the workers literally adhering to their job
descriptions. It doesn't work out well for the company.

~~~
yummyfajitas
To prevent cheating, test administration should be handled by someone other
than teachers.

The fact that the current system has a bunch of cheaters is not an argument
against more carefully and objectively measuring the current system. What next
- bankers sometimes engage in rogue trading, so we should reduce monitoring of
their behavior?

 _Another way to manipulate the test results is to manipulate which students
are in your class or your school._

This is very difficult with VAM, since the goal is to increase (actual score -
statistically predicted score). You need to reliably identify students who
will do better than their statistical predictor.

I.e., you need to discover students who will improve drastically _this year_
and then pack your student body with them.

------
dropdownmenu
The author's argument for this post is rather poor. He states that the raw
data plot is noisy and weakly correlated, thus no relationship can be drawn by
averaging the data into sets of buckets.

While the author is correct in that averaging hides the variance of data, he
does not try to answer the question of "Can a trend be found from noisy data?"
or show other accepted statistical methods that show no correlation in the
data.

Instead the author dismisses the report from Gates through name calling: "It
seems like the point of this ‘research’ is to simply ‘prove’ that Gates was
right about what he expected to be true. "

------
acslater00
As far as I can tell, the author breaks the universe of teachers into 5%
groups and then proceeds to show that the percentile groups are ordinally
stable between these two years. In other words, the top scoring group in year
1 beat all other groups in year 2, and so forth, with perhaps some slight
shuffling in the absolute middle of the distribution [hard to see on the
graph].

This is supposed to be evidence that the correlation is weak??

------
IEatShortPeople
Even if Bill Gates isn't doing the best job, every time I see one of these
articles, it appears that there isn't any advice on better ways to grade
teachers. Did I miss that in the article? Are there a lot of other people in
the field that are doing better work? (I would really like to read if anyone
can provide links to papers to other ways of analyzing teachers)

~~~
crusso
_Did I miss that in the article_

The blog link to this thread is a propaganda site. It conspicuously doesn't
reference any of the original materials that it is attempting (weakly) to
refute.

There was an article on HN last week that referenced a more source material
article about the project.

Here's the link to the METProject's web site: <http://www.metproject.org/>

~~~
IEatShortPeople
Thanks

------
thedufer
I'm confused as to why the "‘raw’ score (for value added, this is a number
between -1 and +1)" goes up to ~1.6 in 2009-2010 and up to ~1.1 in 2008-2009.
Why does the data go outside the ranges he states? I find it hard to believe
someone who can't read something so simple off his own graph.

~~~
carbocation
The author incorrectly stated that "value added" is a score between -1 and +1.
In fact, "value added" scores are z scores, which can in theory range from
(-∞,∞).

------
absherwin
To build intuition, let's consider a hypothetical society. We divide a group
of people in two. The first group rolls a die with 100 sides labeled from 1 to
100 in even increments. The second group rolls a die with labeled 1 to 110 in
even increments. The roll determines their annual ability level which is
unknowable. If you were hiring, which group would you prefer? Is it fair to
the second group?

Note that this isn't strictly equivalent for several reasons. The game above
the error is part of the generating process instead of the measurement
process. Also the distribution is district rather than continuous. The key
commonality is that a weak predictor at an individual level can be a
meaningful predictor at a group level.

~~~
streptomycin
To provide a contrary hypothetical...

Imagine if one group had 55% of its dice labeled up to 110 and 45% up to 100,
and the other group had the opposite. So if all you use for hiring is the
group membership, you're going to very arbitrarily miss out on a lot of 110s
and hire a lot of 100s. And now also imagine that this process of identifying
who is in what group is ridiculously expensive.

~~~
absherwin
It's not contrary at all. It's the same point. The changing mix is merely a
difference of degree. The key argument you raise is that discrimination has a
cost. I don't have the numbers offhand to determine whether the cost of
monitoring exceeds the benefit of higher quality education. Regardless, that's
a different argument that should be backed by a different set of data.

~~~
streptomycin
It's contrary in that a naive interpretation leads to the opposite conclusion.
That's all I meant. And I'm not sure how you define what is a different
argument and what isn't. It's all about the value of these scores.

------
MarkMc
If you want to crunch the numbers yourself, here's the raw data:
[http://www.ny1.com/content/top_stories/156599/now-
available-...](http://www.ny1.com/content/top_stories/156599/now-available--
2007-2010-nyc-teacher-performance-data)

------
MarkMc
I don't know enough about statistics to judge the strength of Mr Rubinstein's
critique, but I find this a little odd: The axes on his graph is a 'raw score'
between -1 and 1, but the axes on the MET study graph is 'standard
deviations'. Is that significant?

------
tokenadult
More weekends than not, Colin shares here on HN an interesting link about
education policy. I see from Colin's own top level comment that this one is
shared for methodological discussion. I'm still mulling over what I think
about this link after reading it, but I'm not sure I can agree that the Gates
Foundation research group is engaged in a "lie" about anything. Education
policy is the issue that drew me to participate on Hacker News,

<http://news.ycombinator.com/item?id=4728123>

and I'm glad to see that so many participants, from the founder on to the
newest member, enjoy thinking about and checking facts on education issues.

There is other research on the issue of teacher quality and value-added
measures to assess teacher quality,

<http://hanushek.stanford.edu/publications/teacher-quality>

and I encourage Hacker News participants to dig through some of the other
publications to figure out what the best available current data show.

The other comments here suggest that the blog author of the submitted blog
post is overstating his own case at least as much as he accuses the Gates
Foundation supporters of overstating their case. If people who oppose the kind
of teacher ratings proposed by the Gates Foundation really had courage of
their convictions, they might try to let learners have full power to shop for
teachers, on the grounds that teacher quality measures miss many dimensions of
teacher performance that are important to individual learners. In fact, very
few states even have as easy and time-proven a policy reform as statewide open
enrollment (which my state has had for almost a generation now)

[http://education.state.mn.us/MDE/StuSuc/EnrollChoice/index.h...](http://education.state.mn.us/MDE/StuSuc/EnrollChoice/index.html)

and no state yet lets learners shop for schools on an equal per-capita funding
basis whether the school is state-operated or not (as the Netherlands has for
more than a century). Teachers will have plenty of deserved professional
respect and regard from their clients if their clients have power to shop. But
if policy proposals for learner power to shop are rejected, hand in hand with
rejecting proposals for evaluating teachers for their effectiveness in
publicly subsidized instruction, I wonder what the true agenda is. What
assurance do we (we parents, we taxpayers, we members of the general public)
have that all schools are staffed with the best available teachers unless
someone is checking on how the teachers are doing their work?

------
keithwarren
Typically when someone posts a baiting title like that I know not to trust
them. Scholarship and evidence don't need hyperbolic headlines.

~~~
ColinWright
I think you are mistaken. In today's attention economy even the very best
scholarship and the very best evidence-based conclusions will be completely
ignored unless there is a hook to get people, preferably lots of people, to
read it.

Even here on the allegedly rational Hacker News I've seen item after item sink
without trace, whereas others with significantly fewer details and
significantly worse evidence get voted up and read by thousands.

Alex Bellos recently passed on to me what his editor told him. It doesn't
matter how good your article is, if no one reads it, or if no one gets to the
point, it's no use and a waste of your time.

I'd like to think "Scholarship and evidence don't need hyperbolic headlines."
Sadly, I think that scholarship and evidence say otherwise.

~~~
Evbn
When is HN alleged to be rational?

------
lifeisstillgood
Isn't the point that _before_ starting a measurement exercise one should have
defined how one was going to interpret the results - and the whole is or is
not successful if it meets its own success criteria

------
jellicle
So if the teacher evaluation method is alleged to be inferior, please provide
a new evaluation method which is superior to it.

Can't? Won't? Then be quiet.

~~~
itg
The idea would be not to waste numerous resources and money implementing a
broken idea and spend some time coming up with a better one. Why put the
education of our children and the careers of good teachers on the line with
such a poorly thought out and half-assed idea?

~~~
jellicle
If you think that no time has been spent coming up with this idea, you are
coming very late to the party. This debate is decades old, at a minimum. Dates
from 1971, says Wikipedia.

<http://en.wikipedia.org/wiki/Value-added_modeling>

It's not something that "Bill Gates" or anyone else made up yesterday. It's a
state-of-the-art approach to a difficult problem. If you have a better way,
let's hear it. Otherwise, be quiet.

~~~
streptomycin
You: "Yeah, okay, this method doesn't work well enough to accomplish anything
and it's ridiculously expensive. HOWEVER, it's been used for decades, and I
don't know any other methods that actually work. So let's continue pissing
money down the drain."

------
derpmaster
This is the corporatist way. Everything is measured so it can be 'tweaked' by
managers and technocrats because school is a business now.

Having a bad teacher to me doesn't matter when any kid can turn on his phone
or laptop and watch free MIT lectures, free Stanford lectures, free
Khanacademy lectures, free Android and iOS developer courses all over youtube,
ect.

Gates obsession with teacher metrics would be better spent on building an
entirely free university with worldwide standardized credits, so somebody in
Pakistan can have their education credentials accepted should they start
working overseas instead of being forced to drive a cab around Montreal while
spending thousands to retake what he/she already learned.

~~~
crusso
_Having a bad teacher to me doesn't matter when any kid can turn on his phone
or laptop and watch free MIT lectures_

Alternative self-study education is small compensation for having your child
literally waste a school year in a teacher's class who has no clue how to keep
control of the kids or desire to go over the material in class vs assigning
everything for homework. In my daughter's case, her fourth grade teacher was
too busy using that vaunted Internet access to check her facebook page.

Also, your "doesn't matter" argument would hold a lot more water if I didn't
have to pay significant taxes to support such a lame public education system.

------
saosebastiao
Never trust anybody who tries to explain statistics using Excel.

