

Teacher effectiveness ratings, programmers, and Khan Academy's data - kamens
http://bjk5.com/post/18609999923/teacher-effectiveness-ratings-programmers-and-khan

======
DarkShikari
Teaching is a lot like programming in this respect -- if you try to enforce
metrics of "performance", you usually end up with optimization for those
metrics and dumber, less-educated students. Similarly, _students_ tend to
optimize too, just from a different standpoint. Examples of this:

If you rate teachers based on the grades their students get, they will usually
award their students higher grades for the same (or worse) performance. This
isn't just a joke; many, many professors have reported their deans bearing
down on them for awarding _too many bad grades_ , often to students who are so
incompetent that they cannot write basic English sentences or perform long
division. The solution -- not remedial courses, or letting them drop out (that
would hurt financial aid income) -- but letting them pass even though they
lack the basic skills to learn in that class, let alone pass. This also leads
to the student mentality that they are being "given" grades (and thus can beg
for higher grades) as opposed to _earning them_ through their own effort in
the class.

If you rate teachers based on the number of their students pass a standardized
test, the teachers will focus solely on the contents of that standardized test
and put a huge amount of effort into educating the few students who Don't Get
It, at the cost of all the students that do. I had this experience in primary
and secondary school, where teachers lamented that they had to go over
material for the Standards of Learning tests constantly and repeatedly,
dramatically lowering the amount they actually had time to teach. The end
result is a quick dropping of everyone to the lowest common denominator --
which often just ends up shifting the entire bell curve to a lower point.

If you rate teachers based on student evaluations, your results are nearly
useless. Most students don't bother to write in much detail, since they're
just glad to be done with the final exam (I did this too!)-- and often they
use it for "revenge" on professors they thought were too hard. The teachers
who get the best evaluations are often (though not always) the teachers who
gave the easiest work and grades, and whose students turn out to be the most
woefully unprepared for what comes next. Just like customers of your software
are not always good judges of what they really want, students are often
terrible judges of what good education actually is.

As far as I can see, nobody has come up with a magical metric for rating
teachers. If anything, most of the metrics out there are outright
counterproductive -- and you can see it in schools across America, where
teachers are being constantly urged to inflate grades for their 'customers',
overlook plagiarism, ignore the most talented students in favor of the least
talented ones, and help drive down the quality of education for the next
generation.

~~~
moonchrome
>As far as I can see, nobody has come up with a magical metric for rating
teachers. If anything, most of the metrics out there are outright
counterproductive -- and you can see it in schools across America, where
teachers are being constantly urged to inflate grades for their 'customers',
overlook plagiarism, ignore the most talented students in favor of the least
talented ones, and help drive down the quality of education for the next
generation.

You can do things to improve the quality of data - use multiple metrics, don't
announce them up front and certainly don't standardize - so that optimizing
for a specific test doesn't justify the cost, also change testing as
needed/with dynamic feedback and use the results to get insight in to the
problem. Computers can only help with this sort of thing. Ultimately it's
about defining the objective of education, what environment students are
supposed to be prepared by it and exposing them to that environment, just like
you want to test the system in production as soon as you can, benchmarks are
always synthetic and can't replace "real world" testing.

Data won't capture everything and could in fact be 100% misleading - that's
the nature of imperfect knowledge - but if used correctly it can help provide
insight in to situation. Just don't conflate benchmarks and reality, use it as
a tool for improved feedback/analysis. Of course this approach is absolutely
incompatible with bureaucratic systems. Heck the entire education system is
based around optimizing for arbitrary/artificial evaluation metrics (all that
crap about higher education and signaling, that starts with primary
education).

As for favoring least talented over most talented, that's just different
objectives, I don't think it changes with more quantifiable information about
students performance, do you want to raise the bottom at the expense of the
top.

------
sbucher
The author is making a straw man argument, though it does not seem to be
intentional (edit: I am referring to Rubinstein's original analysis). The
author points out that using data for a single class and single year is not a
reliable indicator (second graph). That is uncontested and exactly why ratings
are based on 3 years of data and multiple classes where possible. Even this
cannot provide a single, accurate percentile value. That is why confidence
intervals are used and are rather prominently displayed in all the graphs;
e.g.
<[http://www.nytimes.com/schoolbook/school/656-ps-009-teunis-g...](http://www.nytimes.com/schoolbook/school/656-ps-009-teunis-
g-bergen/teachers>);

Given the example above, there is one teacher who has is 50th percentile for
career math, but the confidence interval (CI) indicates that this may really
be anywhere from well below average to well above average. In contrast there
is another teacher with a value of 3 and the highest bounds of the CI still
place them well below average. Conversely there is a teacher that is 90th pctl
and even the lowest bounds of the the confidence interval still makes this a
highly effective teacher.

In sum, the data make evident that a few teachers are fairly unambiguously
ineffective (at least with regards to test results), a few are unambiguously
effective, but for the majority the most that can be said is that they are
neither the best nor the worst, but somewhere in the middle.

So if you reserve yourself to making conclusions merited by the data, then
there is some value to this (i.e. identifying the extremely ineffective and
effective). As it stands now, the teacher with a 3 may very well have tenure
and be paid considerably more that the teacher with a 90 (given that pay is a
function largely of seniority and education, and nearly all teachers have
tenure). That is what this exercise is meant to address.

So you definitely cannot assign a fine-grained single value to every teacher
(and the authors analyses speak to that issue), but there does not mean you
cannot make any conclusions at all. If folks make conclusions that exceed what
is supported by the data that is a fault of the analyst or perhaps a function
of poor communication or visualization, not evidence that the data is junk.

That said, the test could be improved (and supposedly are being improved) and
the very least the measurements place more focus on the subjects being tested.

And the teacher effectiveness data is only 40% of the new approach to
evaluating teachers in NY; the other 60% including peer evaluation. It may be
useful if that became part of the public data so a more balanced picture is
available; but it is unclear if that data would be public.

~~~
tokenadult
_So if you reserve yourself to making conclusions merited by the data, then
there is some value to this (i.e. identifying the extremely ineffective and
effective)._

Yes. Anyone who cares about the education of children will welcome any further
improvements in evaluating teachers, but there is already actionable
information available today. Bill Gates makes the good point that the best
system of evaluating teachers should lead to interventions to help each
teacher do better. But the existing performance gap, which may also be shown
by which teachers other teachers who have children choose for their own
children, could guide a system of encouraging the worst performers to seek
other occupations

[http://hanushek.stanford.edu/sites/default/files/publication...](http://hanushek.stanford.edu/sites/default/files/publications/Hanushek%202009%20Teacher%20Deselection.pdf)

while rewarding teachers who teach effectively and are willing to take on
challenging classrooms with students who have the greatest need.

[http://hanushek.stanford.edu/sites/default/files/publication...](http://hanushek.stanford.edu/sites/default/files/publications/Haycock%2BHanushek%202010%20EdNext%2010%283%29.pdf)

~~~
sbucher
Current accountability reform places too much faith in the power of incentives
and hence many metrics (not only teacher effectiveness and not only the NYC
accountability system) are not directed to offering any sort of guidance
towards improving schools. The NYC Progress Reports, which give schools
grades, are another prime example of this. Personally, I hope that the next
stage in accountability will see a shift to making the data more actionable
from the perspective of practitioners and it is something I have been
encouraging for a few years. I think part of the obstacle, though perhaps not
the primary one, is the criticism that measurements are not accurate. So given
the option of re-thinking how to make actionable reports or measurements vs.
just improving what exists (via extending and revising statistical methods,
refining business rules, improving data quality, and including additional data
sources) the latter is winning out even in areas where there may be
diminishing returns to doing this.

On the other hand, one could argue that the issue of how to improve the
schools is addressed by other projects or other areas in the system, and it
sufficient for these tools just to make clear and accurate evaluations. We are
dealing with a big problem, both in terms of depth and significance. So there
is room to approach this from many angles.

Regarding the issue of making the data public (i.e. Gates' Op Ed), there are
competing values at play. Yes, it will be misused and some folks will be
unfairly hurt. I would rather have data public (not just the teacher
effectiveness data), and allow everyone the option to create there own diverse
conclusions, perform their own analyses (like the one linked to), and as a
society we can become more sophisticated and knowledgeable about these issues.
The alternative is that the only people that will has access to this data is
bureaucrats with a specific policy agenda and a handful of academics. Broader
society, especially parents, has a right to weigh-in on these issues and be a
part of this discussion and they cannot do this effectively without open data.
To reiterate, I don't want technocrats driving education (that is part of the
broader problem with current reform); hopefully open data such as this can
limit that.

The solution is to do a better job of disclosing the information and providing
the necessary context rather than just hide it. NYTimes did a pretty sweet job
in this respect . I like where they are going with schoolbook.

I would like to see the media and public consistently push government for
better open data. Right now that is not happening, and agencies are not going
to make big changes in this area in the absence of that pressure.

------
crazygringo
Of course, the real problem is crappy metrics. And this is why "teach to the
test" is given such a bad rap.

If the tests were actually well-written, testing all sorts of critical
reasoning and problem-solving abilities, then teaching-to-the-test would be
the correct strategy, and metrics would be tremendously useful.

It makes me think somewhat of exams used for foreign students of English. The
American TOEFL exam is extremely limited, and test prep can be extremely
effective, without actually improving your day-to-day English. The British
IELTS exam is extremely wide-ranging, and test prep is basically worthless,
because it really does a good job of testing your English. To improve your
test scores, you truly do have to improve your real-world English.

------
tristan_juricek
I don't see a problem with effective metrics; they might help identify where
teachers might need help.

The problem, is that usually nobody really does proper statistical analysis.
The metrics I've seen, are usually like "mean student grade performance". In
which case, what you'd want to do is have a really small class of smart kids.
(Ergo, kudos to the teacher who finds a way to dump the idiots.)

I do think, however, you could do polling of teachers by asking the students,
teachers, and other staff. In most cases you could easily have enough
coworkers to figure out who's respected and who needs help.

------
yummyfajitas
I'm looking at the original posts that inspired this:

[http://garyrubinstein.teachforus.org/2012/02/26/analyzing-
re...](http://garyrubinstein.teachforus.org/2012/02/26/analyzing-released-nyc-
value-added-data-part-1/)

[http://garyrubinstein.teachforus.org/2012/02/28/analyzing-
re...](http://garyrubinstein.teachforus.org/2012/02/28/analyzing-released-nyc-
value-added-data-part-2/)

In particular, the "same grade, different subjects" plot looks like a good
correlation. Some noise, but otherwise exactly what you'd expect. Anyone know
where to get the raw data so I can do more careful plots (e.g., a density plot
instead of a scatter plot [1])?

Note that noise simply means you need more than a single year's data to make a
decision.

[1] In a scatterplot, if two points overlap (which they likely do in this
data), the apparent density is understated.

[edit: found the data here:
[http://www.ny1.com/content/top_stories/156599/now-
available-...](http://www.ny1.com/content/top_stories/156599/now-available--
2007-2010-nyc-teacher-performance-data) ]

------
mathattack
Any field that produces only one metric will find people optimizing on it.
Eventually the metric loses it's value. This is true in teaching, programming,
and most everything. Life's not so one dimensional.

