

Student Course Evaluations Get an 'F' - chrismealy
http://www.npr.org/blogs/ed/2014/09/26/345515451/student-course-evaluations-get-an-f

======
tokenadult
The methodology used in the study from Italy[1] mentioned in the article was
quite interesting, because it had actual random assignment of comparable
groups of students to different instructors. The students were then followed
up over their further study in the same university. That's a good kind of data
set to have for examining issues like this.

The alternative means of evaluating teachers mentioned in the article are also
quite reasonable. Both peer review and content analysis of instructor-prepared
materials can identify better teachers with different sources of bias in the
evaluation, allowing a triangulation with student ratings.

Here on Hacker News, from another participant, I learned about a more rigorous
method of student ratings of teachers[2] that ought to be applied (with
appropriate adaptations) to higher education teaching. It appears to work will
in K-12 teaching.

[1]
[http://www.sciencedirect.com/science/article/pii/S0272775714...](http://www.sciencedirect.com/science/article/pii/S0272775714000417)

[2]
[https://news.ycombinator.com/item?id=4559682](https://news.ycombinator.com/item?id=4559682)

------
yummyfajitas
So here is a fun fact about professor performance. When I was a grad student,
various calculus classes had a shared final. We had some professors with great
evaluations who devoted their life to teaching. We had some Germans and
Chinese who students perceived to barely speak English [1] and hated. Student
evals were fairly predictable - hard exams would reduce scores, jokes and
sympathy would increase them.

Everyone's students followed the exact same normal curve.

I've been told that the only way one can make an observable difference in
group final scores is to schedule a class at the same time as sports practice
or remedial education. You need better students, not a better teacher.

[1] I've noticed that either students are unable/unwilling to listen to a
foreign accent, or perhaps I am unable to notice them. Recently a friend of
mine said my secretary had a heavy accent - I barely notice it.

~~~
shanusmagnus
I understand the spirit of your comment, and the statistical difficulties in
doing ratings in a meaningful way as described in TFA, but it is absolute
truth that there are teachers who are absolutely _wretched_ and when I was an
undergrad it drove many of us nuts to have no way to identify them. The
problem might have been especially severe when I was an undergrad, since I
went to one of the most populous schools in the US, before the proliferation
of social medial and the like made a lot of this information gathering easier.
But it was a real problem. The fact that in the case you mention all the
students had the same normal curve (by which I assume you mean they had the
same means and also the same shape) is interesting, but there's more to being
a good or bad teacher than the numeric outcome of the class, which ought to be
a familiar argument by this time.

Another problem that you mention, and that was also very real, was that some
of the teachers literally could not be understood. I use 'literally' literally
-- I took an automata theory class from a guy that seemed to have been
speaking his own native tongue and not English. That class was an absolute
delight, as you might imagine. Most of my teachers had strong accents, and I
could deal with it, but some were incomprehensible and had no business
teaching in am American university.

Acting like these are problems that don't exist, and that complaints are just
small-mindedness or excuse-making, is just as bad as going the other way and
throwing a fit when you have to 'endure' an unfamiliar accent. Everything's a
matter of degree.

~~~
yummyfajitas
_...there are teachers who are absolutely _wretched_..._

Are there? We thought the same thing about our "worst" professors/TAs. Yet
students ability to differentiate/integrate seemed unaffected.

Now this suggests one of two things. First, that there really are no "worst"
professors/TAs. Second, that students have the ability to replace a $3-4,000
class with reading a $50-150 book (+ Khan academy, etc).

In the former case it suggests teacher quality don't matter much. In the
latter case it suggests teachers are completely unnecessary and our usage of
them should be reduced. This experiment at MIT is promising:

[http://phys.org/news/2014-09-online-
classes.html](http://phys.org/news/2014-09-online-classes.html)

 _...there 's more to being a good or bad teacher than the numeric outcome of
the class, which ought to be a familiar argument by this time._

This is a familiar platitude. I've never seen an actual argument that didn't
appeal to invisible dragons.

[http://www.godlessgeeks.com/LINKS/Dragon.htm](http://www.godlessgeeks.com/LINKS/Dragon.htm)

~~~
shanusmagnus
_Are there? We thought the same thing about our "worst" professors/TAs. Yet
students ability to differentiate/integrate seemed unaffected._

That's a data point worth paying attention to. However, I'll submit that the
intellectual world is larger than integration and differentiation. I'll
further suggest that there is a world of difference between understanding
enough calculus to integrate and differentiate, and understanding calculus in
sufficient depth to apply its principles to problems and in settings that do
not afford simple template matching -- I wrote an integration/differentiation
package in Scheme my freshman year, but would not claim that it understands
calculus.

This is one of the ways in which a good teacher and a bad teacher can vary by
an order of magnitude. You might counter that your tests were so rigorous that
any such differences would fall out in the data. Perhaps you have achieved
this apex of testing methodology, but I'm suspicious.

 _Now this suggests one of two things. First, that there really are no "worst"
professors/TAs. Second, that students have the ability to replace a $3-4,000
class with reading a $50-150 book (+ Khan academy, etc)._

Or that your measurements are not sufficiently granular, as described above.

 _This is a familiar platitude. I 've never seen an actual argument that
didn't appeal to invisible dragons._

Many parents produce children who survive long enough to successfully
reproduce themselves. By one of the most fundamental biological heuristics
that's a rip-roaring success, and yet presumably you would aspire to more with
your own children. How's that for an invisible dragon?

EDIT: figured out how to make quotes italicized.

~~~
yummyfajitas
_...good teacher and a bad teacher can vary by an order of magnitude...your
measurements are not sufficiently granular..._

These two claims contradict each other.

If you want to argue that the exams couldn't measure anything at all, go
ahead. You can find examples here:
[http://cims.nyu.edu/~stucchio/classes/fall2008/index.html](http://cims.nyu.edu/~stucchio/classes/fall2008/index.html)

 _I 'll submit that the intellectual world is larger than integration and
differentiation._

Calculus 1 and 2 are not. I don't see a reason why the non-simple problems you
describe can't be put on an exam, apart from the fact that _it 's not part of
Calc 1_. I've also never observed a group of students from any professor who
could reliably do it, so I suspect that your "order of magnitude" better
teachers are a rarity.

 _How 's that for an invisible dragon?_

Are you seriously asserting that outcomes beyond reproduction are not
measurable?

------
leoh
When I was at Columbia, we used an interesting website called CULPA ("Columbia
Undergraduate Listing of Professor Ability";
[http://culpa.info/](http://culpa.info/)). The site is interesting is that
unlike many other sites, students can only give written evaluations and there
are no numerical scores. Reviews are generally thoughtful and, in some cases,
were particularly useful in choosing a professor for a course.

~~~
Alex3917
Non-numerical evaluations make much more sense.

It's funny how most of the good universities in the US started out as being
explicitly Christian, but over the last couple centuries the Christianity has
been pretty much wholesale replaced with numerology -- student grades, teacher
evaluations, impact factors, statistical significance, etc. The entire system
is built upon the worship of numbers, and every phenomenon is assigned some
sort of numerical value (to of course be aggregated and statistically
'analyzed') even if it's plainly non-quantitative in nature.

You've got to wonder whether this is some sort of temporary societal blind
spot, or whether this ideology is so toxic and so subtle and complex to refute
that it will just be around forever as the endpoint of scientific history.

~~~
jeangenie
Doesn't matter if it's big business or academia, shitty managers everywhere
love bad metrics.

------
SynchrotronZ
The way things work at my university, at least the science faculty, is that
word of good or bad teachers just travels trough the grapevine. And it is
usually quite accurate.

Then we have (a active, and paid) representation in the faculties decision
making body, which leads to, in my experience, the faculty actually dealing
with bad professors.

Not everything has to be boxed in by numerics, sometimes simply speaking up
and listening is the easiest and best solution.

~~~
blahedo
> _And it is usually quite accurate._

How do you know? How _would_ you know?

------
Redoubts
And of course, those with legitimate complaints with the Prof simply drop the
class a few days or weeks in, and never get a chance to fill out an
evaluation...

------
JoeAltmaier
Poor professors, getting arbitrary ratings from people they meet once or twice
a week for a few weeks, which determine their professional future. Sounds like
being a student or something. Go suck on a lemon, professors.

------
bradleyjg
In deteriming whether course evaluations are corrlelated with teacher quality,
they compared them to grades received in a follow on course, and found that
they had a negative correlation. So far so good. But if grades received in
follow on courses are the gold standard for teacher effectiveness, then why
not use that, instead of the other proxies the article goes on to recommend
(peer evaluations primarily)? Certainly it won't be possible in every case as
not every class has a follow on, but at the very least they should subject
their new proposed methods to the same analysis.

~~~
Alex3917
> if grades received in follow on courses are the gold standard for teacher
> effectiveness, then why not use that

Because grades received in later courses isn't actually a meaningful metric.
There was a theory that you could measure teacher quality this way using
something called Value Added Metrics (VAM), but it's now been completely
debunked. C.f.

[http://www.epi.org/publication/bp278/](http://www.epi.org/publication/bp278/)

[http://www.nber.org/papers/w14442](http://www.nber.org/papers/w14442)

[http://annenberginstitute.org/sites/default/files/product/21...](http://annenberginstitute.org/sites/default/files/product/211/files/valueAddedReport.pdf)

[http://ies.ed.gov/ncee/pubs/20104004/pdf/20104004.pdf](http://ies.ed.gov/ncee/pubs/20104004/pdf/20104004.pdf)

[http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1142237](http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1142237)

[http://educationnext.org/files/ednext20022_10.pdf](http://educationnext.org/files/ednext20022_10.pdf)

~~~
yummyfajitas
Your links do not remotely come close to debunking the concept of VAM.

Your first (accurately) says "measurements are noisy" and the inaccurately
says "and other things matter for learning too". Unless I missed something, it
doesn't provide a reason to expect bias in the other things.

The second link says that some VAM models assume student assignment is random
when it is not, and also that measured educational outputs don't matter much
for outcomes anyway. The logical conclusion of this is simply that VAM needs
modification, and maybe we should reduce education. Claiming this debunks the
idea of VAM is like saying "hah, this paper proves OLS doesn't always work
well, therefore regression is debunked".

The third is really long and I couldn't figure out it's point. The fourth one
says VAM is noisy.

If you believe VAM is unable to measure the effects of teachers, are you
willing to follow through with the logical conclusions? Namely, the effect
teachers have on students is so small it can't be measured and we should
therefore focus on other things (e.g. reducing cost) instead?

~~~
Alex3917
> Unless I missed something, it doesn't provide a reason to expect bias in the
> other things.

"VAM’s instability can result from differences in the characteristics of
students assigned to particular teachers in a particular year, from small
samples of students (made even less representative in schools serving
disadvantaged students by high rates of student mobility), from other
influences on student learning both inside and outside school, and from tests
that are poorly lined up with the curriculum teachers are expected to cover,
or that do not measure the full range of achievement of students in the
class."

> If you believe VAM is unable to measure the effects of teachers, are you
> willing to follow through with the logical conclusions? Namely, the effect
> teachers have on students is so small it can't be measured ...

We already knew that the contributions of teachers to student achievement was
relatively small long before VAM was invented. The only people who seem to
think that teachers have a very large impact are the pro-VAM folks (e.g. Bill
Gates and Malcolm Gladwell), who all cite the same flawed study.

That said, the reasons why VAM can't measure teacher performance isn't because
the contributions of teachers are small per se, but rather because there are
all sorts of problems with all of the proposed measurement systems.

In order to create a VAM system that actually worked, assuming such a thing is
possible, you'd have to make a number of changes to the way that the school
system is designed... e.g. assigning students to teachers at random, designing
standardized tests to measure teacher quality, implementing standardized tests
in all subject areas, etc. But implementing all of those changes may well be
worse for student learning than firing all the bad teachers or whatever. Not
to mention that there is currently a major crisis in teacher retention, with
half of all teachers either getting fired or quitting in their first five
years. (After spending two years and god knows how much money getting their
masters degrees to go into the profession.)

~~~
yummyfajitas
_VAM’s instability_

Your quote concerns variance, not bias.

(The main exception is the bit about "full range of achievement". It's quite
true that teachers of upper class Asian students will likely have lower VAM
scores than they deserve since the test caps out at 100%.)

 _We already knew that the contributions of teachers to student achievement
was relatively small long before VAM was invented._

You dodged the question. Do you think policymakers and politicians should stop
trying to improve teachers, and instead simply try to reduce costs? I.e., pay
teachers less, increase their turnover (so that we don't have to pay expensive
but ineffective senior ones), etc?

~~~
Alex3917
> Your quote concerns variance, not bias.

What were you originally referencing in terms of bias?

> Do you think policymakers and politicians should stop trying to improve
> teachers, and instead simply try to reduce costs? I.e., pay teachers less,
> increase their turnover (so that we don't have to pay expensive but
> ineffective senior ones), etc?

I mean the answer to any of those types of questions really depend on what
kind of school system you want to create as a whole, you could go either way
on any of them and have either a good or a bad school system depending on the
overall design choices. But my point was that I don't think that VAM has much
to say about any of these questions, because whether or not it's theoretically
possible to measure teacher performance doesn't say much about whether or not
good and bad teachers actually exist and whether or not they have an impact on
students learning/lives. My point was that the reason we can't measure teacher
performance isn't because the difference between good and bad teachers is
small, even though it is relatively small in the grand scheme of things.

~~~
Alex3917
I know what they are, where do you see Ravitch talking about bias though?

~~~
yummyfajitas
She's not. She's talking about variance, which does not invalidate VAM.

~~~
Alex3917
Seeing as the entire use case of VAM is as a tool to decide which teachers to
give bonuses to and which ones to fire, variance is pretty much the only thing
that matters.

E.g. if the scores have no stability so who gets fired is basically just a
coin toss (as these links suggest) then that's really important, but if these
tests systematically overestimate or underestimate teacher performance then
that doesn't really matter.

~~~
yummyfajitas
If scores have high variance but are unbiased then you simply need more
samples. I.e., wait 2 years and take a teachers aggregate VAM over ~12-16
classes.

~~~
Alex3917
Right, the links I provided addressed this:

"As expected, the level of uncertainty is higher when only one year of test
results are used [...] as against three years of data [...]. But in both
cases, the average range of value-added estimates is very wide. For example,
for all teachers of math, and using all years of available data, which
provides the most precise measures possible, the average confidence interval
width is about 34 points (i.e., from the 46th to 80th percentile). When
looking at only one year of math results, the average width increases to 61
percentile points. That is to say, the average teacher had a range of value-
added estimates that might extend from, for example, the 30th to the 91st
percentile. The average level of uncertainty is higher still in ELA [English
Language Arts]. For all teachers and years, the average confidence interval
width is 44 points. With one year of data, this rises to 66 points."

"The Mathematica models, which apply to teachers in the upper elementary
grades, are based on two standard approaches to value-added modeling, with the
key elements of each calibrated with data on typical test score gains, class
sizes, and the number of teachers in a typical school or district.
Specifically, the authors find that if the goal is to distinguish relatively
high or relatively low performing teachers from those with average performance
within a district, the error rate is about 26% when three years of data are
used for each teacher. This means that in a typical performance measurement
system, more than one in four teachers who are in fact teachers of average
quality would be misclassified as either outstanding or poor teachers, and
more than one in four teachers who should be singled out for special treatment
would be misclassified as teachers of average quality. If only one year of
data is available, the error rate increases to 36%. To reduce it to 12% would
require 10 years of data for each teacher."

So even with three years of data, the scores are fairly unstable, and this
isn't even taking into account the fact that even if they are stable they
aren't necessarily correct -- there are many more sources of error that would
still be stable over time.

Furthermore, short term student gains correlate poorly with longterm student
gains, so if you wanted to measure longterm student gains (presumably the most
important metric) then you'd have to wait an addition couple of years after
each grade before testing. This means that, in addition to the 3+ years of
data you'd need to have even a semi-stable variance, it would take at least 5
years total to have even decent data:

Economist Jesse Rothstein found that students' learning decays over time, and
that the teachers who had the biggest added value gains at the end of the
first year didn't necessarily have the biggest gains at the end of years two
or three when the same students were retested. According to the Rothstein
paper, "Three aspects of the results are of note. First, there is much more
variation in fourth grade teachers' effects on fourth grade scores than in
those same teachers' effects on ﬁfth grade scores. [...] Second, the average
persistence of fourth grade teachers' effects one year later is only around
0.3, again consistent with recent evidence. Third, the data are not even
approximately consistent with the notion that this persistence rate is uniform
across teachers: The correlation between teachers' ﬁrst-year effects and their
two year cumulative effects is much less than one, ranging between .33 and .51
depending on the model and subject. Three-year cumulative effects show a
similar pattern, correlated around .4 with the immediate effect. Even if we
assume that the VAM-based estimates can be treated as causal, a teacher's
ﬁrst-year effect is a poor proxy for his or her longer-run impact. [...] Only
about a third of teachers in the top quintile of the distribution of two year
cumulative effects are also in the top quintile of the one-year effect
distribution. It is apparent that misspeciﬁcation of the outcome variable
produces extreme amounts of misclassiﬁcation. [...] A full account of the
utility of VAMs for identifying good teachers would need to combine the
analyses of lagged effects and endogenous classroom assignments. This would
imply even higher rates of misclassiﬁcation than are produced by either on its
own."

~~~
yummyfajitas
It sounds like a) we don't have a good measure of what we hope to accomplish
by having teachers and b) most of their measurable effect vanishes quickly.

Sounds to me like "teachers don't matter, lets try to cut costs". Do you
disagree with this?

------
mjhoy
Back in college I used to check "rate my professor" when picking classes. I
learned pretty quickly to avoid the highly-rated.

------
breen
What about taking into account each student's performance in the class? In
other words, partition the evaluations into sets based on the students'
performance. Then you'd end up with a spectrum of instructor ratings, from F
to A, not an average.

If the institution wants to make a choice of instructors for an introductory
course, they might want to select the instructor with the higher low-grade
rating. The intuition here is that well-performing students in an introductory
course might have prior knowledge, so their ratings add noise.

If the institution wants to compare two instructors overall, they can compare
their spectra after subtracting the skew that results from low-performers
evaluating poorly and high-performers evaluating highly. What remains is some
picture of how the instructor handles students of different strengths and
weaknesses, which (I think) is the best measure of a teacher.

~~~
watwut
"If the institution wants to make a choice of instructors for an introductory
course, they might want to select the instructor with the higher low-grade
rating. The intuition here is that well-performing students in an introductory
course might have prior knowledge, so their ratings add noise."

The complain in the article is that with exception of well-performing
students, students rating does not measure how much they learned (measured be
their performance in subsequent courses). Lower performing students tend to
give lower marks to those teachers that pushed them to learn more.

Your plan maximizes satisfaction of low performing students in introductory
course. It does not maximizes how much they will learn, quite the opposite.
University should strive for the best teaching, not for the highest immediate
satisfaction. It has importance too, but should not be more important then
actual learning.

------
j2kun
From all of my experiences and everything every faculty member and career
panel has ever told me (about research university positions in mathematics and
computer science), teaching does not matter at all. Hiring boards don't look
at your teaching statement even though they require one. Nobody reads a letter
of recommendation if it is about teaching. Nobody reads your course
evaluations. And in particular, the more you neglect teaching (and the more
time you spend on research) the easier it will be to get tenure.

As sad as it sounds, that's how my field is. So is this report specific to
other fields?

------
dzink
UChicago's rating system was quite useful, actually. The survey is given
before the final, so exam difficulty doesn't matter. Professors who deliver
insights you really learn from get the high ratings and those who don't (aka:
they dwell in detail instead of creating understanding, can't communicate,
read slides, make you fall asleep, or teach outdated material lose out). Over
the years, professor ratings and course bidding points became a much better
indicator of the quality of a course section than the description of a class.

------
pfooti
At the end of the semester, students who enjoyed the course should just tip
their professors. Adding a real free-market incentive would increase class
size and teaching quality, right? Let's let the invisible hand of the market
decide; it is certainly more motivating than course evaluations.

I have taught at two higher-ed institutions. In one, each student was given a
20-question evaluation. Of those twenty questions, two were of any importance
to the instructor: overall instructor and overall course. The other eighteen
were frequently ignored by administration and had absolutely zero influence on
tenure, promotion, raises or retention. Those two examined scores were
averaged over all course over the entire year, but not weighted - you'd have a
six-person course count as much toward your final average as a sixty-person
course. The scores were from 1 (bad) to 5 (good). You needed an overall
average of 3 on each scale to avoid being fired. Higher scores yielded no
tangible (or intangible) rewards. Every faculty member taught enough grad-
level courses (PhD in particular tend to yield all 5s) that nobody ever really
worried about their evaluations from an enforcement perspective. Those of us
who cared about actually improving our practice had better ways to get honest
reflections from their students. Personally, I tended to get low-ish
evaluations from students expecting an easy A ("Professor __ expects his
students to work really hard and learn" "He thinks he's the King of Science"
etc etc), and high evaluations from students who wanted to learn the stuff the
course was about.

Of course, I also got better over time and so did my course evaluations. I
also got better at gaming the system. Add too much reading to the syllabus,
and remove an assignment or two so you can be "responsive to student needs",
for example.

At the other institution, course evaluations were ignored entirely. They
played no role whatsoever in anything. On the other hand, I had a better time
teaching those students, and according to my evaluations that had no impact on
my career, I was doing a better job of it to boot.

Of course, anecdotes vs data, etc etc. Still, I think tips are the way to go.
We already know mean course evaluations can be predicted by the grades
students expect to get [1], so let's just flip that around a little.

[1]:
[https://stat.duke.edu/~dalene/chance/chanceweb/153.johnson.p...](https://stat.duke.edu/~dalene/chance/chanceweb/153.johnson.pdf)

~~~
naner
_At the end of the semester, students who enjoyed the course should just tip
their professors. Adding a real free-market incentive would increase class
size and teaching quality, right? Let 's let the invisible hand of the market
decide_

First of all, broke college students certainly aren't going to 'tip' their
professors. I guess perhaps you're thinking part of tuition will be used for
this?

Still, what students 'enjoy' is not necessarily what is most beneficial for
them in a college course. I think you're setting up the incentives wrong.

~~~
userbinator
_Still, what students 'enjoy' is not necessarily what is most beneficial for
them in a college course. I think you're setting up the incentives wrong._

Exactly. The majority of them just want the highest grades with the least
effort, and the professors who will quite naturally want to get tipped more,
will optimise toward that. It'll create a situation where teaching next-to-
nothing and giving everyone perfect scores gets the highest rewards; i.e. not
much of an education.

~~~
WalterBright
That may be true for some, but not for me. Not only was I paying most of the
tuition bill, I selected classes based on what I though was knowledge I needed
for the career I wanted. An "easy A" didn't figure into it.

I wasn't interested in paying lots of money for nothing. I also wasn't
interested in attending a university known for easy grading.

~~~
watwut
Quoting article: "There's an intriguing exception to the pattern: Classes full
of highly skilled students do give highly skilled teachers high marks. Perhaps
the smartest kids do see the benefit of being pushed."

Maybe you are in that "highly skilled" group. Unfortunately, average student
have different behavior.

~~~
WalterBright
I don't know if that necessary corresponds to being highly skilled or not.

But I like being pushed hard in a class - it brings out the best in me. Easy
A's just engender boredom and contempt.

I remember once being pushed to enter an athletic competition where the field
was not so good. "It'll be an easy win for you" I was told. That made me
completely uninterested. I entered one where I was told "You'll be in way over
your head." and I was. I came in last - but there was never a more satisfying
one to me, because it was my best job ever, and just being on the field with
those other excellent competitors was exciting.

An analogy would be just qualifying for the Olympic team, even if you have no
hope of getting a gold.

