

New Test for Computers: Grading Essays at College Level - jfc
http://www.nytimes.com/2013/04/05/science/new-test-for-computers-grading-essays-at-college-level.html

======
lmkg
There are a few reasons I'm not excited about this sort of thing, but there's
one big reason where I think it would be a distinct improvement over the
status quo: standardized testing.

Most standardized testing in the US, especially the ones deployed at a
national scale, are designed for grading first, and testing second. This is a
simple concession to feasibility: if you're trying to evaluate a few dozen
million students, scantron is the only current tool that is time-efficient,
cost-efficient, and consistent ("fair") at scale. The constraints imposed by
that tool are significant, and tying so much incentives to those tests have
warped the whole education system.

Automated essay grading, no matter how imperfect, would still be an expansion
of the available testing techniques that could be deployed at the national
scale. It would expand the range of skills that are measurable, and therefore
incentivized to teach.

The cynical way of putting that is, no matter how shitty computers are at
grading essays, they're still an improvement when the competition is multiple-
choice questions. I have decided to be excited by that.

~~~
cma
Instead of bothering with grammar, unique personal takes on subject matter,
and other complicated things, teachers can just teach students how to spew out
Markov Chain incomprehensible crap peppered with high vocabulary.

------
a_p
A quote from the article:

“My first and greatest objection to the research is that they did not have any
valid statistical test comparing the software directly to human graders,” said
Mr. Perelman.

That is a good enough reason not to use the software. If it does eventually
pass such a test, I doubt that it would ever be better than the _best_ human
grader but rather it would be better than the _average_ grader.

~~~
SilasX
I'm not sure a "good" result for such software would be much progress though
-- any success would be the result of students writing for a human grader. If
you actually employ the software grader, students write with that in mind and
start looking for "tricks" and "holes" in the algorithm that result in a good
grade but diverge from what we humans regard as good writing.

~~~
Retric
Your assuming 100% automation, having a person skim for obvious bad faith
essays is probably enough to combat such cheating while still saving 90% of
the effort of fulling grading each essay.

------
SatvikBeri
Of course there are a million-and-one possible problems, but the upside is
also massive, especially if an automated grader is used to supplement human
graders. Some examples:

-A student can iterate and improve their essays on their own using an auto-grader. They can get it up to a decent level just through iteration before having to get a human involved.

-Human graders are highly subjective

-Human graders tend to be strongly affected by factors such as "number of hours since they last ate"

-Different humans have different levels of harshness, a machine could help calibrate these

-Outside, say, the top 10% of colleges, the vast majority of human graders suck. Especially for standardized tests, they typically get paid near minimum wage and the qualifications are along the lines of "have a degree." While automated graders will probably never be as good as the best graders, they don't have to be.

So overall, I'm pretty excited. Even a half-baked solution has a lot of
potential value.

~~~
notahacker
_A student can iterate and improve their essays on their own using an auto-
grader. They can get it up to a decent level just through iteration before
having to get a human involved._

If awarding people degrees based on essays whose form is dictated by trial and
error rather than actual comprehension and interpretation is an _upside_...

~~~
SatvikBeri
I also hold the opinion that it's much easier to learn programming if you can
get feedback from the compiler in a few seconds or minutes as opposed to
waiting two weeks between each compile.

~~~
solistice
You don't wait for 2 weeks to get your compiler errors? How do you handle the
subjectivity of stack traces then? Dude, that's madness.

But seriously, short feedback loops facilitate learning, and if this shortens
the feedback loop, then I'm exited about it.

------
TheCapn
Feedback is more than a grade letter. Real teachers provide insights and
direction to material like an essay to stimulate critical thought and
development. How well can a computer grade system instill this type of
learning on students? Can it at all?

------
geoffpado
I actually had to use something like this for a class in college… maybe 2
years ago? It was the worst thing. It would auto-grade, tell you your score,
and how you could improve it. However, there was almost no way to actually
respond to the grade. It would often tell you you didn't cover a specific
topic, even when you had. There was no way of pointing out where you had
discussed it, no way in the program of asking for a human review, anything.

All this, of course, was exacerbated by the class being run by possibly the
laziest professor I had all throughout college (all essays were graded by
this, all tests in class were taken by clicker, all notes were Powerpoint
slides provided by the book's publisher). Possibly, in the hands of someone
who actually cared about teaching, a tool like this wouldn't have been so
bad—but it seems to me that someone who cared about teaching would just grade
manually to begin with.

------
marianne_navada
As an educator, I appreciate technology that provides students a quicker way
to get feedback, but this software only reinforces students' frustration with
essays: no one reads them except the professor. Most are written to be read by
one person. Artists are able to share their work, but for most majors that
require writing, assignments are meant to be forgotten. This is exactly the
reason why students who major in these fields graduate without a portfolio of
their work.

My husband and I developed <https://chalktips.com/> to solve this problem. We
wanted to make essays and school work for college students engaging and
shareable. Students publish booklets and slideshows as part of their
assignment. Students can tweet or share their work in Facebook or Tumblr.

Last semester as part of the final, I had professionals from different fields
comment on student work, and the feedback from students was amazing.

Using AI to comment on essays might be efficient and probably comparable to
having ONE person grade an assignment. But it doesn't make use of the power of
community. It also does not address the fundamental problem with essays: after
it's graded, so what?

We currently have 1,400 users with more than 2,500 booklets and slideshows
published. Students are embracing the platform. Of course, I'm a community
college teacher without the clout of MIT professors. But we're hopeful that
more students start demanding that they are given assignments that have
utility after the course is over.

------
rollo_tommasi
The problematic aspects of this software are applicable to MOOCs and online
education in general. Truly valuable, high-level education requires a level of
precise, individually calibrated feedback in order to train students how to
think and express themselves critically and rigorously (there are also
obviously signalling, credentialing networking benefits that will never be
replicable by MOOCs, but that's not relevant to a discussion of pure
educational quality).

Computer-driven mass education will never be able to provide that level of
instruction. At best, it's a tool for bringing the workforce up to the basic
level of competence necessary to function in an information economy, much like
how the original purpose of the public education system was to crank out semi-
skilled factory workers and low-level clerks. This may be a valuable end - it
may help to stem the commoditization of the BA and the attendant tsunami of
student debt - but there needs to be wider acknowledgement that it is actually
fulfilling a purpose that is different from that of traditional higher
education.

------
scarmig
One big advantage of this, if it ever becomes decent: iteration.

Make it available to students. They can write an essay, submit, and get
feedback on it. Repeat.

It'd be difficult to get a machine to bring someone to great writing. But
clear, concise prose that gets to the point? That seems reachable. And that
would be an improvement over the status quo.

In other words, machines won't anytime soon be able to know that a Wodehouse
is superior to an Orwell. But an Orwell to, say, a Thomas Friedman or typical
ninth grader? Totally.

~~~
notahacker
Machines will also "know" that the course leaders' model essays put through a
bog standard article-spinner are superior to Orwell.

------
ISL
One problem: Essay quality, and that of writing in general, is subjective.

It's possible to assess objective properties of writing, but it cuts off the
top end of the spectrum.

    
    
      Three quarks for Muster Mark!
      Sure he has not got much of a bark
      And sure any he has it's all beside the mark.
    

Grade? F. Author: Joyce. Importance: Inspired Gell-Mann's naming of the quark
[1].

[1] <http://en.wikipedia.org/wiki/Quark>

~~~
solistice
But going into a writing class to be graded for the subjective quality of my
writing doesn't really make sense, does it? I mean it boils down to whether I
get lucky with the grader, whether his subjective evaluation of my work is
positive, and it tells me almost nothing. I now know Professor Manning likes
my writing. But Professor Sieboldt doesn't. I honestly don't think subjective
art should be graded. It has no use to grade it. For example, imagine I want
to hire a painter. An objective apraisal of his skill would be useful for
that. A for Stroke Quality, B+ for Perspective Drawing. Fair enough. But if I
want him to create an artistic work for me, I'd look through his portfolio
first. Do I like the images he's done before? Do I like his style of painting?
What Professor Manning thinks about his style is absolutely useless to me, and
it's even worse if he mixed that grade with his technical prowess, because
then all the grades he gave me are completely useless.

But Hackers and Painters right? We do it too, don't we? 5 years of Java
experience, 7 years of PHP, and 94 years of Haskell. So 5 years of Java, or
5x1year of Java, or Java as something you do for 2 hours every 2 months? 5
years of Java tells you very little doesn't it? As does A+ for Stylistic
Expression.

Were grading that Arts like the Sciences. Judge a fish by it's ability to
climb a tree and...

------
noonespecial
They're facing the same problem as testing (diagnosis) in large scale health
care: How to test (and teach really) without having to take on the bothersome,
unscalable task of actually getting to know the student.

Wouldn't it be great if we could just figure out "education" once, code it up
and then let it run on all of the students? Its easy really, all we need are
identical students.

------
danso
So, for this to be even a feasible solution, EdX needs to show more proof of
concept...like, here's a sample question, here's the 100 sample answers that
were used to machine-learn against, and here's how the auto-grader graded
these 10 different answers (both good and bad).

Why should anyone have faith that EdX has cracked the perfect mix of machine
learning and NLP and other associated technologies needed to provide accurate
assessments of essays? Even Google from time to time has trouble guessing
intent. Wolfram Alpha even more so. If the engineers at these companies can't
get it always right -- and it's not just engineering talent, but data and data
analysis -- why should a school entrust one of its most important functions to
EdX?

Grading is something _critical_ to get right, not just, "almost there." Think
of how much time you spent arguing with a profesor that your score deserved an
8 instead of a 6, enough points to bring you from a B to a B+ for the
semester...Think of the incentive to do so (career prospects). If the machine
is ever wildly off just even in one case, would you ever take its assessments
as gospel? Multiply yourself by 20 or 50 or whatever a usual professor's
lecture load is, and now a ton of lag has been introduced into the grading
workflow.

Obviously, there are ways to mitigate this. One would be to write questions
that are so narrowly focused that there are very clearly right and wrong
answers...Which of course raises the problem of: why not just give everyone
multiple choice tests, then?

The sad thing is is that even if these machine graders were empirically better
than human graders, they can't just be better, they have to be almost perfect.
If a plane's autopilot failed, on the whole, 1 out of a 1000 times compared to
5 out of a 1000 times for human pilots...how much more pissed do you think
victim families are going to be when they find out their loved ones died
because of a algorithmic malfunction rather than just pilot error? People, for
obvious reasons, don't like thinking of their work or their destinies being
defined by deterministic machines...so if those machines aren't 99.99% right,
then the fallout and pushback may cost schools more than the savings in
professor-grading time.

~~~
jurassic
> Think of how much time you spent arguing with a profesor that your score
> deserved an 8 instead of a 6, enough points to bring you from a B to a B+
> for the semester.

I spent zero time doing this, and I never even realized how widespread it was
to try to negotiate grades until I helped TA a few classes. Now I see how much
professors hate the students who do.

Seriously, what job is going to care that you got a B vs B+ in a class? If
that subject is really so important to their mission, they are probably
looking for A+ candidates anyway.

~~~
danso
GPA is, unfortunately, one metric that is used to quickly sift applicants. So
one B+ over a B may not matter for one course, but can be quite significant
over the course of four years.

~~~
hmsimha
Additionally, certain scholarships require a minimum GPA be met. A B vs. a B-
can be a huge difference to someone faced with losing their tuition check.

------
JEVLON
Well I referenced a Harvard business school public journal and it got flagged
as being from Wikipedia. So I do not believe we should let an automated system
mark our essays just yet.

------
therobot24
as a precursor to grading/giving feedback on an essay, i usually pull out a
Naive Bayes network for topic classification and see if their report is
correctly classified. It's kind of fun and a good indicator for what i'm about
to read.

