
The Algorithm Didn’t Like My Essay - aaronbrethorst
http://www.nytimes.com/2012/06/10/business/essay-grading-software-as-teachers-aide-digital-domain.html
======
InclinedPlane
We keep paying more and more per student for public education. And what do we
get in return? Robo-grading and prison-like treatment of students (closed
campuses, zero tolerance taken to ridiculous extent). And again we come around
to the notion that if only we'd spend more everything would get better, which
has worked so well every time it's happened in the past.

The increasing reliance on automated tools and standardized tests is just one
of many symptoms of the increasing detachment between teachers and students.
We have to stop doubling down on the prussian style educational system, it
stopped returning dividends decades ago. More so we have to stop acting as
though we can just solve our educational problems by throwing more money at
it. We need to empower good teachers and parents, we need to encourage and
support students. Right now all of the incentives in the system run counter to
those things.

~~~
noonespecial
The problem with "more money" is that the educational system seems to have a
sort of hull speed when it comes to adding money.

I'd guess that the same _or less_ is being spent on actual education today and
the excesses, whatever they may be, are simply siphoned off by the parasites
in the system.

It doesn't seem to matter how much $ per student gets added, teachers still
end up buying classroom supplies with their own money. The system just has
more $180k "administrators" now than it used to.

~~~
btilly
University education has been going up at 10%/year for decades now. Way above
inflation.

It is clear that it is going on an internal dynamic until those dynamics
finally conflict with reality. That collision will be..painful. Given the
crushing levels of student debt in this country it may happen sooner rather
than later.

If I were in charge, I'd institute a simple change. Improve financial aid.
Beef up the federal loan guarantees. Only make either available at schools
whose tuition was in the bottom 70% when you applied. With that incentive, I
think that a lot of schools would find ways to reduce tuitions towards what
(inflation adjusted) was more than sufficient several decades ago.

~~~
_delirium
> University education has been going up at 10%/year for decades now.

Accounting for inflation and number of students, university budgets have not
been going up 10%/year; at some universities, the real per-student cost of
higher education has actually been declining. What's been going up at 10%/year
is the tuition "sticker price" for students who receive no scholarships or
need-based aid. But this is offset by two contrary trends: 1) a decrease in
the percentage of students who are actually paying the sticker price,
especially at private universities; and 2) large decreases in state funding
for public universities.

Consider the University of California system (all figures below in 2012
dollars). In 1990, it spent $21,000 per student (dividing its total budget by
its total enrollment). Today it spends $16,500 per student--- a _decrease_ in
the cost of education of about 25%. Tuition has nonetheless gone up, because
the state-funded portion of its budget has declined even faster: from $16,000
per student in 1990 to $9,500 per student today, and possibly to $8,500 per
student in the coming year. As a result, the student-funded portion has risen
from $5,000 to $7,000 and soon $8,000 per student on average (with a sticker
price a bit over double that).

If you want the UC system to be able to run on tuitions that were sufficient
in 1990, you'd have to also return state funding to what it was in 1990...

------
kenrikm
If it's ok for them to use an algorithm to grade essays then it would be ok
for students to use an algorithm to write the papers no? How is it different
from using a calculator? it's a tool right? mathematics used to only be only
pen and paper. (I know it says it would still be graded by people, I'm making
a point)

Writing is an artform and as such should be treated and graded as one.
Sometimes you need to use language that is grammatically incorrect to get a
point across and software is not nuanced enough to pick up on this. I can see
this helping to turn out students that are good grammaticians but terrible
writers - optimizing for the wrong outcome.

~~~
numlocked
(disclaimer - I work at Kaggle)

I would recommend downloading the dataset[1] and taking a look at some of the
essays. There is a tendency for an educated audience to (justifiably) worry
about exactly what you've articulated, but when you read through the essays,
you get an appreciation of how valuable a tool like this could be. These are
elementary and middle school essays rife with rudimentary mistakes (which are
at times extremely charming) and generally not filled with subtle arguments
prone to misinterpretation.

If algorithms can help every seventh grader in the country write a solid 5
paragraph essay, that's progress.

[1]: <http://www.kaggle.com/c/asap-aes/data>

~~~
gammarator
I participated in the Kaggle competition, but I'm less optimistic about the
pedagogic value of these systems. A black box algorithm won't explain to a
student what their rudimentary mistakes _are_.

I wrote about this more extensively here:
[http://bellm.org/blog/2012/04/23/arguing-to-the-algorithm-
ma...](http://bellm.org/blog/2012/04/23/arguing-to-the-algorithm-machine-
learned-scoring-of-student-essays/)

~~~
numlocked
Great write up! I hadn't seen it, thanks for sharing.

Two things: 1) Agree about the explanation bit. In fairness, that was not part
of the competition. But I do believe that, given the way the features were
constructed, many of the algorithms could be modified to also provide
"explanation" of their scoring. The value is highly dependent on the features
though (length, for instance, would probably be a prominent "explanation"),
and I haven't actually tried it.

2) The "This would grade Dickinson/Hemingway/DFW/etc poorly" argument is true
but, again, read the essays. These are serious edge cases - ones that should
be addressed (by, for instance, correlating algorithmic scoring with human
scoring), but this argument doesn't diminish the applicability of the
algorithm to the vast majority of essays and students.

------
alextp
Giving students close access to a program that evaluates essays many times
would be the worst possible thing to do. People are generally very good at
adapting to systems, and the algorithm would break down pretty quickly as
people find out which factors it favors and which ones it penalises (while
human graders are arguably less strict in their criteria and can hence
recognize these attempts at writing to the letter).

~~~
biot
This inevitably leads to an arms race, with the students and software getting
better and better. At some point, sufficiently advanced gaming of essay scores
is indistinguishable from students who naturally write well by following the
rules which lead to good writing. As long as creativity and originality aren't
sacrificed, is it really a problem?

~~~
gammarator
You may be overestimating the sophistication of these algorithms.

At least for this training set, my algorithm rewarded the length of the essay
most of all (something like 65% of the total prediction). The only other
significant factors were misspellings and prevalence of certain parts of
speech.

That model matched the accuracy of human graders and several commercial essay
grading packages.

Students reverse-engineering comparable algorithms won't necessarily have to
write well to score well.

~~~
andrewcooke
currently, if a student writes nonsense, there's a fairly significant chance
that they will be caught and penalised. a human can detect nonsense in three
minutes.

in contrast, i suspect algorithmic approaches can be gamed more easily because
they don't adapt in the same way. they're not solving the hard ai problem;
they're grading essays (currently) written for a human reviewer.

for example, what happens if a child learns an existing text by heart and then
substitutes appropriate nouns and verbs to suit the context? say they learn
"We hold these truths to be self-evident, that all men are created equal" and
then, for an essay on their favourite pet, they hand in "We hold these kittens
to be furry, that all kittens are created hungry". That's good grammar; it's
got suitable references to the subject; it's clearly nonsense.

~~~
terangdom
No it's incorrect. Compare "these truths... that all men are created equal,
(...)" with "these kittens... that all kittens are created hungry". The that
in the second sentence is wrong.

------
mingpan
I think this says more about the assignment and the human grading procedures
than anything else. Even the supposedly creative sections in standardized
tests are surprisingly narrow in what they consider to be good.

~~~
frankydp
Exactly what I was thinking. The ability to reliably predict human scores,
says more about the terrible way we "standardize" creative writing, than about
the quality of the algorithms.

If the grading curve of an essay is a checklist how is this finding a
surprise?

------
ericd
This article is coming to the wrong conclusion from the facts given. That the
grades given by these algorithms matched the grades given by human essay
reviewers on standardized tests almost exactly doesn't mean that our
algorithms are great and that they should be rolled out to normal school
environments to help students learn to write. It means that our method of
grading essays on standardized tests is absolute garbage, because the
algorithms can't judge semantic meaning or depth of insight. We're basically
grading people on how prepared they are for writing About.com content farm
articles.

------
PaulHoule
I was told by the head search engineer for a well-known job board site that
they'd trained a bag-of-words classifier to find badly written resumes that
would benefit from professional editing.

They'd found it was one of the easiest text classification problems that had
ever been tried. With a very small training set they'd attained off-the-charts
accuracys.

------
dhx
How well did the algorithms detect insightful analysis, deep understanding
beyond the immediate subject matter, factual correctness, salience and an
ability to write to a specific audience?

When reading articles on the Internet I look out for superficial factors
including:

‣ Misuse of U+0022 typewriter double quotation marks in place of U+201C and
U+201D double quotation marks

‣ Misuse of U+002D hyphens in place of U+2013 en dashes and U+2014 em dashes

‣ Poor understanding of comma and semicolon usage

‣ Choice of typeface, use of ligatures and micro-typographical features,
amount of leading used…

‣ Misuse of their, there, they’re, where, were, we’re, affect, effect…

‣ RAS syndrome ( _sīc_ )

These factors can assist with analysis of writing. An author can generate
credibility in certain contexts by using triangular Unicode bullets instead of
ASCII asterisks. Use of lengthy and complex words in place of simple English
can also be mistaken for credibility. Analysis of these techniques will show
that the writer is often attempting to belittle readers. Documentaries utilise
similar tactics that the untrained eye will likely mistake for credibility.
Soft filter effects, use of bookcase backdrops for expert interviews and
silence have profound manipulative effects on viewers.

Grading algorithms will likely reward use of manipulative and belittling
writing techniques and penalise honest superficial mistakes. I would rather
read the honest opinion from an author that misuses _their_ and _there_ than
have to carefully analyse the carefully crafted writings of a dishonest
linguist.

As an aside, I recommend bookmarking The Browser[1] ( _Writing Worth Reading_
). It would be interesting to see the results of algorithmic essay grading
against this curated collection of articles from across the Internet.

[1] <http://thebrowser.com/>

~~~
DougBTX
_How well did the algorithms detect insightful analysis, deep understanding
beyond the immediate subject matter, factual correctness, salience and an
ability to write to a specific audience?_

I'm far from well informed, but my understanding of standardised tests is that
the standard specifies the algorithm, which already ignores your good points
above to achieve standardised grading. All that really changes is whether a
human or a robot executes the algorithm, the human insight has already been
squeezed out of the system.

------
_delirium
I wonder how the state-of-the-art in algorithmic essay grading compares to the
state-of-the-art in scoring the quality of online content. Does Google have
content-quality-estimating algorithms that could improve algorithmic essay
grading?

------
feral
Myself and a colleague entered this competition. We came 9th. We were doing
this for fun, and aren't experts in the domain, but I think our score was
within 'diminishing returns' of the better teams.

There are a couple of things to realize about this challenge:

I wouldn't conceptualise the challenge as trying to find features of good
essays. It is more about trying to find features that are predictive of the
essay being good.

This is a subtle but important distinction. One example is that the _length_
of essay was hugely predictive of the score the essay would get - longer meant
better.

Is a longer essay really a better one? No. But, at the level the students were
at, it just so happened that the students who were able to write better, also
were able to write longer essays.

While you get good accuracy using techniques like this, its debatable how
useful or robust this general approach is - because you aren't really
measuring the _quality_ of the essay, so much as you are trying to find
features that just happen to be predictive of the quality. Certainly, it would
seem fairly easy for future students to game, if such a system was deployed.

This isn't a general attack on machine learning competitions - but I wonder,
if for situations that are in some sense adversarial like this (in that future
students would have an incentive to game the system), whether some sort of
iterated challenge would be better? After a couple of rounds of grading and
attempting to game the grading, we'd probably have a more accurate assessment
of how a system would work in practice.

There is another important feature of this essay grading challenge, that
should be taken into consideration. There were 8 sets of essays, each on a
different topic. So, for example, essay set 4 might have had the topic 'Write
an essay on what you feel about technology in schools'. To improve accuracy,
competitors could (and I would guess most of the better teams did) build
separate models for each individual essay-set/topic. This then increased the
accuracy of, say, a bag of words approach - if an essay-set 1 essay mentioned
the word 'Internet' then maybe that was predictive of a good essay-set-1
essay, even though the inclusion of 'Internet' would not be predictive of
essay quality, across all student essays.

Its important to remember this when thinking about the success of the
algorithms. The essay grading algorithms were not necessarily general purpose,
and could be fitted to each individual essay topic.

Which is fine, as long as we realize it. The fact that it was so easy to
surpass inter annotator agreement (how predictive one of the human graders
scoring was of the other human graders scoring) was interesting. Its just
important to realize the limits of the machine learning contest setup.

I would guess that accuracy would down on essays of older, more advanced
students, or in an adversarial situation where there was an incentive to game
the system.

~~~
yummyfajitas
_Is a longer essay really a better one? No. But, at the level the students
were at, it just so happened that the students who were able to write better,
also were able to write longer essays._

It could also be the case that length is one of the features your human
graders are using to grade essays. I.e., it might really be causal, rather
than merely correlated.

In my (anecdotal) experience, teachers certainly do this. While in college I
developed the skill of utilizing excessively long and verbose language while
elucidating simple points simply to incrementally increase the length of
essays [1].

Luckily a great prof in grad school (thanks Joel) beat this bad habit out of
me.

[1] In college I learned to pad my essays with verbose language.

~~~
feral
Its possible. More generally, its also possible the human graders were doing a
bad job; the ML system can only learn 'essay quality' to the extent that the
training data reflects it.

However, the kaggle supplied 'straw-man' benchmark, which worked solely based
on the count of characters and words in the essay, had an score of .647 with
the training data. (The score metric used isnt trivial to interpret - it was
'Weighted Mean Quadratic Weighted Kappa' - but for reference the best entries
had a score of ~.8 at the end)

The score of .647, just using length, is quite high. For length to have this
powerful a causal predictive effect, the human graders would have to be
weighting for length, as a feature, very heavily.

I can't rule that out; but I think its highly likely a major component of the
predictive effect of length was correlative, rather than causal.

------
fiatmoney
As long as we're training algorithms to recognize correlates of "high-quality
writing" rather than high-quality writing itself, why not use as many
predictive features as possible? I'll bet parental income and education level,
average home price in the school district, and the percentage of students at
the school receiving free or reduced-price lunches, are incredibly correlated
with writing quality.

------
reubensutton
My instinct is that most algorithms to test this may end up optimizing for
essay length and word complexity, rather than being able to assess the
content, however I imagine this is also how a lot of teachers grade, given
that this article states that they receive 3 minutes of attention each.

------
keithpeter
"And humans are not necessarily ideal graders: they provide an average of only
three minutes of attention per essay, Ms. Chow says. "

Well, I'm hardly surprised that an algorithm can duplicate that kind of
superficial marking of an essay!

Good heavens, what are you doing to your children in the US?

------
SagelyGuru
At best, these algorithms can only pick on obvious spelling, grammar,
punctuation and style errors. They have no concept of semantics. Therefore the
students will have to polish the form and abandon the content.

To prove this point, I suggest setting up a testing competition, this time for
algorithms to generate meaningless 'essays'. I expect their easy success
obtaining full marks.

So, what will be the educational outcome of this? In three simple words: more
dumbing down.

Is this another example of the obsession of _officials_ everywhere with the
pedestrian form at the expense of original content? I think it tells us a lot
about their own limited mentality.

