
How to assign partial credit on an exam of true-false questions? - one-more-minute
https://terrytao.wordpress.com/2016/06/01/how-to-assign-partial-credit-on-an-exam-of-true-false-questions/
======
gjm11
As one of the commenters points out, Tao has rediscovered the notion of a
_proper scoring rule_ \--
[https://en.wikipedia.org/wiki/Scoring_rule#Proper_scoring_ru...](https://en.wikipedia.org/wiki/Scoring_rule#Proper_scoring_rules)
\-- and the specific (rather nice, if you don't mind all the scores being
negative and the infinite penalty when something happens that you said
definitely wouldn't) _logarithmic scoring rule_ \--
[https://en.wikipedia.org/wiki/Scoring_rule#Logarithmic_scori...](https://en.wikipedia.org/wiki/Scoring_rule#Logarithmic_scoring_rule).

The Brier score -- you score minus the average squared error between your
prediction and [1 for the right outcome, 0 for the others] -- is also a proper
scoring rule (i.e., incentivizes you to report your probabilities accurately)
and doesn't penalize maximally-wrong answers infinitely. For some purposes
it's a better choice than the logarithmic score.

~~~
shoyer
In the traditional formulation, the Brier score is 0 for correct guesses with
p=1, 1 for incorrect guesses with p=1 and 0.25 for guesses with p=0.5:
[https://en.wikipedia.org/wiki/Brier_score#Example](https://en.wikipedia.org/wiki/Brier_score#Example)

When translated into a grading rule, this gives us a score range from -3 to
+1, with 0 for uncertain guesses. So it's not as harsh as logarithmic scoring,
but still penalizes confident wrong guesses more harshly than it rewards
confident correct guesses.

~~~
Retric
Sounds interesting. The problem with logarithmic scoring is your penalized for
picking actual percentages vs gaming the system.

EX: If your 99% sure of each question and there are 100 questions then getting
1 wrong gives the best points at 99%. But, randomly you can only add +1 point
from 100 correct but randomly there is a long tail of missing several.
Further, scores are not linearly valuable trading a lower max score but higher
chance of getting an A is a positive result.

Ideally you should set things up so accurate estimates give the best results.

~~~
shoyer
> Ideally you should set things up so accurate estimates give the best
> results.

This is true for both the logarithmic score and the Brier score. These are
both "strictly proper scoring rules", which is a formal way of stating your
requirement that it should not be possible to game the system. In both cases,
you get the highest score (on average) if you guess is true distribution.

~~~
Retric
The problem is if your X% sure on each of a set of questions and you want to
get an A, you want to minimize your chances of getting a bad grade more than
you want to maximizes your score. Alternatively, there is no point in setting
your odds for every question below the point that takes the minimum number of
correct answers to pass. EX: If you need to get X% correct to pass, then don't
set your odds below X%.

For a similar example consider what you bet on the final question in Jeopardy
does not just depend on your estimate of your odds, but other factors.

Of course even thinking about this stuff is distracting, you may be better off
picking a very small number of odds. Say 99%, 90%, 70%, and 50%.

------
whack
There used to be a Decision Science class at Stanford, where this exact scheme
was used. Students were warned all the time to never indicate 100% certainty
on any question, because if you ever did that and turned out to be wrong, you
would fail the entire course because of that one question: even if it happened
to be a minor homework assignment. I always thought this was a great way to
teach people the lesson that you should (almost) never claim 100% certainty in
anything, and that you should view knowledge through a probabilistic
perspective.

~~~
ianai
Unless you're arguing with someone that will take any missing confidence as
sign that you're completely wrong.

~~~
avn2109
>> "Unless you're arguing with someone that will take any missing confidence
as sign that you're completely wrong."

This is 90% of people.

~~~
marxidad
No. You have to stick to your guns that uncertainty is a valid disposition
(maybe).

------
Houshalter
I'm working on something like this right now. We asked 8 trivia questions on a
survey of users of our website, and had people assign probability estimates to
the probability they got the answer right.

First of all it seems like everyone had exactly the same probability of
getting any random question right. Some people got every question right, and
some got none right. But in exactly the proportion you would expect by random
chance - that some were just particularly lucky or unlucky.

The second is that probability estimates did not vary much based on questions
gotten right. Everyone expected to get about 44% of questions right,
regardless how many questions they actually got right. People who only got 1
right assigned the same probability as people who got 5 right.

Likewise people who estimated higher probabilities of getting more questions
right, got the same amount of questions right. And a decent percent of people
were underconfident too, and assigned probabilities too low (but got the same
amount of questions right.)

Lastly, people are really uncalibrated. Some people are just bad at estimating
probability. When they say "80% chance of something" they mean that thing will
only happen 58% of the time. You can be trained, in a relatively short time,
to become calibrated. By estimating probabilities and seeing how many you
actually got right. But most people aren't trained, so it would be a bit
unfair to put this on a real test.

------
syphilis2
Could this be simplified by asking students what they think their score on the
test will be and adjusting their final score based on how accurate that
estimate was?

Overall it seems like the biggest flaws in this system are that

1: Scores still get mapped to discrete letter grades.

2: A student's goal is not to get the highest score possible, rather it is to
ensure that he or she is most likely to get an "A".

For example:

Given a 10 question quiz, a student who knows (in truth) she is 80% likely to
answer each question correctly, and who needs to achieve a score of 0.2 to
pass. The student is led to believe that by accurately estimating her
confidence for each question at 80% she is giving herself the biggest
advantage. If she does this and she gets 4 of the questions wrong her final
score will be -1.219 and she will fail. However, if she had instead
underestimated her confidence and given each question a confidence of 0.6 her
final score would instead have been 0.290 and she would have passed. She could
of course go even further and determine the likelihood of her getting N
questions wrong and use that to determine the optimal confidence level to
select which will maximize her expected score while ensuring that she is most
likely to pass the class.

~~~
syphilis2
[http://pastebin.com/zahgmVnk](http://pastebin.com/zahgmVnk)

[http://octave-online.net/](http://octave-online.net/)

I put sample code in a pastebin and a link to octave-online if you want to
verify how I'm thinking about this.

------
scraft
I suppose if you combine this approach with the theory that nothing is 100%
certain you end up with a test that is impossible to get 100% on (which I
guess makes perfect sense as nothing is 100% certain). But back to reality, if
you set students tests where a 100% confident answer being wrong instantly
meant they fail, it feels like you are no longer testing the student on the
subject matter in question and instead are testing their ability to play the
probability meta-test game.

The task itself of working out how 'confident' you feel about an answer also
feels like an impossible task (almost like a non-technical project manager
asking a developer how long a task will take). I guess it would be fun to see
the results of tests taken using this approach and then compare the scores
with the new system, the old system and ultimately what they get as their
final grade in the subject.

------
savanaly
Fantastic and frankly hilarious scheme he has developed here. My favorite part
is that if you state your absolute certainty in true or false and turn out to
be wrong, your score is negative infinity. Only in a math class would I expect
a score of negative infinity I suppose.

~~~
repsilat
That weirdness is kinda just a consequence of wanting the scores for different
questions to be added together. A simpler (to me) but equivalent way to frame
it is to use scores that you multiply together. These scores are "the
probability of the true answer occur, assuming the students odds."

That is, if the student gave a probability of 1 and were correct, their score
for that question is 1. If they were wrong, they get a score of zero.

The interpretation of their final score his interpreted simillarly as a joint
probability, assuming independent trials.

------
SloopJon
The SAT used to have a wrong answer penalty to discourage random guessing.
This seems to me a simple, effective way to reflect a student's certainty in
an answer.

~~~
daxfohl
Funny thing is the weights they assigned didn't discourage random guessing at
all; they merely made your expected gain from guessing equal with the expected
gain from not answering (i.e. zero).

So even if you had the very slightest idea, you should absolutely guess. If
you took the test multiple times, then probabilistically you'd be _better_ off
if you always guessed even when you had no clue: some of your scores would be
higher and some would be lower than if you hadn't guessed, and usually only
your highest score is considered. Most guides overemphasized the penalty.

That said, it can screw you of course too if you happen to be unlucky in your
guessing. This is what Terry's scheme overcomes.

~~~
tamana
Terry's scheme punishes well educated people who are conservative in their
claims of confidence

~~~
Jabbles
Why should being conservative in your confidence be rewarded above having an
accurate assessment of your confidence?

~~~
Jtsummers
In the case of an exam, students are under stress and pressure that will often
lead them to underestimate themselves, especially knowing that if they're
wrong that 100% probability will net them a -infinity score for the exam, even
if they get everything else right.

This also means questions have to be _very_ carefully worded. I recall a
number of exam questions (over many years of school, not one class in
particular) where the "wrong" answers were arguably right depending on how the
sentence was parsed just based on poor use of punctuation or other ambiguous
wording.

------
tikhonj
Here's a totally different idea: lets assume that the true-false questions are
on a related topic. We can give partial credit if wrong answers are
_consistent_ with each other: that is, if there is some (simple) model of the
topic we're asking about that is close to the correct model and produces those
particular wrong results.

Intuitively, if there are a bunch of questions about the same "feature" and
you get them _all_ wrong, all those mistakes stem from the same
misunderstanding. I guess in a well-written test where questions are
conceptually spaced out this is not as much of an issue...

So to give partial credit, we try to find a simple model consistent with the
T/F answers and award credit based on how wrong the model is.

In particular, this would catch the problem where you have a single
misunderstanding that cascades into a whole bunch of wrong answers even if you
actually understood the rest of the system fine. That's one of the main uses
for partial credit in longhand answers, isn't it?

How would we do this systematically? Well, we can't, really. But there are
places where we could. I worked with a professor who did research in program
synthesis, and I remember he had an interesting idea for education: when a
student submits an incorrect program in, say, Scheme, we could try to
synthesize an interpreter _that would make the program correct_ and then
extract what the students misunderstanding was. (For example: they used
dynamic scoping instead of static scoping.) If you seeded your synthesis
system with the various kinds of things students actually get wrong, this
could be both useful and practical.

You can apply the same idea to grading a test about Scheme. Award partial
credit if somebody has a consistent mental model that just happens to be
wrong. If they got a whole bunch of questions wrong just because they mixed up
scoping rules—but understood everything else—partial credit seems fair.

I guess this is pretty explicitly rewarding _consistency_ with partial credit,
but that also seems fair in a lot of classes like CS.

To be clear, I don't actually think this approach is really practical: it's
more of a thought experiment on what could be _interesting_. Even if it was
possible, doing it on T/F questions would likely require _a lot_ of questions
since each one only provides one bit of input. If you had questions along the
lines of "what does this Scheme program produce", you could get away with
significantly fewer if you were clever about choosing them—but you'd still
want some redundancy to be at least a bit robust to the student making typos
_as well_ as conceptual mistakes.

------
SubiculumCode
Interesting derivation, but why not Receiver Operator Characteristics(ROC)? In
recognition memory research we often collect confidence ratings on a 6-point
scale for yes/no decisions, for example, and plot ROC curves, and calculate d'
discrimination scores.

~~~
SubiculumCode
[https://en.m.wikipedia.org/wiki/Receiver_operating_character...](https://en.m.wikipedia.org/wiki/Receiver_operating_characteristic)

Or am I missing something.

------
primodemus
Nitpick: His name is Terence not Terrance.

~~~
marxidad
Both spellings are equivalent modulo permutation.

~~~
nightcracker
I understand the attempt at humor but can not help but point out that the two
names are not permutations of eachother.

------
cvick
There is nothing in that method that accounts for the possibility that the
test may be flawed in some way. I suppose that the assumption here is that the
test is 'perfect' in that each question is worded in such a way that there is
one and only one 'right' answer. But, as a taker of this kind of true/false
test, if I am only allowed to provide my assessment of my own confidence in my
answer without an explanation of "why" I assigned that value, then there's no
way to answer in a way that I will not be penalized for a question that I
don't feel that I can answer as asked.

Consider assigning an additional value of '1' to each question initially as a
representation of the author's confidence that the question is not flawed in
any way. Then, if a significant segment of the test taking population
indicates that they believe that the question is flawed in some way, then, the
"author's confidence" for that question would be reduced by an amount that
wouldn't penalize me for identifying that flaw while still allowing for less
'partial credit' to those who answer incorrectly without being aware of the
flaw, or more specifically, not giving as much credit to those who answered
'correctly' in spite of the flaw.

It's also possible that the question I think is a 'flaw' is really a 'trick'
question and that in order to answer 'correctly' you must discover the trick
-- this would still allow for a better result in that it more accurately
assesses whether I really understand the question or not.

It also occurs to me that if the test administrator wants to give a test where
they can assign some manner of partial credit to an answer, then they
shouldn't give a true/false test in the first place.

------
pierrebai
I don't think his scheme buys anything. Any pointing system in which a wrong
answer has more weigh than a right answer will siply incensitive student to
aim for the fair confidence level, the one that gives -1 for a bad answer. You
need so few 'bad confidence' result to completely wreck your overall score
that it is not worth it.

The main flaw of the scheme is that it is a purely mathematical analysis.
Answers are not based solely on confidence but often relies of reading
comprehension skills. (Even at the pure mathematical level. For example
missing a square factor and a minus sign.) So you can have 100% confident
incorrect answer just because you mis-read or mis-interpreted the question.
Then you get punished hard. Given one's incapcity of self introspection and
detection of such mis-reading, and the harsh punishment for such undetectable
failure on the part of the student, his scheme is wrong headed.

~~~
mabbo
I imagine the first test or two would be a period of the students learning to
understand the system, but after that it's perfectly fair. It's a good lesson
in "You aren't 100% sure of anything, so don't claim you are".

~~~
tamana
Why fail a student for making one specific mistake once?

~~~
kwikiel
Because that will make him more focused on correctly approaching problems
instead of rushing for solution without checking immediate work.

------
soreal
The authors calls out: [Important note: here we are not using the term
“confidence” in the technical sense used in statistics, but rather as an
informal term for “subjective probability”.]

But then goes on to use a very technical, statistical version of confidence
where 0% confidence is somehow equal to 100% confidence you picked the wrong
answer.

All of this leads me to assert that despite the math being internally
consistent, it does not apply well to the situation. A rational outside
observer without reading such an article would assume that 0% confidence in
your answer is equal to a 50% chance of being right.

The phrase "confidence that the answer is 'true'" can easily be interpreted as
"confidence that the answer I marked is correct".

~~~
hfanson1
It is confidence that the answer is true. 100% being fully sure the answer is
true. 0% being fully sure the answer is false. 50% should be used for both
true and false have the same probability.

------
tamana
This moves the problem to a quiz of introspection, not a quiz of the material

~~~
sophacles
There's a pretty solid case to be made that introspection is missing from a
lot of pedagogy anyway, so this isn't necessarily bad. I can see this being
useful in many places - when testing on bits of interrelated knowledge, having
a confidence score on the test may help people learn to puzzle out information
and synthesize conclusions. Both of those skills are more valuable to the
student than the ability to dump random facts - we have google for that these
days.

------
lebca
That was beautiful. It's been a long time since I've read anything derived as
elegantly and clearly described as that. Got any more?

------
daxfohl
But a possibility of negative infinity begs the question: how certain is the
_professor_ of the answer?

------
tzs
A comment on the reddit discussion of this said that a similar scheme was used
at CMU for midterm and final exams in the "Decision Systems and Decision
Support Analysis" class offered by the decision theory department. The exams
were multiple choice with four choices per question. The commentator linked to
this handout describing the grading for the midterm:
[http://www.contrib.andrew.cmu.edu/%7Esbaugh/midterm_grading_...](http://www.contrib.andrew.cmu.edu/%7Esbaugh/midterm_grading_function.pdf)

If you assigned probability p to the correct choice, your score for that
question was 1 + ln(p)/ln(4). You were not allowed to assign a probability of
0 or 1 to a choice.

The handout points out the importance of thinking about how to approach these
tests ahead of time, and also explains the benefits of using this scoring
system:

\---- begin quote ----

I cannot stress strongly enough the need for each of you to sit down and think
about different strategies for answering the questions. This grading technique
completely removes any benefit of random guessing. Such a guess could be
disastrous. You're much better off admitting that you don't know the answer to
a question. (Placing a 0.25 probability by every option indicates that you
have no idea which answer is correct, and your score will be 0 for that
question). Assessments of probability 0 (0%) or 1 (100%) are not allowed.
These answers will be interpreted as probability 0.001 (0.1%) and 0.997
(99.7%) respectively. Your probability assessments must sum to 1 (100%). A
probability of 0.001 by the correct answer will result in a score of -4. In
contrast, a probability of 0.997 on the correct answer only earns a score of
1. Think about the implications of this before the day of the test.

I strongly recommend that you analyze the grading problem from a decision
analysis perspective. Calculate expected values (or expected utilities) for
various levels of personal uncertainty. Notice what happens if you are
overconfident or underconfident.

This grading scheme makes the midterm harder then a standard multiple choice
test, but this is the point. It has many benefits from a teaching/learning
perspective.

1) It teaches you to apply the techniques that have been discussed in class.
You have to assess your own personal probabilities and apply them to problems
that have very real (and potentially) important payoffs. It is impossible to
get these points across with the few simple lotteries demonstrated during
class.

2) It helps to removes the element of chance from the test. Because of the
severe penalty for guessing, the test will more accurately measure your
knowledge.

3) The test will also measure what you know about your knowledge.

4) By analyzing how you answer the questions, I will be able to determine
which questions are hard and which questions you believe are hard. (A
"flatter" distribution for a question indicates a lower confidence in the
question, and therefore, a belief that it is hard.)

5) It will allow you to appreciate how hard it is to assess probabilities.

\---- end quote ----

------
caf
Interestingly, another way of looking at this result is that the score is a
measure of the total information content of the respondent's answers.

------
amelius
Sorry to be negative, but this sounds like a solution to make something which
is horribly broken a teeny bit less horribly broken.

~~~
frenchy
I think it's a little extreme to say that the current popular way of quickly
assessing large numbers students is "horribly broken", but I think the most
interesting part of this has nothing to do with student assessment and more to
do with forcing the student to think about how well they know things.

A quiz can be a teaching tool as well as an assessment tool. It's often poorly
designed for the first function, but that doesn't mean it has to be.

