
Algorithmic decision making and the cost of fairness (2017) - shawndumas
https://arxiv.org/abs/1701.08230
======
slavik81
I just saw another article[1] on a study using the same data set. It concluded
that the COMPAS algorithm is no more accurate than random people given the
task on Mechanical Turk.

Those researchers also designed a simple predictive model using only two
factors as inputs: age and number of previous convictions. By comparison,
COMPAS uses more than a hundred. To me, it doesn't sound like those extra
factors end up contributing much.

[1]: [https://www.economist.com/news/science-and-
technology/217349...](https://www.economist.com/news/science-and-
technology/21734986-short-answer-two-are-about-same-are-programs-better)

~~~
indubitable
That is a very poorly formed study precisely because systems like the
Mechanical Turk actually work surprisingly well. This is a common theme in the
'wisdom of the masses.' If you ask a single person how many beans are in a
jar, you're going to get an answer that's generally very wrong. Yet ask 100
people and average their answer and you tend to get an answer that's oddly
_extremely_ close to the correct answer. As the number of people
_independently_ asked approaches infinity, the error approaches 0.

So when you ask ' _x_ ' people to independently judge something, let alone
something that is multiple choice with a correct answer, you're going to get
answers that are far more accurate than a single individual would give you. So
looking at the average answer of 'x' people, comparing it to an AI system, and
arguing that they're relatively close so _individual_ people are relatively
close to the AI is completely fallacious.

\---

As a tangential aside, there's another quirk to the wisdom of the masses. When
you let the people communicate and try to intelligently organize and use
expertise to come to answer, this effect disappears and the final answer again
tends to be very wrong. Kind of an interesting perspective on the current
zeitgeist of society and work.

=== _EDIT_ ===

The authors were obviously aware of the wisdom of the masses. Quoting the
paper itself:

 _To determine whether there is “wisdom in the crowd” (7) (in our case, a
small crowd of 20 per subset), participant responses were pooled within each
subset using a majority rules criterion. This crowd-based approach yields a
prediction accuracy of 67.0%. A one-sided t test reveals that COMPAS is not
significantly better than the crowd (P = 0.85)._

That's a quite silly p-value and on top of that I'm not sure how they claim
their system actually controls for the wisdom of the masses.

~~~
IshKebab
> As the number of people independently asked approaches infinity, the error
> approaches 0.

Only if people are an _unbiased estimator_!!

~~~
indubitable
You'd think so, but that's not correct. This is not a straight forward result
of probability with a filter of complexity. It works regardless of bias,
though obviously if everybody was biased in the exact same way then that would
cause things to break down - which is perhaps the reason that the coordination
results in a worse result than independent averages.

For another example of it consider things like the television show, 'Who wants
to be a millionaire?' It's a quiz show where one of the choices is for the
participant to ask the audience. And the audience tends to do absurdly well on
even the most esoteric questions, though independently they are certainly far
from trivia experts. But very few are randomly guessing - their own
experiences and biases leads them to entirely different conclusions. Yet
somehow it produces the correct result time and again.

It's a strange phenomena and one that has to be constantly guarded against in
anything involving sampling of people. This is a text book example of a study
that gets destroyed by it.

~~~
_dps
>>> As the number of people independently asked approaches infinity, the error
approaches 0.

>> Only if people are an unbiased estimator!!

> You'd think so, but that's not correct.

Either you're misinterpreting the technical term "unbiased estimator" here, or
you are aware of some research that I would like to read.

In context, "unbiased" means that if you pick people at random and ask for
their estimates, then on average the too-high estimates cancel out the too-low
estimates (i.e. there is not a bias in one direction or another).

But people as a whole have poor understanding of many things. One common one
that appears in social science research and is often replicated is that people
grossly overestimate the size of the homosexual population in the US (the
"wisdom of the crowds" often estimates it around 20% whereas best available
polling data suggests 3-5%). Here's just one source for this phenomenon

[http://news.gallup.com/poll/183383/americans-greatly-
overest...](http://news.gallup.com/poll/183383/americans-greatly-overestimate-
percent-gay-lesbian.aspx)

"Wisdom of the crowds" is occasionally reliable, but it should not be assumed
to be reliable for any particular problem without verification. It often fails
terribly even on problems that are not very esoteric.

Edit: changed phrasing of final paragraph

~~~
indubitable
As mentioned, there is a difference between bias and uniform bias. In the US
the media, politics, social media, and even miseducation _(e.g. in my deviance
class we focused on Kinsey 's 10%, yet oddly enough never contrasted that
against contemporary results)_ have heavily and uniformly biased the
population on sexuality leading people to _vastly_ overestimate the number of
homo/bi/trans individuals.

Where it works phenomenally well is in areas where biases have not been
directly instilled into people. This does not mean people are unbiased,
however. Again the knowledge of trivia is a good example since while the
crowds can generally do phenomenally well even at very esoteric questions
where biases would lead them to individually come to very different
conclusions, yet they will invariably fail to answer ostensibly trivial
questions like 'What is the capital of Australia?' You'll get Sydney, it's
not. You'd likely get a similar result for things like the capital of
Pennsylvania.

In a way I view the wisdom of the masses as analogous to machine learning
systems. They do an oddly good job of providing extremely precise answers to a
wide array of questions even when trained with models that do not directly
represent the 'questions'. Yet you can also break the systems, at times
comically, with certain types of queries designed to do precisely that.

And as was the case with machine learning for quite some time, I think people
remain reluctant to utilize it due to the black box nature of it. The
implication of your comment is that the wisdom of the masses is little more
than incorrect answers canceling out on average leaving nothing but a survey
of experts. Yet I think there's no evidence for this (even if it may be a
perfectly logical 'kneejerk' reaction) as it works even on things where
_nobody_ is an expert, and if this were the case then we ostensibly should be
able to get comparable answers from coordination - yet coordination causes the
entire system to collapse.

~~~
_dps
> As mentioned, there is a difference between bias and uniform bias.

You are simply mistaken in your interpretation of the technical term "unbiased
estimator". This has a specific meaning in statistics, and is required for the
convergence property you specified earlier. From wikipedia [0]

"In statistics, the bias (or bias function) of an estimator is the difference
between this estimator's expected value and the true value of the parameter
being estimated."

In lay terms, this means that the estimator process "ask lots of people and
average the result" is unbiased only if all the too-high errors, in aggregate,
cancel out the too-low errors.

[0]
[https://en.wikipedia.org/wiki/Bias_of_an_estimator](https://en.wikipedia.org/wiki/Bias_of_an_estimator)

~~~
indubitable
I'm not sure if we're now going in circles or if you failed to read what I
just wrote. Repeating it:

 _" The implication of your comment is that the wisdom of the masses is little
more than incorrect answers canceling out on average leaving nothing but a
survey of experts. Yet I think there's no evidence for this (even if it may be
a perfectly logical 'kneejerk' reaction) as it works even on things where
nobody is an expert, and if this were the case then we ostensibly should be
able to get comparable answers from coordination - yet coordination causes the
entire system to collapse. ... _"

I'm not sure if you even realize all the assumptions you're making. You are
assuming, for instance, that 'guesses' are regularly distributed. That
assumption may be correct in some cases - I expect in many it is not.
Alternatively there is the possibility is for you to claim that you're
referring not to the individuals in question as the estimators, but the entire
group. In that case you've spent a lot of time saying nothing as it boils down
to "people are only correct if they're correct."

~~~
IshKebab
Sorry but you're mathematically wrong. The "wisdom of the crowds" does not
work on things were everybody is wrong in the same way. For example if you ask
a lot of people to estimate income inequality _you get the wrong answer_
because everybody underestimates it.

[https://www.scientificamerican.com/article/economic-
inequali...](https://www.scientificamerican.com/article/economic-inequality-
it-s-far-worse-than-you-think/)

The "wisdom of the crowds" is pseudo-nonsense.

------
adjkant
It seems like the problem is that any data that works statistically will not
be fair to individuals who are intuitively exceptions to the general trend. It
sounds like algorithms relying on statistics are pretty much guaranteed to
fail the fairness test. It's the wrong tool for this problem I think.

There are so many factors in morality that I think individually, case by case
judgments are likely the best option. Racial / class bias seems to be just as
present in these algorithms, so lets at least allow for human intervention
rather than forcing ourselves to be constrained by an inherently unfair
system. Yes, humans are flawed too, but perhpas we should be spending more
time trying to adjust / account for those in different ways.

Note: Bias in sentencing/convictions is a huge problem that should absolutely
be worked on. I'm only claiming that these approaches are inherently flawed,
not that what we have now is anywhere close to good enough.

~~~
quotemstr
> There are so many factors in morality that I think individually, case by
> case judgments are likely the best option.

What are the inputs to "case by case judgments" except Bayesian priors and
observations about a specific case? Your brain isn't doing anything a computer
model can't in principle do. My point isn't that any particular bail pricing
model is acceptable. Instead, I'm arguing that a desire to escape "statistics"
by giving up on models and punting to the brain is futile, since the brain is
just going to use its own statistical model whether the owner of that brain is
aware of it or not --- it _must_ , since statistics is baked into the
structure of knowledge, and the brain has no private source of truth.

There is no accessible realm of knowledge about the real world somehow exempt
from statistics, since we never have complete knowledge, and statistics models
our uncertainty.

~~~
adrianN
While I agree with your general point, artificial neural networks and
biological neural networks are sufficiently different that it's not really
fair to compare them just because they happen to have the same name. A
biological neuron does a lot more than computing some affine transformation of
its inputs.

Also, humans draw from an absurdly complex pool of inputs, usually called
"common knowledge", that so far has given us a hard time when trying to
replicate it in computer systems. So while in principle there is, imho,
nothing stopping a statistical method to produce decisions at least as good as
a human could, in practice this might not be the case.

~~~
noobermin
To expand on your point to further strengthen your original argument, these
models are trained on a smaller subset of observables and ideas we consider
when we decide policy. Even then, the comparison between a matrix and a human
brain is poor.

Another issue I have is that we don't know is whether statistical bias exists
in the metrics they train on, something they discuss in the paper. We already
know for example blacks are arrested and convicted harsher than whites for
some of the similar crimes[0]. The paper just says that COMPAS predicts
likelihood of violent crime, not whether that means they'll be arrested of
crime. Moreover while Table 1 talks about a prediction regarding violent
crime, Figure 2 talks about recidivism with regards to _all crime_ , so which
is it? It still seems nebulous to me what COMPAS actually is for.

I'd have to read it deeper than I have, but not sure how I feel about the
paper.

[0] [http://www.politifact.com/truth-o-
meter/statements/2016/feb/...](http://www.politifact.com/truth-o-
meter/statements/2016/feb/26/hillary-clinton/hillary-clinton-says-blacks-more-
likely-be-arreste/)

------
nabla9
Assume that you are insurance company with access to customers social media
profiles.

You find out that if customer has a friend who posts pictures of fast
motorcycles that increases the probability of customer making $100,000 claim
in the future to 2% from previous 1%. Is it fair to double their insurance
payment to make up for the increase in expected value even if the number of
false positives is huge? General efficiency increases and customers can save
money if they select their friends better. This might be how Chinese social
scoring system might work in the future. Society as a whole optimizes itself.

What if instead of the insurance claim, the risk is homicide or rape. Should
algorithmic judge take that into account when weighing the evidence?

Personally I think unconstrained algorithm in criminal cases is ethically
wrong. Justice should be individualized or it's not justice.

In some other cases it might be justifiable.

~~~
taneq
Don't they do precisely that, across a whole bunch of data? Not "friends on
facebook" but location, habits, age, etc.

~~~
nabla9
Yes. That's the idea when managing insurance risk.

The question is if there are variables that should be excluded because
optimizing trough them has harmful effects to the society and it's unfair.
Smoking or alcohol use are noncontroversial variables. The quality of your
friends, your political leanings are controversial.

------
sean_anandale
So in sum: "analyzing data from Broward County, we find that optimizing for
public safety yields stark racial disparities; conversely, satisfying past
fairness definitions means releasing more high-risk defendants, adversely
affecting public safety."

In other words, black defendants actually are more dangerous to release and
there is no magic algorithm that bypasses this fact.

~~~
hateduser2
On a topic like this it’s important I think to address the elephant in the
room. I don’t think this implies a genetic issue! I’m not an expert but from
what I know this shouldn’t lead _me_ to the conclusion that “black Americans
are predisposed to violence” and, if you know approximately the same things I
do, I think that’s probably fair for you too!

~~~
toomanybeersies
I don't think anyone here on HN thinks that it implies a genetic issue, but
rather an issue caused by generations of discrimination and failed attempts to
stop the cycle of poverty and crime that plagues many of these communities.

~~~
trowawee
I am 100% certain, based on 5 years of observation, that there is a
significant portion of users on this site who absolutely believe that racial
disparities are genetic and immutable.

~~~
rdl
I would be amazed if most important physical, mental, or social
characteristics had zero underlying biological or genetic drivers, or if they
were completely driven by underlying biology. Even small underlying
differences can be amplified through compounding effects over a lifetime of
decisions, through culture, etc.

(The easy ones are male/female; racial differences are far less.)

------
femto
Statistics tell you something significant about a group, but little of
significance about an individual. Statistical measures should not be used to
make decisions about individuals. Other names for doing so are: "Sacrificing
the one for the many" or "Presumption of guilt."

------
avinium
I've only had a chance to skim this, so maybe I misunderstood.

That being said, the authors' point seems to be right there in the
introduction - "it's not fair to hold all individuals to the same standard,
irrespective of race."

In other words, the broader community should tolerate laxer pre-trial
sentencing standards applied to people of race A, precisely because they
commit more crime than people of race B?

Pretty odd concept of "fairness" if you ask me.

Also interesting that explicitly adjusting for race does not affect recidivism
predictions by the COMPAS model. In other words, as it stands, the model is
not inadvertently discriminating on the basis of race. I wonder if the same
testing has been applied to sentencing decisions.

~~~
wonnage
re: COMPAS

> As noted above, a major criticism of COMPAS is that the rate of false
> positives is higher among blacks than whites [2].

Policing in America is racist. So anything that takes arrests into account
(prior arrests, recidivism, etc.) will be in error as well.

The line you quoted misses a lot of context - the 'standard' is an algorithmic
risk score with a single threshold. The paper mentions why having different
treatment by race might have legal problems (e.g 14th amendment). So this
standard is what we're stuck with. The paper asserts that it isn't fair.

~~~
Houshalter
>Policing in America is racist.

Citation needed. Cops visit black neighborhoods more, sure. But they do that
because they commit vastly more crime. Which is the cause and which the
effect?

It's also a lot more complicated than that. People throw out studies that
blacks are more likely to be caught for drug crimes. But they are less likely
to be caught for more serious crimes. Possibly because they trust police less
and are less likely to volunteer as witnesses. E.g.
[https://www.wsj.com/articles/the-underpolicing-of-black-
amer...](https://www.wsj.com/articles/the-underpolicing-of-black-
america-1422049080)

~~~
pimmen
When you make judgements about individuals based on their race and not their
individual actions you're being racist. It's as simple as that. This is what
the paper is looking at.

You can argue for being racist if you want but you can't say "it's not racist,
because 'racist' is a word with bad connotations and I think what they do is
good" without being called out on it.

~~~
Houshalter
>When you make judgements about individuals based on their race

No one is doing that. The algorithm does not take race into account. It's just
an assertion that it uses other factors as a proxy for race. I've not seen
this proven anywhere.

And indeed, if it were true, you would have to admit that blacks are more
likely to be criminals. Even after controlling for all relevant variables.
Which goes against the standard narrative that it's just socioeconomic status
or whatever.

>you can't say "it's not racist, because 'racist' is a word with bad
connotations and I think what they do is good" without being called out on it.

It isn't racist. And the word racist is becoming almost meaningless because
how often you overuse it.

------
jimmywanger
There is no cost of fairness here. They're playing percentages. I'd rather
have an algorithm make predictions than a human judge/prosecutor. If a certain
group of people are disposed towards recidivism or not showing up for trial,
it doesn't matter how much they've suffered in the past. That's a policy
decision to reach out to them.

------
candiodari
If justice was truly blind (which is what I would consider fair) then the
number of defendants incorrectly classified as dangerous would depend on the
likelihood of all the groups they're part of committing dangerous crimes.

You want it fair based on race ? Ok, don't input race. Everything else stays
the same. Why this complexity ?

~~~
dspoka
So this idea seems intuitive at first but turns out to be one of the worst
things to treat unfairness.

There are several reasons for this from both technical and legal perspective.

It is incredibly easy to find statistically significant correlations given
just a few (more than 7) different views of the data. In general these ml
models are not working with less than hundreds or thousands.

If the model learned this suppose racial bias, once, you deleting this column
is not going to stop it from learning it again, and I believe some research
showed that it actually can make the unfairness more severe.

from a legal standpoint a company that may or may not be infringing on rights
could just say, oh we can't be because we don't have these fields in our data:
which makes it harder to monitor and audit wrong doing.

most of the methods that I am familiar try to ease the effects of the learned
biases as a post-processing step for the model.

~~~
bluecalm
>>It is incredibly easy to find statistically significant correlations given
just a few (more than 7) different views of the data. In general these ml
models are not working with less than hundreds or thousands.

I would be interested in seeing examples. So far in this thread the arguments
were along the lines of: "but then the algorithms punishes poor neighborhood
instead of race" but you shouldn't have address in the data either as (I hope)
nobody is ok on punishing people for living in bad neighborhood. We should
only include data we would like to see in the explanation of the sentencing.

"You are not getting parole because you live in a poor neighborhood" is unfair
while most people would be ok with:

"You are not getting parole because you willingly associated with people who
committed crime".

------
thundergolfer
Bernhard Scholkopf explored this at the hour mark in his 2017 ICML keynote.
[https://vimeo.com/238274659](https://vimeo.com/238274659)

Was interesting, and worth a look.

------
crb002
This hits home in Iowa. They just started using it to determine bail.
[https://businessrecord.com/Content/Law-Government/Law-
Govern...](https://businessrecord.com/Content/Law-Government/Law-
Government/Article/Polk-County-courts-begin-using-pretrial-risk-
assessments-/164/788/80952)

------
dblotsky
Personal two cents: the actual maximally "fair" and maximally safe strategy is
not to release anyone on bail. True positives are 100%, and false negatives
are 0%.

Beyond that, I agree with nabla9:

> I think unconstrained algorithm in criminal cases is ethically wrong.
> Justice should be individualized or it's not justice.

------
extension
Imprisoning people without trial, for predicted future crimes, is what's
unfair. Letting an algorithm make the prediction instead of a judge only
punctuates the injustice.

What is a "fair" criteria to base this decision on? Is it fair to throw
someone in jail because they are young, or they got layed off, or they don't
have friends or family? How is any of that better than jailing them for their
skin color?

These are exactly the things that justice is supposed to be blind to.

I'm rooting for the algorithms here, simply because they make the inherent
injustice of pre-trial detention harder to ignore. We can convince ourselves
that this injustice is somehow corrected by the presumed wisdom and compassion
of a human judge. But by formalizing the logic, we have to acknowledge that we
are literally throwing people in jail for plainly unfair reasons.

~~~
quotemstr
Is there any circumstance in which you would support detention before trial?
If not, would you oppose the detention of a serial killer caught in the act?
If you do support pre-trial detention in some cases, how do you distinguish
these cases from those cases for which pre-trial detention is unjustified?
Could such a decision scheme be "fair" in principle? What would make it fair?

~~~
extension
I think detention before trial is always unfair in principle, but likely
unavoidable in practice. I would like to see the issue acknowledged and taken
more seriously, but it's a tricky problem and I have no easy solutions to
offer. Practical mitigations may be the best we can do, which I'll grant may
be expensive, non-trivial to implement, and allow more criminals to roam free.
Here are some vague ideas off the cuff:

* Base decisions only on things that would be relevant in a trial, like evidence and criminal history.

* Nobody should be detained just because they haven't paid bail money. If we decide that someone can be released, it should be immediate and unconditional. The court should charge no more than they can immediately collect.

* Make detention as pleasant and convenient as possible for the accused. We should have facilities specifically for this purpose that are more like hotels than prisons, at least in principle.

* Eliminate any trial delays that aren't strictly necessary, i.e. due to congestion or beurocracy.

If a serial killer is caught in the act, there would presumably be enough
evidence available at the bail hearing to justify detention.

------
Houshalter
Getting rid of algorithms means replacing them with humans. And humans are
_far_ worse.

Humans are terribly biased. Racial bias isn't even anywhere near the strongest
bias we have. Unattractive people get sentences twice as long as attractive
people. Judges give far harsher sentences before lunch, when they are
hungriest. Socially awkward people seem to be pretty strongly discriminated
against. Studies have found people discriminate by politics even more than
race. Job interviews have actually been shown to degrade performance over just
judging resumes. Before statistics and credit ratings, getting a good loan
required being an old friend of the banker.

It's not just that humans are unfair. We are objectively terrible. Very simple
statistical algorithms beat human "experts" in almost every domain they get
tried on. As early as the 1920s, a statistician came up with a formula that
was better at predicting recidivism than a group of 3 prison psychologists.

Simple linear regression has predicted the success of medical treatments
better than doctors, diagnosed psychoticism better than trained psychiatrists,
predicted academic success much better than admissions officers, predicted
loan risk better than bank officers, etc, etc. To say nothing of modern
machine learning methods on modern computers. It's insane we allow humans to
continue doing these tasks at all.

But there has been huge resistance to algorithms in every domain. From people
who stand to lose their jobs and be put to shame by them of course. But also
even outsiders tend to reject algorithms. And overly trust humans.
Psychologists have actually studied this. They call it a bias labelled
"Algorithm Aversion".
[http://opim.wharton.upenn.edu/risk/library/WPAF201410-Algort...](http://opim.wharton.upenn.edu/risk/library/WPAF201410-AlgorthimAversion-
Dietvorst-Simmons-Massey.pdf) The study showed that humans were willing to
forgive the mistakes of humans far more than those of algorithms, even when
the algorithm made far fewer of them.

This is why they aren't everywhere already. The last thing we need is fear
mongering like this. As shown, humans are far worse. If an algorithm shouldn't
do it, then a human certainly should not.

A big part of the case these people make is a reference to a propublica study
that found a slight bias against race in an algorithm once. Yet that study
wasn't peer reviewed. The findings weren't statistically significant.
[https://www.chrisstucchio.com/blog/2016/propublica_is_lying....](https://www.chrisstucchio.com/blog/2016/propublica_is_lying.html)

Also, A.I. ‘Bias’ Doesn’t Mean What Journalists Say it Means:
[https://jacobitemag.com/2017/08/29/a-i-bias-doesnt-mean-
what...](https://jacobitemag.com/2017/08/29/a-i-bias-doesnt-mean-what-
journalists-want-you-to-think-it-means/)

~~~
quotemstr
It depends on your objective function, doesn't it? Computer models definitely
do better than humans on the _ostensible_ objective function --- predicting
recidivism --- but computer models don't have decades of exposure to subtle
social cues that signal to humans that the real objective function is one
subtly different from the one that a naive understanding of the system would
suggest.

That's really the beautiful thing about our shift toward algorithms: we'll no
longer be able to rely on these subtle social pressures. If we want to
optimize a specific function, we need to put _that function_ in the open,
where everyone can see it. Then, as the linked article demonstrates, we can
compare the idealized and actually-desired functions and perform a real cost-
benefit analysis.

