
Notes on AI Bias - andrevoget
https://www.ben-evans.com/benedictevans/2019/4/15/notes-on-ai-bias
======
twa927
> Until about 2013, If you wanted to make a software system that could, say,
> recognise a cat in a photo, you would write logical steps. You’d make
> something that looked for edges in an image, and an eye detector, and a
> texture analyser for fur, and try to count legs, and so on, and you’d bolt
> them all together...

I'm doing a lot of such algorithms (well, not for images). Does someone know
if such algorithms have a name? I'm calling it "heuristics" and I think it
falls under "AI".

~~~
layoutIfNeeded
I would call it “classical” machine learning.

~~~
twa927
Hmm, I think there's no "machine learning" here. There's a human hard-coding
some thought process, using mostly some simple statistics/thresholds to e.g.
define what a "fur texture" looks like.

~~~
pedrosorio
Machine learning was extensively used in image processing before 2013 / deep
learning.

The main difference is that you’d write code to extract features from the
image and then learn a model using those features (as opposed to using the
pixel data directly and learning a model from that as in CNNs).

As an example, you wouldn’t necessarily write code for “fur texture” but
instead would extract histograms of pixel brightness gradients and feed those
(along with other things) to a machine learning algorithm. In this example,
fur texture would generate a different histogram (to be used as a feature)
than skin texture.

[https://en.m.wikipedia.org/wiki/Histogram_of_oriented_gradie...](https://en.m.wikipedia.org/wiki/Histogram_of_oriented_gradients)

~~~
twa927
Ok, so this depends on what algorithms are used for the feature detection
("edges in an image, and an eye detector, and a texture analyser for fur").
I'm guessing hand-coding an algorithm for detecting edges in an image can be
done successfully, but it looks much harder for "an eye detector", so it needs
"machine learning"

What I meant when asking for a name of an algorithm class are algorithms where
the feature extraction is done using hand-coded algorithms.

~~~
layoutIfNeeded
You can call them “handcrafted decision trees” then.

------
fvdessen
> Since Amazon’s current employee base skews male, the examples of ‘successful
> hires’ also, mechanistically, skewed male and so, therefore, did this
> system’s selection of resumés. Amazon spotted this and the system was never
> put into production.

Couldn't they have retrained the system with a 50/50 mix of males / females
resumes ? Or restrict the use of the algorithm to sort male resumes ? Or maybe
resumes don't actually correlate at all with success in Amazon ...

~~~
DuskStar
One situation I could see leading to this result (Amazon cancelling their
resume filtering software with the excuse that it 'skewed male') is that

1\. The AI system accurately predicted employee success across both genders

AND

2\. The AI system predicted that women would do worse than men

That's politically embarrassing and something that you can't necessarily 'fix'
by improving the system. (see: all the 'will this person commit a crime if let
out on parole' systems that end up _accurately_ discriminating based on race)

This isn't to say that women are worse engineers than men, or anything of that
sort - only that the applicant pool to Amazon was skewed, or women were
treated worse in the workplace and thus performed worse, or a dozen other
possible causes. (And only in this hypothetical scenario! I have no inside
info from Amazon!)

~~~
Bartweiss
In this case, it appears to instead be a matter of journalists focusing on
totally the wrong aspect of a story for more drama. Buried deep in the
original Reuters piece is this offhand mention:

> _Gender bias was not the only issue. Problems with the data that underpinned
> the models’ judgments meant that unqualified candidates were often
> recommended for all manner of jobs, the people said. With the technology
> returning results almost at random, Amazon shut down the project, they
> said._

Apparently the recommendation system really did _create_ gender bias, neither
inherited from real differences nor from replicated human biases. (It looks
like an issue with mismatched training data and task.) But that initial bias
was found and corrected (2015) more than a year before the project was
cancelled (2017) for providing "random" results. I think this is the most
extreme case of algorithmic bias I've ever seen, but also the least commonly
relevant; Amazon appears to have built a model which contained almost no rules
_except_ sexism, and scrapped it for not knowing anything worthwhile.

[https://www.reuters.com/article/us-amazon-com-jobs-
automatio...](https://www.reuters.com/article/us-amazon-com-jobs-automation-
insight/amazon-scraps-secret-ai-recruiting-tool-that-showed-bias-against-
women-idUSKCN1MK08G)

~~~
DuskStar
That is certainly another plausible explanation - and a less culture-war
infused one, too. Thanks!

------
gambler
_> The most obvious and immediately concerning place that this issue can be
manifested is in human diversity._

I swear, when someone starts building autonomous killer robots, the first set
of concerned articles will probably be asking whether robots were properly
trained to target all genders and races with equal accuracy. This is not a
sensible way to approach AI ethics.

 _> It was recently reported that Amazon had tried building a machine learning
system to screen resumés for recruitment. Since Amazon’s current employee base
skews male, the examples of ‘successful hires’ also, mechanistically, skewed
male and so, therefore, did this system’s selection of resumés._

There is nothing "mechanistic" about this. It depends on how you select sample
resumes and how you split them between "good" and "bad" labels.

I worked on a similar thing as an "encouraged" side-project at a certain
company. Except I realized from day 1 that using AI on resumes is a bad idea
and aimed to show this with data. My model was aiming to detect people who
will quit or get fired within first 6 month (with the intent of lowering them
in priority for interviews, supposedly). It miraculously achieved 85%
accuracy... by figuring out how to detect summer interns.

Framing this problem as "bias" and especially hyper-focusing everyone's
attention on diversity aspect of it is extremely irresponsible. (I'm not
saying that's what the author is doing, but that's definitely what's being
done at large.) Fundamentally, there are _significant_ higher-level problems
with using statistical ML models for things like hiring or crime prediction.

~~~
joshuamorton
>Framing this problem as "bias"

Except that's exactly what it is. Much as your model was biased against
interns.

> and especially hyper-focusing everyone's attention on diversity aspect of it
> is extremely irresponsible.

Why? Pointing out a specific and concrete harm badly designed ML models cause
is irresponsible? Just because the same kind of methodological flaw can cause
other harms its irresponsible to use a motivating example?

~~~
gambler
_> Why? Pointing out a specific and concrete harm badly designed ML models
cause is irresponsible?_

In my opinion, yes, if it leads most readers to misjudge some fundamental
properties of the problem as a whole. Again, I'm not saying this article is
guilty, but most are.

~~~
joshuamorton
> In my opinion, yes, if it leads most readers to misjudge some fundamental
> properties of the problem as a whole.

Which problem? The general statement of this problem is "models, trained on
[somehow] misrepresentative data [or even technically representative data] can
draw unintended conclusions that lead to harm". Specifically in this case, the
harm was "the model was basically just trained to ignore all women applicants
due to bad inference of conditional probabilities".

This is a common thing. Because our society draws lines and has bias, its
fairly common for modelling failures to exist along those lines. Indeed,
sometimes the failures are mostly harmless and immediately obvious, but often
they aren't. And people building models should be made aware of those failure
scenarios, and be especially aware of failure scenarios that affect
underrepresented groups, because those groups are the most likely for the
model to fail on if you aren't actively looking for them.

And this stuff is pervasive. Facial recognition tech is much worse at noticing
the faces of darker skinned people [1]. Some of this is because the people
building the common models (eigenfaces etc.) didn't use diverse skin tones,
but some of it goes back further, white balance in film was tuned for lighter
skin tones until the 90s[2]. Some of that has likely persisted into modern
film and camera technology, unfortunately. People working with data need to
understand their data. And that means understanding how bias infests their
data.

> fundamental properties of the problem as a whole

You've yet to state the "whole problem" or the fundamental properties that
people might misjudge. So I'm unclear what they are.

[1]: Arguably an advantage now.

[2]: [https://petapixel.com/2015/09/19/heres-a-look-at-how-
color-f...](https://petapixel.com/2015/09/19/heres-a-look-at-how-color-film-
was-originally-biased-toward-white-people/)

~~~
gambler
_> Which problem? The general statement of this problem is "models, trained on
[somehow] misrepresentative data [or even technically representative data] can
draw unintended conclusions that lead to harm"._

Throwing AI at answering an ill-formed question or optimizing a process that
shouldn't happen in the first place is not something that can be corrected by
getting better training data.

Moreover, automation can have consequences that aren't detectable by analyzing
some test set.

------
chobeat
I've just added this post to my reading list. I share it if anybody is
interested in this and similar topics: [https://github.com/chobeat/awesome-
critical-tech-reading-lis...](https://github.com/chobeat/awesome-critical-
tech-reading-list/)

------
Zolomon
There is a course on this at New York University:
[https://dataresponsibly.github.io/courses/spring19/](https://dataresponsibly.github.io/courses/spring19/)

------
killjoywashere
I actually think this is where ML really shines. You _can_ pick things apart.
Sure, you might need carefully designed experiments, but you _can_ subtract
"female" from the resume and look for other data that cause some trained
machine to activate, like patterns of word choice, etc. This is akin to the Go
players learning from Alpha Go. It's actually a richly rewarding investigation
for those of us who have done it. To discover a whole class of failure modes,
that's success! And, unlike courts of law, the the process is much more
efficient, because you don't have to contend with a defendant appealing to
matters of intent or the emotions of a jury.

------
Someone
Short way to describe the problem: we want to build systems that detect
causation, but statistical models can only detect correlation.

~~~
wongarsu
That's not entirely true: it's hard to show causation, but with enough data
you can. If A correlates with B you know that either A causes B, B causes A,
some C causes both A and B, or the correlation is a coincidence. If you have
the data to rule out 3 of those the remaining possibility is the causation.

~~~
Someone
So, how do you, for example, rule out “some C causes both A and B“, if you may
not even know of the existence of C?

More importantly, the only way to really show causation is by positing a
mechanism.

------
eanzenberg
>>Now, suppose that 75% of the bad turbines use a Siemens sensor and only 12%
of the good turbines use one (and suppose this has no connection to the
failure). The system will build a model to spot turbines with Siemens sensors.
Oops.

Given a statistically large enough sample, 2 outcomes: 1) The Siemens sensor
actually is at fault. 2) The Siemens sensor is a part of a larger system,
which is different in non-Siemens turbines, and that system is failing.

Either way, the model prediction on turbine failures is enhanced with that
Siemens feature. But to even get to this granularity, you are diving into
model explainability, or what features were important for each prediction.
Here, you try to understand the black-box to find reasons for particular
input->output.

~~~
munificent
I think you assume here that the historical effects that led to Siemens
sensors correlating with failure will continue to be true in the future. And I
think that is the key fallacy that makes AI bias a problem.

We aren't just looking for patterns. We are looking for patterns so that we
can _take action_ and affect the future. If the patterns, which are real
enough in the historical data, don't correctly predict the impact of a choice,
then they are anti-helpful bias.

For example, it may be that the company bought Siemens sensors years ago and
then switched to another brand later. Unsurprisingly, older turbines fail more
than newer ones. So, really, it's _age_ that is the causative factor and the
concrete action you want to take is to pay closer attention to older turbines.
Even though the correlation to Siemens is real, if the action you take is
"replace all the Seimens sensors with another brand", that won't make those
old turbines work any better.

In other words, understanding data doesn't just mean "see which bits are
correlated with which other bots". In order to be useful, we need to
understand which _changes_ to those bits in the future will be correlated with
which desired outcomes. Anything less than that and you don't yet have
information, just data.

~~~
MiroF
> I think you assume here that the historical effects that led to Siemens
> sensors correlating with failure will continue to be true in the future.

Yes, AI systems presume induction to be true. But so does... uh, science and
most other things we do?

~~~
gizmo686
Science has trained experts _thinking_ about the data.

If you set a team of scientists to find a way of predicting failure of
turbines, they might notice a correlation between Siemens sensors and failure.
They would then look for and attempt to prove theories to explain this
descrepency. In doing so, they would likly discover that, not only can they
not find a causative theory, but the correlation goes away when they control
for age.

AI systems stop after the first step, yet somehow are perceived as better than
expert humans.

~~~
bumby
That's an interesting way to frame it. AI may stop at proximate causes rather
than finding root causes

~~~
munificent
Or: AI shows correlation which we then implicitly treat as causation.

------
jgon
This quote stands out to me:

"just as a dog is much better at finding drugs than people, but you wouldn’t
convict someone on a dog’s evidence. And dogs are much more intelligent than
any machine learning."

Because in my head I followed it with the sentence "but we're all confident
that we will have dogs driving our cars in about 5 years." Food for thought
for sure.

~~~
dmix
So dogs are better than humans at detecting drugs because they have a better
sense of smell than can penetrate packaging. What does that have to do with
technology being better/worse than humans at driving, exactly?

They didn't say dogs were better than technology at solving problems, in any
sort of general sense.

