
Professor who says facial recognition ​​can tell if you're gay - uxhacker
https://www.theguardian.com/technology/2018/jul/07/artificial-intelligence-can-tell-your-sexuality-politics-surveillance-paul-lewis
======
myWindoonn
This response to the research is quite good: [https://medium.com/@blaisea/do-
algorithms-reveal-sexual-orie...](https://medium.com/@blaisea/do-algorithms-
reveal-sexual-orientation-or-just-expose-our-stereotypes-d998fafdf477)

The takeaway: AIs can learn our stereotypes.

------
YeGoblynQueenne
The "deep gaydar" paper reports that the training data had 50% gay men and 50%
straight men and the same for women (for subjects with at least one picture,
i.e. all). This equal distribution is, of course, quite unlike the
distribution in the real world. The paper itself quotes a 7% prevalence of
homosexuality.

The question is then- if they believed their sources that homosexual men and
women are 7% of the population, why did they choose to train their classifier
with a radically different distribution. One answer may be that it's much
harder to train an accurate classifier for unbalanced classes and that
artificially including more of the minority class improves the performance of
the classifier on its training _and_ its test set, therefore overall improving
its evaluation.

Of course, the resulting model is useless for predictions in the real world,
because it violates the central assumption in PAC learning (the computational
learning theory behind modern machine learning): that the data used to train a
machine learning algorithm has the same distribution as unseen data.

In other words, the "deep gaydar" paper is a pointless exercise in classifier
tuning. It proves that, with enough knowledge of how classifiers work, you can
squeeze any kind of results you want out of them. But it doesn't tell us
anything else.

~~~
dual_basis
I would disagree, the assumption that the data has the same distribution in
your training set as it will in application is not required at all, and in
fact it is often better to have more data around the edge of the
classification set.

~~~
YeGoblynQueenne
Like I say, the central assumption in PAC learning is that a data set is drawn
from the same distribution as unseen data. Otherwise, there can be no
guarantees that the model is capable of representing anything beyond its data
set.

That is not controversial. I'm surprised at your statement that distributional
consistency is "not required at all" and I wonder where it's coming from.

I don't understand what you mean by "data around the edge of the
classification set".

~~~
BrandoElFollito
Why should it be? You present an image and state that this is type A. Then
another, with your B, etc.

The images are independent, each holds a classification and how many you have
from each type does not need to reflect their actual distribution.

------
spunker540
I think this kind of research is worth trying despite the potential risks.
It’s the role of researchers to figure out what is possible.

It’s like the atomic bomb— sure it’s incredibly dangerous technology and we’d
prefer no one has it. But we can’t prevent others from having it, and if
everyone has it mutually assured destruction actually promotes peace.

------
RickJWagner
I think we had better start looking hard at ethics around these kinds of
things.

For instance, what if the AI can read the same from children-- say even in the
womb? There are already babies killed before birth because they are the wrong
sex, or because they show potential to have some disease.

Caution is warranted.

