
Speech2Face: Learning the Face Behind a Voice - grej
https://arxiv.org/abs/1905.09773
======
wanderfowl
I'm a speech scientist. This paper is a neat idea, and the results are
interesting, but not in the way I'd expected. I had hoped it would the domain
of how much _person-specific_ information this can deduce from a voice, e.g.
lip aperture, overbite, size of the vocal tract, openness of the nares. This
is interesting from a speech perception standpoint. Instead, it's interesting
more in the domain of how much _social_ information it can deduce from a
voice. This appears to be a relatively efficient classifier for gender, race,
and age, taking voice as input.

I'm sure this isn't the first time it's been done, but it's pretty neat to see
it in action, and it's a worthwhile reminder: If a neural net is this good at
inferring social, racial, and gender information from audio, humans are even
better. And the idea of speech as a social construct becomes even more
relevant.

~~~
jobigoud
> If a neural net is this good at inferring social, racial, and gender
> information from audio, humans are even better.

Why would humans automatically be better than machines at that task?

~~~
zawerf
I think that's the scary thing. We don't even know if we know. It's all
subconscious.

For example most people can easily picture a gender, race, age, and where a
person is from based on accent.

But I never realized that I also picture how fat they are, and can do it
pretty well! It wasn't until I saw that this project can do it very reliably
that I realize that I do it all the time too.

What else are we subconsciously picking up on? And as a counter defense, how
can we better hide it? Do I need to change my vocabulary and topic choices to
something more posh so they think I am eating healthier? What other info leaks
are there?

~~~
samvher
This is a bit different but also an example that made me realize I
unconsciously recognize some things I'm unaware of (the difference between
pouring hot and cold water):
[https://www.youtube.com/watch?v=Ri_4dDvcZeM](https://www.youtube.com/watch?v=Ri_4dDvcZeM)

------
daenz
One of the really cool things from the pdf is the ability to construct a
normalized face from a video frame:
[https://i.imgur.com/SdRHCJ0.png](https://i.imgur.com/SdRHCJ0.png) Basically
they are taking a video frame, making it face forward, cancelling out lighting
for pure albedo, and removing accessories like glasses. This is incredible in
terms of tracking individuals. It's a digital mugshot.

Also, I think it's interesting that the classifiers can pick out ethnicity
with a high degree of accuracy. Seems like an easy way to fool this tech, from
a privacy perspective, is to talk stereotypically like a specific ethnicity.

~~~
GorgeRonde
[https://www.youtube.com/watch?v=72n34fRkA3I](https://www.youtube.com/watch?v=72n34fRkA3I)

------
echopom
Absolutely terrifying.

If you add this to the model that guesses identity based on the sound produced
by inputs (keyboard , mouse...) you basically end up with an "ambiant sound
fingerprinting" tech , where the sounds emitted nearby a device can be used to
accurately determine the individual that's standing close to it...

If you add this to china's facial recognition , it scares me to think how
"Gattaguesque" our societies are turning thanks to Machine Learning and Big
Data...

~~~
splatzone
What do you mean by "Gattaguesque"?

~~~
etaioinshrdlu
Gattaca-esque:
[https://en.wikipedia.org/wiki/Gattaca](https://en.wikipedia.org/wiki/Gattaca)

~~~
splatzone
Cheers!

------
hedgew
I don't think that we're prepared at all for how much, and from how little,
machine learning might be able to deduce about us. Combined with how our
behavior is tracked in high resolution by Facebook, Google, and the rest,
we're heading straight for the kind of future depicted in films like Gattaca,
but they won't even need to test your DNA for it. Just upload a short video of
your friend/employee/kid to _videotherapist.ai_ and find out if they're more
likely to be an arsonist or a physicist!

~~~
nothis
Absolutely fascinating/terrifying (why do those always come in pairs with AI
research?) but I think it's also interesting to use this to make some
assumptions about how human brains work. When I hear a voice, I _kinda_ have
an idea what the person would look like. I can't really explain it, but
there's definitely some assumptions I can make about age, weight – gender is
easy but even within that, a deeper voice easily makes me think of more
stereotypically male facial features. The model is also slightly racist (as we
all are) so if I hear an Asian language, well, I might picture an Asian
person.

So, basically, a lot of this stuff is actually likely happening right now in
our brain. A lot of crazy complicated stuff, happening entirely subconsciously
and never having been tested before because doing so empirically is somewhere
between incredibly tedious and impossible.

~~~
adrianN
How is it racist to think of an asian person when you hear something with a
typically asian pronunciation?

------
arendtio
For a quick overview, I found the Github page to be more useful:
[https://speech2face.github.io](https://speech2face.github.io)

------
Mizza
Cool, but this seems like a very fancy age, race and gender approximator.

I wonder if the NSA has an in-house version already.

------
craftinator
I found this scary from the perspective of it's use by Law Enforcement. There
have already been serious misuses of facial recognition software by LE,
including using celebrity pictures as a query [1] because a witness described
a perpetrator as looking like that celebrity. From the perspective of someone
familiar with machine learning, this is a terrible misuse that is likely to
lead to wrongful arrests. I can easily imagine LE using Speech2Face software
to extract a facial reconstruction from a recording, then feeding this
reconstruction into facial recognition software, not understanding how this
will WILDLY propagate error. I'm not against industry use of machine learning,
but this software has the propensity for extreme misuse.

1\.
[https://www.washingtonpost.com/technology/2019/05/16/police-...](https://www.washingtonpost.com/technology/2019/05/16/police-
have-used-celebrity-lookalikes-distorted-images-boost-facial-recognition-
results-research-finds/)

~~~
taurish
Check this: [https://www.afcea.org/content/mind-blowing-promise-ai-
driven...](https://www.afcea.org/content/mind-blowing-promise-ai-driven-voice-
profiling)

~~~
craftinator
I may have slept better tonight not having know about this.

------
geowwy
I'm underwhelmed.

It fits age, sex, ethnicity and face shape.

The part it does well is age, sex and ethnicity. It's not really surprising
that voice can give those away. Most people can guess those correctly from a
voice sample.

Face shape is the interesting part, and in my opinion it doesn't do that very
well at all. I wouldn't recognise any of those people from their reconstructed
images.

------
adrianN
I like how having a goatee correlates sufficiently with the way you speak for
a neural network to learn it.

~~~
GorgeRonde
There is a whole field dedicated to this kind of study, namely
sociolinguistics (see [1] for a short summary of the seminal experiment).
Sociolinguistics study surface variations in language use that cannot be
accounted by dialectology only (i.e. geographical factors) among speakers of
the same language. There are clusters among linguistics uses, and it turns out
they map to clusters in the space of social practices. For what I have studied
of the field (not that much), it seems most of the time the variation is
driven by the desire to belong or show you belong to the community of the
users of the trait you adopt. It's said to be "inconscious" (pretty much like
my masterful ability a handling language), but at the same time the subjects
at hand can arrive at the same conclusions with the help of some
introspection.

What's uncanny here is that having a goatee doesn't make you belong to any
social group you could think of explicitly and enjoy belonging to. I guess the
relationship is mostly driven by a mix of physionomical traits (gender + age)
and the fact they correlate well to having a goatee (which isn't a tiny class
anyway). Or there are indeed "deep" social structures to which we belong and
are yet unable to identify.

[1] [http://all-about-linguistics.group.shef.ac.uk/branches-of-
li...](http://all-about-linguistics.group.shef.ac.uk/branches-of-
linguistics/sociolinguistics/research-in-sociolinguistics/william-labov-
marthas-vineyard/)

Edit: there may well be an immense data trove hidden in people's voice. That
could be a very useful way to enrich datasets internally a bit like
recommendation engines work: if my neighbour speaks like I do, then he must
enjoy the same things as I.

~~~
jcims
Or it could be that a goatee physically influences the sound of the voice in a
way that is perceptible to the algorithm. For example, damping transmission
through the skin and attenuating reflections off the chin and upper pip.

~~~
anigbrowl
It doesn't. I worked as a sound engineer in film for a decade and I'm
extremely forensically minded.

~~~
solarkraft
But are you absolutely sure there aren't tiny, tiny differences? I don't think
it's likely, but how can you be so sure?

~~~
anigbrowl
I'm dubious because of the frequency range of audio recordings (especially
post compression on Youtube) and the size of cells, even large ones like hair
follicles.

------
grittygrease
Wow. It looks like my image and voice were used in this study. I was never
contacted.

------
taurish
Interestingly, US Coast Guard has been using this technology for a long time
via Carnegie Mellon (CMU).

CMU also presented a similar research in World Economic Forum last year:
[https://www.afcea.org/content/mind-blowing-promise-ai-
driven...](https://www.afcea.org/content/mind-blowing-promise-ai-driven-voice-
profiling)

with a recent paper:
[https://arxiv.org/pdf/1905.10604.pdf](https://arxiv.org/pdf/1905.10604.pdf)

------
vackosar
would be useful for game designers if worked in reverse

~~~
drusepth
Honestly, still probably pretty useful for game designers if they can do some
style mapping on top of the generated image. Being able to generate
"believeable-enough" portraits for NPCs based on voice actor data + style
mapping would cut down on a sizeable art task of portrait assets.

------
circinus
Seems like pure overfitting. Since the model is able to see during training
the same coocurrences that are used for evaluation, it is just about
memorizing the training set, which is trivial.

------
sandworm101
>> In our experimental section, we mention inferred demographic categories
such as "White" and "Asian". These are categories defined and used by a
commercial face attribute classifier (Face++), and were only used for
evaluation in this paper.

So the software accurately depicts the race of a person according to what some
other software has determined their race to be? This is so circular it is
laughable. I cannot wait to see how many L's I need to mispronounce for this
thing to assume I'm an Asian from the Bronx, or how many stutters are needed
before it thinks I'm an octogenarian.

I'm reminded of the similar tool that could identify sexual orientation from a
photo. It only worked on those who fit certain stereotypical behaviors,
persons who actively self-identified as being in a particular category. When
tied to immutable characteristics (skull dimensions) it fell apart.

~~~
nl
_I 'm reminded of the similar tool that could identify sexual orientation from
a photo. It only worked on those who fit certain stereotypical behaviors,
persons who actively self-identified as being in a particular category. When
tied to immutable characteristics (skull dimensions) it fell apart._

That paper was particularly bad. To a large degree it was a dataset detector
(the "gay" and "non-gay" facial datasets came from different sources, based on
different geography).

This paper is _much_ more limited in the claims it is making - only that it
correlates well on faces that appear white and "Asian", and that it doesn't
correlate as well for "Indian" and black. The speculate that this is because
of under-representation of those classes.

 _So the software accurately depicts the race of a person according to what
some other software has determined their race to be?_

Are you arguing that Face++ is inaccurate? It would surprise me if a machine
learning model isn't pretty much as good as humans at this. I don't see any
numbers quoted by Face++, but an paper claims 93% accuracy[1]

 _I cannot wait to see how many L 's I need to mispronounce for this thing to
assume I'm an Asian from the Bronx_

So if you make yourself talk like an "Asian from the Bronx" and it detects
that then... it is working, no?

[1]
[http://www.bernardjjansen.com/uploads/2/4/1/8/24188166/janse...](http://www.bernardjjansen.com/uploads/2/4/1/8/24188166/jansen_facial_icwsm2018.pdf)

------
circinus
Seems like pure overfitting. Since the model is able to see during training
the same coocurrence, it is just about memorizing the training set, which is
trivial.

------
etaioinshrdlu
The results look so good it makes me suspicious it's overfitting! Would be
nice if the author said whether the examples were from a held-out set of
not...

~~~
lopmotr
Is it really much better than it would be if it only knew age, race and sex
though? Those seem fairly easy to determine from speech, especially when you
consider language - and it made a mistake in a case where the race didn't
match the language.

------
splatzone
This is absolutely absurd and really scary. I'm not a hugely scientific
person, so I might have misunderstood, but it seems like we are getting close
to being able to figure out the whole of a person just from a snapshot of one
part of them, like their voice.

I really worry for people born 100 years from now. We need to be really
careful with technology like this. This could lead to a dystopia greater than
Orwell could have ever predicted, and I don't want to sit idly by while it
happens

~~~
tntn
There are some recent developments that are potentially scary, but I don't see
how this is one of them. Humans are really good at guessing approximate age,
sex, and ethnicity from a voice. I'm not surprised in the least that some
machine learning can do it too. I think the "whole of a person" is a lot more
than age, sex, and ethnicity, so I don't agree that we are "getting close."

------
anigbrowl
Good thing I rarely use phones any more. Maybe I should start making portable
vocoders to help maintain privacy against ill-conceived projects like this
one.

------
amelius
What exactly is the endgoal here?

~~~
tirpen
Science.

~~~
amelius
That's not really an answer. You can research everything in the name of
science, but one should hope that there is some rationale behind the effort.

