Hacker News new | past | comments | ask | show | jobs | submit login
Speech2Face: Learning the Face Behind a Voice (arxiv.org)
143 points by grej 23 days ago | hide | past | web | favorite | 54 comments

I'm a speech scientist. This paper is a neat idea, and the results are interesting, but not in the way I'd expected. I had hoped it would the domain of how much person-specific information this can deduce from a voice, e.g. lip aperture, overbite, size of the vocal tract, openness of the nares. This is interesting from a speech perception standpoint. Instead, it's interesting more in the domain of how much social information it can deduce from a voice. This appears to be a relatively efficient classifier for gender, race, and age, taking voice as input.

I'm sure this isn't the first time it's been done, but it's pretty neat to see it in action, and it's a worthwhile reminder: If a neural net is this good at inferring social, racial, and gender information from audio, humans are even better. And the idea of speech as a social construct becomes even more relevant.

Do you think it's more difficult to guess physiological features from a voice or a voice from a picture?

I'm mostly deaf (cochlear implant) and one thing I've noticed is that if I watch things without my processor on (e.g., completely deaf), I can generally "guess" what a voice sounds like fairly accurately... I've wondered for a long time if it's a trick of my mind, a quirk of statistics, or something that's actually possible.

In both cases, there are a lot of hidden variables. With voice, you miss out on non-acoustic things like beards, cheekbones, and other sorts of face-distinguishing features

With just a face, you miss things like the fundamental frequency (pitch) of the voice, dialect, and other linguistic variables.

In both cases, much is missing, and impossible to reconstruct beyond a stereotype.

I think that's part of the motivation for Blindpad: https://github.com/blindpad

It's a tool for pair programming in interviews. It includes audio (no video), and alters the audio to reduce/eliminate cues that would indicate the interviewee's race, age etc.

> If a neural net is this good at inferring social, racial, and gender information from audio, humans are even better.

Why would humans automatically be better than machines at that task?

We don't know this for sure, certainly, but given that things like social group, race and gender are fundamentally sociocultural phenomena (albeit with some physiological basis in some cases), I would assume that humans will have a considerable advantage. We are natively social beings with decades of social knowledge and learning, whereas these sorts of algorithms are at best seeing these things as epiphenomena in large datasets.

Plus, we have the advantage of understanding what social cues certain speech traits directly 'index', or serve to mark. For instance, I'll bet you can picture a voice of somebody who you could clearly identify as white and male, but who would be exceedingly unlikely to have a long, bushy beard and wear a camoflauge jacket. This is not anatomical, but social, and are not coincidence, but broadcasted social information. Sure, with enough data, we might be able to pick up on these as sort of emergent stereotypes, but we're attuned to such cues through our social experience. And these things are culturally specific, perhaps moreso than a YouTube dataset would be.

I view this as a similar situation to using ML for evaluating things like humor, irony, or aesthetic beauty in cloudscapes: They might be able to bootstrap a model which starts with human judgements, or cluster things in such a way that a 'funny' category emerges, but they're a ways off from understanding the categories themselves, and I think that's relevant.

I think that's the scary thing. We don't even know if we know. It's all subconscious.

For example most people can easily picture a gender, race, age, and where a person is from based on accent.

But I never realized that I also picture how fat they are, and can do it pretty well! It wasn't until I saw that this project can do it very reliably that I realize that I do it all the time too.

What else are we subconsciously picking up on? And as a counter defense, how can we better hide it? Do I need to change my vocabulary and topic choices to something more posh so they think I am eating healthier? What other info leaks are there?

This is a bit different but also an example that made me realize I unconsciously recognize some things I'm unaware of (the difference between pouring hot and cold water): https://www.youtube.com/watch?v=Ri_4dDvcZeM

Actually I see no reason for humans to be any better for these tasks.

One of the really cool things from the pdf is the ability to construct a normalized face from a video frame: https://i.imgur.com/SdRHCJ0.png Basically they are taking a video frame, making it face forward, cancelling out lighting for pure albedo, and removing accessories like glasses. This is incredible in terms of tracking individuals. It's a digital mugshot.

Also, I think it's interesting that the classifiers can pick out ethnicity with a high degree of accuracy. Seems like an easy way to fool this tech, from a privacy perspective, is to talk stereotypically like a specific ethnicity.

Absolutely terrifying.

If you add this to the model that guesses identity based on the sound produced by inputs (keyboard , mouse...) you basically end up with an "ambiant sound fingerprinting" tech , where the sounds emitted nearby a device can be used to accurately determine the individual that's standing close to it...

If you add this to china's facial recognition , it scares me to think how "Gattaguesque" our societies are turning thanks to Machine Learning and Big Data...

Am I missing something?

This research stops short of tying speech to any individual's appearance. It isn't even an advancement toward that goal, which it explicitly doesn't have.

The facial identity part seems little more than an average/example visualization of traits (age, gender, etc), which can be inferred from speech data with some accuracy (as we've always attempted in the form of mental models in our brains).

Not trying to be contrarian, genuinely wondering why I'm not among the concerned, and if I'm missing something.

the scary security applications were exactly my first thought too. it's funny because if they were to continue this research into the other direction (creating speech from faces) you could get useful applications that could help people come up with realistic voices for deaf/mute people (which sounds much less evil). Made me wonder if there was a financial angle to get the research money easier by focusing on the security.

What do you mean by "Gattaguesque"?


I don't think that we're prepared at all for how much, and from how little, machine learning might be able to deduce about us. Combined with how our behavior is tracked in high resolution by Facebook, Google, and the rest, we're heading straight for the kind of future depicted in films like Gattaca, but they won't even need to test your DNA for it. Just upload a short video of your friend/employee/kid to videotherapist.ai and find out if they're more likely to be an arsonist or a physicist!

Absolutely fascinating/terrifying (why do those always come in pairs with AI research?) but I think it's also interesting to use this to make some assumptions about how human brains work. When I hear a voice, I kinda have an idea what the person would look like. I can't really explain it, but there's definitely some assumptions I can make about age, weight – gender is easy but even within that, a deeper voice easily makes me think of more stereotypically male facial features. The model is also slightly racist (as we all are) so if I hear an Asian language, well, I might picture an Asian person.

So, basically, a lot of this stuff is actually likely happening right now in our brain. A lot of crazy complicated stuff, happening entirely subconsciously and never having been tested before because doing so empirically is somewhere between incredibly tedious and impossible.

How is it racist to think of an asian person when you hear something with a typically asian pronunciation?

For a quick overview, I found the Github page to be more useful: https://speech2face.github.io

Cool, but this seems like a very fancy age, race and gender approximator.

I wonder if the NSA has an in-house version already.

I found this scary from the perspective of it's use by Law Enforcement. There have already been serious misuses of facial recognition software by LE, including using celebrity pictures as a query [1] because a witness described a perpetrator as looking like that celebrity. From the perspective of someone familiar with machine learning, this is a terrible misuse that is likely to lead to wrongful arrests. I can easily imagine LE using Speech2Face software to extract a facial reconstruction from a recording, then feeding this reconstruction into facial recognition software, not understanding how this will WILDLY propagate error. I'm not against industry use of machine learning, but this software has the propensity for extreme misuse.

1. https://www.washingtonpost.com/technology/2019/05/16/police-...

I may have slept better tonight not having know about this.

I'm underwhelmed.

It fits age, sex, ethnicity and face shape.

The part it does well is age, sex and ethnicity. It's not really surprising that voice can give those away. Most people can guess those correctly from a voice sample.

Face shape is the interesting part, and in my opinion it doesn't do that very well at all. I wouldn't recognise any of those people from their reconstructed images.

I like how having a goatee correlates sufficiently with the way you speak for a neural network to learn it.

There is a whole field dedicated to this kind of study, namely sociolinguistics (see [1] for a short summary of the seminal experiment). Sociolinguistics study surface variations in language use that cannot be accounted by dialectology only (i.e. geographical factors) among speakers of the same language. There are clusters among linguistics uses, and it turns out they map to clusters in the space of social practices. For what I have studied of the field (not that much), it seems most of the time the variation is driven by the desire to belong or show you belong to the community of the users of the trait you adopt. It's said to be "inconscious" (pretty much like my masterful ability a handling language), but at the same time the subjects at hand can arrive at the same conclusions with the help of some introspection.

What's uncanny here is that having a goatee doesn't make you belong to any social group you could think of explicitly and enjoy belonging to. I guess the relationship is mostly driven by a mix of physionomical traits (gender + age) and the fact they correlate well to having a goatee (which isn't a tiny class anyway). Or there are indeed "deep" social structures to which we belong and are yet unable to identify.

[1] http://all-about-linguistics.group.shef.ac.uk/branches-of-li...

Edit: there may well be an immense data trove hidden in people's voice. That could be a very useful way to enrich datasets internally a bit like recommendation engines work: if my neighbour speaks like I do, then he must enjoy the same things as I.

Or it could be that a goatee physically influences the sound of the voice in a way that is perceptible to the algorithm. For example, damping transmission through the skin and attenuating reflections off the chin and upper pip.

It doesn't. I worked as a sound engineer in film for a decade and I'm extremely forensically minded.

You motivated me to investigate.

Extreme example, compare audio at 4m and 21m - https://www.youtube.com/watch?v=6dbQ2OA4SRA

Clearly a different top end.

Did a little tinkering in Audacity and with the beard there's a standard roll off from 3khz to 10khz. Without there's a weird flat spot in the same area (both are averaged over 20 seconds or so)

But are you absolutely sure there aren't tiny, tiny differences? I don't think it's likely, but how can you be so sure?

I'm dubious because of the frequency range of audio recordings (especially post compression on Youtube) and the size of cells, even large ones like hair follicles.

You can't know this.

How do you know he can't know this?

This feels very unlikely. Hairs are thin, non-absorptive, and spaced far enough apart to avoid much absorption. Unless it were an exceptionally bushy and thick goatee, I don't imagine there's much effect.

They trained this model on Youtube videos. I'd be surprised if whatever fine differences goatees may induce didn't get crushed away by Youtube's lossy compression.

Wow. It looks like my image and voice were used in this study. I was never contacted.

Interestingly, US Coast Guard has been using this technology for a long time via Carnegie Mellon (CMU).

CMU also presented a similar research in World Economic Forum last year: https://www.afcea.org/content/mind-blowing-promise-ai-driven...

with a recent paper: https://arxiv.org/pdf/1905.10604.pdf

would be useful for game designers if worked in reverse

Honestly, still probably pretty useful for game designers if they can do some style mapping on top of the generated image. Being able to generate "believeable-enough" portraits for NPCs based on voice actor data + style mapping would cut down on a sizeable art task of portrait assets.

Interesting and probably doable!

Seems like pure overfitting. Since the model is able to see during training the same coocurrences that are used for evaluation, it is just about memorizing the training set, which is trivial.

>> In our experimental section, we mention inferred demographic categories such as "White" and "Asian". These are categories defined and used by a commercial face attribute classifier (Face++), and were only used for evaluation in this paper.

So the software accurately depicts the race of a person according to what some other software has determined their race to be? This is so circular it is laughable. I cannot wait to see how many L's I need to mispronounce for this thing to assume I'm an Asian from the Bronx, or how many stutters are needed before it thinks I'm an octogenarian.

I'm reminded of the similar tool that could identify sexual orientation from a photo. It only worked on those who fit certain stereotypical behaviors, persons who actively self-identified as being in a particular category. When tied to immutable characteristics (skull dimensions) it fell apart.

I'm reminded of the similar tool that could identify sexual orientation from a photo. It only worked on those who fit certain stereotypical behaviors, persons who actively self-identified as being in a particular category. When tied to immutable characteristics (skull dimensions) it fell apart.

That paper was particularly bad. To a large degree it was a dataset detector (the "gay" and "non-gay" facial datasets came from different sources, based on different geography).

This paper is much more limited in the claims it is making - only that it correlates well on faces that appear white and "Asian", and that it doesn't correlate as well for "Indian" and black. The speculate that this is because of under-representation of those classes.

So the software accurately depicts the race of a person according to what some other software has determined their race to be?

Are you arguing that Face++ is inaccurate? It would surprise me if a machine learning model isn't pretty much as good as humans at this. I don't see any numbers quoted by Face++, but an paper claims 93% accuracy[1]

I cannot wait to see how many L's I need to mispronounce for this thing to assume I'm an Asian from the Bronx

So if you make yourself talk like an "Asian from the Bronx" and it detects that then... it is working, no?

[1] http://www.bernardjjansen.com/uploads/2/4/1/8/24188166/janse...

Seems like pure overfitting. Since the model is able to see during training the same coocurrence, it is just about memorizing the training set, which is trivial.

The results look so good it makes me suspicious it's overfitting! Would be nice if the author said whether the examples were from a held-out set of not...

Is it really much better than it would be if it only knew age, race and sex though? Those seem fairly easy to determine from speech, especially when you consider language - and it made a mistake in a case where the race didn't match the language.

This is absolutely absurd and really scary. I'm not a hugely scientific person, so I might have misunderstood, but it seems like we are getting close to being able to figure out the whole of a person just from a snapshot of one part of them, like their voice.

I really worry for people born 100 years from now. We need to be really careful with technology like this. This could lead to a dystopia greater than Orwell could have ever predicted, and I don't want to sit idly by while it happens

There are some recent developments that are potentially scary, but I don't see how this is one of them. Humans are really good at guessing approximate age, sex, and ethnicity from a voice. I'm not surprised in the least that some machine learning can do it too. I think the "whole of a person" is a lot more than age, sex, and ethnicity, so I don't agree that we are "getting close."

Good thing I rarely use phones any more. Maybe I should start making portable vocoders to help maintain privacy against ill-conceived projects like this one.

What exactly is the endgoal here?


That's not really an answer. You can research everything in the name of science, but one should hope that there is some rationale behind the effort.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact