I'm sure this isn't the first time it's been done, but it's pretty neat to see it in action, and it's a worthwhile reminder: If a neural net is this good at inferring social, racial, and gender information from audio, humans are even better. And the idea of speech as a social construct becomes even more relevant.
I'm mostly deaf (cochlear implant) and one thing I've noticed is that if I watch things without my processor on (e.g., completely deaf), I can generally "guess" what a voice sounds like fairly accurately... I've wondered for a long time if it's a trick of my mind, a quirk of statistics, or something that's actually possible.
With just a face, you miss things like the fundamental frequency (pitch) of the voice, dialect, and other linguistic variables.
In both cases, much is missing, and impossible to reconstruct beyond a stereotype.
It's a tool for pair programming in interviews. It includes audio (no video), and alters the audio to reduce/eliminate cues that would indicate the interviewee's race, age etc.
Why would humans automatically be better than machines at that task?
Plus, we have the advantage of understanding what social cues certain speech traits directly 'index', or serve to mark. For instance, I'll bet you can picture a voice of somebody who you could clearly identify as white and male, but who would be exceedingly unlikely to have a long, bushy beard and wear a camoflauge jacket. This is not anatomical, but social, and are not coincidence, but broadcasted social information. Sure, with enough data, we might be able to pick up on these as sort of emergent stereotypes, but we're attuned to such cues through our social experience. And these things are culturally specific, perhaps moreso than a YouTube dataset would be.
I view this as a similar situation to using ML for evaluating things like humor, irony, or aesthetic beauty in cloudscapes: They might be able to bootstrap a model which starts with human judgements, or cluster things in such a way that a 'funny' category emerges, but they're a ways off from understanding the categories themselves, and I think that's relevant.
For example most people can easily picture a gender, race, age, and where a person is from based on accent.
But I never realized that I also picture how fat they are, and can do it pretty well! It wasn't until I saw that this project can do it very reliably that I realize that I do it all the time too.
What else are we subconsciously picking up on? And as a counter defense, how can we better hide it? Do I need to change my vocabulary and topic choices to something more posh so they think I am eating healthier? What other info leaks are there?
Also, I think it's interesting that the classifiers can pick out ethnicity with a high degree of accuracy. Seems like an easy way to fool this tech, from a privacy perspective, is to talk stereotypically like a specific ethnicity.
If you add this to the model that guesses identity based on the sound produced by inputs (keyboard , mouse...) you basically end up with an "ambiant sound fingerprinting" tech , where the sounds emitted nearby a device can be used to accurately determine the individual that's standing close to it...
If you add this to china's facial recognition , it scares me to think how "Gattaguesque" our societies are turning thanks to Machine Learning and Big Data...
This research stops short of tying speech to any individual's appearance. It isn't even an advancement toward that goal, which it explicitly doesn't have.
The facial identity part seems little more than an average/example visualization of traits (age, gender, etc), which can be inferred from speech data with some accuracy (as we've always attempted in the form of mental models in our brains).
Not trying to be contrarian, genuinely wondering why I'm not among the concerned, and if I'm missing something.
So, basically, a lot of this stuff is actually likely happening right now in our brain. A lot of crazy complicated stuff, happening entirely subconsciously and never having been tested before because doing so empirically is somewhere between incredibly tedious and impossible.
I wonder if the NSA has an in-house version already.
It fits age, sex, ethnicity and face shape.
The part it does well is age, sex and ethnicity. It's not really surprising that voice can give those away. Most people can guess those correctly from a voice sample.
Face shape is the interesting part, and in my opinion it doesn't do that very well at all. I wouldn't recognise any of those people from their reconstructed images.
What's uncanny here is that having a goatee doesn't make you belong to any social group you could think of explicitly and enjoy belonging to. I guess the relationship is mostly driven by a mix of physionomical traits (gender + age) and the fact they correlate well to having a goatee (which isn't a tiny class anyway). Or there are indeed "deep" social structures to which we belong and are yet unable to identify.
Edit: there may well be an immense data trove hidden in people's voice. That could be a very useful way to enrich datasets internally a bit like recommendation engines work: if my neighbour speaks like I do, then he must enjoy the same things as I.
Extreme example, compare audio at 4m and 21m - https://www.youtube.com/watch?v=6dbQ2OA4SRA
Clearly a different top end.
Did a little tinkering in Audacity and with the beard there's a standard roll off from 3khz to 10khz. Without there's a weird flat spot in the same area (both are averaged over 20 seconds or so)
CMU also presented a similar research in World Economic Forum last year: https://www.afcea.org/content/mind-blowing-promise-ai-driven...
with a recent paper: https://arxiv.org/pdf/1905.10604.pdf
So the software accurately depicts the race of a person according to what some other software has determined their race to be? This is so circular it is laughable. I cannot wait to see how many L's I need to mispronounce for this thing to assume I'm an Asian from the Bronx, or how many stutters are needed before it thinks I'm an octogenarian.
I'm reminded of the similar tool that could identify sexual orientation from a photo. It only worked on those who fit certain stereotypical behaviors, persons who actively self-identified as being in a particular category. When tied to immutable characteristics (skull dimensions) it fell apart.
That paper was particularly bad. To a large degree it was a dataset detector (the "gay" and "non-gay" facial datasets came from different sources, based on different geography).
This paper is much more limited in the claims it is making - only that it correlates well on faces that appear white and "Asian", and that it doesn't correlate as well for "Indian" and black. The speculate that this is because of under-representation of those classes.
So the software accurately depicts the race of a person according to what some other software has determined their race to be?
Are you arguing that Face++ is inaccurate? It would surprise me if a machine learning model isn't pretty much as good as humans at this. I don't see any numbers quoted by Face++, but an paper claims 93% accuracy
I cannot wait to see how many L's I need to mispronounce for this thing to assume I'm an Asian from the Bronx
So if you make yourself talk like an "Asian from the Bronx" and it detects that then... it is working, no?
I really worry for people born 100 years from now. We need to be really careful with technology like this. This could lead to a dystopia greater than Orwell could have ever predicted, and I don't want to sit idly by while it happens