
A voice separation model that distinguishes multiple speakers simultaneously - venmul
https://ai.facebook.com/blog/a-new-state-of-the-art-voice-separation-model-that-distinguishes-multiple-speakers-simultaneously
======
wenc
This is known as Blind Source Separation [1], and it's been a field of study
for decades. The specific problem here seems to be the "cocktail party
problem", where you want to isolate a single speaker (or in this case 5?) in a
room full of conversations.

When I was in grad school, I knew an EE research group in the building next to
mine working on this problem using ICA (independent components analysis) --
this was ca 2004, before the resurgence of deep learning. Even with ICA useful
results could be obtained.

The results of the FB work [2] with RNNs are pretty impressive (audio
samples).

[1]
[https://en.wikipedia.org/wiki/Signal_separation](https://en.wikipedia.org/wiki/Signal_separation)

[2]
[https://enk100.github.io/speaker_separation/](https://enk100.github.io/speaker_separation/)

~~~
ShamelessC
Are the audio embeds on that second site working for you? Can't get most of
them to play.

~~~
wenc
Yep, they're working for me.

------
boublepop
I feel that they are underplaying just how big a deal this would be in hearing
aids. It’s not just a case of “slightly better noice filtering” for some it is
the difference between being able to go to social events or not. For a large
group of people using hearing aids the cocktail party effect means they can’t
hear anything at all in social settings, so they avoid them completely because
of the negative effects that come from everyone assuming your able to follow
group conversations when your in fact sitting in your own little bubble only
able to pick up what’s going on when someone semi-yells directly at you.

In any case the box you’d be selling them this product in wouldn’t say “better
sound” it would say: “Get back your ability to attend and enjoy parties, enjoy
group conversations and socialize unencumbered”. That’s a huge quality of life
improvement.

You still have the issue of how to figure out which voices to boost and which
to reduce, but I’d expect that to be simpler issue of using multiple receivers
and directional detection.

------
yodon
Facebook's work on separating multiple sources in an audio stream is
fundamentally different from prior ICA-based methods of Blind Source
Separation [0] in ways that are both interesting and seem to be part of a
broader trend at FB Research.

ICA-based BSS requires at least n microphones to separate n sources of sound.
This work does the separation with one microphone.

What makes this more broadly interesting is FB Research has separately
developed the capability to reconstruct full 3D models from single image
photos[1].

Both of these reconstruct-from-single-sensor problems are MUCH harder than
their associated reconstruct-from-multiple-sensors variants (ICA in the case
of audio, stereo separation or photogrammetry in the case of video) so they
aren't efforts one undertakes casually.

The obvious motivation for this single-sensor approach is augmenting existing
video and audio clips, most of which are single camera, single microphone (or
very closely spaced stereo microphones with minimal separation), and all of
which people have already uploaded massive numbers to Facebook.

The more interesting motivation could be that FB (Oculus) is widely believed
to be developing next generation AR or VR glasses. Most of the discussion
around AR/VR headsets focuses on the displays, but if you wanted to keep both
your physical size and hardware parts cost to an absolute minimum, one of the
things you'd want to minimize is your sensor count.

FB Research seems to have a strong interest in things that reduce the number
of sensors required to provide high grade AR/VR experiences and that make it
possible to explore pre-existing conventional media in spatialized 3D
contexts.

[0]
[https://en.m.wikipedia.org/wiki/Independent_component_analys...](https://en.m.wikipedia.org/wiki/Independent_component_analysis)

[1] [https://ai.facebook.com/blog/facebook-research-at-
cvpr-2020/](https://ai.facebook.com/blog/facebook-research-at-cvpr-2020/)

------
thaumasiotes
This is a really interesting problem to work on. A couple obvious points:

1\. This is a task that humans must do _all the time_. It's very important in
all kinds of different circumstances.

2\. This is also a task that humans find very difficult. It's not like
recognizing someone by their face, where humans do it effortlessly but
struggle to describe how. We frequently fail at this.

Combining (1) and (2), and the assumption that this task has been just as
important historically as it still is now, we might conclude that this is a
_really hard problem_ and AI is unlikely to reach the level of performance we
might hope for.

And if AI quickly jumps to superhuman levels of performance, that too would
have many interesting implications.

~~~
mlthoughts2018
I have some skepticism. Humans can’t process multiple audio streams
simultaneously. Recognizing faces is a poor analogue to compare with, it would
be more similar to photostitching several faces all superimposed on top of
each other and asking a human to identify each component face, which humans
are also terrible at.

But we also rarely need realtime solutions for this. Casually tracking gross
features from a mixed audio source, like a big crowd, is easy, but we almost
never need to be actively listening and disambiguating between several
deliberate audio signals all at once. Human communication just isn’t setup
that way, though I’m sure there are niche exceptions.

Note here that video conferencing is emphatically not an example of an
exception. You still need to have synchronous order of speaking, not because
technology lacks the ability to separate the streams for the listener, but
because the listener cannot pay attention to more than one stream at a time.

To me this technology seems almost exclusively useful for surveillance and
offline audio analysis or audio synthesis / mixing.

Still possibly valuable, but definitely not in any fundamental way that would
significantly change or augment realtime verbal communication.

~~~
newsbinator
We are much better at faces than we are at voices:

• Imagine I show you 10 photos of faces. Then I come back 15 minutes later and
ask you to pick them out from a set of 20 faces.

• Imagine I play you 10 audio snippets of someone talking. Then I come back 15
minutes later and ask you to pick them out from a set of 20 audio snippets of
people talking.

That said, I agree with you that primarily this technology would be useful for
surveillance.

If you can zoom in on any given conversation happening at New York Penn
Station, you're basically in the Bourne Identity, with all the privacy-
destroying overreach that entails.

~~~
mlthoughts2018
I think you completely missed the point. If you think the overall fact that
humans are better at face identification than audio identification is relevant
here, you missed the point, since that high level fact is not related to this
specific problem or the comparison being made.

~~~
sillysaurusx
Can you explain the point? Because I've missed it too.

~~~
mlthoughts2018
The task of disambiguating speakers, relative to skill at other audio
processing tasks, is more similar to disambiguating _superimposed_ faces,
relative to skill at other vision tasks.

It is _not_ similar to quickly recognizing several _different_ faces that are
present in a video stream.

Tasks requiring simultaneous processing of focused attention on multiple input
streams are _not_ very related to human visual or audio processing. These
types of tasks are _not_ similar to attending to things in peripheral vision
or peripheral audio.

------
ComputerGuru
Not an expert in this domain but I'm not sure this can be done (well) without
a physical component.

Recent studies have shown that we can consciously and subconsciously
physically manipulate the position and directionality of our outer ear and
some of the mechanics in the inner ear to "zero in" on noises and affect the
frequency response of the ear. Our ears move imperceptibly when we look from
side to side to synchronize what we hear with what we see. Try listening to
one person in a busy room is saying then try doing the same while looking
somewhere else.

There is hardware actively filtering out interfering sounds based on location
and frequency, then there's the wetware that further processes the incoming
signals and attempts to strip unwanted noise. I don't believe the second can
be effectively done without a feedback loop to the first.

~~~
sgk284
If you've ever listened to a podcast or talk radio then you've done all of
this without the physical component.

------
Yhippa
The "Why it matters" section is interesting. Cynically I'm trying to think of
commercial uses of this for FB. I'm thinking if you built a device that you
could put into public places, restaurants, or stores:

* People could order food from their table without summoning a server. I guess some restaurants have tablets or other devices at their table but it seems to break immersion if you're enjoying your company.

* In a big box store someone could come help you where you are without having to have workers roam the store and then you hope you run into someone.

* Fingerprint people in public or private for targeted advertising.

------
iandanforth
The assumption that this is possible come from our ability to isolate voices
in a crowd by paying attention to one or more of them. However our ability to
do so rests on two important factors that don't exist in these datasets. 1. We
have two ears to allow for sound localization and 2. The sounds we distinguish
are colocated in space allowing us to use ambient information for
disambiguation.

This means that the problem being solved here is _harder_ than the natural
problem we have evolved and learned to solve.

This is both impressive and possibly problematic. Some feature of training in
a goal directed fashion in naturalistic environments could be essential for
higher quality speaker isolation, or it might not matter at all. The
multiplicity of models phenomenon tells us there are likely many solutions to
this problem.

~~~
milesvp
It’s porentially even harder than this. Not only do we have two ears, our ears
are shaped to help determine where sounds are coming from, and we make
constant movements to further help process audio. I remember years ago I
played an FPS capture the flag type game with headphones, and, not everyone
knows this, but the math works out that the same sound could come from in
front or behind, so I developed a strong tendency to make small mouse
movements rotating left and right to help me identify where footsteps were
coming from. I didn’t even know I was doing it until I talked to an audio
engineer who mentioned this problem, and next time I played, I realized how
much I relied on it to avoid getting blindsided.

------
fredmonroe
i'm excited that FB develops and shares their research and simultaneously
terrified of what they will do with it given past behavior

its very disconcerting - i feel this way everytime i use pytorch - which i
love

------
atum47
Nice, now Facebook can spy on several people at once.

