
Amazon releases public data set to aid research on speech separation - georgecarlyle76
https://developer.amazon.com/blogs/alexa/post/6963ff40-6e62-4d6a-975d-fea600affa46/amazon-releases-new-public-data-set-to-help-address-dinner-party-problem
======
dawg-
I'm a speech science student, not an engineer. So take this with a grain of
salt, but here's what I know.

In order to process speech with a computer you need to know two things: what
is the source (vocal cords) doing? And what is the filter (vocal tract) doing?
And you need to extract those two pieces of information from a single audio
signal, and figure out what each is doing independently of the other.

Is a sound voiced or unvoiced? Is it, for example, labial (using the lips) or
alveolar(pressing the tongue on alveolar ridge just behind the teeth)? That
info is the difference between "t" and "p". You need to know both to figure
out what sound someone is making, and you need to put all those sounds
together to make words.

Most speech recognition uses one method of processing the audio signal: Mel-
frequency cepstral coefficients (MFCC). It's a set of numbers that allows you
to figure out what the source and filter are doing independently.

Doing this with multiple (human) speakers is wicked hard, because you are only
working with one audio input. That's everybody's voice combined into one
vibration traveling through the air and hitting the microphone as a signal.

We use MFCC's because they approximate how humans perceive sound - it goes by
the Mel scale, which is not linear like Hz, because that scale divides
frequencies according to what the human ear is most sensitive to. And MFCCs
seem to work pretty well for single speakers.

So the question for this problem should come from the same place as why we use
MFCCs: how do humans do it? I don't100% know the answer, but my guess is we
use other cognitive and sensory systems to augment our reception of the sound
signal.

On the sensory side, vision is a good example because it's super important in
speech processing. Among a sea of signals, humans have an incredible ability
to tell what sound someone is making by watching their mouth move and matching
it with the sounds we are taking in. Maybe having a model that incorporates
visual input of a speaker's face would be one approach to distinguish between
different speakers in a group.

Another issue to consider is that humans _don 't_ process all the speech
happening in a crowded room. We are really good at filtering our attention to
the person we are directly listening to. If a speaker could match a voice with
a unique set of fundamental frequencies then it would be able to identify
which speaker to pay attention to. But I don't think MFCCs tell you that info
in very good detail, so now you are looking for another method. Google
assistant does this with its "voice match", and that feature does seem to
improve how well it recognizes voices. But it has to build a model of your
voice beforehand, and it would be nice to make a system that is able to
differentiate between speakers "on the fly".

