
Generating natural-sounding synthetic speech using brain activity - techben
https://humanbioscience.org/2019/04/new-brain-machine-interface-can-generate-natural-sounding-synthetic-speech-using-brain-activity.html
======
bjackman
Amazing. One thing I'm not clear on is this: did they have to re-establish the
brain activity -> muscle movements model for each patient? Because presumably
that wouldn't have worked for a paralysed patient. In that case, the question
is how hard is it to generalise a brain activity model so that it can be
trained on one population and then used to get data from a paralysed person's
brain?

~~~
fundamental
Yes, typically in brain computer interface tasks there is a need for
retraining models given different individuals or even the same individual on
different days/weeks/months. In another summary page there is the statement:

> The researchers also found that the neural code for vocal movements
> partially overlapped across participants, and that one research subject’s
> vocal tract simulation could be adapted to respond to the neural
> instructions recorded from another participant’s brain. Together, these
> findings suggest that individuals with speech loss due to neurological
> impairment may be able to learn to control a speech prosthesis modeled on
> the voice of someone with intact speech.

As per paralyzed individuals that is a primary target of this area of
research. Things get considerably more complex in those cases however as they
would need to start out with a pre-trained model which couldn't be naturally
adapted by listening to their own speech. Additionally any brain damage which
many have contributed to their condition can impair some of the signals in
question. Overall the particular problem seems to be advancing from when I
worked with a lock-in patient last, but there's still a good ways to go.

------
liability
I wonder if this tech could one day be useful for general users without
disabilities.

> _Even when the researchers provided the algorithm with brain activity data
> recorded while one participant merely mouthed sentences without sound, the
> system was still able to produce intelligible synthetic versions of the
> mimed sentences in the speaker’s voice._

That already seems borderline usable for people with speech. I wonder if it
could be made to work when you're merely thinking about the mouth movements
without actually making them. That would be ideal.

~~~
tjchear
It'd be great if this technology matures to the point where we can perform
telepathic communication, à la Ghost in the Shell.

~~~
liability
Beyond radio comms, I think a telepathic _" Hey [Siri/Alexa/etc]..."_ could
have some startling social implications.

It might even be the killer app to make elective brain implants mainstream.
(Of course if this could be done without hazardous and expensive implants, all
the better.)

~~~
nielsole
When you just have to think about buying something, and Alexa orders it for
you :O

~~~
liability
That's something I've been thinking about a lot recently. Not so much
accidental orders, but rather who these software assistants will be owned and
controlled by in in a foreseeable future where this sort of tech creates
tighter couplings between software assistants and our own minds.

If the software assistants are sufficiently useful and tightly coupled with
the human mind, I think it quite likely that the line between self and
software might get blurry for some users. The ability to think a question and
hear a _correct_ answer as a voice in your head is the sort of profoundly
powerful user experience that I think might plausibly alter the assumptions
people make about what it means to be themselves.

If these software assistants become apart of the users' own mind in their own
perception of themselves, what responsibilities do the owners/operators of
those systems have to their users?

I guess we'll cross that bridge when we get their, but the relative immaturity
of FOSS software assistants is starting to unnerve me. In 2040 when Amazon
starts selling "god in a box" to the general public, a two-way telepathic
connection to a state of the art quasi-AGI living in the cloud, will there be
a viable FOSS alternative?

~~~
tjchear
In a way, we're already there, no? We listen to what Yelp reviewers tell us.
We stop trying as hard to memorize facts as we offload our cognitive functions
to Google search.

I guess the tighter coupling than what it is now is what makes the idea
repulsive.

Also, as far as FOSS alternative goes, I wouldn't count on it. It's not so
much the code - in time it'd be the huge data-crunching that counts, something
only big corporations are able to do.

~~~
liability
Even if we just had _reliable_ FOSS voice recognition without the rest, I
think hackers could create powerful user experiences. But alas, even that
seems to be asking too much. There are some FOSS efforts to implement state of
the art solutions with lots of training data, Mozilla has been working on this
from what I understand, but last I checked nothing was really ready yet and
the stuff Mozilla is working on needs really beefy server hardware to run,
which I think unfortunately disqualifies it as a viable competitor to the
commercial offerings (which also use expensive hardware, but don't require end
users to know anything about it.)

------
_august
Video (buried in the article):
[https://www.youtube.com/watch?v=3pv0vT82Cys](https://www.youtube.com/watch?v=3pv0vT82Cys)

~~~
CriticalCathed
at 2 minutes:

they can produce words from brains even when the subject does not speak them.
interrogation tool?

~~~
fundamental
No, this approach is not reading thoughts. The video states that they still
mouth the sentences which will trigger similar muscle movements which appears
to be the target of this method. Without the signals to move the muscles, no
intelligible speech should be detectable using this approach.

~~~
CriticalCathed
I mean, we can already read subvocalizations.

What I mean is, when we think about moving our arms the same parts of our
brains activate as if we are moving our arms. Maybe, when we think about
talking a similar thing happens.

------
benzine
Here is the google Cached version. Seems the site is offline

[http://webcache.googleusercontent.com/search?q=cache:https:/...](http://webcache.googleusercontent.com/search?q=cache:https://humanbioscience.org/2019/04/new-
brain-machine-interface-can-generate-natural-sounding-synthetic-speech-using-
brain-activity.html)

Copy the link above to view the research

------
mkagenius
Why aren't they replacing the mumbling with closest match words in the
synthesizer?

Anyway, putting electrodes inside the brain is not for common public, is this
at all possible without those intrusions?

~~~
fundamental
Approaches like this are not possible without invasive methods. Placing
electrodes in the brain or on the brain provides considerably higher signal
fidelity.

~~~
raidicy
That's unfortunate. I have RSI in both my hands and throat and something like
this would really be a life changer.

------
macawfish
Slap on a style transfer layer and it's a wrap.

Disclaimer: I know very little about how to actually do ML stuff, it just
seemed like something that'd be possible in the near future.

~~~
whuffman
I work in this space, so I'd love to give a bit of detail on the ML if you're
interested:

Style transfer is trickier to do with speech than images! One significant
issue is the lack of a good "content" versus "style" distinction. In images
you can get great results by calling the higher-level features of an object
classifier network "content" and holding that constant. Some people have tried
this for audio with e.g. a phoneme classifier, but there are additional
characteristics (such as inflection) that relate the emotional content of
speech which wouldn't be held constant.

Another issue is that much of the speech classification work is done in
spectrogram (or with further processing MFCC) space, which lets you treat
audio similar to images and leverage a bunch of technology that we have for
classifying those. But for synthesizing speech, spectrograms aren't a
fantastic representation, because small errors in spectrogram space can
translate into large errors in the waveform which are very clearly audible,
and humans in general are pretty sensitive to audio errors. There are cool
neural spectrogram inversion methods out there which can help, but those
should still be trained to be robust to the kinds of errors that a style
transfer algorithm would make, so it's still pretty tricky.

My company, Modulate, is building speech style transfer tech; and we've found
a lot more success with adversarial methods on raw audio synthesis, where the
adversary forces the generator to produce plausible speech from the target
speaker!

One of the coolest parts of the kind of BMI research in this article, to me,
is the potential to buy back some latency margin for speech conversion! If
you're working on already-produced speech, there are super tight latency
requirements if you want to hear your own speech in the converted voice - over
20-30ms for the entire audio loop, and you start to get echo-like feedback
that makes speaking difficult. Even without looping back, you don't want more
than 100-200ms of latency in a conversation before it starts impeding the flow
of dialogue. This means your style transfer algorithm gets almost no future
context, and limits the kinds of manipulations that you can do (not to mention
the size of the network that you can do them with, depending on available
compute power!).

------
bpchaps
This is super cool! Does anyone have any insight into whether this sort of
muscle-movement based vocalization has been done in the past?

~~~
effakcuL
The cognitive systems lab at the university in bremen[1] ,germany does a lot
of research in that field and I had the pleasure to visit them a few months
ago. If you are interested you should find a lot of research on their
homepage, ranging from regular speech to text, over silent speech (aka muscle
movement to text) up to brain to text.

[1] [https://www.uni-bremen.de/en/csl/](https://www.uni-bremen.de/en/csl/)

edit: small correction of my bad english :P

------
dalbasal
The proofs of concept accumulating in this space are so exciting. It's hard
not to jump straight into runaway speculation.

------
a-dub
i'm curious, what's the effective bit rate? speech and language have excellent
priors and fancy speech synthesis has been a thing for a while.

either way cheers for getting something to work well!

------
retpirato
so they use the brain to bypass the brain that isn't working properly. that
makes sense.

