
Reconstructing intelligible speech from the human auditory cortex - Jerry2
http://naplab.ee.columbia.edu/reconstruction.html
======
hprotagonist
This is a fantastic bit of work, and I'm excited to see that this technique
can be shown to actually work.

From their paper:

 _We used invasive electrocorticography (ECoG) to measure neural activity from
five neurosurgical patients undergoing treatment for epilepsy as they listened
to continuous speech sounds. Two of the five subjects had high-density
subdural grid electrodes implanted in the left hemisphere with coverage
primarily over the superior temporal gyrus (STG), and four of the five
subjects had depth electrodes with coverage of Heschl’s gyrus (HG). All
subjects had self-reported normal hearing. Subjects were presented with short
continuous stories spoken by four speakers (two females, total duration: 30
minutes). To ensure that the subjects were engaged in the task, the stories
were randomly paused, and the subjects were asked to repeat the last sentence.
The test data consisted of continuous speech sentences and isolated digit
sounds. We used eight sentences (40 seconds total) to evaluate the objective
quality of the reconstruction models. The sentences were repeated six times in
random order, and the neural data was averaged over the six repetitions to
reduce the effect of neural noise on comparison of reconstruction models (see
Supp. Fig. 1 for the effect of averaging). The digit sounds were used for
subjective intelligibility and quality assessment of reconstruction methods
and were taken from a publicly available corpus, TI-46. We chose 40 digit
sounds (zero to nine), spoken by four speakers (two females) that were not
included in the training of the models. Reconstructed digits were used as the
test set to evaluate subjective intelligibility and quality of the models._

Do _not_ expect this to generalize easily to a noninvasive hairnet: ECoG is
so, so much nicer than EEG from a signal to noise perspective, not to mention
latency and localization. The only drawback is you have to crack someone's
skull to do it...

~~~
taneq
> Do not expect this to generalize easily to a noninvasive hairnet: ECoG is
> so, so much nicer than EEG from a signal to noise perspective, not to
> mention latency and localization. The only drawback is you have to crack
> someone's skull to do it...

Do you know if these implanted electrods interfere with transcranial EEG
measurement? If not, is anyone gathering data from these invasively-
instrumented humans to see if it's possible to learn a mapping between the
two? It'd be a long shot, but still...

~~~
etrautmann
it's easier to just replicate the work with EEG and see how well you do.
Having worked with these signals, my guess is extremely poorly.

------
selimnairb
Can this method be used to extract speech by people who subvocalize, that is,
where auditory processing is going on in the absense of speech by another?
Also, this confirms to me that we are merely, or at least mostly, made of
electrified meat.

------
king_magic
Can imagine this having really profound implications for folks with ALS, etc
down the line. Really impressive.

------
protomikron
Interesting work.

This proves that the Borg's collective voice uses " DNN + Spectrogram" to
synthesize their voice from the drones auditory cortex, compare:
[https://www.youtube.com/watch?v=ql83z8yBx2M&feature=youtu.be...](https://www.youtube.com/watch?v=ql83z8yBx2M&feature=youtu.be&t=32)

------
Sephr
Here's a link to the research paper itself:
[https://www.nature.com/articles/s41598-018-37359-z.pdf](https://www.nature.com/articles/s41598-018-37359-z.pdf)

------
est31
I wonder why they used the WORLD vocoder and whether quality can be improved
by using other vocoders like wavenet or lpcnet.

~~~
resiros
That would be a nice project idea, exchanging the vocoder with something like
wavenet and making it end-to-end.

The cool thing is that they published here
([http://naplab.ee.columbia.edu/naplib.html](http://naplab.ee.columbia.edu/naplib.html))
all their training data and code, so that should not be actually difficult.

Edit: My mistake, the data is not available, but they say in the paper that it
is available under request..

------
amelius
Can someone summarize what mathematical operations are used in the
reconstruction? E.g. is there an (inverse) Fourier transform? (Asking since
the first thing the ear does is basically to perform a Fourier transform,
separating the auditory signal in different frequencies).

Or do they use a more opaque ML approach?

~~~
dr_zoidberg
I just skimmed over the paper (don't really have time to sit and read it down
right now), but it seems it's about like this:

1\. They captured the ECoG of people hearing stories (for training). They also
captured ECoG of people saying numbers 0 to 9.

2\. They trained the vocoder with those parameters as an autoencoder that goes
from 516 parameters down to 256 and back again to 516. The 516 parameters are
spectral envelope, pitch, voicing and aperiodicity.

3\. They trained the models to receive as input the HG and LF signals (from
the ECoG I think -- don't really know wht these names mean, just what the
figures on the paper show). The (expected) output for the models was either
the audio spectrogram or the vocoders 256 encoding.

So I'd go with "opaque ML approach" to answer your last question. As for the
mathematical operations involved (in general), I dind't read the detailed
explanation (if there is one) of how the networks are working, but probably
ReLU activations and summation over parameters, like most DNNs.

------
pmoriarty
Now can it go the other way?

