
Recurrent Neural Nets for Speech Synthesis - Katydid
http://arxiv.org/abs/1601.02539
======
julespitt
Went looking for audio samples, here's some from one of the researchers:

[http://www.zhizheng.org/demo/is15_mte/demo.html](http://www.zhizheng.org/demo/is15_mte/demo.html)
[http://www.zhizheng.org/demo/dnn_tts/demo.html](http://www.zhizheng.org/demo/dnn_tts/demo.html)

~~~
raverbashing
Also here
[http://homepages.inf.ed.ac.uk/zwu2/demo/icassp16/lstm.html](http://homepages.inf.ed.ac.uk/zwu2/demo/icassp16/lstm.html)

~~~
sawwit
I thought this would be about text-to-speech applications, while this seems
more like an encoder-decoder problem (make the network learn a pattern and
then let it reproduce it). I'm wondering how long it is until we see working
TTS based on LSTM RNNs.

~~~
ghayes
Yeah, can someone explain the exact problem of "statistical parametric speech
synthesis," since I can't find a general overview of the problem itself.

~~~
amelius
I'm a newbie to all this, but I can imagine it could be useful for speech
compression.

------
nicklo
This paper focuses on statistical parametric speech synthesis (SPSS). SPSS is
only 1/2 of the text to speech problem.

SPSS is the problem of going from linguistic features, phenomes, etc, to
speech audio. These features are more or less golden, either derived from the
audio itself or hand-labeled. So things like tonality, cadence, emphasis on
words is already encoded as features which is why these samples sound so good.

Deriving these features from pure text is very hard, and this failing is the
main reason most text to speech systems sound so dull and tone-dead.

That being said, these results are seriously impressive, sounding very
natural. Would love to see someone try and train an end-to-end system from
pure text to speech. I think we'd see some big improvements like what Baidu
has done for end-to-end speech to text.

------
mempko
The most interesting part of this paper is their simpler RNN structure than
LSTM.

------
Moshe_Silnorin
Slightly unrelated question, has there been any effort into hardware
acceleration of such networks? How amenable are modern machine learning
algorithms to hardware acceleration?

~~~
michael_h
The GPU is pretty well optimized for the sort of operations an RNN needs.

There were a few efforts to make actual silicon neurons, plus the whole
nueromorphic movement, but they were generally less than what people were
expecting, slow, and difficult to interface with.

~~~
gcr
I've seen some work that attempts to recreate the "spiky" neural networks
(e.g. neurons that fire when their inputs pass a threshold), intended to mimic
the biochemistry of real neurons.

That work seems to spin their contribution as reducing the power required to
evaluate the neural network though. If I recall correctly, the accuracy of
those models for everyday tasks is typically much much lower than usual ANNs,
and they're a pain to train. So, still not very common.

~~~
michael_h
That is exactly what I made circa 2008. I used the izhekevich model for
spiking. It was certainly faster on the GPU (2000x), but, yeah, getting the
network to converge on _anything_ was terrible. Debugging it was fun/awful
though.

1:"Hey, do you see the first squiggle with the two fuzzes after."

2:"Next to Beaker's eyebrows?"

The low power work seems to have been aiming to be a rough filter, rather than
a full system. Still fun to use.

