
Speech-to-Text-WaveNet: End-to-end sentence level English speech recognition - dudisbrie
https://github.com/buriburisuri/speech-to-text-wavenet
======
gambler
_" Some of Deepmind's recent papers are tricky to reproduce. The Paper also
omitted specific details about the implementation, and we had to fill the gaps
in our own way."_

So, I'm not the only one seeing this issue. It seems like many recent AI
papers want to look as impressive as possible, wile giving you as little
implementation info as possible. This bothers me, because it opposes the very
purpose of research publication.

~~~
deepnotderp
This is more specific to deepmind actually, Facebook and others have been
pretty good about publishing code.

------
bmc7505
A few weeks ago, a deep learning researcher at one of the world's leading
speech groups told me off-the-record that offline, human-parity speech
recognition would be "coming soon" to mobile devices. Not sure s/he realized
just how soon that would be. Even though state-of-the-art ASR is really
expensive to train, recognition is extremely cheap to run, even on lower-power
devices. [1][2] With specialized silicon, you can do this, continuously, for
free, on something like a smartwatch. You don't need to open a websocket or
call an API running on some beefy server to do this, speech-to-text is now a
basic commodity. Fully offline, ubiquitous speech recognition is right around
the corner. With human-level speech synthesis [3], speech applications are
going to get very interesting, very quickly.

[1]
[http://niclane.org/pubs/deepx_ipsn.pdf](http://niclane.org/pubs/deepx_ipsn.pdf)

[2] [https://www.ibr.cs.tu-
bs.de/Cosdeo2016/talks/invitedTalk.pdf](https://www.ibr.cs.tu-
bs.de/Cosdeo2016/talks/invitedTalk.pdf)

[3] [https://github.com/ibab/tensorflow-
wavenet](https://github.com/ibab/tensorflow-wavenet)

~~~
braindead_in
A consumer focused human parity ASR service will disrupt so many industries,
including mine. I run a human powered transcription service where we
transcribe files with high accuracy. I am just waiting for the day when our
transcribers can work off a auto-generated transcript instead of typing it all
up manually. I'll pay good money for a service where I can just send a file
and get a 80-90% accurate transcript with speaker diarization.

~~~
imaginenore
I hope you realize your business is about to go out of business. The only
reason you can charge people now is because the automatic recognition sucks
compared to humans.

~~~
braindead_in
We do super-human-parity transcripts. Our transcripts are insanely accurate,
even for challenging files. I'm sure computers will be able to do that one
day, but Singularity would have already happened by then, wiping out many
businesses. I for one look forward to Singularity and hope that we will
contribute to it in some way.

~~~
imaginenore
What's super-human parity? And how do you achieve it using humans?

~~~
epistasis
Presumably more accurate than a single human, and you can do it with multiple
humans and reaching a consensus. I remember an anecdote in physics class where
an experiment required counting a certain number of events in time. A single
person would occasionally blink and miss an event. But if you had two people,
and you count how many people observed each event, you can solve for super-
human accuracy using the estimated error rates of each person.

See also this usage in the context of ML:

[https://arxiv.org/pdf/1602.05314v1.pdf](https://arxiv.org/pdf/1602.05314v1.pdf)

~~~
gwern
Ensembles are well-known to be more accurate. But this is not an advantage
exclusive to humans: NNs ensembled will do better than any of the individual
NNs.

There's no reason one couldn't train 5 or 10 RNNs for transcription and
ensemble them. (Indeed, one cute trick this ICLR was how to get an ensemble of
NNs for free so you don't have to spend 5 or 10x time training: simply lower
the learning rate during training until it stops improving, save the model,
then jack the learning rate way up for a while and start lowering it until it
stops improving, save _that_ model, and when finished, now you have _n_ models
you can ensemble.) And computing hardware is cheaper than humans, so it will
be cheaper to have 5 or 10 RNNs process an audio file than it would be to have
2 or 3 humans independently check, so the ensembling advantage is actually
bigger for the NNs in this scenario.

Humans still have the advantage of more semantic understanding, but RNNs can
be trained on much larger corpuses and read all related transcripts, so even
there the human advantage is not guaranteed.

~~~
visarga
Yeah, but you don't want to run an ensemble of 10 RNNs on your phone, or in
the cloud for that matter, when you got billions of queries. It's too
expensive.

In practice the ensemble model is compactly transferred into a single network.
In order to do that, they train a new network to copy the outputs of the
ensemble, exploiting "dark knowledge".

Recurrent Neural Network Training with Dark Knowledge Transfer -
[https://arxiv.org/abs/1505.04630v5](https://arxiv.org/abs/1505.04630v5)

------
kcorbitt
This is really exciting. I previously worked at a startup for that could have
benefited enormously from even 90% accurate speech recognition. As of six
months ago when I last looked, there were no open source speech-to-text
libraries with anything approaching the performance of the proprietary work by
Google, Microsoft, Baidu, etc. The closest thing was CMU Sphinx, but its
accuracy was unacceptable.

Props to the author, and especially to the DeepMind researchers who published
their work! I look forward to living in a world where this type of technology
is ubiquitous and mostly commoditized.

~~~
bmc7505
The CMU Sphinx project as it stands is basically dead. Even though they
recently implemented some sequence-to-sequence deep learning techniques for
g2p [1], the core stack is still based on an ancient GMM/HMM pipeline, and
current state of the art projects (even open source ones) have leapfrogged it
in terms of accuracy. If you're implementing offline speech recognition today,
start with something like this or Kaldi-ASR [2]. It will take a bit of work to
get your models to running on a mobile device, but the end result will be much
more usable.

[1] [http://cmusphinx.sourceforge.net/2016/04/grapheme-to-
phoneme...](http://cmusphinx.sourceforge.net/2016/04/grapheme-to-phoneme-tool-
based-on-sequence-to-sequence-learning/)

[2] [http://kaldi-asr.org/](http://kaldi-asr.org/)

~~~
snadal
We've worked in the past with CMU Sphinx too, and it is absolutely amazing the
advances in this area in the last months.

A little bit off-topic, but do you know any recent work or paper for speech
recognition in language teaching area ? (I mean, analysing and rating accuracy
of speaker, detect incorrect pronunciation of phones, and so on)

~~~
bmc7505
> Do you know any recent work or paper for speech recognition in language
> teaching area?

What you're describing is called "speech verification". Language education is
an application I'm personally very interested in, and one that almost no one
discusses in the speech community (I assume because of machine translation),
so if you find any research papers please let me know! I wrote a little about
it: [http://breandan.net/2014/02/09/the-end-of-
illiteracy/](http://breandan.net/2014/02/09/the-end-of-illiteracy/)

The task is actually much simpler than STT. You display some text on the
screen, wait for an audio sample, then check the model's confidence that the
sample matches the text. If the confidence is lower than some threshold, then
you play the correct pronunciation through the speaker. The trick is doing
this rapidly, so a fast local recognizer is key. I've got a little prototype
on Android, and it's pretty neat for learning new words. I'd like to get it
working for reading recitation, but that's a lot of work.

~~~
snadal
Hey, thank you for the link to you article. I've read it throughly and I
cannot agree more. And that was written two years and a half ago, before the
AI "explosion" that we saw later.

Actually, checking against confidence is something that we've tried to play
with, but to my knowledge there is not a model that allows you to compare
speech confidence against an specific text. Public APIs like MS
ProjectOxford.ai can return a confidence, but against the "recognised" text,
not against a predefined text.

Going further, this kind of approach can be very effective on words and small
sentences, but I'd really love to see which specific phones the learner is
failing, which can help in analysing full speaking exercises.

It works, but I am sure it should be possible to do better

------
brandoncarl
To the authors: did you any of your own recordings? I've used my own and clips
online, in WAV and other formats, at various sampling rates.

All of the results come back gibberish. The results in the training data seem
just fine. Curious if you've tested the above to ensure it didn't overfit.

------
craigbaker
Is this really speech recognition from raw waveforms? It looks like they're
extracting MFCC features from the raw audio, and using that as input to the
neural network. I thought that the point of WaveNet was that it took the raw
waveform directly as input, unlike previous architectures which first extract
spectral features such as MFCCs to use as the input.

~~~
bmc7505
Apparently, they tried to use the raw audio waveform with the original setup
from the WaveNet paper but couldn't get it to train on their TitanX, so they
used MFCCs instead. It's not exactly clear why this is the case.

"Second, the Paper added a mean-pooling layer after the dilated convolution
layer for down-sampling. We extracted MFCC from wav files and removed the
final mean-pooling layer because the original setting was impossible to run on
our TitanX GPU." [1]

[1] [https://github.com/buriburisuri/speech-to-text-
wavenet#speec...](https://github.com/buriburisuri/speech-to-text-
wavenet#speech-to-text-wavenet--end-to-end-sentence-level-english-speech-
recognition-using-deepminds-wavenet)

------
RandomInteger4
How much Bandwidth is consumed from voice communications such as when speaking
to someone on Skype or over the phone, vs. the same words transmitted via
text?

Perhaps future communication applications can have a WaveNet on either end,
which learns the voice of the person you're communicating with and then only
sends text after a certain point in the conversation?

I'm coming at this from a point of ignorance though, so correct me if I've
made erroneous assumptions.

~~~
dest
text communication is much lighter (a few bytes/s vs kb/s) but you may miss
the non verbal contents of voice

~~~
RandomInteger4
By non-verbal do you mean like ambient sound? Dogs barking, child yelling,
garbage truck garbage trucking? I don't know. If they can do voice, then it
might be possible to do ambient sounds of there is a separate nets trained
with a library of ambient sounds where it's tuned not to be the same every
time the sound plays like how when you have tiled graphics, there are
algorithms that remove the unnatural sameness from one tile to the next.

This could have interesting implications for Foley-artists of the 21st
century.

How likely would such a tech help lower budget companies who want to implement
voice communication within their software, say for video games or similar?

Hmm, now this has me wondering what implications this has for voice acting as
well.

EDIT: We can call the ambient sound symbols sent over the wire "Soundmojis" or
"amojis" or "audiomojis"

~~~
dest
I was thinking about the voice intonation. For example the sentences "this is
really great" or "how do you do? -> I'm fine, thank you" can have opposite
meanings depending on the intonation. This explains a lot of the
misunderstandings on written forums.

It should be possible to train a neural network to catch those special
intonations, but it is IMHO substantially harder than the initial project,
with uncertain results.

~~~
RandomInteger4
Oh, right. I can't believe I forgot about intonation ... I should really get
out and talk to people via voice more ...

------
throwaway13337
This seems super useful for most speech recognition - understanding context.

It doesn't seem like the mainstream engines (Alexa, Google Voice, Siri) are
context aware. Why not?

~~~
doublerebel
Context involves location, which 99% of the time those bots don't take into
consideration. Context does not involve knowing everything about your email or
being able to search the entire web. It's much more connected to what you just
did and where and when you are doing it.

This is what I'm solving at Optik. Helping you manage the things that you care
about in the place that you are, and NOT exposing your personal details to
cloud computation.

------
teajunky
Wow train.py contains only 83 lines of code (including a few empty lines and
commets). And recognize.py is only litte bit longer with 108 lines. Very
impressive.

~~~
bra-ket
typical of machine learning, a whole lot of talking about a few lines of code

~~~
hyperbovine
FFT is 4 lines, what is your point.

------
IshKebab
Can someone explain why MFCC is used rather than allowing the neural network
to learn from the raw waveform? I looked back in the literature and the
intention of MFCC & PLP seems to be to remove speaker-dependent features from
the audio in order to reduce the dimensionality of the input. But I though the
whole point of neural nets is that they can learn from very high dimensional
inputs no?

I had a go at implementing wave->phoneme recognition using a simple neural net
and it seemed to work pretty well.

------
Karlozkiller
This is exactly what I would have wanted for my master thesis about half a
year ago, where I wanted to use s2t with good control over the system without
having to implement everything myself.

------
echelon
Did the original WaveNet text to speech demo come with a paper or source code?
(I didn't see either.) I'm interested in techniques, particularly neural
network-related, to improve the quality of my Donald Trump text to speech
engine [1].

Does anyone on HN do active research in this field? Could I pick your brain
for a survey of the best papers (especially review papers) on the subject?

[1] [http://jungle.horse](http://jungle.horse)

~~~
bmc7505
> Did the original WaveNet text to speech demo come with a paper or source
> code?

Paper, yes. [1] Source code, no.

[1]
[https://arxiv.org/pdf/1609.03499.pdf](https://arxiv.org/pdf/1609.03499.pdf)

------
londons_explore
Looking at the training loss graph, it looks like training for more time would
produce even better results...

Anyone want to volunteer a few weeks of GPU time to train this better?

~~~
gwern
Training loss pretty much always decreases. NNs are extremely powerful models,
so they can overfit most data. What you want to see is the _validation_ loss
graph.

------
mo1ok
This is awesome. I was just reading the waveNet paper and wondering how would
go about a DIY approach...

------
EGreg
Does this require an internet connection, though? Relative to say OpenEars?

------
amelius
Perhaps now finally Linux could get a speech recognition input device.

