
DeepSpeech 0.6 - pionerkotik
https://hacks.mozilla.org/2019/12/deepspeech-0-6-mozillas-speech-to-text-engine/
======
bmn__
I do not understand how to use Deepspeech even in the most simple use case.

1\. I want to teach it ten words. How do I do this?

2\. I want to speak into my microphone (available as a Pulseaudio device) and
recognise the words and output the words as a text stream on stdout. How do I
do this?

This is the documentation:

[https://deepspeech.readthedocs.io/en/v0.6.0/Python-
Examples....](https://deepspeech.readthedocs.io/en/v0.6.0/Python-
Examples.html) [https://deepspeech.readthedocs.io/en/v0.6.0/Python-
API.html](https://deepspeech.readthedocs.io/en/v0.6.0/Python-API.html)

It does not answer the questions I have.

[https://deepspeech.readthedocs.io/en/v0.6.0/DeepSpeech.html](https://deepspeech.readthedocs.io/en/v0.6.0/DeepSpeech.html)

The introduction page is full of incomprehensible jargon.

~~~
sillysaurusx
So, funny story: I wanted #2 and did it myself. I was similarly frustrated
with the lack of documentation.

It's been a little while since I got it running, but I basically got a siri
clone working. If you want to test it out, I can try to answer questions /
whatever problems pop up.

The code is here:
[https://github.com/shawwn/DeepSpeech/commit/01f5cf8d39c356ae...](https://github.com/shawwn/DeepSpeech/commit/01f5cf8d39c356aecf84b9daaa5b38747d64c209)

As far as I know, you can simply run speech_to_text.sh. It will connect to
your microphone and start dumping out transcribed audio to stdout.

It wasn't super easy, but once you spend a little time with the code you can
sort of figure out ways to get it to do what you want.

EDIT: By the way, people will try to convince you that you're nuts and that
the documentation is crystal clear and so on. Know this: It's not just you. I
had the exact same experience. It's a recurring theme in AI programming.

The only thing to do is to either find someone else who feels similarly, or
roll up your sleeves and dive into the code.

~~~
logicallee
>By the way, people will try to convince you that you're nuts and that the
documentation is crystal clear and so on

No it really is simple: the original implementation had a base of prefabulated
amulition grammar, this was surmounted by a malleable logarithmic function in
such a way that the two main spurving vocabularies were in a direct line with
the panametric grammar fields. The latter simply consists of marzlevanes
fitted to the ambifacient morpheme waneshaft to eliminate side utterances.
This main winding is of the normal lotus-o-deltoid type, just placed in
panendermic semi-boloid stators. Basically every second conductor is connected
by a nonreversible tremmie pipe to the differential grammeters.

All these words are in:
[https://en.wikipedia.org/wiki/Turboencabulator](https://en.wikipedia.org/wiki/Turboencabulator)

------
detaro
> _It achieves a 7.5% word error rate on the LibriSpeech test clean benchmark_

Anyone have a comparison for how good/bad that is compared to other solutions,
and what it means for practical usage, if that can be guessed at from a single
number?

~~~
reubenmorais
Author here. I would add to what the sibling comments have mentioned by saying
that SotA results should be taken with a grain of salt. Our engine is capable
of streaming (processing the audio as it's being recorded), which is not
doable with architectures that have bidirectional decoders or attention
mechanisms that require the whole encoder input ahead of time.

For real world applications, this is absolutely crucial, users want latency
numbers on the order of milliseconds, not seconds. This is why, if you run a
standard test set like LibriSpeech on, say, a commercial offering from Google,
it will perform considerably worse than state of the art according to Google
papers.

This repository [0] has a benchmark of some commercial offerings. Our model
beats all of those on Librispeech clean and other (except for Speechmatics on
Librispeech clean), as well as on Common Voice. But note that the Common Voice
corpus used in that benchmark is very old.

In sum, I would compare this against solutions that go for the same space:
fast, client-side ASR, rather than state of the art.

[0] [https://github.com/Franck-
Dernoncourt/ASR_benchmark#benchmar...](https://github.com/Franck-
Dernoncourt/ASR_benchmark#benchmark-results)

~~~
albertzeyer
Hi! I'm from the RWTH team ([https://paperswithcode.com/sota/speech-
recognition-on-libris...](https://paperswithcode.com/sota/speech-recognition-
on-librispeech-test-clean)). Our best system (2.3% WER) (trained only on the
Librispeech data, i.e. much less data than your model) currently is a hybrid
HMM/NN model, and you are right, the acoustic model uses a BLSTM. However, we
have shown in other work ([https://www-i6.informatik.rwth-
aachen.de/publications/downlo...](https://www-i6.informatik.rwth-
aachen.de/publications/download/1009/Zeyer--2016.pdf)) that you can expect
similar performance in an online system, or not much worse (there is also many
related work, which often is maybe 5-10% relative worse than the offline
system).

I guess you prefer an "end-to-end" model over a hybrid HMM/NN model, for
simplicity, right? As far as I remember, you use CTC, right? I always wondered
why you have chosen CTC, and not some better model, like RNN-T, RNA, or some
of the streaming attention variants. They should give you all the same
properties (online capable, low latency, simple, end-to-end), but much better
WER performance. Or is this simply because there currently is no simple ready-
to-use implementation for those? Note that we published some TF code recently
for some streaming attention variants, and plan to publish some RNN-T/RNA code
soon.

~~~
reubenmorais
Hi!

> I guess you prefer an "end-to-end" model over a hybrid HMM/NN model, for
> simplicity, right?

Simplicity and ease of targeting other languages, yes. We're a small team.

> As far as I remember, you use CTC, right? I always wondered why you have
> chosen CTC, and not some better model, like RNN-T, RNA, or some of the
> streaming attention variants.

We started DeepSpeech in 2016, before these recent developments for end-to-end
ASR were mainstream/SotA.

> Or is this simply because there currently is no simple ready-to-use
> implementation for those?

Implementing the model architecture for training is only part of the problem
for us. We have a hand-crafted inference graph to allow for small and
efficient client code and inference models, and the more complex the
architecture is, the trickier it gets to make sure it all works on all
platforms, including TFLite, with quantization, etc.

We're investigating alternative architectures as well as mixed CTC/RNN LMs to
deal with language model size, but no final decisions made yet.

> Note that we published some TF code recently for some streaming attention
> variants, and plan to publish some RNN-T/RNA code soon.

Nice! Can you share a link to the streaming attention code?

~~~
mostlyjason
I really hope you adopt the latest models particularly streaming attention
variants. I think you should validate with users the assumption that latency
is more important than WER.

IMHO the WER is more important than latency improvements in the millisecond
range. The most frustrating thing is having to dictate over and over and the
transcription is incorrect each time.

Consider that the time to a correct transcription is the latency plus error
correction. If error correction is manual it will be orders of magnitude
slower, so optimize for WER.

I’m terms of competition, Siri has latency in the 5+ second range due to the
network call especially in area with poor data rates. I think a client side
model like yours will easily win in this category. If you’re already ahead
here, why not focus on WER next?

Another great capability is to generate alternative transcriptions for words
with low confidence values to allow for quick error correction. Do you offer
something like this today?

Also, consider the long term view that new models are constantly being
released and refined. It’d be best to have an architecture that allows quick
replacement without a lot of hand tuning, or where the tuning can be automated
to a greater extent.

~~~
whyleyc
Thanks for all the hard work you have put in so far @reubenmorais

+1000 to @mostlyjason's comment - Great latency figures mean nothing if the
word error rate is high, since it dents confidence in the output (so why use
DeepSpeech?) and (as the parent comment notes) necessitates manual error
correction.

I would love to see a future release focus on optimizing WER for these
reasons.

------
bayesian_horse
I just found [https://voice.mozilla.org/](https://voice.mozilla.org/)

Besides being a great resource for speech analysis, this could be a real game
changer for acquiring listening comprehension in a foreign language.

I feel that even after a few years of learning a new language I still have
trouble with listening. Part of that is that it's often all or nothing, even
one or two unknown words in a sentence means I can't understand the sentence.
But worse is that most language teaching materials use a very small set of
native speakers, which deprives the learner's brain of being able to
generalize.

~~~
pete762
Some of the English samples by non-natives were completely incomprehensible. I
hope the system learns to be accent free.

~~~
bayesian_horse
They have a validation process, and the data is mainly meant to train speech-
to-text.

For that, it's important to include accents and even non-native speakers,
especially in English.

I think that those wavenet models for TTS were able to be conditioned by
gender and maybe accent, if they had the data.

------
mncharity
Perhaps there's an opportunity to create a demo webpage using tensorflow.js?
Especially with the new TSLite support. To raise visibility and awareness. The
posenet webcam demo[1] for example, seems to have ~2k links.

[1] [https://storage.googleapis.com/tfjs-
models/demos/posenet/cam...](https://storage.googleapis.com/tfjs-
models/demos/posenet/camera.html)

------
est31
Congrats to the team on the new release! Following their progress since a
while. It's an important project, and I'm very happy about the size reductions
while they still delivered a WER improvement over the last release. Amazing!

~~~
reubenmorais
Thank you for your work on the Rust bindings :)

------
jcims
This reminds me of something I would love to see happen but I don't have the
skills to put it all together. I really think there's some potential merit to
a reading coach app(lication) that listens to someone read and looks for
weaknesses/disorders/etc compared to a trained model. It could provide those
diagnostics to an educator, guide the content to focus on those, coach the
reader directly, etc.

It all seems very doable based on what I see in the technology today, I just
don't have the skills to do it.

~~~
Enginerrrd
I'll be honest... that seems like a pretty tough sell to me. You could be
looking at anything from a speech disorder to a learning disability to a
perfectly healthy normal child with idiosyncrasies and those things are
usually left to professionals with graduate degrees. The liability seems high,
you've got both the potential stigma associated with incorrectly flagging a
child, or worse, missing an otherwise obvious issue that means treatment is
delayed during critical years.

~~~
jcims
I totally understand where you're coming from, but the reality is that there
is a lot of imbalance and delay in this type of assessment and educational
support already. My wife has been an English language arts teacher and
instructional coach for almost 20 years. While they definitely have standard
diagnostics and regular assessments to try to keep kids on track, it's
essentially best effort. And when you move out of 4th/5th grade in many
districts you start getting to 90:1 teacher-student ratios with 1-2 interns
available to do 1:1 reading support across maybe 500 kids. And this is in a
middle class school district.

There's definitely some ways it can go bad, but even basic fluency monitoring
without any active remediation attempts would be a good addition.

~~~
yorwba
> even basic fluency monitoring without any active remediation attempts would
> be a good addition.

How do you define fluency? I guess it involves some combination of speed and
accuracy. Speed shouldn't be too hard to measure, but for accuracy you'll
probably need a model that's more accurate than 7.5% WER and can handle the
difference in vocal range between children and adults. Otherwise the speech
disfluencies you want to detect will be drowned out by the model not correctly
recognizing actually pretty clear speech.

~~~
jcims
In the US, fluency metrics in ELA vary across standards/states but are
generally coarse grained and somewhat subjective. Here's an example of the
standard that a student is expected to meet for fluency in the fifth grade
(bottom half): [http://www.corestandards.org/ELA-
Literacy/RF/5/](http://www.corestandards.org/ELA-Literacy/RF/5/)

You _definitely_ would have to use a model trained from content in the target
audience (probably down to the grade level as things change so dramatically
from year to year) as well as probably some labeled examples of students with
various reading/speech deficiencies. This of course would lead to a lot of
challenges from a privacy/regulatory/etc perspective and the entire thing
would be challenging from an optics perspective (AI teachers taking over, etc
etc).

Edit: One thing to keep in mind with accuracy is that there is no ambiguity
what the word _should_ be, the question is how closely the utterance matches
the expected sounds of the word.

------
z3t4
I was very exited about this, then I tried it with the pre-trained model and I
recorded "Hello this is a test message" 3 times and it was inferred as "a
sassanian", "he is a paris" "he states that" ...

------
testbed
I can't seem to find pre-trained models for other languages (French). How long
did you training take for those in English? Do you think it makes sense to
start from the English one? Thank you!

------
tobylane
I'm starting to look into the MycroftAI. It sounds like with this release I
could use Mycroft+DeepSpeech on a new RaspberryPi for a completely offline
smart speaker, do I understand that right?

~~~
reubenmorais
Correct. The latest release with the TensorFlow Lite model runs in real time
on a Raspberry Pi 4. I'm not sure if the Mycroft integration is updated to the
latest version though, as it was just released.

------
maxbrunsfeld
> DeepSpeech v0.6 with TensorFlow Lite runs faster than real time on a single
> core of a Raspberry Pi 4

This is great news.

I'm not very familiar with the deep learning framework ecosystem. Does anyone
know what the simplest way would be to incorporate a DeepSpeech model into a
WebAssembly project?

For instance, is it straightforward to compile Tensorflow Lite with
emscripten? If not, can this TensorFlow model be run with tensorflow.js? Or
can it be converted to some other format that is easier to use with WASM?

------
franciscop
Tensorflow lite, interesting. Has anyone tried this with Google Coral USB
accelerator? I got one laying around but very little experience with ML. I
could get the USB Accelerator working with the pretrained posenet after much
messing around, but my chances with DeepSpeech are small to none. This seems
like the perfect fit though.

~~~
dabinat
To the best of my knowledge, DeepSpeech currently isn’t compatible with any
inference or training accelerators (other than NVIDIA GPUs).

But the whole point of the TFLite model is that it has low resource
requirements so an accelerator for inference shouldn’t be necessary.

~~~
reubenmorais
TFLite on Android with the NNAPI delegate is theoretically supposed to
leverage hardware accelerators. We haven't tested it extensively though.

------
bayesian_horse
I wonder if it is any good at out-of-vocabulary words? Is it hard to teach it
new things like medical terms and such?

~~~
reubenmorais
In the post I mention a Brazilian company who's using DeepSpeech for
vocabulary constrained, medical applications. They've trained their own
Brazilian Portuguese models.

The English acoustic model that we released does not have a fixed vocabulary.
What determines the vocabulary is the language model, which can be created
from text. So if you have in-domain text, and would like to try it out, I
would say the first step is to create a language model using your text and
then experiment with it.

We have some documentation on how to create the LM here:
[https://github.com/mozilla/DeepSpeech/tree/v0.6.0/data/lm](https://github.com/mozilla/DeepSpeech/tree/v0.6.0/data/lm)

It's not super detailed, but I'd be happy to answer questions on our
Discourse: [https://discourse.mozilla.org/c/deep-
speech](https://discourse.mozilla.org/c/deep-speech)

------
kbumsik
Looks great. Is it related to any of Mozilla's product?

~~~
reubenmorais
We'll have more to share about that soon :)

------
therealmarv
So who will be developing an open source assistant like Goggle Home / Alexa?
:)

~~~
Erlich_Bachman
If you do a 1 minute google search you will find that there are multple open
source voice assistant offerings already, for example mycroft.

~~~
detaro
Most of which are not fully open, e.g. do not have open speech recognition.

------
travisporter
Any recommendations on a good raspberry pi hat or microphone to test this?

