
A TensorFlow implementation of Baidu's DeepSpeech architecture - rhakmi
https://github.com/mozilla/DeepSpeech
======
amelius
By the way, there's currently a competition on Kaggle about speech recognition
for TensorFlow, [1].

[1] [https://www.kaggle.com/c/tensorflow-speech-recognition-
chall...](https://www.kaggle.com/c/tensorflow-speech-recognition-challenge)

~~~
woodson
That challenge seems to be more about speech command recognition (isolated
words). They supply 1 second long recordings of 30 short words.

> There are only 12 possible labels for the Test set: yes, no, up, down, left,
> right, on, off, stop, go, silence, unknown.

This won’t work for large vocabulary continuous speech recognition, which is
what you want if you want to transcribe podcasts, phone calls, or generally
human-to-human, spoken interactions.

------
worldsayshi
It would be awesome to see open source speech-to-text becoming a viable
option.

Has anyone tested this out? Impressions of its usability?

~~~
kleiba
What about Kaldi? Do you not consider that a viable option?

~~~
hiddencost
I made an embarrassing amount of money back when I was a consultant, just
helping people get Kaldi working in their production environments or small
tweaks to the models. I really don't consider it viable unless you're in the
field, and it's especially not viable in a production environment.

~~~
confounded
That’s a shame to hear, I’ve always has high hopes for it.

What makes it less suitable for production compared to TF?

Is it just the Kafkaesque build process (admittedly I last tried a year ago),
or is model-fitting or prediction especially buggy?

------
iandanforth
Thanks so much for this! The effort to package this model must have been
significant. As awesome as it is to see source code implementations, being
able to pip install a pretrained model is even better. I hope others emulate
this!

------
rav
I wrote a VAMP plugin that can be used in Audacity to run DeepSpeech on
selected ranges of audio:
[https://github.com/Mortal/vampdeepspeech](https://github.com/Mortal/vampdeepspeech)

------
contingencies
1997: Nobody believes the NSA is listening to phone calls. Netscape is rich.
China barely has the internet.

2017: Kids can run real time E1+ voice transcription systems made exclusively
of free software on commodity gaming hardware. Bizarrely, the dominant
implementation is based upon the "free" browser community Mozilla, based upon
work released by a "don't be evil" global megacorporation, but they are
reduced to imitating China to get there.

~~~
gok
I’m tempted to just ignore this troll but this is highly uninformed. There’s
nothing “dominant” about this implementation or the DeepSpeech architecture in
general. And just to address the Sinophobia at the end of your post: the Deep
Speech papers were published by Baidu’s Silicon Valley lab, not “China.”

~~~
typon
I don't understand this "China == Evil" attitude on this website.

~~~
irq11
on the weekend, HN is overrun with nationalist/conservative trolls.

~~~
sctb
This doesn't help. Please follow the guidelines and flag comments you think
break them, and don't comment that you did.

[https://news.ycombinator.com/newsguidelines.html](https://news.ycombinator.com/newsguidelines.html)

~~~
irq11
Was I supposed to flag the parent comment? Perhaps instead of attacking well-
meaning people for observing problems, you should focus on solving the
problem.

~~~
sctb
If you go to a comment's page by clicking on the timestamp (e.g.
[https://news.ycombinator.com/item?id=15837946](https://news.ycombinator.com/item?id=15837946))
you should see a “flag” link.

------
ssttoo
I ran into some minor glitches trying to install and use DeepSpeech couple of
days ago. I’m sure they’ll be fixed soon enough but meanwhile hope this helps:
[https://www.phpied.com/taking-mozillas-deepspeech-for-a-
spin...](https://www.phpied.com/taking-mozillas-deepspeech-for-a-spin/)

~~~
kdavis
It only works on "short", about 5 seconds or so, audio clips. (We should have
documented this better, but I just put in a PR adding this to the
documentation.)

However, you can use voice activity detection (VAD), for example webrtcvad
from PyPI, to chop long audio into smaller bits that are able to be digested.

Maybe we should just put VAD in the client and have this occur automatically?

~~~
TheAceOfHearts
Personally, I'd love to see that as part of the client.

Just this week I started looking into how I could generate transcripts for a
bunch of videos. Even if the transcripts aren't perfect, it helps with tagging
and searching through large video collections that include certain keywords.

Sadly, I didn't have any luck with local solutions. I managed to generate a
few transcripts using GCP's Cloud Speech API with minimal hassle, but I'd much
prefer to do it locally.

I was planning on trying this out later today, and had already downloaded the
Common Voice corpus. Having to add another step to break up the input into
smaller chunks probably isn't a huge deal, but I wouldn't have known what tool
to use in order to achieve that.

Do you know of any comparisons between various speech-to-text tools? I've
avoided commercial tools so far because I'm hesitant to drop $250+ just for
playing around, but I'd be interested in seeing if they're truly superior to
existing open alternatives.

~~~
kdavis
Added an enhancement request, issue 1064[1], to github, asking for the clients
to support longer audio clips.

I can't promise when we'll get to it, as from now until new year is a bit of a
wash.

I don't know of any detailed comparisons of commercial solutions. However,
with respect to pure word error rate, the article[2] does a comparison of
several engines as of circa 2015.

[1]
[https://github.com/mozilla/DeepSpeech/issues/1064](https://github.com/mozilla/DeepSpeech/issues/1064)

[2] [https://arxiv.org/abs/1412.5567](https://arxiv.org/abs/1412.5567)

