
Guide to Speech Recognition with Python - hn17
https://realpython.com/python-speech-recognition/
======
kastnerkyle
For people who want simple, out of the box stuff (not necessarily in Python)
for just getting phonemes I can also recommend [0]. Not amazing recognition
quality, but dead simple setup, and it is possible to integrate a language
model as well (I never needed one for my task). The author showed it as well
in [1], but kind of skimmed right by - but to me if you want to know speech
recognition in detail, pocketsphinx-python is one of the best ways.
Customizing the language model is a _huge_ boost in domain specific
recognition.

Large company APIs will usually be better at generic speaker, generic language
recognition - but if you can do speaker adaptation and customize the language
model, there are some insane gains possible since you prune out a lot of
uncertainty and complexity.

If you are more interested in recognition and alignment to a script, "gentle"
is great [2][3]. The guts also have raw Kaldi recognition, which is pretty
good for a generic speech recognizer but you would need to do some coding to
pull out that part on its own.

For a decent performing deep model, check into Mozilla's version of Baidu's
DeepSpeech [4].

If doing full-on development, my colleague has been using a bridge between
PyTorch (for training) and Kaldi (to use their decoders) to good success [5].

[0] how I use pocketsphinx to get phonemes,
[https://github.com/kastnerkyle/ez-phones](https://github.com/kastnerkyle/ez-
phones)

[1] [https://github.com/cmusphinx/pocketsphinx-
python](https://github.com/cmusphinx/pocketsphinx-python)

[2]
[https://github.com/lowerquality/gentle](https://github.com/lowerquality/gentle)

[3] how I use gentle for foreced alignment,
[https://github.com/kastnerkyle/raw_voice_cleanup/tree/master...](https://github.com/kastnerkyle/raw_voice_cleanup/tree/master/alignment)

[4]
[https://github.com/mozilla/DeepSpeech](https://github.com/mozilla/DeepSpeech)

[5] [https://github.com/mravanelli/pytorch-
kaldi](https://github.com/mravanelli/pytorch-kaldi)

------
capo64
No mention of DNN based ASR like DeepSpeech? There’s even open source python
implementations available from Mozilla and Paddle.

These models are way easier to train, have surprisingly good accuracy, and are
robust to noise.

~~~
sdenton4
afaik, pure DNN models still lag seriously behind 'traditional' HMM-based
frameworks augmented by neural networks (using DNNs for specific parts of the
pipeline). Last I checked a couple months ago, state of the art for HNN+DNN
was something like 6% word error rate (WER). The best Seq2Seq DNN I know of
hit 18% WER, dropping to 10% when a secondary language model was integrated
in. (my guess is that part of the problem is leaning too heavily on the
attention mechanism... a more 'streaming friendly' framework should help
reduce the load on the attention mechanism.)

[https://arxiv.org/pdf/1610.03022.pdf](https://arxiv.org/pdf/1610.03022.pdf)

~~~
kastnerkyle
This has changed recently, full seq2seq is now matching hybrid models [0].

[0] [https://arxiv.org/abs/1712.01769](https://arxiv.org/abs/1712.01769)

~~~
sdenton4
Oh, thanks! Now I know what I'm reading on the commute tomorrow.

------
nshm
The SpeechRecognition module is pretty popular but it has some important API
design flows. The thing is that speech is always continuous stream of data and
you need a streaming-like API for proper user experience - you need to respond
on events as soon as they appear. You need to filter silence and wait for
actual words. You need to delay input reaction until the user clearly
expressed the goal. Such streaming API is provided by major engines like
Google and CMUSphinx and enables natural and responsive experience.
Unfortunately SpeechRecognition module does not support streaming so
developers often restrict themselves. A proper guide should better cover
Google's streaming API.

~~~
da12
That's a really good point. I am the author of the article, and this is
something I debated during writing. In the end the goal was to provide an "in-
depth enough" tutorial on adding speech recognition to an app for people who
were new to it and possibly intimidated by the topic. For that, I think
SpeechRecognition is a fantastic module.

I had to leave a lot out of this that I wish could have gone in, simply due to
length constraints. In that regard, perhaps "The Ultimate Guide to Speech
Recognition" wasn't be best choice of title. I'm sure that we'll be updating
this article as time goes on, and Google's streaming API is something I want
to make sure goes in it.

Also, something that was left out of the article was SpeechRecognition's
listen_in_background method, which does solve this problem somewhat. My issue
with it is that SpeechRecognition uses a somewhat crude RMS energy based VAD
for detecting speech.

Thanks for your feedback!

~~~
kastnerkyle
It may be possible to do this with an LSTD VAD, I always had really good luck
with that. I tried a few random ones in here for silence removal - no quality
guarantee [0]

I found LTSD pretty robust compared to simpler energy based things as long as
you have a small chunk of background sound at the start. The LTSD
implementation is largely from my friend Joao, so I can't take credit for the
cool part, only the bugs

[0]
[https://gist.github.com/kastnerkyle/a3661d6be10a0ae9e01fd429...](https://gist.github.com/kastnerkyle/a3661d6be10a0ae9e01fd4299f0a38af)

~~~
da12
Cheers! I'll definitely check this out.

------
vram22
I had experimented with Python libraries for both speech recognition and
speech synthesis a while ago. It was very basic stuff, but fun:

Speech recognition with the Python "speech" module:

[https://jugad2.blogspot.in/2014/03/speech-recognition-
with-p...](https://jugad2.blogspot.in/2014/03/speech-recognition-with-python-
speech.html)

Speech synthesis in Python with pyttsx:

[https://jugad2.blogspot.in/2014/03/speech-synthesis-in-
pytho...](https://jugad2.blogspot.in/2014/03/speech-synthesis-in-python-with-
pyttsx.html)

Check out the synthetic voice announcing an arriving train in Sweden (near top
of 2nd post above).

------
payne92
> Most modern speech recognition systems rely on what is known as a Hidden
> Markov Model (HMM).

This not correct. Most modern speech recognition systems are based on deep
neural nets (DNN).

~~~
bmc7505
It depends on what you mean by "most". While it is true that Google, Amazon,
Baidu, et al. have a DNN-based implementation, most open source ASR systems
(eg. CMU Sphinx, HTK, Julius) are still HMM-based. There are still very few
DNN based modern speech recognition systems available to developers. Most are
behind a cloud API. Mozilla runs one:
[https://github.com/mozilla/DeepSpeech/](https://github.com/mozilla/DeepSpeech/)

~~~
kleiba
I think the discrepancy lies between "open source" and "modern" in this case.
It is true that DNNs outperform traditional HMM models, but speech recognition
systems are complex. It's not trivial to simply "port" an existing open source
system to switch to DNNs if you don't have the manpower and training data that
Google and the like possess.

