Hacker News new | past | comments | ask | show | jobs | submit login
Guide to Speech Recognition with Python (realpython.com)
143 points by hn17 on March 25, 2018 | hide | past | favorite | 20 comments



For people who want simple, out of the box stuff (not necessarily in Python) for just getting phonemes I can also recommend [0]. Not amazing recognition quality, but dead simple setup, and it is possible to integrate a language model as well (I never needed one for my task). The author showed it as well in [1], but kind of skimmed right by - but to me if you want to know speech recognition in detail, pocketsphinx-python is one of the best ways. Customizing the language model is a huge boost in domain specific recognition.

Large company APIs will usually be better at generic speaker, generic language recognition - but if you can do speaker adaptation and customize the language model, there are some insane gains possible since you prune out a lot of uncertainty and complexity.

If you are more interested in recognition and alignment to a script, "gentle" is great [2][3]. The guts also have raw Kaldi recognition, which is pretty good for a generic speech recognizer but you would need to do some coding to pull out that part on its own.

For a decent performing deep model, check into Mozilla's version of Baidu's DeepSpeech [4].

If doing full-on development, my colleague has been using a bridge between PyTorch (for training) and Kaldi (to use their decoders) to good success [5].

[0] how I use pocketsphinx to get phonemes, https://github.com/kastnerkyle/ez-phones

[1] https://github.com/cmusphinx/pocketsphinx-python

[2] https://github.com/lowerquality/gentle

[3] how I use gentle for foreced alignment, https://github.com/kastnerkyle/raw_voice_cleanup/tree/master...

[4] https://github.com/mozilla/DeepSpeech

[5] https://github.com/mravanelli/pytorch-kaldi


No mention of DNN based ASR like DeepSpeech? There’s even open source python implementations available from Mozilla and Paddle.

These models are way easier to train, have surprisingly good accuracy, and are robust to noise.


Seems all the methods in the writeup are APIs (not sure about wit or sphinx), so what's missing is missing locally-run processes like DeepSpeech. But on that same note, I'd like to see greater accuracy comparisons on all these methods, and pricing (googly gets to around $1.44 / recorded hour?) since that's a significant factor.

From prior use, Google's speech API (at least the "video" model) is freakishly accurate compared to DeepSpeech to where I wondered if they used closed captioning to help train their model. But I haven't seen rest of these at work: https://i.imgur.com/cdOlARO.png


afaik, pure DNN models still lag seriously behind 'traditional' HMM-based frameworks augmented by neural networks (using DNNs for specific parts of the pipeline). Last I checked a couple months ago, state of the art for HNN+DNN was something like 6% word error rate (WER). The best Seq2Seq DNN I know of hit 18% WER, dropping to 10% when a secondary language model was integrated in. (my guess is that part of the problem is leaning too heavily on the attention mechanism... a more 'streaming friendly' framework should help reduce the load on the attention mechanism.)

https://arxiv.org/pdf/1610.03022.pdf


This has changed recently, full seq2seq is now matching hybrid models [0].

[0] https://arxiv.org/abs/1712.01769


Oh, thanks! Now I know what I'm reading on the commute tomorrow.


The majority of the APIs mentioned are probably using DNNs. But those are all online-only, so higher-quality offline engines would definitely be an improvement. I wonder how much effort it would require to integrate them into the SpeechRecognition package.


+1


The SpeechRecognition module is pretty popular but it has some important API design flows. The thing is that speech is always continuous stream of data and you need a streaming-like API for proper user experience - you need to respond on events as soon as they appear. You need to filter silence and wait for actual words. You need to delay input reaction until the user clearly expressed the goal. Such streaming API is provided by major engines like Google and CMUSphinx and enables natural and responsive experience. Unfortunately SpeechRecognition module does not support streaming so developers often restrict themselves. A proper guide should better cover Google's streaming API.


That's a really good point. I am the author of the article, and this is something I debated during writing. In the end the goal was to provide an "in-depth enough" tutorial on adding speech recognition to an app for people who were new to it and possibly intimidated by the topic. For that, I think SpeechRecognition is a fantastic module.

I had to leave a lot out of this that I wish could have gone in, simply due to length constraints. In that regard, perhaps "The Ultimate Guide to Speech Recognition" wasn't be best choice of title. I'm sure that we'll be updating this article as time goes on, and Google's streaming API is something I want to make sure goes in it.

Also, something that was left out of the article was SpeechRecognition's listen_in_background method, which does solve this problem somewhat. My issue with it is that SpeechRecognition uses a somewhat crude RMS energy based VAD for detecting speech.

Thanks for your feedback!


It may be possible to do this with an LSTD VAD, I always had really good luck with that. I tried a few random ones in here for silence removal - no quality guarantee [0]

I found LTSD pretty robust compared to simpler energy based things as long as you have a small chunk of background sound at the start. The LTSD implementation is largely from my friend Joao, so I can't take credit for the cool part, only the bugs

[0] https://gist.github.com/kastnerkyle/a3661d6be10a0ae9e01fd429...


Cheers! I'll definitely check this out.


For me as a begginer (in speech recognition) this tutorial (and others on RealPython site) are a good start to digg into the topic. That's why I linked it here.

One of the things I like in HNews is that there's often someone who can join discussion and add something to it.

Clearly it's worth to learn and discuss from experience of others to see broader spectrum. Thanks to the author for looking here and dropping few lines.


I had experimented with Python libraries for both speech recognition and speech synthesis a while ago. It was very basic stuff, but fun:

Speech recognition with the Python "speech" module:

https://jugad2.blogspot.in/2014/03/speech-recognition-with-p...

Speech synthesis in Python with pyttsx:

https://jugad2.blogspot.in/2014/03/speech-synthesis-in-pytho...

Check out the synthetic voice announcing an arriving train in Sweden (near top of 2nd post above).


> Most modern speech recognition systems rely on what is known as a Hidden Markov Model (HMM).

This not correct. Most modern speech recognition systems are based on deep neural nets (DNN).


It depends on what you mean by "most". While it is true that Google, Amazon, Baidu, et al. have a DNN-based implementation, most open source ASR systems (eg. CMU Sphinx, HTK, Julius) are still HMM-based. There are still very few DNN based modern speech recognition systems available to developers. Most are behind a cloud API. Mozilla runs one: https://github.com/mozilla/DeepSpeech/


I think the discrepancy lies between "open source" and "modern" in this case. It is true that DNNs outperform traditional HMM models, but speech recognition systems are complex. It's not trivial to simply "port" an existing open source system to switch to DNNs if you don't have the manpower and training data that Google and the like possess.


There is no contradiction here, state of the art systems use both DNN and HMM (Kaldi, for example). It is GMM (Gaussian mixture model) that was replaced by DNN, HMM is still here.


[flagged]



Yes. OP: please stop. It's OK to share your stuff, but please don't be spammy about it—this a community site.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: