
Voice Recognition and Text to Speech in Python - ggulati
https://ggulati.wordpress.com/2016/02/24/coding-jarvis-in-python-3-in-2016/
======
danso
FWIW, IBM has a wonderful speech to text API...I've put together a repo of
examples and Python code:

[https://github.com/dannguyen/watson-word-
watcher](https://github.com/dannguyen/watson-word-watcher)

One of the great things about it is its word-level time stamp and confidence
data that it returns...here's a few super cuts I've made from the presidential
primary debates:

[https://www.youtube.com/watch?v=VbXUUSFat9w&list=PLLrlUAN-
Lo...](https://www.youtube.com/watch?v=VbXUUSFat9w&list=PLLrlUAN-
LoO73FrSa6yn8gsPpi7J9TJb7&index=14)

It's not perfect by any means, but the granular results give you a place to
start from...here's a super cut of cuss words from a well known episode of The
Wire...only 59 such words were heard by Watson even though one scene contains
30+ F-bombs alone:

[https://www.youtube.com/watch?v=muP5aH1aWUw&feature=youtu.be](https://www.youtube.com/watch?v=muP5aH1aWUw&feature=youtu.be)

The service is free for the first 1000 minutes each month.

~~~
pbw
It took me a while to understand what you did here. I was waiting for some
kind of subtitles showing the recognition ability.

But you are saying you performed speech recognition on the full video then
edited it according to where the words you targeted were found. I liked the
bomb/terrorist one, the others didn't seem to be "saying" anything.

~~~
danso
Yeah, I was a bit lazy...I could have used moviepy (which I currently use, but
merely as a wrapper around ffmpeg) to add subtitles to show which identified
word was identified...I'm hoping to make this into a command-line tool for
myself to quickly transcribe things...though making supercuts is just a fun
way to demonstrate the concepts.

The important takeaway is that the Watson API parses a stream of spoken audio
(other services, such as Microsoft's Oxford, works only on 10-second chunks,
i.e. optimized for user commands) and tokenizes it...what you get is a
timestamp for when each recognized word appears, as well as a confidence level
and alternatives if you so specify. Other speech-transcription options don't
always provide this...I don't think PocketSphinx does, for example. Or sending
your audio to a mTurk based transcription service.

Here's a little more detail about The Wire transcription, along with the JSON
that Watson returns, and a simplified CSV version of it:

[https://github.com/dannguyen/watson-word-
watcher/tree/master...](https://github.com/dannguyen/watson-word-
watcher/tree/master/examples/the-wire-season-1-ep-4)

------
kleiba
Kids, it's called "speech recognition". Voice recognition also exists, but
it's the task of identifying a user based on his/her voice, not the task of
transcribing spoken input as text.

~~~
jwitko
Kids?

~~~
DecoPerson
He jests.

------
giancarlostoro
It really would be amazing to be able to get voice recognition software that
covers at least recognizing a small enough fraction of our language to be
useful without having to reach the cloud. It is definitely a dream I hope we
one day achieve, thanks for the article, will test it on my day off and play
with it a bit.

~~~
lovelearning
Pocketsphinx/Sphinx with a small, use-case specific dictionary showed much
better accuracy for my accent and speech defects, than any of these cloud
based recognition systems. I used a standard acoustic model, but it probably
would have been even more accurate had I trained a custom acoustic model.

For simple use cases like home automation or desktop automation, I think it's
a more practical approach than depending on a cloud API.

~~~
danso
I haven't tried out Pocket Sphinx myself...could you describe the training
process, e.g. how long did it take, how much audio did you have to record, how
easy was it to iterate to improve accuracy?

~~~
lovelearning
PocketSphinx/Sphinx use three models - an acoustic model, a language model and
a phonetic dictionary. I'm no expert, but as I understand them, the acoustic
model converts audio samples into phonemes(?), the language model contains
probabilities of sequences of words, and the phonetic dictionary is a mapping
of words to phonemes.

Initially, I just used standard en-us acoustic model, US english generic
language model, and its associated phonetic dictionary. This was the baseline
for judging accuracy. It was ok, but neither fast nor very accurate (likely
due to my accent and speech defects). I'd say it was about 70% accurate.

Simply reducing the size of the vocabulary boosts accuracy because there is
that much less chance of a mistake. It also improves recognition speed. For
each of my use cases (home and desktop automation), I created a plain text
file with the relevant command words. Then used their online tool [1] to
generate a language model and phonetic dictionary from it.

For the acoustic model, there are two approaches - "adapting" and "training".
Training is from scratch, while adapting adapts a standard acoustic model to
better match personal accent or dialect or speech defects.

I found training as described [2] rather intimidating, and never tried it out.
This is likely to take a lot of time (a couple of days atleast I think, based
on my adaptation experience).

Instead I "adapted" the en-us acoustic model [3]. About an hour to come up
with some grammatically correct text that included all the command words and
phrases I wanted. Then reading it aloud while recording using Audacity. I
attempted this multiple times, fiddling around with microphone volume and
gain, trying to block ambient noise (I live in a rather noisy env), redoing
it, final take. Took around 8 hours altogether with breaks. Finally generating
the adapted acoustic model. About an hour.

About 95% of the time it understands what I say. About 5% of the time, I have
to repeat. Especially with phrases.

Did this on both a desktop and raspberry pi. The Pi is the one managing home
automation. I'm happy with it :)

[1]: [http://www.speech.cs.cmu.edu/tools/lmtool-
new.html](http://www.speech.cs.cmu.edu/tools/lmtool-new.html)

[2]:
[http://cmusphinx.sourceforge.net/wiki/tutorialam](http://cmusphinx.sourceforge.net/wiki/tutorialam)

[3]:
[http://cmusphinx.sourceforge.net/wiki/tutorialadapt](http://cmusphinx.sourceforge.net/wiki/tutorialadapt)

PS: Reading their documentation and searching for downloads takes more time
than the actual task. They really need to improve those.

~~~
vram22
If not confidential, can you describe what kinds of automation you used this
for, particularly the desktop automation?

I was interested in automating transcription to text of my own reminders to
myself and other such audio files, say taken on the PC or on a portable voice
recorder, hence the earlier trials I did. But at the time nothing worked out
well enough, IIRC.

~~~
lovelearning
Nothing confidential at all :). I was playing with them because I personally
don't like using keyboard and mouse, and also have some ideas for making
computing easier for handicapped people.

My current desktop automation is doing command recognition. Commands like
"open editor / email / browser", "shutdown", "suspend"...about 20 commands in
all. 'pocketsphinx_continuous' is started as a daemon at startup and keeps
listening in the background (I'm on Ubuntu).

I think from a speech recognition internals point of view transcription is
more complex than recognizing these short command phrases. The training or
adaptation corpus would have to be much larger than what I used.

~~~
vram22
Thanks. Good uses.

He he, the voice "shutdown" command you mention reminds me of a small assembly
language routine that I used to use to reboot MSDOS PCs; it was just a single
instruction to jump to the start of the BIOS (cold?) boot entry point, IIRC
(JMP F000:FFF0 or something like that). Used to enter it into DOS's DEBUG.COM
utility with the A command (for Assemble) and then write it out to disk as a
tiny .COM file. (IOW, you did not even need an assembler to create it.)

Then you could reboot the PC just by typing:

REBOOT

at the DOS prompt.

Did all kinds of tricks of the trade (not just like that, many other kinds),
in the earlier DOS and (more in) UNIX days ... Good fun, and useful to
customers, many a time, too, including saving their bacon (aka data) multiple
times (with, of course, no backups by them).

------
IshKebab
Don't expect this to be anything like modern "good" speech recognition. Sphinx
is definitely from the 00's when it seemed like speech recognition would never
be solved.

Apparently Kaldi is a lot better, but good luck setting it up!

------
privong
Another project along similar lines is the Jasper Project[0], which has
received some HN coverage in the past several years[1]. It interfaces with
many of the same speech recognition and text-to-speech libraries.

[0] [https://jasperproject.github.io/](https://jasperproject.github.io/)

[1]
[https://hn.algolia.com/?query=Jasper%20Project&sort=byPopula...](https://hn.algolia.com/?query=Jasper%20Project&sort=byPopularity&prefix&page=0&dateRange=all&type=story)

------
squeaky-clean
Very cool! I just started playing with speech recognition in Python for home
automation this week. I'm controlling some WeMo switches and my PC with an
Android Tablet using Autovoice, and it works well as a proof-of-concept, but
Autovoice doesn't always register commands, and the "Okay, Google" speech to
text can be slow sometimes. I'd like it to take less than 5 seconds between
saying "TV Off" and the TV actually turning off., with Autovoice it's anywhere
from 3s to 25s depending on the lag. I also figure with real code, I can get
commands that are more flexible than Autovoice's regex.

Aside from circumventing lag, I can also give it some personality. I want to
name it Marvin, after the robot from H2G2, so that I can say:

"Marvin, turn the TV off"

"Here I am, brain the size of a planet, and you ask me to turn off the tv.
Call that job satisfaction, 'cause I don't."

------
afsina
They should move from Sphinx to Kaldi and from GMM to DNN acoustic models.
Instant 30% improvement.

~~~
luke-stanley
[http://kaldi.sourceforge.net/about.html](http://kaldi.sourceforge.net/about.html)

~~~
turnip1979
Does Kaldi need Windows? I only saw installation instructions for Windows.
Also .. I just tried Pocket Sphinx ... says it works on Windows and Linux. So
.. no non-apple or cross platform speech rec for us mac devs?

~~~
afsina
AFAIK, On the contrary. They officially support only linux. But community
provides Windows support. I do not know Mac support. AFAIK again, Kaldi is
mainly targeted server and desktop applications

------
ivan_ah
For folks who want to try this at home on Mac OS X, you'll need to change
'sapi5' to 'nsss' on the line 'speech_engine = pyttsx.init('sapi5')'.

I also had to 'brew install portaudio flac swig' and a bunch of other python
libs. By the time it ran, 'pip freeze' returned:

    
    
        altgraph==0.12
        macholib==1.7
        modulegraph==0.12.1
        py2app==0.9
        PyAudio==0.2.9
        pyobjc==3.0.4
        pyttsx==1.1
        SpeechRecognition==3.3.0
        pocketsphinx==0.0.9
    

My fork of the gist is here:
[https://gist.github.com/ivanistheone/b988d3de542c1bdd6a90](https://gist.github.com/ivanistheone/b988d3de542c1bdd6a90)

------
vram22
Nice work, ggulati. I had done some roughly similar stuff, but more basic,
using same / similar libraries (but you have researched more libs), a while
ago:

Recognizing speech (speech-to-text) with the Python speech module

[https://code.activestate.com/recipes/579115-recognizing-
spee...](https://code.activestate.com/recipes/579115-recognizing-speech-
speech-to-text-with-the-python-/?in=user-4173351)

and

Python text-to-speech with pyttsx

[https://code.activestate.com/recipes/578839-python-text-
to-s...](https://code.activestate.com/recipes/578839-python-text-to-speech-
with-pyttsx/?in=user-4173351)

Good stuff. I like this area.

------
whizzkid
Microsoft's translation API has 1 million characters/month free version for
text to speech with male/female voice.

It is good enough quality and a good start for those who can not afford paying
for Google's API.

~~~
iamcreasy
Just checked, it's 2 Million character/month for free.

------
archiebunker
Excellent post. Very interesting. I see how it works but am using Python 2.7
so based on your headline I suppose it won't work for me. This is the first
real lead I've seen for integrating it easily. Pricing isn't terrible, if it
goes production. Too bad there is no way to test it first for development. But
we're lucky to have this at all.

The link to the VLC library is pretty handy.

~~~
ggulati
Most of the stuff I found was for Python 2.7! I'll edit that into the post. My
focus was for finding libraries that worked with new Python code, e.g. Python
3.5 code.

All of those libraries have Python 2.7 versions. Actually for all of them you
pip install the same library; for pyttsx, `pip install pyttsx` and ignore
jpercent's update.

I'm not sure what you mean about pricing and testing for development. Are you
referring to Google's services? They offer 50 reqs/day for voice recognition
on a free developer API key ([https://www.chromium.org/developers/how-tos/api-
keys](https://www.chromium.org/developers/how-tos/api-keys)). Google Translate
can also be used by gTTS; it will rate limit or block you if you send too many
reqs/min or per day without an appropriately registered API key, but you could
play around with it for sure.

If voice recognition is important, it might be worth investigating Sphinx more
and putting the time to tweak their English language model files. Synthesis is
more difficult, though I think the Windows SAPI, OSX NSSS, and ESpeak on *nix
are all "good enough." There are also a range of commercial libraries.

~~~
dr_zoidberg
I too thought it was Python 3 only before I read it. Maybe a better title
would be "Coding Jarvis in Python in 2016" and then explaining in the first
paragraph that this is Python 2 and 3 compatible, with your personal focus on
3?

~~~
ggulati
Thanks for the feedback; I updated the blog post.

------
Karlozkiller
I have had a problem with using the speech_recognition library in that it does
not stop listening when silence occurs.

After trying to tweak the threshold parameters without success I just figured
I'd add a custom key-command to break the listening loop in my project.

------
infocollector
Does this work without an internet connection (once downloaded)? If yes, How
big is the downloaded footprint? I still haven't gone through the webpage
carefully.

~~~
akerro
There is project Sirius which does it, take a look

[http://sirius.clarity-lab.org/category/watch/](http://sirius.clarity-
lab.org/category/watch/)

