
Open Source Speech Recognition - ashitlerferad
http://chrislord.net/index.php/2016/06/01/open-source-speech-recognition/
======
johnwheeler
"Amazon Alexa skills JS API...There isn’t really any alternative right now..."

I want to take this opportunity to plug an Alexa Skills Kit API (in Python) I
just dropped:

[https://github.com/johnwheeler/flask-
ask](https://github.com/johnwheeler/flask-ask)

It does a lot of things the JS API doesn't: Jinja templates, decorator-based
intent routing, slot defaults/conversions, and request signature verification
to name a few.

~~~
jamornh
I was just looking for something like this, and as of yesterday I only found
this project: [https://github.com/anjishnu/ask-alexa-
pykit](https://github.com/anjishnu/ask-alexa-pykit)

Yours look much simpler to get started, would you care to comment on what the
differences are between yours and the project I just posted?

~~~
johnwheeler
Yes!

I just put mine up a month ago - not discoverable for "Python Alexa" or any
keywords. Working on it :-)

So for the differences -

Alexa Skills are deployable as AWS Lambda functions or behind HTTPS.
Currently, ask-alexa-pykit works on Lambda and Flask-Ask implements the
signature verification required for HTTPS deployments. (i.e. Flask-Ask works
on your own HTTPS server or Lambda).

Another difference is in the intent mapping design. Flask-Ask is based on the
same architectural patterns of Flask with context locals, parameter mapping /
conversion, and of course, Jinja templates!

For example, Mapping an intent with ask-alexa-pykit looks like this:

[http://pastebin.com/raw/hQJLKnHL](http://pastebin.com/raw/hQJLKnHL)

Flask-Ask is like this:

[http://pastebin.com/raw/9fWrGNYY](http://pastebin.com/raw/9fWrGNYY)

Flask-Ask also converts slots like firstname from the example above into
arbitrary datatypes, and has stock conversions for AMAZON.DURATION (e.g.
'P2YT3H10M' into a Python datetime.timedelta). Full parameter mapping docs
here: [https://johnwheeler.org/flask-ask/requests.html#mapping-
inte...](https://johnwheeler.org/flask-ask/requests.html#mapping-intent-slots-
to-view-function-parameters)

Flask-Ask templates are grouped together in the same files since utterances
are typically small phrases--to make them easier to manage. Templates are of
course an optional feature but are encouraged!

It's still very early, but I'm working my butt off, full-time on it! I have a
5-min tutorial that shows how to get up and running with Flask-Ask and ngrok:
[https://www.youtube.com/watch?v=eC2zi4WIFX0](https://www.youtube.com/watch?v=eC2zi4WIFX0)
\- The API has changed a little, if you try it out and have any questions, you
can do an issue or hit me up! john at ! johnwheeler.org

Thank you!

~~~
edraferi
That video was great! Just started thinking about Alexa, this looks like an
easy way to get moving with it.

~~~
johnwheeler
Thank you!

------
kleiba
Please do not confuse the terms "voice recognition" and "speech recognition":
the former refers to identifying people by their voice, the latter to
transcribing spoken words to text.

~~~
nxzero
Not an expert, but might go even further to state that voice recognition
doesn't even require identification of a speaker, but that words are being
spoken.

~~~
richarme
That would be speech detection / voice activity detection.

------
afsina
Any reason why not to use Kaldi ([https://github.com/kaldi-
asr/kaldi](https://github.com/kaldi-asr/kaldi))? AFAIK it uses most of the
state of the art algorithms.

~~~
jandrese
Frankly, Kaldi is nearly impossible for mere mortals to use. It's 100%
targeted at people doing PhD work in speech recognition who have a colleague
who already knows how it works and can set it up for them.

IMHO, there is a big opportunity for someone to come along and repackage it in
a user friendly way, but the people who actually understand it are too busy
doing "real work" to bother with such frivolity.

------
bruth
A bit of a tangent, but are there good speech-to-text tools that can take an
audio file (i.e. a voice memo) and transcribe it?

~~~
nfriedly
I've done a lot of work with the Watson Speech to Text service (I work on the
Watson team), and I've been pretty impressed with the results. You can upload
an audio file (wav, flac, or ogg/opus) to the demo to try it out:
[https://speech-to-text-demo.mybluemix.net/](https://speech-to-text-
demo.mybluemix.net/)

~~~
bruth
That is a super cool service! I love the stream of alternate words with
confidence scores.

------
teekert
Maybe relevant: a FOSS AI project:

[https://mycroft.ai/](https://mycroft.ai/)

------
bitL
Anyone knows which one should I use for an offline real-time speech
recognition on a humanoid robot running Raspberry Pi 2?

~~~
smcameron
Pocketsphinx. Here is a demo
[https://www.youtube.com/watch?v=tfcme7maygw](https://www.youtube.com/watch?v=tfcme7maygw)
and a blog post about how to do it:
[https://scaryreasoner.wordpress.com/2016/05/14/speech-
recogn...](https://scaryreasoner.wordpress.com/2016/05/14/speech-recognition-
and-natural-language-processing-in-space-nerds-in-space/) Code is in here
(gpl2) [https://github.com/smcameron/space-nerds-in-
space](https://github.com/smcameron/space-nerds-in-space) In particular, look
at snis_nl.h, snis_nl.c

The trick with pocketsphinx is to limit the vocabulary you want to recognize,
and create a corpus of the types of things you want to be able to recognize
and feed it through here: [http://www.speech.cs.cmu.edu/tools/lmtool-
new.html](http://www.speech.cs.cmu.edu/tools/lmtool-new.html)

If you try to use pocketsphinx to recognize arbitrary English (e.g. dictation)
it's not going to work very well in my experience.

------
doc_holliday
Slightly related, does anyone know of any speech recognition library that also
analysises the intonation / emotion in speech?

For example that has the ability to recognise anger / happy / questioning to
relative accuracy?

~~~
knodi123
_NO!_ BUT THAT'S A GOOD QUESTION!

But seriously, I've seen emotion recognition, and speech recognition, but
haven't come across anything that provides both in one package.

~~~
doc_holliday
Do you have link to good emotion recognition if you have used one?

Perhaps, could use wrapper library to combine functionality. It's a very
interesting area.

~~~
nfriedly
I work on the Watson team, so I'm probably biased, but I think our
AlchemyLanguage Emotion Analysis is pretty good:
[http://www.ibm.com/smarterplanet/us/en/ibmwatson/developercl...](http://www.ibm.com/smarterplanet/us/en/ibmwatson/developercloud/alchemy-
language.html)

------
avodonosov
I tried sphinx in 2005 - we needed to recognize separate words, not even
phrases. It wasn't viable, only commercial alternatives were satisfying.

~~~
falcolas
It has been 11 years since your last attempt then - have you checked it for
improvements since?

~~~
CaptSpify
I have. It's... not great. It gets the job done, but it's a bit cumbersome.

It does recognize separate words, but, IMO, the biggest flaw is that it only
recognizes pre-defined commands.

------
gravypod
My main issue with doing anything voice related was the last time I looked
into using Pocketsphinx I needed to define terms/dictionaries to parse from.

I'd love to mix and match NPL libraries, voice synthesis, voice
identification, and speech recognition to make a comfortable "User Interface"
to some systems in my house.

I think it'd be a fun project, but nothing seems to be able to take arbitrary
audio streams and give me a "User identification" based on voice patterns and
also arbitrary spoken text.

I know, yes, this is a VERY tall order, but it something that should be
possible. At the very least, the identification part isn't needed. It's just
important that it works offline and provides a text stream.

------
jessyuan
How can I study speech recognition???

~~~
nshm
Read the book: Automatic Speech Recognition A Deep Learning Approach
[http://rd.springer.com/book/10.1007%2F978-1-4471-5779-3](http://rd.springer.com/book/10.1007%2F978-1-4471-5779-3).
Also read another book [http://www.amazon.com/Spoken-Language-Processing-
Algorithm-D...](http://www.amazon.com/Spoken-Language-Processing-Algorithm-
Development/dp/0130226165). In parallel play with open source toolkits -
cmusphinx, kaldi. Just run examples from tutorials to understand how things
look in practice.

------
10dpd
[http://www.politepix.com/openears/](http://www.politepix.com/openears/) for
iOS has been around since 2010

------
tianlins
How accurate is CMU Sphinx for speech recognition compared to what's inside
Alexa?

~~~
IshKebab
Sphinx is pretty awful (remember the time before good speech recognition
existed?). Alexa is far better.

Kaldi is much better, but _very_ difficult to set up.

None of the open source speech recognition systems (or commercial for that
matter) come close to Google.

~~~
amelius
> None of the open source speech recognition systems (or commercial for that
> matter) come close to Google.

Is that because of the data they have, or because of their superior
algorithms?

~~~
kuschku
It’s because they used data from the public to train their models.

If, suddenly, someone would apply the fact that copyright bans remixes to
training of neural networks, and apply the fact that licenses for this have to
be granted explicitly, Google would lose 90% of their advantage over other
companies.

Personally, I’d be for making a requirement that companies open source their
trained models if the training data contained data supplied by users, not paid
employees.

~~~
davexunit
I think the world needs the equivalent of OpenStreetMap but for speech data,
so that the data is under a copyleft license that legally enforces
reciprocation when the corpus is used or modified.

~~~
ashitlerferad
The closest is VoxForge:

[http://voxforge.org/](http://voxforge.org/)

~~~
davexunit
Didn't know about VoxForge. Thanks!

------
EGreg
What about OpenEars? Is this better somehow?

[http://www.politepix.com/openears/](http://www.politepix.com/openears/)

------
praccu
cobaltspeech.Com -- in case you can't get what you want from a default Kaldi
install.

------
PSeitz
pocketsphinx is too bad to be useful. I even tried to train special speech
models with no luck.

I guess we need something new and shiny here in the open source space,
probably based on neural networks

~~~
roel_v
Kaldi does this, but it's a lot harder than pocketsphinx to get running.

~~~
skykooler
Julius works well, but has no continuous dictation models for English yet (it
was originally developed in Japanese).

------
stephengillie
[Deleted]

~~~
skykooler
.NET speech recognition is Windows-only. I doubt Mozilla would go for this as
they are aiming for cross-platform software.

