Hacker News new | past | comments | ask | show | jobs | submit login
Open Source Speech Recognition (chrislord.net)
311 points by ashitlerferad on June 2, 2016 | hide | past | favorite | 60 comments



"Amazon Alexa skills JS API...There isn’t really any alternative right now..."

I want to take this opportunity to plug an Alexa Skills Kit API (in Python) I just dropped:

https://github.com/johnwheeler/flask-ask

It does a lot of things the JS API doesn't: Jinja templates, decorator-based intent routing, slot defaults/conversions, and request signature verification to name a few.


I was just looking for something like this, and as of yesterday I only found this project: https://github.com/anjishnu/ask-alexa-pykit

Yours look much simpler to get started, would you care to comment on what the differences are between yours and the project I just posted?


Yes!

I just put mine up a month ago - not discoverable for "Python Alexa" or any keywords. Working on it :-)

So for the differences -

Alexa Skills are deployable as AWS Lambda functions or behind HTTPS. Currently, ask-alexa-pykit works on Lambda and Flask-Ask implements the signature verification required for HTTPS deployments. (i.e. Flask-Ask works on your own HTTPS server or Lambda).

Another difference is in the intent mapping design. Flask-Ask is based on the same architectural patterns of Flask with context locals, parameter mapping / conversion, and of course, Jinja templates!

For example, Mapping an intent with ask-alexa-pykit looks like this:

http://pastebin.com/raw/hQJLKnHL

Flask-Ask is like this:

http://pastebin.com/raw/9fWrGNYY

Flask-Ask also converts slots like firstname from the example above into arbitrary datatypes, and has stock conversions for AMAZON.DURATION (e.g. 'P2YT3H10M' into a Python datetime.timedelta). Full parameter mapping docs here: https://johnwheeler.org/flask-ask/requests.html#mapping-inte...

Flask-Ask templates are grouped together in the same files since utterances are typically small phrases--to make them easier to manage. Templates are of course an optional feature but are encouraged!

It's still very early, but I'm working my butt off, full-time on it! I have a 5-min tutorial that shows how to get up and running with Flask-Ask and ngrok: https://www.youtube.com/watch?v=eC2zi4WIFX0 - The API has changed a little, if you try it out and have any questions, you can do an issue or hit me up! john at ! johnwheeler.org

Thank you!


Thanks for the detailed reply! I've been interested in the echo and wanted to get into it for a while so I'll definitely dig into flask-ask and try to get up and running with it.


That video was great! Just started thinking about Alexa, this looks like an easy way to get moving with it.


Thank you!


OK, you have written an API. When can be expected that you publish a code for a service that implements this API?


Hi dozzie!

Please check out the samples directory for client code:

https://github.com/johnwheeler/flask-ask/tree/master/samples

These are direct ports from the Java samples:

https://github.com/amzn/alexa-skills-kit-java/tree/master/sa...


Please do not confuse the terms "voice recognition" and "speech recognition": the former refers to identifying people by their voice, the latter to transcribing spoken words to text.


Speaker recognition unambiguously describes identifying people by their voice, whereas voice recognition frequently is used for both. This blog post isn't confusing the terms – they've already been confused.


Not an expert, but might go even further to state that voice recognition doesn't even require identification of a speaker, but that words are being spoken.


That would be speech detection / voice activity detection.


Any reason why not to use Kaldi (https://github.com/kaldi-asr/kaldi)? AFAIK it uses most of the state of the art algorithms.


Frankly, Kaldi is nearly impossible for mere mortals to use. It's 100% targeted at people doing PhD work in speech recognition who have a colleague who already knows how it works and can set it up for them.

IMHO, there is a big opportunity for someone to come along and repackage it in a user friendly way, but the people who actually understand it are too busy doing "real work" to bother with such frivolity.


Like others are saying, it's just much harder to use. The official tutorial even says "The intended audience for this tutorial is either speech recognition researchers, or graduates or advanced undergraduates who are studying this area anyway." in the first paragraph. It seems like Kaldi is meant for people who actually know how speech recognition works, while other tools are meant for people who just want some text from some audio without really understanding how.

For example, I've been playing with home automation and speech recognition, and have been able to get any Sphinx based recognizer working in a single sitting, in a few hours or less. But I've yet to get Kaldi working yet after a several nights of effort. It seems much more powerful, and based on my reading, it's more accurate than Sphinx. But that doesn't do me any good if I can't get it to run, haha.


I was thinking the same. Kaldi's documentation is a bit lacking, and it's non-trivial to use besides their provided "recipes".


It's (a lot) more difficult to set up.


A bit of a tangent, but are there good speech-to-text tools that can take an audio file (i.e. a voice memo) and transcribe it?


I've done a lot of work with the Watson Speech to Text service (I work on the Watson team), and I've been pretty impressed with the results. You can upload an audio file (wav, flac, or ogg/opus) to the demo to try it out: https://speech-to-text-demo.mybluemix.net/


That is a super cool service! I love the stream of alternate words with confidence scores.


Nuance has a suite that does that. We use their software and it's pretty good—despite being a headache to configure. But it's neither free nor open source.


This is also what Castingwords does - if you want high quality results transcribed by people. https://castingwords.com/support/transcription-api.html

We use Speechmatics (and sphinx) for caption alignment and timestamps. We certainly recommend them if that’s the service you need.


that's exactly what we do at speechmatics.com - feel free to give us a try and make the most of the free credits


Great, I just submitted the first audio file for transcription. Looking forward to seeing the output.


Google will start giving this service. https://cloud.google.com/speech/


Maybe relevant: a FOSS AI project:

https://mycroft.ai/


Anyone knows which one should I use for an offline real-time speech recognition on a humanoid robot running Raspberry Pi 2?


Pocketsphinx. Here is a demo https://www.youtube.com/watch?v=tfcme7maygw and a blog post about how to do it: https://scaryreasoner.wordpress.com/2016/05/14/speech-recogn... Code is in here (gpl2) https://github.com/smcameron/space-nerds-in-space In particular, look at snis_nl.h, snis_nl.c

The trick with pocketsphinx is to limit the vocabulary you want to recognize, and create a corpus of the types of things you want to be able to recognize and feed it through here: http://www.speech.cs.cmu.edu/tools/lmtool-new.html

If you try to use pocketsphinx to recognize arbitrary English (e.g. dictation) it's not going to work very well in my experience.


Two realistic options, one is pocketsphinx, the other Kaldi. When running on a Pi, pocketsphinx will be your only realistic option for realtime detection. You'll want to move to a RaspPi 3 as well, and you'll want to use a customized dictionary to try and get your recognition speed up. Lastly, there are several parameters you can tweak that'll affect recognition speed.

Raw processing power will be the bottleneck on a Raspberry Pi.


Slightly related, does anyone know of any speech recognition library that also analysises the intonation / emotion in speech?

For example that has the ability to recognise anger / happy / questioning to relative accuracy?


NO! BUT THAT'S A GOOD QUESTION!

But seriously, I've seen emotion recognition, and speech recognition, but haven't come across anything that provides both in one package.


Do you have link to good emotion recognition if you have used one?

Perhaps, could use wrapper library to combine functionality. It's a very interesting area.


I work on the Watson team, so I'm probably biased, but I think our AlchemyLanguage Emotion Analysis is pretty good: http://www.ibm.com/smarterplanet/us/en/ibmwatson/developercl...


I have used none, but at the time I researched it, a lot of people were talking positively about

https://www.informatik.uni-augsburg.de/lehrstuehle/hcm/proje...


I tried sphinx in 2005 - we needed to recognize separate words, not even phrases. It wasn't viable, only commercial alternatives were satisfying.


It has been 11 years since your last attempt then - have you checked it for improvements since?


I have. It's... not great. It gets the job done, but it's a bit cumbersome.

It does recognize separate words, but, IMO, the biggest flaw is that it only recognizes pre-defined commands.


My main issue with doing anything voice related was the last time I looked into using Pocketsphinx I needed to define terms/dictionaries to parse from.

I'd love to mix and match NPL libraries, voice synthesis, voice identification, and speech recognition to make a comfortable "User Interface" to some systems in my house.

I think it'd be a fun project, but nothing seems to be able to take arbitrary audio streams and give me a "User identification" based on voice patterns and also arbitrary spoken text.

I know, yes, this is a VERY tall order, but it something that should be possible. At the very least, the identification part isn't needed. It's just important that it works offline and provides a text stream.


How can I study speech recognition???


Read the book: Automatic Speech Recognition A Deep Learning Approach http://rd.springer.com/book/10.1007%2F978-1-4471-5779-3. Also read another book http://www.amazon.com/Spoken-Language-Processing-Algorithm-D.... In parallel play with open source toolkits - cmusphinx, kaldi. Just run examples from tutorials to understand how things look in practice.


Take a natural language processing course if possible


http://www.politepix.com/openears/ for iOS has been around since 2010


How accurate is CMU Sphinx for speech recognition compared to what's inside Alexa?


Sphinx is pretty awful (remember the time before good speech recognition existed?). Alexa is far better.

Kaldi is much better, but very difficult to set up.

None of the open source speech recognition systems (or commercial for that matter) come close to Google.



Ah interesting link, I hadn't seen that.


> None of the open source speech recognition systems (or commercial for that matter) come close to Google.

Is that because of the data they have, or because of their superior algorithms?


It’s because they used data from the public to train their models.

If, suddenly, someone would apply the fact that copyright bans remixes to training of neural networks, and apply the fact that licenses for this have to be granted explicitly, Google would lose 90% of their advantage over other companies.

Personally, I’d be for making a requirement that companies open source their trained models if the training data contained data supplied by users, not paid employees.


I think the world needs the equivalent of OpenStreetMap but for speech data, so that the data is under a copyleft license that legally enforces reciprocation when the corpus is used or modified.


The closest is VoxForge:

http://voxforge.org/


Didn't know about VoxForge. Thanks!


I sympathize, but good luck with that :)


Both I think, but mostly the data. Baidu's deep speech is meant to be very good and its design is public (they even open sourced one component of it).


What about OpenEars? Is this better somehow?

http://www.politepix.com/openears/


cobaltspeech.Com -- in case you can't get what you want from a default Kaldi install.


pocketsphinx is too bad to be useful. I even tried to train special speech models with no luck.

I guess we need something new and shiny here in the open source space, probably based on neural networks


Kaldi does this, but it's a lot harder than pocketsphinx to get running.


Julius works well, but has no continuous dictation models for English yet (it was originally developed in Japanese).


[Deleted]


.NET speech recognition is Windows-only. I doubt Mozilla would go for this as they are aiming for cross-platform software.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: