It does a lot of things the JS API doesn't: Jinja templates, decorator-based intent routing, slot defaults/conversions, and request signature verification to name a few.
I just put mine up a month ago - not discoverable for "Python Alexa" or any keywords. Working on it :-)
So for the differences -
Alexa Skills are deployable as AWS Lambda functions or behind HTTPS. Currently, ask-alexa-pykit works on Lambda and Flask-Ask implements the signature verification required for HTTPS deployments. (i.e. Flask-Ask works on your own HTTPS server or Lambda).
Another difference is in the intent mapping design. Flask-Ask is based on the same architectural patterns of Flask with context locals, parameter mapping / conversion, and of course, Jinja templates!
For example, Mapping an intent with ask-alexa-pykit looks like this:
Flask-Ask also converts slots like firstname from the example above into arbitrary datatypes, and has stock conversions for AMAZON.DURATION (e.g. 'P2YT3H10M' into a Python datetime.timedelta). Full parameter mapping docs here: https://johnwheeler.org/flask-ask/requests.html#mapping-inte...
Flask-Ask templates are grouped together in the same files since utterances are typically small phrases--to make them easier to manage. Templates are of course an optional feature but are encouraged!
It's still very early, but I'm working my butt off, full-time on it! I have a 5-min tutorial that shows how to get up and running with Flask-Ask and ngrok: https://www.youtube.com/watch?v=eC2zi4WIFX0 - The API has changed a little, if you try it out and have any questions, you can do an issue or hit me up! john at ! johnwheeler.org
Thanks for the detailed reply! I've been interested in the echo and wanted to get into it for a while so I'll definitely dig into flask-ask and try to get up and running with it.
Please do not confuse the terms "voice recognition" and "speech recognition": the former refers to identifying people by their voice, the latter to transcribing spoken words to text.
Speaker recognition unambiguously describes identifying people by their voice, whereas voice recognition frequently is used for both. This blog post isn't confusing the terms – they've already been confused.
Not an expert, but might go even further to state that voice recognition doesn't even require identification of a speaker, but that words are being spoken.
Frankly, Kaldi is nearly impossible for mere mortals to use. It's 100% targeted at people doing PhD work in speech recognition who have a colleague who already knows how it works and can set it up for them.
IMHO, there is a big opportunity for someone to come along and repackage it in a user friendly way, but the people who actually understand it are too busy doing "real work" to bother with such frivolity.
Like others are saying, it's just much harder to use. The official tutorial even says "The intended audience for this tutorial is either speech recognition researchers, or graduates or advanced undergraduates who are studying this area anyway." in the first paragraph. It seems like Kaldi is meant for people who actually know how speech recognition works, while other tools are meant for people who just want some text from some audio without really understanding how.
For example, I've been playing with home automation and speech recognition, and have been able to get any Sphinx based recognizer working in a single sitting, in a few hours or less. But I've yet to get Kaldi working yet after a several nights of effort. It seems much more powerful, and based on my reading, it's more accurate than Sphinx. But that doesn't do me any good if I can't get it to run, haha.
I've done a lot of work with the Watson Speech to Text service (I work on the Watson team), and I've been pretty impressed with the results. You can upload an audio file (wav, flac, or ogg/opus) to the demo to try it out: https://speech-to-text-demo.mybluemix.net/
Nuance has a suite that does that. We use their software and it's pretty good—despite being a headache to configure. But it's neither free nor open source.
The trick with pocketsphinx is to limit the vocabulary you want to recognize, and create a corpus of the types of things you want to be able to recognize and feed it through here: http://www.speech.cs.cmu.edu/tools/lmtool-new.html
If you try to use pocketsphinx to recognize arbitrary English (e.g. dictation) it's not going to work very well in my experience.
Two realistic options, one is pocketsphinx, the other Kaldi. When running on a Pi, pocketsphinx will be your only realistic option for realtime detection. You'll want to move to a RaspPi 3 as well, and you'll want to use a customized dictionary to try and get your recognition speed up. Lastly, there are several parameters you can tweak that'll affect recognition speed.
Raw processing power will be the bottleneck on a Raspberry Pi.
My main issue with doing anything voice related was the last time I looked into using Pocketsphinx I needed to define terms/dictionaries to parse from.
I'd love to mix and match NPL libraries, voice synthesis, voice identification, and speech recognition to make a comfortable "User Interface" to some systems in my house.
I think it'd be a fun project, but nothing seems to be able to take arbitrary audio streams and give me a "User identification" based on voice patterns and also arbitrary spoken text.
I know, yes, this is a VERY tall order, but it something that should be possible. At the very least, the identification part isn't needed. It's just important that it works offline and provides a text stream.
It’s because they used data from the public to train their models.
If, suddenly, someone would apply the fact that copyright bans remixes to training of neural networks, and apply the fact that licenses for this have to be granted explicitly, Google would lose 90% of their advantage over other companies.
Personally, I’d be for making a requirement that companies open source their trained models if the training data contained data supplied by users, not paid employees.
I think the world needs the equivalent of OpenStreetMap but for speech data, so that the data is under a copyleft license that legally enforces reciprocation when the corpus is used or modified.
I want to take this opportunity to plug an Alexa Skills Kit API (in Python) I just dropped:
https://github.com/johnwheeler/flask-ask
It does a lot of things the JS API doesn't: Jinja templates, decorator-based intent routing, slot defaults/conversions, and request signature verification to name a few.