I think Google's API will usher in a lot of new innovative applications.
It is amazingly easy to create speech recognition without going out to any API these days.
This was 2 years ago, so maybe it's simple now, but I didn't find it "amazingly easy" back then.
I do have a number of projects where I could definitely use a local speech recognition library. I have used [Python SpeechRecognition](https://github.com/Uberi/speech_recognition/blob/master/exam...) to essentially record and transcribe from a scanner. I wanted to take it further, but google at the time limited the number of requests per day. Today's announcement seems to indicate they will be expanding their free usage, but a local setup would be much better. I'd like to deploy this in a place that might not have reliable Internet.
We've written detailed, up-to-date instructions  for installing CMU Sphinx, and now also provide prebuilt binaries !
If you're interested in not sending your audio to Google, CMU Sphinx and other libraries (like Kaldi and Julius), are definitely worth a second look.
But these days, if you go all the way through their tutorial, and give it a proper read, it's very doable to set up.
Also, they give you the tools and knowledge to build better models (and explain the theory), which is where most of the competitive advantage is IMHO.
Googles engine also works fine (have been trying it with the phones), but the pricing may or may not be a deal breaker.
CMUSphinx is not a neural network based system, they do use language and acoustic modeling.
Not really. The hard part is not the algorithm, it is the millions of samples of training data that have gone behind Google's system. They pretty much have every accent and way of speaking covered in their system which is what allows them to deliver such a high-accuracy speaker-independent system.
CMUSphinx is remarkable as an academic milestone, but in all honesty it's basically unusuable from a product standpoint. If your speech recognition is only 95% accurate, you're going to have a lot of very unhappy users. Average Joes are used to things like microwave ovens, which work 99.99% of the time, and expect new technology to "just work".
CMUSphinx is also an old algorithm; AFAIK Google is neural-network based.
Baidu open sourced their CTC implementation
I think we will have an easy to install OSS speech recognition library and accurate pretrained networks not far off from Google/Alexa/Baidu, running locally rather than in the cloud, within 1-2 years. Can't wait.
Speech Intent Recognition
... the server returns structured information about the incoming speech so that apps can easily parse the intent of the speaker, and subsequently drive further action. Models trained by the Project Oxford LUIS service are used to generate the intent.
Do others offer something like this?
Still, having this paid and cloud based puts a limit to types of things where you'd use it. I will use it in my own apps for now but will swap to a OSS speech recognition library running locally as soon as one emerges that is good enough.
I've been thinking a lot lately about where the next major areas of technology-driven disruption might be in terms of employment impact, and things like this make me wonder how long it will be before call centers stacked wall to wall with customer service reps become a relic of the past...
Fun to play with, but don't expect it to last...
Disclosure: I work on Compute Engine.
Google has an history of shutting down useful products, why people should trust that one for long term integration?
IMO a better option for Google, when considering to close an API, is to enforce payment and hike the price enough to justify maintaining it. If and only if enough users drop out with the higher price, then shut it down for good.
Doesn't this mean you could spend time developing and building on the platform without knowing if your application is economically feasible? Seems like a huge risk to take for anything other than a hobby project.
On a past project, i think we got 45% off list without too much trouble.
AppEngine pricing change resulted in a nasty surprise for a lot of people and a lot of vitriol towards Google. Just one article on this from many: http://readwrite.com/2011/09/02/google-app-engine-pricing-an...
While the platform has moved way ahead of where it was in 2011, the memory of this is still in the back of people's mind and it would be a good move on Google's part to at least try to engender more trust by giving directional pricing.
I just wish Google wouldn't bring back memories of that by not disclosing pricing of a very promising API.
GAE has moved quite far ahead since then, but many people still won't consider it after the bad experience. Perceptions die hard...
As is, the ambiguity is a deterrent from too much time investment.
Side note: if anyone is interested in helping with an embedded voice recognition project please ping me.
Sphinx is supported by Carnegie Mellon and Julius by Kyoto University/Nagoya Institute of Technology.
I think the easier choice even today might still be Sphinx. Given the excellent documentation (they touch pretty much all the basics you need to know), and the availability of pocketsphinx (C) and Sphinx4 (Java).
There's also projects like this:
When did you try it and which languages? Any particular issues you can share?
Voice recognition was supposed to have been a cherry-on-top, but ended up taking out one of our senior developers for the duration of the month-long project, and we were ultimately unable to get it working in the time that we had available
A glimpse of what's involved:
Eg: "Switch on the lights" becomes
"thing" : "lights"
etc.. I'm trying really hard to remember the name but it escapes me.
Speech recognition and <above service> will go very well together.
Since your service is completely free, how do you plan on surviving? Would you open source any parts of Wit.ai should you go under?
I feel these are important questions to ask before investing time & energy into using your otherwise awesome service...
Ah so it will probably be discontinued
> The Google Cloud Speech API, which will cover over 80 languages and will work with any application in real-time streaming or batch mode, will offer full set of APIs for applications to “see, hear and translate,” Google says.
I don't know I should feel about Google taking even more data from me (and other users). How would integrating this service work legally? Would you need to alert users that Google will keep their recordings on file (probably indefinitely and without being able to delete them)?
Would be nice if they just open sourced it though but I imagine that is at crossed purposes with their business.
Since Android is open-source, would that mean that the voice recognition software (and/or trained coefficients) could, in principle, be ported to Linux?
You haven't used Android since 2010, have you?
In the latest versions, there is no more Open Source anything.
Calendar, Contacts, Home screen, Phone app, Search, are all closed source now.
(btw, all of them, including the Google app, used to be open in Gingerbread)
You can't do TLS without going through Google apps (or packaging spongycastle), you can't do OpenGL ES 3.2, you can't use location anymore, nor use WiFi for your own location implementation.
Since Marshmallow, you are also forced to use Google Cloud Messaging, or the device will just prevent your app from receiving notifications.
To "save battery power" and "improve usability", Google monopolized all of Android.
Not that I mind personally.
Oh, wait, there were no announcements, they were dropped silently.
It includes lots of links to relevant research, tools, and services. Also includes discussion of the pros and cons of various services (Google/MS/Nuance/IBM/Vocapia etc.) and the value of vocabulary uploads and speaker profiles.
Pycon did a admirable effort to live caption their talks last year but some of those transcripts never got uploaded along with the talks which is puzzling, but I suppose it could be due to lack of timecodes.
I've subbed to your blog and hopefully I can contribute whatever I can to make this work out.
My trial of a Python speech library on Windows:
Speech recognition with the Python "speech" module:
and also the opposite:
I've never used Nuance but I've played around with IBM Watson , which gives you 1000 free minutes a month, and then 2 cents a minute afterwards. Watson allows you to upload audio in 100MB chunks (or is it 10 minute chunks?, I forgot), whereas Google currently allows 2 minutes per request (edit: according to their signup page )...but both Watson and Google allow streaming so that's probably a non-issue for most developers.
From my non-scientific observation...Watson does pretty well, such that I would consider using it for quick, first-pass transcription...it even gets a surprising number of proper nouns correctly including "ProPublica" and "Ken Auletta" -- though fudges things in other cases...its vocab does not include "Theranos", which is variously transcribed as "their in house" and "their nose" 
It transcribed the "Trump Steaks" commercial nearly perfect...even getting the homophones in "when it comes to great steaks I just raise the stakes the sharper image is one of my favorite stores with fantastic products of all kinds that's why I'm thrilled they agree with me trump steaks are the world's greatest steaks and I mean that in every sense of the word and the sharper image is the only store where you can buy them"...though later on, it messed up "steak/stake" 
It didn't do as great a job on this Trump "Live Free or Die" commercial, possibly because of the booming theme music...I actually did a spot check with Google's API on this and while Watson didn't get "New Hampshire" at the beginning, Google did . Judging by how well YouTube manages to caption videos of all sorts, I would say that Google probably has a strong lead in overall accuracy when it comes to audio in the wild, just based on the data it processes.
edit: fixed the Trump steaks transcription...Watson transcribed the first sentence correctly, but not the other "steaks"
...Isn't that specifically what anticompetition laws were written to prevent?
But maybe they only do that with consumer facing items?
As others here have pointed out, the value now for GOOG is in building the best training data-set in the business, as opposed to just racing to find the best algorithm.
But, assuming that was their plan, they'd have a couple options:
- Like you said, they could turn it into supervised training examples by transcribing it. I'm sure they'd at least like to transcribe some of it so that they can measure their performance. Also, while Google does have a lot of 1st party applications feeding them training data, customer data might help them fill in some gaps.
- They might also be able get some value out of it without transcribing it. Neural networks can sometimes be pre-trained in an unsupervised manner. One example would be pre-training the network as an autoencoder, which just means training it to reproduce its input as its output. This can reduce convergence time.
First, you take the huge input (because something like sound has a huge amount of data in it, similarly with images there are a lot of pixels) and learn a simpler representation of it.
The second problem of mapping these nice dense features to actual things can be solved in different ways, even simple classifiers can perform well.
This doesn't actually need any labelled data. I just want to learn a smaller representation. For example, if we managed to learn a mapping from bits of audio to the phonetic alphabet then our speech recognition problem becomes one of just learning the mapping from the phonetic alphabet to words which is a far nicer problem to have.
Some ways of "deep learning" solve this first problem (of learning neater representations) through a step by step process of what I like to refer to as laziness.
Instead of trying to learn a really, really high level representation of your input data just learn a slightly smaller one. That's one layer. Then once we've got that we try and learn a smaller/denser representation on top of that. Then again, and again, and again.
How can you learn a smaller representation? Well a good way is to try and get a single layer to be able to regenerate its input. "Push" the input up, get the activations in the next layer, run the whole thing backwards and see how different your input is. You can then use this information to tweak the weights to make it slightly better the next time. Do this for millions and millions of inputs and it gets pretty good. This technique has been known about for a long time, but one of the triggers for the current big explosion of use was Hinton working out that this back and forth only really needs to be done once rather than 100 times (which was thought to be required beforehand).
Hinton says it made things 100,000 times faster because it was 1% of the computation required and it took him 17 years to realise it in which time computers got 1000 times faster. Along with this, GPUs got really really fast and easier to program. I took the original Hinton work that took weeks to train and had it running in hours back in 2008 on a cheap GPU. So before ~2006 this technique would have taken years of computer time, now it's down to minutes. Of course, that's then resulted in people building significantly larger networks that take much longer to train but would have been infeasible to run before.
But you still need a lot of unlabelled data. While I doubt google is doing that with this setup, they have done something before, where they setup a question answering service in the US that people could call I think for free to collect voice data.
You need labelled data. But it turns out you can learn most of what you need with unlabelled data, leaving you with a much simpler problem to solve. That's great because labelled data is massively more expensive than unlabelled data.
Or, at least, that's my best guess with zero research and little knowledge.
Like if neural networks trained with user data should be un-copyrightable, and public domain by default.
Do they provide a way to send audio via WebRTC or WebSocket from a browser?
Congress has video for all of its sessions and it is transcribed. So does the Supreme Court (though not timestamped).
This is enforced by the FCC , but as more and more "internet" content gets consumed I imagine the same regulations will eventually come, at which point you've got a fantastic training set.
The Word Error Rates (lower is better) for each recognizer on two different corpora, VM1 and WSJ1:
RECOGNIZER VM1 WSJ1
HDecode v3.4.1 22.9 19.8
Julius v4.3 27.2 23.1
pocketsphinx v0.8 23.9 21.4
Sphinx-4 26.9 22.7
Kaldi 12.7 6.5
Especially because Google's version is trained with illegally obtained user data (no, changing your ToS doesn't allow you to use previously collected data for new purposes in the EU).
We, as a society, should discuss if software trained on user data should be required to be available to those who have provided that data. If for software developed by training neural networks even any copyright can exist — or if it's by definition public domain.
Software trained with neural networks provides an immense advantage for existing monopolies, and makes it extremely hard for competitors.
If this trend continues, startups will become impossible for software that depends on being trained with huge datasets.
Try Googling "speech recognition api"...
really GooG should democratize quant stuff next .. diy hedge fund algos.
Have a look at CMUSphinx/Pocketsphinx . I wrote a comment about training it for command recognition in a previous discussion.
It supports BNF grammar based training too , so I've a vague idea that it may be possible to use your programmming language's BNF to make it recognize language tokens. I haven't tried this out though.
Either way, be prepared to spend some time on the training. That's the hardest part with sphinx.
Also, have you seen this talk for doing it on Unix/Mac ? He does use natlink and dragonfly, but perhaps some concepts can be transferred to sphinx too?
I just applied for early access.
Regardless this is still very exiting. I haven't found anything that's as good as Google's voice recognition. I only hope this ends up being cheap and accessible outside of their platform.
Similarly, you can use the Google Translate API without using Compute Engine, App Engine, etc.
Note: I work for Google (but not on any of these products).