Hacker News new | comments | show | ask | jobs | submit login
Google opens access to its speech recognition API (techcrunch.com)
605 points by jstoiko 548 days ago | hide | past | web | 168 comments | favorite

This is HUGE in my opinion. Prior to this, in order to get near state-of-the-art speech recognition in your system/application you either had to have/hire expertise to build your own or pay Nuance a significant amount of money to use theirs. Nuance has always been a "big bad" company in my mind. If I recall correctly, they've sued many of their smaller competitors out of existence and only do expensive enterprise deals. I'm glad their near monopoly is coming to an end.

I think Google's API will usher in a lot of new innovative applications.

Other "state-of-the-art" speech recognition solutions already exist. For example, Microsoft has been offering it through its Project Oxford service. https://www.projectoxford.ai/speech

Also, CMUSphinx and Julius:



It is amazingly easy to create speech recognition without going out to any API these days.

I first learned about CMUSphinx from the [Jasper Project](https://jasperproject.github.io/). While Jasper provided an image for the Pi, I decided to go ahead and make a scripted install of CMUSphinx. I spent something like 2 frustrating days attempting to get it installed by hand in a repeatable fashion before giving up.

This was 2 years ago, so maybe it's simple now, but I didn't find it "amazingly easy" back then.

I do have a number of projects where I could definitely use a local speech recognition library. I have used [Python SpeechRecognition](https://github.com/Uberi/speech_recognition/blob/master/exam...) to essentially record and transcribe from a scanner. I wanted to take it further, but google at the time limited the number of requests per day. Today's announcement seems to indicate they will be expanding their free usage, but a local setup would be much better. I'd like to deploy this in a place that might not have reliable Internet.

In my experiences, the issues with building CMU Sphinx are mainly unspecified dependencies, undocumented version requirements, and forgetting to sacrifice the goat when the MSVC redistributable installer pops up.

We've written detailed, up-to-date instructions [1] for installing CMU Sphinx, and now also provide prebuilt binaries [2]!

If you're interested in not sending your audio to Google, CMU Sphinx and other libraries (like Kaldi and Julius), are definitely worth a second look.

[1] https://github.com/Uberi/speech_recognition/blob/master/refe... [2] https://github.com/Uberi/speech_recognition/tree/master/thir...

Yeah I'm gonna leave a reply here just in case I need to find this again (already opened tabs, but you never know). This might be big for a stalled project at work. If this can un-stall that, I'll sure owe you a beer ;)

Would you mind submitting this documentation to CMU? I get the feeling they'd love to at least host a link to them or something to enhance their own documentation?

Thanks for providing this. Will definitely give it a fresh look.

That sounds like my experience with it from about 5 years ago or so. I gave up on it also. It also didn't help that CMUSphinx has had more than one version in development in different languages.

I would note that as a positive... But yeah, 5 years ago things were much much rougher (which is partly why I didn't think it got so much press).

But these days, if you go all the way through their tutorial, and give it a proper read, it's very doable to set up.

Unfortunately, the situation hasn't improved much. Besides, even if you get it set up, the quality of the recognition isn't even close to the one from Google.

As someone who's worked with a lot of these engines, Nuance and IBM are the only really high quality players in the space. CMUSphinx and Julius are fine for low volume operations where you don't need really accurate response rates, but if you want high accuracy neither comes close from my experience.

Right, but they do offer you a fantastic starting point. If Nuance is 100%, I'd say CMUSphinx is at least 40%.

Also, they give you the tools and knowledge to build better models (and explain the theory), which is where most of the competitive advantage is IMHO.

As someone who has actually done objective tests, Google are by far the best, Nuance are a clear second. IBM Watson is awful though. Actually the worst I've tested.

Do you have a report of your tests? I'm interested in using speech recognition, but there are many start-ups and big players that it would be quite time consuming to get a quality/price analysis.

For the "dialect" of spanish that we speak in Argentina, Watson misses every single word. So, to me, CMUSphinx is valuable in that it allows me to tweak it, while IBM miserably fails at every word. Must've been trained with Spain or Mexican "neutral" spanish.

Googles engine also works fine (have been trying it with the phones), but the pricing may or may not be a deal breaker.

Is Julius really state-of-the-art? Looks like they use n-gram and HMMs. Those were the methods that achieve SotA 5+ years ago. My understanding is that Google and Microsoft are using end-to-end (or nearly) neural network models; these outperformed the older methods a few years ago. Not sure how CMUSphinx works under the hood.

They might not be considered state-of-the-art (if you consider both approaches in the same category), but they are definitely one valid approach to voice recognition, which works surprisingly well.

CMUSphinx is not a neural network based system, they do use language and acoustic modeling.

check out https://github.com/yajiemiao/eesen for LSTM and CTC based library instead of HMMs

CMUSphinx is really easy to set up, and then being able to train it for one's specific domain probably beats state of the art with one-size-fits-all training.

> It is amazingly easy to create speech recognition without going out to any API these days.

Not really. The hard part is not the algorithm, it is the millions of samples of training data that have gone behind Google's system. They pretty much have every accent and way of speaking covered in their system which is what allows them to deliver such a high-accuracy speaker-independent system.

CMUSphinx is remarkable as an academic milestone, but in all honesty it's basically unusuable from a product standpoint. If your speech recognition is only 95% accurate, you're going to have a lot of very unhappy users. Average Joes are used to things like microwave ovens, which work 99.99% of the time, and expect new technology to "just work".

CMUSphinx is also an old algorithm; AFAIK Google is neural-network based.

Eesen looks promising, uses LSTM and CTC rather than older tech.


Baidu open sourced their CTC implementation


I think we will have an easy to install OSS speech recognition library and accurate pretrained networks not far off from Google/Alexa/Baidu, running locally rather than in the cloud, within 1-2 years. Can't wait.

From the Microsoft Project Oxford Speech API link:

Speech Intent Recognition

... the server returns structured information about the incoming speech so that apps can easily parse the intent of the speaker, and subsequently drive further action. Models trained by the Project Oxford LUIS service are used to generate the intent.

Do others offer something like this?

Microsoft LUIS is almost identical to the intent classification and entity extraction in the Alexa Skills Kit, but it's easier to use because you can pipe in your own text from any source instead of having to use a specific speech recognition engine. LUIS also has a pretty nice web interface that prompts you to label utterances that it's seen that it had trouble with.

But Google's is by far the best.

Exactly, this is what people are missing when they are trying to compare Google's speech recognition to other services. Google uses deep neural-networks to continuously train and improve the quality of their speech recognition, they get their training data from the hundreds of millions of Android users around the world using speech-to-text every day. No other company has a comparable amount of training data, continuously being expanded. http://googleresearch.blogspot.ca/2015/09/google-voice-searc...

Kaldi is probably the best option now. https://github.com/kaldi-asr/kaldi

Interesting - I saw this as a defensive response to the rising number of developers using Amazon's Alexa APIs, rather than anything related to Nuance.

It's probably been on their roadmap for a while, before Alexa came out. Re: Alexa/echo - think there is an opportunity for someone to manufacture cheap USB array mics for far field capture.

Still, having this paid and cloud based puts a limit to types of things where you'd use it. I will use it in my own apps for now but will swap to a OSS speech recognition library running locally as soon as one emerges that is good enough.

You're right - this could lead to a lot of new innovation as a bunch of developers who wouldn't have bothered before can now start hacking away to see what they can do.

I've been thinking a lot lately about where the next major areas of technology-driven disruption might be in terms of employment impact, and things like this make me wonder how long it will be before call centers stacked wall to wall with customer service reps become a relic of the past...

If it's anything like Googles other APIs, people will build applications on top of it, and then Google will decide to shut down the API with no notice.

Fun to play with, but don't expect it to last...

That's incorrect. This is a Google Cloud Platform service and when it reaches General Availability (GA) it will be subject to our Deprecation Policy. Just like Compute Engine, Cloud Storage, etc. requires us to give at least a 1 year heads up.

Disclosure: I work on Compute Engine.

It's nice there's a policy around that, but I can understand the fears of someone considering using this to start a product - or even worse, a business.

Google has an history of shutting down useful products, why people should trust that one for long term integration?

Because we don't have a history of violating promises like this when done in writing? Seriously, I'd love to call us just "Cloud Platform" so you don't have to think "oh yeah, those guys cancelled reader on me" but if you look at the Cloud products we don't play games with this (partly because we hold ourselves to our binding Deprecation Policy, but mostly because we really care).

Google Search API, Autocomplete, Finance, Voice all closed with tons of active users. I'm not blaming Google; they were acting in their best interest, but the consequence is less enthusiasm for building software that depends on their APIs.

IMO a better option for Google, when considering to close an API, is to enforce payment and hike the price enough to justify maintaining it. If and only if enough users drop out with the higher price, then shut it down for good.

Even outside of services with a formal deprecation policy, Google rarely shuts anything down with no notice (their frequently cited shutdowns had long notice.)

Has Google sued many of their smaller competitors out of business?

No, and they dont need to. They have too many other advantages -- low customer acquisition cost via already-present cloud customers, economies of scale, ease of hiring the best talent, natural integrations via their android platform...who needs nastiness when you have all these amazing benefits!

Isn't Google's strategy typically to just start offering the same services as their smaller competitors but for free and then let them starve? ... Kind of like what's probably happening here? Sounds like this is terrible news for Nuance, for example.


> To attract developers, the app will be free at launch with pricing to be introduced at a later date.

Doesn't this mean you could spend time developing and building on the platform without knowing if your application is economically feasible? Seems like a huge risk to take for anything other than a hobby project.

Only if you spend a huge amount of time integrating with their API. It won't be more expensive than Nuance, so prepare with their pricing in mind and you'll be fine ($.008 / transaction over 20k monthly transactions).


If you do any kind of volume with Nuance you get get pretty significant discounts.

On a past project, i think we got 45% off list without too much trouble.

I was really hoping they would post at least a preview of what the pricing will be.

AppEngine pricing change resulted in a nasty surprise for a lot of people and a lot of vitriol towards Google. Just one article on this from many: http://readwrite.com/2011/09/02/google-app-engine-pricing-an...

While the platform has moved way ahead of where it was in 2011, the memory of this is still in the back of people's mind and it would be a good move on Google's part to at least try to engender more trust by giving directional pricing.

It is funny reading this in hindsight. We went straight to AWS, also for long running tcp connections, and never looked at it again. After that I have never recommended app engine.

Wasn't that five years ago? When it came out of preview (and prices were expected to change)?

It was - 2011 is mentioned in both article and my comment.

I just wish Google wouldn't bring back memories of that by not disclosing pricing of a very promising API.

GAE has moved quite far ahead since then, but many people still won't consider it after the bad experience. Perceptions die hard...

They probably are using usage data in the free period to drive decisions on pricing.

Sure, would expect nothing less, but I'm also sure they have some sense of the range they are looking for. Such a range could help developers to establish whether their use case is viable or not.

As is, the ambiguity is a deterrent from too much time investment.

Or the data determines that they have to price outside the hypothetical early-release range, and then you have people complaining that you lied.

They did something very similar to this for their translate API. Realtime IRC client plugins that translated lines in foreign languages into yours in near-realtime went from "awesome" to "just kidding, nevermind" virtually overnight.

If (for any reasonably probable value of final pricing) the cost of voice recognition is a material factor in determining whether the application is economically feasible, you have larger problems. In other words: if your application's economics are going to swing materially depending on how this is priced, it's probably not going to work out even if they keep it free.

Apps can involve users speaking for ten seconds making selections, or an hour getting transcribed. Right now, we don't know the order of magnitude; adding voice interaction to an app could be $.01/user/year or it could be $100/user/year.

It depends. Imagine a hardware device that uses the speech API, and you're looking at the 3 year cost of an average of 5 speech interactions per day (something like the Amazon Echo). Using Nuance pricing of $0.008/request, you're adding 365 * 3 * 5 * $.008 = $43.80 in cost above the BOM cost; It's entirely possible that adding $20 to the BOM is viable, but adding $40 is not.

That may just be poor phrasing by TechCrunch. It's entirely possible Google will announce the pricing on launch day, but not charge for some period of time so people have the ability to develop the initial apps without paying.

Work in the pricing now (assuming it's the same as competition) and architecture your app so it's easy to replace the provider down the line if you need to, without breaking the clients.

I came across CMU Sphnix speech recognition library (http://cmusphinx.sourceforge.net) that has a BSD-style license and they just released a big update last month. It supports embedded and remote speech recognition. Could be a nice alternative for someone who may not need all of the bells and whistles and prefers to have more control rather than relying on an API which may not be free for long.

Side note: if anyone is interested in helping with an embedded voice recognition project please ping me.

There is a project for Rapberry Pi to use Sphinx to roll your own Amazon Echo like device. You might want to take a look at that.

Thanks for the link, but this is the Raspberry Pi project: https://github.com/jasperproject

Sphinx is an order of magnitude worse than Google. They're not in the same league.

Came in here to say this (and also include julius), but yeah cmusphinx is awesome.

I haven't seen Julius before, thank you! How do you decide which one to use?

The last time I did interesting stuff with these libs, I used CMUSphinx -- The documentation for Julius wasn't quite good enough yet, and CMUSphinx has great documentation.

Sphinx is supported by Carnegie Mellon and Julius by Kyoto University/Nagoya Institute of Technology.

I think the easier choice even today might still be Sphinx. Given the excellent documentation (they touch pretty much all the basics you need to know), and the availability of pocketsphinx (C) and Sphinx4 (Java).

There's also projects like this: https://github.com/syl22-00/pocketsphinx.js

It's been years since I used CMU Sphinx, but don't you have to bring your own training data? Sure, there are free data sets out there, and pre-trained models, but they are not as good as what Google et al. have.

Yes. And it's not just the data sets, its the fundamental technology. You really need state-of-the-art, ie LSTM/CTC to deal with noisy input data and to get to 99% accuracy (in addition to excellent data sets of course)

Tried it for a project, decent for english, but the support for non-english languages is not there

The website says they provide language models for "many languages".

When did you try it and which languages? Any particular issues you can share?

9 months ago for Portuguese, the amount of effort required for building a language model is too great.

Voice recognition was supposed to have been a cherry-on-top, but ended up taking out one of our senior developers for the duration of the month-long project, and we were ultimately unable to get it working in the time that we had available

A glimpse of what's involved: http://cmusphinx.sourceforge.net/wiki/tutoriallm

Tangentially related: Does anyone remember the name of this startup/service that was on HN (I believe), that enables you to infer actions from plaintext.

Eg: "Switch on the lights" becomes

{"action": "switch_on", "thing" : "lights" }

etc.. I'm trying really hard to remember the name but it escapes me.

Speech recognition and <above service> will go very well together.

Our service Wit.ai (YC W14) does just that.

Demo: https://labs.wit.ai/demo/index.html

This is it! Thank you!

I've been meaning to ask someone at Wit.ai this for a while:

Since your service is completely free, how do you plan on surviving? Would you open source any parts of Wit.ai should you go under?

I feel these are important questions to ask before investing time & energy into using your otherwise awesome service...

They've been acquired by Facebook, the main guys behind are now making Facebook M.

> They've been acquired by Facebook

Ah so it will probably be discontinued

Which languages does your service support?

From their website: de, en, es, et, fr, it, nl, pl, pt, ru, sv.

In case you're not interested in having google run your speech recognition:

CMU Sphinx: http://cmusphinx.sourceforge.net/

Julius: http://julius.osdn.jp/en_index.php

If you're having trouble (like me) to find your "Google Cloud Platform user account ID" to sign up for Limited Preview access, it's just the email address for your Google Cloud account. Took me only 40 minutes to figure that one out.

I wrote a client library for this in C# by reverse engineering what chrome did at the time (totally not legit/unsupported by google, possibly against their TOS). I have never used it for anything serious, and am glad now there is an endorsed way to do this.


Key sentence:

> The Google Cloud Speech API, which will cover over 80 languages and will work with any application in real-time streaming or batch mode, will offer full set of APIs for applications to “see, hear and translate,” Google says.

Pretty impressive from the limited look the website (https://cloud.google.com/speech/) gives: the fact that Google will clean the audio of background noise for you and supports streamed input is particularly interesting.

I don't know I should feel about Google taking even more data from me (and other users). How would integrating this service work legally? Would you need to alert users that Google will keep their recordings on file (probably indefinitely and without being able to delete them)?

Unless I have gone crazy google has had a STT available to tinker with for awhile. It is one of the options for jasper [1]. Hopefully this means it will be easier to setup now.

Would be nice if they just open sourced it though but I imagine that is at crossed purposes with their business.

[1] https://jasperproject.github.io/documentation/configuration/

SoundHound released Houndify[1], their voice API last year which goes deeper than just speech recognition to include Speech-to-Meaning, Context and Follow-up, and Complex and Compound Queries. It will be cool to see what people will do with speech interfaces in the near future.

[1] https://www.houndify.com/

Why isn't speech recognition just part of the OS? Like keyboard and mouse input.

Because speech recognition occurs on Google servers, not locally.

Speech recognition can occur either locally or on Google's servers. Since about 2012 [1], Android has been able to do some types of speech recognition, like dictation, on local devices. Additionally, Google Research has recently expanded on this functionality and it seems like much more of the speech recognition will be done locally [2].

[1] http://www.androidcentral.com/jelly-bean-brings-offline-voic...

[2] http://www.zdnet.com/article/always-on-google-ai-gives-andro...

> always-on-google-ai-gives-android-voice-recognition-that-works-on-or-offline

Since Android is open-source, would that mean that the voice recognition software (and/or trained coefficients) could, in principle, be ported to Linux?

"Android" as it is distributed in mobile devices has some Google proprietary components.

> since Android is Open Source

You haven't used Android since 2010, have you?

In the latest versions, there is no more Open Source anything.

Calendar, Contacts, Home screen, Phone app, Search, are all closed source now.

(btw, all of them, including the Google app, used to be open in Gingerbread)

You can't do TLS without going through Google apps (or packaging spongycastle), you can't do OpenGL ES 3.2, you can't use location anymore, nor use WiFi for your own location implementation.

Since Marshmallow, you are also forced to use Google Cloud Messaging, or the device will just prevent your app from receiving notifications.

To "save battery power" and "improve usability", Google monopolized all of Android.

To whomever downvoted the above post: Please clarify why you think it isn’t relevant to the discussion, or provide counterarguments. All the points I made can be easily sourced (if you wish, I can even post them in here), and are all verifiable.

It could also have been the tone of your first sentence.

Not that I mind personally.

It wasn’t meant aggressively, just as a question. It’s quite possible that the author of the comment I answered to had not used Android for a few years, or had never cared – or had just missed the announcements of the official apps not being supported anymore.

Oh, wait, there were no announcements, they were dropped silently.

It used to be. I remember "back in the day", windows 95/98 had it. It wasn't perfect, but it was decent with a bit of training such that you could dictate reasonably well.

Houndify launched last year and provides both speech recognition and natural language understanding. They have a free plan that never expires and transparent pricing. It can handle very complex queries that Google can't.

FWIW I'd just finished a large blog post researching ways to automate podcast transcription and subsequent NLP.

It includes lots of links to relevant research, tools, and services. Also includes discussion of the pros and cons of various services (Google/MS/Nuance/IBM/Vocapia etc.) and the value of vocabulary uploads and speaker profiles.


As a hard of hearing aspiring software developer, this would be a godsend for me if someone came up with a reliable automated transcription service. I'm often dismayed by the amount of valuable information locked in podcasts and non-transcribed videos and have to rely on goodwill of volunteers to give me transcripts.

Pycon did a admirable effort to live caption their talks last year but some of those transcripts never got uploaded along with the talks which is puzzling, but I suppose it could be due to lack of timecodes.

I've subbed to your blog and hopefully I can contribute whatever I can to make this work out.

For anyone who wants to try these areas a bit:

My trial of a Python speech library on Windows:

Speech recognition with the Python "speech" module:


and also the opposite:


FWIW, Google followed the same strategy with Cloud Vision (iirc)..they released it in closed beta for a couple of months [0], then made it generally available with a pricing structure [1].

I've never used Nuance but I've played around with IBM Watson [2], which gives you 1000 free minutes a month, and then 2 cents a minute afterwards. Watson allows you to upload audio in 100MB chunks (or is it 10 minute chunks?, I forgot), whereas Google currently allows 2 minutes per request (edit: according to their signup page [5])...but both Watson and Google allow streaming so that's probably a non-issue for most developers.

From my non-scientific observation...Watson does pretty well, such that I would consider using it for quick, first-pass transcription...it even gets a surprising number of proper nouns correctly including "ProPublica" and "Ken Auletta" -- though fudges things in other cases...its vocab does not include "Theranos", which is variously transcribed as "their in house" and "their nose" [3]

It transcribed the "Trump Steaks" commercial nearly perfect...even getting the homophones in "when it comes to great steaks I just raise the stakes the sharper image is one of my favorite stores with fantastic products of all kinds that's why I'm thrilled they agree with me trump steaks are the world's greatest steaks and I mean that in every sense of the word and the sharper image is the only store where you can buy them"...though later on, it messed up "steak/stake" [4]

It didn't do as great a job on this Trump "Live Free or Die" commercial, possibly because of the booming theme music...I actually did a spot check with Google's API on this and while Watson didn't get "New Hampshire" at the beginning, Google did [4]. Judging by how well YouTube manages to caption videos of all sorts, I would say that Google probably has a strong lead in overall accuracy when it comes to audio in the wild, just based on the data it processes.

edit: fixed the Trump steaks transcription...Watson transcribed the first sentence correctly, but not the other "steaks"

[0] http://www.businessinsider.com/google-offers-computer-vision...

[1] http://9to5google.com/2016/02/18/cloud-vision-api-beta-prici...

[2] https://github.com/dannguyen/watson-word-watcher

[3] https://gist.github.com/dannguyen/71d49ff62e9f9eb51ac6

[4] https://www.youtube.com/watch?v=EYRzpWiluGw

[5] https://services.google.com/fb/forms/speech-api-alpha/

It's 100MB chunks, and you can compress it with opus or flac to squeeze in more audio per chunk :)

"Google may choose to raise those prices over time, after it becomes the dominant player in the industry."

...Isn't that specifically what anticompetition laws were written to prevent?

As a developer I might be more worried about it not becoming at least amongst the dominant players because they might just drop it.

But maybe they only do that with consumer facing items?

Google kills off APIs often. Remember the whole Translate API fiasco? Though that was for overuse, not underuse.

This is a Google Cloud Platform service, and is subject to long deprecation policies at the very least.

I would say that Google's main goal here is in expanding their training data set, as opposed to creating a new revenue stream. If it hurts competitors (e.g. Nuance) that might only be a side-effect of that main objective, and likely they will not aim to hurt the competition intentionally.

As others here have pointed out, the value now for GOOG is in building the best training data-set in the business, as opposed to just racing to find the best algorithm.

Question from a machine learning noob: how would they use an unlabeled dataset for training? Would Google employees listen to it all (if the ToS would even allow that) and transscribe it, then use the result for training? Or is there another way to make it useful without being labeled?

I doubt that using their customers' audio as training data was a major motivation for offering this service.

But, assuming that was their plan, they'd have a couple options:

- Like you said, they could turn it into supervised training examples by transcribing it. I'm sure they'd at least like to transcribe some of it so that they can measure their performance. Also, while Google does have a lot of 1st party applications feeding them training data, customer data might help them fill in some gaps.

- They might also be able get some value out of it without transcribing it. Neural networks can sometimes be pre-trained in an unsupervised manner. One example would be pre-training the network as an autoencoder, which just means training it to reproduce its input as its output. This can reduce convergence time.

Couldn't they go the the same route they took with classifying objects in images by using their users to label them? I know recaptcha provides an option to verify yourself by transcribing audio.

While aware of unsupervised learning, I thought I had read somewhere that it wasn't yet a solved problem to use it for this kind of learning. But that might be wrong, and the Wikipedia article mentions some interesting examples.

Quite a few methods of running recognition work in two stages (as a very large simplification).

First, you take the huge input (because something like sound has a huge amount of data in it, similarly with images there are a lot of pixels) and learn a simpler representation of it.

The second problem of mapping these nice dense features to actual things can be solved in different ways, even simple classifiers can perform well.

This doesn't actually need any labelled data. I just want to learn a smaller representation. For example, if we managed to learn a mapping from bits of audio to the phonetic alphabet then our speech recognition problem becomes one of just learning the mapping from the phonetic alphabet to words which is a far nicer problem to have.

Some ways of "deep learning" solve this first problem (of learning neater representations) through a step by step process of what I like to refer to as laziness.

Instead of trying to learn a really, really high level representation of your input data just learn a slightly smaller one. That's one layer. Then once we've got that we try and learn a smaller/denser representation on top of that. Then again, and again, and again.

How can you learn a smaller representation? Well a good way is to try and get a single layer to be able to regenerate its input. "Push" the input up, get the activations in the next layer, run the whole thing backwards and see how different your input is. You can then use this information to tweak the weights to make it slightly better the next time. Do this for millions and millions of inputs and it gets pretty good. This technique has been known about for a long time, but one of the triggers for the current big explosion of use was Hinton working out that this back and forth only really needs to be done once rather than 100 times (which was thought to be required beforehand).

Hinton says it made things 100,000 times faster because it was 1% of the computation required and it took him 17 years to realise it in which time computers got 1000 times faster. Along with this, GPUs got really really fast and easier to program. I took the original Hinton work that took weeks to train and had it running in hours back in 2008 on a cheap GPU. So before ~2006 this technique would have taken years of computer time, now it's down to minutes. Of course, that's then resulted in people building significantly larger networks that take much longer to train but would have been infeasible to run before.

But you still need a lot of unlabelled data. While I doubt google is doing that with this setup, they have done something before, where they setup a question answering service in the US that people could call I think for free to collect voice data.


You need labelled data. But it turns out you can learn most of what you need with unlabelled data, leaving you with a much simpler problem to solve. That's great because labelled data is massively more expensive than unlabelled data.

All of your audio is recorded (google.com/dashboard) then the audio files are thrown into a lottery. Individual people transcribe what they hear for the computer to build associations.

Or, at least, that's my best guess with zero research and little knowledge.

Google gets plenty of speech training data from Android phones; I doubt they need more from startups.

And it opens up legal questions

Like if neural networks trained with user data should be un-copyrightable, and public domain by default.

Has anyone tried adding OpenEars to their app, to prevent having to send things over the internet from e.g. a basement? Is it any good at recognizing basic speech?

Pretty bad except for keyword spotting.

In the sign-up form they state that "Note that each audio request is limited to 2 minutes in length." Does anyone know what "audio request" is? Does it mean that it's limited to 2 minutes when doing real-time recognition, or just that longer periods will count as more "audio requests" and result in a higher bill?

Do they provide a way to send audio via WebRTC or WebSocket from a browser?

Nice. But what I want is open-source speech recognition.

Like most machine-learning applications, the source code isn't the interesting part, the data is. Google started by training on millions of phrases from Google 411, and then they've been able to continue training anytime someone issues a voice command to an Android device. They have orders of magnitude more data than you could fit into a GitHub repository.

Couldn't you just download subtitles for old movies and train using those?

There aren't many old movies (in the public domain) that are subtitled and/or have good audio... However, in the U.S., we have a huge amount of government audio/video that is captioned. Here's an example of using youtube-dl to download one of President Obama's speeches and extract the closed captions (which I believe are produced by the government) into its own VTT file:


Congress has video for all of its sessions and it is transcribed. So does the Supreme Court (though not timestamped).

Doesn't even need to be old movies. Certain types of video content in the US is legally required to have subtitles(e.g. a lot of youtube content). You could programmatically download them and use that as your training set. And, since it is a transformative work, you can train your models even on copyrighted works freely.

Much of the YouTube content has auto-generated subtitles, i.e. Google is running their speech-recognition software on the audio stream and then using that to caption the video. If you used that as your training set, you're effectively training on the output of an AI. Which is kind of a clever way to get information from Google to your open-source library, but will necessarily be lower-fidelity than just using the Google API directly.

In the US, if it's ever been played out on broadcast TV then it must have Closed Captions.

This is enforced by the FCC [0], but as more and more "internet" content gets consumed I imagine the same regulations will eventually come, at which point you've got a fantastic training set.

0: https://www.fcc.gov/node/23883

That's actually a pretty good idea. Lyrics for rap music might also be a good training data set. It'd bias strongly toward English, though, and particularly American English. I suspect the size of the resulting data set is also quite a bit smaller than what Google has.

This paper from 2014 is a comparison of 5 different open source speech recognizers: http://suendermann.com/su/pdf/oasis2014.pdf

The Word Error Rates (lower is better) for each recognizer on two different corpora, VM1 and WSJ1:

  HDecode v3.4.1    22.9 19.8
  Julius v4.3       27.2 23.1
  pocketsphinx v0.8 23.9 21.4
  Sphinx-4          26.9 22.7
  Kaldi             12.7 6.5

At least offer a self hosted version. Maybe it's just me, but I'm not comfortable sending every spoken word to Google.

Exactly this.

Especially because Google's version is trained with illegally obtained user data (no, changing your ToS doesn't allow you to use previously collected data for new purposes in the EU).

We, as a society, should discuss if software trained on user data should be required to be available to those who have provided that data. If for software developed by training neural networks even any copyright can exist — or if it's by definition public domain.

Software trained with neural networks provides an immense advantage for existing monopolies, and makes it extremely hard for competitors.

If this trend continues, startups will become impossible for software that depends on being trained with huge datasets.

I thought I read open source, then I realized open access. I believe in the past there was a similar API, or maybe it was based on Google Translate. But I swear at one point people wrote hackathon projects using some voice APIs.

Nice! Curious how it compares to amazon's avs that went public this week.


I think this more directly competes with the IBM Watson speech API, not Nuance?

Why? This is almost exactly what Nuance provides:


I wanted to see if there was good speech recognition and could Google about Nuance... Now that its going to be "disrupted" I found out about it.. ironic.

Try Googling "speech recognition api"...

IBM's offering is using the nuance engine, with Watson using its magic to make more accurate predictions based on context.

IBM split from Nuance a year or so ago- it's now its own engine.

Not sure how many users of IBM speech API but Nuance is big in this area (also offers offline conversion products)

Well NUAN took a 5% price hit after the announcement today, so maybe there's some overlap?

I would be hesitant to build an entire application that relied on this API only to have it removed in a few months or years when Google realizes it sucks up time and resources and makes them no money.

It seems there are a few options. Devs could use an abstraction layer and change providers with very little effort. Also, by the way things seem to be going, voice will become more and more prevalent.

cool, next up is a way to tweak the speech API to recognize patterns in stocks and capex .. wasn't that what Renaissance Technologies did ?

really GooG should democratize quant stuff next .. diy hedge fund algos.

I'm reading many libraries here, I wonder what's the best open and multi platform software for spech recognition to code with vim, Atom etc. I only saw a hybrid system working with dragon + Python on Windows. I would like to train/ customize my own system since I'm starting to have pain in tendons, and wrists. Do you think this Google Api can make it? Not being local looks like a limiting factor for speed, lag.

Nothing out of the box as far as I know. You'll have to DIY.

Have a look at CMUSphinx/Pocketsphinx [1]. I wrote a comment about training it for command recognition in a previous discussion[2].

It supports BNF grammar based training too [3], so I've a vague idea that it may be possible to use your programmming language's BNF to make it recognize language tokens. I haven't tried this out though.

Either way, be prepared to spend some time on the training. That's the hardest part with sphinx.

Also, have you seen this talk for doing it on Unix/Mac [4]? He does use natlink and dragonfly, but perhaps some concepts can be transferred to sphinx too?

[1]: http://cmusphinx.sourceforge.net/ [2]: https://news.ycombinator.com/item?id=11174762 [3]: http://cmusphinx.sourceforge.net/doc/sphinx4/edu/cmu/sphinx/... [4]: https://www.youtube.com/watch?v=8SkdfdXWYaI

Great, now when will Google let us use the OCR engine they crowdsourced from us over the last decade with ReCaptcha. tesseract is mediocre.

Will be interesting to compare with http://www.speechmatics.com

There's no hint of pricing that I could find on their page. Do you know about their pricing model?

If you sign up you can scroll down and see the pricing on https://www.speechmatics.com/account/

What is the difference from a speech recognition API and [NLP libraries](https://opennlp.apache.org/)? This information was not easily found with a few google searches, so I figured others might have the same question.

NLP libraries process written text. This new API processes speech and extracts text from it.

What is the best speech recognition engine, assuming one has no internet?

At this time, CMU Sphinx or Julius are good choices, not great but worth a look.

I hope this opens up some new app possibilities for the Pebble Time. I believe right now they use Nuance and it's very limited to only responding to texts.

I'm not sure, what will happen to Google's webspeech API in the future. Whether it will be continued as a free service.

I think they are pushing back against Amazon's Echo speech APIs, which I have experimented with.

I just applied for early access.

Fuck. Yes. IBM has a similar API as well as part of their Watson APIs but I really wanted to use Google's.

Sounds like this is bad news for Nuance,

Finally, this is something that will be the main way for communication in the future.

Anybody got the api docs yet? I wonder if I can stream from chrome via webrtc.

Can't you just use the Web Speech API for that? https://dvcs.w3.org/hg/speech-api/raw-file/tip/speechapi.htm...

I want to use cloud speech api for various reasons

How well does this work with conversational speech? Any benchmarks?

So this was very, very exciting until I realized you have to be using Google Cloud Platform to sign up for the preview. Unfortunately all of my stuff is in AWS and I could move it over but I'm not going (far too much hassle to preview an API I may not end up using, ultimately).

Regardless this is still very exiting. I haven't found anything that's as good as Google's voice recognition. I only hope this ends up being cheap and accessible outside of their platform.

I don't think there's any requirement that you use Google Compute Engine in order to use this API. Yes, you sign up for an account, but of course you have to sign up for an account to use it. This API is part of the platform.

Similarly, you can use the Google Translate API without using Compute Engine, App Engine, etc.

Note: I work for Google (but not on any of these products).

Ah that makes sense. It gave me the impression I had to run my services on there. I don't know if I'm the only person to screw that up or not. Thanks for clearing that up!

I think you just need to register for the Google cloud and can then use the service as you want it. There is probably an advantage if you upload it directly from a Google VM (no costs for uploading from a Google service/very low latency) but it would surprise me if you have to have a Google VM. I can after all upload files to S3 and googles cloud storage even if I host the application on my hardware.

Applications are open for YC Winter 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact