Hacker News new | past | comments | ask | show | jobs | submit login
HTML5 Speech Recognition (in Chrome) (html5rocks.com)
83 points by philfreo on Dec 5, 2010 | hide | past | web | favorite | 72 comments

Wow, this is pretty intriguing and may actually be a solid differentiator in the browser space since it probably requires a considerable stored database of speech samples coupled with a decent back end server farm to do it effectively. Hard for the other players to replicate. Clever move Google!

I wonder if they will add this as a standard feature for any text field at some point? It's probably not going to get much sunlight if it requires a chrome-specific attribute on the field.

Couldn't the other browsers yank the code from Chrome and use Google's servers too?

Not really. I mean, if Google chose to let them than sure. Otherwise it wouldn't be that hard to have Chrome (which is based on an open source project but is not itself open source) send an encrypted key in each packet ensuring it came from a Chrome browser.

That said both Microsoft and Apple have voice recognition engines built into their client OS which seems like a much better option latency wise.

If they're going to support it in Chromium, they'd have to support it for all browsers, else they'd get bad PR, as Chromium is open source.

I was speaking in the theoretical. Honestly I think Google would be overjoyed if other browsers decided to make themselves reliant on Google's server farm to function. I think that's why the issue isn't really addressed in the proposed standard...


And other browsers could just extract the key and ship it as well.

I'm assuming there would be some kind of algorithm to generate the encrypted key.

The algorithm has to be seeded with a secret (Stream ciphers work this way). You can't get around having to have Chome own a secret, and it's really hard to protect a secret when you're in a hostile environment (e.g. the user's PC). This is part of what makes DRM really hard.

And get sued for doing so.

It might not actually be that difficult to build some level of speech recognition into the browser.

See http://cmusphinx.sourceforge.net/

I suspect this slide's covert purpose is to make me (yes, just me) sit here and say 'hello' to my computer like a moron for 3 minutes while seeing no effect what-so-ever. In this it has succeeded brilliantly.

Running Chrome, mic is on, no one is home... sigh. I so wanted to be wowed.

You need to click the little microphone icon - sorry if you've already tried this.

Interesting... no Mic icon here. Just a text box centered in the slide. Both in Chrome 7.0.517.44 and Safari 5.0.2. The mystery deepens.

Edit: Aha! Update to 8.0.552 and presto... Mic icon. Very slick.

Edit again: "Hello" ---> Lowes. "Hello Mr. Webpage" ---> homeless services "This is a test" ---> test

Probably a crappy notebook mic to blame. Ah well. Recognition on my Nexus One is quite solid so I can't blame Google's algorithms.

Where does this icon appear? (It's not obvious in the latest Chrome or Safari on MacOS.)

The icon appears in the right of the input box. Appears perfectly on Chrome 8.0.552.215 which my Mac shows is up-to-date.

Thanks, restarting Chrome got me to that version, and then the icon appeared.

(It seems this feature must have just been pushed; it was less than a week ago when I last restarted.)

It actually does a pretty good job at simple words and sentences. So, jumping in the deep end, I tried "The reflected binary code was originally designed to prevent spurious output from electromechanical switches". Can anyone get it to recognize that? I did manage to get it to respond correctly to every word by itself (sometimes only after a couple of tries), but not the whole thing.

(non-native speaker)

Californian here, and it was really close on my first try:

the reflected binary code was originally designed to prevent spurious output from electrode mechanical switches

I'm kinda blown away. Here's an mp3 of what I sound like, if anyone is curious: http://cl.ly/3WDv

Thanks for this. From now on Chrome will be the judge of how clearly and accent free I speak :)

"reflected by originally designed the spirit outlet switches"

"side by side spirit album auto mechanic"

"reflective vinyl ether design factory outlet switches"

I'm from New Jersey.

Varies from poor to amazing.

"I have met Jesus, he was a nice guy" -> "ice melt cheese"

"hacker news is amazing" -> "hacker news"

"are you afraid of santa claus?" 100% correct

"if a woodchuck could chuck wood how much wood would a woodchuck chuck" 100% correct

Did anybody check out the slide before this, device orientation? http://slides.html5rocks.com/#slide23

That's pretty awesome too, I could see this being great for mobile web apps, especially games.

Yeah, though support is pretty limited though and worst of all, the axis are not standard and so there needs to be branches for basically every sensor/computer configuration to ensure Y+ is up and X+ is right. Coming from iOS land it's exciting until you realize it's going to be tricky to do widespread. Here's to hoping it improves soon though!

This worked great on my aging macbook. That's just amazing, I had no idea it came with sensors to support such functionality.

All Apple laptops since 2006 have had a sudden motion sensor to detect when the laptop is falling and lock the hard drive platters in place to prevent damage. Apple exposes this to software basically just because they can, AFAIK. And to support cool features like this.


It appears the new macbook air no longer includes the motion sensor since it was originally designed to protect the hard drive, and the new air is lacking one.

Thanks for the link and an explanation!

Pity the Android browser ignores URL fragments.

How are they accessing my laptop mic? Is it the Google voice plugin?

Shouldn't the browser ask for permission before allowing access?

Because it's a Chrome feature? x-webkit-speech

Its also a little concerning that google is sending this data to its own servers without warning. GOOG-411 at least warned you it was a collection tool.

They're considering clicking the microphone icon permission. (You did have to click the icon, didn't you?)

Hm, I wonder if it's susceptible to clickjacking

Good point.

This could be easily use for spying.

Next up wikileaks will use this when government IPs are discovered on the site ;p

Why does this have anything to do with HTML5 -- isn't it up to the UA to determine how best to accept form input? Specifying in the form that a particular field is a "voice recognition" field seems to be encoding presentation details in what should be structure.

I can understand that it's important to mark a particular form field as more "important" than others (and thus more likely that a user would like to use their voice to input text to it), but wouldn't this be better served by semantic markup declaring the field as a "primary" field or some such?

Is there a speech recognition engine built in to chrome this is leveraging?

I busted out wireshark to answer my own question. The data is actually getting encoded as speex and being posted to http://www.google.com/speech-api/v1/recognize . Maybe google is about to open up its speech recognition API to the masses?

They are doing post requests to there own servers, for recognition.

reference: http://src.chromium.org/viewvc/chrome/trunk/src/chrome/brows...

api url: https://www.google.com/speech-api/v1/recognize

If it's anything like speech recognition in Android, it's all server-side..so this should be technically possible in any browser as long as google allows it.

from unscientifically looking at network sent packet counts, it's definitely phoning home for the answer.

You can't see it in the Chrome Developer Tools/Network UI, but the browser is sending the recorded audio input to a webserver. Unfortunately it's HTTPS, so a bit hard to decode the URL/content of the message without using a proxy.

That does not specify where the recognition is happening.

I whipped up a Chrome extension for voice search if anyone is interested:



Aw, curse words are censored? That's pretty ####### lame.

Indeed. I hope they did it to not accidentally offend people by falsely recognizing a swear word. If so, they would probably be better of taking the non-swear neighbor word.

I thought speech recognition was still very inaccurate & hasn't improved much in the last 5-10 years. Has it suddenly become usable?

Not for me in this case. Every sentence I tried was mangled in the traditional way:

Once upon a time in america -> ants on a time in america

The owl and the pussycat went to sea in a beautiful pea green boat. -> the owl and pussycat when to see in a beautiful p cream

Google is not evil -> google is evil

I'm not joking about that last one.

I tried your sentence "The owl and the pussycat went to sea in a beautiful pea green boat" and got "seattle hookers gatwick to see a beautiful pizza ri boat". You win.

Actually I think you win!

Speech recognition has steadily improved over the past 10 years. You can see a few examples: Android system wide speech input, Siri iPhone app (which Apple bought), Dragon Naturally Speaking advertises upwards of 99% accuracy for general purpose dictation, and the latest speech recognition IVRs do pretty darn well; try calling Amtrak or United. If you try to dick around with it, have a strong accent, or are in a noisy environment, you won't get (as) good results. However, as a whole, it's greatly improved.

This is exactly like the speech recognition on Android. It works brilliantly with short phrases that also happen to be popular searches on Google (or Google Voice Search) but fails at longer or obscure sentences. It's all about the data, baby.

I use Voice Search heavily on my Desire, but I prefer to type out my communications because of this exact limitation.

That is awesome, works even for German without a problem. I couldn’t get it to recognize an English sentence properly (which probably only means that my English pronunciation is horrible). I’m wondering, however, how they manage to recognize the language in the three word sentences I tried.

A lot of it is statistical inference. I've run into weird glitches where it chooses the completely wrong word that still fits. For example I used a sentence that ended with "cool!" but it transcribed it to "excellent!"

Obviously, "excellent" sounds nothing like "cool" but the sentence still worked because it was using the neighboring words to try and guess what should go there.

If speech recognition can with some accuracy identify the language of speech after only three words it already exceeds my own capabilities. Whenever I truly don’t know which language someone is going to talk to me I nearly always need more than three words to orient myself. That’s why I’m so impressed.

It was rather good, but not even close to rely on it for anything practical. It felt a bit like this http://www.youtube.com/watch?v=5FFRoYhTJQQ

What version of chrome does this work on? Either I'm missing something or on an older version of chromium: Chromium 5.0.375.127 (Developer Build 55887) Ubuntu 10.04.

I believe it's available in later 7 builds and all dev channel builds after that. Note that it is not yet available in the beta or stable channel builds (since we're still perfecting it).

Also, I really recommend you upgrade your Chromium version! I believe security updates are only back-ported to the current stable release, which means you haven't gotten any such updates for a while! (Stable is now at 8.)

Works on Mac OS with Chromium 10.0.601.0 (68155)

Chrome 8.0.552.215 also works on Mac.

Two things I would want upon seeing this.

1. Chrome extension to use speech recognition in every text box.

2. Speech recognition inside the google apps: Gmail, etc.


Right now it turns any text input into a speech input, but I might change that later. Or, at least, have an option to disable it on certain sites (have never created a chrome extension, no clue how long that'd take).

<input x-webkit-speech> works for me, so you may want to just apply the attribute to all input elements regardless of type (and textarea elements for future support). I suspect that support for voice input on input types such as date may also be added eventually.

Edit: More at the HTML5 speech input proposal at https://docs.google.com/View?id=dcfg79pz_5dhnp23f5#y1f9 . It's apparent from this that you should also use the attribute on select elements too. I also can get x-webkit-speech working in current stable Chrome with an input type of speech.

Thanks for the link, I've went ahead and made it work on any input field and text area that isn't in the not allowed list.

Don't forget select elements too. An easier way for you to do your whole extension could be to use an XPath expression or `document.querySelectorAll('textarea, select, input:not([type="' + notAllowed.join('"]):not([type="') + '"])')` (results in `textarea, select, input:not([type="checkbox"]):not([type="radio"]):not([type="file"]):not([type="submit"]):not([type="image"]):not([type="reset"]):not([type="button"]))`). Also, may I ask why are you abstracting Array.indexOf away and extending the Array prototype with a non-standard method for such a simple problem?

Your line of code is so long it's causing this Hacker News comments page to have horizontal scrolling! I've never seen that before. :)

In case you didn't know, HN will format code if you prefix it with two spaces like this:

  document.querySelectorAll('textarea, select, input:not([type="' + notAllowed.join('"]):not([type="') + '"])') (results in `textarea, select, input:not([type="checkbox"]):not([type="radio"]):not([type="file"]):not([type="submit"]):not([type="image"]):not([type="reset"]):not([type="button"]))`)
It can preserve indentation too:

    'textarea, select, input:not([type="' 
    + notAllowed.join('"]):not([type="') 
    + '"])')

  Also, may I ask why are you abstracting Array.indexOf away and extending the Array prototype with a non-standard method for such a simple problem?
Mostly a personal preference, I guess, haven't looked into the downsides of extending prototypes.

Anywho, I'm fairly new at JS outside of jQuery, so thanks for your input and critiques. I'll look into the rest when I get back from dinner.

Shucks, you beat me to it, and with a better name too. Congratulations for being quick on the draw. I'm going to fork it and help work on it.

Heh, I had nothing better to do on a Sunday afternoon. :P

"Hack the planet" -> "Mayo clinic"

You win this round Google.

wow this is really good. It recognizes almost everything and I am not even a native speaker.


HTML5 is a lot more than an AJAX/DHTML "rehash". And even if it weren't more than that, you say it like that's a bad thing....it's not! None of that was ever properly standardized, so building a markup language that's standardized across all implementations is extremely valuable. Not to mention, there are numerous important features of HTML5 that you're not thinking about.

Geolocation is extremely powerful for the mobile world, which is growing faster than anything else. A web page can - with your permission - read your GPS coordinates and provide you with location-aware information.

The combination of WebSockets, WebWorkers and local storage lets devs build sites that more resemble Applications. This is done in an easy and standards-focused way, and not with the hacked-together way that "DHTML" sites were built.

Come on...multithreaded Javascript with a socket interface and local storage? Added with location-aware information!

HTML5 really is much more than a rehashing of current web tech.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact