Wow, this is pretty intriguing and may actually be a solid differentiator in the browser space since it probably requires a considerable stored database of speech samples coupled with a decent back end server farm to do it effectively. Hard for the other players to replicate. Clever move Google!
I wonder if they will add this as a standard feature for any text field at some point? It's probably not going to get much sunlight if it requires a chrome-specific attribute on the field.
Not really. I mean, if Google chose to let them than sure. Otherwise it wouldn't be that hard to have Chrome (which is based on an open source project but is not itself open source) send an encrypted key in each packet ensuring it came from a Chrome browser.
That said both Microsoft and Apple have voice recognition engines built into their client OS which seems like a much better option latency wise.
I was speaking in the theoretical. Honestly I think Google would be overjoyed if other browsers decided to make themselves reliant on Google's server farm to function. I think that's why the issue isn't really addressed in the proposed standard...
The algorithm has to be seeded with a secret (Stream ciphers work this way). You can't get around having to have Chome own a secret, and it's really hard to protect a secret when you're in a hostile environment (e.g. the user's PC). This is part of what makes DRM really hard.
I suspect this slide's covert purpose is to make me (yes, just me) sit here and say 'hello' to my computer like a moron for 3 minutes while seeing no effect what-so-ever. In this it has succeeded brilliantly.
Running Chrome, mic is on, no one is home... sigh. I so wanted to be wowed.
It actually does a pretty good job at simple words and sentences. So, jumping in the deep end, I tried "The reflected binary code was originally designed to prevent spurious output from electromechanical switches". Can anyone get it to recognize that? I did manage to get it to respond correctly to every word by itself (sometimes only after a couple of tries), but not the whole thing.
Yeah, though support is pretty limited though and worst of all, the axis are not standard and so there needs to be branches for basically every sensor/computer configuration to ensure Y+ is up and X+ is right. Coming from iOS land it's exciting until you realize it's going to be tricky to do widespread. Here's to hoping it improves soon though!
All Apple laptops since 2006 have had a sudden motion sensor to detect when the laptop is falling and lock the hard drive platters in place to prevent damage. Apple exposes this to software basically just because they can, AFAIK. And to support cool features like this.
It appears the new macbook air no longer includes the motion sensor since it was originally designed to protect the hard drive, and the new air is lacking one.
Its also a little concerning that google is sending this data to its own servers without warning. GOOG-411 at least warned you it was a collection tool.
Why does this have anything to do with HTML5 -- isn't it up to the UA to determine how best to accept form input? Specifying in the form that a particular field is a "voice recognition" field seems to be encoding presentation details in what should be structure.
I can understand that it's important to mark a particular form field as more "important" than others (and thus more likely that a user would like to use their voice to input text to it), but wouldn't this be better served by semantic markup declaring the field as a "primary" field or some such?
I busted out wireshark to answer my own question. The data is actually getting encoded as speex and being posted to http://www.google.com/speech-api/v1/recognize . Maybe google is about to open up its speech recognition API to the masses?
If it's anything like speech recognition in Android, it's all server-side..so this should be technically possible in any browser as long as google allows it.
You can't see it in the Chrome Developer Tools/Network UI, but the browser is sending the recorded audio input to a webserver. Unfortunately it's HTTPS, so a bit hard to decode the URL/content of the message without using a proxy.
Indeed. I hope they did it to not accidentally offend people by falsely recognizing a swear word. If so, they would probably be better of taking the non-swear neighbor word.
I tried your sentence "The owl and the pussycat went to sea in a beautiful pea green boat" and got "seattle hookers gatwick to see a beautiful pizza ri boat". You win.
Speech recognition has steadily improved over the past 10 years. You can see a few examples: Android system wide speech input, Siri iPhone app (which Apple bought), Dragon Naturally Speaking advertises upwards of 99% accuracy for general purpose dictation, and the latest speech recognition IVRs do pretty darn well; try calling Amtrak or United. If you try to dick around with it, have a strong accent, or are in a noisy environment, you won't get (as) good results. However, as a whole, it's greatly improved.
This is exactly like the speech recognition on Android. It works brilliantly with short phrases that also happen to be popular searches on Google (or Google Voice Search) but fails at longer or obscure sentences. It's all about the data, baby.
I use Voice Search heavily on my Desire, but I prefer to type out my communications because of this exact limitation.
That is awesome, works even for German without a problem. I couldn’t get it to recognize an English sentence properly (which probably only means that my English pronunciation is horrible). I’m wondering, however, how they manage to recognize the language in the three word sentences I tried.
A lot of it is statistical inference. I've run into weird glitches where it chooses the completely wrong word that still fits. For example I used a sentence that ended with "cool!" but it transcribed it to "excellent!"
Obviously, "excellent" sounds nothing like "cool" but the sentence still worked because it was using the neighboring words to try and guess what should go there.
If speech recognition can with some accuracy identify the language of speech after only three words it already exceeds my own capabilities. Whenever I truly don’t know which language someone is going to talk to me I nearly always need more than three words to orient myself. That’s why I’m so impressed.
What version of chrome does this work on?
Either I'm missing something or on an older version of chromium: Chromium 5.0.375.127 (Developer Build 55887) Ubuntu 10.04.
I believe it's available in later 7 builds and all dev channel builds after that. Note that it is not yet available in the beta or stable channel builds (since we're still perfecting it).
Also, I really recommend you upgrade your Chromium version! I believe security updates are only back-ported to the current stable release, which means you haven't gotten any such updates for a while! (Stable is now at 8.)
Right now it turns any text input into a speech input, but I might change that later. Or, at least, have an option to disable it on certain sites (have never created a chrome extension, no clue how long that'd take).
<input x-webkit-speech> works for me, so you may want to just apply the attribute to all input elements regardless of type (and textarea elements for future support). I suspect that support for voice input on input types such as date may also be added eventually.
Edit: More at the HTML5 speech input proposal at https://docs.google.com/View?id=dcfg79pz_5dhnp23f5#y1f9 . It's apparent from this that you should also use the attribute on select elements too. I also can get x-webkit-speech working in current stable Chrome with an input type of speech.
Don't forget select elements too. An easier way for you to do your whole extension could be to use an XPath expression or `document.querySelectorAll('textarea, select, input:not([type="' + notAllowed.join('"]):not([type="') + '"])')` (results in `textarea, select, input:not([type="checkbox"]):not([type="radio"]):not([type="file"]):not([type="submit"]):not([type="image"]):not([type="reset"]):not([type="button"]))`). Also, may I ask why are you abstracting Array.indexOf away and extending the Array prototype with a non-standard method for such a simple problem?
HTML5 is a lot more than an AJAX/DHTML "rehash". And even if it weren't more than that, you say it like that's a bad thing....it's not! None of that was ever properly standardized, so building a markup language that's standardized across all implementations is extremely valuable. Not to mention, there are numerous important features of HTML5 that you're not thinking about.
Geolocation is extremely powerful for the mobile world, which is growing faster than anything else. A web page can - with your permission - read your GPS coordinates and provide you with location-aware information.
The combination of WebSockets, WebWorkers and local storage lets devs build sites that more resemble Applications. This is done in an easy and standards-focused way, and not with the hacked-together way that "DHTML" sites were built.
Come on...multithreaded Javascript with a socket interface and local storage? Added with location-aware information!
HTML5 really is much more than a rehashing of current web tech.
I wonder if they will add this as a standard feature for any text field at some point? It's probably not going to get much sunlight if it requires a chrome-specific attribute on the field.