
Google Translate, Now With Voice Input - Kylekramer
http://googlesystem.blogspot.com/2011/04/google-translate-now-with-voice-input.html
======
micheljansen
Might be worth mentioning that the Google Translate app for IOS already had
this feature for a while. It actually works surprisingly well in all the
languages I have tried (even though I speak some of them with a horrible
accent). I suspect they just added an interface for Chrome to an existing
back-end.

Update: I just noticed that this is actually an implementation of the HTML
Speech Input API ([http://lists.w3.org/Archives/Public/public-xg-
htmlspeech/201...](http://lists.w3.org/Archives/Public/public-xg-
htmlspeech/2011Feb/att-0020/api-draft.html)), so technically this could appear
on any site. This makes things actually a lot more interesting :)

~~~
MatthewPhillips
It's because Chrome just added support for the Speech Input API:

[http://chrome.blogspot.com/2011/03/talking-to-your-
computer-...](http://chrome.blogspot.com/2011/03/talking-to-your-computer-
with-html5.html)

~~~
micheljansen
I just noticed that too; that is of much broader interest :)

------
dstein
I wonder why Google is doing it like this. Sending an audio file of your voice
over the wire seems like unnecessary overhead. From an API perspective I'd
much rather have client-side speech-to-text built into the browser (accessible
via a JavaScript API). I think they're already doing STT client-side in
Android, so what's the hold-up with embedding it in Chrome and letting web
developers go nuts with voice-enabling their web apps?

~~~
tel
I'm not familiar with Android, so I'm not sure about the capabilities of its
client side speech recognition algorithms, but the short answer is that
competent speech to text is currently a very, very intensive problem.

SR is roughly divided into acoustic and language modeling. The acoustic model
proposes words that might have been said given some chunk of speech and the
language model tells you what the most likely actual word is given what's been
said.

The acoustic model can be solved in a large number of ways — though production
technologies use very large hidden Markov models — but decoding a word
sequence from speech might scale like O(knm^2) with n being the size of your
vocabulary (often _large_ ), m the complexity of the acoustic model (# of
phonemes modelled, perhaps) and k the number of acoustic frames. The n at the
very least can be parallelized (embarrassingly), but the m and k cannot.

The language model involves a search through an exponential search space of
orderings of words in the vocabulary (n^l choices, but l is also unknown).
Anything sophisticated also will have an incredibly large (in memory) model as
it has to have parameters across words, pairs of words, triples of words,
grammatical categories, topics, etc. etc.

Solving both of these problems well simultaneously is not a task for a
consumer computer. Speedy algorithms with small vocabularies and simple models
exist and are implemented (Dragon Naturally Speaking, for instance) but Google
didn't go and record a million hours of GOOG-411 to produce Dragon's
technology over again.

\---

Finally, there's a lot of work on front end signal processing in SR. Before
you get into acoustic and language modeling, you often transform your input
into another representation (often spectral components from sliding 10ms
frames). A growing camp in SR research involves finding _sparse_ front end
representations of speech though. If the client-side software is capable of
quickly computing a sparse representation of the speech, that could
dramatically reduce the latency and bandwidth issues.

~~~
tel
By the way, the reason Google is willing to spend this computational effort
for your convenience is probably the same as GOOG-411. They are definitely
recording every translation they do in order to use as a huge training corpus
later.

 _The best data is more data._

------
MatthewPhillips
What's interesting to me about the Speech Input API is that most browser
vendors don't have access to the underlying technology to make the api useful.
Microsoft has the tech, and Apple has it a little bit, but certainly Mozilla
and Opera do not. They _might_ be able to afford providing it through 3rd
party access, but any one smaller than that (read: community browsers) aren't
going to be able to implement this.

It's the first HTML standard that I can think of where this is true.

