Well, it is actually very demanding. ASR systems usually work with the speed of 1 RT (RT= real time factor, meaning recognizing 1 second of speech in 1 second). Approximately %60-70 of these processing goes to acoustic scoring. Rest is search in a large sub-phonetic+words graph and feature extraction (feature extraction takes a tiny percentage actually).
Nowadays acoustic scoring is done by large deep neural networks. And they are quite computation intensive. One can use GPU for that and indeed it works really fast if you have all the speech beforehand (off-line or batch mode). But for live recognition, GPUs lose their advantage quite a bit. Probably that is why Google worked on quantized vectorization and other tricks to make the DNNs fast in CPU [1].
I am quite sure this creates an immense pressure on their servers when tens of thousands of concurrent speech streams are queued for recognition. Perhaps todays GPUs are better in that aspect and more work can be delegated to decrease the pressure. There were other interesting work which utilizes almost all processing in GPU [2].
In short, ASR systems are very very processing hungry and a challenge for everyone, probably even for Google.
This isn't entirely accurate. Or rather, it is accurate as far as it goes, but doesn't tell the whole story.
Training a neural network uses a lot of computational power. From memory I think training the Android voice recognition was weeks of training on Google's GPU cluster ([1] talks about 95 hours for partial training, but I don't think that's the production system).
However, once the network is trained it doesn't use much power at all. The trained network can run a mobile phone, and it doesn't even drain the batteries much.
I was mentioning about run time operations, not training. Yes training DNNs are much more time consuming, but my point is, using them is also not cheap. As mentioned, processing 1 second of speech, lets say in 0,5 seconds is expensive. Considering a web search is done in sub millisecond time. of course I assume speech recognition is done in server side.
Very interesting links, thanks for sharing. I'm not sure if you're familiar with Android's speech recognition, but it seems to work offline as well. I wonder if they offload the computation to their servers when you're online and compute it locally when you're not. However the latency seems to be on the same order of magnitude.
Yes it works off line and it is a work of marvel IMO. Seems like all work is done in the phone when you are offline. And it performs close to the server counterpart. Latency is probably because of the nature of the live ASR processing. System cannot recognize word sequences immediately, just as humans.
Nowadays acoustic scoring is done by large deep neural networks. And they are quite computation intensive. One can use GPU for that and indeed it works really fast if you have all the speech beforehand (off-line or batch mode). But for live recognition, GPUs lose their advantage quite a bit. Probably that is why Google worked on quantized vectorization and other tricks to make the DNNs fast in CPU [1].
I am quite sure this creates an immense pressure on their servers when tens of thousands of concurrent speech streams are queued for recognition. Perhaps todays GPUs are better in that aspect and more work can be delegated to decrease the pressure. There were other interesting work which utilizes almost all processing in GPU [2].
In short, ASR systems are very very processing hungry and a challenge for everyone, probably even for Google.
[1] http://static.googleusercontent.com/media/research.google.co...
[2] http://www.cs.cmu.edu/~ianlane/hydra/#&panel1-1