Hacker News new | past | comments | ask | show | jobs | submit login
Google’s new voice recognition system works instantly and offline (Pixel only) (techcrunch.com)
340 points by Errorcod3 42 days ago | hide | past | web | favorite | 151 comments

These are much much better links

Interesting that they're using RNN transducer. I thought everyone's moved to CNN lately.

CNN's are only mentioned because of potential processing considerations as computationally, they are easier to deal with. But given the nature of speech recognition, which is so highly temporally correlated, it shouldn't be a surprise that a recurrent neural network would be used. This is pretty much exactly the purpose the RNN type of model architecture was designed for.

Also if you haven't looked into the properties of how exactly a RNN Transducer functions, I highly recommend doing so. They help resolve a great deal of problems that traditional RNNs and CNNs are unable to deal with.

I take it you haven't been following the field lately, have you? It is surprising because convnets just work better for speech recognition ([1] is the latest SOA). I'm guessing gated convolutional LMs are slower than RNN transducers when deployed on mobile. Can someone confirm?

[1] https://arxiv.org/abs/1812.06864

There is also the transformer approach (eventually with local attention to bound latency), (like I'm doing in my project (Work in Progress) : https://github.com/GistNoesis/Wisteria/blob/master/SpeechToT... ), though it's in the same line of thought as the convolutional CTC.

The RNN-T is a nice idea though, if I understand it correctly it's another approach to the alignment problem. In CTC, you are generating sequence like TTTTHHHEE CCCAAATT, which mean that your language model must deal with these repetitions, and you can't train using text without repetitions. In RNN-T you are learning to advance the cursor on either audio sequence or text sequence so as to maintain alignment, kind of like you do when you merge-sort two sorted lists, therefore it outputs THE CAT, and you can use a standard language model.

We explored self-attention + CTC in our ICASSP 2019 paper (https://arxiv.org/abs/1901.10055). Our implementation uses internal infra, so we're a ways from releasing code :(

Hoping paper details suffice and help with the parameter search, and happy to respond to questions over e-mail. Would love to see an open-source implementation with local or directed attention built out!

Very interesting! Perhaps I should revise my original comment to “everyone is moving to transformers lately” :)

What is "convolutional CTC"?

Gated convolutions as LM is similar to RNN-T idea [1], but you have to deal with softmax, so I'm not sure how well this would work in practice, especially on a mobile processor.

[1] https://arxiv.org/abs/1612.08083

I'm guessing Connectionist Temporal Classification: https://www.cs.toronto.edu/~graves/icml_2006.pdf

The paper you site gets equal to SOA performance on Wall Street Journal and LibriSpeech test sets, both of which are clean, read speech, and not at all representative of what a phone or assistant recognizer deals with. The convnet described there also does not perform streaming recognition.

The primary reason to be interested in convnets for speech is computational parallelism, not because they have especially strong results for accuracy.

I'm curious why people think that recurrent architectures are somehow more noise-tolerant. Where did this come from?

They're not that I know of. But the paper cited is just showing relative parity or slight improvement on relatively toy examples. The claim that convnets are the clear winner for speech in general/what everyone is doing now is just not true.

I work in the field, a more accurate summary would be that there are a number of viable architectures that currently get fairly similar accuracy, but that have other pros/cons with respect to streaming, memory use, parallelism, model size, integration with external language models and context, complexity of the decoder, friendliness to different types of hardware etc.

Ok, so what are the advantages of RNN over CNN based models for speech to text, with respect to any one of those factors you mentioned?

Well for example, some comparisons to the CNN paper you pointed to:

- No comparison is given of number of model parameters. If optimizing strictly for model size, RNNs tend to be nice and compact.

- The computational advantage of the CNN at training time is throughput. The advantage of RNN at decoding time is streaming latency. Running the CNN frame by frame as they are received removes the ability to run frames in parallel and if the CNN is larger, it will run slower, and depending on its receptive fields it may not even stream well at all.

- That particular CNN system uses a strictly external LM that is not jointly trained and has an additional hyper parameter at decoding time to weight the LM that requires additional tuning.

- It is still autoregressive in the beam search, so the LM will still be run many times sequentially adding tokens just like an RNN LM, and is likely to be more expensive. The throughput advantage a conv lm has in scoring whole sentences is totally lost. In fact, there doesn't seem to be anything special about the choice of a conv lm for that paper except that it is fun to make all the parts convolutional.

- CNNs frequently require more total flops, but are high throughput on eg a GPU because they expose so much parallelism. On an embedded CPU this can be a bad tradeoff.

As a side note, there's no reason that CNN architecture, which in the paper is trained with a close relative of CTC and is decoded identically to a RNN CTC AM + external LM, couldn't be trained as an RNN transducer. Despite the name neither the am nor lm have to actually be RNNs.

Haha, I ran a straight convolution net as an encoder for asr in a toy project while learning seq2seq. Worked fine in the small datasets I was working with, like the voice commands set...

Thank you for the detailed answer. This is exactly what I was looking for when starting this thread.

Feedforward CNNs cannot tolerate as much noise and real-world variability as RNNs.


There is this new type of trolling where someone just puts in bare minimal counterarguments and asks for citations knowing full well it's labour intensive to do so. If your not trolling.. your not doing a good job of avoiding these comparisons. The onus should be on you here to provide citations first that dissprove the person your replying to.

I'm not trolling. The statement that RNN are somehow more noise-tolerant than CNNs does not make sense to me, and is not based on any literature about noise tolerance in NNs that I'm familiar with. Also, no arguments have been provided as to why this could possibly be the case.

Hi p1esk, RNNs can be more tolerant to noise because they can learn transient or dynamic attractors. If the inputs move an RNN into an attractor, small changes due to noise make little difference to the state.

Recurrence can help with robustness in some other very important ways as well.

Citations for this dates from the 80s and 90s. I don't know the best reference offhand. You could look at some old Hinton stuff if you're a fan. Lots published on this.

That’s handwaving. The only way to find out if rnns are more robust to noise than cnns is to test them on the same task, with the same inputs and the same noise. Preferably using similar number of parameters and achieving similar accuracy before noise is applied. Then gradually increase the amount of noise, and compare the impact.

Nothing like this has been published AFAIK.

After you have the results of this experiment you can try to explain them with attractors and what not, but I would be surprised if there was much difference. Would make a good paper though!

> The onus should be on you here to provide citations first that dissprove the person your replying to.

Unless you're dealing with opinions, I disagree. The onus is on the person trying to give evidence to actually give evidence.

There was a post here 9 days ago discussing "intellectual DoS attacks" which basically work like this: https://news.ycombinator.com/item?id=19293036

Closely related to sea-lioning, in my opinion.


Sorry to go on a tangent, but this is the first time I've heard the word "transducer" outside of a conversation about clojure. Is it the same concept?

Not quite. A Clojure transducer from A to B is pure function which takes As to lists of Bs. To apply it to a sequence of As you flatMap it, so you apply it to each A in sequence and get a sequence of Bs which you splice into your final resulting sequence. Maps and filters are special cases, for example

    const mapTrans = fn => function* (x) {
      yield fn(x);
    const filterTrans = predicate => function* (x) {
      if (predicate(x)) {
        yield x;
    const dupeTrans = n => function* (x) {
      for (let i = 0; i < n; i++) {
        yield x;
Clojure just observed that an isomorphism of these functions under Church encoding looked like f<B> -> f<A> for a special parametric type f, so could be composed with ordinary function composition, albeit backwards.

An RNN transducer is fundamentally three functions:

One takes a list of recent As to some C1.

Another takes a list of recent Bs to some C2.

A final one takes a C1 and C2 to a B.

Rather than mapping over each input independently the RNN-T is allowed to learn something about the relationship of recent outputs to the next output, and the relationship of nearby inputs. Clojure transducers thus have an order insensitivity that RNN transducers are allowed to be sensitive to.

DeepSpeech 3 from Baidu also uses transducers. http://research.baidu.com/Blog/index-view?id=90.

"But it’s sort of funny considering hardly any of Google’s other products work offline. Are you going to dictate into a shared document while you’re offline? Write an email? Ask for a conversion between liters and cups? You’re going to need a connection for that!"

While offline, you might write email drafts, your blog, or even a book:


What's missing is the ability to make edits using your phone. You can probably speak at over 100 words a minute but then you need to stop to bring up the software keyboard.

The offline aspect is hardly the main draw here though. As mentioned earlier in the article, the latency reduction is huge. Another aspect they didn't really cover is privacy implications. Lastly, you may not be offline, but dodgy connections can also be a pain if you need a stable stream of packets going back and forth.

I refuse to put an amazon/apple/google surveillance device in my home, so I am very interested in a DIY digital assistant device. I'm aware of a few options but it seems like offline voice recognition is always a little sub-par. I am really looking forward to the day when an offline, open source digital assistant can compare in quality to a proprietary/cloud device.

> I refuse to put an amazon/apple/google surveillance device in my home...

Do you have a smartphone? Because that's most likely an Apple or Google surveillance device.

It shouldn't be offline, using recognition online but at your own cloud would be the way too.

uh, "cloud" isn't a magical thing that has specialized hardware for voice recognition. It is just a computer. Just like the device you have locally. So if the device locally can run the NN then there is no need to have a "cloud" do it. You only add latency.

On-demand Xeon with GPU in the cloud is quite different from local ESP32.

>As mentioned earlier in the article, the latency reduction is huge.

Well, on macOS offline voice recognition is actually much slower than online. Not to mention the choice of words and Vocab is quite limited. I love to get an offline version, but so far every online version seems to be better.

I don't think that means this one will be slower. Google has always been the leader in the voice recognition field. I'll bet they'll do offline processing better than Apple.

>Google has always been the leader in the voice recognition field

Interestingly may be only for English. In my experience Apple is doing far better in Japanese, Chinese ( Both Mandarin and Cantonese )

FWIW Google Translate (including the "translate from picture" feature) is an example of a product that has had offline option for quite some time. You have to tell it to download for each language pair IIRC.

Can't wait for different language models to be available to download for recognition, so one could have a genuine offline dictation between languages.

Offline voice-to-voice translation would pretty damn useful.

For the record it wasn't always this way, the last couple of years though they have made a lot of improvements on this front. I think it may have something to do with Google's "next billion of devices" being in countries with bad connectivity.

With that said I especially like the Google Maps offline features which have been added recently. You can even have it calculate driving directions completely offline if you have the starting and ending addresses.

Google Maps has an excellent offline mode on iOS and Android. I wish Apple Maps had that too.

If only it didn't force-expire downloaded maps after a while...

My offline maps expire in years, not in months or weeks. I'm not sure that's a huge issue, roads change over time and eventually maps will be so old that they are harmful.

On my Pixel 3 it's more like a month. All expired in the next few weeks, when I updated one it moved the expiration date to 4 weeks into the future.

I get a popup/notification once a month or so, when on wifi. It just asks if I want an update. There are settings to automatically update offline maps and one to control the automatic downloading of offline maps.

Seems pretty reasonable to me, after all roads do change over time.

My main motivation is to make nav less bandwidth intensive (I pay per GB with Google FI) and to ensure I have maps even if I don't have a good data connection.

It's reasonable to warn and suggest to update. It's not reasonable to not allow to use the old maps just because you don't update - but the Maps app enforces that.

It might have something to do with licensing. Either way, OsmAnd doesn't have that problem.

I get updates every 30 days. Do they expire if you don’t update?


That's stupid. I do offline maps mainly for the times when I don't have a good connection. It makes no sense to lose the maps when you are spending an extended amount of time off-grid. That's exactly when you need offline maps.

License issues. Wouldn't happen if all the data was OpenStreetMap and Google data only.

I suspect very very few people pay to have a smart phone and don't get it on WAN or wifi at least once a month.

If I remember correctly, the maps aren't auto-updated by default; you do get notifications, but those are too easy to ignore (esp. since Maps is chatty in general).

And then for someone on an expensive metered or slow connection - which is a lot of people in the developing countries - they might not want auto-update at all. So if they didn't notice expiration, they'll find out that their maps aren't there next time they try to navigate offline.

It's one of those features that makes one think... Why am I being notified about this at all? Can't this be taken care of without my input?

I am being notified but the process starts automatically. No input needed.

Why bother you then?

Yes, there's two settings in maps under offline maps. You can set it to auto update and download if you wish.

Most of their products work offline and sync when a connection is regained. That includes Google Docs and Gmail.

It means google doesn't need to pay for all the servers busy doing speech recognition. They shifted that work to the user's device.

It's hilarious how they can't do anything right. If it's in the cloud it's evil because Google, if it happens on device it's evil because Google.

Who said it's evil?

I just switched my Pixel 1 to airplane mode and tried voice input. Sure enough, it worked offline and it was fast! Very impressive work. (I've tried that before, but in the past it could only understand a few special phrases.) I suppose this new feature came with the security update my phone downloaded a few days ago.

There are lots of ways to spin this, but I see it as a significant improvement for any app that could benefit from voice input. It's immediate and not susceptible to network glitches. The benefit for Google, IMHO, is primarily more sales of updated Android devices.

Unless you very recently (meaning today) accepted a download of a new language pack for English, it's likely just the old model, which is perfectly functional, while not being as accurate as the online version.

More specifically:

Gboard > Voice Typing > Faster voice typing

It says its an 85MB download for US-English

Looks like this is on Android. Gboard iOS app doesn't have this setting.

Yes, it's specifically pixel phones.

Neither my Android 8 Gboard. Maybe it's only for Pixels or Android 9.

OK, thanks for the clarification.

> But it’s sort of funny considering hardly any of Google’s other products work offline.

I dunno, Android and a lot of Google's mobile apps that aren't about online communication work fine offline. Actually, a lot of the online communications ones do too, as much as is even conceivable, they just don't transmit and receive offline, because, how would they?

Just to be clear: This has nothing to do with "Wake Words" (e.g. OK Google, Alexa, Hey Siri, etc) which have always been handled offline/locally.

This is translating what you said after the wake word from voice to text on the local [Pixel] hardware rather than sending it into Google's Cloud.

The biggest benefits here are speed and reliability. It could also handle some actions offline.

Another benefit is privacy, this eliminates an entire set of potentially personal data from being handed off to Google.

On the other hand, when you can transcribe locally, uploading whole days worth of eavesropping would not cause a noticeable spike in traffic. I'd consider it more a lateral change than an improvement.

install a firewall (there are root/no root) options) and block google keyboard's internet access.

What if the data is send by something other than keyboard though? :)

This conspiracy goes all the way to the kernel!

My first thought as well.

Typed from my s̶u̶r̶v̶e̶i̶l̶l̶a̶n̶c̶e̶ ̶d̶e̶v̶i̶c̶e̶ smartphone.

I'm unclear on if this moves the privacy needle. It says they do offline translation, but they still may attempt to send the audio clip to compare with the resulting text translation.

It could be used to improve privacy, I just don't know if it will be used that way.

To me it's clear that this is in it's early phases, and as it's only available on Pixel devices and not to the general public I think it's safe to say this is part of it's testing.

However, as you said, if this is always a requirement then it doesn't affect privacy at all, which to me would be a real shame but this is Google after all. We just have to wait and see for now.

I doubt it.

I generally think of Google the same way I think of the NSA. If they stop doing something invasive, either it didn't work, they found a better way of doing it, or it was transferred to a legally distinct category, and we only hear about it because of PR considerations.

That's a pessimistic way of looking at it. Personally, from my experience, it's the exact opposite: If they're doing something invasive, it's because the data actually powers a feature so valuable that it's generally worth it.

Since I use dictation so much, I hope others don't use it as much, and google uses/prioritizes more of my speech. But I don't use dictation with sensitive data so I don't worry about privacy in this particular instance.

Gboard is governed by Google's catch all privacy policy, that allows them to gather all data and mine everything.

If you have an android device with Google services and a firewall, you'll see that the device is constantly phoning home, which is also noted in the privacy policy.

This does nothing for privacy, rather than provide the illusion of privacy.

Gboard uses differential privacy.

I've only seen articles that say Google was going to "explore" adding differential privacy to Gboard analytics [0]. Do you know if the feature ever shipped, and is it the only way sends data to Google?

I'm mistrustful of Google's privacy stance, since they have a history of changing their privacy policy, then misleading users about it. Remember when they implemented personally-identifiable web tracking and sold it to users as "new features for your Google account"? Merging Doubleclick's tracking data with my Google account doesn't seem like a feature to me.

[0]: https://venturebeat.com/2017/04/06/following-apple-google-te...

[1]: https://www.propublica.org/article/google-has-quietly-droppe...

Does the Pixel have some specific hardware that this uses, or is it simply limited to Pixel to limit the rollout? I am curious if I should get my hopes up to see this on gboard with non-Pixel Android devices.

The Pixel 2+ does have a coprocessor for compute workloads (the Visual Core). However users here have reported this working on a Pixel 1, which doesn't have that chip.

The Verge says it may reach other devices later.

It sounds like it's both better than the old dictation model, and significantly smaller.

AI systems that are able to work offline are great for privacy.

The thought that every interaction with my phone is being streamed in realtime to a third party server freaks me out.

Kudos to Google for working on this.

They can still send the information they collected from your microphone later, when you connect to the internet ...

You want an open source solution, not just an offline solution.

Even so, offline AI solutions have been piss poor and Google moving the state of the art in spite of their vested interest in keeping people online is a good thing.

Yes, we want an open source solution, but I'm not going to work on it. So who's going to work on it? Are you?

In absence of resources working towards the ideal, I'll applaud any step in the right direction.

To be fair there are others that have been pushing the needle in this space for considerable time. The standout for me is https://snips.ai Offline, multiple platforms, multiple languages, many parts open source and more oss parts in the pipeline. While certainly not currently aimed at dictation, but instead assistant building and automation. In this space on device speed, privacy & offline are critical. In the case of Snips "piss poor" falls short from the reality of what I have experienced YMMV.

Nonetheless we all benefit from this progress

I hope they add this to the Web Speech API in Chrome. It doesn't have punctuation right now and that's a killer for me.

Didn't they advertise something like this a few years ago? I seem to remember trying it and finding that it didn't really work as well as the online recognition at the time.

EDIT: Looks like something was added in Jelly Bean: https://stackoverflow.com/questions/17616994/offline-speech-...

This will be great when ported to Lineage!

I can't pinpoint when exactly, but on windows XP, there used to be a speech to text engine that worked locally. When you set it up, you had to read some text to train it with your voice. You could constantly train it to improve it.

This was before the cloudamagig, so I wonder it ran on.

Edit: found the link https://www.techrepublic.com/article/solutionbase-using-spee...

At the same time pre-pixel phones get features stripped. "OK Google" now requires phone to be awake and unlocked, or plugged in to work.

The other, gigantic shoe that will someday drop will be Google transcribing every incidental conversation. It can already do that, on-device, for every song that's heard, ever. It's a super power, being able to remember every word spoken around you, time and place, but of course it has privacy implications even if all the work is done without their cloud.

I've noticed a trend in older folk where they get halfway into a sentence knowing they don't know the name of what they are referencing only to ask those around them for help identifying the reference on the way to making some other point, e.g., "That's like that time in It's a Wonderful Life where _______, gosh, I can't think of his name. What was the guy that was in that movie? Oh yeah, James Stewart. That's like the time in It's a Wonderful Life where yada yada yada"

I'm hopeful that voice recognition assistants will help the burden during Christmas visits :D

This is great. I've been working on voice systems for VR and AR applications. On the HoloLens, it's a dream once you have your entire interface speech enabled. Can't wait to start porting to Android. Daydream and ARCore apps are going to see a huge improvement.

These end-to-end speech recognition systems are very intriguing. One major limitation is that since they don't model phonetics, they have no great way to deal with highly irregular orthography that doesn't show up in the training data. For example, there is no great way for the system to learn that the pronunciation "black" can be spelled "6LACK" sometimes.

The paper on arXiv goes into how they deal with this. Basically they run a traditional WFST decoder over the output of the RNN-T to take spelling context into account. Still, it's impressive how far the neural system can get with no explicit lexicon or acoustic modeling in general.

I'm wondering if we'll see it write emoji at some point.

Hrmm, Gboard only? Does it mean they don't/can't use this model for voice commands? I do sometimes dictate messages to my phone but my main use of Android voice recognition is Android Auto commands to navigate or play music.

Call me when it can figure out my wife's Italian name, pronounced correctly :-(

Downvoter(s): sorry, but it's true. Google can't figure out my wife's name, which is pretty freaking lame, as she's the person I interact with the most in terms of sending messages and emails and whatnot.

Perhaps check if this could work for you?


Call me cynical but I cannot picture Google not tapping into everything you run through their voice recognition software, even if it does work offline. Doesn't mean it won't phone home later.

For what? They only really make their money on ads for things you're actively searching for. Everything else they have in ads works rather poorly in the text world. Trying to interpret interests out of task-driven voice commands is way beyond their capabilities.

But, enough of that. I'm holding out until decent voice dictation is standard everywhere and a well understood engineering problem with good open source implementations.

Mostly so I don't have to type address into my car's GPS.

I think this is only true under the assumption "voice recognition (and transcription) is solved". I think most would consider this assumption to not be true. If it is not true, then there is value to that voice data, as it can be used to help train.

I guess another assumption could be "they have more than enough voice data to train any future network improvements." While I feel they have a lot of voice data, I am skeptical to say that they wouldn't view more data as useful, or at least potentially useful.

I cannot think of a compelling reason for why they would stop collecting data all of a sudden. Can you? (Serious question, I don't mean to sound snarky)

I think you're completely right. I interpreted data collection here as a mining activity for ads preferences. Currently, they fully collect the voice recordings and only seem to provide options on whether the data is connected to your account. I don't see any option to "forget what I said after you've responded."

You might be waiting a while. The problem you describe - with GPS - is an acoustic modeling problem, and a very difficult one at that. Cars come in many shapes and make lots of odd noises and various speeds, not to mention these devices are mounted in many places in relation to the speaker.

The voice recognition on Android Auto works fine in the interior of every car where I've tried it. The manufacturers include well-placed microphones dedicated to the purpose.

Hopefully those abilities and integration will filter into open source implementations. Right now, it seems like AM is owned by the big players and will be for the foreseeable future.

Right, but once this is implemented there is no excuse for audio to be "sent home".

Of course there will still be usage analytics, etc. But it does increase privacy to some degree, especially when compared to sending all audio after a certain phrase is mentioned.

Google does not need any excuse.

They do need an excuse, however most likely it will be "to further improve our models with more data", which could certainly be true, much to all of our distaste.

We have to wait and see. I'm sure we all look forward to a completely offline solution.

That seems orthogonal to this anyway. If they wanted to tap into what you're saying, why not just upload the text? It's both smaller and more useful. And more general because you get the data regardless of how the user chose to input it.

And if you're uploading the text, then it doesn't really matter where the speech to text translation happens.

Finally. Over the past year or so I've noticed significant increases in the voice recognition lag across a handful of devices and across multiple wireless carriers.

Voice on my pixel 3 is incredible. I normally have problems with voice recognition but this understands me better than some friends I have. It really is magical.

What's so special about it? Just tried this on the BlackBerry keyboard and there it works instantly without being connected to the internet as well.

Google and its dominance in both AI and reach into everyone’s private lives really scares me.

There is a machine that can work totally offline, listen to audio, transcribe it, have a basic understanding and blast me with ads everywhere I go in the digital universe.

It can then psycologically slowly manipulate my behavior via ads making us buy/do things that we don’t even realize it.

It’s gonna be a scary world for my kids.

It would be nice if Siri would at least allow me to turn cellular data back on with a voice command. Turn-to-turn navigation tends to consume a lot of data when I'm abroad using a temporary SIM so I drive without network connection on offline maps but that kills Siri meaning I can't do anything anymore without touching my phone.

Disable mobile netork for Maps specificially, while keeping it on for everything else.

Now they can save valuable CPU time and your phone will extract advertising keywords from your conversations for them, even without an internet connection. It's way more efficient to cache speech converted to text while offline rather than audio clips. The servers get cleaned up text data, saving bandwidth and storage.

For application purposes where you don't want to source the audio from the microphone, is it possible for an Android application to feed audio to Gboard in order to source audio from other sources than the microphone? Maybe the Pixel has a mixer which allows audio from sources other than the microphone?

There is an excellent overview over their speech recognition system. http://iscslp2018.org/images/T4_Towards%20end-to-end%20speec...

This is an impressive engineering feat. Imagine the applications at edge devices! Microsoft is also trying hard to get their "Intelligent Edge" right.

At the risk of being downvoted, any Pixel users enable "Hey Google" recognition on their phones only to regret it?

I'm constantly dealing with the phone interpreting commands intended for a Google Home speaker, which sometimes results in both the speaker and the phone acting on the same command. To my dismay, there's no way to disable Hey Google recognition on the phone after it's been enabled.

Perhaps someone here has run into this issue as well? It's a huge pain point for me.

You can register different trigger words, or at least you used to be able to. I had my phone wired to OK Google and the Home wired to Hey Google. Wasn’t an issue once I made the distinction. I no longer have an Android so I can’t comment on this still working. If you have external parties at your home regularly, that would obviously complicate things.

I never understood the need for server-side speech recognition. Did an internship in 2013 for speech recognition on a BeagleBoard with Julius (https://github.com/julius-speech/julius), the thing worked with ninety-ish % accuracy (japanese language) and delay comparable to what my tablet gives - but locally.

I always figured doing it server-side was just to capture the users' data,either because the company wants a big training set, or for more nefarious purposes like tagged advertising.

Hmm, is this how the "what song is playing" feature works? Google claims it works offline (I haven't tested it) but I have a hard time believing that Google is storing information related to every song out there. What about new songs?

It was covered on HN when this feature was released. There's a database of >10,000 song fingerprints on-device (IIRC, people found it and it was ~100MB) that's updated based on the most popular songs from GPM/Youtube Music.


You never heard of Google Music? I believe getting fingerprints of all published music is probably one of the least complex tasks Google faces.

been using Google Voice for several years now for most of my communications in text, email, slack, whatever (only on phone, of course).

it is quite good, and very fast. but it's still not there. it has trouble with nuances like "call" vs "called" -- can't hear that suffix very well in regular speech. for me, it also has a really hard time with pronouns.

many times I'll start off with regular speech, go to look at what was transcribed and notice a couple errors that would make me look like a fool, backspace the whole thing, and then repeating it all gain in a very robot like voice.

it's almost there.

> been using Google Voice for several years now for most of my communications in text, email, slack, whatever (only on phone, of course).

Just a heads up, Google Voice is the name of a product that offers telephony service including SMS, and has been around for a decade or so.


got it this morning on the way in to work. already used it a bunch and its GREAT.

Dictation works offline on iPhone since iOS10

Offline transcription has been available on Android as well. This is an announcement of a faster, more accurate model.

Finally google has caught up to 1997: https://en.wikipedia.org/wiki/Dragon_NaturallySpeaking

Sure it might work better now, but that's expected when computers are much more powerful than a pentium 100 with 32MB of RAM. Uploading voice to google servers for processing was always just a data grab.

Well, try to train your Dragon (no pun intended) and then let another person speak, possibly someone with a different accent.

GBoard works somewhat reliably with 0 training and my Italian accent, that's an apple to oranges comparison.

> Well, try to train your Dragon (no pun intended)

The training was in part due to the hardware limitations. Some versions let you skip the training or have only a small training session and let it learn as it went, I'd be surprised if googles systems weren't doing some sort of continuous learning too.

> and then let another person speak, possibly someone with a different accent.

It had profiles to handle that, although it needed training for each profile. With google does it automatically switch profiles or does it mess up what it's learned about you're voice? It's very opaque.

> GBoard works somewhat reliably with 0 training and my Italian accent, that's an apple to oranges comparison.

I haven't tried the new system but all of google voice recognition has struggled with my Australian accent to the point of being unusable.

They've had offline voice recognition since Honeycomb at least. This is just their newer/better system coming soon. If you wanted you could go into airplane mode right now and dictate a novel.

I used Dragon dictation in the 90s.

You had to spend an age just training it to your own specific voice and choose microphone carefully, neither of which are steps this requires. Not to mention the syntax for punctuation etc.

So no, I would not argue this is catching up the the 90s.

While not a demo of Dragon's NaturallySpeaking, I think of this video when someone mentions 90/00's-level consumer dictation software. For me, the video absolutely captured what it felt like to use those products.


is more than just newer hardware. These CNN/RNN based methods are vastly more robust than older bayessian/markov based methods.

> These CNN/RNN based methods are vastly more robust than older bayessian/markov based methods.

They were invented in the 80's, they weren't used because of limited processing power.

Can't imagine why you're being demoted for this accurate assessment. Current resurgence of machine learning is based on the realization that early 80s/90s academic research that was temporarily abandoned was not working only because they didn't have the hardware for a gigantic parameter space. Now we do have the hardware.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact