
Google’s new voice recognition system works instantly and offline (Pixel only) - Errorcod3
https://techcrunch.com/2019/03/12/googles-new-voice-recognition-system-works-instantly-and-offline-if-you-have-a-pixel/
======
modeless
Google AI blog: [https://ai.googleblog.com/2019/03/an-all-neural-on-device-
sp...](https://ai.googleblog.com/2019/03/an-all-neural-on-device-speech.html)

arXiv: [https://arxiv.org/abs/1811.06621](https://arxiv.org/abs/1811.06621)

~~~
p1esk
Interesting that they're using RNN transducer. I thought everyone's moved to
CNN lately.

~~~
lawrenceyan
CNN's are only mentioned because of potential processing considerations as
computationally, they are easier to deal with. But given the nature of speech
recognition, which is so highly temporally correlated, it shouldn't be a
surprise that a recurrent neural network would be used. This is pretty much
exactly the purpose the RNN type of model architecture was designed for.

Also if you haven't looked into the properties of how exactly a RNN Transducer
functions, I highly recommend doing so. They help resolve a great deal of
problems that traditional RNNs and CNNs are unable to deal with.

~~~
p1esk
I take it you haven't been following the field lately, have you? It is
surprising because convnets just work better for speech recognition ([1] is
the latest SOA). I'm guessing gated convolutional LMs are slower than RNN
transducers when deployed on mobile. Can someone confirm?

[1] [https://arxiv.org/abs/1812.06864](https://arxiv.org/abs/1812.06864)

~~~
GistNoesis
There is also the transformer approach (eventually with local attention to
bound latency), (like I'm doing in my project (Work in Progress) :
[https://github.com/GistNoesis/Wisteria/blob/master/SpeechToT...](https://github.com/GistNoesis/Wisteria/blob/master/SpeechToText.md)
), though it's in the same line of thought as the convolutional CTC.

The RNN-T is a nice idea though, if I understand it correctly it's another
approach to the alignment problem. In CTC, you are generating sequence like
TTTTHHHEE CCCAAATT, which mean that your language model must deal with these
repetitions, and you can't train using text without repetitions. In RNN-T you
are learning to advance the cursor on either audio sequence or text sequence
so as to maintain alignment, kind of like you do when you merge-sort two
sorted lists, therefore it outputs THE CAT, and you can use a standard
language model.

~~~
JulianSlzr
We explored self-attention + CTC in our ICASSP 2019 paper
([https://arxiv.org/abs/1901.10055](https://arxiv.org/abs/1901.10055)). Our
implementation uses internal infra, so we're a ways from releasing code :(

Hoping paper details suffice and help with the parameter search, and happy to
respond to questions over e-mail. Would love to see an open-source
implementation with local or directed attention built out!

~~~
p1esk
Very interesting! Perhaps I should revise my original comment to “everyone is
moving to transformers lately” :)

------
melling
"But it’s sort of funny considering hardly any of Google’s other products work
offline. Are you going to dictate into a shared document while you’re offline?
Write an email? Ask for a conversion between liters and cups? You’re going to
need a connection for that!"

While offline, you might write email drafts, your blog, or even a book:

[https://medium.com/@augustbirch/what-i-learned-writing-an-
en...](https://medium.com/@augustbirch/what-i-learned-writing-an-entire-novel-
on-my-phone-f1655d09b00b)

What's missing is the ability to make edits using your phone. You can probably
speak at over 100 words a minute but then you need to stop to bring up the
software keyboard.

~~~
ehsankia
The offline aspect is hardly the main draw here though. As mentioned earlier
in the article, the latency reduction is huge. Another aspect they didn't
really cover is privacy implications. Lastly, you may not be offline, but
dodgy connections can also be a pain if you need a stable stream of packets
going back and forth.

~~~
pault
I refuse to put an amazon/apple/google surveillance device in my home, so I am
very interested in a DIY digital assistant device. I'm aware of a few options
but it seems like offline voice recognition is always a little sub-par. I am
really looking forward to the day when an offline, open source digital
assistant can compare in quality to a proprietary/cloud device.

~~~
pdog
_> I refuse to put an amazon/apple/google surveillance device in my home..._

Do you have a smartphone? Because that's most likely an Apple or Google
surveillance device.

------
hathawsh
I just switched my Pixel 1 to airplane mode and tried voice input. Sure
enough, it worked offline and it was fast! Very impressive work. (I've tried
that before, but in the past it could only understand a few special phrases.)
I suppose this new feature came with the security update my phone downloaded a
few days ago.

There are lots of ways to spin this, but I see it as a significant improvement
for any app that could benefit from voice input. It's immediate and not
susceptible to network glitches. The benefit for Google, IMHO, is primarily
more sales of updated Android devices.

~~~
tacomonstrous
Unless you very recently (meaning today) accepted a download of a new language
pack for English, it's likely just the old model, which is perfectly
functional, while not being as accurate as the online version.

~~~
ehsankia
More specifically:

Gboard > Voice Typing > Faster voice typing

It says its an 85MB download for US-English

~~~
rayshan
Looks like this is on Android. Gboard iOS app doesn't have this setting.

~~~
IanCal
Yes, it's specifically pixel phones.

------
dragonwriter
> But it’s sort of funny considering hardly any of Google’s other products
> work offline.

I dunno, Android and a lot of Google's mobile apps that aren't _about_ online
communication work fine offline. Actually, a lot of the online communications
ones do too, as much as is even conceivable, they just don't transmit and
receive offline, because, how would they?

------
Someone1234
Just to be clear: This has nothing to do with "Wake Words" (e.g. OK Google,
Alexa, Hey Siri, etc) which have always been handled offline/locally.

This is translating what you said after the wake word from voice to text on
the local [Pixel] hardware rather than sending it into Google's Cloud.

The biggest benefits here are speed and reliability. It could also handle some
actions offline.

~~~
penagwin
Another benefit is privacy, this eliminates an entire set of potentially
personal data from being handed off to Google.

~~~
usrusr
On the other hand, when you can transcribe locally, uploading whole days worth
of eavesropping would not cause a noticeable spike in traffic. I'd consider it
more a lateral change than an improvement.

~~~
gruez
install a firewall (there are root/no root) options) and block google
keyboard's internet access.

~~~
konart
What if the data is send by something other than keyboard though? :)

~~~
ClassyJacket
This conspiracy goes all the way to the kernel!

------
adzm
Does the Pixel have some specific hardware that this uses, or is it simply
limited to Pixel to limit the rollout? I am curious if I should get my hopes
up to see this on gboard with non-Pixel Android devices.

~~~
joshvm
The Pixel 2+ does have a coprocessor for compute workloads (the Visual Core).
However users here have reported this working on a Pixel 1, which doesn't have
that chip.

The Verge says it may reach other devices later.

It sounds like it's both better than the old dictation model, and
significantly smaller.

------
bad_user
AI systems that are able to work offline are great for privacy.

The thought that every interaction with my phone is being streamed in realtime
to a third party server freaks me out.

Kudos to Google for working on this.

~~~
amelius
They can still send the information they collected from your microphone later,
when you connect to the internet ...

You want an open source solution, not just an offline solution.

~~~
bad_user
Even so, offline AI solutions have been piss poor and Google moving the state
of the art in spite of their vested interest in keeping people online is a
good thing.

Yes, we want an open source solution, but I'm not going to work on it. So
who's going to work on it? Are you?

In absence of resources working towards the ideal, I'll applaud any step in
the right direction.

~~~
hipitihop
To be fair there are others that have been pushing the needle in this space
for considerable time. The standout for me is
[https://snips.ai](https://snips.ai) Offline, multiple platforms, multiple
languages, many parts open source and more oss parts in the pipeline. While
certainly not currently aimed at dictation, but instead assistant building and
automation. In this space on device speed, privacy & offline are critical. In
the case of Snips "piss poor" falls short from the reality of what I have
experienced YMMV.

Nonetheless we all benefit from this progress

------
jsight
Didn't they advertise something like this a few years ago? I seem to remember
trying it and finding that it didn't really work as well as the online
recognition at the time.

EDIT: Looks like something was added in Jelly Bean:
[https://stackoverflow.com/questions/17616994/offline-
speech-...](https://stackoverflow.com/questions/17616994/offline-speech-
recognition-in-android-jellybean)

------
berbec
This will be great when ported to Lineage!

------
firefoxd
I can't pinpoint when exactly, but on windows XP, there used to be a speech to
text engine that worked locally. When you set it up, you had to read some text
to train it with your voice. You could constantly train it to improve it.

This was before the cloudamagig, so I wonder it ran on.

Edit: found the link [https://www.techrepublic.com/article/solutionbase-using-
spee...](https://www.techrepublic.com/article/solutionbase-using-speech-
recognition-in-windows-xp/)

------
lostmsu
At the same time pre-pixel phones get features stripped. "OK Google" now
requires phone to be awake and unlocked, or plugged in to work.

------
davidy123
The other, gigantic shoe that will someday drop will be Google transcribing
every incidental conversation. It can already do that, on-device, for every
song that's heard, ever. It's a super power, being able to remember every word
spoken around you, time and place, but of course it has privacy implications
even if all the work is done without their cloud.

~~~
rocky1138
I've noticed a trend in older folk where they get halfway into a sentence
knowing they don't know the name of what they are referencing only to ask
those around them for help identifying the reference on the way to making some
other point, e.g., "That's like that time in It's a Wonderful Life where
_______, gosh, I can't think of his name. What was the guy that was in that
movie? Oh yeah, James Stewart. That's like the time in It's a Wonderful Life
where yada yada yada"

I'm hopeful that voice recognition assistants will help the burden during
Christmas visits :D

------
moron4hire
This is great. I've been working on voice systems for VR and AR applications.
On the HoloLens, it's a dream once you have your entire interface speech
enabled. Can't wait to start porting to Android. Daydream and ARCore apps are
going to see a huge improvement.

------
gok
These end-to-end speech recognition systems are very intriguing. One major
limitation is that since they don't model phonetics, they have no great way to
deal with highly irregular orthography that doesn't show up in the training
data. For example, there is no great way for the system to learn that the
pronunciation "black" can be spelled "6LACK" sometimes.

The paper on arXiv goes into how they deal with this. Basically they run a
traditional WFST decoder over the output of the RNN-T to take spelling context
into account. Still, it's impressive how far the neural system can get with no
explicit lexicon or acoustic modeling in general.

~~~
fouronnes3
I'm wondering if we'll see it write emoji at some point.

------
shereadsthenews
Hrmm, Gboard only? Does it mean they don't/can't use this model for voice
commands? I do sometimes dictate messages to my phone but my main use of
Android voice recognition is Android Auto commands to navigate or play music.

------
davidw
Call me when it can figure out my wife's Italian name, pronounced correctly
:-(

~~~
davidw
Downvoter(s): sorry, but it's true. Google can't figure out my wife's name,
which is pretty freaking lame, as she's the person I interact with the most in
terms of sending messages and emails and whatnot.

~~~
convivialdingo
Perhaps check if this could work for you?

[https://www.google.com/amp/s/www.howtogeek.com/340108/how-
to...](https://www.google.com/amp/s/www.howtogeek.com/340108/how-to-add-
phonetic-names-to-contacts-in-android/amp/)

------
dotdi
Call me cynical but I cannot picture Google not tapping into everything you
run through their voice recognition software, even if it does work offline.
Doesn't mean it won't phone home later.

~~~
lallysingh
For what? They only really make their money on ads for things you're actively
searching for. Everything else they have in ads works rather poorly in the
text world. Trying to interpret interests out of task-driven voice commands is
way beyond their capabilities.

But, enough of that. I'm holding out until decent voice dictation is standard
everywhere and a well understood engineering problem with good open source
implementations.

Mostly so I don't have to type address into my car's GPS.

~~~
godelski
I think this is only true under the assumption "voice recognition (and
transcription) is solved". I think most would consider this assumption to not
be true. If it is not true, then there is value to that voice data, as it can
be used to help train.

I guess another assumption could be "they have more than enough voice data to
train any future network improvements." While I feel they have a lot of voice
data, I am skeptical to say that they wouldn't view more data as useful, or at
least potentially useful.

I cannot think of a compelling reason for why they would stop collecting data
all of a sudden. Can you? (Serious question, I don't mean to sound snarky)

~~~
lallysingh
I think you're completely right. I interpreted data collection here as a
mining activity for ads preferences. Currently, they fully collect the voice
recordings and only seem to provide options on whether the data is connected
to your account. I don't see any option to "forget what I said after you've
responded."

------
Causality1
Finally. Over the past year or so I've noticed significant increases in the
voice recognition lag across a handful of devices and across multiple wireless
carriers.

------
thrax
Voice on my pixel 3 is incredible. I normally have problems with voice
recognition but this understands me better than some friends I have. It really
is magical.

------
tlepsh
What's so special about it? Just tried this on the BlackBerry keyboard and
there it works instantly without being connected to the internet as well.

------
nojvek
Google and its dominance in both AI and reach into everyone’s private lives
really scares me.

There is a machine that can work totally offline, listen to audio, transcribe
it, have a basic understanding and blast me with ads everywhere I go in the
digital universe.

It can then psycologically slowly manipulate my behavior via ads making us
buy/do things that we don’t even realize it.

It’s gonna be a scary world for my kids.

------
dep_b
It would be nice if Siri would at least allow me to turn cellular data back on
with a voice command. Turn-to-turn navigation tends to consume a lot of data
when I'm abroad using a temporary SIM so I drive without network connection on
offline maps but that kills Siri meaning I can't do anything anymore without
touching my phone.

~~~
zulln
Disable mobile netork for Maps specificially, while keeping it on for
everything else.

------
nukeop
Now they can save valuable CPU time and your phone will extract advertising
keywords from your conversations for them, even without an internet
connection. It's way more efficient to cache speech converted to text while
offline rather than audio clips. The servers get cleaned up text data, saving
bandwidth and storage.

------
camkego
For application purposes where you don't want to source the audio from the
microphone, is it possible for an Android application to feed audio to Gboard
in order to source audio from other sources than the microphone? Maybe the
Pixel has a mixer which allows audio from sources other than the microphone?

------
beatle_sauce
There is an excellent overview over their speech recognition system.
[http://iscslp2018.org/images/T4_Towards%20end-to-
end%20speec...](http://iscslp2018.org/images/T4_Towards%20end-to-
end%20speech%20recognition.pdf)

------
sidcool
This is an impressive engineering feat. Imagine the applications at edge
devices! Microsoft is also trying hard to get their "Intelligent Edge" right.

------
stanley
At the risk of being downvoted, any Pixel users enable "Hey Google"
recognition on their phones only to regret it?

I'm constantly dealing with the phone interpreting commands intended for a
Google Home speaker, which sometimes results in both the speaker and the phone
acting on the same command. To my dismay, there's no way to disable Hey Google
recognition on the phone after it's been enabled.

Perhaps someone here has run into this issue as well? It's a huge pain point
for me.

~~~
_asummers
You can register different trigger words, or at least you used to be able to.
I had my phone wired to OK Google and the Home wired to Hey Google. Wasn’t an
issue once I made the distinction. I no longer have an Android so I can’t
comment on this still working. If you have external parties at your home
regularly, that would obviously complicate things.

------
jcelerier
I never understood the need for server-side speech recognition. Did an
internship in 2013 for speech recognition on a BeagleBoard with Julius
([https://github.com/julius-speech/julius](https://github.com/julius-
speech/julius)), the thing worked with ninety-ish % accuracy (japanese
language) and delay comparable to what my tablet gives - but locally.

~~~
foobarbecue
I always figured doing it server-side was just to capture the users'
data,either because the company wants a big training set, or for more
nefarious purposes like tagged advertising.

------
40acres
Hmm, is this how the "what song is playing" feature works? Google claims it
works offline (I haven't tested it) but I have a hard time believing that
Google is storing information related to every song out there. What about new
songs?

~~~
mynameisvlad
It was covered on HN when this feature was released. There's a database of
>10,000 song fingerprints on-device (IIRC, people found it and it was ~100MB)
that's updated based on the most popular songs from GPM/Youtube Music.

[https://venturebeat.com/2017/10/19/how-googles-
pixel-2-now-p...](https://venturebeat.com/2017/10/19/how-googles-pixel-2-now-
playing-song-identification-works/)

------
legohead
been using Google Voice for several years now for most of my communications in
text, email, slack, whatever (only on phone, of course).

it is quite good, and very fast. but it's still not there. it has trouble with
nuances like "call" vs "called" \-- can't hear that suffix very well in
regular speech. for me, it also has a _really_ hard time with pronouns.

many times I'll start off with regular speech, go to look at what was
transcribed and notice a couple errors that would make me look like a fool,
backspace the whole thing, and then repeating it all gain in a very robot like
voice.

it's _almost_ there.

~~~
reificator
> _been using Google Voice for several years now for most of my communications
> in text, email, slack, whatever (only on phone, of course)._

Just a heads up, Google Voice is the name of a product that offers telephony
service including SMS, and has been around for a decade or so.

[https://www.google.com/voice](https://www.google.com/voice)

------
jdc0589
got it this morning on the way in to work. already used it a bunch and its
GREAT.

------
tonywastaken
Dictation works offline on iPhone since iOS10

~~~
tacomonstrous
Offline transcription has been available on Android as well. This is an
announcement of a faster, more accurate model.

------
flukus
Finally google has caught up to 1997:
[https://en.wikipedia.org/wiki/Dragon_NaturallySpeaking](https://en.wikipedia.org/wiki/Dragon_NaturallySpeaking)

Sure it might work better now, but that's expected when computers are much
more powerful than a pentium 100 with 32MB of RAM. Uploading voice to google
servers for processing was always just a data grab.

~~~
gattilorenz
Well, try to train your Dragon (no pun intended) and then let another person
speak, possibly someone with a different accent.

GBoard works somewhat reliably with 0 training and my Italian accent, that's
an apple to oranges comparison.

~~~
flukus
> Well, try to train your Dragon (no pun intended)

The training was in part due to the hardware limitations. Some versions let
you skip the training or have only a small training session and let it learn
as it went, I'd be surprised if googles systems weren't doing some sort of
continuous learning too.

> and then let another person speak, possibly someone with a different accent.

It had profiles to handle that, although it needed training for each profile.
With google does it automatically switch profiles or does it mess up what it's
learned about you're voice? It's very opaque.

> GBoard works somewhat reliably with 0 training and my Italian accent, that's
> an apple to oranges comparison.

I haven't tried the new system but all of google voice recognition has
struggled with my Australian accent to the point of being unusable.

