
Speech to Text on iPhone vs. Pixel - tosh
https://twitter.com/jamescham/status/1265512829806927873
======
walterbell
Some hackers have been trying to reuse Google's offline speech recognition
models within other software toolkits,
[https://hackaday.io/project/164399-android-offline-speech-
re...](https://hackaday.io/project/164399-android-offline-speech-recognition-
natively-on-pc)

 _> Especially the offline part is very appealing to me, as it should to any
privacy conscious mind. Unfortunately this speech recognizer is only available
to Pixel owners at this time. Since GBoard uses TensorFlow Lite, and the blog
post is also mentioning the use of this library, I was wondering if I could
get my hands on the model, and import it in my own projects, maybe even using
LWTNN._

Recent (May 2020) news suggests that these models may be coming to Chromium,
which would make them widely accessible for offline transcription and
dictation, e.g. WebRTC or video captions,
[https://hackaday.io/project/164399-android-offline-speech-
re...](https://hackaday.io/project/164399-android-offline-speech-recognition-
natively-on-pc/log/176945-soda-speech-on-device-api/discussion-144497)

 _> Google is building speech recognition into Chromium, to bring a feature
called Live Caption to the browser. To transcribe videos playing in the
browser a new API is slowly being introduced: SODA ... What is especially
interesting is that it seems it will be using the same language packs and RNNT
models as the Recorder and GBoard apks_

In Aug 2019, the Live Transcribe engine was open-sourced,
[https://github.com/google/live-transcribe-speech-
engine](https://github.com/google/live-transcribe-speech-engine) &
[https://opensource.googleblog.com/2019/08/bringing-live-
tran...](https://opensource.googleblog.com/2019/08/bringing-live-transcribes-
speech-engine.html)

~~~
dandare
Laptops urgently need to have a hardware switch for the microphone. A mobile
phone should have one too - but that will never happen. At least the laptop is
large enough that one little switch is not a problem (unless you are Apple and
even essential peripherals are a problem).

~~~
GeekyBear
>Apple has brought its hardware microphone disconnect security feature to its
latest iPads.

The microphone disconnect security feature aims to make it far more difficult
for hackers to use malware or a malicious app to eavesdrop on a device’s
surroundings.

The feature was first introduced to Macs by way of Apple’s T2 security chip
last year. The security chip ensured that the microphone was physically
disconnected from the device when the user shuts their MacBook lid.

[https://techcrunch.com/2020/04/03/apple-hardware-
microphone-...](https://techcrunch.com/2020/04/03/apple-hardware-microphone-
disconnect-ipads/)

~~~
arvinsim
As someone who uses his laptop in clamshell mode most of the time, this was an
anti-feature for me.

Having to literally lift the lid and MacOS readjusting the display every time
I want to have a meeting was frustrating. It got to the point where I just
bought a cheap USB microphone just to be done with it.

I support having a hardware switch for it so that I can have the choice.

~~~
danielscrubs
For me it certainly is a feature. I want the microphone of when turning down
the lid.

I mean, the microphone is under the lid anyway so you are not doing whoever
you are talking to any favours by using it in clamshell mode anyway.

------
cl0rkster
There are already some novel applications of this. It is extremely useful to
use Google's screening service for unknown numbers that might be a call I'm
expecting from a new number. I can't recall the last time I actually listened
to a voicemail. I know that at least voicemail transcription works elsewhere,
and I can't speak to how well, but I do know that on the Pixel, transcription
is really quite impressively accurate.

I'm sure it's not much different than the test shown in the twitter post, but
I've enjoyed calling myself from another phone and then screening the call and
doing my best micro machines guy fast-talking impression just to see how well
it really can transcribe conversation at real-time speeds.

~~~
garaetjjte
Is voicemail popularity some US thing? I actually don't know anybody who uses
it.

~~~
derefr
I don’t expect voicemail from other human beings, but I do expect it from my
bank, my doctor, the government, my landlord, my plumber, my child’s school,
the repair shop my car is in, etc. For local in-person-visitsble companies, a
phone call is still the #1 way they update you on things. And, as they make
most of these update calls while everyone is at work, inevitably these calls
all become voice messages.

~~~
jiofih
At least in Western Europe, there is near zero chance you’ll get an unexpected
call from any of those. Either you’re the one calling, or updates will be sent
via email / message / WhatsApp / some specialized app. Probably another
efficiency consequence of human labor being insanely expensive.

~~~
maccard
I'm in the UK, and my dentist and doctor both communicate by phone, including
appointment reminders. The garage I use has a website but no email, and they
call if you're waiting on a part (not exactly unexpected but close enough). My
car insurance and home insurance and bank (much to my dismay) all semi-
regularly phone me, each of the ones Ivr listed I've had an unplanned phone
call from in the last 6-12 months

~~~
abrowne
And on the other side, in the US I still expect calls and voicemails like was
suggested, but the two bike repair shops I frequent and my dentist have
switched to texting, while my medical provider emails.

------
tialaramex
Similar offline versus online privacy thing for music recognition.

My Pixel is sat here charging. Billie Eilish is singing in the background, by
the time she sings "I should have known..." it has recognised the music and
passively displayed "No Time To Die by Billie Eilish" on the lock screen.

Google's team were building an ML model for music recognition and they
realised oh - the smallest model fits on a phone. We don't need to spin this
up as a cloud service we can just deliver it to the phones of people who want
it as a feature. "Now your phone can tell you what that music is".

Of course down the road this awareness of context allows even cleverer agent
interaction. "Which Bond was that?" you ask as Adele sings "Skyfall". "In the
movie Skyfall, James Bond was played by Daniel Craig".

~~~
swamp40
How in the world is that possible with 97 million+ songs in existence?

~~~
tdonovic
I believe their model only works for the top 50k songs in your region. Pretty
rarely does my pixel not recognise music that is playing around me, that seems
to be a large enough number

------
barnabee
Buried in the replies is a comparison with iOS’s on device speech recognition
enabled, which paints a somewhat (though not entirely) different story:
[https://twitter.com/BenLumenDigital/status/12657182691908321...](https://twitter.com/BenLumenDigital/status/1265718269190832128)

~~~
gonational
I don’t know what the “new feature” bit is about; I’ve been using off-line
dictation on iOS for years. Anytime I need to dictate something very long, I
turn the phone on airplane mode so that I can dictate without Apple,
inevitably deciding, at some point in the middle of my dictation, to delete my
entire message and begin to retype it in slow motion (has anybody else
experienced this? I’ve been dealing with this for at least five years, since
an early version of their online dictation).

Unfortunately, airplane mode is the only way to enable off-line dictation. Or,
is there another way?

The first version of dictation had no online mode at all; when Apple added
that, presumably as a Slowloris attack on the entire world, it ruined the
entire experience.

~~~
jdtang13
Yes. It is absolutely horrible. You can press-hold on the mic icon to force it
to offline mode.

~~~
jamescham
Oh! Good to know!

------
dhruvmittal
Since I got my Pixel 4, I've found myself using more and more speech to text
as... it's just that quick and easy. Every so often, I try to use speech to
text on my iPad Pro and I have an experience pretty similar to Poor Michael
Geer over here.

~~~
notyourwork
+1 The Apple experience is frustrating at best and usually flat out wrong.
Especially for use cases it is supposed to be good at like hands free driving
where accuracy is paramount to use case's success.

------
fenwick67
I basically can't look at my phone screen when doing TTS for full sentences,
it does something to derail me and I basically cannot finish a sentence,
similar to how DAF trips people up. Am I alone in this?

[1]
[https://en.wikipedia.org/wiki/Delayed_Auditory_Feedback](https://en.wikipedia.org/wiki/Delayed_Auditory_Feedback)

~~~
cl0rkster
I love when the "Good Mythical Morning" guys demonstrated the effect for
entertainment.

[https://www.youtube.com/watch?v=TB2rEddp-
Oo](https://www.youtube.com/watch?v=TB2rEddp-Oo)

------
payne92
Latency matters a lot, and it's one of the least-appreciated aspects of UX
design.

I (accidentally) dropped my desktop Linux system into pure console mode the
other day and realized how much FASTER it felt just because of the improvement
in keyboard latency.

~~~
snazz
Some applications have noticeably better keyboard latency than others on my
Linux system. Kitty, Emacs and Chromium beat most QT apps and Firefox by a
mile. Also, latency is much better for me in Wayland than in (composited) X
(sway vs i3). Not sure why that is.

------
kartayyar
The Pixels get a lot of hate from people who haven't used them though I think
they are great phones and I'll never buy a Samsung phone again.

There are so many nice software touches on them that you just have to
experience to appreciate.

~~~
lvs
This is not specific to a phone model. It works reasonably well on the
venerable Nexus line of phones too. Any android device that uses gboard. I
still won't actually use it for anything, but it works within a limited
vocabulary. It's still not what I would call generally useful, although I'm
sure it is very helpful for those with disabilities.

~~~
muyuu
Can I try this in an older Android tablet? I thought it was a TF-lite thing
under the hood.

------
deeesstoronto
Latency is super important.... High latency is one of the largest problems
with modern software.

But what about transcription accuracy? I mainly use Android but also use an
iPhone.... I have found transcription accuracy is so much higher on my iPhone
then Android.

I pulled out my old BlackBerry (Android) when I sent my Android phone in for
repair recently. The voice transcription via blackberry's keyboard is hands
down better then either stock Google or iPhone. It's surprisingly feels like a
regression going back to the new phone (other than speed and battery life)

~~~
bangonkeyboard
Did you watch the video? The iPhone went back and retroactively inserted
errors into its transcribed text.

------
gregsadetsky
I have an extension in the Chrome store [0] that brings dictation into GMail.
I piggyback on the Web Speech API, which in the case of Chrome uses Google's
servers.

Considering the audio stream upload & processing & network jitter/lag, the
speed at which text results come back is simply incredible. I don't remember
the exact timing, but it was a roundtrip of 40-100?ms which is... crazy/magic.

I made a small experiment with this same Chrome text to speech engine which
triggered a google image search and showed in near real time image results for
the spoken words. The slowest part of that Rube Goldberg was the google images
search + loading the images. [1]

I also.... "secretly" believe that very fast speech recognition is one of
"the" secrets to building a smarter / better digital voice assistant. It's one
of the key components (with a ton of groundbreaking NLP and/or the right
regexes) that might allow closing the "strange feeling" you get when talking
to the robot... voice... and... it... answers....... not ... exactly
when...... you were expecting it.... to.

I'm super mega busy these days but also super mega interested in this. Reach
out? :-)

[0] [https://chrome.google.com/webstore/detail/dictation-for-
gmai...](https://chrome.google.com/webstore/detail/dictation-for-
gmail/eggdmhdpffgikgakkfojgiledkekfdce?hl=en-US)

[1]
[https://www.instagram.com/p/BwqFQgWFsYu/](https://www.instagram.com/p/BwqFQgWFsYu/)

------
oh_hello
An an iPhone user, I very often find speech recognition far too slow. It is to
the point I rarely use the feature because it often causes frustration. The
Google performance here is amazing.

Is this a feature only of Pixel phones?

~~~
datguacdoh
The more recent Pixel devices have a local accelerator for different machine
learning tasks, they brand them Visual Core and more recently Neural Core.
[https://www.androidauthority.com/google-pixel-4-neural-
core-...](https://www.androidauthority.com/google-pixel-4-neural-core-1045318)
has more details and a lot of the speed up is probably tied to these being
onboard.

------
kayoone
Some iOS Keyboards (SwiftKey) and apps like Draft support speech to text via
Microsoft, which seems to work very well.

------
walterbell
A printed dictionary captures extended history of human language. An audio
recognition model based on global mass surveillance can snapshot similar
history. After a good model exists and can be run entirely offline, the (sunk)
privacy costs are dwarfed by new value that can be created by speech-to-text
enablement of human expression.

This is what Google has done and it was a monumental human achievement,
deserving a place in history alongside Gutenberg, due to the small size of the
model that could operate fully offline on power and space constrained mobile
phones.

Compare this approach (partial mass surveillance generating privacy-preserving
offline models) with the approaches of competitors like Alexa & Siri: both use
mass surveillance for model training, but neither of them make their models
available for offline and privacy-preserving public use.

------
jay_kyburz
Wow, text to speech does not work anywhere near that good when I speak.
Australian accent perhaps?

~~~
geomark
I was wondering about that. And also about other languages, because I see Thai
speakers using TTS on their Android phones and it seems to work really well in
terms of speed and accuracy.

------
nshm
Would be nice to test something opensource alongside. Like
[https://github.com/alphacep/vosk-api](https://github.com/alphacep/vosk-api)
which runs on Android and iPhone offline.

------
dirtyid
The TTS close caption on Google Meet is also extremely useful, some features
Google is just years ahead in.

------
jotm
Somewhat related: the Google Translate app uses older voices for some
languages, Google Assistant uses the newer, better ones. Not sure why, but if
you're using that feature, you should use Google Assistant ("be my translator,
please")

------
jonplackett
QUESTION

How hard, on a scale of 1 to impossible, how hard would it be for Apple, or
someone else to just grab the model google is using from the pixel and reverse
engineer it and steal all their years of research.

~~~
anchpop
It would probably be easy for any on-device model. But it would be illegal

~~~
rytill
What part of it is illegal? What if they reverse engineered the model, and
then understood the fundamentals of how it worked, and implemented and trained
the same architecture with different data?

Or trained their own architecture with data sampled from the Google model?

is it "stealing" data, the architecture, the parameters, or the act of reverse
engineering and productizing the knowledge?

~~~
chrischen
That's what patents are for. The tricky part is figuring out what aspects were
obvious or not to an industry professional.

Models are also usually black boxes, and the techniques used are published.

------
kccqzy
In case anyone didn't notice, this is comparing Gboard (Google's virtual
keyboard) on iOS versus Android. It's NOT comparing Apple's voice recognition
tech with Google's.

~~~
mwest217
That's actually (sneakily) not the case. Gboard does have voice transcription,
but it's triggered by pressing the microphone button at the upper right corner
of the keyboard. Apple's voice transcription is still triggered from the
bottom right button, even if a 3rd party keyboard is being used.

~~~
kccqzy
Thanks. I stand corrected.

------
causality0
For me it seems like Google voice typing is getting worse and worse. It
capitalizes random Words in a sentence Like this. It specifically ignores the
word "o'clock".

~~~
jefftk
Google pixel 3a, reading your comment: "for me it seems like Google voice
typing is getting worse and worse. It capitalizes random words in a sentence
like this. It's specifically ignores the word o clock."

~~~
causality0
See! I don't know who at Google has this personal vendetta but it has to be on
purpose. You say "We start at three o'clock" and it just types "We start at 3"
when what you want it to type is "We start at 3:00". You can sit there saying
"one o'clock, two o'clock, three o'clock, four o'clock" and it just types out
"1234". It _knows_ the word I'm using but ignores it on purpose.

It's a real issue for me as "4:00" has three significant digits, it's
specific, whereas "4" might mean anything between 3:50 and 4:10.

~~~
aikinai
It's a pain, but can you say "four colon zero zero?"

~~~
causality0
Nope! Google ignores that too and you just get "400".

------
tootie
I'd like to see these side-by-side with someone typing with thumbs or typing
with a full keyboard and see who really wins.

------
tsycho
It's amusing (from the comments) how much people are biased towards Apple.

1\. Apple does it online, Google does it offline, hence Apple being slower is
okay. But why is Apple's transcription more inaccurate then?

2\. Google violates privacy because it used 411 data to train its models,
hence its speech transcription quality is better. But Apple is the one doing
it online, are we sure it's not using the voice data too, similar to how it
has/was doing it with Siri?

It will sound like I am trying to defend/promote Google here, but that's very
far from my intent.

As an iPhone user since the first iPhone, I just want iPhones to be better. I
use iPhones for the same privacy concerns as many of you, but let's not give a
pass to Apple. Let’s demand higher quality from them so that they spend the
time, money and effort to improve. I don’t want to feel like I am compromising
on a worse overall experience in favor of better privacy when i buy my next
phone.

~~~
scblock
Apple does not do the transcription online, at least not on any modern iPhone.
Turn on airplane mode and disconnect wifi and give it a try yourself.

~~~
_nhynes
It requires iOS 13. Here's the code that does it:

    
    
        var req: SFSpeechRecognitionRequest;
        if #available(iOS 13, *) {
            req.requiresOnDeviceRecognition = true
        }
    

And the docs link:
[https://developer.apple.com/documentation/speech/sfspeechrec...](https://developer.apple.com/documentation/speech/sfspeechrecognitionrequest/3152603-requiresondevicerecognition)

------
yalogin
Wow does anyone know why it’s slow on the iPhone?

~~~
ksec
iPhone is doing this online, sending data packet back to Apple.

Pixel is doing this completely offline.

~~~
tonywastaken
iPhone is doing it on device as well.

~~~
datguacdoh
Fairly certain that isn't true. On my wife's iPhone, she was never able to use
voice to text while in airplane mode. Maybe that's changed with the latest 11
but it definitely didn't work on the X.

~~~
tonywastaken
Go to Settings -> General -> Keyboards. If your language and phone support it,
it should say "You can use Dictation for {language} when you are not connected
to the internet." below Dictation Languages. It works since the 6s.

------
soneca
Is it a side-effect of Google expertise in analyzing people’s private data in
order to transform it into something useful to sell ads?

------
xenonite
The iPhone keyboard seems not to be the original one.

Is this really calling Apple’s transcription service? And does it maybe even
add some latency?

------
caycep
granted neither google or apple really works well with my voice for some
reason...

------
markdog12
Must be a Pixel 4. My Pixel 3 result is arguably worse than the iOS one. It's
too painful to ever use.

------
kapsteur
Twitter video is a mess. I can't read anything in this video

------
gonational
I will never support google in anyway, at least in tension fully, but Apple
voice to text is absolutely a palling. I just dictated this message here is my
iPhone, so I will leave it as is without fixing the fuck ups.

------
m0zg
And Apple could literally do this faster than Google due to much better
onboard hardware. They're just years behind at AI at this point. It's getting
embarrassing. Some _individual researchers_ can do a better job than their
entire speech-to-text team is doing here.

Although to me, neither example is useful without:

1\. Automatic punctuation

2\. Robust recognition in noisy environments

And neither system is capable of that yet, although Google's system is better
at #2.

------
WhyNotHugo
I'm curious about the privacy policies behind both engines and their
development.

Google's gotten an advantage in many fields by simply trampling over user's
rights. Apple has been more respectful of user rights, and that's kind of made
life harder for them too.

~~~
pen2l
Here's my controversial post of the day:

I'm okay with Google trampling on my rights, in exchange for the things it
gives. Personalized search results, better ad hoc translation, all of these
things work well only because they have my data. And for me, I like these
things enough that I'm perfectly okay with the trade.

~~~
blinkingled
In this case however the transcription is done fully offline on device - that
is why it is so fast. Yes Google may have trampled on your voicemail data and
411 calls to create a model that works so amazingly well across different
accents and languages - but it is forgivable in this case given how good and
useful it is!

------
ksec
Well the iPhone is doing this online. While Pixel is doing it offline. Which
is why the latency difference. ( Specific to English only, some languages
still requires connection on Pixel )

I think dictation on iPhone is good enough, at least 10 times better than pre
iPhone / Machine learning era. But it is also not as good as google. Partly
because Google has more Data and started research way earlier, and partly
because this isn't a fundamental strength of Apple.

I am pretty sure Apple is working on Offline Dictation. If Google can do it
with Snapdragon on a much smaller transistor budget, there is no reason why
Apple cant do it with more transistor.

~~~
jonplackett
That just makes it even worse!

Look how much better the Pixel is doing on accuracy too. It's almost flawless
while the iPhone, despite the advantage of offline processing, is getting
maybe 10% wrong and writing complete nonsense words in there. Who is the hell
is "Poor Michael Geer"?

(for clarity, I'm an iPhone user and won't be switching because privacy, but
this makes me jealous)

~~~
azinman2
As pointed out on another comment, this is using Google's 3rd party keyboard,
so in effect it's google vs. google.

~~~
SquareWheel
No, it's still using Apple's native speech-to-text feature.

