
How Apple fumbled a five-year lead in voice control - charlysl
http://highscalability.com/blog/2018/1/22/how-apple-fumbled-the-voice-first-future.html
======
manmal
The author makes it seem that everybody is eager to use voice assistants to
replace their Android or iOS screen interactions. But I don’t know anybody who
would dictate their private messages on the subway, or even at home - they are
called private for a reason. And just imagine requesting „tell me the current
value of a bitcoin“ while at work - I don’t think this would go well.

There would be a lot of noise pollution everywhere, with people constantly
murmuring to themselves, alternated with an unresponsive silence while
listening to the device‘s answer. It would seem like everybody is on the
phone, all the time. Also, for me, digesting information via text is way
easier, because I can jump back and forth to support short term memory - good
luck with that with a voice assistant.

The iPhone (and friends) disrupted the PC because they made reading and
writing mobile. Voice assistants are not iterating on this paradigm, but they
are rather iterating on phone calls. I‘m skeptical that this is suitable to
replace the iPhone I‘m writing this on.

~~~
princekolt
I don't think you've seen enough people using WhatsApp to send voice messages
to their contacts. I do see a __lot __of people doing that on the subway, the
bus, the street, and everywhere else around Barcelona. And it is the same in
every other place I 've been to recently.

And I can see a very obvious reason why: it's easier. Typing is annoying and
slow, and you make lots of mistakes. People don't care others are listening.
When was the last time you memorized something irrelevant someone else next to
you was saying on the phone?

Maybe _you_ don't want a single word of what you talk about to be heard by
anyone else. But most other people don't care, and the reality is that most
other people around are too bothered with their own lives to pay any attention
to it.

~~~
qubex
Whenever I get a vocal message I just ignore it and write back ”sorry, I’m not
going to open that because I am in a delicate situation, could you please
resend as text?” People being lazy doesn’t mean they can expect me to disrupt
my existence to receive their lazy communication,if it isn’t worth their time
to type it then it isn’t worth my time to read it.

~~~
unobtaniumstool
I bet you have lots of friends.

~~~
dang
Could you please not post uncivil or unsubstantive comments to Hacker News?
It's against the spirit and rules of the site:
[https://news.ycombinator.com/newsguidelines.html](https://news.ycombinator.com/newsguidelines.html).

~~~
unobtaniumstool
I find my comment civil, substantive, and a worthwhile contribution to the
discussion. Maybe you should talk to the guy I replied to.

------
dustinmoris
> When Apple bought Siri it had a solid 5 year lead in voice control.

Ehm what? Siri was initially released in 2011, I think Google Assistant in
2016.. so yeah.. there's 5 years difference between Siri and a product called
"Google assistant", but before that Google had "Google Now" which was also
controlled by voice, and MANY years before that Google pioneered with Google
Voice Search and other apps which could be controlled by voice.

So let's get the facts right... Apple didn't have a 5 year lead, Apple was so
far behind that they were forced to buy another product in order to even have
a chance to catch up!

\-
[https://en.wikipedia.org/wiki/Google_Voice_Search](https://en.wikipedia.org/wiki/Google_Voice_Search)

------
shabbyrobe
> In the AR and VR future we won’t be typing.

Oh please god no full stop imagine how hard it would be to even write a simple
hello world program example ellipsis linebreak linebreak three backticks no
not that delete word delete word backtick backtick backtick c linebreak void
main open bracket int argc comma char asterisk argv close bracket space open
curly bracket line break tab print-EFF open bracket double-quote percent d
backslash n double-quote comma argc close bracket semi-colon line break
backtick backtick backtick

Chance of success? Zero. Time taken? Three minutes.

~~~
lunixbochs
Whether you think it's a good idea or not, there are those of us who can't
function without voice programming/input.

This video demonstrates full coding proficiency with my Talon project, which
still needs a lot of work (e.g. no auto-spacing yet) -
[https://youtu.be/ddFI63dgpaI?t=30](https://youtu.be/ddFI63dgpaI?t=30) \- this
demo took me around 9min vs a 90wpm typist doing it in a little over 6min.
I've also benchmarked inputting over 280 commands per minute with no false
recognitions.

This is my project, with which I aim to change minds about voice input (even
making it viable for real-time gaming):
[https://talonvoice.com/](https://talonvoice.com/)

The Perl video is embarrassing, but that voice engine is only designed for
dictating english text. voicecode.io has some major architectural limitations
(no dynamic grammars so it requires a Dragon restart to change commands or
word lists, adds commands by adding them to the english vocabulary instead of
a strict command tree so recognition seriously suffers), aenea, caster,
dragonfly are built on a 90's Python project called natlink and have their own
issues. "Continuous command recognition" is pretty much a must, which Tavis
Rudd's video lacks (notice the pausing between words).

~~~
fpig
That looks great! Both the speech recognition and the eye tracking look really
impressive. You should make this into a commercial project or something. Is
there any popular application that already does this stuff and is as good (I
am not familiar with this type of software)?

I had no idea eye tracking could work this well.

~~~
lunixbochs
As far as I'm concerned, there's nothing remotely in Talon's ballpark for
voice, and these demos are just a sort of raw baseline. Every part of Talon
will see significant improvements as I complete more of the upcoming features.
Tech like Siri/Alexa/Google Assistant is too natural language processing
focused to be reflective of the performance I've already demonstrated from a
precise control scheme.

To my knowledge I am the only available high performance eye mousing project
to not require a separate head tracker (I only require a $150 Tobii 4C eye
tracker, and I do not use their integrated webcam head tracking system as it
is not responsive enough).

I'm additionally planning to remove the ~$50-150 Dragon requirement for Mac
users in the mid future.

I'm very committed to keeping the baseline software not just free, but a net
negative cost over using other free solutions. As far as making it commercial,
my current plan is to eventually charge a small subscription fee for extremely
well integrated video game control systems, and a moderate subscription for
professional input schemes like CADD, A/V products, and entire language models
built around natural language programming that would ideally enable people who
become injured to continue to work without much of any downtime.

These are some of the competition:

Voice:

\- [http://voicecode.io/](http://voicecode.io/)

\- [https://github.com/dictation-toolbox/aenea](https://github.com/dictation-
toolbox/aenea)

\- [https://github.com/synkarius/caster](https://github.com/synkarius/caster)

\- [https://www.voicebot.net/](https://www.voicebot.net/)

Eye/Head Tracking:

\- [http://precisiongazemouse.com/](http://precisiongazemouse.com/)

\- [http://iris.xcessity.at/](http://iris.xcessity.at/)

\-
[https://github.com/trishume/PolyMouse](https://github.com/trishume/PolyMouse)

\-
[https://github.com/trishume/FusionMouse](https://github.com/trishume/FusionMouse)

(huge shout out to trishume, who was very helpful and a big inspiration for my
tracking system)

One of my upcoming goals is to get a hands-free world-record game speedrun
into the Awesome Games Done Quick stream.

------
idreyn
> Over 60% of text iMessages are composed using voice

This seems dubious and I'd love to see a source for this claim.

~~~
alwillis
100% of iMessages are composed by voice on the Apple Watch, which may be
weighting this figure somewhat.

~~~
ramzyo
Err, huh? I use the scribble feature (not voice) on the Apple Watch to compose
iMessages all the time.

------
ken
I disagree with the initial premise, i.e., that:

> "We think in a voice in our head. Anyone trying to type has to first put it
> in a voice in their head before typing. You’re transcribing your inner voice
> onto the keyboard. When you speak it’s quicker."

Anyone who has heard me speak probably realizes I don't think in voice. I can
type as fast as I can talk because words->bits (or words->sounds) isn't the
bottleneck. Thoughts->words is.

Feynman [pdf:
[http://calteches.library.caltech.edu/607/2/Feynman.pdf](http://calteches.library.caltech.edu/607/2/Feynman.pdf)]
noted this, too. Some thinking can't easily be done in words. Some people
don't naturally think in words.

There's a TV show about an autistic young doctor which, for all its flaws,
tries to show how a person can have great insight even when they can't
articulate well. I hope it might help inform the public understanding of other
mechanisms of thought.

Spoken language is wonderful as an art form or as a tool but I find it
troubling when people with one mode of thinking can present theirs as the only
kind, and then use it to push their agenda on others. I suspect this is behind
the "open floor plan" office fad, too.

~~~
Someone
Yes, it seems as bad to me as the similar _”We think in a voice in our head.
Anyone trying to read has to speak out what is written.”_ , which was
considered true until the invention of “silent reading”
([https://web.stanford.edu/class/history34q/readings/Manguel/S...](https://web.stanford.edu/class/history34q/readings/Manguel/Silent_Readers.html))

However, there may be some half truth in that statement: to type as fast as
text spoken at top speed you must use stenotype.

I’m not sure that means speaking, in general, _is_ faster than typing, though,
as a) it involves the extra handicap of having to process the spoken sound,
and b) typing may be faster in the common/average/mean case.

------
headmelted
Apple didn't fumble a five-year lead in voice control, they've just not closed
Google's 15 year lead in AI.

The article kind of touches on this topic but misses the point - Google _is_
using data about you to inform it's AI. One could argue that the author not
realizing the extent to which it does so is more likely a sign they've become
very good at it.

~~~
unobtaniumstool
Right well Google is preternaturally good at that shit. I want to know Apple's
excuse for failing to outperform Amazon.

~~~
headmelted
Yeah fair enough. I'd be interested in knowing that myself.

------
thestephen
What makes me sad about Siri on the iPhone is that it would be such a boon if
I was able to use it with actually good apps and if I didn't feel like a
second-class citizen for having an iPhone outside the US.

I'm forced to use the buggy Reminders app, because Siri only supports creating
reminders with Reminders. Apple refuses to support anything else than Apple
Music, which means I can't use Siri for music. (Yet, when the 4S was released,
it only took weeks until I could control Spotify with Siri using jailbreak). I
can't use Siri for navigation either in any form ("Hey Siri, when does the
next train leave from here?"), because Apple has never provided public transit
or good navigation in Sweden, nor announced a timeline on what year that might
happen – and of course, Google Maps can't use Siri. And even if it could, Siri
does not gracefully support foreign words such as addresses.

I could easily love Siri if Apple would stop gimping it by tying it to other
subpar products.

~~~
dbbk
> I'm forced to use the buggy Reminders app, because Siri only supports
> creating reminders with Reminders.

Siri has third-party reminders support now, I use it with Things.

~~~
thestephen
That is good to know, and I've considered switching to Things for a while.
This makes it easier. Thank you!

------
hishnash
i do see a lot of people believing the Alexas system is much better but int he
end alaxa is just build up on regex.

that is ok-ish for simple languages like English were the words themselves
dont change.

but in many other languages, the words in sentences change depending on their
relationship to each other.

eg simple information in English like "the cat is sitting on the bench" is
encoding with the order of the words whereas in other languages the
information the `cat` is `sitting` on the bench rather than the bench `sitting
on the cat` is encoded in how the word `cat` and `sitting` and `bench` (and
maybe `on`) change. This change can be rather complex based on lots and lots
of factors such as gender of the `cat` and `bench` the type of verb eg
`sitting` and the preposition `on` (some languages don't even bother with this
since it's added to the verb).

apple's approach is much more upfront work since they do all this natural
language logic in every language they support. But for developers who use
Sirikit is is much better.

~~~
madeofpalk
I'm sure it's fixed now, but a few months ago right in the middle of the whole
"Siri is a completely failure, Alexa is much better" I asked Alexa to "turn
all my lights off" and she responds with a very daft "I'm sorry, but you don't
have a device called 'All my lights' on your account".

Siri knew how to handle that just fine.

------
ghostcluster
Notably, unlike Xerox, which developed the Alto and its technologies in-house
and let them rot, Apple launched Siri through acquisition of a fully
functioning application which was already distributed through the App Store.
It's more like Flickr.

There are some weird sweeping statements in this piece that I don't think
represent reality, but I do agree that it seems Apple held Siri back from what
it should have been, as a platform with a third party API, because it was
afraid it would bypass Apple's control of the overall iOS platform somehow.
Fatal mistake, and they lost a huge number of Siri's team by letting it rot
the way they have.

~~~
alwillis
_Fatal mistake, and they lost a huge number of Siri 's team by letting it rot
the way they have._

It’s not _fatal_ —lets not be melodramatic here.

I suspect they know they screwed up and have taken steps to correct. Remember
that Siri on iOS, the watch, macOS and AppleTV were all different; they're
working on a unified Siri across platforms so that the things Siri knows how
to do aren’t in silos.

We’re already starting to see the results of this: the new “read me the news”
works on iOS 11.2.5 and also on HomePod:
[https://www.macrumors.com/2018/01/23/apple-releases-
ios-11-2...](https://www.macrumors.com/2018/01/23/apple-releases-ios-11-2-5/)

~~~
ghostcluster
It's not a good sign that they've lost so much of the original talent on the
project.

The WSJ had a very public exposé on the topic over the summer:
[https://www.wsj.com/articles/apples-siri-once-an-original-
no...](https://www.wsj.com/articles/apples-siri-once-an-original-now-
struggles-to-be-heard-above-the-crowd-1496849095)

I don't know of any good redemption narratives for tech products that lose
their founding teams after acquisition and exist in the headless/nebulous
state that Siri appears to be in now — outside of the Macintosh itself, but it
had to hire back its founder and spinoff product.

------
jxdxbx
"Voice first" seems like a misguided philosophy pushed by people who think
that we're due for some sort of "revolution."

Voice is important but mostly for narrow use cases (in the car, music control
in the house) or for shortcuts ("Siri, open accessibility settings").

Though I can imagine some kinds of routine, tedious computer tasks that could
be automated with voice and AI someday--"Computer, convert all the files in
this folder to jpegs and then put today's date at the start of their
filenames, then move them to Dropbox and share them with my wife"\--stuff like
that. But it seems like for all that to happen we need something approaching
general AI. Most voice assistants just move you along predetermined paths.

~~~
cooper12
One place where voice might prove very useful is to get a generation of
illiterate people onto the internet. This article by WSJ talks about
uneducated porters in India who rely on voice to search, including a job ad
app that can be used by voice: [https://www.wsj.com/articles/the-end-of-
typing-the-internets...](https://www.wsj.com/articles/the-end-of-typing-the-
internets-next-billion-users-will-use-video-and-voice-1502116070)

~~~
unobtaniumstool
Exactly. "The technology isn't there yet" and "I wouldn't be comfortable using
voice control" are said from a place of privilege. There are people out there
benefiting from it right now, despite some people's very first-world hangups
about it.

------
gambiting
>>Apple is also fumbling the future—the Voice First future. Voice First simply
means our primary mode of interacting with computers in the future will be
with our voice.

Oh god please no. Interacting with computers using voice is probably the
second worst way of doing it, just after using motion gestures. There are
_some_ good uses for it(although I think even voice controls for lights is
pushing it, it's more pain that it's worth), but having it as a primary mode
of operation? No thanks.

------
sparkpeasy
What do people actually use voice for besides playing music, smart home,
transcribing and "widget" stuff like timers and weather?

Are there lots of people out there actually using it for productivity like
scheduling, communicating, researching, comparing?

To me it seems like until voice AI gets smart enough to respond to a query
like "Alexa, how will our Q4 projected net profits change if I switch widget
vendors from AlphaCorp to BetaCorp?" an interactive screen will be needed.

~~~
yourapostasy
Companies see the Cambrian explosions of growth (and equities appreciation)
that follow successful UI shifts and want to reproduce that with voice UI.

You could do a hell of a lot by establishing a framework for third parties to
use such that they don’t stomp upon each other and the system-reserved
words/phrases, as well as homophone rejection/discrimination, then brute-
forcing lots of recognized cases. Right now, I’m seeing second-system effects
in voice UI efforts by everyone (possibly excepting the dedicated voice
recognition outfits like Nuance), by trying to apply a general machine
learning approach to all of it and eschewing brute-force as not pure/clean
enough.

~~~
jxdxbx
Well, Alexa seems to be going the brute force route--using third-party skills
is just like using a spoken command line utility. The problem is, while this
approach is simple and it works, very few skills are useful.

------
lobster_johnson
I'm probably living in a bubble, but I don't understand the current focus on
all of these voice-operated AI things.

For one, I don't see anyone using them. I've never seen anyone use Siri on a
phone, with one exception: I have a friend who uses voice transcription to
write emails and text messages. (Not sure if that counts as Siri.) I don't
know anyone who uses speakers such as Amazon Echo or Google Home, or mentions
wanting them. I've used Siri on my phone a few times to try out silly things
like asking about the weather or converting units, but it's seems so obviously
bad that I've never tried going "Siri first". It's on my Apple TV, but I've
not used it. Are people actually using this stuff as much as this article (and
all the current focus on AI-powered speakers etc.) implies?

I'd love true AI -- a personal assistant that actually understood me, knew my
tastes, remembered my choices and so on -- but that doesn't exist. So we get
an uncanny valley where speaking to a device comes with uncertainty about
whether the device will actually be able to service the request (see the
numerous annoyed threads on Reddit about Siri not understanding incredibly
basic things), requiring the user to carry around with them a mental model of
the receiver's inadequacies. And we get that problem where many tasks that
would be super useful to have done by an assistant is actually too difficult
for a current AI. For example, booking a trip. I've never booked a flight that
didn't involve pouring over pages of results, weighing the various compromises
(optimizing for price vs departure time vs airport distance vs layovers vs
airline crappiness etc.) and of course filling in a whole lot of information
about myself. Let alone booking holidays (I spend many days on research for
such a task). Bus and train tickets -- _maybe_ , if it's a simple route with
few variables. Dinner reservations and movie tickets, sure. Home automation,
well. Once a home is fully outfitted with all lights, doors etc. being
connected to a home automation system -- maybe. None of this is automated in
my apartment, and I've never seen a home that is set up like this.

There's also the inherent limited utility that comes with an AI being
unintelligent. For example, my Apple TV. I rarely know what I want to watch.
To only rarely would I be able to say "Siri, play The Grand Budapest Hotel".
Usually it would have to be something like "Siri, list me some sci-fi movies".
But then I would still have to navigate the results. If it could respond to
"Siri, show me French movies with an IMDb score of 7.0 or higher, or with a
four-star review from Roger Ebert, that came out after 1990", well, now I'm
interested. But somehow I doubt that those sort of multi-dimensional cross-
referencing mechanics are currently available.

Edit: With the comments saying how good Google Assistant is compared to Siri,
I tried it out. My first queries failed ("What is the weather in Washington
State like in March") because it thought I was finished speaking and cut me
off at "What is the weather", and showed me today's weather in my city. After
a few attempts I got results, but then it returned Fahrenheit units (my phone
is set to use metric/international units), and then it continued giving me
Fahrenheit after I changed it, because apparently climate information is a
"web search" returning text, not data:
[https://imgur.com/gallery/LI8Sc](https://imgur.com/gallery/LI8Sc). Not so
impressive. Meanwhile, Siri on the phone doesn't understand the query at all,
ignoring the "in March" part and giving me today's weather, no matter how I
phrase the question.

I can see a future where AI voice control is incredible, but it requires an
enormous advance, and it seems like even the best AI of today just comes with
frustration and failure. That's surely why companies like Apple and Google are
working on this, but it doesn't explain, to me, why the current voice AI is
being pushed so hard.

~~~
jsharf
I'm a software engineer. I code in vim and use Ubuntu on my desktop which I
built myself. I say all of this to give the reader context for how I tend to
interact with tech. I'm 24.

I do a lot of texting, and a lot of the time it's a mixture between plain old
thumb texting and swiping for longer words. But about 30% of the time I use
voice to text.

I use Google's assistant and the voice to text is excellent. It even uses
context to go back and correct ambiguities (synonyms) and even cooler --
mistakes. If I say "I baked a match of cookies", it will first transcribe
match, but as soon as I say cookies, it will in real-time go back and change
match to what is obviously correct -- batch. Note this doesn't work with
direct assistant commands ("okay Google what time is it?"),but it definitely
does work with text messages you send while talking to the assistant ("okay
Google, send Brandon a text message: I'm baking a match of cookies" gets
corrected). This helps with stuttering or mistakes you make while speaking.

It's certainly a bit of a mental shift to get used to speaking text messages.
Saying punctuation marks out loud is awkward. but it's definitely at the point
where it's convenient.

~~~
jxdxbx
Transcription is a hard problem but it's pretty closed to solved. The unsolved
problem is telling an voice assistant to do something useful and having it do
it. An AI assistant would be something that could act on "Book me a trip to
Tokyo and inform the relevant people," not interacting with a spoken command
line, invoking some API by voice, or laboriously filling out a form by talking
to a computer.

------
skgoa
Eh, this seems to be missing one important detail. At least it's massively
important to me as a potential buyer of an AI home assistant.

That detail is that HomePod doesn't spy on me. It does not upload audio
recordings of my home and it even shows a visual indication when it is
listening at all.

Doing speech recognision on the device is way harder, so it's not a surprise
to see them being later to market than others. However this makes Apple's
product the only viable one on the market for me.

------
charlysl
This is about much more than just voice control, if I got it right much of its
added value could be realized by just typing, or more generally, by
conversational text generated in any other way, of which online voice2text is
one instance.

Also interesting that just like structuring everything around Windows hurt
Microsoft, Apple seems to be falling into this with the iPhone.

------
baybal2
95% trained accuracy was already there by late eighties

97% by mid-nineties

98% late nineties, and I still remember how wonderful was the voice
recognition on old black and white Ericsson phones, and sony PDAs of that era.

And then, things stalled at claimed 98.5% without much improvement in marginal
utility.

The fact that misrecognition is there and will be there, removes much of
utility from applications without a dictionary or one with no side effects.

And yes, people talking to nobody feels and looks creepy. It almost begs other
to come and point a finger at you saying "hey look, a computer creep is
talking to his phone."

------
mrarjen
I'd still like to see someone make a "HER" type of OS even just as a
prototype/demo piece touching new functionalities. Currently Siri indeed
doesn't offer much, but even Google and Alexa aren't yet at the level that I'm
likely to use it for much more than set a timer or put on some music in the
long run...

~~~
ladberg
It's not that people aren't trying to do that, it's just insanely hard. That
is the end goal for all virtual assistants, but it's still years (I would even
say decades) away.

------
mnm1
I'd love to see such technologies developed more, but the focus needs to be on
accuracy and speed. No current systems I've tried excel in those areas when
used by people with accents. I've seen others code by voice, but for me, the
voice recognition systems are about fifty percent accurate most of the time,
aka useless.

------
braindead_in
Here's an automated transcript of the podcast referenced in the post, in case
you want to skim through it.

[https://scribie.com/transcript/236de3526ba3499a9259bd9fecb9e...](https://scribie.com/transcript/236de3526ba3499a9259bd9fecb9eb13c60fbfe2)

------
arkh
> People are listening to music and setting timers, but they are also getting
> things done.

I'd like some examples of those things getting done.

------
fwdpropaganda
> Amazon has 12,000 people working in Alexa.

Is this true?

