Hacker News new | comments | show | ask | jobs | submit login
How Apple fumbled a five-year lead in voice control (highscalability.com)
78 points by charlysl 10 months ago | hide | past | web | favorite | 93 comments

The author makes it seem that everybody is eager to use voice assistants to replace their Android or iOS screen interactions. But I don’t know anybody who would dictate their private messages on the subway, or even at home - they are called private for a reason. And just imagine requesting „tell me the current value of a bitcoin“ while at work - I don’t think this would go well.

There would be a lot of noise pollution everywhere, with people constantly murmuring to themselves, alternated with an unresponsive silence while listening to the device‘s answer. It would seem like everybody is on the phone, all the time. Also, for me, digesting information via text is way easier, because I can jump back and forth to support short term memory - good luck with that with a voice assistant.

The iPhone (and friends) disrupted the PC because they made reading and writing mobile. Voice assistants are not iterating on this paradigm, but they are rather iterating on phone calls. I‘m skeptical that this is suitable to replace the iPhone I‘m writing this on.

I don't think you've seen enough people using WhatsApp to send voice messages to their contacts. I do see a lot of people doing that on the subway, the bus, the street, and everywhere else around Barcelona. And it is the same in every other place I've been to recently.

And I can see a very obvious reason why: it's easier. Typing is annoying and slow, and you make lots of mistakes. People don't care others are listening. When was the last time you memorized something irrelevant someone else next to you was saying on the phone?

Maybe you don't want a single word of what you talk about to be heard by anyone else. But most other people don't care, and the reality is that most other people around are too bothered with their own lives to pay any attention to it.

Whenever I get a vocal message I just ignore it and write back ”sorry, I’m not going to open that because I am in a delicate situation, could you please resend as text?” People being lazy doesn’t mean they can expect me to disrupt my existence to receive their lazy communication,if it isn’t worth their time to type it then it isn’t worth my time to read it.

I don't use the voice features of whatsapp, but my wife often does. Nuances of voice communication, such as "how worried does X sound" are important to both her and the group she is in.

I dislike it too, but there's a huge number of people that don't. I noticed a lot of South Americans and Spanish people using WhatsApp via voice recordings. Coming from Asia, I gotta say we prefer communicating with emojis. :p

I bet you have lots of friends.

Could you please not post uncivil or unsubstantive comments to Hacker News? It's against the spirit and rules of the site: https://news.ycombinator.com/newsguidelines.html.

I find my comment civil, substantive, and a worthwhile contribution to the discussion. Maybe you should talk to the guy I replied to.

This is the future that Her presents[1]. There's a scene on the subway where everyone's chatting away to their AI, and you can see how it could be normalised.

[1]: http://www.imdb.com/title/tt1798709/

Sure, but we're nowhere near that at the moment.

I'm sure this is what a lot of people at Apple told themselves, and the downside risk of that thinking has already arrived. Now voice assistants power connected speakers, a market Apple is trying to enter, and they're years behind on the fundamental UI technology that powers it. Whoops.

Switching to voice is a fundamental change. Amazon made a dent in the universe:


You make a very good point on this.

As you say, a big win by text is that you can quickly skim it, ignore parts of it that you don't deem to be relevant, and essentially only use the working/short-term memory for the information you are interested the most in the context. Therefore, if the device responds back to you in voice, even if it's entirely a human-like experience, that's still inferior (at least in terms of speed) compared to the text-based experience.

Even for using voice as an input-only mechanism, it's still going to contain lots of redundant/useless information and verbosity.

Compare "OK google, what is the price of bitcoin today?" vs goolging "bitcoin price".

Before the Internet, you had to talk to people to do all the things we now do with apps and web browsers and texts and so on. A lot of time was spent on the phone. So the world has gotten quieter, but it's not like it was that quiet to begin with.

We had those phones on the wall though, confined to a dedicated location, and not in our ears.

I think people are going to switch to voice-only interface not because they are eager to use it. People are starting to realize the mobile phone is a time black-hole, and will restrict themselves from using them. But they still need to stay connected, that's where voice interface comes in.

I could easily see a chat thread with Alexa where I can ask things I don't want to say out loud.


- "Number for city of [myCity] utilities"

- "[what's the current] bitcoin price"

- "what's my checking account balance"

- "remind me on the first to change the air filter every [X] months"

Etc. Essentially this would be used while I don't want to bother people around me. I'll be eager to use a voice assistant when it becomes convenient for me. Siri isn't there yet, but my Alexa interactions are damn-near close already (No experience with Google or others). If this saves me from opening up a google tab, or some little-used app then I'm all for it.

What Alexa considers Utterances can simply be interpreted as text instead of speech[-to-text]. The testing tools for Alexa Skills already allows this. Hell, I developed a skill with a strong user base without an Echo device at all.

I use Alexa/Amazon in this case because I've used it first-hand. Any other AI would work just as well (as talking directly to it).

Interestingly, for the items you listed, I am skeptical that needing to jump through the hoops of NLP would result in a more convenient experience for some users.

It feels like the reason why some of these voice systems actually have utility is because they are usable in scenarios where you typically can't/don't want to type (such as while driving, lying on the couch at your house, etc). When you require somebody to type in their query, I'm struggling to see why an Alexa/AI chat thread would perform better than just a search engine with AI features. Or how, practically, it's different than just a "thread" where what you type goes to a search engine.

> When Apple bought Siri it had a solid 5 year lead in voice control.

Ehm what? Siri was initially released in 2011, I think Google Assistant in 2016.. so yeah.. there's 5 years difference between Siri and a product called "Google assistant", but before that Google had "Google Now" which was also controlled by voice, and MANY years before that Google pioneered with Google Voice Search and other apps which could be controlled by voice.

So let's get the facts right... Apple didn't have a 5 year lead, Apple was so far behind that they were forced to buy another product in order to even have a chance to catch up!

- https://en.wikipedia.org/wiki/Google_Voice_Search

> In the AR and VR future we won’t be typing.

Oh please god no full stop imagine how hard it would be to even write a simple hello world program example ellipsis linebreak linebreak three backticks no not that delete word delete word backtick backtick backtick c linebreak void main open bracket int argc comma char asterisk argv close bracket space open curly bracket line break tab print-EFF open bracket double-quote percent d backslash n double-quote comma argc close bracket semi-colon line break backtick backtick backtick

Chance of success? Zero. Time taken? Three minutes.

Whether you think it's a good idea or not, there are those of us who can't function without voice programming/input.

This video demonstrates full coding proficiency with my Talon project, which still needs a lot of work (e.g. no auto-spacing yet) - https://youtu.be/ddFI63dgpaI?t=30 - this demo took me around 9min vs a 90wpm typist doing it in a little over 6min. I've also benchmarked inputting over 280 commands per minute with no false recognitions.

This is my project, with which I aim to change minds about voice input (even making it viable for real-time gaming): https://talonvoice.com/

The Perl video is embarrassing, but that voice engine is only designed for dictating english text. voicecode.io has some major architectural limitations (no dynamic grammars so it requires a Dragon restart to change commands or word lists, adds commands by adding them to the english vocabulary instead of a strict command tree so recognition seriously suffers), aenea, caster, dragonfly are built on a 90's Python project called natlink and have their own issues. "Continuous command recognition" is pretty much a must, which Tavis Rudd's video lacks (notice the pausing between words).

That looks great! Both the speech recognition and the eye tracking look really impressive. You should make this into a commercial project or something. Is there any popular application that already does this stuff and is as good (I am not familiar with this type of software)?

I had no idea eye tracking could work this well.

As far as I'm concerned, there's nothing remotely in Talon's ballpark for voice, and these demos are just a sort of raw baseline. Every part of Talon will see significant improvements as I complete more of the upcoming features. Tech like Siri/Alexa/Google Assistant is too natural language processing focused to be reflective of the performance I've already demonstrated from a precise control scheme.

To my knowledge I am the only available high performance eye mousing project to not require a separate head tracker (I only require a $150 Tobii 4C eye tracker, and I do not use their integrated webcam head tracking system as it is not responsive enough).

I'm additionally planning to remove the ~$50-150 Dragon requirement for Mac users in the mid future.

I'm very committed to keeping the baseline software not just free, but a net negative cost over using other free solutions. As far as making it commercial, my current plan is to eventually charge a small subscription fee for extremely well integrated video game control systems, and a moderate subscription for professional input schemes like CADD, A/V products, and entire language models built around natural language programming that would ideally enable people who become injured to continue to work without much of any downtime.

These are some of the competition:


- http://voicecode.io/

- https://github.com/dictation-toolbox/aenea

- https://github.com/synkarius/caster

- https://www.voicebot.net/

Eye/Head Tracking:

- http://precisiongazemouse.com/

- http://iris.xcessity.at/

- https://github.com/trishume/PolyMouse

- https://github.com/trishume/FusionMouse

(huge shout out to trishume, who was very helpful and a big inspiration for my tracking system)

One of my upcoming goals is to get a hands-free world-record game speedrun into the Awesome Games Done Quick stream.

Your speed and accuracy are impressive, but that looks very unpleasant to recite. Human speech tends to have rhythm and cadence, but that sounded to be like reading a phone book. Very precisely spoken, very choppy, no comfortable flow.

First, it doesn't feel unpleasant as I do it. There are a few reasons why it might look that way.

I have very little practice at this point, especially considering I'm still changing my grammars regularly, so the editing video there is more a technical demo of the software's performance in spite of my own awkwardness while learning. That was my first and only take. I'm basically learning to type again, with cache misses as I try to remember commands and keep track of my position (like trying to remember where keys are, or the next note in a song as you pluck it out).

It's also possible the most accurate form of voice input does not sound like natural language. Finally, it's a different mental exercise to copy text that is already written versus writing and editing my own code, so I'm triply out of my element when recording these early videos.

Compare to Tavis Rudd's video from 2013, where he pauses between every command: https://youtu.be/8SkdfdXWYaI?t=1050

I've recorded my own version of his demo today for comparison: https://youtu.be/wt4PR5j7vBE

A voice interface to programming would either necessitate different programming languages, or macros - maybe gestural or vocal - of some sort to take care of the various ordering syntax that languages like { that languages require. As well as absolutely perfect voice recognition, including adjusting to when you have a sore throat or other problems.

I foresee a problem with maintaining code that has been written with gestures or voice.

How would that work? Would your IDE play back the voice or display animations of the gestures to you.

It just feels as if text is the simplest solution, but I'm open to suggestions.

I agree, just trying to think of how it would have to work if it was ever to exist.

I would suppose that voice/gesture written code would be stored as text, when reading you could edit via keyboard or your voice/gesture interface. This comes back to languages written for voice - probably if this became feasible a lot of these languages would be compilers from voice to another traditional text target.

But like many people here I have a hard time actually believing this would ever be useful.

In seriousness though, while voice input would not be how you do complex programming, you could argue that there will always be more specific and specialized tools for that.

For people who spend most of their time clicking with a mouse or pointing at pictures on a phone screen, I could see voice working with things like IFTTT to create simple "programs" or actions. That's essentially what happens when I say "hey Google, turn on living room lights at 6pm".

Someone needed to program the framework for that to happen, but someone also needed to develop the Arduino IDE that I use when I engage in the very limited hobbyist programming that I do. I get to use a simpler interface because someone more specialized and experienced created it with lower level tools.

In a sense, that "if time=18:00 then set livingroom_lights=on" command is a very simple program entered via voice command.

This one is old, but I don't think things have improved much since then: https://www.youtube.com/watch?v=MzJ0CytAsec

It's worth a watch. Hilarious, despite the bad sound quality.

There are improvements on it :) https://voicecode.io/

and improvements on that: https://talonvoice.com/

I found this to be very ironic based on the time it probably took you to type all this out haha!

Hehe yep but the major difference here is that I was chuckling the whole time I was typing. Can you imagine how bad chuckling would make the end result if you were speaking it? Chance of success: -1 billion.

I won't be switching any time soon, but it's certainly possible: https://www.youtube.com/watch?v=8SkdfdXWYaI

That's why only brain interfaces can replace keyboards

That's why you use Python.

While you were yucking it up, someone already did it. I'd say it takes about a minute:


Of course, there's plenty of room for improvement. Why not have the assistant understand keywords and perform high-level tasks?


"create function named main"


"Add two int arguments"

> Over 60% of text iMessages are composed using voice

This seems dubious and I'd love to see a source for this claim.

Judging by the error-riddled messages I receive, most of my correspondents using iPhones to send me email or text messages appear to be using this method of text input on their phones.

As a result, I have become fluent in "dictato" much the same as I became fluent in reading typos in the 1980s when I first started using text chat a lot (on dial-up BBS systems).

I transcript and I've found it's actually a good exercise.

Plotting out a text and plotting out spoken dialogue require different mental gymnastics and it was actually pretty hard to dictate entire sentences without pause until I'd been using it for about two months.

Yes, I simply don’t buy that either. I try comparatively hard to use Siri, and I don’t think I write more than 5% via Siri. It doesn’t work that well even while driving.

I would guess that this statistic is about using the voice button on the keyboard, rather then commanding Siri "send a message to...". I've watched people that do this constantly, and perhaps they send more messages overall as a result of the behavior.

From a single datapoint, senior citizens prefer speaking into the phone instead of typing out long messages.

Anecdatally, it seems to be quite common in Brazil.

100% of iMessages are composed by voice on the Apple Watch, which may be weighting this figure somewhat.

Err, huh? I use the scribble feature (not voice) on the Apple Watch to compose iMessages all the time.

I only use the writing feature on my watch. But I may be weird.

Composed or merely sent as a voice message?

I disagree with the initial premise, i.e., that:

> "We think in a voice in our head. Anyone trying to type has to first put it in a voice in their head before typing. You’re transcribing your inner voice onto the keyboard. When you speak it’s quicker."

Anyone who has heard me speak probably realizes I don't think in voice. I can type as fast as I can talk because words->bits (or words->sounds) isn't the bottleneck. Thoughts->words is.

Feynman [pdf: http://calteches.library.caltech.edu/607/2/Feynman.pdf] noted this, too. Some thinking can't easily be done in words. Some people don't naturally think in words.

There's a TV show about an autistic young doctor which, for all its flaws, tries to show how a person can have great insight even when they can't articulate well. I hope it might help inform the public understanding of other mechanisms of thought.

Spoken language is wonderful as an art form or as a tool but I find it troubling when people with one mode of thinking can present theirs as the only kind, and then use it to push their agenda on others. I suspect this is behind the "open floor plan" office fad, too.

Yes, it seems as bad to me as the similar ”We think in a voice in our head. Anyone trying to read has to speak out what is written.”, which was considered true until the invention of “silent reading” (https://web.stanford.edu/class/history34q/readings/Manguel/S...)

However, there may be some half truth in that statement: to type as fast as text spoken at top speed you must use stenotype.

I’m not sure that means speaking, in general, _is_ faster than typing, though, as a) it involves the extra handicap of having to process the spoken sound, and b) typing may be faster in the common/average/mean case.

I truly agree with this. I still remember the first time I heard of that idea of thinking in words. It was from my Portuguese teacher (I'm Brazilian) on 6th or 7th grade in school.

I was blown away by that idea so I started observing my own thoughts.

In very few occasions I had words come to my mind that described what I was thinking. Most of the time, thoughts were very abstract and indescribable, or simple sensorial information (shades of colours whose name I didn't know, smells that I couldn't name or describe even if I wanted, sounds, etc).

Regardless, there seems to be people who think everybody thinks the same way, and the most vocal ones seem to be the ones who consider thoughts to be 100% language-based.

Apple didn't fumble a five-year lead in voice control, they've just not closed Google's 15 year lead in AI.

The article kind of touches on this topic but misses the point - Google is using data about you to inform it's AI. One could argue that the author not realizing the extent to which it does so is more likely a sign they've become very good at it.

Right well Google is preternaturally good at that shit. I want to know Apple's excuse for failing to outperform Amazon.

Yeah fair enough. I'd be interested in knowing that myself.

What makes me sad about Siri on the iPhone is that it would be such a boon if I was able to use it with actually good apps and if I didn't feel like a second-class citizen for having an iPhone outside the US.

I'm forced to use the buggy Reminders app, because Siri only supports creating reminders with Reminders. Apple refuses to support anything else than Apple Music, which means I can't use Siri for music. (Yet, when the 4S was released, it only took weeks until I could control Spotify with Siri using jailbreak). I can't use Siri for navigation either in any form ("Hey Siri, when does the next train leave from here?"), because Apple has never provided public transit or good navigation in Sweden, nor announced a timeline on what year that might happen – and of course, Google Maps can't use Siri. And even if it could, Siri does not gracefully support foreign words such as addresses.

I could easily love Siri if Apple would stop gimping it by tying it to other subpar products.

> Apple refuses to support anything else than Apple Music

I was pretty appalled when I installed the first version of OS X with Siri and then the first thing I did was ask it to play a song from my iTunes Matched iTunes library and it had no idea what I was talking about because the artist (Kanye) wasn't on Apple Music.

My iPhone 4S I was using at the time could do this feat just fine but the so called latest and greatest can't manage it because Apple wants to force Apple Music on me.

I disabled Siri a few moments after this.

> I'm forced to use the buggy Reminders app, because Siri only supports creating reminders with Reminders.

Siri has third-party reminders support now, I use it with Things.

That is good to know, and I've considered switching to Things for a while. This makes it easier. Thank you!

i do see a lot of people believing the Alexas system is much better but int he end alaxa is just build up on regex.

that is ok-ish for simple languages like English were the words themselves dont change.

but in many other languages, the words in sentences change depending on their relationship to each other.

eg simple information in English like "the cat is sitting on the bench" is encoding with the order of the words whereas in other languages the information the `cat` is `sitting` on the bench rather than the bench `sitting on the cat` is encoded in how the word `cat` and `sitting` and `bench` (and maybe `on`) change. This change can be rather complex based on lots and lots of factors such as gender of the `cat` and `bench` the type of verb eg `sitting` and the preposition `on` (some languages don't even bother with this since it's added to the verb).

apple's approach is much more upfront work since they do all this natural language logic in every language they support. But for developers who use Sirikit is is much better.

I'm sure it's fixed now, but a few months ago right in the middle of the whole "Siri is a completely failure, Alexa is much better" I asked Alexa to "turn all my lights off" and she responds with a very daft "I'm sorry, but you don't have a device called 'All my lights' on your account".

Siri knew how to handle that just fine.

Notably, unlike Xerox, which developed the Alto and its technologies in-house and let them rot, Apple launched Siri through acquisition of a fully functioning application which was already distributed through the App Store. It's more like Flickr.

There are some weird sweeping statements in this piece that I don't think represent reality, but I do agree that it seems Apple held Siri back from what it should have been, as a platform with a third party API, because it was afraid it would bypass Apple's control of the overall iOS platform somehow. Fatal mistake, and they lost a huge number of Siri's team by letting it rot the way they have.

Fatal mistake, and they lost a huge number of Siri's team by letting it rot the way they have.

It’s not fatal—lets not be melodramatic here.

I suspect they know they screwed up and have taken steps to correct. Remember that Siri on iOS, the watch, macOS and AppleTV were all different; they're working on a unified Siri across platforms so that the things Siri knows how to do aren’t in silos.

We’re already starting to see the results of this: the new “read me the news” works on iOS 11.2.5 and also on HomePod: https://www.macrumors.com/2018/01/23/apple-releases-ios-11-2...

It's not a good sign that they've lost so much of the original talent on the project.

The WSJ had a very public exposé on the topic over the summer: https://www.wsj.com/articles/apples-siri-once-an-original-no...

I don't know of any good redemption narratives for tech products that lose their founding teams after acquisition and exist in the headless/nebulous state that Siri appears to be in now — outside of the Macintosh itself, but it had to hire back its founder and spinoff product.

"Voice first" seems like a misguided philosophy pushed by people who think that we're due for some sort of "revolution."

Voice is important but mostly for narrow use cases (in the car, music control in the house) or for shortcuts ("Siri, open accessibility settings").

Though I can imagine some kinds of routine, tedious computer tasks that could be automated with voice and AI someday--"Computer, convert all the files in this folder to jpegs and then put today's date at the start of their filenames, then move them to Dropbox and share them with my wife"--stuff like that. But it seems like for all that to happen we need something approaching general AI. Most voice assistants just move you along predetermined paths.

One place where voice might prove very useful is to get a generation of illiterate people onto the internet. This article by WSJ talks about uneducated porters in India who rely on voice to search, including a job ad app that can be used by voice: https://www.wsj.com/articles/the-end-of-typing-the-internets...

Exactly. "The technology isn't there yet" and "I wouldn't be comfortable using voice control" are said from a place of privilege. There are people out there benefiting from it right now, despite some people's very first-world hangups about it.

>>Apple is also fumbling the future—the Voice First future. Voice First simply means our primary mode of interacting with computers in the future will be with our voice.

Oh god please no. Interacting with computers using voice is probably the second worst way of doing it, just after using motion gestures. There are some good uses for it(although I think even voice controls for lights is pushing it, it's more pain that it's worth), but having it as a primary mode of operation? No thanks.

What do people actually use voice for besides playing music, smart home, transcribing and "widget" stuff like timers and weather?

Are there lots of people out there actually using it for productivity like scheduling, communicating, researching, comparing?

To me it seems like until voice AI gets smart enough to respond to a query like "Alexa, how will our Q4 projected net profits change if I switch widget vendors from AlphaCorp to BetaCorp?" an interactive screen will be needed.

Companies see the Cambrian explosions of growth (and equities appreciation) that follow successful UI shifts and want to reproduce that with voice UI.

You could do a hell of a lot by establishing a framework for third parties to use such that they don’t stomp upon each other and the system-reserved words/phrases, as well as homophone rejection/discrimination, then brute-forcing lots of recognized cases. Right now, I’m seeing second-system effects in voice UI efforts by everyone (possibly excepting the dedicated voice recognition outfits like Nuance), by trying to apply a general machine learning approach to all of it and eschewing brute-force as not pure/clean enough.

Well, Alexa seems to be going the brute force route--using third-party skills is just like using a spoken command line utility. The problem is, while this approach is simple and it works, very few skills are useful.

I'm probably living in a bubble, but I don't understand the current focus on all of these voice-operated AI things.

For one, I don't see anyone using them. I've never seen anyone use Siri on a phone, with one exception: I have a friend who uses voice transcription to write emails and text messages. (Not sure if that counts as Siri.) I don't know anyone who uses speakers such as Amazon Echo or Google Home, or mentions wanting them. I've used Siri on my phone a few times to try out silly things like asking about the weather or converting units, but it's seems so obviously bad that I've never tried going "Siri first". It's on my Apple TV, but I've not used it. Are people actually using this stuff as much as this article (and all the current focus on AI-powered speakers etc.) implies?

I'd love true AI -- a personal assistant that actually understood me, knew my tastes, remembered my choices and so on -- but that doesn't exist. So we get an uncanny valley where speaking to a device comes with uncertainty about whether the device will actually be able to service the request (see the numerous annoyed threads on Reddit about Siri not understanding incredibly basic things), requiring the user to carry around with them a mental model of the receiver's inadequacies. And we get that problem where many tasks that would be super useful to have done by an assistant is actually too difficult for a current AI. For example, booking a trip. I've never booked a flight that didn't involve pouring over pages of results, weighing the various compromises (optimizing for price vs departure time vs airport distance vs layovers vs airline crappiness etc.) and of course filling in a whole lot of information about myself. Let alone booking holidays (I spend many days on research for such a task). Bus and train tickets -- maybe, if it's a simple route with few variables. Dinner reservations and movie tickets, sure. Home automation, well. Once a home is fully outfitted with all lights, doors etc. being connected to a home automation system -- maybe. None of this is automated in my apartment, and I've never seen a home that is set up like this.

There's also the inherent limited utility that comes with an AI being unintelligent. For example, my Apple TV. I rarely know what I want to watch. To only rarely would I be able to say "Siri, play The Grand Budapest Hotel". Usually it would have to be something like "Siri, list me some sci-fi movies". But then I would still have to navigate the results. If it could respond to "Siri, show me French movies with an IMDb score of 7.0 or higher, or with a four-star review from Roger Ebert, that came out after 1990", well, now I'm interested. But somehow I doubt that those sort of multi-dimensional cross-referencing mechanics are currently available.

Edit: With the comments saying how good Google Assistant is compared to Siri, I tried it out. My first queries failed ("What is the weather in Washington State like in March") because it thought I was finished speaking and cut me off at "What is the weather", and showed me today's weather in my city. After a few attempts I got results, but then it returned Fahrenheit units (my phone is set to use metric/international units), and then it continued giving me Fahrenheit after I changed it, because apparently climate information is a "web search" returning text, not data: https://imgur.com/gallery/LI8Sc. Not so impressive. Meanwhile, Siri on the phone doesn't understand the query at all, ignoring the "in March" part and giving me today's weather, no matter how I phrase the question.

I can see a future where AI voice control is incredible, but it requires an enormous advance, and it seems like even the best AI of today just comes with frustration and failure. That's surely why companies like Apple and Google are working on this, but it doesn't explain, to me, why the current voice AI is being pushed so hard.

I'm a software engineer. I code in vim and use Ubuntu on my desktop which I built myself. I say all of this to give the reader context for how I tend to interact with tech. I'm 24.

I do a lot of texting, and a lot of the time it's a mixture between plain old thumb texting and swiping for longer words. But about 30% of the time I use voice to text.

I use Google's assistant and the voice to text is excellent. It even uses context to go back and correct ambiguities (synonyms) and even cooler -- mistakes. If I say "I baked a match of cookies", it will first transcribe match, but as soon as I say cookies, it will in real-time go back and change match to what is obviously correct -- batch. Note this doesn't work with direct assistant commands ("okay Google what time is it?"),but it definitely does work with text messages you send while talking to the assistant ("okay Google, send Brandon a text message: I'm baking a match of cookies" gets corrected). This helps with stuttering or mistakes you make while speaking.

It's certainly a bit of a mental shift to get used to speaking text messages. Saying punctuation marks out loud is awkward. but it's definitely at the point where it's convenient.

Transcription is a hard problem but it's pretty closed to solved. The unsolved problem is telling an voice assistant to do something useful and having it do it. An AI assistant would be something that could act on "Book me a trip to Tokyo and inform the relevant people," not interacting with a spoken command line, invoking some API by voice, or laboriously filling out a form by talking to a computer.

On the other hand, in my experience most of the people who complain about voice operated AI being useless have been found using Siri. Google Assistant on the other hand works really great for me, and I use it all the time (in my phone, and with Google Home). I find it amazingly useful that I can ask my phone to check the time, my next meetings, while I'm shaving/taking bath/cooking or I'm just feeling lazy to get up and pick up the phone. Also, it is useful in setting reminders which I do a lot. I also send whatsapp/text messages using it or play YouTube videos/google play music just by asking the song I want. Sure, sometimes it doesn't recognize, but most of the times it does. It's incredible value addition for me. Just being able to ask a question in my natural voice without manually unlocking using a PIN etc. is so useful.

That's what I thought until i witness people in my family using: 1. Google voice and 2. Alexa (and raving about it)

I think tech people just find typing away much faster and voice interface kinda annoying, but a lot of more 'average type of users' really like the voice interface better.

I wouldn't have predicted, but voice is here to stay and it has long legs. It seems that casual home use, and in car (where the next battle will be) are the best uses of it, and not mobile/while out

How do people use those things?

Yeah, it doesn't work well for me either. However my 6-year old son uses the Google assistant on my phone a lot. "Lego Ninjago Kai video please", "Play". "Why do dogs have tails?" That kind of thing.

I think it works brilliantly for certain narrow domains - answering factual questions, directions, distances, top rated attractions etc. And is quicker than typing and scrolling. But I agree the AI behind it has yet to demonstrate any real broad understanding - assimilating various contrasting opinions from text, providing context etc.

I also personally don't want to sit on a train, couch or a dentist waiting room listening to people talking to their phones or watches.

> I don't understand the current focus on all of these voice-operated AI things

> I can see a future where AI voice control is incredible

It's not going to work overnight. I think it's great that apple integrated Siri in their product even though it wasn't very useful. There has been great progress in just a few years. I was skeptical too but I use it more and more. Things like "remind me to ...". "Play ...". I also use google voice quite often with my android TV box.

That being said, I'm always much faster with a keyboard. I prefer a smart text interface that understands basic actions, fuzzy search and so on.

I use voice transcription out of necessity because my touch-screen is broken.

It would be pretty nice if it actually understood what I was saying instead of sometimes randomly capitalizing words and not offering a lowercase alternative when you press the word, and occasionally outputting something wildly phonetically different than its input.

I've come across several phrases that simply will not translate and must be entered manually.

Android, FWIW

I'm more in your camp, but I bought my dad an ipad for xmas and Siri has made all the difference with him. He's very smart but not computer literate, so the typing, tapping, apps, settings, app store—all that just confused him. Being able to control the ipad with voice really opened it up to him. Though I do get some weird emails now :-)

When I wake up in the middle of the night, I ask Siri what time it is, so I don't have to open my eyes.

That's about it.

Siri, Echo and the other voice-activated command lines without screen are not meant to be used by muggles, they fit the needs of Silicon Valley engineers and no one else. Maybe they can retrofit some parts into muggle-worthy products some day.

Can be useful when you’re looking after kids, you’re carrying a child/stuck away from your phone.

I’d guess voice assistance could be useful in other situations too. But for me the error rate still seems too high, and there are limited applications.

Eh, this seems to be missing one important detail. At least it's massively important to me as a potential buyer of an AI home assistant.

That detail is that HomePod doesn't spy on me. It does not upload audio recordings of my home and it even shows a visual indication when it is listening at all.

Doing speech recognision on the device is way harder, so it's not a surprise to see them being later to market than others. However this makes Apple's product the only viable one on the market for me.

This is about much more than just voice control, if I got it right much of its added value could be realized by just typing, or more generally, by conversational text generated in any other way, of which online voice2text is one instance.

Also interesting that just like structuring everything around Windows hurt Microsoft, Apple seems to be falling into this with the iPhone.

95% trained accuracy was already there by late eighties

97% by mid-nineties

98% late nineties, and I still remember how wonderful was the voice recognition on old black and white Ericsson phones, and sony PDAs of that era.

And then, things stalled at claimed 98.5% without much improvement in marginal utility.

The fact that misrecognition is there and will be there, removes much of utility from applications without a dictionary or one with no side effects.

And yes, people talking to nobody feels and looks creepy. It almost begs other to come and point a finger at you saying "hey look, a computer creep is talking to his phone."

I'd still like to see someone make a "HER" type of OS even just as a prototype/demo piece touching new functionalities. Currently Siri indeed doesn't offer much, but even Google and Alexa aren't yet at the level that I'm likely to use it for much more than set a timer or put on some music in the long run...

It's not that people aren't trying to do that, it's just insanely hard. That is the end goal for all virtual assistants, but it's still years (I would even say decades) away.

I'd love to see such technologies developed more, but the focus needs to be on accuracy and speed. No current systems I've tried excel in those areas when used by people with accents. I've seen others code by voice, but for me, the voice recognition systems are about fifty percent accurate most of the time, aka useless.

Here's an automated transcript of the podcast referenced in the post, in case you want to skim through it.


> People are listening to music and setting timers, but they are also getting things done.

I'd like some examples of those things getting done.

> Amazon has 12,000 people working in Alexa.

Is this true?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact