There would be a lot of noise pollution everywhere, with people constantly murmuring to themselves, alternated with an unresponsive silence while listening to the device‘s answer. It would seem like everybody is on the phone, all the time. Also, for me, digesting information via text is way easier, because I can jump back and forth to support short term memory - good luck with that with a voice assistant.
The iPhone (and friends) disrupted the PC because they made reading and writing mobile. Voice assistants are not iterating on this paradigm, but they are rather iterating on phone calls. I‘m skeptical that this is suitable to replace the iPhone I‘m writing this on.
And I can see a very obvious reason why: it's easier. Typing is annoying and slow, and you make lots of mistakes. People don't care others are listening. When was the last time you memorized something irrelevant someone else next to you was saying on the phone?
Maybe you don't want a single word of what you talk about to be heard by anyone else. But most other people don't care, and the reality is that most other people around are too bothered with their own lives to pay any attention to it.
As you say, a big win by text is that you can quickly skim it, ignore parts of it that you don't deem to be relevant, and essentially only use the working/short-term memory for the information you are interested the most in the context. Therefore, if the device responds back to you in voice, even if it's entirely a human-like experience, that's still inferior (at least in terms of speed) compared to the text-based experience.
Even for using voice as an input-only mechanism, it's still going to contain lots of redundant/useless information and verbosity.
Compare "OK google, what is the price of bitcoin today?" vs goolging "bitcoin price".
- "Number for city of [myCity] utilities"
- "[what's the current] bitcoin price"
- "what's my checking account balance"
- "remind me on the first to change the air filter every [X] months"
Etc. Essentially this would be used while I don't want to bother people around me. I'll be eager to use a voice assistant when it becomes convenient for me. Siri isn't there yet, but my Alexa interactions are damn-near close already (No experience with Google or others). If this saves me from opening up a google tab, or some little-used app then I'm all for it.
What Alexa considers Utterances can simply be interpreted as text instead of speech[-to-text]. The testing tools for Alexa Skills already allows this. Hell, I developed a skill with a strong user base without an Echo device at all.
I use Alexa/Amazon in this case because I've used it first-hand. Any other AI would work just as well (as talking directly to it).
It feels like the reason why some of these voice systems actually have utility is because they are usable in scenarios where you typically can't/don't want to type (such as while driving, lying on the couch at your house, etc). When you require somebody to type in their query, I'm struggling to see why an Alexa/AI chat thread would perform better than just a search engine with AI features. Or how, practically, it's different than just a "thread" where what you type goes to a search engine.
Ehm what? Siri was initially released in 2011, I think Google Assistant in 2016.. so yeah.. there's 5 years difference between Siri and a product called "Google assistant", but before that Google had "Google Now" which was also controlled by voice, and MANY years before that Google pioneered with Google Voice Search and other apps which could be controlled by voice.
So let's get the facts right... Apple didn't have a 5 year lead, Apple was so far behind that they were forced to buy another product in order to even have a chance to catch up!
Oh please god no full stop imagine how hard it would be to even write a simple hello world program example ellipsis linebreak linebreak three backticks no not that delete word delete word backtick backtick backtick c linebreak void main open bracket int argc comma char asterisk argv close bracket space open curly bracket line break tab print-EFF open bracket double-quote percent d backslash n double-quote comma argc close bracket semi-colon line break backtick backtick backtick
Chance of success? Zero. Time taken? Three minutes.
This video demonstrates full coding proficiency with my Talon project, which still needs a lot of work (e.g. no auto-spacing yet) - https://youtu.be/ddFI63dgpaI?t=30 - this demo took me around 9min vs a 90wpm typist doing it in a little over 6min. I've also benchmarked inputting over 280 commands per minute with no false recognitions.
This is my project, with which I aim to change minds about voice input (even making it viable for real-time gaming): https://talonvoice.com/
The Perl video is embarrassing, but that voice engine is only designed for dictating english text. voicecode.io has some major architectural limitations (no dynamic grammars so it requires a Dragon restart to change commands or word lists, adds commands by adding them to the english vocabulary instead of a strict command tree so recognition seriously suffers), aenea, caster, dragonfly are built on a 90's Python project called natlink and have their own issues. "Continuous command recognition" is pretty much a must, which Tavis Rudd's video lacks (notice the pausing between words).
I had no idea eye tracking could work this well.
To my knowledge I am the only available high performance eye mousing project to not require a separate head tracker (I only require a $150 Tobii 4C eye tracker, and I do not use their integrated webcam head tracking system as it is not responsive enough).
I'm additionally planning to remove the ~$50-150 Dragon requirement for Mac users in the mid future.
I'm very committed to keeping the baseline software not just free, but a net negative cost over using other free solutions. As far as making it commercial, my current plan is to eventually charge a small subscription fee for extremely well integrated video game control systems, and a moderate subscription for professional input schemes like CADD, A/V products, and entire language models built around natural language programming that would ideally enable people who become injured to continue to work without much of any downtime.
These are some of the competition:
(huge shout out to trishume, who was very helpful and a big inspiration for my tracking system)
One of my upcoming goals is to get a hands-free world-record game speedrun into the Awesome Games Done Quick stream.
I have very little practice at this point, especially considering I'm still changing my grammars regularly, so the editing video there is more a technical demo of the software's performance in spite of my own awkwardness while learning. That was my first and only take. I'm basically learning to type again, with cache misses as I try to remember commands and keep track of my position (like trying to remember where keys are, or the next note in a song as you pluck it out).
It's also possible the most accurate form of voice input does not sound like natural language. Finally, it's a different mental exercise to copy text that is already written versus writing and editing my own code, so I'm triply out of my element when recording these early videos.
Compare to Tavis Rudd's video from 2013, where he pauses between every command: https://youtu.be/8SkdfdXWYaI?t=1050
I've recorded my own version of his demo today for comparison: https://youtu.be/wt4PR5j7vBE
How would that work? Would your IDE play back the voice or display animations of the gestures to you.
It just feels as if text is the simplest solution, but I'm open to suggestions.
I would suppose that voice/gesture written code would be stored as text, when reading you could edit via keyboard or your voice/gesture interface. This comes back to languages written for voice - probably if this became feasible a lot of these languages would be compilers from voice to another traditional text target.
But like many people here I have a hard time actually believing this would ever be useful.
For people who spend most of their time clicking with a mouse or pointing at pictures on a phone screen, I could see voice working with things like IFTTT to create simple "programs" or actions. That's essentially what happens when I say "hey Google, turn on living room lights at 6pm".
Someone needed to program the framework for that to happen, but someone also needed to develop the Arduino IDE that I use when I engage in the very limited hobbyist programming that I do. I get to use a simpler interface because someone more specialized and experienced created it with lower level tools.
In a sense, that "if time=18:00 then set livingroom_lights=on" command is a very simple program entered via voice command.
It's worth a watch. Hilarious, despite the bad sound quality.
Of course, there's plenty of room for improvement. Why not have the assistant understand keywords and perform high-level tasks?
"create function named main"
"Add two int arguments"
This seems dubious and I'd love to see a source for this claim.
As a result, I have become fluent in "dictato" much the same as I became fluent in reading typos in the 1980s when I first started using text chat a lot (on dial-up BBS systems).
Plotting out a text and plotting out spoken dialogue require different mental gymnastics and it was actually pretty hard to dictate entire sentences without pause until I'd been using it for about two months.
> "We think in a voice in our head. Anyone trying to type has to first put it in a voice in their head before typing. You’re transcribing your inner voice onto the keyboard. When you speak it’s quicker."
Anyone who has heard me speak probably realizes I don't think in voice. I can type as fast as I can talk because words->bits (or words->sounds) isn't the bottleneck. Thoughts->words is.
Feynman [pdf: http://calteches.library.caltech.edu/607/2/Feynman.pdf] noted this, too. Some thinking can't easily be done in words. Some people don't naturally think in words.
There's a TV show about an autistic young doctor which, for all its flaws, tries to show how a person can have great insight even when they can't articulate well. I hope it might help inform the public understanding of other mechanisms of thought.
Spoken language is wonderful as an art form or as a tool but I find it troubling when people with one mode of thinking can present theirs as the only kind, and then use it to push their agenda on others. I suspect this is behind the "open floor plan" office fad, too.
However, there may be some half truth in that statement: to type as fast as text spoken at top speed you must use stenotype.
I’m not sure that means speaking, in general, _is_ faster than typing, though, as a) it involves the extra handicap of having to process the spoken sound, and b) typing may be faster in the common/average/mean case.
I was blown away by that idea so I started observing my own thoughts.
In very few occasions I had words come to my mind that described what I was thinking. Most of the time, thoughts were very abstract and indescribable, or simple sensorial information (shades of colours whose name I didn't know, smells that I couldn't name or describe even if I wanted, sounds, etc).
Regardless, there seems to be people who think everybody thinks the same way, and the most vocal ones seem to be the ones who consider thoughts to be 100% language-based.
The article kind of touches on this topic but misses the point - Google is using data about you to inform it's AI. One could argue that the author not realizing the extent to which it does so is more likely a sign they've become very good at it.
I'm forced to use the buggy Reminders app, because Siri only supports creating reminders with Reminders.
Apple refuses to support anything else than Apple Music, which means I can't use Siri for music. (Yet, when the 4S was released, it only took weeks until I could control Spotify with Siri using jailbreak).
I can't use Siri for navigation either in any form ("Hey Siri, when does the next train leave from here?"), because Apple has never provided public transit or good navigation in Sweden, nor announced a timeline on what year that might happen – and of course, Google Maps can't use Siri. And even if it could, Siri does not gracefully support foreign words such as addresses.
I could easily love Siri if Apple would stop gimping it by tying it to other subpar products.
I was pretty appalled when I installed the first version of OS X with Siri and then the first thing I did was ask it to play a song from my iTunes Matched iTunes library and it had no idea what I was talking about because the artist (Kanye) wasn't on Apple Music.
My iPhone 4S I was using at the time could do this feat just fine but the so called latest and greatest can't manage it because Apple wants to force Apple Music on me.
I disabled Siri a few moments after this.
Siri has third-party reminders support now, I use it with Things.
that is ok-ish for simple languages like English were the words themselves dont change.
but in many other languages, the words in sentences change depending on their relationship to each other.
eg simple information in English like "the cat is sitting on the bench" is encoding with the order of the words whereas in other languages the information the `cat` is `sitting` on the bench rather than the bench `sitting on the cat` is encoded in how the word `cat` and `sitting` and `bench` (and maybe `on`) change. This change can be rather complex based on lots and lots of factors such as gender of the `cat` and `bench` the type of verb eg `sitting` and the preposition `on` (some languages don't even bother with this since it's added to the verb).
apple's approach is much more upfront work since they do all this natural language logic in every language they support. But for developers who use Sirikit is is much better.
Siri knew how to handle that just fine.
There are some weird sweeping statements in this piece that I don't think represent reality, but I do agree that it seems Apple held Siri back from what it should have been, as a platform with a third party API, because it was afraid it would bypass Apple's control of the overall iOS platform somehow. Fatal mistake, and they lost a huge number of Siri's team by letting it rot the way they have.
It’s not fatal—lets not be melodramatic here.
I suspect they know they screwed up and have taken steps to correct. Remember that Siri on iOS, the watch, macOS and AppleTV were all different; they're working on a unified Siri across platforms so that the things Siri knows how to do aren’t in silos.
We’re already starting to see the results of this: the new “read me the news” works on iOS 11.2.5 and also on HomePod: https://www.macrumors.com/2018/01/23/apple-releases-ios-11-2...
The WSJ had a very public exposé on the topic over the summer: https://www.wsj.com/articles/apples-siri-once-an-original-no...
I don't know of any good redemption narratives for tech products that lose their founding teams after acquisition and exist in the headless/nebulous state that Siri appears to be in now — outside of the Macintosh itself, but it had to hire back its founder and spinoff product.
Voice is important but mostly for narrow use cases (in the car, music control in the house) or for shortcuts ("Siri, open accessibility settings").
Though I can imagine some kinds of routine, tedious computer tasks that could be automated with voice and AI someday--"Computer, convert all the files in this folder to jpegs and then put today's date at the start of their filenames, then move them to Dropbox and share them with my wife"--stuff like that. But it seems like for all that to happen we need something approaching general AI. Most voice assistants just move you along predetermined paths.
Oh god please no. Interacting with computers using voice is probably the second worst way of doing it, just after using motion gestures. There are some good uses for it(although I think even voice controls for lights is pushing it, it's more pain that it's worth), but having it as a primary mode of operation? No thanks.
Are there lots of people out there actually using it for productivity like scheduling, communicating, researching, comparing?
To me it seems like until voice AI gets smart enough to respond to a query like "Alexa, how will our Q4 projected net profits change if I switch widget vendors from AlphaCorp to BetaCorp?" an interactive screen will be needed.
You could do a hell of a lot by establishing a framework for third parties to use such that they don’t stomp upon each other and the system-reserved words/phrases, as well as homophone rejection/discrimination, then brute-forcing lots of recognized cases. Right now, I’m seeing second-system effects in voice UI efforts by everyone (possibly excepting the dedicated voice recognition outfits like Nuance), by trying to apply a general machine learning approach to all of it and eschewing brute-force as not pure/clean enough.
For one, I don't see anyone using them. I've never seen anyone use Siri on a phone, with one exception: I have a friend who uses voice transcription to write emails and text messages. (Not sure if that counts as Siri.) I don't know anyone who uses speakers such as Amazon Echo or Google Home, or mentions wanting them. I've used Siri on my phone a few times to try out silly things like asking about the weather or converting units, but it's seems so obviously bad that I've never tried going "Siri first". It's on my Apple TV, but I've not used it. Are people actually using this stuff as much as this article (and all the current focus on AI-powered speakers etc.) implies?
I'd love true AI -- a personal assistant that actually understood me, knew my tastes, remembered my choices and so on -- but that doesn't exist. So we get an uncanny valley where speaking to a device comes with uncertainty about whether the device will actually be able to service the request (see the numerous annoyed threads on Reddit about Siri not understanding incredibly basic things), requiring the user to carry around with them a mental model of the receiver's inadequacies. And we get that problem where many tasks that would be super useful to have done by an assistant is actually too difficult for a current AI. For example, booking a trip. I've never booked a flight that didn't involve pouring over pages of results, weighing the various compromises (optimizing for price vs departure time vs airport distance vs layovers vs airline crappiness etc.) and of course filling in a whole lot of information about myself. Let alone booking holidays (I spend many days on research for such a task). Bus and train tickets -- maybe, if it's a simple route with few variables. Dinner reservations and movie tickets, sure. Home automation, well. Once a home is fully outfitted with all lights, doors etc. being connected to a home automation system -- maybe. None of this is automated in my apartment, and I've never seen a home that is set up like this.
There's also the inherent limited utility that comes with an AI being unintelligent. For example, my Apple TV. I rarely know what I want to watch. To only rarely would I be able to say "Siri, play The Grand Budapest Hotel". Usually it would have to be something like "Siri, list me some sci-fi movies". But then I would still have to navigate the results. If it could respond to "Siri, show me French movies with an IMDb score of 7.0 or higher, or with a four-star review from Roger Ebert, that came out after 1990", well, now I'm interested. But somehow I doubt that those sort of multi-dimensional cross-referencing mechanics are currently available.
Edit: With the comments saying how good Google Assistant is compared to Siri, I tried it out. My first queries failed ("What is the weather in Washington State like in March") because it thought I was finished speaking and cut me off at "What is the weather", and showed me today's weather in my city. After a few attempts I got results, but then it returned Fahrenheit units (my phone is set to use metric/international units), and then it continued giving me Fahrenheit after I changed it, because apparently climate information is a "web search" returning text, not data: https://imgur.com/gallery/LI8Sc. Not so impressive. Meanwhile, Siri on the phone doesn't understand the query at all, ignoring the "in March" part and giving me today's weather, no matter how I phrase the question.
I can see a future where AI voice control is incredible, but it requires an enormous advance, and it seems like even the best AI of today just comes with frustration and failure. That's surely why companies like Apple and Google are working on this, but it doesn't explain, to me, why the current voice AI is being pushed so hard.
I do a lot of texting, and a lot of the time it's a mixture between plain old thumb texting and swiping for longer words. But about 30% of the time I use voice to text.
I use Google's assistant and the voice to text is excellent. It even uses context to go back and correct ambiguities (synonyms) and even cooler -- mistakes. If I say "I baked a match of cookies", it will first transcribe match, but as soon as I say cookies, it will in real-time go back and change match to what is obviously correct -- batch. Note this doesn't work with direct assistant commands ("okay Google what time is it?"),but it definitely does work with text messages you send while talking to the assistant ("okay Google, send Brandon a text message: I'm baking a match of cookies" gets corrected). This helps with stuttering or mistakes you make while speaking.
It's certainly a bit of a mental shift to get used to speaking text messages. Saying punctuation marks out loud is awkward. but it's definitely at the point where it's convenient.
I think tech people just find typing away much faster and voice interface kinda annoying, but a lot of more 'average type of users' really like the voice interface better.
I wouldn't have predicted, but voice is here to stay and it has long legs. It seems that casual home use, and in car (where the next battle will be) are the best uses of it, and not mobile/while out
I think it works brilliantly for certain narrow domains - answering factual questions, directions, distances, top rated attractions etc. And is quicker than typing and scrolling. But I agree the AI behind it has yet to demonstrate any real broad understanding - assimilating various contrasting opinions from text, providing context etc.
I also personally don't want to sit on a train, couch or a dentist waiting room listening to people talking to their phones or watches.
> I can see a future where AI voice control is incredible
It's not going to work overnight. I think it's great that apple integrated Siri in their product even though it wasn't very useful. There has been great progress in just a few years. I was skeptical too but I use it more and more. Things like "remind me to ...". "Play ...". I also use google voice quite often with my android TV box.
That being said, I'm always much faster with a keyboard. I prefer a smart text interface that understands basic actions, fuzzy search and so on.
It would be pretty nice if it actually understood what I was saying instead of sometimes randomly capitalizing words and not offering a lowercase alternative when you press the word, and occasionally outputting something wildly phonetically different than its input.
I've come across several phrases that simply will not translate and must be entered manually.
That's about it.
I’d guess voice assistance could be useful in other situations too. But for me the error rate still seems too high, and there are limited applications.
That detail is that HomePod doesn't spy on me. It does not upload audio recordings of my home and it even shows a visual indication when it is listening at all.
Doing speech recognision on the device is way harder, so it's not a surprise to see them being later to market than others. However this makes Apple's product the only viable one on the market for me.
Also interesting that just like structuring everything around Windows hurt Microsoft, Apple seems to be falling into this with the iPhone.
97% by mid-nineties
98% late nineties, and I still remember how wonderful was the voice recognition on old black and white Ericsson phones, and sony PDAs of that era.
And then, things stalled at claimed 98.5% without much improvement in marginal utility.
The fact that misrecognition is there and will be there, removes much of utility from applications without a dictionary or one with no side effects.
And yes, people talking to nobody feels and looks creepy. It almost begs other to come and point a finger at you saying "hey look, a computer creep is talking to his phone."
I'd like some examples of those things getting done.
Is this true?