This would solve so many problems. Sometimes I like to think a bit about what level to set the lights at, but by that time the machine thinks I'm done with my request.
We need richer interaction models, and this is an area where it'd be nice if we could focus on programmers first to find the appropriate mechanisms.
I think modal systems that set contexts would also be amazing.
"Robot, let's adjust the lights"
Which could set the system into dealing only with lighting-related requests.
Do you find that different types of users interact with the product differently? Children, engineers, laypeople...?
Computer, set lights at... 50%. Over.
BTW, always on the lookout for an offline only voice activated clock with multiple timers. wifi for NTP only is also acceptable; no cloud allowed. 98% of what we use the voice assistants for is task timers; it's nice to have those voice activated so everyone knows they were set.
I really wish these text-to-speech interfaces would begin notivation with a unobtrusive ding, wait a moment so you can stop talking... or even better, listen and slightly delay their notice so they're not talking over you if at all possible.
That's not to mention the overly verbose, repetitive, and late instructions whenever you're taking some ramp from one highway to another. God forbid you try to have a conversation during that.
I say "try" because I've failed at this myself a couple of times. But it may help.
I wish there was a level of detail slider for the navigation system, because by default I think it's overly detailed and verbose in many instances. An option to skip the first n instructions or any instructions within an x mile radius of my home would be nice too. I don't need the navigation to tell me how to get to the major highways near my home, just which one to go to.
"And so now begins the most interesting twi--"
"Get ready to make a left turn up ahead"
"--and it was all because--"
"Left turn in 100 meters"
"--but what really--"
"--and that would change things forever."
The Overcast podcast player tries to backup to complete a word or phrase when it is interrupted.
with ios adding an indicator to let the user know whether an app is listening in or not, I'm sure this is going to cause a lot of pushback from users in the form of "omg google is eavesdropping on everything we say!!!".
At least the Amazon and Google produced devices have notification sounds in accessibility options. One sound when it wakes up, and another when it stops listening. Sometimes you can hear it figured out it miswoke and cancels itself (like when the La Cucaracha car horn in Ant-Man triggered a Google Home? I dunno sometimes it wakes on the weirdest stuff)
I hate saying "Alexa, dismiss" for Reminder notifications rather than accepting the natural conversational "OK", "Thanks", "Got it", or "OK, thanks, I got it"...
Polite language sometimes conveys information: "please" signals an expectation of action, "thank you" can act as an acknowledgement or mark the end of a conversation, etc.
Me: Alexa, text my spouse please
Alexa: What would you like to say?
Me: Thank you for reminding me to pack my lunch, the cafeteria options were terrible today!
Alexa: what would you like to say?
Mapping the space of possible solutions is delegated to your brain. You are responsible for remembering what's possible, what you can do, and what you should say to make that happen.
The majority of the time you will fail, because you forgot what the magic spell actually was.
Visual interface has an unobtrusive way of showing you what you can do. More importantly - it can show you multliple options at once. Voice feedback can only tell you one thing at a time and very slowly.
Voice assistants are called that for a reason - core assumption is that you share your intent, and assistant is smart enough to make that happen.
But that does not work with people either! Nobody can read your mind (yet) - that is why project requirements and management exists. I would love to tell my team one sentence of what I want done, but I have to review that at some point.
I am not using voice assistant to do human things badly. I want it to do computer things well.
Re. your comment about discovery, I use an old version of a voice recognition software (dragon). From the help:
"Say "What Can I Say?" to see a list of sample commands you can say in the current application you are using"
So not a problem, at least in voice transcription software.
From the article criticising recognition loss in brevity, which is very true:
> “Paste” is almost never detected correctly.
Yup, so dragon simply expands the phrase to "paste that". It uses such redundancy to fix the problem, and because words are cheap (if you have no voice disability) this IME works well.
I have to say, having used dragon a long time now, I don't see much value in voice command interfaces primarily but it totally shines in plain old text transcription.
- Feels more like an outline than an article or blog post
- Covers all the essential points
- Wastes no space on superfluous language
- I think I like it
This is bullet list phase which essentially just gathering data. Next phase should be to digest this data to reveal patterns, identify best practices and that article would be more interesting - worthy of a good essay. Next phase would be to synthesize new facts, bring new perspective and fix old issues. That article would be Worthy of a research paper.
Unedited GPT-3 output:
I think this problem is especially bad when you have to use a specific vocabulary, and when you have to speak in a specific order. It’s not just the laggy UI, it’s the idea that your vocabulary and phrasing is not flexible. The laggy UI is an artifact of that lack of flexibility.
I don’t think it’s hard to understand how frustrating that can be.
A lot of people in the voice community are concerned with “flow”. This is the idea that the conversation should just flow, that we shouldn’t have to interrupt it to do work.
Flow can be achieved in other interfaces by turning off prompts, and putting in defaults. But it’s hard to do in voice. First, if you have to wait too long, then you are in trouble. Second, you can’t do defaults, because the only way to get the answer is to wait for the user to speak.
I think that is true. But there are ways to achieve a kind of flow, by turning off prompts and being very responsive. We don’t have to show you all the information on the screen at once. We can use highlights to let you know the computer is paying attention, or make use of non-visual modalities to let you know something happened. If we show you something on the screen, it’s because we are trying to convey something important, and you’ll be able to act on it, but we can’t expect you to act on everything.
Even with this in mind, flow is a design principle, not a technical requirement. I don’t think we should design systems that are trying to achieve flow, because it’s very hard to do. I do think we should design systems that are responsive and forgiving.
Voice systems have a unique capability: they can communicate with you without being able to see you. This is the key to being able to make an AI feel more like a person. If I could have a computer read my mind, that would be much more powerful than if it just read my words. This is the reason that voice is so compelling, and why it’s so interesting to implement it.
Flow is a principle of interaction design, and it’s about putting in a lot of work to make a system seem less work. Voice does not make interaction feel less work. In fact, voice can be the opposite. It is work. Voice requires us to work harder, and because it is so personal, we feel that more.
Voice is a more intimate interface. You are using your voice to speak to a computer.
There is a large class of problems which are better solved with a screen. I believe this to be
One thing on my mind with voice is that humans generate useful shorthand very dynamically. It would be great if I could conversationally set up more context with Alexa about tasks I do often or music I like, and have that improve efficiency and accuracy when I repeat. Instead, I experience the opposite: memorized commands often stop working because the outside world has changed; for example a song I like getting shadowed by a crappy (but newer) cover.
Not sure this is what the author meant in the first section when he wrote:
> I suspect you can push the user into a conversational skeuomorphism if you think that’s best, and the user will play along, but it’s no more right than another metaphor. It’s a question of quality of interaction, not ease or familiarity.
...but it's an interesting example of non-conversational multiplexed voice communication.
I've always felt a bit disappointed in myself that I haven't created the kind of optimized experience that Engelbart put forward – like maybe I'm not just being inefficient, but leaving room to overengage with my own distraction.
OTOH, there are limits to our own ability to be engaged and efficient. I am a good typist, which means I can write faster than my own thoughts... do I need to type faster? I could switch to Dvorak, but I don't think that's the limiting factor. Some of the user interfaces Engelbart advocated for feel like the Dvorak of interfaces, optimizing for something that is already past human ability. Looking at some of the interfaces in the Mother Of All Demos, it feels like they would require years of living in the tool to make it worth it, to make it natural and comprehensible and focusing instead of distracting. (I haven't done it, so I might be wrong!)
A high-bandwidth vocal signal feels similar, optimizing past our mental speed, and single-system-focused. At least that's my take on it.
I feel that way when I'm writing text and code.
But when doing visual design, or playing with a demoscene setup, I tend to feel limited by the number of variables I can simultaneously control. I have tried using MIDI nobs and sliders, but the bottleneck is how many fingers I have, rather than the number of input devices in front of me.
People think it's crazy that we still have people editing 50 year old COBOL programs today. Imagine how annoying it'd be to have to use a 50 year old computer that hasn't had "accent updates" and needing to emulate the speech patterns of your grandparents to be recognized.
(Autocorrected SMS was unpopular in my country when it was first introduced, because it corrected to german german spellings, and ever since the mid-twentieth, we've been proud of regional orthographies. I imagine francophones who use "SMS language" also have to turn off autocorrect.)
And then he provides four bullet points explaining exactly why we should not abuse assistants...
Dare I say it, but maybe we should look to the likes of Zork for inspiration for voice input.
On that note, there's nothing intrinsic here, really, to why these need to be voice interfaces. Based on watching the video and my experience using voice interfaces, if my hands are already on the keyboard, I'd wager I can type some of these things as quickly as if I'd spoken them and waited for the agent to keep up (even assuming they can).
I'm also surprised that the Awesomebar-/Sublime Text-/VSCode-style command palette hasn't by now become a ubiquitous feature of user interfaces. I'm really surprised they haven't replaced for many use cases the combo of Bash running in a terminal emulator, given how well-known a problem it is to have to remember the args to infrequently used commands. And I'd rather type "it" (as in in delete it) than to ever type "$_" even once—let alone ask someone with a straight face to remember that this is the way to do things when you're talking to the computer.
This makes me think of programs on the Mac like Alfred and LaunchBar. While they're usually called "application launchers," in a lot of ways they're more like simple, always-available command lines. I use Alfred for launching nearly every program, but also use it to do various web searches (like "g foobar" for Google, "wp foobar" for Wikipedia, etc.), perform simple calculations ("=123+(45/67)"), and even actions like converting Markdown on the clipboard to BBCode or creating Jira URLs for pasting into GitHub PRs. The Mac's native Spotlight has evolved to do some of these tasks, but Alfred, Launchbar, et. al. are more powerful in part because they're more explicit -- Spotlight makes educated guesses about what you're searching for, but Alfred operates more like, well, a command line.
 I presume programs like these exist for Linux, too, although I haven't found any when I've gone looking. The lack of one would actually be a pretty major stumbling block for me for platform switching. (As would the lack of anything quite like Keyboard Maestro, but that's a different topic, and maybe that's out there, too!)
 Actually I'm using DuckDuckGo but still type "g <thing>" because of years of muscle memory.
Come to think of it, I usually type into my Google Assistant app on my phone, rather than talk to them with voice, so there's no need to make these voice exclusive.
People are very adaptable and computers are hard to adapt. Tell me what I'm supposed to say, and I'll say it.
We make extensive use of Sonos and Amazon Echo products in our household and have two young children (3 and 5) the older one doesn't like issuing instructions for the Echo devices because he pauses occasionally and it somewhat aggressively "SORRY, I CAN'T HELP RIGHT NOW".
A bigger problem we've noticed is that things "drift". We've got muscle memory of asking "Alexa, play classical lullabies" or "Echo, play the skeleton dance song". Occasionally though these commands will be "hijacked" by some new content in Spotify or somewhere else.
"Alexa, play music for children" used to play nice kids songs, now it plays some murder-metal album called (one assumes ironically) "music for children", prior to that match, it seems like it was a search, or a playlist?
And there seems to be no way to deny-list music like this. It's driving us off of Spotify in our household, "skeleton dance" plays either screaming metal artists or the cutesey kids song with about 50% accuracy, it seems.
FWIW I had the same problem with Amazon FreeTime where for 5 bucks a month you get access to 10,000 kids shows and games, which is about 9950 more than I had time to vet to check if they were appropriate for my kids.
Please, hold your criticism if you want to berate me for raising kids in a household with voice assistants and tablets, I am trying to balance exposing them to useful technology without exposing them to the underworld of utter shite that is to be found beneath those interfaces.
The only addition I can think of is during the wake word section. I'd like to remove the wake word requirement if I'm in the middle of an interaction.
For example, sending a text on Siri.
"Hey Siri text my wife. Tell her I'm on my way"
Siri shows me the text and then I have to again say "Hey Siri send it" when just "Send it" should do.
Using AirPods Pro is supposed to allow somewhat reduced vocal utterances to Siri, but I haven’t spent time learning that.
"Dialogue designers replace graphic designers when creating voice interfaces"
Don't try to use full sentences to command a computer. People are happy to learn a few words if accuracy increases to >99% and it is more convenient.
Also, incorporate humming in tones.
If those things would be more active they would work much better.