Hacker News new | past | comments | ask | show | jobs | submit login
Voice assistants are not doing it for big tech (theregister.com)
257 points by rntn on Nov 23, 2022 | hide | past | favorite | 361 comments



Voice assistants are basically just mainstream non-visual command-lines, and it's unsurprising to me that something that relies heavily on memorization and extremely specialized "skills" isn't quite taking off in the way it was imagined. A voice system that can do literally everything one can do with a keyboard and a mouse would be magical, but no system offers that.

Instead, it's a guessing game about syntax and semantics, and frequently a source of frustration. There are many failure points: it can "hear" you wrong, it can miss the wake word, it can hear correctly but interpret wrong, miss context clues, or simply be unable to process whatever the request is. In my experience, most normal people either relegate voice commands to ultra-specific tasks, like timers, weather, and music, and that's that. Google and Alexa are relatively good at "trivia" questions, but Siri is a complete failure. All systems have edge cases that make them brittle.

I think there's potential here. Cortana was the most promising: an assistant that's integrated into the OS and can change any setting or perform anything on-screen would, again, be really awesome. We just don't have that. I think maybe OS-wide + GPT 4 (or later) might get closer to what we expect, but it's just not great right now. I really want to be able to say something as unstructured as "hey siri, create alarms every 5 minutes starting at 6am tomorrow" or "hey siri, when I get home every day, turn on all of the lights, change my focus to personal, and turn on the news". There /is/ power to-be-had, but nobody has really tapped it.


The problem isn't voice, it's natural language.

Natural language is a fundamentally wrong vehicle to convey information to a computer. It can be useful for some specific tasks, automated Q/A, simple interfaces to databases, stuff where I can't be properly f_ed to remember the syntax or the shortcut like IDE commands.

But the idea it can replace formal language is fundamentally and dangerously incorrect. I agree with Dijkstra's quip, we shouldn't regard formal language as a burden, but rather as a privilege.


I'd be perfectly happy with a list of Siri commands that I would have to learn to be able to do things. I don't care if I ended up sounding like:

Hey Siri

Turn lights on 50 percent

For one hour

Dim over that time

Play music.

I can learn what I need to do; JUST LET ME KNOW THE MAGIC WORDS!


It's like playing Zork all over again.


A lisp compiler in a voice assistant would seem like an improvement in that the user could define objects and then express the actions to be performed in the same room. But these assistants seem to drop objects between commands making them hard to program conversationally.

I guess a list like language would be ideal and the pauses would be like parentheses


But with the added complexity that sometimes the speech-to-text will just crap out completely.


Alexa, turn on lights

...I don't know how to do that

Alexa, turn lights on

...What do I turn the lights with?

Alexa, activate lights

...I don't know what you mean

...It is pitch black. You are likely to be eaten by a grue.

ALEXA TURN ON THE DAMN LIGHTS

...I don't know the word "lights"

...Oh no! You have walked into the slavering fangs of a grue!

** You have died **


Siri, turn on bathroom lights.

Downstairs or upstairs bathroom?

Downstairs.

Sorry, I didn’t understand. Downstairs it upstairs bathroom?

Downstairs bathroom.

Sorry, I didn’t understand. Downstairs it upstairs bathroom?

Cancel.

Ok. Cancelling.

Siri turn on downstairs bathroom lights.

(Turns off all lights)


For me, about once a week it's

"hey siri?"

(no response, no icon),

"hey siri?"

(no response, no icon),

"hey siri?" (louder)

(no response, no icon),

"hey siri?" (louder and slower)

(no response, no icon),

reboot iphone 13 pro

"hey siri?"

works


“Did you mean ‘bathroom LED’ or ‘bathroom’?”

Because god help you if your device names are similar to your room names…


I’ve taken to naming my lights things like Greg, The Beacons, etc.

And I added scenes so I can say “Gondor calls for aid” and the beacons will light.


Yes. And it may be worth noting that Zork is literally something like 50 year old parser technology.


Not to take away from your point (I'd like the magic list too) but to some degree, this can be worked around using Shortcuts. If you use inputs, Siri will prompt for them which is a bit slow but you could even use a dictate text and parse yourself if desired.


I highly doubt there is "a" magic list. I'll bet the magic list changes constantly.


I noticed a drop in usability about the time they went with ML.


Same with the predictive keyboard, it feels more random now.


i don’t know that you can do exactly all these things, but is this the use case for custom routines in the amazon ecosystem.

you great the prompt and add one or more actions to take.


On the other side, humans have been fine using natural language to delegate commands to each other.

So maybe it's just that the subfield of natural language understanding is still too early to be really useful. Speech recognition itself has gotten really good but then understanding the context, the intent, etc, all that is natural language understanding, and that is often the problem.


> have been fine

Citation needed, there's a lot of disagreements and misunderstandings (some have cost lives) that could've been avoided if we didn't have 10 different ways to say the same vague thing that can be interpreted in 20 ways. You think the military uses a phonetic alphabet and specifically structured communications for fun? Or the way planes talk to ATC for example. Where precision and unambiguity is crucial, natural language always gets ditched for something more formal.


This is actually an interesting point. In the Army, we used terms that limited ambiguity thereby increasing efficiency. Even if one eliminates the complexity of language, there's still a specification problem.

I only use voice assistants to set alarms. I cannot imagine voice as a primary input. Then again, many have opted out of owning desktops and laptops in favor of mobile phones. That also seems terribly inefficient.


>Then again, many have opted out of owning desktops and laptops in favor of mobile phones. That also seems terribly inefficient

A lot of people don't need computers in the general purpose sense. I admit my mind boggles a bit when co-workers tell me their kids don't want a computer to do their school papers because their phone is fine. But, then, I'm used to keyboards and what we think of as a "computer" and have been using one for decades--and grab one when I can for any remotely complex or input-heavy task.


> A lot of people don't need computers in the general purpose sense. I admit my mind boggles a bit when co-workers tell me their kids don't want a computer to do their school papers because their phone is fine.

I grew up in the 1980s, when handwritten papers were still the norm. I do see the advantages of using a word-processor for writing papers, but don't see why it would be a necessity (at least, until University).


I think the implication is that the kids use a word processor on their phone.


It sounds ridiculous, but I'll admit that when you've got something like Dex that lets you dock the phone for usb and hdmi out and gives you close to a full desktop OS I'd imagine it really is enough for the casual user.


I certainly know colleagues in the industry who travel with just a tablet and external keyboard. No, they're not running IDEs etc., but they find it OK for emails, editing docs, taking notes, etc. Personally I'll spend the extra few pounds to also carry along a laptop. But I can imagine not needing/wanting a dedicated laptop when I travel at some point.


Is a tablet and keyboard really much lighter than a laptop?

https://www.theverge.com/2020/4/20/21227741/apple-ipad-pro-m...

Suggests a keyboard and large tablet is heavier than a laptop


I'm usually carrying a tablet anyway though for entertainment/reading purposes. So it's usually a choice of tablet + laptop vs. tablet + keyboard. (I admittedly don't really have a weight optimized travel laptop these days either.)

I actually do wish there were good Mac or Chromebook choices for a travel 11" or so laptop but the market seems to have settled on a thin 13" as the floor and, admittedly, the weight/size difference isn't huge.


While I am mostly a Mac person, for travel I often prefer a tiny and cheap Lenovo Chromebook that does everything (a bit poorly): Linux containers for light weight programming and writing, consume media like books, audiobooks, and streaming.

In response to a grandparent comment about weight for tablets: I prefer Apple’s folio old style of cases/keyboards because of weight. I have one for both my small and large iPad Pros. Whenever I travel, I usually just take one of my iPads if I don’t need a dev environment [1].

[1] but with GitHub Codespaces and Google Colab, development on an iPad is sort of OK.


I still don't see the point of tablets. It's just a smartphone with a larger screen, and practically all people already carry phones.

Might as well go for the laptop at that point given that it can actually do far more imo, unless you ditch the phone and go for one of those half phone half tablets I guess.


I'd rather watch movies, read, play certain games, etc. on my tablet than on a phone. (Obviously there are also specific use cases like digital art.) That said, I mostly use my tablet when traveling and it's a distant third in necessity compared to either a laptop or a phone--and only somewhat more useful than a smartwatch.


Watching movies on a tablet is terrible, though. All methods for propping the device up so you can watch the movie are inferior to the way a laptop screen props itself up via hinges and a base.


On a plane I'd rather use the tablet in my lap than have to put the tray table down. And in a hotel room I'm watching on the couch if there is one. (I do also have an attachment for my tablet that will let you prop it up on a table but I mostly don't use it because it adds weight.)

For reading, I'm probably bringing my Kindle along if I don't bring my tablet.


I bought a surface for that reason. I like the portability, and it is just a normal PC with a pretty bad keyboard.


If you do not have one, buy a dock! I have a sp6 and 4 , and having the dock makes it quite the device. Speakers, multiple external monitors, keyboard, mouse -- a full desktop setup, I can grab and either stick a keyboard cover on or just use as a reading device on the couch.

Back to work? Sit on table, one cable and it's back to a desktop and charging up again.

Makes the whole thing make far more sense.


How old are you? Because larger screens become really nice as your eyes go bad. And I don't need the full size of a laptop for things I'd want to do on a tablet.


The obsession with being lighter definitely has diminishing returns. At some point another few ounces doesn't make any difference in a real, practical sense. I think have just started to associate "lightness" == "better" despite there being no actual benefit past a certain threshold.


Right at some point. But at the current point my tablet is too heavy to hold in hand for more than 20 secs perhaps. Phone is ok. Tablet is not (for me). I only use tablet by placing it on table or a stand. Then actually using a laptop is much better than a table.

The killer-tech will be when we have a tablet that is as light as phone.


Thanks for that. A lot of energy is currently sunk because of natural language, and I'd argue gains from employing software (instead of human processes) for various tasks is in part due to scaling up the results of many confusing discussions in natural language about what a specific process actually comprises.


This is part of the reason Google search sucks more and more.

Around when Android appeared, and the first voice searches began, Google suddenly started to alias everything.

Search for 'Andy', 'Andrew' appears. Search for 'there', and 'they're' appears.

This has been taken further, now silly aliases such as debian .. ubuntu exist, and as google happily drops words in your search, to find a match, this makes precision impossible.

But, that's the only way to make voice search remotely work, so...


I don't think this is to support voice search: Google generally knows whether a query was initiated by voice or typing. Instead, I think it's because most users find what they're looking for faster with it.

If you have terms you don't want interpreted broadly you can put them in quotes.


Google "helpfully" ignores the quotes sometimes too. They're not the hard and fast rule they used to be.

I preached the Gospel of Google when the competition was composed of web rings and Altavista, but Google in its infinite wisdom has abandoned the advanced user with changes of this nature.


Pretty sure quote support has improved recently.

https://blog.google/products/search/how-were-improving-searc...


Considering the article lies, and tries to claim quotes always are respected, I wouldn't put much faith in it.


So what is the gospel de jour, or are we forsaken in these benighted times?


Most people are not precise enough in their terminology.


I find voice-assistant often useful for using the phone such as opening a given setting, say make the display brighter. Trying to navigate the settings pages is very error-prone. There seems to be no universal standard as to where each setting should be found.


The real problem is people keep reorganizing where the settings are found.


There is a widely accepted and straightforward thinking that humans has ideas, which are expressed in languages, and that languages being ambiguous is problematic: this I'm starting to have doubts on.

Maybe we don't have clear intentions in the first place, maybe languages are not just ambiguous, but only meant to narrow realms of valid interpretations down to a desired precision, rather than intended to form a logically fully constrained statements. Maybe this is why intelligent entities are needed to "correctly" interpret natural language statements, because an act of interpretation itself is a decision making and an action.

Just my thoughts but I do think there are more to be said than "natural languages are ambiguous".


> On the other side, humans have been fine using natural language to delegate commands to each other.

Using language to instruct humans goes wrong all the time. Just a short while ago on British Bakeoff I saw 2 of the contestants make white chocolate feathering on their biscuits by making actual feathers out of white chocolate and placing them on their biscuits. And I'm sure that will confuse quite a few people reading this too. It certainly confuses image searches. Language is a fuzzy interface. Compare to interface like clicking on a button that does the thing I want done.


How would you (easily) describe the concept of chocolate feathering to a computer without using natural language? (e.g. if you wanted the computer to generate an image, or search for an image of / recipe with chocolate feathering).


> On the other side, humans have been fine using natural language to delegate commands to each other.

And that's why all of aviation has moved to a tight phraseology, such that delegated commands are universally understood and their meaning is set in stone.

Natural language has cost many lives.


> humans have been fine using natural language to delegate commands to each other.

Not always resulting in unambiguous instructions:

"Lord Raglan wishes the cavalry to advance rapidly to the front, follow the enemy, and try to prevent the enemy carrying away the guns." ~Lord Raglan, Balaclava

"I wish him to take Cemetery Hill if practicable." ~Robert E. Lee, Gettysburg


> On the other side, humans have been fine using natural language to delegate commands to each other.

On the other hand, legalese exists and is the lingua franca of telling people what to do, and math exists.


> On the other side, humans have been fine using natural language to delegate commands to each other.

I think this is really a characterization. Mostly human communication is full of errors and problems.

What is true is that when it is important enough, humans have come up with ways that minimize communication errors and frameworks to deal with ambiguity - mostly these involve training and effort though, it really doesn't come naturally.


"really a problematic characterization"...


> humans have been fine using natural language to delegate commands to each other.

Every time we try to minimize errors, we formalize a language. I don't even think people use natural language to issue commands often. Commanding people is often considered rude.


I agree with this. We have evidence that natural language works well enough to run most of the world. AI will eventually get there.


The problem is that it's not actually a conversation. To significantly improve it, you'd want to:

- identify users by voice

- ask them clarifying questions

- remember the answers on a per-user basis

- understand "no, that was the wrong answer"

If you're going to provide a formal interface to the computer, you also have to provide teaching in that formal interface, which is far more of a burden to the user than the cost of the device. And we've completely moved away from that model (not necessarily a good thing, but that's what the market has chosen).


Calling it a burden is an assumption that ignores and belittles the end user. Sure, there are people who won't want to train their personal ai.

But I imagine there are significantly more who would appreciate clarifying requests by a teachable assistant capable of interacting with the entire digital world on their behalf, efficiently and intelligently.


I think you're right. There are glimpses of this in the voice interfaces right now. For example, Alexa will distinguish between voices and preferentially take actions for me, saying "Play Music" plays Spotify, and for my kids, it plays Amazon music.


An example backing this is voice assistants that DO work, e.g. Talon voice. But these require defining a language, and then they are very accurate and powerful.

I don't see why a voice assistant for the masses couldn't "train it's own users", for example suggesting the language it does expect. But even then, most times people are talking in noisy environments or talk to fast or don't have an understand of how the machine might work. Regardless, who cares. They ruin the audio environment of a home. They're good for setting timers while you're cooking, that's about it.


Car voice assistants do this, but they're still clunky and it takes them forever to list their options. Voice interfaces just like CLI suffer from extremely bad discoverability and presentation compared to GUIs and thus will always be limited to specialty applications. CLIs at least have a league of try-hards and hobby linux users to keep them alive.


They're also fantastic at playing soothing music while your hands are busy holding a crying baby.


Only thing I use Siri for as well.


Right - natural language works for people because we have minds that are communicating. A virtual assistant has a list of things it can do, and uses language as an interface to them. So the language just becomes obfuscation instead of allowing clarification.

I've said before, I would prefer a voice assistant that optimized for traversing its menu system, in response to unambiguous noises (could be high and low pitch hums or whatever) that lets me bypass the guessing game and use the menu it's hiding


Like this: https://www.youtube.com/watch?v=8SkdfdXWYaI ? Here you traverse the AST, but the idea is similar, I think.


The problem is that it doesn't make money.

Otherwise, it works great :-) We love the hands-off usage mode because we cook a lot, so adding things to shopping lists or looking stuff up doesn't require cleaning hands in the middle of prep. Also the speakers are pretty darn good for the size and work well for music.

Doing complicated things is right out though. But the simple stuff works fine.


I'm just waiting for someone to finally release a voice assistant built around an actual language model, like GPT-3 or LaMDA.

It would be more error prone in a lot of ways, which is probably why nobody's done it yet, but it would also be a _lot_ more powerful, and fulfill the vision of conversational AI in a way the current rules-based assistants do not.

I think if powerful language models were easily accessible to normal people (in an inexpensive and completely unrestricted fashion, like with Stable Diffusion) we'd already see this happening in the open source world. Companies are going to be a lot more hesitant to try it though until they have a way to 100% prevent the models from making mistakes that could reflect poorly on the company, which is going to take _way_ longer to achieve.


Are you trying to say, Alexa should be funding the synthetic language nerds over at Lojban[0] or the Universal Networking Language[1]???

That would be a fun universe.

[0] https://mw.lojban.org/index.php?title=Lojban&setlang=en-US

[1] https://en.wikipedia.org/wiki/Universal_Networking_Language


Natural language conveys information to other people just fine. So the problem isn't that "Natural language is a fundamentally wrong vehicle to convey information to a computer". The problem is getting the computer to understand natural language to the same level as a human.


The problem is both


> we shouldn't regard formal language as a burden, but rather as a privilege

What the hell? Is riding public transport or riding a bike either a burdain or a privilidge? Is Driving a car?

I am trying to control shit in my home, it should be neither.


Dijkstra's full essay[1] is a bit more illuminating, but essentially it's about how, for example, developing a system of symbols and formal language around mathematics has allowed "school children [to] learn to do what in earlier days only genius could achieve".

1: https://www.cs.utexas.edu/users/EWD/transcriptions/EWD06xx/E...


I think his argument even generalizes to literacy in general. Remember that reading and writing skills don't develop naturally (as opposed to spoken language). They require a large educational investment, and used to be reserved for the wealthy and the privileged.


> I think there's potential here.

But how? Even if those interfaces were actually working, it's still extremely inconvenient to talk when you can click. You have to be somewhere where talking out loud doesn't disturb the people around you. That excludes most situations: open space offices, restaurants, coffee shops, public transport, cars with passengers, and most places in the home except maybe the bathroom.

And even if you're all alone in a silent place, giving instructions out loud takes more time than configuring a screen, and will always be error prone, because the feedback will always be ambiguous and imprecise.

Except maybe if the feedback is on a screen, but then if there's already a screen, why not use it.


I think the best use cases for voice assistants are when you don’t have free hands. I have two scenarios where I use voice assistants: setting a timer while cooking and changing the music while showering. Both could be done by other means as well but they wouldn‘t be more convenient.


Exactly. For instance, in the mornings Google Assistant has been really useful for when I say "OK Google, Good Morning". It then runs through and tells me:

* Current time, and weather forecast for the day

* Upcoming meetings today

* My current commute time to work, including traffic

* NPR news podcast

So during my routine of letting the dogs out, starting the coffee, etc. in the morning, I get the daily "essential" info.


Also when driving but Siri / Google assistant are more applicable for that use case


Asking the time whilst getting ready.


Seems like a perfect fit for a clock?


Or a watch?


Apple watch does have Siri, I suppose. They could be really bold and remove the screen.


> They could be really bold and remove the screen.

Then it would be called AirPods.


Both or either would suffice.


> But how? Even if those interfaces were actually working, it's still extremely inconvenient to talk when you can click. You have to be somewhere where talking out loud doesn't disturb the people around you. That excludes most situations: open space offices, restaurants, coffee shops, public transport, cars with passengers, and most places in the home except maybe the bathroom.

I would separate out the two, actually. There's a "natural language control system for the entire OS" and then there's the actual voice part. Voice is often mostly useful for accessibility purposes -- hands full, running, driving, etc. However, the other side is that a text-based NL assistant would also be profoundly useful. On iOS, you can enable "Type-to-siri" and you can just type sentences and Siri will respond back in text.

If we make progress on NL-driven command-lines, we can actually make progress on voice-assistants, and vice versa. The catch is that the voice side still needs recognition work.


Well, you are not trying to operate heavy machinery with Amazon Echo - hopefully. Voice as a common interface - I agree with all of that, but to me the everyday utility of being able to add something to my shopping list or my TODO list without having to fire up an APP greatly increases my quality of life. That part is magical, but I don't expect a lot more from it.


I used to use Alexa for my shopping list. I guess over time I came to the conclusion that adding something to a steno pad or my whiteboard was even easier.


If the assistant AI was advanced enough for pleasant conversations to occur, it would be useful.

The would be trivial to use the interface on screen when appropriate, and a truly smart assistant should be able to follow the context and be aware of your preferences and mood.

This is not fundamentally impossible, we're simply not there yet.


> But how? Even if those interfaces were actually working, it's still extremely inconvenient to talk when you can click.

Smart home light/etc while hands are occupied like with a baby. But usecases are quite limited


> But how? Even if those interfaces were actually working, it's still extremely inconvenient to talk when you can click

Working from home changes that. I can see many more opportunities for a multimodal input interface. Examples:

1. My fingertips now are closer to the "reply" button below this text area than they are even to the touchpad. Touching "reply" is half a second, moving one hand to the touchpad, aiming the pointer at the button and clicking takes longer. With a mouse: much longer. Anyway, my screen is not a touchscreen. I'll click.

2. Or, with an assistant, I could have said "Click reply", provided that the assistant knows where the focus is and that it can read the form I'm typing in.


Your fingertips while typing are even closer to the Tab and Enter keys on your keyboard, which, if pressed in sequence, have the exact same effect. Much simpler and much faster than either of your options.


Faster, don't know. Simpler, I didn't even think about it. However I'm doing it now. Thanks.


Wow, the second point is really interesting. Binding a voice command to a key, in this case Enter.


"hey siri, when I get home every day, turn on all of the lights, change my focus to personal, and turn on the news"

I think the problem with that is that even I, as a human, struggle to know for sure what you want.

You want to turn all the lights on in the house? Does that include the lamps in the bedroom? How about new lights that you add later? Or the ones in the garden? It's full of ambiguity. What device do you want to watch the news on? Or did you mean the radio? Do you want this to apply when you get back at 2am one night, meaning your family gets woken up when you turn on all the lights and start playing the news in their bedrooms?

I think that's probably why voice interfaces aren't likely to work well for anything beyond direct, specific, well-scoped requests: turn on the lights in the bedroom; turn off the heating at home; roll up the blinds; what's the weather like today; what's the remaining range on my car. They really struggle to deal with anything more complex – not so bad in theory, but really incredibly irritating when they make the wrong decision.

If you had some kind of 24-hour live-in assistant (a butler, maybe?), then they probably have the knowledge and intuition to make sensible decisions in response to fairly unstructured requests. But I think we're miles off getting a voice assistant to do it – not because they can't, necessarily, but because if they mess it up at all it's infuriating.


You can do some of this with shortcuts, and then use Siri to trigger the shortcut. But that involves thinking; the magic of Jeeves is that he knows what you want even before you do.


The problem is there are more different combinations I might want as a shortcut then I have time to program/remember. I can remember something like a dozen commonly used shortcuts. However when 5 years from now I arrive home at 2am (for the first time in several decades, but it will probably happen at some point again in my life) will I remember the correct shortcut - and assuming I do, is it up to date with whatever changes have been made to my house?

What about the shortcut for when I need to leave at 3am for some reason. then a different shortcut for when it isn't just me, but my whole family leaving at 3am. An still another for my son having to leave that early.

Jeeves can figure it out when I arrive at 2am so I don't need to program it.


You've reminded me of some aspects of these platforms that I like in a more general sense – like for example the way the Apple Watch will automatically ring the alarm on my phone if I forget to put my watch on, or if I get up before my alarm goes off the watch will notice and ask if I want to skip the alarm for the day. This stuff genuinely feels almost like magic sometimes – the risk is that when anything like this goes wrong it's awful.


Yeah these are graceful - and the watch will start out very light buzzing and then get louder.


I might be in the minority, but I also don't want to add things to my life that make my environment noisier or that require me or others living with me to speak more. As much of a Star Trek fan as I am, I never found "The Computer" to be appealing, and always thought of it more as an artistic device. It's a lot easier to communicate a character's intent / action if they are vocalizing it for performance. Even in scenes where they are "typing" something into the computer, they will inevitably be communicating to the captain or another character what they are doing.

In practical reality these interfaces feel, to me, as extremely inefficient. As someone who doesn't particularly like to speak, and prefers silent environments, these interfaces require more energy from me to use. Unless they are serving someone who has a physical impairment then I don't see what problems exist that these solve, but I can identify lots of problems that they introduce (not only noise but privacy / security vulnerabilities etc.)

Personal preference.


Timers and reminders alone are enough to make them a pretty nice thing to have though.

I don't really want them to be all that much more powerful, because natural language can be imprecise, and... there's just not much I that I want to automate in a home setting beyond some real simple timers for lights and stuff.

What if I had a bad day and didn't want to see depressing news? Or what if I came home and was talking on the phone when it turned the news on?

True automation as opposed to just telemetry and remote control can easily be annoying more than helpful.

I like the idea of automation... but I don't actually... automate anything aside from timers and reminders.


I think that's generally true though playing music is a little more freeform. (And, guess what? Voice assistants tend to be worse at that.)

The problem is that you have many many billions of dollars have been sunk into making these devices about more than setting alarms and timers. There's actually been a lot of pretty amazing progress. But it's yet another one of those things that getting to 90% to anyone but techies who want to fiddle with their smarthome stuff or otherwise play with the technology.


They might have a sudden increase in usefulness when smarthome stuff is more common, although smart bulbs are a bit of a hassle in most switched outlets, because the switch is usually more convenient.

Maybe they'll add an app that lets you browse possible commands so it's more discoverable.


It's probably true that a well-integrated smarthome would benefit from voice control.

But I'd observe that I'm going up to my brother's tomorrow and he has all manner of timers and other WiFi-connected stuff and none of it has any sort of centralized control and that's pretty normal even for people who have a lot of that sort of thing.

And, yeah, the only smart light thing I have at home is one thing that doesn't have a controlling light switch and I used X10 for it for years before I got an Alexa.


If I would be in this space I would just build voice assistant to very specific situations where you cannot type like driving, cooking, doing some sport etc. There is lots of potential but big players are kinda trying build generic tool for every situation which is super hard problem.


You want utility. The big players want a product that can be monetized and milked for revenue.


My Alexa asked me today if I wanted an Avatar theme. No I really do not, Alexa. I was reminded of the article a few days ago how they can’t monetize this well and are somehow losing $10 billion. :)


Voice assistants have reached the Unhelpful Valley stage.

When they were a novelty I recall the excitement of trying new commands and layering in context, after many failures I've been conditioned to now only attempt and expect success with generic queries.


To me what’s interesting is that MS smelled that it was a problem a while ago and pulled the plug before it ate a hole in their wallet but Amazon and Google keep plugging along ploughing money into a bottomless pit. Apple has a different play and looks like they are controlling their losses there quite well and may act as a slight loss leader for other products.


I can't fathom how they managed to spend so much on it, though. The product has been around for quite a while, as well, so it's not some initial ramp-up cost. $3B/quarter $10B/year? Wow.

Edit: Maybe things like this happen because there are various nerds who lead these products and are good at talking the businesspeople into funding it. Maybe this was only possible at the big tech growth stage while business wasn't that good at telling the value proposition. So end result, lots more engineers get paid which is great in my book :-)


> Instead, it's a guessing game about syntax and semantics, and frequently a source of frustration

My biggest frustration with Alexa is getting it play the podcasts I want to listen to. Even popular podcasts with English names are hard to get just right for Alexa. The same goes for song titles and bands that are not popular, or they are in other languages.

Usually when I want to take a shower, I try to get the podcasts/music to play for 2 minutes, then sigh, give up and just say "Alexa play Britney Spears".


And discoverability. For a long drive I probably want to pick out some specific podcast episodes rather than play whatever. I'm just not a whatever background sound sort of person. The interfaces aren't really good enough to present me with some options with voice control only. So I end up mostly pre-populating a "Car" playlist.


>> A voice system that can do literally everything one can do with a keyboard and a mouse would be magical, but no system offers that.

And even then, a voice assistant is essentially a user interface, not a product or service.

It could be a service if you could reliably say "Alexa, plan my trip to customer X the week of the 30th and send me my itinerary". But for now they are an alternative to a phone UI.


The reality is that even a human personal assistant can rapidly devolve to being more of a hindrance than a help if they're not very good once you get beyond simple mechanical tasks. Even with all the knowledge about the world that most adults carry around in their heads. Yes, a poor human assistant can fall down in other ways such as forgetting to do something--but they have a lot of context.

This seems a really high bar for voice assistants aspiring to do much more than set alarms or turn the odd light etc. on or off.


These days few people have personal secretaries, but back when they were common they really were personal - once you got a personal secretary she (nearly always she, I feel like we should acknowledge sexism even though it is irrelevant my point) would follow you (nearly always male), as you moved job to job and up the ladder. She went with the you because once you spent a few years training her to how you worked, a new secretary would greatly limit your effectiveness.

These days a large part of what people relied on secretaries for a computer can do faster, so only at the highest levels do you see them. There are still secretaries at the low levels, but not nearly as many, and they are not doing the same tasks.


That's pretty much it. We call them executive admins these days where they exist.

And, yeah, assistants shared with a bunch of other people--as with travel agents in general--aren't really all that useful. If I'm mostly just giving fairly mechanical instructions to execute, it's probably easier for me to go online and figure out the options myself.

A secretary made a lot more sense when you dictated memos for inter-office mail and retrieving information often involved making multiple phone calls.


>> This seems a really high bar for voice assistants aspiring to do much more than set alarms or turn the odd light etc. on or off.

That's kind of my point. A voice assistant is just a fancy UI until they reach the level of AGI, and I don't see the point in spending billions of dollars on them to be a simple UI as Amazon seems to be doing.


If that voice assistant were self hosted in the little device, I agree. But those simple interfaces are connected directly to a significantly larger machine that literally knows everything about you and half of everyone you know. It's not unheard of to expect it to be more useful than setting timers and playing music.


They "know" a bunch of discrete facts. They don't know that if you book me on a red-eye unnecessarily to save $100 I'll be hunting you down. Or any of a zillion other flexible preferences--some of which I'm not even very consistent about.


I don't know about you personally, but google definitely knows I've never booked a red-eye and that I haven't booked a layover since the early aughts. I'm fairly sure Google could easily figure out not only where I'd be interested in flying to in the next few months, but when and for how long, and at what price points I'd consider upgrading my flight.

I know they know this about me not only because of my Gmail account but also because I use Google flights to find the flights before I book them.

Unfortunately they're not using this data to help me. Rather they're using it to target advertising to me. But they definitely have the data and the machinery to be more useful to me with more than just a few facts


Maybe my travel is more complicated but I even not infrequently get annoyed with "past me" for various travel-related decisions. I avoid red-eyes but at some price point I won't--or maybe only if it's someone else's money. And maybe I don't have a choice based on my schedule or just what flights are available. Normally I won't do an unnecessary layover but maybe I will to fly my preferred airline.

It gets complicated in a hurry and for the cases where it is relatively simple (and when it gets into very complex international travel a voice interface is going to be completely useless), I can look up my options pretty quickly on a computer.


The potential would be there, if they would focus on the assistant-part, and take the voice just as a mean to interact with the assistant, besides other means like clicking, typing, showing complex information on a screen, etc.

Voice alone sucks, it's just too limited to be useful on a grand scale. Similarly, command lines suck too. The shell in general has the same problems that Voice assistants have, just that they have more value and had decades to mature into something actually useful. And toady we have unix-shells which reduce the problematic parts by many levels, and still receive constant improvements. This is missing for voice assistants, because unix-shells are growing and improving in an open space, where everyone can add their own things. This is not happening in big tech.


I don't think this is actually reliably possible due to the fact that while grammar does tend to follow patterns sometimes, we're fundamentally dealing with an exponential amount of ways to say things to a voice assistant.

In the spirit of the title of this post, someone else also has to say something.

If your argument is that this is a "non-visual command line" there's slim hope of the layperson learning a whole secret grammar without even a goddamn man page just to do their menial tasks.


I really doubt *nix would have made it so far if the cli were audio based, too. It's a fundamentally slower and lower bandwidth communication channel.


*nix was optimized for low-bandwith channels. That's why the command names and options are extremely terse and typically return trivial output on success. OTOH it was assumed that input would be reliable, so there's no confirmation required for potentially dangerous commands. A "*nix for voice" would need to address that, at the very least.


I’d sure be lost if I had to listen to the entirety of a manpage or dmesg output or /var/log/messages read out by voice. Some of those could take hours to read out. Nothing actually trivial about *nix command output. Just sometimes terse.


>Voice assistants are basically just mainstream non-visual command-lines, and it's unsurprising to me that something that relies heavily on memorization and extremely specialized "skills" isn't quite taking off in the way it was imagined.

This got me thinking. Voice recognition is basically a commodity now .. there are open source AI engines that can do it offline really well. So the recognition part is solved, you can just grab it from your distro's package manager. Now there's just the language part.

Thing is, I don't want to speak to my computer using English. Aside from the enormous practical problems in natural language processing you've outlined, I just find the idea creepy[1].

What I want is to unambiguously tell it to do arbitrary things. I.e. use it as an actual computer, not a toy that can do a few tricks. I.e. actually program it. In some kind of Turing complete shell language that is optimized for being spoken aloud. You would speak words into the open source voice recognizer, it writes those to stdout, then an interpreter reads from stdin and executes the instructions.

Is there any language like this? What should it look like?

And yeah that would take effort to learn to use it right, just like any other programming language; so be it. This would be a hobbyist thing.

[1] https://i.kym-cdn.com/photos/images/original/002/054/961/748...


> So the recognition part is solved

If you're using an averaged American voice - maybe. But it's really not solved for everyone. Google assistant can't set the right timer for me 1/10 times. And that's before we get to heavy accent Scots and others.


Even my "affected Edinburgh accent", as someone once described it, causes no end of trouble with voice recognition.



> Voice recognition is basically a commodity now .. there are open source AI engines that can do it offline really well. So the recognition part is solved, you can just grab it from your distro's package manager.

This is potentially far from true, depending on how exactly you draw the line between "voice recognition" and "language". I've looked at quite a few transcription services, and they fail a lot of the time for most people - those who either have a non-native accent (even if very slight!) or those who do any amount of stammering or other vocal tics.


I find the ML transcription services, given 2 people speaking English with high quality sound and without heavy accents/a lot of jargon, to be adequate for having a skimmable record--such as for extracting quotations (and just go back to the recording to confirm the exact words if it's not obvious). But if I'm publishing a transcript I get a human transcription. Cleaning up the ML stuff takes way too much time and I wouldn't publish a transcript without cleaning it up.


I was in fact looking at some transcriptions of my recent meetings, and found one that captures how even small mistakes can make for completely not-understandable transcripts, unless they are manually cleaned up.

Manual transcription:

> So no: long story short, Slum is basically the way we can have an individual [, uhhh,] instance that carries all the licenses.

(Slum is a project name in this case)

Computer transcription (MS Teams):

> So no.

> A long story shorts. Love is basically the way we can have an individual.

> OHS instance that carries all the license.


> Voice recognition is basically a commodity now .. there are open source AI engines that can do it offline really well. So the recognition part is solved, you can just grab it from your distro's package manager.

I personally don't consider this a fully-solved problem. The best transcription system I've used is OpenAI Whisper, and it doesn't work in realtime. Maybe it's fine on small amounts but it's still not perfect. You really need error to be driven down dramatically. Zoom auto-captions are a joke in terms of how badly they work for me, and Live Text (beta) on macOS is equally dreadful. YouTube auto-captions suck. All of these use industry-leading APIs. If I'm speaking a voice command and one single word is wrong, usually the whole thing fails.

There's an entirely separate issue about things that are Proper Nouns that don't exist. For example, "Todoist" is often misunderstood by Siri. Thus, people started saying "Two doist (where doist rhymes with joist)" to fool it into understanding "Todoist". Media like anime with strange titles from other languages often flat out trolls these transcription systems. ("Hey Siri, remind me to watch Kimetsu no Yaiba tomorrow".)


That reminds me of the handwriting recognition approach [1] used in old Palm Pilot devices. Even though the shapes it expected you to draw resembled the corresponding letters, you would never draw them like that if you were writing on paper.

You knew that you were drawing something designed for a computer to recognise as unambiguously as possible, while being efficient to draw quickly and easy to learn for you. I feel like that's the kind of notion that voice interfaces should somehow expand upon.

[1] https://en.wikipedia.org/wiki/Graffiti_(Palm_OS)


> And yeah that would take effort to learn to use it right, just like any other programming language; so be it. This would be a hobbyist thing.

There are quite a few hobbyists working on local on-prem privacy focused voice assistans with conversation support.

https://www.home-assistant.io/integrations/#voice https://www.home-assistant.io/integrations/conversation/

Have fun. It is a rabbit hole.


To me the hardest problem is simply remembering what every light on my network is named. Did I call the light next to my desk “desk light” or did I call it “office light”? If I don’t get the name exactly right, I cannot control the light. Multiply that by every other light in the house and it becomes a lot to remember. I have probably 15 lights controlled by Alexa and I can only remember the name of like three of them. Thus most of the time it is just “Alexa turn on the lights” so it can turn everything on in a room.

If these voice assistants were smarter about “alternative” names for every device it might be easier to use. But as it stands, it’s kind of a pain because the way you phrase each request is so unforgiving…

Oh yeah, and god help you if your device name is similar to your room name. If your room is “office” (or did I name it “the office”?) and your light is “office light” Alexa is gonna have a bad time figuring the two apart.

I have no clue how to fix this…

PS: this is why I question steering wheel free self driving cars. How will we tell these things exactly where to go when we cannot even reliably tell our voice assistants exactly what light to turn on?


I think the biggest potential is with Microsoft Teams in business. It is ubiquitousness in people's work life, has access to data and has integrated with everything. And adding cortana to calls would be an easy step for people to understand and learn. People would say "cortana share my screen". People would learn phrases from each other.


But teams hasn't figured out how to send text in a coherent way.

It's used because companies can cheap out on buying a license for other communication applications, it is fundamentally worse than anything else in any other metric. If voice lets me respond to a message without hunting for the hidden reply because Teams shoves it below the bottom of the screen then it could be a win. Considering UX is so low for Teams I doubt it will.


> There /is/ power to-be-had, but nobody has really tapped it.

This kind of thing can't be built for modern mainstream operating systems because they generally prevent subjugation of the OS components and other programs, even if the user wants that, ostensibly for security reasons.

Unlike a human operator, an assistant "app" can only operate within the bounds of APIs defined by the OS vendors and third-party developers. Gone are the days of third-party software that extends the operating system in ways that the overlords couldn't (or wouldn't) dream of.


That's not entirely true. Accessibility APIs on macOS, for example, would let you control so many aspects of the OS from user land apps given that permissions are granted. But voice assistants are not up to the task.


I think you're identifying some of the right problems here. All voice assistants are based on turn-taking, and when the VoiceAI hits one of those failure points and just comes back with "I didn't get that" it leaves the user in a frustrating state trying to debug what's wrong.

I work at SoundHound where we've been worried about these issues. (I'm going to plug our recent work...) Our new approach is to do natural language understanding in real-time instead of at the utterance (turn) taking level. That way we can give the user constant feedback in real-time. In the case of a screen that means the user sees right away that they are understood, and if not, a better hint of what went wrong. For example a likely mistake is an ASR mistranscription for a word or two.

We still need to prove this is a better paradigm for VoiceAI in products that people can try for themselves, and are working towards that goal. I hope that voice interfaces that were clunky with turn-taking will finally be more naturally usable with real-time NLU.

https://www.youtube.com/watch?v=5WLYH1qHfq8


I tried Amazon's Alexa, the top end model with a display. Often it would taunt you about new/interesting things on the screen, but I could never get them to work. I'd had to memorize things to get even the basics working. Ended up unplugging it.

However Google's Assistant in comparison worked great, no memorization, and very useful. Sure time, weather, set timers, and alarms worked great with a very flexible set of natural language queries. Even more complex things like what will be the temperature tomorrow at 10pm, simple calculations and unit conversions. But also things like IMDB like queries about directors, actors, which movies someone was in, etc generally worked well. It seemed to really understand things, not just "A web search returned ...". Even more complex things like the wheelbase of a 2004 WRX would return an answer, not a search result.

With all that said I'm looking for a non-cloud/on site solution, even if it requires more work, most recently noticed https://github.com/rhasspy/rhasspy


The big issue is that there's no clearly defined interface for users. What commands are possible? Nobody knows. So people default to the most obvious things like setting a timer. Is it possible to setup your own commands and build your own work flows? AFAIK, no. So the tech is essentially dead in the water until companies fundamentally rethink what they're trying to do with voice assistants.


Yup. At the risk of being glib I would say this is 90% of the issue. Or more like 'the big blocking issue' at the moment.

Voice can do way more than we know, but we have no idea what it does or how to use it.

Standardizing the interface and providing tutorials would possibly change things dramatically.

And this goes for the back-end protocols as well.

The tech is way, way ahead of the UI and integration.

Imagine getting the power of 'git' with no tutorial and not really an understanding of what it does? Good luck with that.

90% of us would be using it in the car to do a lot of things if we really knew how to do it:

You: "Siri: Command. Open. Mail. Prompt. Recipients starting with S"

Siri: "Sarah, Sue, Sundar"

You: "Stop. Command. Message. To: Sunar. Thanks for the note. Stop. Send without Review"

Some of this already exists, but it's product specific etc. there needs to be some kind of natural universal interface - or we have to wait until the AI is really, really that good.


Talon voice can do everything a keyboard and mouse offers, plus more (contextual awareness, higher level abstraction). Very powerful in combination with modal editing. I'm not affiliated, just a user.

Granted, this is for a specific user base and yes, not in coffee shops.


This timeline is such a mishmash of mediocrity. Voice assistants could have been a vibrant ecosystem of different personalities, like say buying a Darth Vader voice pack or having your computer sound like a snooty English butler..

There's a great little game series called Megaman Battle Network (Rockman.exe in Japan) which diverges from the mainline by showing an alternate universe where scientists focused on AI instead of robotics, resulting in a world where "Navis" are ubiquitous.

I wonder, what if our early software engineers focused on bringing natural voice control to CLIs, before perfecting GUIs first?


> There /is/ power to-be-had

This is not power. This is just first-world problems.


I think these assistants just need to give the user a way to edit interpretations.

A 'debug' area that lets you ask a command, see what was interpreted - and immediately edit or click "that's not what I wanted". But not an afterthought and not a cumbersome process like setting up an automation that is triggered by specific commands.

Imagine telling your voice assistant "You're wrong, as usual" and instead of it giving you the boiler plate "I'm sorry ", it actually offered a way to improve itself.


I would think that a good command-line is one that responds to me within milliseconds on a crapbox i386 machine, and I can COMMAND it what to do. A good command-line is not a binary blob that cannot parse simple instructions correctly.

At the same time, siri seems to be getting slower and fatter every iteration so perhaps it is becoming more human ;)


> "hey siri, create alarms every 5 minutes starting at 6am tomorrow"

“OK, I’ve created an infinite number of alarms, every five minutes, starting at 6 AM tomorrow!”

(As a native English speaker, I'm not sure what specific outcome you want to happen from that request. That's the one that makes the most sense.)


As a native English speaker, that seems a profoundly odd request but that is what you asked for.

And you now have me wondering how open-ended calendar requests are actually implemented given that they can't literally have entries out to infinity. (I assume they go out some finite period and some background process periodically re-populates future entries.)


A recurrence rule is added to a start event, then an occurrence cache is either generated on the fly for periods of interest, or, yes, a rolling cache a year or two in the future is maintained and updated daily.


Perhaps trivial, but actually seems like an interesting question given you have to potentially tradeoff RPCs for routine queries (and the number of database records) vs. being wrong for the random "Am I free on this day three years from now?" query. Of course, the answer may be that, in general, the differences don't really matter.


Another pitfall of most voice assistants is that they are really designed first with the corporation in mind rather than the user. Most are proxies for surveillance, advertising, or are just steering consumers back to a preferred set of walled-garden services.


Yeah, the whole idea has a lot of potential that seems like it should be within reach, but somehow it's 2022 and my phone still can't handle "hey Google, play my driving playlist on Spotify."


Your queries continue to be money-sinks -- even in your ideal case, you aren't buying anything! This query costs them money but earns them nothing. This is useless.


> an assistant that's integrated into the OS and can change any setting

That sounds like a security nightmare. Someone walks by and starts changing your system settings? No thank you


Me and voice assistants are like me on the ballroom dance floor. I loved to take the lessons and learn all sorts of moves and chain them all together and look impressive, but when I got onto the floor with a partner, I just wouldn't know what to do or where to start. I kept to the "basic" steps and maybe a timid little turn once in a while.

Maybe it's possible to learn a working vocabulary and know how to command a voice assistant. I know my way around several command lines, but I have no idea what to say to Hey Google.


it almost sounds like you are describing how it feels to learn a new language. And if that's the case and people need to learn "voice assistant" to communicate with their device effectively, hasn't it utterly failed as a natural language processor?

Also I know this is true in other domains as well, obviously there is a common "google-ese" that people learn to narrow down their searches.


I think it's fairly clear now that the only time a voice based UI is better is when the user is unable to use their hands. Driving or in the kitchen when cooking seem the be the most successful. The are barely any other strong use cases.

On top of that the general distrust of the privacy of these systems has stoped a significant number of people (myself included) from wanting to us them at all. I don't have an in home device, and have turned off Siri on my Apple devices.


And then it typically interacts and fails without feedback. I've tried so many times to tell Siri "Send a message to x that I'm 10 mins away", only to realize much later that "message delivery failed".

No clear feedback, a weird timing issues where it just stalls and show the message it' about to send in case it got it wrong.

It's just a terrible UX all around.


I've had to stop using Google Assistant to send messages. It used to ask the user to choose the correct word when it misheard something. Now it just makes a wild-ass guess and sends it on. It's caused me to send some very odd messages to people and/or look like an idiot.


Totally agree. "Hey Siri, start a timer for X minutes" and "Hey Siri, play Y on Spotify" are the entire extent of my voice assistant interactions.


And I’m always annoyed that it doesn’t tell me how long I set each timer to when I have multiple timers, just the remaining time, which makes it hard for me to know which is which.


It’s not encouraging that even the most common use cases for these systems still have rough edges.

The Google ones do support named timers, so you can say “start a pasta timer” and later ask “what’s left on my pasta timer” etc. I thought Siri added this at some point, but I wouldn’t be surprised if not.


It does. "Hey Siri, set a timer for 45 minutes called pasta bake", "Hey Siri, how long left on pasta bake?" works.


Actually, it seems that google is now able to cope with things a bit better. I've just tried and I can now ask for "how long left on my 45-minute timer?" and it does finally answer. It actually also shows up in the UI. That is sort of recent, though, as I've had that problem earlier this year. It seems that siri is now unable to set multiple timers, though.

Anyway, it does seem to have improved, but I wonder why that stuff wasn't in from day one. It seems pretty obvious to me.


Nope, Siri can also do multiple timers. It asks for an name if an unnamed timer of the same length already exists.


Weirdly, the ability to set multiple timers varies by the device you are using.

A HomePod can set multiple timers. A Watch, iPhone, or iPad can only set one timer. There is no obvious technical reason for it. It just seems like only the HomePod team thought it was an important feature.

This becomes annoying if you have multiple devices set to respond to “hey Siri” and the wrong one picks up the request and then refuses to comply.


This used to be true, but at least the Apple Watch lets me do this all the time. I forget which OS update added the capability.


Oddly enough, in the iPadOS 15.7.1 at least if I say "Hey siri, set a timer for 20 minutes" it will say "20 minutes, starting now". Then if I say "Hey siri, set a timer for 5 minutes" it says "there's already a 20 minute timer. Replace it?"

If I say "set a timer for 20 minutes called A" it just ignores the "called A" part.


I do "Hey Siri set a timer for Foo for 5 minutes" I get a timer named Foo, and I can then set concurrent timers named Bar, Baz, etc. without replacement.


Yep, that seems to work which brings us back to the need to memorize fairly precise incantations.


It even plays a little Italian jingle if you start a pasta timer.


I discovered recently that you can set a timer just by saying "Hey Siri, four minutes," or however long you want, and she will set a timer to that length. Not that I'm doing anything with the extra second I've saved myself, but it feels good anyway.


I don't think that's true. It's way faster for things like smart lights and playing music. The problem is it's still so limited for other things. For example I can tell Alexa to play The Simpsons on Fire TV, it will do so on Disney+ but always the first episode even though the last watched one was in Season 15. It also can't seem to find my purchased episodes from iTunes (watchable with the Apple TV app on Fire TV). Simple searches also have a high chance of being misunderstood still with poor results.

I think if the accuracy was better and more content/things were available through voice it would be a pretty good input method for any scenario where you don't need visual feedback.


Selecting content via voice only works when you can either name exactly the content you want, or you are choosing a general category to play.

Browsing content, or looking it up, via voice is slow and playful, it always will be. Who what's to be saying "next", "scroll down", or have a full on conversation with an AI to try and work out what you want to play? Our fingers on our hands have evolved to be incredible at interacting with things, we are good at using them. Touch screens or physical UIs will never be superseded by voice.

So yes, there is a small use case for voice for controlling music/tv, or controlling a few things in the home (heating, blinds, lights) but thats it, I don't believe there is this massive opportunity to expand it into our everyday lives where we are constantly interacting with devices via voice.


You don't need to name content exactly. Saying "Alexa, play the sandwich song from Frozen" for example will correctly play Love is an Open Door. Ideally this kind of thing would work for TV and movies too, web searches as well.

Humans evolved language to communicate ideas, wants and desires to others for thousands of years. Obviously voice UI is not there now but maybe someday the experience won't be much different than asking the movie rental store clerk for their recommendations for a romantic comedy.


>Saying "Alexa, play the sandwich song from Frozen" for example will correctly play Love is an Open Door.

I bet ya some engineer at Amazon hooked that up manually when they saw a bunch of requests failing, so that’s only gonna work for popular fuzzy naming conventions. I don’t want to have to think “is this a way lots of people are gonna request this song?” before saying it that way.


That is still an example of knowing the content that you want to play exactly, you just don't remember its official name - and yes, search engines have gotten pretty good at this.

The more relevant use case is "hmm, I'd like to listen to some prog rock, let me browse what Spotify has and see what takes my fancy". Sure, I could say "VA, play prog rock", but I don't want it to choose for me: I want to browse the available content to remind myself what are my options, and choose one when I see one that looks interesting.


> asking the movie rental store clerk for their recommendations for a romantic comedy

The only people who ever did that were in a romantic comedy.


This is a real problem, though the problem is not technical, it's purely contract/legal: Disney+, AppleTV (all content providers actually) refuse to allow third parties to know what you've watched, because viewer data is closely protected

That's why an opensource ToS-violating assistant has chances to work better than legal ones, they can just scrape all those infos off internet. But then, once you go into that grey area, you just end up pirating content already.


To be fair, that’s an issue with siri integration, not siri itself. Kinda sad that Apple’s own products doesn’t implement that properly :/


>the only time a voice based UI is better is when the user is unable to use their hands. Driving

my observation of people on the road has led me to conclude that Driving is an activity where people think they can do absolutely anything else while engaged in it.


When you put it like that you will get a lot of upvotes. However for almost all specifics you will get a ton of downvotes when you name them, and probably someone saying no that is okay. (these days if you say "you can't drink alcohol while driving" you are safe, 30 years ago if there had been an internet people would have said you are wrong)


things I've seen I think are probably not ok.

1. sending messages on phone while driving, one hand on steering wheel.

2. having sex / receiving oral sex.

3. turned around, yelling at kid in back seat to not fight with other kid in back seat.

4. girlfriend having argument with boyfriend, slapping him on arm some, about how she was smart too just a different kind of smart while swerving back and forth in fast merging traffic near the Haight (I was in back seat)

5. it's getting hot in here, time to take my jacket off!

your mileage may vary of course.


> Driving or in the kitchen when cooking seem the be the most successful.

Since the voice assistants are incredibly stupid I find it extremely stressful and distracting to ask them for anything while driving.


Figuring out what works well or not while driving isn't a great idea, but using the ones that work well seems fine for most people.

Saying "Hey Siri, text Fred <pause> I'm on my way but stuck in traffic, eta 4 o'clock" or something along those lines nearly always works fine for me and is no more distracting than having a conversation with somebody in the car with me. If Siri gets some of the message wrong I'll either send a new one using clearer speech or wait until I'm not driving to fix it if the mistake isn't important.

Sure, it would be possible to then allow myself to get distracted by focussing too much on some weird aspect of it, but equally it would be possible to get so emotional in a conversation with somebody sat next to you that you stop paying attention to the road. And we (most people at least) don't say "it's not safe to talk at all while driving", we just make sure not to go over that line of getting too distracted by the conversation.


> or something along those lines nearly always works fine for me and is no more distracting than having a conversation with somebody in the car with me.

Until you have several Freds in your contact list. Until you have friends with foreign/uncommon names. As long as you have near-perfect American pronunciation. As long as...

There are too many variables to consider and think of. Sometimes I can't get Siri to reliably understand what music I want (and my Engilsh is pretty darn good), much less anything more advanced.


> As long as you have near-perfect American pronunciation.

There isn’t a such thing as an American accent. Ask anyone who is not a native speaker and either hasn’t been to US that long and tries to understand my natural deep southern accent. I can adjust my accent if needed and if I think about it.


There is an accent that is typically called "standard American" or something along those lines, which is what you'll hear on things like national news programs. I'm not sure how many Americans actually speak with this accent, but it's usually the one that all of these devices target initially.


Yeah, I was thinking along the lines of "General American"/"California English":

- General American https://www.babbel.com/en/magazine/united-states-of-accents-...

- California English https://www.babbel.com/en/magazine/the-united-states-of-acce...


The "Standard American" accent is the west coast (mostly California) accent. It is far from universal.


It is Midwestern and, while not universal, it is the majority of speakers on the coasts and people I have met from the Midwest.


It's essentially how a homogenized middle+ class of people born/raised in/around metropolitan areas that aren't in the South/Texas speak. (In general, stereotypical accents associated with various cities are mostly more of a working class thing, e.g. Southie in Boston.) The South is the main outlier. Colleagues I work with from and living in North Carolina generally have a distinct southern accent albeit a mostly slight one. But, yeah, historically we'd have called it Midwestern.


I think historically we'd probably have said it was Midwestern. But, yeah, however the network news anchors speak.


> Hey Siri, text Fred <pause> I'm on my way but stuck in traffic, eta 4 o'clock

That only works well if you have an accent it recognizes, if you're speech is clear (not slurred, not lisping etc), if you don't stammer, if you don't have any verbal tics that you don't want to show up in the message, and if "Fred" is actually a simple unambigous name.

Otherwise, at best when you want to send a message to "Ioana" it may end up sending a message to "Anna" that says "I'm, ummm, oh my way! and stalking traffic ate a what was it like 4 like maybe 4 and you know what <pause>" (followed by the "4 o'clock" that will no longer be included).


Google (android auto) was significantly better at this early on than it is now. I used to be able to search random topics by voice while driving, and it would read me excerpts and results. I used it often. Now it's map-specific, messaging, or music-specific and nothing else.


>Driving

While driving, I wanted to have Siri read a lengthy webpage to me. I pulled up the page, got in the car and asked Siri to "speak screen." Siri says it can't do that when I am driving! What idiot thought that was a necessary safety measure? What if I were the passenger?

Overall, I am stunned at how bad Siri is at things that don't even require AI. It's almost as if this insanely profitable company failed to invest a tiny bit of money into researching ways that people would like to use Siri.


> What if I were the passenger?

I often go places with my sister (she drives). Her car doesn't allow pairing or swapping bluetooth connections to the car's entertainment system while it's moving. If we want to switch to my phone we have to come to a complete stop.


I disagree: a voice based UI is better IFF:

1. The command set is broad enough or user input is complex enough to make other UIs inefficient. 2. The voice UI is up to the task of correctly interpreting the voice input correctly most of the time

What "most of the time" means for the second item is somewhat personal and use-case specific.

For item 1, examples where voice is better right now, or could be with reasonable NLU improvements:

"Text my wife that I'll be there in five minutes."

"Get me driving directions to the nearest Indian restaurant with at least 3 stars on Yelp."

"Order six rolls of paper towels and a bottle of Windex from Walmart, delivered to my home address, for delivery by Saturday"

"Remind me tomorrow morning to review this web page"

"Create a shopping list with the items from this recipe"

"Create a basic presentation with one slide each for each entry in the table of contents for this book"

Voice can be better. As others have pointed out, as long as it's like playing Zork where half the time the response is "you can't do that" or "I don't understand", voice interfaces will continue to flounder.


> Driving

Siri never triggers when I'm driving, it just doesn't hear me. I think it's because of the noise of the car or because of my music, but it doesn't work. I have to move my face closer to my phone so that it can hear me, but that's even more dangerous than using the controls.

Same when I'm in the shower and I ask it to change the music, it doesn't hear me, I have to shout and get angry every time.


Is your phone mounted close to the air vent? Siri hears me perfectly in my car, even when the phone is in my pocket but it doesn't hear me at all when I mount the phone on the air vent.


Yes, it is. Maybe it doesn't hear me because of that.

For what it's worth, it doesn't work either when it's my pocket. When I come home and ask it to turn the lights on, it doesn't answer if it's not in my hand.


That might be a separate issue - you have to specifically enable a setting which will make the phone listen for Hey Siri when it's face down (or in the pocket).

See here: https://support.apple.com/en-gb/guide/iphone/iphaff1d606/ios...


You might try blowing the microphone holes with a can of air. It may be clogged with pocket lint.


> when the user is unable to use their hands

this is still potentially a huge domain. one could imagine a benign scenario where voice assistants enhance people's abilities to interact with each other (and digital devices) when a more potent UI is not within reach

privacy concerns (->controversial business models) and technical ability to deliver a desirable service (that people would pay for) might indeed prevent this vision from catching on in the short term

another factor that may complicate adoption might be just cultural / perceptions. It is a somewhat odd thing to be shouting at devices - especially in the presence of other people. User interfaces that interfere strongly with communication habits and behaviors established over millennia (see also wearing VR goggles) might have a harder time seeing adoption outside very specific scenarios


Fully agree on the privacy distrust.

BTW, another use case for speech recognition is when you're carrying a baby around.


I've gotten pretty good at doing chores one handed...


I'd say my baby preferred me playing around with my phone than speaking up. Doubly so when they were carried around sleeping.


Even while driving, it's useless past basic commands.

My most egregious example of this for me is that there's a grocery store near me that the Google assistant is incapable of finding because of a few people in my contacts list. Whenever I try to ask it for directions to that store, it picks (at pretty much random) one of three of my contacts instead. This is despite the only common part of said contacts' names and the grocery store is that their names all start with the same letter.

Basically, imagine asking for directions to Albertsons, and the assistant giving you directions to Andrew.


Or in countries where we get snow during Autumn. Getting messages read-out loud, and responding with voice-to-speech is great too


The main issue for me is that they are not stateful. Perhaps the main thing in the role of an assistant is to keep state. You want someone or something that understands you and what you want, so that you don't have to put too much thought into it.

If you tell it you want more coffee, it should know what you like and suggest a mixture of brands you bought before and new ones you may enjoy. If you tell it you're hungry, depending on the time of day it could suggest you some takeaway you've ordered previously or something else you may like. If you say the same some other time it may suggest recipes based on what you have at home or it may suggest nearby restaurants. It should keep track of your friends and otherwise and tell you when their birthdays are coming and it would be nice if it could even suggest some presents based on things you've told the assistant before, or their wishlist on amazon or something else.

There are a lot of things assistants could do, but it needs to know you. The model where everyone has the same assistant doesn't quite work out.


Complicating the situation is that I don't trust any of the companies making virtual assistants with this level of personal data; so the first thing I do on a new device is block the assistant's access to my location or any other behavior learning functionalities.


I think that lack of trust is the single biggest factor holding back this technology.

I imagine that a truly top-notch virtual assistant would always be listening and aware of your behavior and context. However, the level of trust required for such monitoring is usually reserved only for one or two people in our lives, and even then, it's quite incomplete awareness on their part.

I don't know how a for-profit company can reconcile this disconnect, though I imagine that someone will eventually try.


I believe tech companies are aware of the creepiness factor associated with surprising customers with too much context about them. I think they cripple some of the more context aware features that they could be doing because it draws a lot of attention on just how much data they have about you.


That could maybe be gradually introduced like that story about the frog which doesn't feel the water getting warm. What put people off is having it knowing everything up ahead through inference on data you were not even aware you had shared.

Instead, you could go easy, by first suggestint the user to set up your assistant by linking it to amazon, deliveroo/uber eats/etc, facebook, and verbally sharing information as it asks. Then, over time, it could spread its inference further and further and you won't be sure or not if you already shared that information or not, but you will just assume you did, as you share everything anyway. For people who are not so open, it could stick to inferring less and being less useful.


I would want a company where I pay them for their assistant, knowing that everything it holds is not shared with anyone and completely under my control.

The issue is of course, like paying for youtube, is consumers will wonder why they would want to pay for an assistant.


The market for something you pay for is small. The market for free things is seemingly never ending.


>creepiness factor

This is a complicated social construct even with people. If a public relations person (or whoever in a professional context) reaches out to me, I hope they've done some basic research on what my interests are. I saw you were at $CONFERENCE last month? Sure, probably. Start asking me about my vacation last month that they found photos from on Flickr? Probably getting over a line if I don't know them.


Cortana on Windows Phone had a "notebook" of everything it had learned and you could modify it to your liking. For example it detected my home and work locations and transit hours and displayed every weekday at 17h15 bus information before I left, and I could modify that info if it was wrong.


Google also identifies commute, it’s actually a maps feature and not the assistant’s, though.


Yeah I wanted something like a "voice notes" like I gave my kids credit scores to see who earns the most each weak, I want my smart speaker to have a simple voice activated k-v store ready, can increase or decrease, get and reset.


It's useful for trivial unambiguous tasks where you have your hands full or don't want to touch your device or it's dangerous to. That's all I can muster mine for.

"Hey Siri, add more toilet paper to the shopping list" (while pooping)

"Hey Siri, shuffle my music" (while driving)

"Hey Siri, countdown 10 minutes" (while shoving a pizza in the oven)

Anything else is a shit show. Anything where trust or accuracy is involved i.e. mutating data, spending money, absolutely no way can I trust it at all and never will.


Agreed, but I find for even these simple tasks it's hit-and-miss for accuracy. My Google device will randomly not know what a "shopping list" is, or the interactions go something like this:

"Hey Google, put dishwasher salt on the shopping list" "OK, I added 'put dishwasher salt'" (strangely, this particular bug only manifests for dishwasher salt).

Timers are useful, but sometimes they can't be shut off by voice command.


Yeah it doesn't always work well. I say "hey siri add green milk to the shopping list". I want "green milk" added to the shopping list which in the UK is semi-skimmed milk. What does it do? Adds "green" and "milk" because it thinks I'm a weed smoker...


Trust and accuracy is involved in the first and last of your examples - I'd end up having to check that the TP was actually added to the list, and that the timer had actually begun and was set to 10 mins.

Shuffling music, turning lights on, yes fine - because confirmation that the right thing has happened is instant and effortless. Anything else, I'll use a button or a screen.


Definitely agree with this. You get that confirmation with siri. I mostly use my watch for it and it will show me what it did on the screen without having to touch anything.

Confirmation is required when dealing with humans as well ... https://www.youtube.com/watch?v=11fCIGcCa9c (this reminds me of Alexa)


Google is pretty good about that. It will say "Ok, your alarm is set for 7 hours and 40 minutes from now" and similar.


Not really - adding toilet paper to a shopping list is not clicking the "buy" button. And if you set up a timer you get quick confirmation that it has been set. If the timer is accidentally set for 100 mins it's easily corrected.


I think the parent meant that you need to check if these commands are executed properly, otherwise you get into trouble later. For example, if the toilet paper isn't added to the shopping list, and you go shopping with this list the next day trusting it contains everything you need, you're not buying the toilet paper. Similarly, if the timer is accidentally set to 100, you only notice it after, say, 20 minutes when there's black smoke coming out of the oven.


I know, that's why I said you get a quick confirmation - "Ok, setting a timer for 10 minutes." is spoken back to you via the speaker.


I just asked Alexa to set a timer for 2 mins, and you're right - she did then ponderously state that a timer for 2 mins was starting. Then she asked me if I'd like to hear tips about using timers? No. Then she told me I had two notifications, would I like to hear them? No.

Then I timed myself setting a timer on my phone, which took 9 seconds from pocket to running.

Adding to a shopping list isn't clicking the "buy" button, no - but if it's not on the list I won't buy it and then I will have no toilet paper. I would not need a list if I could simply remember everything.


Then she asked me if I'd like to hear tips about using timers? No. Then she told me I had two notifications, would I like to hear them? No.

Are you saying this for comedic effect, or does the Alexa really do this? (I'd look it up myself, but good luck with that query...) To each their own, but I'd throw the device into the street if it pulled a stunt like that.

Then I timed myself setting a timer on my phone, which took 9 seconds from pocket to running.

To the Homepod or my Apple Watch: "hey, siri, tea timer for three minutes".

"Three minute tea timer, starting now."

I didn't think a product could screw that up. I would suppose it's a design decision between "assistant" and "servant that carries out my command without backtalk". There are times that I wish the Apple product were more "assistant" than "servant", but the Alexa product just sounds pushy.


I use Alexa for shopping lists, I get a “toilet paper added to your shopping list” confirmation after adding items to my list.

It’s not perfect though, for example when trying to add fruit and fibre cereal it will often add two items, “fruit” and “fibre”. But its close enough that when I get to the store and check the list I know what I intended to add to the list.


Mmhmm, I never handle my phone while pooping, no siree.


> "Hey Siri, add more toilet paper to the shopping list" (while pooping)

This is the main reason why I have an Echo in my bathroom! The one advantage Alexa has over everything else is that you can voice shop -- "alexa buy more toilet paper" solves the problem that much faster than a reminder for later.


I don't want that to happen because the price variation in toilet paper is huge based on deals and offers available, and Amazon is rarely the cheapest provider these days, so it's actually worth me spending a few minutes on it to save some money.

The reason Alexa exists is to sell you Amazon's prices, not necessarily a good deal.


Also, I think I'd rather just add stuff to my shopping list so that I can at a later time order everything together, rather than have multiple deliveries.


The sort of consumables I might order on Amazon on a regular basis--like those that the Amazon Dash buttons were intended to address--can vary a fair bit in price and quantity. I'm not going to have Amazon just ship whatever.

And it's not even a very frequent thing. Mostly, every few months, I look through what consumables need replenishing and I fill up the car with plus-size packages from Walmart.


Lights on lights off is also useful, especially when in bed or carrying a basket-full of washing.


They still just don't work very well unless you memorize very specific exact commands.

The other day, we had to remember to book a school thing for the kid. I said "Hey Google, set a reminder for 9pm to [book the thing]".

Google replied "Here are web search results for set a reminder...."

When they fail constantly at the most basic tasks, usage is going to drop way off.


They also refuse to ever show a reference so that you can learn what you can say, or teach you any other way. Refusing to show and teach your users feels arrogant to me. They had to promise that you can "just speak naturally" and now they can't roll it back.

Meanwhile my success rate for the way I speak is below 20%.


they do chime in and tell you all sorts of things "by the way, i can also play music while you're getting ready for bed" while i'm trying to concentrate and live my life lol.

just compile a list of all commands into an email and send that to me. i don't actually want to hear you talk, siri


What annoys me the most is this:

"Hey Google, turn off the den light switch in 30 minutes"

"Sorry, for safety reasons we cannot...."

It's a light. Because it heard "switch" it thinks there might be some power tool connected to it and won't let me set delayed actions. I want to be intelligent with it like "Hey Google, turn turn the lights on when the sun comes up everday" but no one has gone to that next step.

Or how about "Hey Google, turn off the tone played when you say Hey Google". These settings aren't accessible from the voice interface itself.

Can't wait for Alexa to fail so my SmartTV will stop nagging me to use integrations I will never use. Anti-competitive but whatevs.


some of the stuff google don't put effort into is super weird, like the whole if the device is in a different "room" it has to loudly announce what it's doing but if it's in the same room it turns down the volume on all of the devices for 10 seconds

really makes me think they almost want to discontinue it but it's so integrated with android and the chromecast they can't really kill it


I think you can set 'routines' now, might be able to help with sunrise actions - https://support.google.com/assistant/answer/7672035?hl=en-GB...


I have explored those, but I ran into many many issues when i wanted to build routines of routines. I wanted to compose a routine from others. I couldn't simply/directly invoke a routine from others without voice interpretation, and it led to strangely phrased triggers just to get it to work. It's frustrating that I should be able to make these explicit actions in the app, but it's just very clumsy. In the ideal scenario I should be able to configure all of that through the voice interface itself. "Hey Google, create a routine that initiates from "It's dinner time" that turns on all the lights in the dining room and sets them to 50% with the 'candlelight' color"

I hate how many products need to be configured from apps nowadays. You gave me the voice assistant and I can't use the voice assistant to manage the voice assistant. I shouldn't even need an iPad or phone to set it up. Give me a minimum local processing/interpretation so I can get it online and it can begin interpreting more articulate requests.

We are so far away from zen.


You can change the device type in Google home to work around the scheduling issue


Voice assistants are shit. The number of times my friends have got alexa to turn the light on first time is functionally zero.

And, they don't really explain the syntax constraints. Which are massive.

Try ab initio without knowing how to do it, to get OK google to open an arbitrary google authored app and direct it to do something. Compared to learning how to use the OS UI keyboard shortcuts or applescript. (Which btw like Windows is basically fully documented because all the libraries are self documenting for their call structures)

The voice interfaces are universally badly designed because spoken command sentences are not well understood as a modality of command, distinct from mouse, gesture, touch or keyboard.

Until voice is baked in with a documented syntax in "man" format, i won't believe its first class.

How do I even know for any arbitrary app what voice directives it uses? How do they correlate to any other command input? How consistent is this with other commands in other apps? Does "stop now" always mean the same thing between a mapping routing app, and a tape backup app? Isn't "stop" contextually defined in a way ^C isn't?


I bet you could sell a lot of fully offline voice assistants that just did timers and maybe reminders that sync to the phone or other smart speakers with bluetooth. No privacy concern if there's no WiFi at all!

I'll stick to Google assistant for the extra occasionally useful stuff, but the idea of a device that won't stop working if the server goes away is pretty cool.


> The number of times my friends have got alexa to turn the light on first time is functionally zero.

I have a room called "Study" and the only lights there are called "Main Lights".

- Hey, Siri, study lights 100%

- Did you mean these lights: "Study: Main Lights"

:facepalm:


I don’t disagree with the stupidity sometimes but I have probably upwards of 50 lightsbulbs/plugs connected to my echo speakers and use them exclusively for turning on my lamps, dimming lights etc. so I can say confidently, it makes mistakes sometimes but once you learn the quirks and set things up with proper names it’s actually pretty impressive. And you can learn the oddities, you just have to be dedicated, sorta like learning anything else.

I’ve learned that I can’t say “what’s the temperature” because it will tell me the temperature my Dyson fan is set to lol. It’s not wrong, it’s just missing some context (me holding my jacket wondering if I should wear it) that I can provide going forward. Maybe I should just ask it “should I wear my jacket today”


> set things up with proper names

The hard part isn't naming the stuff, its remembering what each thing is named. I could call the lamp next to the couch several different things depending on the context. If you don't remember exactly what it is named you are gonna have a rough time. I've been tempted to just put a label with the device name on everything controlled by Alexa.

And good luck getting a guest to turn anything on or off.


> I could call the lamp next to the couch several different things depending on the context.

Or just call it dude :) https://reddit.com/r/tumblr/comments/548b3p/everything_is_du...


That's the issue though: there are too many quirks even for the most simple of interactions.You have to be constantly aware of quirks, and issues, and workarounds. It's like walking on broken glass.


Yea, a single member set shouldn't require full enumeration, it's detectably the only operating device in context.

If it was the only "main light" across all rooms I'd hesitate to say the same mind you, nesting scopes should be respected.

What would you want it to do when you add a second light bulb? Tell you the old command is no longer unique or turn on both?


> What would you want it to do when you add a second light bulb? Tell you the old command is no longer unique or turn on both?

I think when the command is generic "lights" it should turn on all lights. But both are valid behaviours.


> The number of times my friends have got alexa to turn the light on first time is functionally zero.

Do your friends have accents? Alexa frequently fucks up due to my South African/Zimbabwean accent. I joke that Alexa is racist. Google seems to handle it way better. Here's hoping that Mycroft does better (which I plan to switch to in the coming months).


They managed to control the tech so tightly that no potential is fulfilled and they’re boring.

Truly opened to developers they could have been really interesting and fun.

This is what happens when big companies develop technologies and think they are too valuable to share.


This also pissed me off. Why not start the day with a weather report, short news and some trivia. When I checked long time ago I didn’t find any way to integrate stuff like that together.


It's possible to do that with a morning routine in Alexa. I know this because it tells me the weather (and other things) after I kill my alarm in the morning.


If you check again now this is a out of the box feature on Google Assistant. You just say “good morning” and it tells you the time, weather, news, and more.


A lot of the more useful information retrieval tasks involve a feedback loop. If I’m shopping for a product, I may enter a generic term. Then the system sends me images of products matching that term. Then I tell the system which product image is closest to what I want. Then the system sends me reviews of that product. Then I read the reviews and realize this is not what I actually want…repeat until I’m satisfied.

I can’t do this loop with modern voice assistants.

A voice assistant with contextual conversation skills, and access to an “always on” visual monitor (home projector or AR glasses) would definitely increase utility by 10x or more


I use voice assistants for one task:

Setting a timer.

The oven is in another area of the house, so when I come back to squeeze in some work, I often just said "Voice Assistant, set a timer for 10 minutes!"

And that's about it.

Apart from that I worked on some chat-bots in the past and it's the very same thing to me, just even a bit worse because of audio.

Natural language processing simply isn't there AND there is just a few very niche use cases in my eyes.

So if they go away again, I won't cry or rather my tears will dry fast.


This (for me) just shows a normal trend in tech.

Somebody develops something, gets a lot of buzz, doesn't deliver, buzz dies slowly, and then several years in the future somebody else actually builds the tech stack needed for it, and it takes off again.

Voice assistance, chat robots [1]... the metaverse.

[1] Several years back I attended a chat in my city where some people from IBM showed us how to implement a chatbot I think with Watson.

Beyond the trivial examples, it was nearly impossible to implement anything (Or the people giving the chat didn't know how to implement those) but the documentation was nowhere to be found beyond those trivial examples.


While I get it why Amazon, Microsoft, and Google (?) might want to reduce development costs for voice interfaces, it seems like Apple is locked in to supporting Siri since one of its products, the Apple Watch, really needs Siri to get full value from it. I like to go about my day without carrying a phone (if I don’t need to take pictures) and the Apple Watch is a great compromise that allows getting calls and text messages, but at the same time is not intrusive as an iPhone. No one sits in a restaurant staring lovingly at content on their Apple Watch, ignoring nearby people and their environment.

Pardon my getting a bit off topic, but it is interesting to see the “belt tightening” by FANGs: it seems like everyone is cutting excess staff and looking hard at which products make money and which are money losers. This may seem like a good thing except for newer product categories like AR/VR that will need a lot of experimentation to get right.


It's not the technology. Voice transcription works great, and from the point of view of extracting meaning, even pattern matching would do better than the embarassing failure of today's voice assistants. It's a matter of product. We are living in an IT world where there are no great people able to turn the technological potential into useful things.


I agree, but I think things will likely change soon. The main thing holding progress in this field seems to be privacy issues and sending data to the cloud. However, we already have human-level voice transcription that you can run on a low-to-moderate hardware at home, so all it takes is for the average developer to be able to run GPT-like LLMs on their machine. At this point I think the quality will improve really fast.


Conversational interfaces just aren't any good, and until they're perfect they won't be useful. The primary issue I've seen (having worked with slack bots) is discoverability, how do you find out what a bot can do without asking lots of open questions? The most useful bots I saw were those that didn't try to be conservsational but had a fixed grammar for commands and questions (along with a good help response). And at least with chatbots you've got the textual context you can scroll back on and read as opposed to trying to keep track of what's happening in a conversation with a non human who you can't see.


Despite the tremendous amount of effort that's gone into creating large language models, there is still no way to hold a goal-directed conversation with voice assistants. There is a lot of implicit context in normal human speech that needs to be inferred or clarified. None of these speech assistants can do handle anything more than the most rudimentary clarification dialog.


I'm really not surprised. Some sample items I bought in the last months in order of difficulty both for me and (I think) for a voice assistant:

1. Hey Alexa/Google/Whatever, buy me a saddle, Selle Italia SMP Extra, white. I expect to get a proposal for the best price + shipment and that's it. Easy.

2. Buy me a replacement head for my Parktool pump. Oops, it's a PFP-3 pump. Ah, the shop selling the saddle doesn't have one... I eventually discovered that that shop has an identical part for a pump of a different brand, probably the same Chinese manufacturer. I think this would be hard for an assistant, borderline to impossible.

3. Buy me a magnetic mosquito screen of at least X x Y cm, not adhesive or velcro. Oops, I need a frame to mount it on? I spare you with the discoveries I made along the process. Either the voice assistant is equivalent to a professional installer and can see my door or I'll always do a search with a browser, and watch many videos too.


Voice assistants are the most useful new UI to me since smartphones.

I use them to get travel time estimates, reference facts, music and podcasts, rewind / forward / pause / resume / next / previous, timers, lights, and occasionally broadcast messages on home speaker devices.

I use voice UI while cooking, running, mowing, raking, driving, changing diapers, putting on my kids' clothes, and while sitting with my wife.

My two young children (5 and almost 2) love voice UI. We ask about the animal of the day, what does this animal sound like, play that song. My older child is beginning to set timers with it. My wife recently said having a nearby voice assistant is important in a home configuration discussion.

My family and I sometimes have frustrating experiences with voice UI, like failed hotwords, shifting syntax, answering on an unexpected device, and slow responses. But we still use it frequently, and our overall sentiment with voice UI is positive.


It's quite funny that none of the things you mention involve shopping, or any activity that would bring revenue to the voice assistant provider.


Voice UI could be a decisive feature for profitable equipment, or generate advertising revenue.

Because it lacks VUI, my family recently didn't buy a new ~$400 product. We bought an older version that was marked-down and otherwise inferior, but has built-in voice assistant.


i don't want my voice assistant to be a sock puppet for a corporate identity proctologist.

can you imagine hiring a production assistant that gave everything you asked them to do to a corporate competitor ?

Why would I want to do that with my life and Amazon and Apple ?


Conversational Computing is just not ready. Even a better AI, alone, will not solve it 100%.

But because of the 2010s meteoric rise of FAANG (in the public imagination and stock market) they think they can quickly push into mainstream all these immature paradigms like IVA, AR and now VR.

All these technologies exist since decades, but they are not ready! billions of dollars of investment and marketing are not enough to make them so.

As small a Leap as a tactile portable device took us almost 20 years to reach a conclusive mainstream form (iPhone)

We need to accept that IVA/AR/VR are exponentially larger leaps and should remain side-shows for a very long time to come.

For example, Microsoft is finally acknowledging this, with HoloLens being now just "to help you solve real business problems".


At least for my family, the voice recognition seems to be getting worse and worse with each update. Nowadays google doesn’t seem to turn off the alarms or the music after several tries so my son just comes and unplugs it (touch controls are also a fucking mystery), my wife tries to make her phone ring so many times she can find without help in the meantime. Alexa doesn’t understand what I try to order and it’s easier to just find my phone and type it. I thought that somehow the devices were just having physical issues with time, but even a new one has the same problems. So far it seems like I would get a better experience by rolling my own stuff and training it with our specific voices.


Can we lump chatbots in with this too? Chatbots have totally failed to live up to their hype. It seems like we spent billions of dollars chasing chatbots and voice assistants only to realize that you need strong AI to have something truly useful.


I don't like the centralized nature of them.

If I could download it and run it on a private server in the basement, without any ties to the cloud, and with my own settings for privacy, then I'd be more willing to use the things.


Same. I actually really enjoy using it for basic functions but I am too concerned about privacy to use one for anything more advanced. I don't know if I would ever trust a Siri or Alexa or Google Assistant even if they claimed to work offline because all providers have had a terrible track record. If the Linux Foundation or reputable open source project had a solid open source solution that promised to work 80% as well and had a fairly straightforward installer, then I'd be more apt to interact with it regularly. One other issue I've found with Alexa and others is their integration with proprietary lock-in ecosystems for calendars, reminders, lists, etc. Ideally with an open source solution you could do things like set a write only calendar that uses your CalDav calendar and CardDav for contacts, or some non-invasive solution for messaging and calling.


Mycroft.ai has some of what you need. They still host some functionality but it's on their road map to make their voice assistant offline capable.


This is exactly how I feel. On the one hand, this technology seems completely miraculous (I remember watching Star Trek TNG back in the day and thinking the "computer" was about the coolest thing imaginable).

Now it's here, it works amazing well (considering what it's doing), and . . . I'm talking to Apple, Google, and Amazon? No thanks . . .


I think Amazon is one of the worst platforms possible for voice shopping. For something that can barely understand what you want, you want a small store, that doesn’t have multiple products or sellers per type. I could imagine using voice ordering at my low-carb store, when I ask for "white almond flour", there will be exactly one product they can deliver. At Amazon? I might as well order from AliExpress, it’s a marketplace, you need human-level intelligence to order something, not a robot that can’t even figure out what band to play if there are 2 similar-ish sounding ones.


Perhaps what value they're getting out of voice services is not what you think it is.

Data, training, pattern recognition, language.


But then you have news articles talking about how Alexa lost $10 billion and I'm wondering if these companies even know what they're getting out of it.


Voice assistants are useless tech that has been shoved down the users throat to attempt to gain adoption. Google attempted to hold me hostage to this with android auto. Previously: voice recognition worked just fine. At some point they decided to integrate their stupid voice assistant, which probably was the same back end anyway… but now it required all sorts of permissions (like web and search history) just to do basic things like send a text message.

Lost yet another another use to ios.


That's right, they're doing it for big surveillance.

Though, as much as these services are apparently bleeding, the in-kind payments from big surveillance had really be worth it.


I've been using these ever since they came out. I've tried.. enough of them, including the open source, self hosted things. I have a tonne of them all over the place, and after almost 8 years of this, here are all of my use cases. To be clear, I just don't get why people care:

1. Play me music or a podcast (this fails a shockingly high amount)

2. Call person X

3. Covert unit to unit

4. Turn devices on or off

5. Timer set

6. Reminder set

7. What time does Shabbat (my Sabbath) start.

All of the above can be done with my phone, so there's really no point in them - for me.


They were too early. A decade from now when NLP is bullet proof and AI has made significant advances we'll revisit this. Pragmatically it makes sense at this juncture to focus on the API behind the voice interface that lets you issue commands. That's the true power. That's effectively a global catalog of services that can do anything, driven through one API. We can revisit voice but this is where the starting line is IMO.


A huge source of subjective user frustration with voice assistants is the lack of responsiveness to both commands and interruptions.

Siri in particular will wait as long as a full second to “ding” letting you know she’s ready to process a command … often interrupting your attempt to give the command. Alexa will very often fail to hear me say “Alexa” despite the audio conditions being quite favorable.

Both Siri and Alexa will misunderstand what you asked for and be completely indifferent to your attempts to correct them unless you interrupt them correctly. Afaict Siri can’t be interrupted by voice alone and Alexa can only be interrupted by something starting with “Alexa”. Once you are already “in” an interaction session with either of these tools you should be able to simply say “no, not that” or “no, <correction>” and the assistant should stop talking immediately, instantly, and not continue to blather on with the previous wrong interpretation.

No matter how useful these tools are in the hands of a power user, Siri and Alexa give the impression of being morons largely because of unresponsiveness, not lack of capability.


My issue with trying to use voice assistants to purchase items for me is that I’ve found the experience not that great when doing so and had to resort to using my phone/computer anyway, so it wasn’t really worth it for something other than adding an item to my cart as a reminder that I wanted to purchase that item when I wasn’t at my desk.

For me, the utility I get out of my echos are home automation control to control heating, lighting, audio when my hands are full or their control are out of reach. But you are never going to make a fortune out of house voice control when you sell the voice assistants at or near cost (don’t think I’ve ever paid full price of any of my Echo’s, and I tell people considering getting on to wait till they go on sale as Amazon always seem to be offering a deal/discount on them every few weeks.

But what does frustrate me if that Amazon don’t seem to uniformly “monetise” things across the Alexa platform, for example when I found out about the Samuel Jackson voice for Alexa I wanted to buy it even if it was just a novelty, but it’s not available in the U.K.


Not surprising.

What we were promised: a personal well-trained butler.

What we got: a voice input field that you know will be incorrectly interpreting your intents far too often.


My hot take is this is the start of vastly (better/worse) surveillance- that can be a force for good or bad. Let's start with business meetings - the tech is close enough to be able to record, transcribe and summarise all decisions and action points - which sounds great and may get put into place. Then it can assess how well a manager was "coaching" the meeting - getting appropriate feedback from all participants - then were all necessary participants there ? Then we can see training sessions geared specifically towards manager A and their problem in encouraging positive performance from their team with vague goals.

Wouldn't we all like better managers with clearer goals?

Roll this out to a doctors bedside manner, or the mistaken command given to a nurse that was a perhaps misheard and the wrong medicine given

or ...

We can surveil all our waking lives - and almost the only thing social sciences get right is large population statistics- we can find out who leads happiest best lives and be taught how to do it better.

Or we can stomp out individuality and live on a totalitarian nightmare.

But we don't get to not play the game


Well there are so any problems with it (Siri).

1) Doesn't always understand what I'm saying (voice recognition)

2) If it does understand (voice recognition), it actually changes it it something else (super dumb, no, idiotic "AI", changes someone I call almost daily into someone I never call)

3) If it asks me what I want to do with a contact, it doesn't understand "call"

4) There's no way to 'correct' what has been said.

5) I can't open "podcasts", because it will open Deezer (I don't use Deezer anymore, but it's still on my phone).

It's really just really really a dumb command line interface. And if it would just be that without the 'smart' things, it would be a lot more helpful. There are only a few things I use it for:

1) Call XXX

2) Alarm at XXX

3) Open google maps. Sometimes I say "route to xxx using google maps", and that works

Even these three common use cases fail about 10-20% of the time.

When I'm in the car and want to change route or add a poi, want to open an app, or text something, I try, but I get so frustrated which is more distracting than just typing it in your phone. So that's what I do.

Google's voice recognition is a lot better and faster. Also their search understands more context.

Voice assistants are like a skinner experiment. Solution: make it very rigid in terms of operations. People are quick to adapt this. Somehow the ai-crowd doesn't understand that a UI works best when the things you operate with are at the same location and always respond and work in the same way.

I'd compare the voice assistant experiment as a GUI where the buttons, their function, and labels always change.


I use Alexa on a house filled with Sonos speakers every day. She barely works and is the worst advertisement for Amazon.

"Alexa turn on my FireTV" (doesn't work, weirdly FireTV and Alexa speakers don't seem to get along)

"Alexa play The Killers" (responds with "Playing the killers on Spotify" then silence)

"Alexa play The Killers on Spotify" (works)

"Alexa play The Killers" (sometimes tells me to start scanning devices and install a skill for Sonos)

"Alexa wikipedia 'history of Bulgaria'" (works but asks me if I want to continue after every sentence, so I give up)

"Alexa stop" (doesn't stop for some apps, Amazon should have support for silencing as a minimum requirement for apps and implement worst-case scenarios if an app doesn't respond to stop in time)

"Alexa rewind 30 seconds" (doesn't rewind for some apps, Amazon should have support for rewinding played recordings as a minimum requirement for apps)


There is a monetization path, but not likely easy like selling direct to consumers. As others have commented - voice (and google glass!!) are useful when you don't have hands. For example, a dentist frequently needs to look at the computer screen or have the computer screen change a display while they are in the chair working on a patient. Making a command via voice seems useful (change to the next screen, etc) and/or also potentially showing some data in a heads up display. Another vertical could be hospital - being able to use a generic voice UI to turn on the lights and summon help seems useful.

Is the monetization path hard - yes. Trying to break into specific industries takes a ton of work. Additionally - you already have staff in these cases. Would a dentist get rid of her assistant for some voice commands? not likely. A hospital still needs nurses.


I bought the original Echo back when it first came out.

It was a neat internet radio device and Bluetooth speaker but outside of making it play the radio I never really used the voice functionality much.

I remember one of the places where I buy groceries from offering an incentive if you ordered your shopping via the Alexa skill, it was so painful to use that I added everything to my basket using the website and then just added 1 unambiguous item using the skill and checked out. They gave me the discount/incentive but I never used it again.

I moved house but never plugged it back in and don't really miss it.

I also remember going to AWS re:invent in 2016 and they gave every attendee an echo dot, with a lot of fanfare to encourage Devs to make skills. I tried to make one but it was so convoluted, I gave up in the end and just gave the Dot to my parents. It broke after about a year, they never replaced it either


I'd be full on board for a voice assistant, but I control the server and the traffic, so Amazon is a non-starter.



I personally use Alexa for some light control as well as smart blinds I have installed. Sometimes some basic questions ("What is XY?") and besides that nothing more.

The amount of wrong answers or just telling me something it has found on the web, which has nothing today with the topic is just annoying. Even basic questions seem to not work reliably and looking at the whole Alexa ecosystem it doesn't even feel close to "smart" home.

Lack of context awareness, connection to basic things like parcel delivery, washing machine and similar that would make my life a little easier (because I'm to lazy to open an app to look when my parcels will arrive or when the washing machine will finish) just ruin the whole thing for me.

Besides that, I've tried setting the male voice, but for some responses it just switches back to female Alexa??


Voice assistants are infuriating. Every day I feel like Siri and Alexa would understand less and becoming worse.


I got into the habit of asking things like "what is the weather like today" every morning before work, it worked fine for months and I quite liked it - while I'm occupied with something else it can give me some useful information. Then one day it started giving me the dictionary definition of the word "weather". That continued for over a week, so I just lost the habit and don't think I've really used it since.


Alexa is an important tool for our family. My hope is amazon doesn't abandon Alex, but limit what it does and charge a subscription. We use it for alarms, timers, drop ins (inside home and outside), announcements, weather, news, music on Amazon prime and making calls. Simple things that don't require add on skills or advanced commands. Works solid for our needs and it would be difficult to replace. We never buy anything using voice commands.

The problem is a family like us. We haven't bought a new device in 2 years and still use some of the original pucks. Amazon will need to make money somehow so I think a yearly subscription would work. Then again, if it isn't a growth segment I cannot see a tech company keeping something in today's world.


For me, subscriptions need to really earn their keep. I suppose I'd miss my devices if Amazon started charging a subscription, but I certainly wouldn't pay and maybe I'd replace them with something else simpler.


The problem is simply that voice is a 1-D (linear) interface whereas computer screen is 2-D.

So when I interact with the screen it can show me all my possible choices at the same time at each situation. I click a menu-item and a sub-menu pops up.

With a voice-assistant I would have to wait for the assistant to list all possible choices but if they are many that takes a long time. You can only hear one word at a time. Combine that with inaccuracies of voice-recognition and it is clear that a computer screen + mouse is a much better interface.

Hey it's the same difference as with Radio vs. TV.

A computer screen-interface might be usefully augmented with voice recognition. But working with voice alone is like working blind. Gee who could've thought voice is not the killer-tech of 3rd millennium.


I think, or at least for me, one of the major frustration points with voice ai is how dumb it has been for the last two decades.

When you tell a mercedes navigate me to "exact road and adress" it barely gets half of it right. and usually that just the number.

Now i don't have a speech impairement and have tried in 3 languages, none of them work to my sophistication.

The worst thing though, personally, is all the shitty patents around voice activation keywords. It's disgusting.

Finally i am also an IOT guy with a lot of toys, and i would have written my own personal assistant if it would not be for the patents. After all the hardware is surprisingly cheap.

All the problems are solved too. but if you believe i would write free code for megacorp amazon without pay, you might be heavily mistaken.


Voice assistants do have a lot of potential due to the fantastic ergonomics. "Voice command line" is used as a pejorative here but it's true and a good thing. The original visionaries were right to pursue this idea.

To me where it's gone wrong is that like many other things the A team conceives it but the B team is responsible for its development. The A team will ask "what will users want to say to it"? But Team B says "well if they say 'blah' how do we know that they mean 'blah'"? Mediocrity creeps in.

As an example, here is a typical dialog between me and my Alexa.

"Alexa, play Brahms C Minor Piano Quartet."

"Playing: music like Coldplay."

"NO"

[music like Coldplay]

"Alexa NO"

[music like Coldplay]

"STOP"

[music like Coldplay]

"ALEXA STOP STOP"


As Shank said, there needs to be a GPT element to voice assistants. At the moment, Alexa is a voice interface, but is "dumb" essentially, there's no understanding, intonation, memory, etc. A voice assistant that would would be a fully fledged AI that grows over time, as with most Sci Fi smart houses. I think it's possible — that we'll receive units that have base AIs that then grow with time, but we're not there yet, not close. The current problem with GPT-like AIs is that you can't trust what they're going to say. They're interesting, useful, but there still lacks that feeling that it's anything except mathematical probability based auto-completion.


I have a stereo pair of OG homepods, and it never ceases to amaze me how stupid they are. Songs I played yesterday are mysteriously unavailable, or I'll ask for a song I've played a million times for my daughter and it will decide to play a totally different, highly innapropriate rap song just because it also has the word "toothbrush" in the title or something.

I use my Alexa as a timer only, and when it started saying things to me like I have notifications or "did you know you can use Alexa to order such and such" when I would say "Alexa, stop the timer", I almost threw it out the window.


I hope this means people will be looking at jailbreaking these devices. I'd do it myself but that stuff is beyond me at this time.

I'd love to repurpose my Alexas into satellite Rhasspy[1] devices if Alexa retires.

1: on HN the other day https://news.ycombinator.com/item?id=33705938

& direct link to the satellite info https://rhasspy.readthedocs.io/en/latest/tutorials/#server-w...


They're half-baked and I sense consumers starting to lose interest. Third party developers have to do a big deal with Amazon to have a "skill", rather than having an open marketplace where consumers can choose what their device will do? The devices don't really cover all the things consumers might want from them. Years ago, I had assumed Google or Amazon would launch a marketplace akin to Google Play and whoever got there first would be the category winner. Instead, the category became a race to the bottom. Big tech cynically chasing the big deals rather than opening up the platforms.


Following the argument that the biggest problem is natural language understanding, I wonder what happens when you put in some GPT3-like model, give it all relevant context like past conversations, other user information, etc. That should be much more capable in understanding what the user wants than current systems.

There are obviously still many open research questions. Like how this is then combined with actually performing the commands. But there are also solutions to this.

This is still somewhat open active research. But given such powerful models, I think in principle we could make such devices much more useful.


They did take off. Everyone in here uses them. People all over are using transcription (speech to text) to dictate instead of typing on the terribly small mobile phone keyboards.

The issue apparently is no one knows how to make money


Perhaps it would be a useful experiment to pick a representative group, couple them to real human assistants with complete control to complete requests and collect all the requests that are made.


I wish voice assistants like Siri could just do functions that are obvious, even if it requires learning commands. Actually especially for in-car voice assistants. I should be able to do any of the critical commands like turning on wipers, etc, via voice commands without looking at a screen or even feeling around for tactile controls with a free hand. If this was reliable and consistent, even if being reliable required learning a list of commands, that’d be a big help while driving.


This is an interesting thread :)

I see lots of conjecture on the nature of voice assistants, or why they haven’t taken off.

As someone who was at Amazon, and close to this area, let me offer a simpler explanation.

Alexa has stagnated because its leadership has little to no direction. The incentive structure driving the product teams and tech falls under two categories: 1. rest and vest 2. or write documents to build your promotion portfolio and get promoted. Then leave the org to find a job in AWS.

In fact, Alexa could be used as a text book study in empire building.

* Myriad teams that maintain or increase head count each year, with absolutely no meaningful deliverables.

* Services that could be maintained by a 2-3 person on call that have an entire team of 6-8 engineers.

* Tech directors, Sr. SDMs, and SDMs in a race to build their empire to get promoted to the next level. SDMs solemnly hiring head count for the sake of having more reports. Other SDMs hiring SDMs below them, even though there’s no need for additional management (team is idle), so they can show a larger footprint and get promoted to the next level.

Voice assistant tech may as well be hitting a brick wall for technical or human computer interaction gaps that need to be thought much more in depth. But saying this was just an insanely hard technical problem is misleading and missing the bigger picture.

If you build an organization as a pet project and then throw money at it to do whatever the hell it likes, with practically no accountability, of course you won’t get meaningful results.

In recent years, Alexa became a place to chill and wait for your RSUs to vest. I personally know great engineers in Alexa, who are now stressed out of their mind about their work and visa situation because of the layoffs. But look - you decided to join a team where you basically just hang out at work, have a series of useless meetings, or half the time, your manager is complaining for some reason that his team doesn’t all show up to daily stand ups. A ton of extremely talented folks became extremely lazy, and so here we are.

To root cause: misaligned incentives, lack of accountability and a poor work ethic, shockingly poor given at least some other parts of the company are still on the other, opposite extreme.


The reason for voice assistants not taking off is quite simple imo. You don't know what it's capabilities are beyond the well known use cases ("set up a timer for 15min") but are well aware of that it's not going to understand everything. It's limitations are obscure and you've been misunderstood in the past, why waste more time. Compared to a human who understands everything you say, you have the full confidence of using voice.


The only thing missing from the experience is occasionally being eaten by a grue.


I find Siri mostly useful for things like taking messages and playing music when my hands and eyes are occupied with other things, but as many people have pointed out, figuring out what exactly one has to say to get Siri to do it's thing is often hard, even when you realize you've got a parsing problem. I mean, ever try to get Siri to play a song by "Them"? Unpossible by band name. :/ What I wouldn't do for a manual or keyword guide...


I own a 3rd gen Echo Dot. I am not sure where it is right now. It needs to be plugged in, so useless as a portable speaker. I don't control anything via voice in my home. The only time I use voice commands is while driving to tell my phone to play some specific song or call someone. I have a few smart lights in my home and I always find it much quicker to change color etc by using the app since the phone is always within reach than call out a voice command.


I've had Google home for quite a while now, and, apparently like everyone else, I only really use it to start and stop music, as a kitchen timer, and to make animal noises when kids are visiting. There was no obvious improvement over the years. However with GPT3, DALL-E and all the other amazing stuff coming out I had just assumed Google/Amazon must be working on a big update that really makes Assistant/Alexa 10 fold better. Is this hope in vain?


I love Google Assistant but it doesn’t recognize well my wife or daughter voices. Thus they find it annoying and do not use it.

I use it primarily to turn on and off the lights (multiple times per day), play music, turn off the TV (but not to play things on TV, too unreliable) and to raise/lower the temperature on the Nest.

We have tried to use the Nest Cams with the Hubs as a baby monitor but the cameras feeds freeze and don’t tell you so it is actually dangerous.


I would never choose to talk to the machine unless there's no alternative...

There's a reason voicemail has largely been replaced by email, etc...

Then there's the spying issue.


The harsh reality is that the AI powering those assistants is simply not smart enough to converse with.

Using voice to send commands to a machine is not practical or pleasant for everyone.

This should not be a surprise and it could have been easily predicted, but hype and FOMO pushed big companies to sink billions into this tech.

This is not the first time, and it will happen again, especially because anything touching AI leads to highly inflated/magical expectations.


I wrote off Siri when I was driving and said “play episode 6 of XYZ podcast” and was completely incapable. If it can’t do something like that, then what’s the point? It’s no different than those hands-free Bluetooth adapters for your car that my dad uses for his old android phone.

There are many other seemingly simple tasks it has failed at. All I use it for now is sending texts and turning in navigation when I’m driving.


Has anyone also considered that people have voice recognition systems in their smartphone and thus a standalone voice recognition device isn't necessary?

Or that the results on a smartphone can be visual, whereas these home devices don't have direct IO apart from voice? [And yes, while they can be hooked to exogenous devices like a television, that's an extra step of configuration to finagle...]


It's really shocking how bad all voice assistants are considering how amazing LLMs are. There must be major effort underway in all companies to upgrade all of them to LLMs in the backend.

OpenAI charges 2 cents per about 750 words (for their best model), or roughly 1 cent per minute of talking. Maybe they can add LLMs as a premium feature, $3 a month for an actually smart home assistant seems like a deal.


No wonder it's not doing great. There's zero progress in this sphere. Google Assistant if anything, feels worse today than many years ago.


Just don't think they have improved in any meaningful way since launch and in fact some of the experiences have gotten worse.

At launch I could reliably "Hey Siri" a timer across the room, now it just doesn't work because presumably Apple have downgraded the long range microphone tech at some point to save costs.

Eventually just stopped bothering and just set them manually.


They may not be very useful at the moment, but if progress in large language models continues at the current pace then I can imagine that conversational AI might be useful in several years. Investments by companies to establish and maintain voice assistant market share might thus eventually pay off.


Admittedly I only skimmed the comments but “how will they make money from it?” wasn’t acknowledged. Hell of a lot of effort to build up an inference engine for natural language processing and they’re going to want to profit from it. Unless, as some have noted, surveillance will be their funding.


It reminds me of the chatbot fever there was a few years back, it seems like every company wanted a chatbot. Even though as a user I much prefer a normal menu.

I think voice assistants will continue to have good use cases in cars, kitchens etc. where you can't use your hands so the trade-off is worthwhile.


Speaking to a voice assistant is like speaking to a toddler. They have limited vocabulary, limited comprehension and limited ability to perform what you’ve asked them. The only difference is that a toddler stops being a toddler after a year or so, while voice assistants stay perpetually dumb.


I find it perplexing that voice assistants actually seem worse at basic functions than they were ten years ago. I remember being amazed how well it could interpret requests to set reminders with flexible language options. Now google gets it wrong so often I've nearly stopped using it.


I've been dabbling with Talon to reduce the burden on my precious arm tendons. It's the most useful voice interface I've used due to its customizability, but with that great customizability comes a great learning curve that it seems only techies would bother with.


The technology just isn't there yet. A lot of progress has been made, but I still need to repeat myself too often. It's more frustrating than using an app. We'll nail it in another 20 years though, and the voices will be indistinguishable from human ones.


"consumers were just as happy to sit down and click away until they had the basket they wanted" — same reason why although many people hate lifting their butt to go shopping, some would happily do. Nobody likes checkout lines though, that's universal.


I have voice assistant for a few years (Google Home). I tried to use it for many things, but in the end I settled only for playing music, setting alarm, find my phone, what's the time, and "what sound does a monkey make" when I am holding a baby.


I feel like voice assistants sound super awesome in theory and a lot of nerds, me included,first think about "Computer, replicate me some Wiener Schnitzel", but reality is really something completely different, even when ignoring the replication part.



No matter how free they make it to use, I just do not want a robot listening to me all time.


Muahahaha, there's no escape! Maybe dial down the paranoia. It's a bit unhealthy.


They just can't do useful things for me. I can do things like turn off a light or get the weather which are only marginally better than just flipping a switch or pressing one button on my phone. It's not a large gain in efficiency.


i have seen a lot of people with physical challenges and disabilities discussing how critical these devices have become. they may not solve mainstream problems, but they have contributed materially to the quality of life for a marginalized community, and i hate the idea that they will disappear because big tech hasn't found a way to monetize them. i hope that the remnants of these projects can be open-sourced or sold to companies that will run with them and build and maintain products to support those who may have mobility or physical manipulation challenges, that are benefiting from voice control.


I’d argue this space was technically successfully but not commercially so. We love our “smart speakers” and have integrated them into home automation. Does anyone make any money when we use them? No.


Now this is not the most politically correct statement. But I wonder what is the similar situation say in China or other major non west countries? I think in China voice AI is more hyped.


I worked in a project where some funny guy was trying to sell voice assistant for a company selling LPG for housewives in Brazil, and of course nobody ever used this.


"Someone has to say it:" It's weird that they are pretending to do some cutting edge/against-the-grain journalism/research in this article.


What's entertaining is that they were just too soon. Give it at the very most a couple more years and the state of the field will be very, very interesting.


My favorite part is when my Google Home stops the music upon my request and then tells me in a complete and unnecessary sentence that it has stopped the music.


I worked in a project where some funny guy was trying to sell voice assistant for a company selling LPG for housewives in Brazil. Nobody ever used this


Well guess what happens without open platforms? We need open tech to allow anyone to create voice driven “skills” — like websites.

Not everything can live on ad dollars.



Love the utility, but there’s clearly little business value. Great example where “making something people want” doesn’t alone support the bottomline.


Good, I hate these spyware devices.


I think it has to do with privacy and the limitations on training data. Or am I being naive?


Yeah, cuz it’s not 2011 anymore…

Kids in India have smartphones now, no one’s impressed


What are some privacy respecting voice assistants?


Computer, earl grey, hot.


U




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: