Hacker News new | past | comments | ask | show | jobs | submit login
25 Years In Speech Technology and I still don’t talk to my computer (matthewkaras.medium.com)
251 points by samizdis on Oct 26, 2020 | hide | past | favorite | 293 comments



A big problem with the assistants is that as soon as they fail at a query they seem stupid, I feel stupid, and I stop using them for a long time.

Last weekend I had the following failed queries:

"OK Google, what is the air quality like at Mt. Shasta today?"

"OK Google, add a waypoint for the last gas station before the mountain pass"

"OK Google, what percentage of people can you detect to be wearing masks on recent Instagram photos tagged at a location within a 10 mile radius of Mt. Shasta?"

These are all things I would expect a computer assistant to do really well. They have access to so much data, and so many APIs, that they should be able to break down these sentences into a SQL-like query and give me results. The third, for example:

"recent Instagram photos" -> Instagram has an API

"tagged within a 10 mile radius" -> parse the cities within a 10 mile radius and look for tags in all of them

"people" -> use your wonderful person detection networks your friends at Waymo developed

"wearing masks" -> I'm sure your internal datasets have this label, so run an object detector

Then compile and reduce the data to give me the number I want.

That's what I want an assistant to do. But it couldn't even do the first, which just involves a single API query to fetch air quality index information. Bleh. And as for the second, it has no idea what "last gas station" or "mountain pass" means; it's a query a human would know to be extremely commonplace.

It turns out that the current generation of "assistants" are mostly just template-matchers which really doesn't help me much at all. I can set my own alarms, thank you.


As soon as you see them for what they are it makes perfect sense. They're basically a proxy to google search with extra dumb features on top. Every single feature that isn't google search has been hard coded (map directions, setup alarms, &c.), as soon as you get out of the hard coded cases it's blatantly obvious these things are anything but "smart"

imho they're just used to give google/amazon a few more data points for their graphs


Yeah, and that wouldn't be a problem if the discovery story wasn't dismal or the browsing story wasn't dismal, but it definitely is. On both counts.


Discovery and browsing are not good on the assistant interface but I'd argue that's a constraint of voice vs visual. On a desktop/mobile, the screen holds the state so you can go back to previous entries, etc. Over voice, you mind holds the state which scales much much worse.


To the credit of the Shortcuts team at Apple, being able to visually select and define certain phrases for commonly completed tasks is helpful in this regard, but I’d guess only a couple % of the user base is even aware that Siri Shortcuts are possible.


Is this just normal phone Siri or HomePod Siri?


Normal phone Siri. Not sure if it works for the HomePod


And this would be a much better situation if users could hardcode their own features and definitions.


Which is exactly why I prefer formal query languages over NLP queries. In both cases (at least with most state of the art NLP techniques), you have to learn certain patterns and ways to phrase a query so that the system will reliably understand it. With formal query languages, these patterns are well-defined, can be looked up and will most likely not change significantly (so there is value in memorizing them). With NLP systems, the patterns are completely opaque, you have to learn them through trial and error, they may change anytime (e.g. because the model is retrained) and they are usually significantly less powerful.

I sometimes feel that the trend to prefer NLP over formal query languages is comparable to the trend to prefer GUIs over consoles in the 80ies and 90ies.


Agreed; back in the day when we/I used to play text adventures, or interact with MUDs/MOOs those systems had English-like interaction languages but the semantics of them were relatively clear -- you mostly had to follow the verb/prep/object formula and once you figured that out, you could manage the system fairly well, without running into a lot of terrible corner cases.

I'd rather have an assistant type system with a fairly well defined query system that exposed its capabilities and limitations directly, rather than me having to guess at the corner cases and failure points.

Disclaimer: I work @ Google on display assistant devices, but I don't work on the actual assistant interaction pieces.


> Which is exactly why I prefer formal query languages over NLP queries.

People like to reference Star Trek for stuff like NLP queries, but if you go back to TNG and pay close attention to the verbal queries to the computer, much of the time it isn't natural language. They seem to actually use some sort of formal query language that fits English a bit closer, but is still distinct from when the characters speak to each other.


> They seem to actually use some sort of formal query language that fits English a bit closer, but is still distinct from when the characters speak to each other.

"Computer, begin auto-destruct sequence, authorization Picard 4-7 Alpha Tango."

- Wake word. Command. Authorization stanza. (I bet the computer would prompt for authorization if missing.)

"Computer, Commander Beverly Crusher. Confirm auto-destruct sequence, authorization Crusher 2-2 Beta Charlie."

- Wake word (possibly superfluous). Identification stanza (probably superfluous for the usual crew, but I can see from an HCI perspective that you might want to make people provide it specifically for such a consequential protocol, and it may also be of merit if some random admiral usually halfway across the galaxy pops in to confirm). Command confirmation, authorization stanza.

"This is Captain Jean-Luc Picard. Destruct sequence Alpha-One. Fifteen minutes, silent countdown. Enable."

- Computer is very awake at this point, no wake word. Identification stanza. Sequence parameter. Time parameter. Verbosity parameter. Commit.


To be fair, that's just how Picard speaks (e.g. "engage"). I haven't noticed anyone else saying "Tea. Earl Grey. Hot".

In any case I think this kind of speech is formulaic for the benefit of the audience, most of all, who are made aware through the formality that the speaker is addressing a machine. Additionally, we're watching navy men and women in space, so we expect them to speak to each other and to their computers in a formulaic manner ("Deck 5! Report!" etc, I can't think of good examples, brain's too tired).

Or perhaps the idea is that Trek AI is not really as advanced as to be able to understand natural language and that makes Data such a unique specimen.

Then again, there's the example of the Doctor in Voyager. I'm confused, I admit.


The Doctor in Voyager is, IIRC, an early prototype, and Voyager is bother later than The Next Generation and set on a newer ship. And the Doctor, IIRC, benefited from upgrades during the show, having been more limited I scope initially.

In any case, Trek isn't super internally consistent, anyhow.


Maybe I'm misremembering, but I distinctly remember that basically everyone ever shown interacting with a replicator follows the generic->specific parameter hierarchy.


You and dragonwriter are probably right. I'm going by memory, here :)


It may be easier to train the humans to be precise. This is the solution adopted by armies which need unambiguous communication over noisy radio.

https://en.m.wikipedia.org/wiki/Operations_order


Well, to extend the GUI/console metaphor, it means that at some point soon, we'll all be using NLP because it's dramatically more user-friendly for the vast majority of people.


GUIs won over console workflows because GUIs have better discoverability and the "recall vs recognize" difference; it's mentally much easier to recognize the option you want when presented it than to recall the existence or the naming of that option.

In those aspects of UX, voice interfaces have the same drawbacks as console apps when compared to a good GUI.

Also, they have to work within the "bandwidth bottleneck" of audio - just imagine a phone system that tells you all the options you have, "Press 1 for something, Press 2 for another thing..." - they are so annoying because they are slow and inherently linear; a GUI can show the same options all at once, and you can read it much faster than listen to them.

So NLP as such is not dramatically more user-friendly unless it is at the "do what I mean" level which likely requries full human-level general artificial intelligence; before that it's just a voice equivalent of a console app, sharing all the problems of discoverability and needing to remember what the system can do and how should it be invoked.


> Also, they have to work within the "bandwidth bottleneck" of audio - just imagine a phone system that tells you all the options you have, "Press 1 for something, Press 2 for another thing..." - they are so annoying because they are slow and inherently linear

They're even slower now because the brain trust decided adding voice control to the phone menu system was a great idea. So before, it said "For prescription refills, press 1." Now I have to wait for "For prescription refills, press 1, or say prescription refills." How on earth does that improve anything? I can just as easily press 1 as I can say a word, and when I press 1, there is a near 100% chance that the computer on the other end will understand my command.

Some phone menu voice automation are even worse. "Tell me what you want! <silence>" Then you say something, and it says "I didn't recognize that. Please tell me what you want!" Then it fails again and says "I didn't recognize that. For prescription refills, press 1, or say prescription refills..." Oh great so there was a menu? Why did you waste my time earlier?

Voice is just a terrible, low-fidelity, low-bandwidth way of commanding a computer. You might as well have handwriting input while you're at it: You write what you want on a piece of paper, and hold it up to the camera and the computer tries to figure out what you wrote. Just as silly.


> I can just as easily press 1 as I can say a word, and when I press 1, there is a near 100% chance that the computer on the other end will understand my command.

So, I would rarely say something when I could push a button, but when using a smartphone on a call, it's not always easy or obvious how to push a button. Some people may have mobility issues making it hard to push a button, or be on speaker phone far away from the buttons. Or, maybe they haven't updated their telephone equipment in 50 years, and only have a rotary dial. Or, maybe on a terrible VoIP system that can't manage to get the tones through.

There's probably some way to clean up the script.

"(Please listen carefully, as our options have changed.) Please choose from the following options: Say prescription refills or press 1; say insurance denied or press 2; say referral to veterinary care or press 3"

I could get behind voice interfaces for more things if the commands words were documented, clear, and consistent, and the damn things worked. Until then, buttons seem good to me.


Totally. I cringe at FedEx's system.

"Welcome to FedEx. [... blah blah blah ...] Tell me what I can help you with today."

"a package"

I mean, what the hell else can you help me with today anyway?


Honestly, "Tell me what you want!" is better than a system that forces you to listen to all of the options since "representative" is what I want 99% of the time when I have exhausted all other options and decided to do battle with an automated phone system.


> You write what you want on a piece of paper, and hold it up to the camera and the computer tries to figure out what you wrote. Just as silly.

Ironically I used to love me some graffiti on PalmOS and Google Handwriting Input on my Droid 2, but I agree with the spirit of your comment.


I see you've run into CVS or Walgreens' automated phone system.


This "recall vs recognize" point should be raised in every console vs GUI debate. It's pretty much the final word.


You can get the recognise experience on the command line with the smarter autocomplete available on shells like fish or zsh.

    foo --<tab>
You're now present with a list of options and depending on your config, the man page one liner descriptions


I understand your point, but I'm not sure that GUIs won out because they were dramatically more user-friendly. It certainly helped, but I think they won because it made multitasking possible. Multitasking from the users perspective that is, the ability to interact with more than one application at the same time. That was just not possible on a console, so even people who didn't need user friendliness were able to do things they couldn't do before. I was young at the time, but that's how I remember it at least.


That's just not true, though. As with many console things, multi-tasking is totally possible but it's discoverability is terrible. Ctrl+z and `jobs` is the entry level, with tmux being the end-state reached via gnu screen. This lack of discoverability is the same problem voice assistants have only moreso; no `apropos` and no tab completion.

GUIs are for discovery, CLIs are for power via composability, voice/NLP assistants are for convenience.


> That was just not possible on a console,

Was it? Even if you discard stuff like tmux as already a GUI, you can still send whatever is running at the moment to the background with CTRL-Z and typing "bg" on any modern Unix system. "jobs" will then list all your processes, and "fg <ID>" will bring it to the foreground. I am sure this functionality predates most modern GUIs.


Aside from the usability POV, GUI provided significantly more features to the user, such as visualization of information and data. Images, Audio, Video, Multi-Media, 3D/2D Video games. You could have more information on the screen and at your fingertips at the same time. You can load many of these things from the CLI, but it's not as convenient as within a GUI.


> That was just not possible on a console

You may be thinking of DOS, which yes had almost no multitasking ability available.

However there were multiple timesharing operating systems that existed before the PC and GUIs, Unix being the most famous and still around.

Multitasking is quite possible on a Linux console for example. It has 5 or more consoles, each handling different users, each being able to be split via screen/tmux. Each shell can run jobs in the background as well.


From my observations, people have reduced their voice assistants to objects that sometimes tell them the weather or switch their lights on and off, and sometimes do something completely unrelated when activated.


On your searches:

1) I get the correct response; Assistant first asks me if I want to use "air check", I say yes, and get the correct response.

2) I get appropriate responses when I ask "What is the closest gas station to Mt. Shasta." Because the approaches are mostly from Rt. 5, you'll have no problem getting a usable response. Another approach from Rt 89 exists, so I don't know how you expect Google to know which "mountain pass" you mean. However, should you ask Google "What is the closest gas station to Mt Shasta on Rt 89", you will get an appropriate response. Should you be approaching from the opposite side of the mountain, unlikely though that is, Rt 5 or Rt 89 would still be the closest

3) I don't know why you expect a computer to answer this question well at all. I'm not aware of an API that would track this sort of thing. Google doing this itself, at scale, for any popular destination would be a very large project in ML with a very high error rate: Instagram photos are unlikely to be representative of a whole, the data set for a given location may be sparse, and especially for an outside location it is entirely likely (as I did at a local pumpkin farm) that people at an outside tourist destination will move away from a crowd, remove their mask, and take a nice photo safely, while people in close proximation still adhere to appropriate social distancing. Effective use would also need to be real-time: from day-to-day a given location could attract people that wear masks, and the next day a large group does not, especially in the > 300 square mile area you specified. It is not reasonable to expect this question to be meaningfully answered by either a human or a computer.


I don’t believe he’s looking for a voice assistant. He’s in the market for Lieutenant Data.


Rather, I think he's looking for the shipboard computer in Star Trek. Which is actually an excellent example of the untapped potential that voice-based computing holds.


The ship computer in Star Trek was always hilariously underpowered. They had to put an android on the bridge to operate manual controls during battle, for god’s sake.


It's also an excellent example of designing a user interface to look and sound cool on screen, rather than for usability, which is how I feel about most voice controls.


The operative word is assistant. Voice assistants should be able to do everything a human assistant can do.


the entire value of a (voice) assistant is that it can quickly and reliably deliver results. You don't need Data, but the system needs to be robust enough to work quickly. If you have to ask five times or spend minutes thinking about how to phrase the question, it's not assisting you in anything but wasting your time.


[Commander] Data


Presumably, at some point Data, who was a Lieutenant Commander at the outset of TNG, was a Lieutenant. (This may even be the case in one of the novels in which he features set prior to TNG.)


> 1) I get the correct response; Assistant first asks me if I want to use "air check", I say yes, and get the correct response.

Yeah, I don't. I just get "Sorry. I could not understand that." if I remember correctly.

> 3) I don't know why you expect a computer to answer this question well at all. [...] Instagram photos are unlikely to be representative of a whole

Sure, but I defined the query pretty clearly, and I just want to answer the "general" question of "Are people generally wearing masks in that area or not" and I already babysat that question into a data query that could use the Instagram API plus some object detectors.

I understand it's hard for one person to write code that could formulate and piece together that graph, but I feel like it should be a tractable engineering problem for Google.


This mindset is exactly why these assistants are bad. Start with the UX then work backwards from there and make it happen.


No. Asking a question in an unambiguous way is not a requirement that can be done away with. It can't be done in many normal text-based searches, it can't be done in face-to-face conversations with real people, so there should be no expectation that a voice search would yield any better results. Having actual data that exists to answer the question, as with the mask example, is also an essential requirement.

If you expect more, than your problem does not lie with voice assistants, it lies with search technology itself. "Expecting" these questions to be answerable is unrealistic given current capabilities. Working backwards from the UX would produce nothing better because your expectations are thwarted not by poor design, but by the limits of state of the art technology.


The three questions are completely answerable by a human. He might ask a bit more information though to answer correctly, which could also be expected of the device.


Rt. 5? Do you mean I-5? I'm curious if using "Rt." instead of "I-" is a regional thing. Where are you located? Typically Rt. would only be used for a small state road, not a major interstate highway.


It's a regional thing. In New England, we call Interstate 95 "Route 95", Interstate 93 is "Route 93", etc. US Route 20 that stretches from Boston to Oregon is "Route 20". And both the US Route 3 that goes north from Boston through New Hampshire, and MA Route 3 that goes south from Boston down to Cape Cod are called "Route 3".


> In New England, we call Interstate 95 "Route 95",

Actually, in the Boston area, it's "Route 128." Despite it never being indicated as such at exits or on maps.


The correct title for Californians is "The 5".


> The correct title for Californians is "The 5".

You have omitted the rather important qualifier “Southern” from your description, as that usage is a key differentiator between Northern and Southern California.


I live in the bay area and everyone I know calls it "I-5". "Everyone I know" of course could be biased though as most of my circle did not grow up in California.


I lived in Southern California (Orange County) in the 70s and 80s. We called I-5 the 5. Same for the 405, the 101, the 15, the 10, the 99. Also, we called State Route 1 (aka Highway 1) PCH no matter where it was located in the state, even though SR1 only has the Pacific Coast Highway designation between Oxnard and Dana Point.

I moved to Northern California (Santa Clara County) in the late 80s and I still use the same monikers. If I take a trip to Big Sur, I tell folks I am going to take the 17 to PCH and drive down. Old habits die hard.

https://www.kcet.org/shows/lost-la/the-5-the-101-the-405-why...


“In Southern California, the definite article “the” gets placed before just about every freeway or highway, whereas in the Bay Area just the numbers are said.”

https://www.kcbx.org/post/how-you-refer-us-101-says-lot-abou...


It's used interchangeably for the number of any road, at least in my experience. It certainly gave appropriate responses when I tested it out.


Where do you live? It sounds like this is a regional thing in the NE? I live in the PNW and in my 50 years I’ve never heard anyone use route in place of I-.


“Alexa, play punk rock”

“Playing <super specific punk rock song>”

“Alexa stop. Alexa play punk rock playlist”

“Cannot find punk rock playlist”

“Alexa play punk rock 00’s playlist”

“Can’t find”

“Alexa play early 2000’s punk rock”

“Playing punk rock 00’s playlist”

It’s like she’s trying to mock me.


Burglar: "Put your hands up and show me where the money is, I won't hurt you..."

Me: "Alright man, I'll tell you where is money ... ALEXA CALL THE POLICE!"

Alexa: "Shuffling songs by The Police"

* EVERY BREATH YOU TAKE plays as I get punched 24 times *

from https://twitter.com/ppathole/status/1092034892249079813


Jokes aside, the police probably want to know whether your complaint is that there is a burglar holding you at gunpoint, or that you're trying to order a pizza. They probably get both types of calls at about the same frequency.


Even if it works: do you expect the police to stop the burglar before they shoot you?


I don't expect a burglar to shoot me. Raising the charge from breaking and entering to armed robbery is dumb enough, raising it to murder is downright stupid.

Drug addiction makes people do crazy things, but there's a reason why most burglars flee as soon as they are noticed


How do you say "00's"? Does Alexa say it the same way?


Yep, oh-ohs. We said it the same.

My fav is when she decides to interpret your words instead of playing the exact playlist you’re telling her to play.


“I'm sorry, I couldn't find any results for ‘OK Google, begin navigation to Denver, Colorado.’”

“OK Google, what's the speed limit?” → “The speed limit is defined as the maximum rate at which a vehicle can legally travel on a given stretch of road.”


"The speed limit is 299792458 metres per second, also known as the Speed of Light."


> I can set my own alarms, thank you.

And not only that, but a specialized user interface is often preferable to voice even when the assistant passes the Turing test. There's a reason that people use apps to get food delivery rather than calling.


It's funny, but for years I've felt like "Hey Siri, set an alarm for 7am" was 1000x easier than using the clunky Clock UI and it's almost exclusively what I use Siri for. Tasks that are so perfectly well-defined are exactly what this "smart" tech is useful for. Except that recently, Siri screwed me. I said "hey Siri, set an alarm for 6:30am" and her response was "Ok, your 6:30am alarm is on," but what she actually meant was "Ok, I ensured that your standing alarm for every tuesday at 6:30am is on" which meant she wasn't enabling the alarm for 6:30am tomorrow, so I overslept. Very annoying and completely blew my trust in a very useful function.


Alarms and reminders are my most frequent uses of Siri (and to entertain/frustrate the kids...). It's also pretty good for hands-free quick texts while driving and hands-free calling, though I don't do those often.

Your alarm example is frustrating. Of course, you could say "set alarm for tomorrow at 6:30 am" which would work, but then you're back in the realm not of natural language commands, but formal commands that exist in the uncanny valley, just similar enough to natural language to be irritating when they fail.


Yeah, I would think it would be obvious to most humans that "set an alarm for 6:30am" never means "6:30am four days from now." It seems like it should just be interpreted as the next occurring 6:30am unless otherwise specified, but hey, I'm not a computer.

Edit: also, your "tomorrow at 6:30am" example is also open to interpretation if you're saying it two minutes after midnight. I'd really like it to recognize these sort of ambiguities and prompt me for clarification. "It's after midnight. By 'tomorrow at 6:30am', do you mean you want your alarm to go off about 6 hours from now or the next day?"


Siri does resolve that ambiguity, it will ask you “do you mean 6.30am on day X, or 6.30 am on day Y?”, using weekday names for good measure. That’s actually one thing they got pretty well.


... and this is exactly the sort of thing where I feel much more comfortable looking at a GUI that clearly says the date, day, and time all in one place. It would take me 2 seconds to verify that it's exactly as I want it instead of parsing a voice assistant's response.


Ouch. I just did a quickie test asking Siri to set me a reminder for 'tomorrow at eighteen'. No dice. It set it for 8 am.

It doesn't even support 24 hour time - at least in English.


"...set an alarm for tomorrow @ eighteen hundred." works just fine. At least in U. S. English, I don't think I've ever heard anyone refer to 6 p. m. as "eighteen".


But that's the problem most people don't speak U.S. English as their primary language.

They use all kinds of languages with all kinds of accents and and ticks.

In my experience, and from the time I did a bit of NLP the situations is often along the line of. It works for mostly accent free simple English. Fails to get anywhere usable on most other languages and or accents. Sure that is to some degree because of missing training data. But for a consumer this doesn't change that for very many consumers this features work terrible bad.

Just out interest I tried out the youtube auto generated subtitle for a german video, but even at the parts where the text was super clear and well pronounced the result was hardly differentiable from randomly picking arbitrary words. It wasn't even that the algorithm choose similar sounding word. They where completely different words in many case. I think in a sentence of ~10 words in average 1 or 2 where correct. And that was at the parts where the text was unusually clear understandable. At other parts it wasn't even able to recognize that there where words...


Except it isn't a hundred, it's sixty, and I protest that usage. >.<. I prefer eighteen o'clock.


It's standard US military terminology and part of US culture and language.

Just imagine all the idioms they have to account for in these voice algorithms. I know in German they also handle "halb neun" correctly as 8:30.


That works fine in Siri.

Although it does have a problem with hours over 20 (“22 o’clock” becomes 20:02 for some reason, possibly because of my thick Italian accent).


but you would still understand them if they did or take 5 seconds to ask a clarifying question.


I live in Redmond, WA. Frankly, if someone asked to meet for a meal at eighteen, I would assume they would like to get together in a Microsoft cafeteria closest to the (what I believe to be non-existent) Building 18. My backup option would be to assume that they have received a hard blow to the head at some point in their life.

Again, whether I would understand them or not, no one to my knowledge speaks like that in U. S. English. It is a great example to use to show the quirks of language. It is a bad example to use to show that Siri "doesn't even support 24 hour time - at least in English".


I'm not English. We don't say eighteen hundred for 18:00, we say just eighteen. And I don't have localized Siri.

And in any case i said eighteen not eight a m.


Officially building 18 does not exist. Anything you may have heard about building 18 is just rumour. If you have any questions regarding the purpose of building 18, you should direct them to Shelley in HR.


Hey, it's been well over a decade since I worked there. :-)

https://campusbuilding.com/b/microsoft-building-18/


"report to Shelley in HR for re indoctrination" would be better :)


..."at eighteen"? "Eighteen hundred hours" is how it's pronounced.


On the google side, "wake me up at 7" results in a 7AM alarm 95% of the time and a 7 PM alarm the other 5%. Just frequent enough to screw you over…


Agreed, I find this to be a great use case for voice assistants. Or on the occasion I feel like napping after lunch "Hey Siri, wake me up in an hour".


> There's a reason that people use apps to get food delivery rather than calling.

Well, if I had some type of personal assistant who worked for me (as in, a real human), I would just call out "Hey, Sam, please order pizza for me" and continue with whatever I was doing.

The reason people don't make phone calls is they add a lot of additional friction. You have to dial the number, and wait for someone to pick up, and give them your address, and deal with frequently-questionable voice quality.


Pizza is a weird example because pizza joint menus are more-or-less standard and unsurprising. For any non-pizza restaurant I'll want to start by scanning a menu, which is way more efficient than listening to a menu be read to me.


And if you're already scanning the menu, just clicking on the stuff you want is easier than explaining it to someone.

However, if you frequently order the same thing from a given restaurant, having a way to order "the usual" from that place via voice might be convenient. But I'm not sure it's much more convenient than being able to do the same thing with a button click or something.


Your assistant is probably smart enough to remember what kind of pizza you like and from where, and will just order that unless you tell them something else. In theory, there's no reason your phone auto-assistant couldn't do the same thing, but we seem to be a long way from any of them having that level of intelligence.


Exactly! We aren't there yet technologically, but I think it's a worthy goal.


I don't know, I like swiping through photos of food to make my decisions on what to eat, and then it only takes about 30 more seconds to complete the order at most.


This is just the optimal halting problem; sometimes you want to spend time finding the exact thing you want, and sometimes you just want something that's good enough. Both problems don't need the same solution.


I still often use a phone call (or sms) for takeaway. It's faster than using an app or website if you already know what you want, and usually cheaper, since there is no middle-man.

For delivery, I agree. It is easier and faster to use a service, since my address and payment information are already saved.


That doesn't seem correct. People prefer apps because they are better at providing the information we need (menus, products, services etc) and the alternative for engaging through calls is either talking with overworked humans or a painful IVR service.

For me, the ideal interface to engage (order something, create a note etc) when you know what you want is through speech. It just feels so effortless.


I can definitely say that for food I MUCH prefer the web/app interface. I can swipe through a bunch of nice-looking food and just tap on something that looks appetizing.

You can't get that with voice no matter how you do it. It's like if a restaurant had no menu and the wait staff just recited the menu in your face and asked you to pick. Even with real humans, voice isn't the best interface for presenting food choices.


> if a restaurant had no menu and the wait staff just recited the menu

That's exactly how food is presented to you at lots of high-end restaurants. Once you get above the fast-casual tier, menus never have pictures. If you know a lot about food and you trust the chef, voice is a perfect medium for explaining the choices. Pictures are just a discovery aid.


That's exactly what I meant. The app interface is good for conveying information. But if you know what you'll order, eg - popcorn every weekend, just saying "Jarvis, order my usual popcorn combo" is so much more seamless.


The only two questions I ever ask Siri are "how cold is it outside?" and "wake me up {at HHam, in X minutes}"


I'm curious that you want to know the temperature specifically, rather than the weather. I'm guessing that's for reasons other than clothing choice?


I just feel like I get a more actionable answer than the "It's a bit chilly and overcast" I get when I ask about the weather – how do I know if Siri and I agree on what's chilly or not?


Cool, pretty sure for weather Alexa says something like "in $place today it's 10 degree with a light breeze and 40% chance of rain; with a high of 13 and a low of 8; have a good $day".


But your examples are way harder than they sound. Speech or non speech analysers have a hard time with context. What do you mean by "recent" photos. And what percentage? Of people wearing mask in each photo, or of 1 or more people with a mask in the whole set of photos? Or the percentage of photo having all people wearing a mask. We humans make a lot of deduction from context. We haven't been able to teach computers this aspect for 25 years. It has started more recently, deep learning shows potential.


Absolutely, but you could start with reasonable assumptions and refine from there. Just the percentage of individual humans wearing a mask. Recent = past week. Start somewhere. Give me an answer first, or ASK for clarification instead of "Sorry, I couldn't understand your query."

I understand it's a hard problem, but with current software capabilities and given Google's compute infrastructure I honestly think some of these things are well within the realm of what a team of several hundred Google engineers can do.

I'm not asking it to write Shakespeare, I'm asking it to crunch data with a sentence that could reasonably easily be parsed into a graph and be turned into a MapReduce query. I thought they were good at that stuff.

Context? I know it's hard, but I thought Google has been working on that. I'm very much adjusting my expectations to what I think Google should be able to accomplish in a decade. I expect some basic context capability now, at the very least these data-crunching type use cases.


But your examples are way harder than they sound.

There are two ways to look at this problem: (1) what is hard and what is easy for our current tech to do, and (2) what are the things that humans actually want to pawn off to assistants?

The problem is that those are two very different answers. I don't think I agree with the grandparent comment that some of those things should be easy, but I do think that comment contains good examples of the level of sophistication that would make an AI assistant more than just a curiosity for the couple times each day you prefer not to hit the button yourself (or if you listen to the radio while you shower).


Exactly, which is why these assistants are not very useful beyond simple tasks you could just do yourself. If they're only good at things that are easy for you to do, then what's the point of them besides not needing your hands to do simple tasks?


Recognizing masks is the hard part, indeed. But it can be done semi-decently by just analyzing tags (#wearamask or whatever).

The rest is easy, it could even be done completely on-device if you have a recent high end chipset. That's how powerful phones are nowadays.

Tbf, it is easier to just run your own server, and pre-program everything for yourself, a truly personalized experience.

Though Google could easily do it for the millions of people using Android. I really wish they allowed custom modules or something for their Assistant, the voice recognition is unmatched.


Then parameterize & return results for one specific duration, return that duration with results, and make it easy to adjust afterward.

“Here are the results I found for the last week...”

“Show the same for the last 10 days”


Context is hard, but it seems like “recent” means (99% of the time) order by date descending, grab the first 15 or so, and then how many of those photos contain a person with a mask.

Maybe the difficult part is whether or not you look for the 15 most recent photos containing people, or the 15 most recent photos of anything.


No, it doesn't mean that 99% of the time. It's more like 99% contextual.

If I ask for recent wildfire news and I'm in a state that doesn't experience wildfires often, are you going to return 15 news articles about wildfires spread out over the 200 year history of the state? I almost certainly want 15 news articles about the current wildfires in some other parts of the country. Your algorithm doesn't really say what to do here.

If I ask for "recent relatively rare astronomical event" recent might mean hundreds of years or more. If I ask for "recent PC game releases" it might mean a month or the current year. If I ask for "recent public events in my town" it might mean over the last week.

In many cases, "there are no recent events" is a better answer than "here are the last 15 events."


FWIW, I tried the first query on Google Home and it suggested I talk to a third party extension called "Air Check." I didn't go through with connecting it but I assume it would give you what you wanted. It sounds like you were on mobile so maybe it's different there. I actually appreciate it prompting to connect before making the request because there's probably some privacy tradeoff there.

The second query was hard for me to test but it seems like a reasonable one to make.

The third one is probably impossible due to Instagram's TOS (among other things.) Their TOS states: "You must not crawl, scrape, or otherwise cache any content from Instagram including but not limited to user profiles and photos." [0] After Cambridge Analytica I'm not surprised that this is the case. Even if Instagram allowed scraping of their data, this feels like a somewhat specialized (though currently very relevant) request. I tried reformulating this request as "What is the mask compliance rate in Shasta county?" which also didn't turn up results, but I'm not surprised by that either. I suspect that data doesn't exist anywhere, so it's hard to fault the assistant for not pulling it up.

[0] https://www.instagram.com/about/legal/terms/before-january-1....


Well put.

I don't know how much of FAANG's budget goes towards improving voice assistants, but considering how much cash these companies have on hand and their operating budgets (the size of some smaller European countries), the progress in that area is just super disappointing.

The 3 most common use cases were refined a long time ago (Directions, Alarms and Play Music) and everything else ran into a hard wall.


Even with directions, it fails miserably IMO. The most it seems to be able to do is "navigate to X". I want:

- "Take me on the most scenic route to X." Can't you figure that out from social media tags? Simple first order solution: routes that have more photos with more likes = more scenic. Took me 1 minute to think of that. And 1000 engineers at Google couldn't implement that? These data crunching tasks are the kind of stuff computers are supposed to be good at.

- "Navigate to X but make sure you stay on paved roads." An actual problem if you are trying to use Google Maps in the back roads of California and don't have a 4WD vehicle. Google Maps loves taking you on 4WD dirt road detours. Don't you have satellite maps and street view? Can't you differentiate paved roads from unpaved ones? What the hell does your machine learning department do?

- "Stop at the last grocery store before the highway 120 junction." Yeah, it doesn't even begin to understand this type of query.

- "OK Google zoom out the map slightly." Nope. Sorry.


Scenic: Garmin's devices appear to do some calculation on number of times the road doesn't go straight over a given distance. Seems to work well enough. OTOH, someone at Google has to make this a feature.

Paved/unpaved: that's one suck-ass assistant you've got there. I know the Garmin on the dash of my motorcycle will give that option. The Garmin RV-specific GPS has loads of other options, such as avoid any low overpasses that will rip the solar panels off my RV. Though it doesn't look like Apple is any better than Google, only because Apple Maps doesn't offer the option. Anyway, satellite and street view? Rand McNally had this information since dirt was first created, no need to get the satellites out.


Yeah, I would have expected Google to even take it to the next level: Ask what model of car the user has (better yet! Identify it from a combination of the car's Bluetooth MAC address and a machine learning model trained on the audio spectrum of the engine noise). Look at where all cars of different types are able to travel, where they make U-turns and turn around, and at what GPS locations they make calls to roadside assistance numbers. Incorporate weather data as well, e.g. what season, and whether it has rained lately. Between all of that data you should be able to advise whether a given vehicle should be able to traverse a given road on a given day of the year.


Agreed, the error rate before I just give up on something is pretty low.

The real frustration is when it fails on tasks I swear it has done before easily.

Like 'ok google' ends up googling ... the ok google command word for word sometimes. 'ok google set timer 10 minutes'

As soon as I have to babysit an assistant like that I'd rather just make sure the clock is on my home screen so I set a timer.


A big problem with the assistants is that as soon as they fail at a query they seem stupid, I feel stupid, and I stop using them for a long time.

Raise your hand if you remember the early days of Xbox "recognition", as you listen to other people in your party:

"Xbox, record. XBOX, RECORD. X...BOX, RE...CORD."

It's the reason that "oh, we can't sell an Xbox without a Kinect" Kinect went into the garage after a few months.


What I find even worse is when you play FIFA and some random non-related british commentary triggers the "xbox" speech command. Infuriating, especially since you can't turn it off.


And it can be used to troll people:

https://www.youtube.com/watch?v=anslUJ5SCIs


Kinect, actually awesome tech, implementation on the xbox... not great. But the heart and soul of a ton of installation art pieces.

It's really too bad that MS went with a windows only sdk for the 2.0, the 1.0 being external tech had a multiplatform sdk.


The big power of GUI is that it was an alternate control scheme to do anything you can do on the CLI. Voice control is not a UI. It's a limited set of functions not meant to offer analogous control over any aspect of your device really; if it does it's happenstance. That's why Siri is for setting alarms and not doing complicated workflows that would require your thumb. I wish it was a voice ui, though. I wish I could stick my phone on a shelf and step back and tell Siri to take a picture, to do that function on the phone my thumb can do, but the best Siri does is open the camera app because the voice controls are just that, a limited set of control levers to pull and not a complete user interface.


That is all doable with Huggin/IFTTT/Automate/Tasker/Snowboy.

It could even be done on the phone itself, although battery life and heat would likely be a problem.

The fact that it doesn't work just shows that Google isn't that interested in their Assistant.

Which is rather strange, but maybe they're looking for better ways to monetize it?

Or maybe it's going the way of Reader once AR gains momentum or something... then again Google Lens is a really nice product and no one I know even heard of it.

I'm amazed at how well it can recognize any kind of writing, even my chicken scratch, and how it can look up any products/labels with decent results.


Probably, they get more ad revenue from:

- you doing a manual search, and being tracked/targeted - them displaying ads as a result - them displaying contextual, higher paying ads

I've noticed that some things over the years, many things in fact, have been removed from Google Maps. I hypothesize that the entire reason is, these things reduce profit.

For example, you used to be able to pause Google Maps. You can't now, and therefore, if you have 'history' off, you have to stop, and manually re-start your destination by typing it in.

Well, history profits them.

And pausing is not the best, because then, you're not active 100% of the trip. Having you active lets them determine all sorts of things, like traffic patterns, where you go, who you're near, and so on.

There are lots of little things like this, which seem to be gone from earlier versions of Maps. Again, I presume, all to make more $.

Which is logical, and fine, but it gets a bit tiresome and sad at times.


But would you have gotten better results if you had typed those same queries into some website?

While an artificial general intelligence would be capable of fluent speech, a perfectly good speech system does not necessarily need to be an AGI.


The problem, as you note, is that you are actually doing more parsing of the input than virtual assistants do. Tom Scott has some good videos on the subject (https://www.youtube.com/watch?v=m3vIEKWrP9Q&list=PL96C35uN7x...)

To give a more concrete example, here's a UPenn demo (https://cogcomp.seas.upenn.edu/page/demo_view/ShallowParse) for your instagram query:

> NP What percentage PP of NP people can NP you VP detect to be wearing NP masks PP on NP recent Instagram photos VP tagged PP at NP a location PP within NP a 10 mile radius PP of NP Mt. Shasta ?

Part of the reason we don't progress beyond that is that speech recognition like in the OP article is quite bad: 95 percent accuracy is considered "good." But it means we expect 1-2 words of your query to be misrecognized, so even if it did parse the query as you proposed, it would probably be answering the wrong question!


This reminds me of the issue in machine translation where even sophisticated systems cannot understand the importance of a word like "no". So, for example, a phrase like "please press this button" could be translated as "please do not press this button" and it could be automatically rated as a good translation because most of the words are there.

But, on the other hand, I have dictated these paragraphs, (pretty much just adding punctuation and minor edits at the end). I think the most useful feature I've found related to voice is dictation (speech to text). It is almost perfect, at least in English (and I am not a native speaker).


Because this is HN, I want to say that I think the above is now doable and more people should be working on startups in this space.

The thing we're discovering is how to marry NLP (which works pretty well) with structured databases and automated tools (which work really really well).

A pure AI play probably won't get this done. But if your AI starts knowing how to use information tools then I think we'll see a lot of near term progress.


Sounds fascinating. Could you share any pointers to papers , academic work on this subject ?


Data Agnostic RoBERTa based Natural Language to SQL Query Generation

https://arxiv.org/abs/2010.05243v1

more

https://scholar.google.com/scholar?as_ylo=2020&q=text-to-sql...


I use my voice assistant on my phone to set timers.

That's it.


The other day, I wanted it to open an app and start recording when I say "Hey Google, take a voice note"... but no dice :/


Just say "Hey Google, open Live Transcribe" (assuming you had that app installed)


This highlights part of the problem. Google sell/advertise products for companies but people want to use tools.

People want "[wake word], [action] [modifiers]".

Companies want "[trademark product] [trademark product] [modifiers]" (eg the parents example "Hey Google (RTM), Live Transcribe (RTM) 'words to transcribe'").

I've only used Alexa (and only for fun) but the replacement of verbs with companies/products, and of nouns with proper nouns, is really annoying to me and gets in the way IMO.


Weather is also useful! Sometimes reminders, which are basically timers. But yea that's it.


Siri doesn't get the weather right.

"What's the weather in Yosemite" and "Yosemite weather" can get you two drastically different results. As of right now (10/26/2020 10:35 AM) the two results I get are 43F and 32F.


Siri is always mistaken. Alexa is pretty good.


Alexa is super easy to make a skill for (IIRC it's one XML/JSON file hosted somewhere and registered with Amazon). So you could probably make your own with ~IFTTT to scrape the data. Alexa has "daily briefing" which will then go through a list of skills which can include your own [public] skills.

I made an "and finally ..." to add funny news to the end of the daily briefing and was super impressed how easy it was.


Something I'd love to see is a move towards "multiple possibilities" in voice-based UI/UX. Almost all these systems give weighted probabilities to different parses, and if you gave the human a choice between the top 3, you could be much more exploratory/risk-taking in choosing candidates. Or in machine translation or transcription, why do Google Translate and YouTube automatic captioning not communicate the system's level of comfort with what it gives you, and provide alternative possibilities if that is low?

We're humans - we can deal with ambiguity. Systems should trust that we'll respect them more if they tell us they're unsure, rather than either jumping to the wrong conclusion or simply being unwilling to guess!


Agreed, "recent photos" should get a response like "showing photos from the last day" and you can say "no, from the last month" and that then will be used to weight similar requests (about photos) in the future (weight mind you, not naively then make all requests for 'recent' by replaced by '1 month').


I think the issue is that the virtual assistant is there to assist, not replace an assistant.

I have an admin. I can text, call, talk, email her and say “let’s get bob, mary and finance on the phone at 4” and she’ll move stuff around to make sure that happens. I can also delegate complex administrative tasks that need to be done in my name.

When I’m at the dentist, I usually do a “hey Siri” to book it as it saves time. At home, “hey Siri turn on the lights” is helpful. It’s magic, but not the same.

The tech companies try to frame this stuff “bigger”, but in doing so they create an unreasonable expectation. Google home, Siri, Alexa are amazing, but we bitch and moan about them as a result.


Maybe it's the marketing.

I don't have a human assistant, but if I did, the biggest things I would expect from them isn't about scheduling a meeting or scheduling a barber appointment. I can handle that stuff myself.

What I would REALLY expect of a human assistant:

"Can you call up my health insurance and fight this stupid bill of $400 for COVID testing that should have been $0. Please escalate if necessary. Thanks."

"Can you figure out how to fight this red light ticket that I got due to a malfunctioning red light camera? Thanks."

"Can you fight this parking ticket? I had a valid permit to park there, and here's documentation of that. Thanks."

"Can you register me to vote? Here's my ID. Thanks. Make sure they don't sign me up for spam."

"Can you call up Comcast and fight this bill increase? Threaten to switch to another provider if necessary, I heard that works."

"Can you call up this company that posted my personal information and ask them to remove it? If they refuse, threaten legal action."

"Can you dispute this electric bill for me? My heating shouldn't have been $400 a month. Something must be wrong with my meter or someone is leeching power from my line."

"Can you call up the 10 different grocery stores in the area and figure out which one has X in stock?"

In all honesty I do wish the Google automated assistant could do all of the above. Sorry Pichai, I don't care for scheduling haircuts automatically. I want your assistant to use its hundreds of thousands of hours of human conversations and use machine learning to craft and engineer responses to humans to know EXACTLY when to escalate, EXACTLY when to ask for a manager, HOW to threaten legal action, and basically fight tooth-and-nail with language to get me what I want against the companies and institutions I need to fight with on the phone. The job of an "assistant" should be to get sh*t done and get me what I want. Use machine learning and lots of data to master the art of negotiation with customer service reps.


Most of the time when I ask that at home, it tells me that it's afraid it can't turn off the lights, since they are already off.


> It turns out that the current generation of "assistants" are mostly just template-matchers which really doesn't help me much at all.

All the cases you've indicated as "so simple" are only so simple as template matching.

When generalized, they are hard problems.

Sure, you could have humans sketch out a few thousand templates enabling "Ok Google" to support such simple things like "What is the air quality at <location> <Time>" which looks up from an API. But that doesn't generalize "for free" outside of your templates.

Google seems to want to avoid handcrafted templates and go straight for the generalized solution.


> "OK Google, what percentage of people can you detect to be wearing masks on recent Instagram photos tagged at a location within a 10 mile radius of Mt. Shasta?"

Ok, so you need the assistant to:

* Already have a trained dataset of people wearing masks

* Fetch ALL instagram pictures it can find

* Not only detect if there's a mask in the picture, but count them

* Fetch the location of Mount Shasta

* Calculate a 10 mile radius around it

* Apply that calculation to the people counted in the first step

* Calculate a percentage of people wearing masks which needs:

* Count people not wearing masks.

Those steps are sub-optimal. So you need not only understand that, but to run some sort of 'query planner' in order to get the results you are looking for.

You are thinking of Jarvis from Iron Man. Forget assistants. Allow someone to construct a query like this in a couple of minutes and you have a great product you can sell. Existing ones require quite a bit of domain knowledge and setup. Not even things like Wolfram Alpha would be able to parse this query.

This reminds me of: https://xkcd.com/1425/


> Not even things like Wolfram Alpha

Sure, but we're talking about Google here, not Wolfram. The masters at query optimization and MapReduce. I would have expected them to be able to parse this query, or at least fetch the AQI near Mt. Shasta but can't even do that.


Heck I ask Siri something like what is the pollen count and I get back a page that it can't read describing what a pollen count is not what the current measured pollen count is in my area. If it could answer your questions I would have home for my own question getting the expected results.


If google assistant or whatever worked like a person of average intelligence that I could talk to and get to do stuff for me, that would be incredible. I've been on a roadtrip and your examples made me realize how much of a time/attention saver something like that would have been.


"Hey Siri, remind me to add [x] to the grocery list at [y]."

"There is no 'Grocery' list. Would you like to make one?"

An example of how Apple's ecosystem breaks down if you want to use something outside it. Saying no should offer to add a reminder as requested, but it doesn't.


Try getting Google to voice type the word "o'clock". Just try it.


I set up a routine for the words "Music Time" on my Alexa. Every day I say those words a few times. The hit rate on an accurate response is not above 90%.

It's the only thing I use my Alexa for.


Without speech recognition our startup would not exist today. We are developing a text-based video editor where the video is cut over the transcribed speech.


Actually, ability to delete all iPhone alarms at once via Siri is a life saver. I know no other way to bulk delete l/disable alarms in iOS.


Sounds like a UI bug/misfeature, rather than a voice feature.


Or my current peeve:

Siri play song “song-name” I’m sorry there was an error with Apple Misic Siri play song “exact-same-song-name” [song plays]


When they do work though you feel like you’re at the helm of the future.

I’ve figured out what queries almost never seem to fail and use those almost exclusively. I don’t get creative.


Siri answers the first query correctly, fwiw


Back in the days (80s) they were joking about the programming languages used in early (star trek) holodecks.

  > (1st attempt) "Computer! coffee please"
  > (computer dumps coffee on Kirk) "argh!"
  > (2nd attempt) "Computer! coffee in a mug please"
  > (3rd attempt) "Computer! hot coffee in a mug please"
  ....
  > (25th attempt) "Computer! 10cl of coffee at 50C with 3cl of fresh milk at 6C in a bottoms down ceramic mug of 15cl"


This means that you whould only be able to choose between several (discrete) choices instead of assigning the computer random tasks. But when the number of such choices is really small (eg. make coffee), then a simple physical button to initiate the task is better in almost every way.


“Computer. Press the coffee button.”


One thing I haven't seen discussed is the poor affordance / discoverability of speech technology.

Google Home can do some clever things however, it also does not have the ability to do some very basic stuff. As a user, how do you know what Google Home can do and cannot do?

It is just trial and error. And if Google Home introduces a new feature to be able to complete new types of queries. What now? How does a user know that last month it wasn't able to do something and this month it is able to do that thing.

And lastly, the interface of voice is very clunky. It has no concept of temporal memory like

Me: "Ok Google, navigate to the nearest safeway" Google: "navigating you to the nearest gas station"

The natural thing to say is "no, I meant safeway nott gas station" however, I now have to say "Ok Google, navigate to the nearest safeway".

This is analogous to if keyboard had no backspace and you have to retype everything everytime you have a typo. Well that's the state of speech technology right now


>As a user, how do you know what Google Home can do and cannot do?

Amazon's workaround for this problem is to have it tell you when new "options" are available for a command (ie: if you set an alarm it would confirm setting the alarm then tell you it can wake you up to the sound of birds then it provides an example command) and Amazon sends out a "what's new with Alexa" email every so often that's 90% example commands.


Agreed. More broadly, I'd say that no one has made a good UI yet. The mac/lisa/star had a UI that people could learn. iOS...

In some ways a voice UI has bigger problems to deal with than PC GUIs or iOS. Those UIs were replacing pre-existing UIs (eg blackberry, dos, unix, norton) and they could target whatever tasks a smartphone/PC needed to do. For voice UIs, it's a cold start. It's not even obvious what an audio only computer should do. Our mental model for a "virtual assistant" is a person-2-person exchange, and computers still aren't great at communicating like people.

FWIW, I think slipping into existing niches is the way to go. That's where a useful voice ui will be discovered. Car stuff, accessibility software, living room controls.. At least these have clear goals. Voice operating spotify, netflix or just an iphone is something people actually need and will use if its useful.


What you say about "temporal memory" is not exactly true for Google's assistant. You can try two separate queries

1. "Who is the president of the United States?" 2. "What is his wife's name?"

And it will resolve the deictic pronoun.

I haven't tried this feature out extensively, but it has worked for a few years now.


Once they start hooking them up to conversational language models that can also submit queries, I think it is going to get a lot better. The conversational model results are starting to look very good.


Where can i learn more about the leading edge of those conversational language models ?



For most short interactions, the mouse/trackpad/finger is simply faster.

Now for long-form typing, I'd love to use dictation and sometimes do for taking down short thoughts I e-mail myself from my iPhone.

But the problem is not just that it still makes tons of mistakes. (Probably a quarter of my notes-to-self involve errors so big it's even impossible for me to later figure out what I even meant by trying to sound it out phonetically.)

The problem is that I can't correct those mistakes using voice. There's no way to say "pause, correct affect to effect" or anything like that.

Even more maddeningly, the words keep changing in real time. Sometimes I'll utter a sentence it gets right, then it "re-analyzes" it and completely messes half of it up.

I just wish there were a kind of dictation where I could say a phrase, pause, see if it's right (and it wouldn't change after), say the wrong part with a kind of emphasis that lets the system know I'm issuing a correction, the system would look for the next most probable alternative, repeat as desired. Then I could actually dictate successfully.

This UX where the words are always changing back and forth according to updated statistical probabilities, even as long as 15 seconds after I said them, and where there's no ability to go back and correct them with voice... it's just so so dumb.

The problem isn't voice recognition anymore. It's voice correction.


Picks up the mouse and speaks into it: "Computer?"

Classic scene: https://youtu.be/xaVgRj2e5_s?t=171

Star Trek isn't my favorite sci-fi show/movie, but they really nailed some futurism and in this case, lack therof.


100% this. Dragon used to have more correction features, but at least the Apple dictation doesn't really have any. Even just "correct sentence" where I try again with the same sentence would be a huge improvement.


I'm dictating this. Not going to make any corrections, so that you caan see what the state-of-the-art is right now, including bugging us (that supposed to be bugging this. Apparently it doesn't recognize that word it's the quality of being buggy).speech is my primary interface my computer due to health issues, and my God is it infuriating.it's never going to really take off until it just works.

imagine if your keyboard had about a 2nd latency and every couple words got messed up in some way. Not only that, but those same words that got messed up are probably getting be messed up again when you try to go back and fix it with the same broken keyboard. you wouldn't say that typing is nearly solved, you'd say that typing absolutely sucks and keyboards just don't work.

I firmly believe that speech is going to be the main interface-actually want to scratch that last sentence, but correcting it is can the pain with voice. I firmly believe that speech is going to be a game changer of an interface, especially for coding, eventually. But until it stops sucking, it's only can be used where where there is no other choice.

(By the way, this is me, a native English speaker with a standard salmon Cisco accent-that's a standard San Francisco accent- dictating it on a many hundred dollar microphone into a many hundred dollar speech engine, speaking relatively slowly and enunciating the hell out of out of things.when I first started with that, it was even worse. and yet somehow a lot of people treat speech recognition as if it is in some way solved, or the error rate is better than human.or loader ship-that's supposedd to be what a load of ship-whatever you see what I mean)

-----

(edit: After the fact, I decided to calculate the word error rate dictating thhis. About 7%. If you ignore the non-speech-related bugs, it's about 4%, which is supposedly "superhuman." take that as you will. my take is that "human level" is not the same as a human trying to work as fast as they can, and maybe not paying particularly close attention. And that 4% is ridiculously far from 0% in terms of usability)


I think the issue with typing via voice are: 1) Speech accuracy 2) Correcting things by voice

It's super unnatural to correct things by voice and until we re-imagine what it looks like if we HAD to type via voice, it's gonna be painful.

(Speaking as someone who types via voice as well). Have used Dragon and now currently Talon


For me, the thing that killed voice commands had nothing to do with speech technology. It was latency and error handling.

At the start of my morning commute, I would say, "Ok Google, navigate to work".

Often, this would fail because I was in the network limbo area outside my house, where my phone struggles to transition from home WiFi to data.

Worst of all: The failure would be horribly slow. I would have to drive for another 30s before my phone realized, yes, we are really out of WiFi range now. And the voice command wouldn't be auto-retried. I would have to tell my phone again. It didn't remember.

I added one-touch "Home" and "Work" Google Maps widgets to my home screen and never looked back.

As an engineer, I realize why this is a tricky problem. As a consumer, I want it to "just work".


It took me months to work out that "navigate to X" was the magic phrase to get google maps to do what you would expect in car navigation to do. The phrase that came more naturally to me was "give me directions to X", but that only gets you to the screen with the route and you still have to manually press the "start" button with your finger. And then it would randomly pick other modes of transport unless I remember to say "by car". Systems that are inherently unrecoverable like voice commands need actual documentation.


Typo correction: unrecoverable -> undiscoverable.


Why would you need directions to the workplace you go to every day? Unless you're doing onsite stuff in new places every day?


Not the OP, but many people use Waze etc for driving directions on everyday commute because the same destination does not imply the same route - due to construction, accidents, traffic jams, etc the best route can vary significantly, and simply driving the same route as yesterday can take much more time than it did yesterday.


Traffic.


A few years ago when Microsoft was pushing Cortana with Windows 10, I decided that since I was wearing a headset at my desk 90% of the time anyway, I may as well try using a voice assistant so I could multi-task. So next time I needed to do a calculation in the middle of something, I said "hey cortana, what's 50 times 12.5?" while typing something up... and it opened the start menu, lost focus on my window, and the searched for it on Bing in Edge. I just wanted it to read me the answer.


Another frustrating experience with a voice assistant was with Google Assistant. I was on a train, and I didn't want to miss my stop if I got distracted (as had happened before). Since I had my headphones on, I tried getting Google assistant to notify me when I arrived at my destination. It could not do it, the devs hadn't made a template for this scenario at the time.

Nowadays it might work via a location-based reminder, but I can't trust that to work within the 15 second window I have to get off the train.


The main problem now is not speech recognition. It's a kind of uncanny valley effect.

You can speak to these assistants, but the language is still restricted. They show little to no common sense. It's a lot of party tricks bundled together.

You can't interrupt them and it's hard to correct them.

On the speech recognition side, an issue I've found (although is a rather niche one) is triggered because I'm bilingual (I'm fluent in spanish and english).

Speech recognition only works well on a single language.

I have Alexa set to speak english, for example when I'm searching for a song with a spanish title, I have to try to fudge the name into a fake english pronunciation for it to produce the right phonemes that will match the song title, rather than just say the name properly.

Also, if it misses the match, there's no easy way to stop it and say "No, not that one", and be presented with a list of similar matches.


I don't think it's a niche issue outside of the United States, in countries where English is widely known. It makes it pretty much impossible to use Siri on Apple TV for example, because so many movies or TV series are named in English but also in the local language.


I suck at talking. I speak in halting, quiet sentence fragments. My mind wanders and I lose my point. I get in my head a lot. I'm much better at writing and reading as a communication method. I'm open to speech stuff, but currently no speech recognition solution meets any needs I have.

Just my 2c, I'm sure other people have uses for it. The most interesting one (to me) has popped up a few times on HN, which is voice-based programming. I would love to see that mature and become more widespread, there are a few things that are annoying enough to do that if I had a voice shortcut or eye tracking it would be pretty cool.


I do the same when talking. Never realized others have this condition as well :)


There's lots of reason it hasn't taken off like some hoped (accuracy, social aspects, privacy) but it's not faster unless you have no other option for input. If I have a computer, typing is faster, if I have a phone, which we almost always do, it's faster to trigger the command or type.

Speech is competing against every other tech item trying to be convenient, from laptops to phones to watches... the only space where I'd want it is something like baking or cooking when I can't interface with a computer.

(edit) and what about the response back? A visual interface I can confirm at a glance whether my input was accurate, a voice one I have to listen to the whole thing. I even don't like maps directions in the car as half the time I know the next direction already and don't want the interruption to music or whatever I'm listening to.


It's also the unprompted responses that has started to bug me. I don't know if Google is having a bad rollout or this is deliberate but my Google Home is being triggered a lot more often now and I don't recall anything remotely close to the wake phrase being said. Also, the responses to questions when actually prompted are not at all what I'm expecting. Particularly, Google has trouble understanding a lot of context around Spotify playlists. It still cannot distinguish between a song and an album with the same name despite me prepending the request with "Play the Song...". Overall it's just a terrible experience and that makes the whole assistant thing less like an assistant and more like a toddler that refuses to cooperate.


I had this problem a lot with Cortana popping up during meetings and seizing control of my microphone. In the end I spent probably 15 minutes trying to figure out how to turn off voice activation because Microsoft doesn't make it easy.


Audio feedback doesn’t give the user the same reassuring sense of certainty as a graphical user interface. One glance will confirm that I have typed my card number correctly, but you don’t have to be unusually impatient for your heart to sink, when you hear the inhumanly calm words, “I heard 4659 1234 1234 1234. Is that correct? Say yes or press one to confirm”.

This is the main reason I rarely use voice controls. Even if voice is faster than typing for most cases, typing never fails catastrophically like voice does.

I could use voice only when I'm confident it will work, but then there's the mental load of making that prediction. It's easier to just always go with typing.


Just this morning on my way to work: "Hey Siri, remind me to take my meds at lunch"

Siri created a reminder (good) called "Take my meds at lunch" - no time (bad), just a simple reminder.

I saw this post, looked at the time and wondered why Siri didn't remind me, given it was 1:00 - now I know to be more specific but in reality I'll just stop trying like most people.


Question: Can I write my own assistant on mobile devices that is an actual assistant? As in, can listen for the key activation word in the background? Like the "official" assistants.

As far as I can tell, the assistant APIs seem to be like plugins? On Android for example, it appears custom assistants still run through Google assistant.

I want to be able to say "TriggerWord, do X and Y" and the OS activates my app, passes the voice sample and I take care of all the language processing from there. Which doesn't seem possible...


The custom home automation system I built relies on google's (very good) off-line voice recognition + automagic [i] + opencv face detection.

I did not want the home asst listening all the time - as it's highly likely to falsely trigger just listening to radio or tv it... so instead I use python+opencv to detect if the assistant (an old rooted samsung note) is being directly looked at - then it wakes up to listen for the trigger word and the command. Of course, I can also manually trigger it via any device in the house.

[i] http://automagic4android.com/


https://snowboy.kitt.ai/ is about the only good third party hotword detection engine. You can hook it into Assistant with a bit of work, and with Tasker/Automate and a rooted phone, you can open apps, press buttons, pass voice commands, and more.

Imo, Assistant is limited because of privacy, see how it asks you to opt in for a more personalized experience and it still won't unlock your phone for you.


I've had a similar experience, worked in voice for a long time. I simply have no desire whatsoever to speak to my electronics, I don't find it helpful or useful in anyway.


I'm not quite that extreme, but I'd use voice if it was at least as smart as a human. And it isn't - yet.

"Turn on the lights" isn't exciting. "Make dinner and do the laundry" would be exciting, but that's at least 25 years and some major advances in robotics away.

IoT is very crude and contrived compared to what would be possible with an active technology that could do useful physical things of all kinds.


For something like that I'd rather press a button or have it happen on a timer.

Where it could really shine is in more rare commands. "Make me some Pad Thai". Unless you're a huge fan, you wont want a button for this. And it's faster to say than type into your phone.


I will consider talking to my computer when my computer is the only one that would hear me.

I also never talk to my phone.



Yeah, it's weird to have a conversation with someone else listening. But I will bring it in if I'm discussing something with another person, and want to look something up. It feels more normal to have a voice interface involved when I'm already talking.


same here, what's the point of using things like Facebook container in Firefox and blocking Google tracking as much as I can, and then happily share my private life with them in even more direct way


I can barely manage to talk to people. Why on earth would I want to talk to a box?


It might be a good way to learn to talk to people, or just talk... Actually, just use Discord for that, join random servers and the voice chats.


That sounds even worse.


No no, it actually works. I am working from home most of the time, and chatting on Discord with random strangers helps me keep my speaking skills, and just stay sane(ish).


Three major problems I see:

1. When these systems go offline, they’re nearly useless. One WiFi glitch and the system gives me the audio equivalent of a blank stare. This directly reduces the chance I’ll casually use voice and instead prefer my more-reliable phone.

2. Most systems haven’t figured out reasonable responses and are basically chatty and full of sounds. Make it Unix-like (silence on success = golden)!!! If I say “turn on the light” and you turn it on, I CAN SEE THE LIGHT ON so I don’t also need to hear a loud chime and some voice confirmation like “sure, no problem”! The fact that I can silently do things quickly in other ways (e.g. phone) is another strike against voice. Yet this is something they could easily fix.

3. Voice systems are not 100% perfect at comprehension yet they tend to babble out long responses. This puts me in the situation of trying to shut them up for long enough to listen for my intended query. Maybe they can improve this by erring on the side of fewer words in replies, with more pauses? Not sure.


I dearly wish I could have a Speech Recognition aspect to the 'creative' and CAD-like software I use - I think it would be fantastic to be able to do two or three operations 'at the same time' - eg. click an object with the mouse, hit ctrl on the keyboard to lock to some plain or angle, as I use the mouse to move, or whatever, whilst verbally instructing the software to 'zoom viewport out' or achieve whatever other function that is vaguely complicated or buried in the GUI somewhere.

Otherwise, I have absolutely no desire to 'talk' to my computer and have it understand me, unless that tech comes packaged with an empathic AI module so I can tell it off and repay it a small percentage of the emotional pain computers have inflicted upon me over the decades.


Just because this thread is dealing with user interaction in general, I do know there are also foot pedals people use as modifiers or input actions along the lines of what you're saying.


Thanks - I've played around with tertiary inputs a couple of times over the years, from MIDI-based level boards to a '3D' input device for CAD. The speech recognition would be, I think, ideal for function interactions harder to describe than that which a switch or a level could achieve - eg "Fill highlighted image with Red and make 50% transparent" or something.

The critical benefit would be to be able to do this whilst at the same time 'controlling' the focus with a mouse and keyboard at the same time.


We've been pretty good at word recognition for a while - speech not so much. Conflating the two has lead to a lot of confusion.


Word recognition and sentence recognition.

It's honestly quite shocking how sparse the research and implementations for everything is once you go beyond a single sentence/command that you shout at your personal assistants.


Frames used to be an idea in AI, but they seem to have been sidelined and possibly forgotten now.

Frames mean that words and sentences have a context, and you can't understand conversations unless you understand the context.

This starts from simple and obvious distinctions. E.g. - as a silly example - "make dinner" usually means "Prepare and cook an evening meal". But if you have a project called "dinner" it might mean "build and compile 'dinner'" An AGI should be able to understand the difference, and ask for clarification if it doesn't.

Eventually you end up with subtextual and implied communication - e.g. "I'm fine" can mean two completely opposite things depending on tone of voice and the contents of minutes-to-years of previous conversations.

All of this is many orders of magnitude harder to handle than "Bedroom lights off."


Oh, I didn't mean AGI levels of understanding, but even "simple" technical things that are likely building blocks necessary to get to that point like sentence boundary detection.


That's fair, NLP has done decently with mechanical deconstruction of normal sentences for quite a while now. But as you note, mapping that onto a template for response is a long way from "understanding".


I think trying to replace regular GUIs and mechanical inputs in every use case is naive, it seems pretty obvious to me that clicking on a button or pressing a keyboard shortcut is pretty much always going to be faster than uttering a voice command.

But there's a huge swathe of use cases where it does make sense, and I think we should be focusing on those - situations when you can't use your hands. Voice assistant technology doesn't even need to be great for this, just good enough that you can look up unit conversions with your hands covered in bread dough, or navigate while driving or whatever.


And on the flip-side of this story... "Hands-Free Coding: How I develop software using dictation and eye-tracking" (https://news.ycombinator.com/item?id=24846887)

The author notes that they only get about 50% of regular speed with this approach, and that may be a significant part of the challenge---speech can encode complex concepts into a few words (especially given context), but the actual baud rate isn't particularly impressive. Keyboard interface, where possible, seems to still win out.


Very unlikely I will ever talk to my computer irrespective of how good the speed technology gets. If I have to talk to my computer how will I work in crowded places ? How does it work ?


Even at home without any risk of disturbing coworkers, it would have to be extremely intuitive to offer any speed advantage.

I can open Microsoft Word faster than I can say "open microsoft word". It would have to be smart enough to short circuit the entire process of doing something useful with Word.


You're not thinking creatively enough. Part of the technology would be to disable any other way of opening Word.


If you had two mics, you could probably work out a filter that captures audio roughly 'in front of the laptop' which would probably work well enough. But I think the wins are going to be in places where you don't normally have a computer, where a mouse and keyboard aren't natural companions to the task at hand. Yes, some environments will be noisy enough that speaking is a bad modality, but not all of them.


I do not see the advantage of talking to computer. I can see the advantage of talking to my Google mini in my bathroom while I am showering. I see the advantage of asking my car to change music while driving. But literally zero advantages of talking to my laptop. My hands can move lot faster than my mouth. I can set shortcuts that help me do things lot faster than the whole pain of talking to the computer.


No, the point is that you'd be annoying the people around you if you were talking to your computer the whole time.


That is a lot of reading into the question that was actually asked: "how does it work?"


Sorry I was not clear. The problem here is that I am annoying other people and I am getting annoyed by other people around me. Imagine saying "Hey Computer, what is my credit card bill for this year", "Hey Computer, Open word",


Shades of that asshole in line at Starbucks holding conference call over a bluetooth headset.


In a croweded place the noise has to be cancelled or speech recognition application must learn to recognise with noise in input audio.


I think the real problem is social, no one wants to hear you talking at your computer in a coffee shop.


That would quickly change if people actually wanted to do it though. Social norms are just unwritten agreements - if enough people decide to dictate their novels and blog posts in coffeeshops then the default will quickly change so talking is acceptable.


I think you underestimate the resistance. The problem isn't just that people don't want to do it, it's also that people don't want other people to do it because it's annoying.

It might pass in a coffee shop, it probably won't in an airplane, it will almost certainly never pass in a library. People will try it regardless of the appropriateness of the location, because some people aren't aware or are assholes, and as a result the entire technology will get a bad rap. See google glass as an example.


People talk to each other all the time in coffee shops. If talking to your computer becomes as natural as talking to a friend, why won't it be acceptable in a shop?


People talk to each other all the time in airplanes too, and even libraries, but it turns out that people are both more understanding of people talking to other humans currently present (rather than say on a cell phone) and people are better at talking to other humans currently present respectfully than they are at talking to devices (or at least people via devices, but I think we can extrapolate).

I agree if it was already the social norm it wouldn't be a problem (same with Google glass), but it turns out that the technology being ready isn't always enough to make the social norms change.


I think part of the issue is that we're pretty good at focusing and filtering. If we're trying, we can focus on a single voice when the room is noisy. The computer doesn't know which to choose and all voices seem equally important.


I've often asked Google assistant in packed bars and it works great.


Wear some sort of sound containing bubble around your head, fits in nicely with the whole pandemic thing :)



I often use Siri when I'm driving to send text messages and it's OK. The speech recognition in Google Maps also works great.

But beyond that, I don't really use these "digital assistant" because you can't teach them when they're wrong, so after 2-3 failed attempt at a request, you lose interest because you can't trust the system.


I generally loathe voice input on computers & phones. I have very specific use cases for the two assistants I've created. "Open the pod bay doors Hal" = open the garage door (but only because it is funny).

"Start the dust extractor Hal" = start the workshop dust extractor, but I no longer use this since upgrading to remote start/power line detection on my dust extractor.

"Hey producer, <various commands to control video production software and PTZ cameras>" = start recording/cut cameras/re read the prompt/extreme close-up on keyboard/pan the camera to me/etc for video production in a one-man multi-camera recording setup.

Everything else, when it comes to NLP, chat bots and voice recognition is "too much trouble" and I find hitting a dedicated button or punching in to a few numbers on a physical keypad to be easier and more reliable than any voice interface.


In my opinion, the speech-driven computing in Star Trek is the best depiction of the technology.

The interaction is very fluid, low-latency, and accurate, and the system doesn't force itself on you: there are still plenty of non-speech-based user interfaces to be found all over a starship.


To avoid wrist pain, I have started dictating quite a lot as a way to enter large amounts of text into my laptop and phone. It's really very good, especially on android. Aside from the relatively large downside of it being noisy, I like it almost as much as I did typing.


Do you find yourself going back and revising what you've dictated? When I type, I'm frequently pausing, going to other parts of the document, and deleting things I've already written. All of these actions I find more annoying when dictating. Not to mention some of the baffling Random capitalization Choices and, over/under insertion of punctuation.

In general, I find dictation useful on my phone to take notes to myself. And maybe to send a message in a chat situation. But it just doesn't work for me for anything slightly more formal / long.


I do need to go back and fix problems, such as poor handling of punctuation on android and poor handling of capitalization on Mac. Fixing a few things is much less work on my wrists than typing the whole thing out, however, so I'm still much happier with speech to text.

At this point I'm dictating even my (work) design docs and (personal) blog posts. I do think I'm a bit less fluent when using speech to text, because it's harder for me to jump around and make slight word choice improvements, but not by much?


Just say "dictated but not read" at the end and leave all the mistakes in there. Problem solved.


I expect you're joking, but in case you're not I would consider that very rude in almost any situation. I don't always catch all of my mistakes, but I definitely read over and check for them.


I think a less terse disclaimer like "Dictated, apologies for mistakes" could help.


Dictating is a much easier task, since it doesn't necessarily require comprehension.


Dictating in the sense of entering text, as one might dictate to a secretary. Definitely involving comprehension.

(For example, I dictated this reply)


We really need to stop expecting so much from speech in technology. It simply isn't a great input method. It's loud, it lacks privacy, and for short commands it takes way too long.

I think a lot of people are counting on speech to bring us into a sort of Star Trek future.

The real game changer for input is along the lines of what the neural lace is supposed to be. Cognitive input. Silent, fast, efficient. In many cases once the technology is mature people won't even have to internally verbalize commands. Just look at a light and desire it to be dimmer.. it dims. "typing" at the speed of internalized though will also be amazing.

Every time I hear someone (including myself) tripping over "OK Google" I cringe.


> It simply isn't a great input method

Loud is relative, privacy is an aimless indictment that's orthogonal, and for brevity?

Speech is a fantastic tool for communication, which includes input and output. It's part of why most large animals, for which quick communication is imperative, use it. It's imprecise, which is the problem that machines are not good at dealing with. It's was a good direction, when we used to have machines that were initially trained with speech for better accuracy, but now passive listening of devices isn't even used for that!

> The real game changer for input is along the lines of what the neural lace is supposed to be. Cognitive input. Silent, fast, efficient.

Silent sure. The human mind is rather random, highly variable between individuals and ages. I would not call it fast or efficient. Then again, speech to text is contextual cognitive input. Without drugs or intentional damage (minor) to the brain, I don't expect neural implants to be very effective, even in the next 100 years.


And then you have thought pop in to your head to send your manager an email calling them a fuckhead so your phone makes it happen.


Meanwhile Google trains their systems on 3 million hours of Youtube audio gaining 30% better accuracy:

https://arxiv.org/abs/2010.12096

It improves much faster than you might think.


I always thought it was a bit dumb for Google to name their voice assistant 'Google'. First you cannot ever upgrade the name, its THE name. Second, every time the assistant screws up, Google gets the blame directly tied to their name.


It's been a while, but I was able to get Google to recognize a different phrase by repeating it several times during the initial training.


Recently I switched on the auto-generated subtitle on a German video and it was not just bad. It was basically impossible to even guess what the original sentence was meant to be. Totally use-less technology in that case.


There are good use cases apart from Dictation for Speech tech:

1. Voice based operations in factories, construction workers who want to have both the hands free but want to navigate via a device

2. Use cases while Driving. E.g. A Driver who is delivering goods.

3. Call centre - Analytics of audio calls etc. Can have many use cases

4. Voice assistant like Alexa, Siri. Mind that Alexa, Siri have vision to do more than just Music.

5. Any use case where visual interface is either not there or visual is not an easy option for user.

Speech tech is challenging when you have to deal with noise or want to do Speech to text on lw profile devices (on the edge).


> 1. Voice based operations in factories, construction workers who want to have both the hands free but want to navigate via a device

They can't understand speech properly in a quiet environment, and you want them to get your commands on a factory floor or construction site? :)


Navigating a menu (with a limited set of choices) could be actually useful, and - given the limited amount of "valid" commands - it should be possible to overcome/filter background noise.


Google/YouTube - voice picking in warehouses.. there are products built for this.


A warehouse a factory is not. Try that in the same hall with 50 lathes.


They wear a powerful noise cancellation headset to cancel noise.


Imho, the issue with speech technology is that it is beeing developed by companies that wants to abuse your home (device,...).

There is just no way I will let google device that listens my conversations let at my home (let me be faster - no, my phone doesnt listen, and yes I am sure, its maker wouldnt recognize it any more).

Or amazon. Or facebook. Sure speech technology could be usefull but not from those companies.

On the other side, there is just no usable way found by anyone else, how to monetize the technology (i.e. except for spying on people).

I think that this is THE problem of speech recognition.


I'd like to go back in time and fire up Dragon Dictation and see how far we've came.

The Pentium processor's power over the 486 was supposed to be the missing link to 'working' text to speech.


Please let me know of the results when you do. It unfortunately doesn't have a free trial.


Petty gouda wince thestraining fun.

AFAIK it was at the level of super funky chording keyboards; with a lot of investment it could be a cool and impressive input method, but the personal investment side was too much for most users. There was an "assistive technology" user base that i dont think they ever realized they had.


It's the classic IT scam that is the entire industry.

When people demo Speech Technology they have the computer doing general AI in response.

Things that are impossible with text or any computer input or anything that can happen for decades.

Reality is, no one wants Speech Technology, we want a computer that can order a pizza with the simple typed words, get me a pizza.

Heck, you pull that off people would even do the way more annoying 'talking' to the computer to access that tech.

What people really want is text to speech. But that's to hard to do the AI scam on.


Because none of them are an improvement over a shell.

Maybe if there were a better way to pronounce "open paren" and "make-vector" we could just yell scheme at the nearest computer.

IMO you're never going to get anywhere trying to make shells speak natural language. Even if you crammed a person into the box there's just way too much you could ask to narrow it down without something formal. At the end of the day you'll have to have something that looks like sh or any other formal language.


I like taking dictates for memos or writing text messages, especially when I'm walking around town and need to pay attention to traffic. It looks like I'm just on my phone as normal, so nobody sees that I'm actually just talking to myself. Sometimes the mistakes the AI maks are hillarios, but usually they're just annoying, as I need to manually edit them. But in the end, doing that takes a shorter amount of time than thumbing it in.


I may date myself here, but the first thing that popped into my head when I read the title of this post was this scene from ST IV:

https://youtu.be/hShY6xZWVGE

It's kinda funny watching someone without a background in CS repeatedly ask google/alexa/siri a question in different ways that you know isn't going to elicit any kind of useful response.


This is a great example of "it's not my preferred mode, so it must not be anyone else's."

Dictation is widely used in medical transcription.

Dictation is a killer way to write a first draft quickly, transcribe rough written notes after a meeting, etc. Also, about half of my emails are dictated, and I know I'm not the only one. It takes some time to get used to, but once you're there (like touch typing!), you can't go back.

..etc..


Progress does seem very slow since I could first talk to an Android phone - which was at least 10 years ago. I'm using iPhones now but even simple things such as "show me pictures of a Boeing 737" or "call the Home Depot on 1st Avenue in Amarillo" only have about a 50% success rate - enough that unless my hands are actively busy, I'd rather just type it.


I don’t know what speech tech will be good for, probably a lot of things, but typing I think will remain as the preferred mode of communication indefinitely. It is already the more sophisticated approach. Using speech is kind of trying to use a new kind of paint that only works on paper to also work on cave walls. Nevertheless we might discover something in the process.


You do talk to your computer. And the accuracy and speed of it has improved dramatically.

Your computer has yet to process meaning.

That is where we are stuck right now.


Something the VR/AR/SR crowds don't always seem to understand: we have been manipulating tools with opposable thumbs longer than we been using language. A keyboard and mouse are not a shoddy stopgap for the utopian future in which we control things directly with our mind. They simply are how we control things with our mind.


I use Siri on my phone all the time, but for one thing and one thing only: reminders.

"Hey Siri, remind me in 2 hours to do X"

"Hey Siri, remind me when I get home to do Y"

"Hey Siri, remind me next time I go to Costco to buy Z"

It's pretty quick to set up reminders by voice, especially location-based reminders. Almost anything else I'd rather do via a UI.


My first experience with conversational tech was using IBM's via voice 98. I still speak with these modern assistants in the same spaced pace.

I was surprised by google meet transcription system, which to me, it's the most accurate I've used so far. Same with the google docs dictation.


Pretty much the only thing i've ever used the google assistant for was to ask it for definitions of terms from urban dictionary, then giggle like a school child while google read them out to me.

That was fun for about 10 minutes. It's been disabled ever since.


I still dont get why google doesnot integrate Google now function on google chrome.

like imagine saying " Hey open a reddit/meme subreddit on the side " while you are doing your watching youtube. or imagine saying "hey Wikipedia that *"


I always type at my computer, and 90% of the time use voice dictation on my phone because I don't text enough to be good at it.

Google's voice capture often astonishes me with its accuracy, and nearly as often makes me laugh at how bad it can get things.


I talk to my phone in the car all the time.

Hands free.

"send message to john"

"how far is it to"

"get directions to"

"play podcast"

"play audiobook"


And then your phone transfers the recorded sounds to someone else's computer to do the work. You still don't talk to your computer phone. You talk through it.


It takes all of 2 minutes trying to use alexa each day to make me stop. "Alexa, play the damn song I have you play every day, I will cite title and author to you so you don't screw it up." "A random song? OK!"


More than anything, the problem with these assistants is it’s still socially awkward to talk to a computer.

I can imagine people using it exponentially more when you no longer get weird looks when you say “Ok google” or whatever in a supermarket.


Interestingly i've noticed that my friends who are less technically savy have actually started to use voice control on a regular basis, probably because they are more easily frustrated with the existing interaction modalities


Voice activated code snippets or command line switches would be nice. "Hey ffmpeg crop the first 15 seconds of input.mp4 and increase the volume by 15%".

Modular so I can switch out the speech recognition and the NLP and fulfillment



Understanding speech ( language context) is much harder problem I think.


I hate talking to people; I absolutely do not want to talk to machinery, unless I'm cussing it and beating on it with something heavy.


In fairness, I don't talk to people either.


I talk to my phone instead of using my thumbs quite frequently....but never to do anything serious.


Good. Please don't. And if you do, please don't do it in an open-plan office next to me.


privacy point in the article is legit and a concern with all kinds of AI-mediated alternative input.

it may mean that if you aren't comfortable sharing rich context about your life with a cloud platform, you'll get left behind by technology


I'd rather type all day than talk. But Comcast voice remote is awesome.


The problem is we (as in hn) can type faster than we can think of the words to describe.

My wife always talks to google. It seems backwards to me. Next generation should skip speaking and wireless read thoughts. Language/speaking is a bottleneck.


> The problem is we (as in hn) can type faster than we can think of the words to describe.

Do you suppose it's possible that speed isn't really what counts? In the era of typewriters, authors like C. S. Lewis opined that typing obscured their thinking because it was too fast. It didn't let them savor the words effectively. Maybe what we really need is to slow down?


> Maybe what we really need is to slow down?

In some cases, we should slow down. But in general, I disagree. Especially because of the way in which we use technology now. Our smart phones are becoming a bit of an external brain to us. It holds contacts, conversations, searches, notes, musings, etc. We use it to recall a fact or answer a question without really even thinking about it. Maybe what we really need is to slow don’t think it is only a problem now because smart phones and the like aren’t really designed to improve our lives, but as ad delivery platforms to manipulate us into buying products and services we probably don’t need. I would love better tech that helped offload mental tasks my human brain isn’t great at, but computer brains are, and let me focus on things my human brain is good at as well as things I just would rather think about.


Good point. I never use speech commands for the same reason I don't use handwriting recognition. I just hate hand writing, typing is my default mode of operating a machine. I can't play with a joystick either, come to think about it.


Comparing creative work to turning on a lamp or checking the weather doesn't seem quite right to me




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: