25 Years In Speech Technology and I still don’t talk to my computer

dheera · on Oct 26, 2020

A big problem with the assistants is that as soon as they fail at a query they seem stupid, I feel stupid, and I stop using them for a long time.

Last weekend I had the following failed queries:

"OK Google, what is the air quality like at Mt. Shasta today?"

"OK Google, add a waypoint for the last gas station before the mountain pass"

"OK Google, what percentage of people can you detect to be wearing masks on recent Instagram photos tagged at a location within a 10 mile radius of Mt. Shasta?"

These are all things I would expect a computer assistant to do really well. They have access to so much data, and so many APIs, that they should be able to break down these sentences into a SQL-like query and give me results. The third, for example:

"recent Instagram photos" -> Instagram has an API

"tagged within a 10 mile radius" -> parse the cities within a 10 mile radius and look for tags in all of them

"people" -> use your wonderful person detection networks your friends at Waymo developed

"wearing masks" -> I'm sure your internal datasets have this label, so run an object detector

Then compile and reduce the data to give me the number I want.

That's what I want an assistant to do. But it couldn't even do the first, which just involves a single API query to fetch air quality index information. Bleh. And as for the second, it has no idea what "last gas station" or "mountain pass" means; it's a query a human would know to be extremely commonplace.

It turns out that the current generation of "assistants" are mostly just template-matchers which really doesn't help me much at all. I can set my own alarms, thank you.

lm28469 · on Oct 26, 2020

As soon as you see them for what they are it makes perfect sense. They're basically a proxy to google search with extra dumb features on top. Every single feature that isn't google search has been hard coded (map directions, setup alarms, &c.), as soon as you get out of the hard coded cases it's blatantly obvious these things are anything but "smart"

imho they're just used to give google/amazon a few more data points for their graphs

jjoonathan · on Oct 26, 2020

Yeah, and that wouldn't be a problem if the discovery story wasn't dismal or the browsing story wasn't dismal, but it definitely is. On both counts.

tschwimmer · on Oct 26, 2020

Discovery and browsing are not good on the assistant interface but I'd argue that's a constraint of voice vs visual. On a desktop/mobile, the screen holds the state so you can go back to previous entries, etc. Over voice, you mind holds the state which scales much much worse.

krrrh · on Oct 26, 2020

To the credit of the Shortcuts team at Apple, being able to visually select and define certain phrases for commonly completed tasks is helpful in this regard, but I’d guess only a couple % of the user base is even aware that Siri Shortcuts are possible.

part1of2 · on Oct 27, 2020

Is this just normal phone Siri or HomePod Siri?

cransallinf · on Oct 27, 2020

Normal phone Siri. Not sure if it works for the HomePod

thomastjeffery · on Oct 26, 2020

And this would be a much better situation if users could hardcode their own features and definitions.

lqet · on Oct 26, 2020

Which is exactly why I prefer formal query languages over NLP queries. In both cases (at least with most state of the art NLP techniques), you have to learn certain patterns and ways to phrase a query so that the system will reliably understand it. With formal query languages, these patterns are well-defined, can be looked up and will most likely not change significantly (so there is value in memorizing them). With NLP systems, the patterns are completely opaque, you have to learn them through trial and error, they may change anytime (e.g. because the model is retrained) and they are usually significantly less powerful.

I sometimes feel that the trend to prefer NLP over formal query languages is comparable to the trend to prefer GUIs over consoles in the 80ies and 90ies.

cmrdporcupine · on Oct 26, 2020

Agreed; back in the day when we/I used to play text adventures, or interact with MUDs/MOOs those systems had English-like interaction languages but the semantics of them were relatively clear -- you mostly had to follow the verb/prep/object formula and once you figured that out, you could manage the system fairly well, without running into a lot of terrible corner cases.

I'd rather have an assistant type system with a fairly well defined query system that exposed its capabilities and limitations directly, rather than me having to guess at the corner cases and failure points.

Disclaimer: I work @ Google on display assistant devices, but I don't work on the actual assistant interaction pieces.

Izkata · on Oct 26, 2020

> Which is exactly why I prefer formal query languages over NLP queries.

People like to reference Star Trek for stuff like NLP queries, but if you go back to TNG and pay close attention to the verbal queries to the computer, much of the time it isn't natural language. They seem to actually use some sort of formal query language that fits English a bit closer, but is still distinct from when the characters speak to each other.

fennecfoxen · on Oct 26, 2020

> They seem to actually use some sort of formal query language that fits English a bit closer, but is still distinct from when the characters speak to each other.

"Computer, begin auto-destruct sequence, authorization Picard 4-7 Alpha Tango."

- Wake word. Command. Authorization stanza. (I bet the computer would prompt for authorization if missing.)

"Computer, Commander Beverly Crusher. Confirm auto-destruct sequence, authorization Crusher 2-2 Beta Charlie."

- Wake word (possibly superfluous). Identification stanza (probably superfluous for the usual crew, but I can see from an HCI perspective that you might want to make people provide it specifically for such a consequential protocol, and it may also be of merit if some random admiral usually halfway across the galaxy pops in to confirm). Command confirmation, authorization stanza.

"This is Captain Jean-Luc Picard. Destruct sequence Alpha-One. Fifteen minutes, silent countdown. Enable."

- Computer is very awake at this point, no wake word. Identification stanza. Sequence parameter. Time parameter. Verbosity parameter. Commit.

YeGoblynQueenne · on Oct 26, 2020

To be fair, that's just how Picard speaks (e.g. "engage"). I haven't noticed anyone else saying "Tea. Earl Grey. Hot".

In any case I think this kind of speech is formulaic for the benefit of the audience, most of all, who are made aware through the formality that the speaker is addressing a machine. Additionally, we're watching navy men and women in space, so we expect them to speak to each other and to their computers in a formulaic manner ("Deck 5! Report!" etc, I can't think of good examples, brain's too tired).

Or perhaps the idea is that Trek AI is not really as advanced as to be able to understand natural language and that makes Data such a unique specimen.

Then again, there's the example of the Doctor in Voyager. I'm confused, I admit.

dragonwriter · on Oct 26, 2020

The Doctor in Voyager is, IIRC, an early prototype, and Voyager is bother later than The Next Generation and set on a newer ship. And the Doctor, IIRC, benefited from upgrades during the show, having been more limited I scope initially.

In any case, Trek isn't super internally consistent, anyhow.

0xffff2 · on Oct 26, 2020

Maybe I'm misremembering, but I distinctly remember that basically everyone ever shown interacting with a replicator follows the generic->specific parameter hierarchy.

YeGoblynQueenne · on Oct 27, 2020

You and dragonwriter are probably right. I'm going by memory, here :)

pjc50 · on Oct 26, 2020

It may be easier to train the humans to be precise. This is the solution adopted by armies which need unambiguous communication over noisy radio.

https://en.m.wikipedia.org/wiki/Operations_order

ketzo · on Oct 26, 2020

Well, to extend the GUI/console metaphor, it means that at some point soon, we'll all be using NLP because it's dramatically more user-friendly for the vast majority of people.

PeterisP · on Oct 26, 2020

GUIs won over console workflows because GUIs have better discoverability and the "recall vs recognize" difference; it's mentally much easier to recognize the option you want when presented it than to recall the existence or the naming of that option.

In those aspects of UX, voice interfaces have the same drawbacks as console apps when compared to a good GUI.

Also, they have to work within the "bandwidth bottleneck" of audio - just imagine a phone system that tells you all the options you have, "Press 1 for something, Press 2 for another thing..." - they are so annoying because they are slow and inherently linear; a GUI can show the same options all at once, and you can read it much faster than listen to them.

So NLP as such is not dramatically more user-friendly unless it is at the "do what I mean" level which likely requries full human-level general artificial intelligence; before that it's just a voice equivalent of a console app, sharing all the problems of discoverability and needing to remember what the system can do and how should it be invoked.

ryandrake · on Oct 26, 2020

> Also, they have to work within the "bandwidth bottleneck" of audio - just imagine a phone system that tells you all the options you have, "Press 1 for something, Press 2 for another thing..." - they are so annoying because they are slow and inherently linear

They're even slower now because the brain trust decided adding voice control to the phone menu system was a great idea. So before, it said "For prescription refills, press 1." Now I have to wait for "For prescription refills, press 1, or say prescription refills." How on earth does that improve anything? I can just as easily press 1 as I can say a word, and when I press 1, there is a near 100% chance that the computer on the other end will understand my command.

Some phone menu voice automation are even worse. "Tell me what you want! <silence>" Then you say something, and it says "I didn't recognize that. Please tell me what you want!" Then it fails again and says "I didn't recognize that. For prescription refills, press 1, or say prescription refills..." Oh great so there was a menu? Why did you waste my time earlier?

Voice is just a terrible, low-fidelity, low-bandwidth way of commanding a computer. You might as well have handwriting input while you're at it: You write what you want on a piece of paper, and hold it up to the camera and the computer tries to figure out what you wrote. Just as silly.

toast0 · on Oct 26, 2020

> I can just as easily press 1 as I can say a word, and when I press 1, there is a near 100% chance that the computer on the other end will understand my command.

So, I would rarely say something when I could push a button, but when using a smartphone on a call, it's not always easy or obvious how to push a button. Some people may have mobility issues making it hard to push a button, or be on speaker phone far away from the buttons. Or, maybe they haven't updated their telephone equipment in 50 years, and only have a rotary dial. Or, maybe on a terrible VoIP system that can't manage to get the tones through.

There's probably some way to clean up the script.

"(Please listen carefully, as our options have changed.) Please choose from the following options: Say prescription refills or press 1; say insurance denied or press 2; say referral to veterinary care or press 3"

I could get behind voice interfaces for more things if the commands words were documented, clear, and consistent, and the damn things worked. Until then, buttons seem good to me.

dheera · on Oct 26, 2020

Totally. I cringe at FedEx's system.

"Welcome to FedEx. [... blah blah blah ...] Tell me what I can help you with today."

"a package"

I mean, what the hell else can you help me with today anyway?

0xffff2 · on Oct 26, 2020

Honestly, "Tell me what you want!" is better than a system that forces you to listen to all of the options since "representative" is what I want 99% of the time when I have exhausted all other options and decided to do battle with an automated phone system.

Snitch-Thursday · on Oct 26, 2020

> You write what you want on a piece of paper, and hold it up to the camera and the computer tries to figure out what you wrote. Just as silly.

Ironically I used to love me some graffiti on PalmOS and Google Handwriting Input on my Droid 2, but I agree with the spirit of your comment.

heavyset_go · on Oct 26, 2020

I see you've run into CVS or Walgreens' automated phone system.

GirkovArpa · on Oct 26, 2020

This "recall vs recognize" point should be raised in every console vs GUI debate. It's pretty much the final word.

Macha · on Oct 27, 2020

You can get the recognise experience on the command line with the smarter autocomplete available on shells like fish or zsh.

    foo --<tab>

You're now present with a list of options and depending on your config, the man page one liner descriptions

dec0dedab0de · on Oct 26, 2020

I understand your point, but I'm not sure that GUIs won out because they were dramatically more user-friendly. It certainly helped, but I think they won because it made multitasking possible. Multitasking from the users perspective that is, the ability to interact with more than one application at the same time. That was just not possible on a console, so even people who didn't need user friendliness were able to do things they couldn't do before. I was young at the time, but that's how I remember it at least.

Godel_unicode · on Oct 26, 2020

That's just not true, though. As with many console things, multi-tasking is totally possible but it's discoverability is terrible. Ctrl+z and `jobs` is the entry level, with tmux being the end-state reached via gnu screen. This lack of discoverability is the same problem voice assistants have only moreso; no `apropos` and no tab completion.

GUIs are for discovery, CLIs are for power via composability, voice/NLP assistants are for convenience.

lqet · on Oct 26, 2020

> That was just not possible on a console,

Was it? Even if you discard stuff like tmux as already a GUI, you can still send whatever is running at the moment to the background with CTRL-Z and typing "bg" on any modern Unix system. "jobs" will then list all your processes, and "fg <ID>" will bring it to the foreground. I am sure this functionality predates most modern GUIs.

thisisnico · on Oct 26, 2020

Aside from the usability POV, GUI provided significantly more features to the user, such as visualization of information and data. Images, Audio, Video, Multi-Media, 3D/2D Video games. You could have more information on the screen and at your fingertips at the same time. You can load many of these things from the CLI, but it's not as convenient as within a GUI.

mixmastamyk · on Oct 26, 2020

> That was just not possible on a console

You may be thinking of DOS, which yes had almost no multitasking ability available.

However there were multiple timesharing operating systems that existed before the PC and GUIs, Unix being the most famous and still around.

Multitasking is quite possible on a Linux console for example. It has 5 or more consoles, each handling different users, each being able to be split via screen/tmux. Each shell can run jobs in the background as well.

heavyset_go · on Oct 26, 2020

From my observations, people have reduced their voice assistants to objects that sometimes tell them the weather or switch their lights on and off, and sometimes do something completely unrelated when activated.

ineedasername · on Oct 26, 2020

On your searches:

1) I get the correct response; Assistant first asks me if I want to use "air check", I say yes, and get the correct response.

2) I get appropriate responses when I ask "What is the closest gas station to Mt. Shasta." Because the approaches are mostly from Rt. 5, you'll have no problem getting a usable response. Another approach from Rt 89 exists, so I don't know how you expect Google to know which "mountain pass" you mean. However, should you ask Google "What is the closest gas station to Mt Shasta on Rt 89", you will get an appropriate response. Should you be approaching from the opposite side of the mountain, unlikely though that is, Rt 5 or Rt 89 would still be the closest

3) I don't know why you expect a computer to answer this question well at all. I'm not aware of an API that would track this sort of thing. Google doing this itself, at scale, for any popular destination would be a very large project in ML with a very high error rate: Instagram photos are unlikely to be representative of a whole, the data set for a given location may be sparse, and especially for an outside location it is entirely likely (as I did at a local pumpkin farm) that people at an outside tourist destination will move away from a crowd, remove their mask, and take a nice photo safely, while people in close proximation still adhere to appropriate social distancing. Effective use would also need to be real-time: from day-to-day a given location could attract people that wear masks, and the next day a large group does not, especially in the > 300 square mile area you specified. It is not reasonable to expect this question to be meaningfully answered by either a human or a computer.

matthewdgreen · on Oct 26, 2020

I don’t believe he’s looking for a voice assistant. He’s in the market for Lieutenant Data.

Dig1t · on Oct 26, 2020

Rather, I think he's looking for the shipboard computer in Star Trek. Which is actually an excellent example of the untapped potential that voice-based computing holds.

matthewdgreen · on Oct 26, 2020

The ship computer in Star Trek was always hilariously underpowered. They had to put an android on the bridge to operate manual controls during battle, for god’s sake.

nitrogen · on Oct 26, 2020

It's also an excellent example of designing a user interface to look and sound cool on screen, rather than for usability, which is how I feel about most voice controls.

nitrogen · on Oct 26, 2020

The operative word is assistant. Voice assistants should be able to do everything a human assistant can do.

Barrin92 · on Oct 26, 2020

the entire value of a (voice) assistant is that it can quickly and reliably deliver results. You don't need Data, but the system needs to be robust enough to work quickly. If you have to ask five times or spend minutes thinking about how to phrase the question, it's not assisting you in anything but wasting your time.

Dig1t · on Oct 26, 2020

[Commander] Data

dragonwriter · on Oct 26, 2020

Presumably, at some point Data, who was a Lieutenant Commander at the outset of TNG, was a Lieutenant. (This may even be the case in one of the novels in which he features set prior to TNG.)

dheera · on Oct 26, 2020

> 1) I get the correct response; Assistant first asks me if I want to use "air check", I say yes, and get the correct response.

Yeah, I don't. I just get "Sorry. I could not understand that." if I remember correctly.

> 3) I don't know why you expect a computer to answer this question well at all. [...] Instagram photos are unlikely to be representative of a whole

Sure, but I defined the query pretty clearly, and I just want to answer the "general" question of "Are people generally wearing masks in that area or not" and I already babysat that question into a data query that could use the Instagram API plus some object detectors.

I understand it's hard for one person to write code that could formulate and piece together that graph, but I feel like it should be a tractable engineering problem for Google.

bigbonch · on Oct 26, 2020

This mindset is exactly why these assistants are bad. Start with the UX then work backwards from there and make it happen.

ineedasername · on Oct 26, 2020

No. Asking a question in an unambiguous way is not a requirement that can be done away with. It can't be done in many normal text-based searches, it can't be done in face-to-face conversations with real people, so there should be no expectation that a voice search would yield any better results. Having actual data that exists to answer the question, as with the mask example, is also an essential requirement.

If you expect more, than your problem does not lie with voice assistants, it lies with search technology itself. "Expecting" these questions to be answerable is unrealistic given current capabilities. Working backwards from the UX would produce nothing better because your expectations are thwarted not by poor design, but by the limits of state of the art technology.

elcomet · on Oct 26, 2020

The three questions are completely answerable by a human. He might ask a bit more information though to answer correctly, which could also be expected of the device.

irrational · on Oct 26, 2020

Rt. 5? Do you mean I-5? I'm curious if using "Rt." instead of "I-" is a regional thing. Where are you located? Typically Rt. would only be used for a small state road, not a major interstate highway.

cbm-vic-20 · on Oct 26, 2020

It's a regional thing. In New England, we call Interstate 95 "Route 95", Interstate 93 is "Route 93", etc. US Route 20 that stretches from Boston to Oregon is "Route 20". And both the US Route 3 that goes north from Boston through New Hampshire, and MA Route 3 that goes south from Boston down to Cape Cod are called "Route 3".

jcranmer · on Oct 26, 2020

> In New England, we call Interstate 95 "Route 95",

Actually, in the Boston area, it's "Route 128." Despite it never being indicated as such at exits or on maps.

stickfigure · on Oct 26, 2020

The correct title for Californians is "The 5".

dragonwriter · on Oct 26, 2020

> The correct title for Californians is "The 5".

You have omitted the rather important qualifier “Southern” from your description, as that usage is a key differentiator between Northern and Southern California.

dheera · on Oct 26, 2020

I live in the bay area and everyone I know calls it "I-5". "Everyone I know" of course could be biased though as most of my circle did not grow up in California.

athms · on Oct 27, 2020

I lived in Southern California (Orange County) in the 70s and 80s. We called I-5 the 5. Same for the 405, the 101, the 15, the 10, the 99. Also, we called State Route 1 (aka Highway 1) PCH no matter where it was located in the state, even though SR1 only has the Pacific Coast Highway designation between Oxnard and Dana Point.

I moved to Northern California (Santa Clara County) in the late 80s and I still use the same monikers. If I take a trip to Big Sur, I tell folks I am going to take the 17 to PCH and drive down. Old habits die hard.

https://www.kcet.org/shows/lost-la/the-5-the-101-the-405-why...

_zhqs · on Oct 27, 2020

“In Southern California, the definite article “the” gets placed before just about every freeway or highway, whereas in the Bay Area just the numbers are said.”

— https://www.kcbx.org/post/how-you-refer-us-101-says-lot-abou...

ineedasername · on Oct 26, 2020

It's used interchangeably for the number of any road, at least in my experience. It certainly gave appropriate responses when I tested it out.

irrational · on Oct 27, 2020

Where do you live? It sounds like this is a regional thing in the NE? I live in the PNW and in my 50 years I’ve never heard anyone use route in place of I-.

Swizec · on Oct 26, 2020

“Alexa, play punk rock”

“Playing <super specific punk rock song>”

“Alexa stop. Alexa play punk rock playlist”

“Cannot find punk rock playlist”

“Alexa play punk rock 00’s playlist”

“Can’t find”

“Alexa play early 2000’s punk rock”

“Playing punk rock 00’s playlist”

It’s like she’s trying to mock me.

codeulike · on Oct 26, 2020

Burglar: "Put your hands up and show me where the money is, I won't hurt you..."
Me: "Alright man, I'll tell you where is money ... ALEXA CALL THE POLICE!"

Alexa: "Shuffling songs by The Police"

* EVERY BREATH YOU TAKE plays as I get punched 24 times *

from https://twitter.com/ppathole/status/1092034892249079813

ufmace · on Oct 26, 2020

Jokes aside, the police probably want to know whether your complaint is that there is a burglar holding you at gunpoint, or that you're trying to order a pizza. They probably get both types of calls at about the same frequency.

arkh · on Oct 26, 2020

Even if it works: do you expect the police to stop the burglar before they shoot you?

wongarsu · on Oct 26, 2020

I don't expect a burglar to shoot me. Raising the charge from breaking and entering to armed robbery is dumb enough, raising it to murder is downright stupid.

Drug addiction makes people do crazy things, but there's a reason why most burglars flee as soon as they are noticed

joombaga · on Oct 26, 2020

How do you say "00's"? Does Alexa say it the same way?

Swizec · on Oct 26, 2020

Yep, oh-ohs. We said it the same.

My fav is when she decides to interpret your words instead of playing the exact playlist you’re telling her to play.

LukeShu · on Oct 26, 2020

“I'm sorry, I couldn't find any results for ‘OK Google, begin navigation to Denver, Colorado.’”

“OK Google, what's the speed limit?” → “The speed limit is defined as the maximum rate at which a vehicle can legally travel on a given stretch of road.”

dyingkneepad · on Oct 26, 2020

"The speed limit is 299792458 metres per second, also known as the Speed of Light."

jkaptur · on Oct 26, 2020

> I can set my own alarms, thank you.

And not only that, but a specialized user interface is often preferable to voice even when the assistant passes the Turing test. There's a reason that people use apps to get food delivery rather than calling.

heleninboodler · on Oct 26, 2020

It's funny, but for years I've felt like "Hey Siri, set an alarm for 7am" was 1000x easier than using the clunky Clock UI and it's almost exclusively what I use Siri for. Tasks that are so perfectly well-defined are exactly what this "smart" tech is useful for. Except that recently, Siri screwed me. I said "hey Siri, set an alarm for 6:30am" and her response was "Ok, your 6:30am alarm is on," but what she actually meant was "Ok, I ensured that your standing alarm for every tuesday at 6:30am is on" which meant she wasn't enabling the alarm for 6:30am tomorrow, so I overslept. Very annoying and completely blew my trust in a very useful function.

drunkpotato · on Oct 26, 2020

Alarms and reminders are my most frequent uses of Siri (and to entertain/frustrate the kids...). It's also pretty good for hands-free quick texts while driving and hands-free calling, though I don't do those often.

Your alarm example is frustrating. Of course, you could say "set alarm for tomorrow at 6:30 am" which would work, but then you're back in the realm not of natural language commands, but formal commands that exist in the uncanny valley, just similar enough to natural language to be irritating when they fail.

heleninboodler · on Oct 26, 2020

Yeah, I would think it would be obvious to most humans that "set an alarm for 6:30am" never means "6:30am four days from now." It seems like it should just be interpreted as the next occurring 6:30am unless otherwise specified, but hey, I'm not a computer.

Edit: also, your "tomorrow at 6:30am" example is also open to interpretation if you're saying it two minutes after midnight. I'd really like it to recognize these sort of ambiguities and prompt me for clarification. "It's after midnight. By 'tomorrow at 6:30am', do you mean you want your alarm to go off about 6 hours from now or the next day?"

toyg · on Oct 26, 2020

Siri does resolve that ambiguity, it will ask you “do you mean 6.30am on day X, or 6.30 am on day Y?”, using weekday names for good measure. That’s actually one thing they got pretty well.

dheera · on Oct 26, 2020

... and this is exactly the sort of thing where I feel much more comfortable looking at a GUI that clearly says the date, day, and time all in one place. It would take me 2 seconds to verify that it's exactly as I want it instead of parsing a voice assistant's response.

nottorp · on Oct 26, 2020

Ouch. I just did a quickie test asking Siri to set me a reminder for 'tomorrow at eighteen'. No dice. It set it for 8 am.

It doesn't even support 24 hour time - at least in English.

mikestew · on Oct 26, 2020

"...set an alarm for tomorrow @ eighteen hundred." works just fine. At least in U. S. English, I don't think I've ever heard anyone refer to 6 p. m. as "eighteen".

dathinab · on Oct 26, 2020

But that's the problem most people don't speak U.S. English as their primary language.

They use all kinds of languages with all kinds of accents and and ticks.

In my experience, and from the time I did a bit of NLP the situations is often along the line of. It works for mostly accent free simple English. Fails to get anywhere usable on most other languages and or accents. Sure that is to some degree because of missing training data. But for a consumer this doesn't change that for very many consumers this features work terrible bad.

Just out interest I tried out the youtube auto generated subtitle for a german video, but even at the parts where the text was super clear and well pronounced the result was hardly differentiable from randomly picking arbitrary words. It wasn't even that the algorithm choose similar sounding word. They where completely different words in many case. I think in a sentence of ~10 words in average 1 or 2 where correct. And that was at the parts where the text was unusually clear understandable. At other parts it wasn't even able to recognize that there where words...

dheera · on Oct 26, 2020

Except it isn't a hundred, it's sixty, and I protest that usage. >.<. I prefer eighteen o'clock.

205guy · on Oct 26, 2020

It's standard US military terminology and part of US culture and language.

Just imagine all the idioms they have to account for in these voice algorithms. I know in German they also handle "halb neun" correctly as 8:30.

toyg · on Oct 26, 2020

That works fine in Siri.

Although it does have a problem with hours over 20 (“22 o’clock” becomes 20:02 for some reason, possibly because of my thick Italian accent).

shpx · on Oct 26, 2020

but you would still understand them if they did or take 5 seconds to ask a clarifying question.

mikestew · on Oct 26, 2020

I live in Redmond, WA. Frankly, if someone asked to meet for a meal at eighteen, I would assume they would like to get together in a Microsoft cafeteria closest to the (what I believe to be non-existent) Building 18. My backup option would be to assume that they have received a hard blow to the head at some point in their life.

Again, whether I would understand them or not, no one to my knowledge speaks like that in U. S. English. It is a great example to use to show the quirks of language. It is a bad example to use to show that Siri "doesn't even support 24 hour time - at least in English".

nottorp · on Oct 26, 2020

I'm not English. We don't say eighteen hundred for 18:00, we say just eighteen. And I don't have localized Siri.

And in any case i said eighteen not eight a m.

justinlloyd · on Oct 26, 2020

Officially building 18 does not exist. Anything you may have heard about building 18 is just rumour. If you have any questions regarding the purpose of building 18, you should direct them to Shelley in HR.

mikestew · on Oct 26, 2020

Hey, it's been well over a decade since I worked there. :-)

https://campusbuilding.com/b/microsoft-building-18/

nottorp · on Oct 27, 2020

"report to Shelley in HR for re indoctrination" would be better :)

parliament32 · on Oct 26, 2020

..."at eighteen"? "Eighteen hundred hours" is how it's pronounced.

tsm · on Oct 26, 2020

On the google side, "wake me up at 7" results in a 7AM alarm 95% of the time and a 7 PM alarm the other 5%. Just frequent enough to screw you over…

fantod · on Oct 26, 2020

Agreed, I find this to be a great use case for voice assistants. Or on the occasion I feel like napping after lunch "Hey Siri, wake me up in an hour".

Wowfunhappy · on Oct 26, 2020

> There's a reason that people use apps to get food delivery rather than calling.

Well, if I had some type of personal assistant who worked for me (as in, a real human), I would just call out "Hey, Sam, please order pizza for me" and continue with whatever I was doing.

The reason people don't make phone calls is they add a lot of additional friction. You have to dial the number, and wait for someone to pick up, and give them your address, and deal with frequently-questionable voice quality.

kibwen · on Oct 26, 2020

Pizza is a weird example because pizza joint menus are more-or-less standard and unsurprising. For any non-pizza restaurant I'll want to start by scanning a menu, which is way more efficient than listening to a menu be read to me.

wtetzner · on Oct 26, 2020

And if you're already scanning the menu, just clicking on the stuff you want is easier than explaining it to someone.

However, if you frequently order the same thing from a given restaurant, having a way to order "the usual" from that place via voice might be convenient. But I'm not sure it's much more convenient than being able to do the same thing with a button click or something.

ufmace · on Oct 26, 2020

Your assistant is probably smart enough to remember what kind of pizza you like and from where, and will just order that unless you tell them something else. In theory, there's no reason your phone auto-assistant couldn't do the same thing, but we seem to be a long way from any of them having that level of intelligence.

Wowfunhappy · on Oct 26, 2020

Exactly! We aren't there yet technologically, but I think it's a worthy goal.

dheera · on Oct 26, 2020

I don't know, I like swiping through photos of food to make my decisions on what to eat, and then it only takes about 30 more seconds to complete the order at most.

Godel_unicode · on Oct 26, 2020

This is just the optimal halting problem; sometimes you want to spend time finding the exact thing you want, and sometimes you just want something that's good enough. Both problems don't need the same solution.

oftenwrong · on Oct 26, 2020

I still often use a phone call (or sms) for takeaway. It's faster than using an app or website if you already know what you want, and usually cheaper, since there is no middle-man.

For delivery, I agree. It is easier and faster to use a service, since my address and payment information are already saved.

dreamer7 · on Oct 26, 2020

That doesn't seem correct. People prefer apps because they are better at providing the information we need (menus, products, services etc) and the alternative for engaging through calls is either talking with overworked humans or a painful IVR service.

For me, the ideal interface to engage (order something, create a note etc) when you know what you want is through speech. It just feels so effortless.

dheera · on Oct 26, 2020

I can definitely say that for food I MUCH prefer the web/app interface. I can swipe through a bunch of nice-looking food and just tap on something that looks appetizing.

You can't get that with voice no matter how you do it. It's like if a restaurant had no menu and the wait staff just recited the menu in your face and asked you to pick. Even with real humans, voice isn't the best interface for presenting food choices.

Godel_unicode · on Oct 26, 2020

> if a restaurant had no menu and the wait staff just recited the menu

That's exactly how food is presented to you at lots of high-end restaurants. Once you get above the fast-casual tier, menus never have pictures. If you know a lot about food and you trust the chef, voice is a perfect medium for explaining the choices. Pictures are just a discovery aid.

dreamer7 · on Oct 26, 2020

That's exactly what I meant. The app interface is good for conveying information. But if you know what you'll order, eg - popcorn every weekend, just saying "Jarvis, order my usual popcorn combo" is so much more seamless.

airstrike · on Oct 26, 2020

The only two questions I ever ask Siri are "how cold is it outside?" and "wake me up {at HHam, in X minutes}"

pbhjpbhj · on Oct 27, 2020

I'm curious that you want to know the temperature specifically, rather than the weather. I'm guessing that's for reasons other than clothing choice?

airstrike · on Oct 27, 2020

I just feel like I get a more actionable answer than the "It's a bit chilly and overcast" I get when I ask about the weather – how do I know if Siri and I agree on what's chilly or not?

pbhjpbhj · on Oct 29, 2020

Cool, pretty sure for weather Alexa says something like "in $place today it's 10 degree with a light breeze and 40% chance of rain; with a high of 13 and a low of 8; have a good $day".

hirako2000 · on Oct 26, 2020

But your examples are way harder than they sound. Speech or non speech analysers have a hard time with context. What do you mean by "recent" photos. And what percentage? Of people wearing mask in each photo, or of 1 or more people with a mask in the whole set of photos? Or the percentage of photo having all people wearing a mask. We humans make a lot of deduction from context. We haven't been able to teach computers this aspect for 25 years. It has started more recently, deep learning shows potential.

dheera · on Oct 26, 2020

Absolutely, but you could start with reasonable assumptions and refine from there. Just the percentage of individual humans wearing a mask. Recent = past week. Start somewhere. Give me an answer first, or ASK for clarification instead of "Sorry, I couldn't understand your query."

I understand it's a hard problem, but with current software capabilities and given Google's compute infrastructure I honestly think some of these things are well within the realm of what a team of several hundred Google engineers can do.

I'm not asking it to write Shakespeare, I'm asking it to crunch data with a sentence that could reasonably easily be parsed into a graph and be turned into a MapReduce query. I thought they were good at that stuff.

Context? I know it's hard, but I thought Google has been working on that. I'm very much adjusting my expectations to what I think Google should be able to accomplish in a decade. I expect some basic context capability now, at the very least these data-crunching type use cases.

munchbunny · on Oct 26, 2020

But your examples are way harder than they sound.

There are two ways to look at this problem: (1) what is hard and what is easy for our current tech to do, and (2) what are the things that humans actually want to pawn off to assistants?

The problem is that those are two very different answers. I don't think I agree with the grandparent comment that some of those things should be easy, but I do think that comment contains good examples of the level of sophistication that would make an AI assistant more than just a curiosity for the couple times each day you prefer not to hit the button yourself (or if you listen to the radio while you shower).

nkrisc · on Oct 26, 2020

Exactly, which is why these assistants are not very useful beyond simple tasks you could just do yourself. If they're only good at things that are easy for you to do, then what's the point of them besides not needing your hands to do simple tasks?

bserge · on Oct 26, 2020

Recognizing masks is the hard part, indeed. But it can be done semi-decently by just analyzing tags (#wearamask or whatever).

The rest is easy, it could even be done completely on-device if you have a recent high end chipset. That's how powerful phones are nowadays.

Tbf, it is easier to just run your own server, and pre-program everything for yourself, a truly personalized experience.

Though Google could easily do it for the millions of people using Android. I really wish they allowed custom modules or something for their Assistant, the voice recognition is unmatched.

traviscj · on Oct 26, 2020

Then parameterize & return results for one specific duration, return that duration with results, and make it easy to adjust afterward.

“Here are the results I found for the last week...”

“Show the same for the last 10 days”

easton · on Oct 26, 2020

Context is hard, but it seems like “recent” means (99% of the time) order by date descending, grab the first 15 or so, and then how many of those photos contain a person with a mask.

Maybe the difficult part is whether or not you look for the 15 most recent photos containing people, or the 15 most recent photos of anything.

cgriswald · on Oct 26, 2020

No, it doesn't mean that 99% of the time. It's more like 99% contextual.

If I ask for recent wildfire news and I'm in a state that doesn't experience wildfires often, are you going to return 15 news articles about wildfires spread out over the 200 year history of the state? I almost certainly want 15 news articles about the current wildfires in some other parts of the country. Your algorithm doesn't really say what to do here.

If I ask for "recent relatively rare astronomical event" recent might mean hundreds of years or more. If I ask for "recent PC game releases" it might mean a month or the current year. If I ask for "recent public events in my town" it might mean over the last week.

In many cases, "there are no recent events" is a better answer than "here are the last 15 events."

tschwimmer · on Oct 26, 2020

FWIW, I tried the first query on Google Home and it suggested I talk to a third party extension called "Air Check." I didn't go through with connecting it but I assume it would give you what you wanted. It sounds like you were on mobile so maybe it's different there. I actually appreciate it prompting to connect before making the request because there's probably some privacy tradeoff there.

The second query was hard for me to test but it seems like a reasonable one to make.

The third one is probably impossible due to Instagram's TOS (among other things.) Their TOS states: "You must not crawl, scrape, or otherwise cache any content from Instagram including but not limited to user profiles and photos." [0] After Cambridge Analytica I'm not surprised that this is the case. Even if Instagram allowed scraping of their data, this feels like a somewhat specialized (though currently very relevant) request. I tried reformulating this request as "What is the mask compliance rate in Shasta county?" which also didn't turn up results, but I'm not surprised by that either. I suspect that data doesn't exist anywhere, so it's hard to fault the assistant for not pulling it up.

[0] https://www.instagram.com/about/legal/terms/before-january-1....

antisthenes · on Oct 26, 2020

Well put.

I don't know how much of FAANG's budget goes towards improving voice assistants, but considering how much cash these companies have on hand and their operating budgets (the size of some smaller European countries), the progress in that area is just super disappointing.

The 3 most common use cases were refined a long time ago (Directions, Alarms and Play Music) and everything else ran into a hard wall.

dheera · on Oct 26, 2020

Even with directions, it fails miserably IMO. The most it seems to be able to do is "navigate to X". I want:

- "Take me on the most scenic route to X." Can't you figure that out from social media tags? Simple first order solution: routes that have more photos with more likes = more scenic. Took me 1 minute to think of that. And 1000 engineers at Google couldn't implement that? These data crunching tasks are the kind of stuff computers are supposed to be good at.

- "Navigate to X but make sure you stay on paved roads." An actual problem if you are trying to use Google Maps in the back roads of California and don't have a 4WD vehicle. Google Maps loves taking you on 4WD dirt road detours. Don't you have satellite maps and street view? Can't you differentiate paved roads from unpaved ones? What the hell does your machine learning department do?

- "Stop at the last grocery store before the highway 120 junction." Yeah, it doesn't even begin to understand this type of query.

- "OK Google zoom out the map slightly." Nope. Sorry.

mikestew · on Oct 26, 2020

Scenic: Garmin's devices appear to do some calculation on number of times the road doesn't go straight over a given distance. Seems to work well enough. OTOH, someone at Google has to make this a feature.

Paved/unpaved: that's one suck-ass assistant you've got there. I know the Garmin on the dash of my motorcycle will give that option. The Garmin RV-specific GPS has loads of other options, such as avoid any low overpasses that will rip the solar panels off my RV. Though it doesn't look like Apple is any better than Google, only because Apple Maps doesn't offer the option. Anyway, satellite and street view? Rand McNally had this information since dirt was first created, no need to get the satellites out.

dheera · on Oct 26, 2020

Yeah, I would have expected Google to even take it to the next level: Ask what model of car the user has (better yet! Identify it from a combination of the car's Bluetooth MAC address and a machine learning model trained on the audio spectrum of the engine noise). Look at where all cars of different types are able to travel, where they make U-turns and turn around, and at what GPS locations they make calls to roadside assistance numbers. Incorporate weather data as well, e.g. what season, and whether it has rained lately. Between all of that data you should be able to advise whether a given vehicle should be able to traverse a given road on a given day of the year.

duxup · on Oct 26, 2020

Agreed, the error rate before I just give up on something is pretty low.

The real frustration is when it fails on tasks I swear it has done before easily.

Like 'ok google' ends up googling ... the ok google command word for word sometimes. 'ok google set timer 10 minutes'

As soon as I have to babysit an assistant like that I'd rather just make sure the clock is on my home screen so I set a timer.

mikestew · on Oct 26, 2020

A big problem with the assistants is that as soon as they fail at a query they seem stupid, I feel stupid, and I stop using them for a long time.

Raise your hand if you remember the early days of Xbox "recognition", as you listen to other people in your party:

"Xbox, record. XBOX, RECORD. X...BOX, RE...CORD."

It's the reason that "oh, we can't sell an Xbox without a Kinect" Kinect went into the garage after a few months.

tjungblut · on Oct 26, 2020

What I find even worse is when you play FIFA and some random non-related british commentary triggers the "xbox" speech command. Infuriating, especially since you can't turn it off.

teddyh · on Oct 26, 2020

And it can be used to troll people:

https://www.youtube.com/watch?v=anslUJ5SCIs

sleepybrett · on Oct 26, 2020

Kinect, actually awesome tech, implementation on the xbox... not great. But the heart and soul of a ton of installation art pieces.

It's really too bad that MS went with a windows only sdk for the 2.0, the 1.0 being external tech had a multiplatform sdk.

asdff · on Oct 26, 2020

The big power of GUI is that it was an alternate control scheme to do anything you can do on the CLI. Voice control is not a UI. It's a limited set of functions not meant to offer analogous control over any aspect of your device really; if it does it's happenstance. That's why Siri is for setting alarms and not doing complicated workflows that would require your thumb. I wish it was a voice ui, though. I wish I could stick my phone on a shelf and step back and tell Siri to take a picture, to do that function on the phone my thumb can do, but the best Siri does is open the camera app because the voice controls are just that, a limited set of control levers to pull and not a complete user interface.

bserge · on Oct 26, 2020

That is all doable with Huggin/IFTTT/Automate/Tasker/Snowboy.

It could even be done on the phone itself, although battery life and heat would likely be a problem.

The fact that it doesn't work just shows that Google isn't that interested in their Assistant.

Which is rather strange, but maybe they're looking for better ways to monetize it?

Or maybe it's going the way of Reader once AR gains momentum or something... then again Google Lens is a really nice product and no one I know even heard of it.

I'm amazed at how well it can recognize any kind of writing, even my chicken scratch, and how it can look up any products/labels with decent results.

bbarnett · on Oct 26, 2020

Probably, they get more ad revenue from:

- you doing a manual search, and being tracked/targeted - them displaying ads as a result - them displaying contextual, higher paying ads

I've noticed that some things over the years, many things in fact, have been removed from Google Maps. I hypothesize that the entire reason is, these things reduce profit.

For example, you used to be able to pause Google Maps. You can't now, and therefore, if you have 'history' off, you have to stop, and manually re-start your destination by typing it in.

Well, history profits them.

And pausing is not the best, because then, you're not active 100% of the trip. Having you active lets them determine all sorts of things, like traffic patterns, where you go, who you're near, and so on.

There are lots of little things like this, which seem to be gone from earlier versions of Maps. Again, I presume, all to make more $.

Which is logical, and fine, but it gets a bit tiresome and sad at times.

jjk166 · on Oct 26, 2020

But would you have gotten better results if you had typed those same queries into some website?

While an artificial general intelligence would be capable of fluent speech, a perfectly good speech system does not necessarily need to be an AGI.

jldugger · on Oct 26, 2020

The problem, as you note, is that you are actually doing more parsing of the input than virtual assistants do. Tom Scott has some good videos on the subject (https://www.youtube.com/watch?v=m3vIEKWrP9Q&list=PL96C35uN7x...)

To give a more concrete example, here's a UPenn demo (https://cogcomp.seas.upenn.edu/page/demo_view/ShallowParse) for your instagram query:

> NP What percentage PP of NP people can NP you VP detect to be wearing NP masks PP on NP recent Instagram photos VP tagged PP at NP a location PP within NP a 10 mile radius PP of NP Mt. Shasta ?

Part of the reason we don't progress beyond that is that speech recognition like in the OP article is quite bad: 95 percent accuracy is considered "good." But it means we expect 1-2 words of your query to be misrecognized, so even if it did parse the query as you proposed, it would probably be answering the wrong question!

_o-o_ · on Oct 26, 2020

This reminds me of the issue in machine translation where even sophisticated systems cannot understand the importance of a word like "no". So, for example, a phrase like "please press this button" could be translated as "please do not press this button" and it could be automatically rated as a good translation because most of the words are there.

But, on the other hand, I have dictated these paragraphs, (pretty much just adding punctuation and minor edits at the end). I think the most useful feature I've found related to voice is dictation (speech to text). It is almost perfect, at least in English (and I am not a native speaker).

iandanforth · on Oct 26, 2020

Because this is HN, I want to say that I think the above is now doable and more people should be working on startups in this space.

The thing we're discovering is how to marry NLP (which works pretty well) with structured databases and automated tools (which work really really well).

A pure AI play probably won't get this done. But if your AI starts knowing how to use information tools then I think we'll see a lot of near term progress.

thewarrior · on Oct 26, 2020

Sounds fascinating. Could you share any pointers to papers , academic work on this subject ?

visarga · on Oct 26, 2020

Data Agnostic RoBERTa based Natural Language to SQL Query Generation

https://arxiv.org/abs/2010.05243v1

more

https://scholar.google.com/scholar?as_ylo=2020&q=text-to-sql...

coldcode · on Oct 26, 2020

I use my voice assistant on my phone to set timers.

That's it.

bserge · on Oct 26, 2020

The other day, I wanted it to open an app and start recording when I say "Hey Google, take a voice note"... but no dice :/

martyvis · on Oct 26, 2020

Just say "Hey Google, open Live Transcribe" (assuming you had that app installed)

pbhjpbhj · on Oct 27, 2020

This highlights part of the problem. Google sell/advertise products for companies but people want to use tools.

People want "[wake word], [action] [modifiers]".

Companies want "[trademark product] [trademark product] [modifiers]" (eg the parents example "Hey Google (RTM), Live Transcribe (RTM) 'words to transcribe'").

I've only used Alexa (and only for fun) but the replacement of verbs with companies/products, and of nouns with proper nouns, is really annoying to me and gets in the way IMO.

maneesh · on Oct 26, 2020

Weather is also useful! Sometimes reminders, which are basically timers. But yea that's it.

GloriousKoji · on Oct 26, 2020

Siri doesn't get the weather right.

"What's the weather in Yosemite" and "Yosemite weather" can get you two drastically different results. As of right now (10/26/2020 10:35 AM) the two results I get are 43F and 32F.

maneesh · on Oct 27, 2020

Siri is always mistaken. Alexa is pretty good.

pbhjpbhj · on Oct 27, 2020

Alexa is super easy to make a skill for (IIRC it's one XML/JSON file hosted somewhere and registered with Amazon). So you could probably make your own with ~IFTTT to scrape the data. Alexa has "daily briefing" which will then go through a list of skills which can include your own [public] skills.

I made an "and finally ..." to add funny news to the end of the daily briefing and was super impressed how easy it was.

btown · on Oct 26, 2020

Something I'd love to see is a move towards "multiple possibilities" in voice-based UI/UX. Almost all these systems give weighted probabilities to different parses, and if you gave the human a choice between the top 3, you could be much more exploratory/risk-taking in choosing candidates. Or in machine translation or transcription, why do Google Translate and YouTube automatic captioning not communicate the system's level of comfort with what it gives you, and provide alternative possibilities if that is low?

We're humans - we can deal with ambiguity. Systems should trust that we'll respect them more if they tell us they're unsure, rather than either jumping to the wrong conclusion or simply being unwilling to guess!

pbhjpbhj · on Oct 27, 2020

Agreed, "recent photos" should get a response like "showing photos from the last day" and you can say "no, from the last month" and that then will be used to weight similar requests (about photos) in the future (weight mind you, not naively then make all requests for 'recent' by replaced by '1 month').

Spooky23 · on Oct 26, 2020

I think the issue is that the virtual assistant is there to assist, not replace an assistant.

I have an admin. I can text, call, talk, email her and say “let’s get bob, mary and finance on the phone at 4” and she’ll move stuff around to make sure that happens. I can also delegate complex administrative tasks that need to be done in my name.

When I’m at the dentist, I usually do a “hey Siri” to book it as it saves time. At home, “hey Siri turn on the lights” is helpful. It’s magic, but not the same.

The tech companies try to frame this stuff “bigger”, but in doing so they create an unreasonable expectation. Google home, Siri, Alexa are amazing, but we bitch and moan about them as a result.

dheera · on Oct 26, 2020

Maybe it's the marketing.

I don't have a human assistant, but if I did, the biggest things I would expect from them isn't about scheduling a meeting or scheduling a barber appointment. I can handle that stuff myself.

What I would REALLY expect of a human assistant:

"Can you call up my health insurance and fight this stupid bill of $400 for COVID testing that should have been $0. Please escalate if necessary. Thanks."

"Can you figure out how to fight this red light ticket that I got due to a malfunctioning red light camera? Thanks."

"Can you fight this parking ticket? I had a valid permit to park there, and here's documentation of that. Thanks."

"Can you register me to vote? Here's my ID. Thanks. Make sure they don't sign me up for spam."

"Can you call up Comcast and fight this bill increase? Threaten to switch to another provider if necessary, I heard that works."

"Can you call up this company that posted my personal information and ask them to remove it? If they refuse, threaten legal action."

"Can you dispute this electric bill for me? My heating shouldn't have been $400 a month. Something must be wrong with my meter or someone is leeching power from my line."

"Can you call up the 10 different grocery stores in the area and figure out which one has X in stock?"

In all honesty I do wish the Google automated assistant could do all of the above. Sorry Pichai, I don't care for scheduling haircuts automatically. I want your assistant to use its hundreds of thousands of hours of human conversations and use machine learning to craft and engineer responses to humans to know EXACTLY when to escalate, EXACTLY when to ask for a manager, HOW to threaten legal action, and basically fight tooth-and-nail with language to get me what I want against the companies and institutions I need to fight with on the phone. The job of an "assistant" should be to get sh*t done and get me what I want. Use machine learning and lots of data to master the art of negotiation with customer service reps.

WalterSear · on Oct 26, 2020

Most of the time when I ask that at home, it tells me that it's afraid it can't turn off the lights, since they are already off.

acchow · on Oct 26, 2020

> It turns out that the current generation of "assistants" are mostly just template-matchers which really doesn't help me much at all.

All the cases you've indicated as "so simple" are only so simple as template matching.

When generalized, they are hard problems.

Sure, you could have humans sketch out a few thousand templates enabling "Ok Google" to support such simple things like "What is the air quality at <location> <Time>" which looks up from an API. But that doesn't generalize "for free" outside of your templates.

Google seems to want to avoid handcrafted templates and go straight for the generalized solution.

outworlder · on Oct 26, 2020

> "OK Google, what percentage of people can you detect to be wearing masks on recent Instagram photos tagged at a location within a 10 mile radius of Mt. Shasta?"

Ok, so you need the assistant to:

* Already have a trained dataset of people wearing masks

* Fetch ALL instagram pictures it can find

* Not only detect if there's a mask in the picture, but count them

* Fetch the location of Mount Shasta

* Calculate a 10 mile radius around it

* Apply that calculation to the people counted in the first step

* Calculate a percentage of people wearing masks which needs:

* Count people not wearing masks.

Those steps are sub-optimal. So you need not only understand that, but to run some sort of 'query planner' in order to get the results you are looking for.

You are thinking of Jarvis from Iron Man. Forget assistants. Allow someone to construct a query like this in a couple of minutes and you have a great product you can sell. Existing ones require quite a bit of domain knowledge and setup. Not even things like Wolfram Alpha would be able to parse this query.

This reminds me of: https://xkcd.com/1425/

dheera · on Oct 26, 2020

> Not even things like Wolfram Alpha

Sure, but we're talking about Google here, not Wolfram. The masters at query optimization and MapReduce. I would have expected them to be able to parse this query, or at least fetch the AQI near Mt. Shasta but can't even do that.

macjohnmcc · on Oct 26, 2020

Heck I ask Siri something like what is the pollen count and I get back a page that it can't read describing what a pollen count is not what the current measured pollen count is in my area. If it could answer your questions I would have home for my own question getting the expected results.

_ezkx · on Oct 26, 2020

If google assistant or whatever worked like a person of average intelligence that I could talk to and get to do stuff for me, that would be incredible. I've been on a roadtrip and your examples made me realize how much of a time/attention saver something like that would have been.

Kye · on Oct 26, 2020

"Hey Siri, remind me to add [x] to the grocery list at [y]."

"There is no 'Grocery' list. Would you like to make one?"

An example of how Apple's ecosystem breaks down if you want to use something outside it. Saying no should offer to add a reminder as requested, but it doesn't.

Causality1 · on Oct 26, 2020

Try getting Google to voice type the word "o'clock". Just try it.

browningstreet · on Oct 27, 2020

I set up a routine for the words "Music Time" on my Alexa. Every day I say those words a few times. The hit rate on an accurate response is not above 90%.

It's the only thing I use my Alexa for.

Michael_Sieb · on Oct 26, 2020

Without speech recognition our startup would not exist today. We are developing a text-based video editor where the video is cut over the transcribed speech.

samat · on Oct 26, 2020

Actually, ability to delete all iPhone alarms at once via Siri is a life saver. I know no other way to bulk delete l/disable alarms in iOS.

wtetzner · on Oct 26, 2020

Sounds like a UI bug/misfeature, rather than a voice feature.

CodeWriter23 · on Oct 26, 2020

Or my current peeve:

Siri play song “song-name” I’m sorry there was an error with Apple Misic Siri play song “exact-same-song-name” [song plays]

xwdv · on Oct 26, 2020

When they do work though you feel like you’re at the helm of the future.

I’ve figured out what queries almost never seem to fail and use those almost exclusively. I don’t get creative.

atestu · on Oct 26, 2020

Siri answers the first query correctly, fwiw

toolslive · on Oct 26, 2020

Back in the days (80s) they were joking about the programming languages used in early (star trek) holodecks.

  > (1st attempt) "Computer! coffee please"
  > (computer dumps coffee on Kirk) "argh!"
  > (2nd attempt) "Computer! coffee in a mug please"
  > (3rd attempt) "Computer! hot coffee in a mug please"
  ....
  > (25th attempt) "Computer! 10cl of coffee at 50C with 3cl of fresh milk at 6C in a bottoms down ceramic mug of 15cl"

XCSme · on Oct 26, 2020

This means that you whould only be able to choose between several (discrete) choices instead of assigning the computer random tasks. But when the number of such choices is really small (eg. make coffee), then a simple physical button to initiate the task is better in almost every way.

aYsY4dDQ2NrcNzA · on Oct 26, 2020

“Computer. Press the coffee button.”

sida · on Oct 26, 2020

One thing I haven't seen discussed is the poor affordance / discoverability of speech technology.

Google Home can do some clever things however, it also does not have the ability to do some very basic stuff. As a user, how do you know what Google Home can do and cannot do?

It is just trial and error. And if Google Home introduces a new feature to be able to complete new types of queries. What now? How does a user know that last month it wasn't able to do something and this month it is able to do that thing.

And lastly, the interface of voice is very clunky. It has no concept of temporal memory like

Me: "Ok Google, navigate to the nearest safeway" Google: "navigating you to the nearest gas station"

The natural thing to say is "no, I meant safeway nott gas station" however, I now have to say "Ok Google, navigate to the nearest safeway".

This is analogous to if keyboard had no backspace and you have to retype everything everytime you have a typo. Well that's the state of speech technology right now

RunningDroid · on Oct 26, 2020

>As a user, how do you know what Google Home can do and cannot do?

Amazon's workaround for this problem is to have it tell you when new "options" are available for a command (ie: if you set an alarm it would confirm setting the alarm then tell you it can wake you up to the sound of birds then it provides an example command) and Amazon sends out a "what's new with Alexa" email every so often that's 90% example commands.

dalbasal · on Oct 26, 2020

Agreed. More broadly, I'd say that no one has made a good UI yet. The mac/lisa/star had a UI that people could learn. iOS...

In some ways a voice UI has bigger problems to deal with than PC GUIs or iOS. Those UIs were replacing pre-existing UIs (eg blackberry, dos, unix, norton) and they could target whatever tasks a smartphone/PC needed to do. For voice UIs, it's a cold start. It's not even obvious what an audio only computer should do. Our mental model for a "virtual assistant" is a person-2-person exchange, and computers still aren't great at communicating like people.

FWIW, I think slipping into existing niches is the way to go. That's where a useful voice ui will be discovered. Car stuff, accessibility software, living room controls.. At least these have clear goals. Voice operating spotify, netflix or just an iphone is something people actually need and will use if its useful.

gugagore · on Oct 26, 2020

What you say about "temporal memory" is not exactly true for Google's assistant. You can try two separate queries

1. "Who is the president of the United States?" 2. "What is his wife's name?"

And it will resolve the deictic pronoun.

I haven't tried this feature out extensively, but it has worked for a few years now.

whimsicalism · on Oct 26, 2020

Once they start hooking them up to conversational language models that can also submit queries, I think it is going to get a lot better. The conversational model results are starting to look very good.

petra · on Oct 27, 2020

Where can i learn more about the leading edge of those conversational language models ?

whimsicalism · on Oct 27, 2020

https://arxiv.org/pdf/2004.13637.pdf

crazygringo · on Oct 26, 2020

For most short interactions, the mouse/trackpad/finger is simply faster.

Now for long-form typing, I'd love to use dictation and sometimes do for taking down short thoughts I e-mail myself from my iPhone.

But the problem is not just that it still makes tons of mistakes. (Probably a quarter of my notes-to-self involve errors so big it's even impossible for me to later figure out what I even meant by trying to sound it out phonetically.)

The problem is that I can't correct those mistakes using voice. There's no way to say "pause, correct affect to effect" or anything like that.

Even more maddeningly, the words keep changing in real time. Sometimes I'll utter a sentence it gets right, then it "re-analyzes" it and completely messes half of it up.

I just wish there were a kind of dictation where I could say a phrase, pause, see if it's right (and it wouldn't change after), say the wrong part with a kind of emphasis that lets the system know I'm issuing a correction, the system would look for the next most probable alternative, repeat as desired. Then I could actually dictate successfully.

This UX where the words are always changing back and forth according to updated statistical probabilities, even as long as 15 seconds after I said them, and where there's no ability to go back and correct them with voice... it's just so so dumb.

The problem isn't voice recognition anymore. It's voice correction.

205guy · on Oct 26, 2020

Picks up the mouse and speaks into it: "Computer?"

Classic scene: https://youtu.be/xaVgRj2e5_s?t=171

Star Trek isn't my favorite sci-fi show/movie, but they really nailed some futurism and in this case, lack therof.

cameldrv · on Oct 26, 2020

100% this. Dragon used to have more correction features, but at least the Apple dictation doesn't really have any. Even just "correct sentence" where I try again with the same sentence would be a huge improvement.

6gvONxR4sf7o · on Oct 26, 2020

I'm dictating this. Not going to make any corrections, so that you caan see what the state-of-the-art is right now, including bugging us (that supposed to be bugging this. Apparently it doesn't recognize that word it's the quality of being buggy).speech is my primary interface my computer due to health issues, and my God is it infuriating.it's never going to really take off until it just works.

imagine if your keyboard had about a 2nd latency and every couple words got messed up in some way. Not only that, but those same words that got messed up are probably getting be messed up again when you try to go back and fix it with the same broken keyboard. you wouldn't say that typing is nearly solved, you'd say that typing absolutely sucks and keyboards just don't work.

I firmly believe that speech is going to be the main interface-actually want to scratch that last sentence, but correcting it is can the pain with voice. I firmly believe that speech is going to be a game changer of an interface, especially for coding, eventually. But until it stops sucking, it's only can be used where where there is no other choice.

(By the way, this is me, a native English speaker with a standard salmon Cisco accent-that's a standard San Francisco accent- dictating it on a many hundred dollar microphone into a many hundred dollar speech engine, speaking relatively slowly and enunciating the hell out of out of things.when I first started with that, it was even worse. and yet somehow a lot of people treat speech recognition as if it is in some way solved, or the error rate is better than human.or loader ship-that's supposedd to be what a load of ship-whatever you see what I mean)

-----

(edit: After the fact, I decided to calculate the word error rate dictating thhis. About 7%. If you ignore the non-speech-related bugs, it's about 4%, which is supposedly "superhuman." take that as you will. my take is that "human level" is not the same as a human trying to work as fast as they can, and maybe not paying particularly close attention. And that 4% is ridiculously far from 0% in terms of usability)

jrlnm · on Oct 26, 2020

I think the issue with typing via voice are: 1) Speech accuracy 2) Correcting things by voice

It's super unnatural to correct things by voice and until we re-imagine what it looks like if we HAD to type via voice, it's gonna be painful.

(Speaking as someone who types via voice as well). Have used Dragon and now currently Talon

frankosaurus · on Oct 26, 2020

For me, the thing that killed voice commands had nothing to do with speech technology. It was latency and error handling.

At the start of my morning commute, I would say, "Ok Google, navigate to work".

Often, this would fail because I was in the network limbo area outside my house, where my phone struggles to transition from home WiFi to data.

Worst of all: The failure would be horribly slow. I would have to drive for another 30s before my phone realized, yes, we are really out of WiFi range now. And the voice command wouldn't be auto-retried. I would have to tell my phone again. It didn't remember.

I added one-touch "Home" and "Work" Google Maps widgets to my home screen and never looked back.

As an engineer, I realize why this is a tricky problem. As a consumer, I want it to "just work".

p1necone · on Oct 26, 2020

It took me months to work out that "navigate to X" was the magic phrase to get google maps to do what you would expect in car navigation to do. The phrase that came more naturally to me was "give me directions to X", but that only gets you to the screen with the route and you still have to manually press the "start" button with your finger. And then it would randomly pick other modes of transport unless I remember to say "by car". Systems that are inherently unrecoverable like voice commands need actual documentation.

p1necone · on Oct 27, 2020

Typo correction: unrecoverable -> undiscoverable.

wheybags · on Oct 26, 2020

Why would you need directions to the workplace you go to every day? Unless you're doing onsite stuff in new places every day?

PeterisP · on Oct 26, 2020

Not the OP, but many people use Waze etc for driving directions on everyday commute because the same destination does not imply the same route - due to construction, accidents, traffic jams, etc the best route can vary significantly, and simply driving the same route as yesterday can take much more time than it did yesterday.

parliament32 · on Oct 26, 2020

Traffic.

ryankrage77 · on Oct 26, 2020

A few years ago when Microsoft was pushing Cortana with Windows 10, I decided that since I was wearing a headset at my desk 90% of the time anyway, I may as well try using a voice assistant so I could multi-task. So next time I needed to do a calculation in the middle of something, I said "hey cortana, what's 50 times 12.5?" while typing something up... and it opened the start menu, lost focus on my window, and the searched for it on Bing in Edge. I just wanted it to read me the answer.

ryankrage77 · on Oct 26, 2020

Another frustrating experience with a voice assistant was with Google Assistant. I was on a train, and I didn't want to miss my stop if I got distracted (as had happened before). Since I had my headphones on, I tried getting Google assistant to notify me when I arrived at my destination. It could not do it, the devs hadn't made a template for this scenario at the time.

Nowadays it might work via a location-based reminder, but I can't trust that to work within the 15 second window I have to get off the train.

juancn · on Oct 26, 2020

The main problem now is not speech recognition. It's a kind of uncanny valley effect.

You can speak to these assistants, but the language is still restricted. They show little to no common sense. It's a lot of party tricks bundled together.

You can't interrupt them and it's hard to correct them.

On the speech recognition side, an issue I've found (although is a rather niche one) is triggered because I'm bilingual (I'm fluent in spanish and english).

Speech recognition only works well on a single language.

I have Alexa set to speak english, for example when I'm searching for a song with a spanish title, I have to try to fudge the name into a fake english pronunciation for it to produce the right phonemes that will match the song title, rather than just say the name properly.

Also, if it misses the match, there's no easy way to stop it and say "No, not that one", and be presented with a list of similar matches.

macrolime · on Oct 26, 2020

I don't think it's a niche issue outside of the United States, in countries where English is widely known. It makes it pretty much impossible to use Siri on Apple TV for example, because so many movies or TV series are named in English but also in the local language.

overthemoon · on Oct 26, 2020

I suck at talking. I speak in halting, quiet sentence fragments. My mind wanders and I lose my point. I get in my head a lot. I'm much better at writing and reading as a communication method. I'm open to speech stuff, but currently no speech recognition solution meets any needs I have.

Just my 2c, I'm sure other people have uses for it. The most interesting one (to me) has popped up a few times on HN, which is voice-based programming. I would love to see that mature and become more widespread, there are a few things that are annoying enough to do that if I had a voice shortcut or eye tracking it would be pretty cool.

selestify · on Oct 26, 2020

I do the same when talking. Never realized others have this condition as well :)

marricks · on Oct 26, 2020

There's lots of reason it hasn't taken off like some hoped (accuracy, social aspects, privacy) but it's not faster unless you have no other option for input. If I have a computer, typing is faster, if I have a phone, which we almost always do, it's faster to trigger the command or type.

Speech is competing against every other tech item trying to be convenient, from laptops to phones to watches... the only space where I'd want it is something like baking or cooking when I can't interface with a computer.

(edit) and what about the response back? A visual interface I can confirm at a glance whether my input was accurate, a voice one I have to listen to the whole thing. I even don't like maps directions in the car as half the time I know the next direction already and don't want the interruption to music or whatever I'm listening to.

aloknnikhil · on Oct 26, 2020

It's also the unprompted responses that has started to bug me. I don't know if Google is having a bad rollout or this is deliberate but my Google Home is being triggered a lot more often now and I don't recall anything remotely close to the wake phrase being said. Also, the responses to questions when actually prompted are not at all what I'm expecting. Particularly, Google has trouble understanding a lot of context around Spotify playlists. It still cannot distinguish between a song and an album with the same name despite me prepending the request with "Play the Song...". Overall it's just a terrible experience and that makes the whole assistant thing less like an assistant and more like a toddler that refuses to cooperate.

munchbunny · on Oct 26, 2020

I had this problem a lot with Cortana popping up during meetings and seizing control of my microphone. In the end I spent probably 15 minutes trying to figure out how to turn off voice activation because Microsoft doesn't make it easy.

dguo · on Oct 26, 2020

Audio feedback doesn’t give the user the same reassuring sense of certainty as a graphical user interface. One glance will confirm that I have typed my card number correctly, but you don’t have to be unusually impatient for your heart to sink, when you hear the inhumanly calm words, “I heard 4659 1234 1234 1234. Is that correct? Say yes or press one to confirm”.

This is the main reason I rarely use voice controls. Even if voice is faster than typing for most cases, typing never fails catastrophically like voice does.

I could use voice only when I'm confident it will work, but then there's the mental load of making that prediction. It's easier to just always go with typing.

josefresco · on Oct 26, 2020

Just this morning on my way to work: "Hey Siri, remind me to take my meds at lunch"

Siri created a reminder (good) called "Take my meds at lunch" - no time (bad), just a simple reminder.

I saw this post, looked at the time and wondered why Siri didn't remind me, given it was 1:00 - now I know to be more specific but in reality I'll just stop trying like most people.

Catsandkites · on Oct 26, 2020

Question: Can I write my own assistant on mobile devices that is an actual assistant? As in, can listen for the key activation word in the background? Like the "official" assistants.

As far as I can tell, the assistant APIs seem to be like plugins? On Android for example, it appears custom assistants still run through Google assistant.

I want to be able to say "TriggerWord, do X and Y" and the OS activates my app, passes the voice sample and I take care of all the language processing from there. Which doesn't seem possible...

canada_dry · on Oct 26, 2020

The custom home automation system I built relies on google's (very good) off-line voice recognition + automagic [i] + opencv face detection.

I did not want the home asst listening all the time - as it's highly likely to falsely trigger just listening to radio or tv it... so instead I use python+opencv to detect if the assistant (an old rooted samsung note) is being directly looked at - then it wakes up to listen for the trigger word and the command. Of course, I can also manually trigger it via any device in the house.

[i] http://automagic4android.com/

bserge · on Oct 26, 2020

https://snowboy.kitt.ai/ is about the only good third party hotword detection engine. You can hook it into Assistant with a bit of work, and with Tasker/Automate and a rooted phone, you can open apps, press buttons, pass voice commands, and more.

Imo, Assistant is limited because of privacy, see how it asks you to opt in for a more personalized experience and it still won't unlock your phone for you.

annoyingnoob · on Oct 26, 2020

I've had a similar experience, worked in voice for a long time. I simply have no desire whatsoever to speak to my electronics, I don't find it helpful or useful in anyway.

TheOtherHobbes · on Oct 26, 2020

I'm not quite that extreme, but I'd use voice if it was at least as smart as a human. And it isn't - yet.

"Turn on the lights" isn't exciting. "Make dinner and do the laundry" would be exciting, but that's at least 25 years and some major advances in robotics away.

IoT is very crude and contrived compared to what would be possible with an active technology that could do useful physical things of all kinds.

im3w1l · on Oct 26, 2020

For something like that I'd rather press a button or have it happen on a timer.

Where it could really shine is in more rare commands. "Make me some Pad Thai". Unless you're a huge fan, you wont want a button for this. And it's faster to say than type into your phone.

WarOnPrivacy · on Oct 26, 2020

I will consider talking to my computer when my computer is the only one that would hear me.

I also never talk to my phone.

zarq · on Oct 26, 2020

http://ars.userfriendly.org/cartoons/?id=20120322

sp332 · on Oct 26, 2020

Yeah, it's weird to have a conversation with someone else listening. But I will bring it in if I'm discussing something with another person, and want to look something up. It feels more normal to have a voice interface involved when I'm already talking.

saiya-jin · on Oct 26, 2020

same here, what's the point of using things like Facebook container in Firefox and blocking Google tracking as much as I can, and then happily share my private life with them in even more direct way

Finnucane · on Oct 26, 2020

I can barely manage to talk to people. Why on earth would I want to talk to a box?

bserge · on Oct 26, 2020

It might be a good way to learn to talk to people, or just talk... Actually, just use Discord for that, join random servers and the voice chats.