
Voice and the uncanny valley of AI - kawera
http://ben-evans.com/benedictevans/2017/2/22/voice-and-the-uncanny-valley-of-ai
======
pesenti
Having developed and sold virtual agents for Watson, this is one of the major
issue I faced. I had to make it clear to our customers (and especially to the
sales and marketing teams) that agents only work in a narrow domain, whose
boundaries would need to be clear to the user. But often that got lost in
translation and expectations were never met (especially in the early days of
Watson when our PR was out of control).

This said, it's not as limited as the author makes it. There are classes of
open requests that can be handled directly from content (i.e. not like an
expert system or IVR but more like a very smart NL search) without having to
identify a precise intent. The Jeopardy system was a perfect example of that,
and Google is now able to answer many question types (not just factoids but
also questions with long answers like how-to) directly from the content.

Dialog is the real limiting factor. Right now most dialog in production is
scripted, with some smart feature like slot filling, but still much closer to
an expert system than a statistical example based system.

------
ghaff
I'm not sure uncanny valley is the right term for this though I see the
analogy.

Rather, I think it's yet another case where success (or at least impressive
advances) in a narrow domain gets conflated with something much broader.
Because if a human can do A they're clearly at least on the cusp of doing B
and C too m

In the case of our digital personal assistants they've actually gotten pretty
good at voice recognition, at least given certain parameters of accents etc.
But what that means is that they're good at recognizing appropriate wizard
incantations and taking the corresponding action. Get off script? It's worse
than talking to one of those outsourced call centers we all hate.

~~~
bsenftner
I'd say he is spot on with the analogy. He's describing a situation where the
actual solution to the problem at hand (a quality Voice experience) is just
the opening of a door to a much more complex charade than originally
understood. I spent a decade becoming an expert at creating digital doubles of
people - where the term "uncanny valley" originates. In the 3D-graphics-person
situation it is very similar. Once one develops a method of generating a
likeness of someone, it's not right because the model does not have the real
person's hair; then after creating their hair, the human model needs clothing
models of the style that person would actually wear. But it is still not right
because the character does not exhibit the specific facial expressions the
real person uses that are characteristic of their personality. After that is
their characteristic body movements, how they stand, sit, how they idle. This
is very similar to the issue with quality Voice - it is not just creating a
Voice, just like it is not just creating a 3D model, but it is in fact
creating a model of reality the software holds, which is then query-capable by
external humans and the software model itself as that self-conversation within
the software is necessary to fulfill the simulation, and finally generate an
experience piercing the uncanny valley for both voice and avatars. But it is
going to require 2-200 times more computational capacity than we are throwing
around now. The devil is in the details.

~~~
ThomPete
Uncanny Valley is about knowing somethings is off even though it's almost
impossible to see what and feeling eerie about it.

I don't see the analogy here and as a user of Google home with my entire
family it's never been anything closely to what uncanny valley is about.

~~~
tkxxx7
There is another usage I've seen in AI discussions involving the uncertainty
of the gap between current tech and general AI, and the chance we may cross it
accidentally.

------
falcolas
IVR - mentioned but not described in the article. It stands for interactive
voice response, which is the industry term for phone trees of all things.

Interesting tidbit which could have benefited from fleshing out in the
article, especially since it so accurately describes the current state of
voice assistants.

------
viewtransform
I remember when voice recognition was barely usable (90-00's) then all of a
sudden within the last decade voice recognition started showing up and it was
amazingly good. Siri can understand most of what I speak to it even while
driving with music in the background.

What are the successful algorithms that made this leap forward possible ?
HMM's ? Neural networks ? more data/compute ? What changed since Dragon
NaturallySpeaking in 1997? Can anyone recommend overview papers/blogs on this
topic I could use to get up to speed ?

