
This Uncanny Valley of Voice Recognition - hodgesmr
http://zachholman.com/posts/uncanny-valley/
======
anatari
Voice recognition is not in an uncanny valley. Uncanny valley means there is a
point where something that is less real is better than something that is more
real. Pixar improves the scene by adding elements that are unrealistic.
Another example is a preference for lower frame rate movies.

Right now, every incremental improvement to voice recognition improves its
usefulness. It might appear that we're in an uncanny valley because voice
recognition is barely usable right now versus completely unusable in the past,
but there is no one that prefers worst voice recognition over better voice
recognition.

~~~
matthew-wegner
I think there is a real danger that people are modifying/learning how to speak
to computer voice recognition software. If voice recognition can't quickly
become able to parse natural language, it will inevitably have to parse "I'm
talking a dumb computer" cadence and inflection instead.

Incremental improvements are very bad in this regard.

These things are hard to reverse, too (people still speak with a very distinct
"I'm speaking on a telephone" cadence today).

~~~
woodson
IMHO, people (and hence language) will always adapt in certain ways to get the
message across. People already learned how to "google" and expect the same
style of search queries to be effective elsewhere. When speaking on the
telephone, people tend to slightly change their voice to counteract the
channel noise (with acoustic consequences such as increased fundamental
frequency ["pitch"], etc.). I would be surprised if a similar adaptation
didn't happen for human-computer voice interaction, which would ultimately
help making it work well enough to be useful. (Of course, using speech
recognition to transcribe human-to-human interaction will still be barely
usable..)

------
51Cards
"The Uncanny Valley is a term that originated from the computer animation
industry. In 1992, while finishing A Bug’s Life, Pixar had to build a digital
valley for..."

Ummm....

Wikipedia: "The term was coined by the robotics professor Masahiro Mori as
Bukimi no Tani Genshō in 1970. The hypothesis has been linked to Ernst
Jentsch's concept of the "uncanny" identified in a 1906 essay "On the
Psychology of the Uncanny"."

~~~
crazygringo
If you read the rest of the paragraph, it's _quite_ clear that the entire
description is intentionally humorous nonsense. I mean, the third sentence
_had_ to make that clear... ;)

> _They ended up illustrating a crate of Campbell’s® Tomato Soup™ in the
> corner to make it feel a bit more canny._

~~~
nicolethenerd
The missing word kind of broke the joke for me ("so he [can/could] get a
vasectomy") - I got so hung up on wondering whether part of the sentence was
missing or whether something had been lost in translation that I missed the
fact that it was just an attempt at humor.

Copy editing - it's important!

~~~
mtVessel
Agreed. Without that auxiliary verb, it was really hard to know the author was
kidding about Buzz Lightyear getting a vasectomy in A Bug's Life.

------
b6
I apologize, I'm pretty sure I feel the way I do because I'm getting old, but
here's how I feel: talking to computers is a really, really bad interface, so
I don't do it.

One reason it's bad is that the sounds we make are mush. It's a miracle if a
computer system can correctly retrieve the words from an utterance. Another
reason it's bad is that the words we say are nonsense. Our sentences aren't
parseable, they don't conform to any actual grammar.

So I see it as another example of people selling something that's supposed to
be more convenient than what we already have, but for many reasons, it
probably isn't. One day it may be, but it wouldn't be surprising for people to
be selling it as more convenient for many years before it actually is.

I'm not criticizing the technology -- it's amazing. It's just clear to me that
it isn't ready to be invited into my life. I consider it inevitable that we
will eventually lose control of technology, but we can at least try to be
judicious.

~~~
wodenokoto
Obviously your sentences are parseable. Millions of humans around the world
understand what you are saying.

the usefulness of this didn't dawn on me until I read an interview with Andreg
Ng where he talked about the huge amount of voice searches in China. Many
adults can't type, so searching with voice is very convenient as opposed to
drawing the characters. Many are down right illiterate or young children not
old enough to read that much yet.

~~~
b6
> Obviously your sentences are parseable. Millions of humans around the world
> understand what you are saying.

No, I mean parseable in the way that source code is parsed. I'm not aware of
any human language that is parseable by a computer. Humans are able to
understand each other because our brains are, loosely speaking, magical.

~~~
wodenokoto
Thats because natural language is ambiguous. Even simple sentences like " I
saw a man on the hill with a telescope" have multiple meanings.

This isn't solved by magic, but pure statistics. Try and ask your friends who
has the telescope in the previous sentence and some will say the man and some
will say "I". Without context we can't tell, but we can judge which one is
more likely given who had the item in previous sentences of similar structure
(the prior). Then usually we also have some context.

Now we can add context: "I got a telescope for my birthday and was eager to
use it. The next day I saw a man on the hill with my telescope". Now most
people would expect the speaker to be looking through the telescope, but the
man might have stolen it and taken it to the hill. Even humans have to guess.

------
Jgrubb
This post reads like he's experimenting with the Uncanny Valley of Auto-
generated Blogging.

------
zanny
I'm still bummed that with all these companies implementing voice recognition
there still is not anything close to a FOSS option. It is a major field and
the kind of software that takes a huge amount of work to get right and I feel
like in the future free operating systems are going to look archaic without
it, but it does not seem like the kind of thing any small club of friends can
pick up and build to match Google or Apple at.

The same applies to OCR and other photo recognition techniques like faces or
red eye. Tesseract is probably the largest free software OCR project but it
still seems to do so much worse than proprietary Adobe and Microsoft products.
At least the OCR reader that came with my S4 does a terrible job, though it
might be using Tesseract behind the scenes since I think its the one from
f-droid.

Digikam does all right red eye correction but it does it with a layered filter
rather than any recognition of eyes. It also sometimes can find faces, but not
nearly as accurately as Google can.

All these fuzzy logic fields are things that take huge code bases and a lot of
R&D to get right and nobody in the free software movement has the organization
or just the raw bank to make them happen from what I can see. Red Hat surely
is not investing in them (kind of outside their enterprise / server domain)
and they are about the only company prominent and powerful enough to do it.

~~~
modeless
> it does not seem like the kind of thing any small club of friends can pick
> up and build to match Google or Apple at

Actually I think it's not out of the question now. The recent advances in
recognition accuracy are mostly due to deep neural nets. The research is all
published open access, and the cutting-edge tools are mostly open source
(Theano, Torch, Caffe). Training neural nets is actually a lot simpler than
the old methods of doing speech recognition; I think it's much more accessible
to a small team. The only really difficult requirement is lots and lots of
clean labeled data for training.

~~~
woodson
I don't really see how the "old" methods were really less accessible. There
were tools such as HTK, cmu sphinx, etc., or srilm for language modelling,
each with documentation and a large user base. Granted, a lot of fiddling is
involved if one wants to use speaker adaptative training (MLLR, VTLN), feature
transforms (HLDA, MLLT), MLP features (TANDEM), etc., but DNN approaches come
with their own set of screws to tweak..

It's just hard to make something work really well for a specific use case;
when contributors to an open-source project are all trying to scratch their
own itch (make it work for their specific use [language, vocabulary, etc.]),
the result may not be universally satisfying.

~~~
modeless
The difference is that the old methods were large systems made up of many
different pieces that all required a ton of domain knowledge specific to
speech and language. Training DNNs requires a lot of knowledge about DNNs, but
not nearly as much knowledge about speech. Knowledge of how to train DNNs is
highly transferable between domains like speech and vision. Similarly, the
actual code can be mostly shared as well; something like Theano would be just
as suited to running speech nets as vision nets.

I don't think we're quite there yet, but DNNs have the potential to replace
every piece of the speech pipeline with one single net that gets audio samples
on one side and spits out characters on the other. All those acronyms you
mentioned (with many, many PhD theses behind them) will be irrelevant, in the
same way that tons of previously successful specialized computer vision
feature detectors (HoG, SIFT, SURF, etc) are now irrelevant to the state of
the art in object recognition.

~~~
kylebgorman
Your prediction that DNNs will replace much of the pipeline is very
interesting to me, but I hypothesize that you're at least partially wrong. I
predict that DNNs will impact early stages of the pipeline which operate on
continuously valued inputs, but I am skeptical that that DNNs will ultimately
be the best solution for late discrete processing (e.g., decoding, language
modeling). That DNNs ever perform well in discrete classification tasks just
tells me we haven't spent enough time feature-engineering.

~~~
modeless
It's the ability of DNNs to replace feature engineering that makes them
interesting. They have completely obsoleted feature engineering in object
recognition in just a few short years. Have you seen the latest DNN results in
translation and image captioning? I think DNNs are quickly going to surpass
the state of the art in language modeling.

------
aaronpk
Is it just me or is the author using the term "Uncanny Valley" completely
wrong? Ignoring the silly Pixar story, I still don't understand how voice
recognition (or more accurately, speech recognition) is currently in the
uncanny valley.

You know when your GPS says "recalculating" in a condescending voice? _That
's_ the uncanny valley of text-to-speech.

~~~
recursive
I'm no valley expert, but it seems to me that uncanny valley refers to an
artificial system intended to mimic a natural one. It mostly gets the mimicry
right, but not enough that we are completely fooled. This freaks some people
out.

Siri's UI intends to mimic a person that understands what you are saying. In
practice, it gets it wrong in hilarious and frustrating ways, breaking the
illusion.

So "uncanny valley" seems ok to me.

~~~
brianmcc
Is the term not about getting so close to real that it's that the remaining
shortfall makes us uncomfortable?

E.g. something "50% real" is so far off, we psychologically dismiss it as a
cartoon, drawing, whatever. It's not trying to compete with real.

Something "98% real" freaks us out though. Surprised I have not yet seen this
link:

[http://blog.codinghorror.com/avoiding-the-uncanny-valley-
of-...](http://blog.codinghorror.com/avoiding-the-uncanny-valley-of-user-
interface/)

------
VLM
One important point about voice recognition is in the short term its OK if its
slower and harder to use than superior technologies, as long as everyone knows
it costs a lot of money.

Once that fad aspect blows over, then usage plummets and its forgotten. See
Kinect, or the nintendo power glove, or qr-codes, or google glass, or the cue
cat, or a zillion other examples that are in, or now entering, 8-track-hood.

~~~
eru
QR-codes have found some useful niches. Eg for setting up my phone as a second
factor authentication device.

------
wodenokoto
Uncanny valley did not originate in the computer animation industry. It
originated in robotics, and was coined by Masahiro Mori in 1970.

[http://en.m.wikipedia.org/wiki/Masahiro_Mori](http://en.m.wikipedia.org/wiki/Masahiro_Mori)

------
idbehold
> The Uncanny Valley is a term that originated from the computer animation
> industry.

Uhh, I don't think so:
[https://en.wikipedia.org/wiki/Uncanny_valley#Etymology](https://en.wikipedia.org/wiki/Uncanny_valley#Etymology)

~~~
soylentcola
In fairness, this is followed up with:

> In 1992, while finishing A Bug’s Life, Pixar had to build a digital valley
> for Buzz Lightyear to drive his Ford® F-150™ pickup through on the way to
> the hospital so he get a vasectomy.

So I'm pretty sure the author is being deliberately silly.

~~~
zik
Just not actually funny.

------
yehat
The Uncanny Valley of HN comments gives plenty of credit to the author's post
(which I enjoyed). Most of the comments sounds like coming from underdeveloped
AI - awkward perception and total lack of human sense of humor.

------
kleiba
"Voice recognition" sounds more reminiscent of speaker identification than
speech recognition to me. Although I don't work on that myself, my day job is
in a related field, and even IBM's "speech to text" is a term I never hear
being used (unlike for instance "text to speech"). People around me either say
"speech recognition" or "ASR" (for automatic speech recognition).

I'd be interested to learn, though, if / where alternative terms are in more
wide-spread use.

~~~
woodson
I agree. I find the use of the term odd and it is not commonly used in an
academic context; however, I think I recall having seen it being used
informally in the context of voice commands (i.e., controlling a PC or
appliance) or, more generally, voice user interfaces.

------
alttab
The truly best voice recognition pretty much has to be hooked up to an uber-
AI. Gaining a friend, yes.

Imagine if they could program things into it that would make your overall life
better by slightly altering your behavior? For instance, if you asked "Siri"
to remind you every 40 minutes for a ciggarette break, I can imagine her
slowly weaning you off, etc.

------
sukilot
I think Zach was drunk on this one.

Also, "eat a dick", really?

------
shanselman
Um. No.

------
tashoecraft
I'd really like it if someone condensed that article and removed all attempts
to be funny or sound extremely clever.

~~~
smlacy
"Siri's voice recognition is kind of bad some of the time. I've extrapolated
this observation to all other voice recognition systems"

~~~
StavrosK
I have been extremely impressed by Google's voice recognition. It got things
right that I never thought it would, such as homophones or proper names.

~~~
gok
I don't mean to belittle you or Google's speech team, but neither homophones
nor proper names are considered hard problems in modern automatic speech
recognition.

~~~
StavrosK
I don't remember the case exactly, but it was something that was hard to
resolve, at least it seemed so to me.

