Voice recognition is not in an uncanny valley. Uncanny valley means there is a point where something that is less real is better than something that is more real. Pixar improves the scene by adding elements that are unrealistic. Another example is a preference for lower frame rate movies.
Right now, every incremental improvement to voice recognition improves its usefulness. It might appear that we're in an uncanny valley because voice recognition is barely usable right now versus completely unusable in the past, but there is no one that prefers worst voice recognition over better voice recognition.
Of course nobody prefers less accurate transcription, but we're talking about more than transcription here. It's entirely reasonable to prefer a "command line" style interface like the XBox over a "conversational" style interface like Siri for the reason that the "conversational" style interface is less reliable even though it's in some sense more "real".
> The uncanny valley is a hypothesis in the field of aesthetics which holds that when features look and move almost, but not exactly, like natural beings, it causes a response of revulsion among some observers. The "valley" refers to the dip in a graph of the comfort level of beings as subjects move toward a healthy, natural likeness described in a function of a subject's aesthetic acceptability.
Wording by wikipedia. So apparently, there might be points on the graph where a slight increase in recognition performance actually freaks out some users.
I think there is a real danger that people are modifying/learning how to speak to computer voice recognition software. If voice recognition can't quickly become able to parse natural language, it will inevitably have to parse "I'm talking a dumb computer" cadence and inflection instead.
Incremental improvements are very bad in this regard.
These things are hard to reverse, too (people still speak with a very distinct "I'm speaking on a telephone" cadence today).
IMHO, people (and hence language) will always adapt in certain ways to get the message across. People already learned how to "google" and expect the same style of search queries to be effective elsewhere. When speaking on the telephone, people tend to slightly change their voice to counteract the channel noise (with acoustic consequences such as increased fundamental frequency ["pitch"], etc.).
I would be surprised if a similar adaptation didn't happen for human-computer voice interaction, which would ultimately help making it work well enough to be useful. (Of course, using speech recognition to transcribe human-to-human interaction will still be barely usable..)
In text adventures there was a limited grammar that the parser understood that was English-like, which you had to express your actions in: get lamp, put lamp on table, eat fish, look sword. This is easier than if the parser tried to let you use fancy English sentences that it'd probably get wrong often and that you would make more mistakes in because the line is blurred between what you're allowed to say or not. That would be a point between the simplistic text adventure grammar and perfect natural language parsing, with the simpler grammar being preferable. I can see the same being true for voice commands.
As a side note, it might be funny to hook one of these voice command systems up to a z-machine VM. The command set is very limited as you pointed out, so it should easily be able to handle the input side. And the voices, while still robotic, seem pretty good as well. With games like Zork being fully text based you could easily turn it into a conversational game.
"The Uncanny Valley is a term that originated from the computer animation industry. In 1992, while finishing A Bug’s Life, Pixar had to build a digital valley for..."
Ummm....
Wikipedia: "The term was coined by the robotics professor Masahiro Mori as Bukimi no Tani Genshō in 1970. The hypothesis has been linked to Ernst Jentsch's concept of the "uncanny" identified in a 1906 essay "On the Psychology of the Uncanny"."
If you read the rest of the paragraph, it's quite clear that the entire description is intentionally humorous nonsense. I mean, the third sentence had to make that clear... ;)
> They ended up illustrating a crate of Campbell’s® Tomato Soup™ in the corner to make it feel a bit more canny.
I have mixed feelings about misleading stealth jokes like this (I wouldn't want someone reading to "learn" that Pixar coined the term in 1992), but this one's pretty obvious and probably worth it for the laugh.
The missing word kind of broke the joke for me ("so he [can/could] get a vasectomy") - I got so hung up on wondering whether part of the sentence was missing or whether something had been lost in translation that I missed the fact that it was just an attempt at humor.
I apologize, I'm pretty sure I feel the way I do because I'm getting old, but here's how I feel: talking to computers is a really, really bad interface, so I don't do it.
One reason it's bad is that the sounds we make are mush. It's a miracle if a computer system can correctly retrieve the words from an utterance. Another reason it's bad is that the words we say are nonsense. Our sentences aren't parseable, they don't conform to any actual grammar.
So I see it as another example of people selling something that's supposed to be more convenient than what we already have, but for many reasons, it probably isn't. One day it may be, but it wouldn't be surprising for people to be selling it as more convenient for many years before it actually is.
I'm not criticizing the technology -- it's amazing. It's just clear to me that it isn't ready to be invited into my life. I consider it inevitable that we will eventually lose control of technology, but we can at least try to be judicious.
Yes you are getting old. Just observe a kid using an ipad and you'll understand how useful it can be. My 5 year old is able to find pretty much anything she wants (like videos of play doh of frozen characters) through voice search.
Well current voice recognition tech makes it a crappy user interface, but it could be improved. It does make a really handy feedback device, though. Much like haptic feedback, humans can detect a wide range of audio cues and use that as feedback in an interface. You probably already recognize the bip on your phone is a Facebook message while the bong is your e-mail alert, and the ding your text message.
Back in the day when I made my own CarPC, I naturally made my own interface for it too, for, you know... reasons. The input was a mini numpad, but the output was audio - primarily text to speech. I could drive around and use the interface (mostly) safely, but be exactly sure of my input choices using the one-handed, nailed-down keypad. Worked like a charm.
If in the future we nail down the voice input to be more accurate, fast and flexible, we can avoid the author's concerns and get something as useful as touch screens but based on our ears and mouths.
Obviously your sentences are parseable. Millions of humans around the world understand what you are saying.
the usefulness of this didn't dawn on me until I read an interview with Andreg Ng where he talked about the huge amount of voice searches in China. Many adults can't type, so searching with voice is very convenient as opposed to drawing the characters. Many are down right illiterate or young children not old enough to read that much yet.
As an interesting data point, my 2 year old knows how to find my wife's phone now. "Ok Google, find my phone." And wouldn't you know it, the damned phone starts making a noise for her. Both of my kids can use the phone to search things. Neither knows how to write yet and my son is just learning to read. It's amazing to watch how touch screens and voice commands have made high tech accessible to very young children.
> Obviously your sentences are parseable. Millions of humans around the world understand what you are saying.
No, I mean parseable in the way that source code is parsed. I'm not aware of any human language that is parseable by a computer. Humans are able to understand each other because our brains are, loosely speaking, magical.
Thats because natural language is ambiguous. Even simple sentences like " I saw a man on the hill with a telescope" have multiple meanings.
This isn't solved by magic, but pure statistics. Try and ask your friends who has the telescope in the previous sentence and some will say the man and some will say "I". Without context we can't tell, but we can judge which one is more likely given who had the item in previous sentences of similar structure (the prior). Then usually we also have some context.
Now we can add context: "I got a telescope for my birthday and was eager to use it. The next day I saw a man on the hill with my telescope". Now most people would expect the speaker to be looking through the telescope, but the man might have stolen it and taken it to the hill. Even humans have to guess.
I'm with you. Speech interaction might be fine in some situations , but they seem all quite superficial to me. The day I have to dictate a regex to my IDE is the day I hang up my boots.
I would prefer a keyboard in any situation except where environment/circumstance prevents it, e.g. when driving or wearing thick gloves to protect me from -20 degree temperatures.
Some interactions with a computer are much better with a keyboard and maybe a mouse. Those are more edge cases for the creators. So many people now use computers just for consumption or for super simple creation that keyboards and mice are things they don't need.
I disagree. It's great, I Drive lyft professionally and it is very useful to be able to use the phone without removing it from its holster between ride.
Also, I just dictated that entire paragraph while sitting in a noisy Korean restaurants
I'm still bummed that with all these companies implementing voice recognition there still is not anything close to a FOSS option. It is a major field and the kind of software that takes a huge amount of work to get right and I feel like in the future free operating systems are going to look archaic without it, but it does not seem like the kind of thing any small club of friends can pick up and build to match Google or Apple at.
The same applies to OCR and other photo recognition techniques like faces or red eye. Tesseract is probably the largest free software OCR project but it still seems to do so much worse than proprietary Adobe and Microsoft products. At least the OCR reader that came with my S4 does a terrible job, though it might be using Tesseract behind the scenes since I think its the one from f-droid.
Digikam does all right red eye correction but it does it with a layered filter rather than any recognition of eyes. It also sometimes can find faces, but not nearly as accurately as Google can.
All these fuzzy logic fields are things that take huge code bases and a lot of R&D to get right and nobody in the free software movement has the organization or just the raw bank to make them happen from what I can see. Red Hat surely is not investing in them (kind of outside their enterprise / server domain) and they are about the only company prominent and powerful enough to do it.
There's no shortage of FOSS software for automatic speech recognition. Most academic research these days uses the Kaldi toolkit (http://kaldi.sourceforge.net/about.html), which has an Apache license.
There are (at least) three problems that prevent the widespread availability of FOSS speech recognizers:
1. Data. Large corpora are available via Penn's Linguistic Data Consortium (LDC). Big academic institutions can afford their all-inclusive licenses; small corporations have to settle for their expensive a la carte options, and startups and hobbyists have to go without. Fortunately, there is now more and more freely available data, such as the LibriSpeech database (http://www.voxforge.org/home/forums/message-boards/audio-dis...), which is extracted from the LibriVox website (of public-domain audiobooks).
2. Task-specificity. Speech recognition systems need extensive customization to the intended use case to achieve good performance. Conventional wisdom is that your recognizer can be tuned to work with a wide variety of speakers, or support a large vocabulary, but not both. This customization requires lots of time, data, and expertise.
3. Expertise. Speech recognition development is a PhD-level activity. After 5 years of stable or declining real income, smart students either go work for big bucks at a large multinational, or can't get a visa and go back to their home country.
While there's some hope for #1, #2 and #3 aren't going away anytime soon.
> it does not seem like the kind of thing any small club of friends can pick up and build to match Google or Apple at
Actually I think it's not out of the question now. The recent advances in recognition accuracy are mostly due to deep neural nets. The research is all published open access, and the cutting-edge tools are mostly open source (Theano, Torch, Caffe). Training neural nets is actually a lot simpler than the old methods of doing speech recognition; I think it's much more accessible to a small team. The only really difficult requirement is lots and lots of clean labeled data for training.
I don't really see how the "old" methods were really less accessible. There were tools such as HTK, cmu sphinx, etc., or srilm for language modelling, each with documentation and a large user base. Granted, a lot of fiddling is involved if one wants to use speaker adaptative training (MLLR, VTLN), feature transforms (HLDA, MLLT), MLP features (TANDEM), etc., but DNN approaches come with their own set of screws to tweak..
It's just hard to make something work really well for a specific use case; when contributors to an open-source project are all trying to scratch their own itch (make it work for their specific use [language, vocabulary, etc.]), the result may not be universally satisfying.
The difference is that the old methods were large systems made up of many different pieces that all required a ton of domain knowledge specific to speech and language. Training DNNs requires a lot of knowledge about DNNs, but not nearly as much knowledge about speech. Knowledge of how to train DNNs is highly transferable between domains like speech and vision. Similarly, the actual code can be mostly shared as well; something like Theano would be just as suited to running speech nets as vision nets.
I don't think we're quite there yet, but DNNs have the potential to replace every piece of the speech pipeline with one single net that gets audio samples on one side and spits out characters on the other. All those acronyms you mentioned (with many, many PhD theses behind them) will be irrelevant, in the same way that tons of previously successful specialized computer vision feature detectors (HoG, SIFT, SURF, etc) are now irrelevant to the state of the art in object recognition.
A lot of the methods I listed don't have anything to do with speech per se. They DO have something to do with how to use data to e.g. remove unwanted variability or achieve better class separation. What makes them old methods is that they were developed in the context of using Gaussian mixture models for modelling Hidden Markov model state output probabilities. As such you could perhaps apply them in classifying birdsong, gunshots, or whatever else (with varying success, of course..).
I have no doubt that these methods and their acronyms will become irrelevant (perhaps they already are), but I guess some of the basic underlying ideas about variability will re-emerge in the training regime of DNNs. Sure, the algorithms (and implementations) for training DNNs are the same, but these ideas are incorporated in the preparation and handling of training data (compare that with augmentation, like creating translated images etc.).
Your prediction that DNNs will replace much of the pipeline is very interesting to me, but I hypothesize that you're at least partially wrong. I predict that DNNs will impact early stages of the pipeline which operate on continuously valued inputs, but I am skeptical that that DNNs will ultimately be the best solution for late discrete processing (e.g., decoding, language modeling). That DNNs ever perform well in discrete classification tasks just tells me we haven't spent enough time feature-engineering.
It's the ability of DNNs to replace feature engineering that makes them interesting. They have completely obsoleted feature engineering in object recognition in just a few short years. Have you seen the latest DNN results in translation and image captioning? I think DNNs are quickly going to surpass the state of the art in language modeling.
Is it just me or is the author using the term "Uncanny Valley" completely wrong? Ignoring the silly Pixar story, I still don't understand how voice recognition (or more accurately, speech recognition) is currently in the uncanny valley.
You know when your GPS says "recalculating" in a condescending voice? That's the uncanny valley of text-to-speech.
I'm no valley expert, but it seems to me that uncanny valley refers to an artificial system intended to mimic a natural one. It mostly gets the mimicry right, but not enough that we are completely fooled. This freaks some people out.
Siri's UI intends to mimic a person that understands what you are saying. In practice, it gets it wrong in hilarious and frustrating ways, breaking the illusion.
> I still don't understand how voice recognition (or more accurately, speech recognition) is currently in the uncanny valley.
We are OK with stupid computers, we don't expect anything from them - we order them around is very formal language.
But now there is an attempt to use natural language - and it doesn't work right. So it's actually better not to use natural language, and just stick to the formal language.
"Uuuugh I can't believe you missed the fucking turn, I've been warning you for TWO MINUTES, Jesus! Fine, I'll plot another course for your stupid ass!"
One important point about voice recognition is in the short term its OK if its slower and harder to use than superior technologies, as long as everyone knows it costs a lot of money.
Once that fad aspect blows over, then usage plummets and its forgotten. See Kinect, or the nintendo power glove, or qr-codes, or google glass, or the cue cat, or a zillion other examples that are in, or now entering, 8-track-hood.
> In 1992, while finishing A Bug’s Life, Pixar had to build a digital valley for Buzz Lightyear to drive his Ford® F-150™ pickup through on the way to the hospital so he get a vasectomy.
So I'm pretty sure the author is being deliberately silly.
The Uncanny Valley of HN comments gives plenty of credit to the author's post (which I enjoyed). Most of the comments sounds like coming from underdeveloped AI - awkward perception and total lack of human sense of humor.
"Voice recognition" sounds more reminiscent of speaker identification than speech recognition to me. Although I don't work on that myself, my day job is in a related field, and even IBM's "speech to text" is a term I never hear being used (unlike for instance "text to speech"). People around me either say "speech recognition" or "ASR" (for automatic speech recognition).
I'd be interested to learn, though, if / where alternative terms are in more wide-spread use.
I agree. I find the use of the term odd and it is not commonly used in an academic context; however, I think I recall having seen it being used informally in the context of voice commands (i.e., controlling a PC or appliance) or, more generally, voice user interfaces.
The truly best voice recognition pretty much has to be hooked up to an uber-AI. Gaining a friend, yes.
Imagine if they could program things into it that would make your overall life better by slightly altering your behavior? For instance, if you asked "Siri" to remind you every 40 minutes for a ciggarette break, I can imagine her slowly weaning you off, etc.
> Buzz Lightyear walks into a bar called the "Uncanny Valley" and asks the bartender for a vodka soda. The bartender gives him a vasectomy. Voice recognition is important!
I don't mean to belittle you or Google's speech team, but neither homophones nor proper names are considered hard problems in modern automatic speech recognition.
"But if we can ever jump past this uncanny valley, that’s where we’ll basically build AI."
To me this seemed like the main conclusion of the post.
And I agree. Voice recognition seems like an AI-complete problem. I think conversations will always be awkward and frustrating until Siri can construct a mental model of my habits, my particular turns of phrases and accent, what I'm up to right now, what I think is important, who else is in the room, etc. etc. I don't think you can (only) throw deep learning at the problem and expect anything but superficial responses. (Maybe if you had one neural net per user?)
Right now, every incremental improvement to voice recognition improves its usefulness. It might appear that we're in an uncanny valley because voice recognition is barely usable right now versus completely unusable in the past, but there is no one that prefers worst voice recognition over better voice recognition.