
Hound voice search NLP demo [video] - sandGorgon
https://youtube.com/watch?v=M1ONXea0mXg
======
Neurocynic
Google play link -
[https://play.google.com/store/apps/details?id=com.hound.andr...](https://play.google.com/store/apps/details?id=com.hound.android.app)

apkmirror link - [http://www.apkmirror.com/apk/soundhound-
inc/hound/soundhound...](http://www.apkmirror.com/apk/soundhound-
inc/hound/soundhound-hound-1-0-0-beta-android-apk-download/)

Edit: Currently, only US devices are supported. You can sideload using the apk
link.

In either case, you need an invite code for which you can register from in app
or on the website - www.soundhound.com/

~~~
nly
Google Play won't let me install it, saying it's incompatible with my Nexus 4
(on Lollypop)

~~~
christop
That's why the GP also posted a direct APK link.

In any case, you need an invite code in order to actually use the app.

------
codeshaman
"What the fuck was that ??" were the exact words I spoke out loud after I saw
this video.

Is this for real, no editing, no time compression ?

Does it really understand those questions or is it preprogrammed ?

Looks kind of incredible to be true. If it is, though, I'm in awe.

~~~
christop
It worked for me with "what's the capital of the country where the Brandenburg
Gate is", and "what's the population of the country with the Eiffel Tower",
then "what's the current time there?"

Similarly, the restaurant and hotel demos from the other promo video worked
fine. Also with follow-up questions like "and what about ones with free wifi?"

~~~
iand
Does it hold context for subsequent questions? For example can you ask "what
city is the Brandenburg Gate in?", "which country is that in?", "what is its
population?"

~~~
iamcasen
Yes it does. That's one of the main features.

------
tomp
They advertise it as "speech-to-meaning". I've been thinking about meaning in
AI and how important it is for understanding and interacting with the world to
be able to answer "What does this mean?" and "Does this make sense?"

Does anyone hove any insight or references to recent research about how to
model, train for and represent _meaning_?

~~~
nl
Here's a chronological reading list of techniques

Traditional NLP -> 2012, IBM Watson papers[1]

Watson is probably the pinnacle of "traditional" style NLP (ie,
Tokenization/Lemmatization/Part of Speech Tagging/Framing/Knowledge
Engineering/etc)

Word2Vec -> 2013, Google paper[2]

Word2Vec kind of exploded everyone's mind for a while.

Subgraph Embedding -> 2015, Facebook AI Group [3][4]

[1]
[http://ieeexplore.ieee.org/xpl/tocresult.jsp?isnumber=617771...](http://ieeexplore.ieee.org/xpl/tocresult.jsp?isnumber=6177717)

[2] [http://arxiv.org/pdf/1301.3781.pdf](http://arxiv.org/pdf/1301.3781.pdf)

[3]
[https://research.facebook.com/publications/1473550739586509/...](https://research.facebook.com/publications/1473550739586509/question-
answering-with-subgraph-embeddings/)

[4] [http://arxiv.org/abs/1502.05698](http://arxiv.org/abs/1502.05698)

~~~
syllogism
You might be interested to read this:
[http://arxiv.org/pdf/1402.3722v1.pdf](http://arxiv.org/pdf/1402.3722v1.pdf)

word2vec Explained: Deriving Mikolov et al.’s Negative-Sampling Word-Embedding
Method. Yoav Goldberg and Omer Levy. arXiv 2014. [pdf]

The word2vec software of Tomas Mikolov and colleagues has gained a lot of
traction lately, and provides state-of-the-art word embeddings. The learning
models behind the software are described in two research papers. We found the
description of the models in these papers to be somewhat cryptic and hard to
follow. While the motivations and presentation may be obvious to the neural-
networks language-modeling crowd, we had to struggle quite a bit to ﬁgure out
the rationale behind the equations.

~~~
nl
Yeah. Pulling a quote:

 _Why does this produce good word representations? Good question. We don’t
really know. The objective above clearly tries to increase the quantity vw·vc
for good word-context pairs, and decrease it for bad ones. Intuitively, this
means that words that share many contexts will be similar to each other (note
also that contexts sharing many words will also be similar to each other).
This is, however, very hand-wavy._

I love an honest paper.

------
nl
What is it that impresses people here?

The voice understanding is impressive, but obviously a video can show the
best-case.

The factual questions themselves aren't hard. I've written a toy QA system,
and it could handle the base case of those - they are just straight querying
on Freebase/DBpedia.

The longer question ("the population, land area and capitals of Japan, India
and China") was good.

If anyone has a working copy, I'd like to know which of these it can handle:

"Who is Bill Clinton?"

"Who is Bill Clinton's daughter?"

"Who did Bill Clinton's daughter marry?"

If it can get the third level then it's pretty good. From memory something
like OpenEphyra can sometimes get that 3rd level, but usually fails.

I thought the contextual querying (where one query led to another and it had
to remember the previous details) was pretty good.

~~~
a3n
Who is Bill Clinton's daughter's husband?

Who is Bill Clinton's daughter's husband's daughter?

Who is Bill Clintron's daughter's husband's daughter's grandfather?

Who is Bill Clinton's house?

~~~
agildehaus
Google Now does surprisingly well on all but the third question, though if you
ask "Who is Chelsey Clinton's husband's father" it gets the right answer.

Siri passes every single question, verbatim, to WolframAlpha and it gets none
of them right.

------
iLoch
Hmm... When your tech works this well, do you want to get acquired? This
reminds me of Pied Piper in the way that it's just so much better than
anything else out there. Applications for something like this stretch beyond a
personal assistant; combining it with something like Watson or WolframAlpha
could be very useful. I feel like I could actually use this to control my
computer with confidence, for example.

~~~
jdiez17
They seem to have (plans for) a way[1] to integrate this into other
applications. Looks like it would reduce a lot of the friction that exists
with current-gen voice recognition systems.

[1] [https://www.houndify.com/](https://www.houndify.com/)

~~~
hackersssss
this already exists --> api.ai

------
mrkmcknz
If this is indeed a non scripted and uncompressed demo this technology is
pretty outstanding.

SoundHound[1] are behind this product.

[1] [http://www.soundhound.com/houndify](http://www.soundhound.com/houndify)

------
rasz_pl
Hope its real. Reminds me those cool expert system demos we say in the
eighties, they had all the answers ... for selected narrow group of questions
with very specific semantics.

------
froo
Incoming acquisition in 3...2.....1

~~~
joeyspn
Exactly the words I was going to post... This looks like a great addition to
the google app. Just the speech-to-text is impressive enough...

~~~
smackfu
The integration is the hard part. Do you just throw out the whole existing OK
Google engine?

------
jdiez17
I am very impressed. The voice recognition seems significantly faster and more
accurate than Google's. The interactive back and forth with the mortgage
calculations was the coolest part, I think. How are you able to access
population and location data that quickly? It feels like it must be stored
locally.

~~~
spyder
"significantly faster and more accurate than Google's" Yes it's impressive,
but it could be because Google has a lot more users.

~~~
jdiez17
Google is in the business of making things go _very fast_. I don't think the
primary bottleneck in Google's speech recognition is the server load, surely
they can add more processing power if that were the issue.

~~~
afsina
Well, it is actually very demanding. ASR systems usually work with the speed
of 1 RT (RT= real time factor, meaning recognizing 1 second of speech in 1
second). Approximately %60-70 of these processing goes to acoustic scoring.
Rest is search in a large sub-phonetic+words graph and feature extraction
(feature extraction takes a tiny percentage actually).

Nowadays acoustic scoring is done by large deep neural networks. And they are
quite computation intensive. One can use GPU for that and indeed it works
really fast if you have all the speech beforehand (off-line or batch mode).
But for live recognition, GPUs lose their advantage quite a bit. Probably that
is why Google worked on quantized vectorization and other tricks to make the
DNNs fast in CPU [1].

I am quite sure this creates an immense pressure on their servers when tens of
thousands of concurrent speech streams are queued for recognition. Perhaps
todays GPUs are better in that aspect and more work can be delegated to
decrease the pressure. There were other interesting work which utilizes almost
all processing in GPU [2].

In short, ASR systems are very very processing hungry and a challenge for
everyone, probably even for Google.

[1]
[http://static.googleusercontent.com/media/research.google.co...](http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/37631.pdf)

[2]
[http://www.cs.cmu.edu/~ianlane/hydra/#&panel1-1](http://www.cs.cmu.edu/~ianlane/hydra/#&panel1-1)

~~~
nl
This isn't entirely accurate. Or rather, it is accurate as far as it goes, but
doesn't tell the whole story.

 _Training_ a neural network uses a _lot_ of computational power. From memory
I think training the Android voice recognition was weeks of training on
Google's GPU cluster ([1] talks about 95 hours for partial training, but I
don't think that's the production system).

However, once the network is trained it doesn't use much power at all. The
trained network can run a mobile phone, and it doesn't even drain the
batteries much.

[1]
[http://static.googleusercontent.com/media/research.google.co...](http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/42652.pdf)

~~~
afsina
I was mentioning about run time operations, not training. Yes training DNNs
are much more time consuming, but my point is, using them is also not cheap.
As mentioned, processing 1 second of speech, lets say in 0,5 seconds is
expensive. Considering a web search is done in sub millisecond time. of course
I assume speech recognition is done in server side.

------
tux1968
Looks good. Bit frustrating that it's not available in Canada though.

~~~
IshKebab
Or the UK though. I wish Google Play had a "Download Anyway" button. Region
blocking is retarded.

------
WhitneyLand
I don't believe it will be exactly this good in everyday usage, if for no
other reason than speed.

Even if the parsing is done locally, broad data queries will have to hit the
cloud.

------
kolev
How can I cut in line and get an invite? I find it a bit unfair that those who
supported SoundHound didn't get any special treatment.

------
nerdy
The voice recognition and speech parsing is incredible. Combining it with some
calculations and interactivity is over the top!

------
beenpoor
Very impressive!

