

Microsoft Research make breakthrough in audio speech recognition - sparknlaunch
http://blogs.technet.com/b/next/archive/2012/06/20/a-breakthrough-in-speech-recognition-with-deep-neural-network-approach.aspx

======
acqq
The most interesting bit for me is at the end of another blog entry:

[http://blogs.technet.com/b/inside_microsoft_research/archive...](http://blogs.technet.com/b/inside_microsoft_research/archive/2012/06/14/deep-
neural-network-speech-recognition-debuts.aspx)

"An intern at Microsoft Research Redmond, George Dahl, now at the University
of Toronto,

<http://www.cs.toronto.edu/~gdahl/>

contributed insights into the working of DNNs and experience in training them.
His work helped Yu and teammates produce a paper called Context-Dependent Pre-
trained Deep Neural Networks for Large Vocabulary Speech Recognition.

[http://research.microsoft.com/pubs/144412/DBN4LVCSR-
TransASL...](http://research.microsoft.com/pubs/144412/DBN4LVCSR-
TransASLP.pdf)

In October 2010, Yu presented the paper during a visit to Microsoft Research
Asia. Seide was intrigued by the research results, and the two joined forces
in a collaboration that has scaled up the new, DNN-based algorithms to
thousands of hours of training data."

~~~
gdahl
For people interested in some (currently) undocumented research code in python
implementing DNNs that is also on my website. Although the code is only an
initial release. I will improve it later, but if I waited until it wasn't
embarrassing I would never release it, so I just posted it.

~~~
pathdependent
Thank you for doing so!

Overwhelmingly, it is my experience that researchers in computational
disciplines publish papers with half-finished code "available on request" --
and requests are often ignored. It's refreshing to hear someone say, "Yes, the
code needs work, but it should be available."

------
kpozin
The demo site (<http://www.msravs.com/audiosearch_demo/>) blocks browsers
other than IE and Firefox based on the user agent string. Use WebKit's
developer tools to change your user agent and you'll be able to get in.

~~~
no_more_death
Why alienate such a large segment of users after pouring so much money into
their technology? The web is getting weirder.

If a company invests in multiple markets, they should be prepared to do well
in some markets and badly in others. Bing isn't as good as Google. Android
isn't as well-designed as Metro. Yes, Android stole Apple's market, and, yes,
Apple stole someone else's market. The large technology companies are
deadlocked on multiple fronts. That fuels fierce competition and inspires
excellence and choice. However, companies should accept they just aren't the
best at everything. Let us make our own choices based on what's best for us.

~~~
georgemcbay
I think you are attributing to malice what is probably just laziness. It is
fairly common for modern websites to drop the ball on support of some browser
or other. I doubt Microsoft as a corporation made a deliberate decision to
support IE and Firefox but not Chrome or Safari or Opera or whatever.

~~~
kpozin
It's one thing to not test a site in a particular browser and to just put up
an unobtrusive warning saying that some things might not work perfectly. It's
quite another to actively block access based on the user agent string.

------
richardlblair
Imagine the power of this for students. This would have made school so much
easier. Simply record every lecture and then use this to search for keywords.

Awesome.

~~~
toemetoch
On a different note, imagine the power of this for DRM control and censorship.

But impressive and very useful.

~~~
nddrylliog
What? What do speech recognition and fingerprinting have in common? I don't
see how this research applies for DRM...

Censorship, maybe. And even then, you can't filter conversations in real-time,
only maybe 'flag' people with forbidden words.

~~~
toemetoch
Pretty much all DRM content has unique patterns.

Want to prohibit videos of the Starcraft game? Simple search for a few
sentences like "more vespene gas" and "require more minerals".

Want to find online copies of "Aliens"? Just enter a few catchphrases or part
of a dialogue like "They come mostly at night. Mostly."

 _And even then, you can't filter conversations in real-time, only maybe
'flag' people with forbidden words._

Yes, that's really assuring to know it's not in real time.

~~~
lt
Those don't really require textual matching, just regular audio
fingerprinting. In fact, doing that would match Starcraft or movies podcasts,
where people are quoting the source.

~~~
toemetoch
With audio fingerprinting the content provider must provide a way to
fingerprint its own audio and have access to fingerprints of the internet's
audio/video. This means a partnership between e.g. youtube and a studio. I'm
fairly sure this involves studios above a certain size, resources for
programming+API and a fair bit of paperwork and testing for robustness as
there are ways to mess with the technique.

With this technique you just enter a few words and look at what comes out.

You're suggesting that the first option is easier?

~~~
lt
Yes. Not only easier, but more reliable. The examples you gave are perfectly
static sound bits - they don't change. It doesn't make sense to transcribe
them to text, just match the audio. Soundhound/Shazam/etc do this easily. I'm
pretty sure YouTube has some kind of similar mechanism already in place.

This technology gets a lot more interesting if you want to search for people
talking about you or your products.

------
bornhuetter
Can someone please explain senones to me? Can't find much on Google.

The article says that they are a fragment of a phoneme, but how small a
fragment are we talking? 2-3 per phoneme, or many more?

Also - I'd be curious how much the phoneme in a word can vary based on accent.

~~~
dewiz
<http://cmusphinx.sourceforge.net/wiki/tutorialconcepts>

"Speech is a continuous audio stream where rather stable states mix with
dynamically changed states. In this sequence of states, one can define more or
less similar classes of sounds, or phones.

Words are understood to be built of phones, but this is certainly not true.
The acoustic properties of a waveform corresponding to a phone can vary
greatly depending on many factors - phone context, speaker, style of speech
and so on. The so called coarticulation makes phones sound very different from
their “canonical” representation. Next, since transitions between words are
more informative than stable regions, developers often talk about diphones -
parts of phones between two consecutive phones. Sometimes developers talk
about subphonetic units - different substates of a phone. Often three or more
regions of a different nature can easily be found.

The number three is easily explained. The first part of the phone depends on
its preceding phone, the middle part is stable, and the next part depends on
the subsequent phone. That's why there are often three states in a phone
selected for HMM recognition.

Sometimes phones are considered in context. There are triphones or even
quinphones. But note that unlike phones and diphones, they are matched with
the same range in waveform as just phones. They just differ by name. That's
why we prefer to call this object senone. A senone's dependence on context
could be more complex than just left and right context. It can be a rather
complex function defined by a decision tree, or in some other way."

~~~
bornhuetter
Thanks. So senones are not just fragments of phones - two senones could sound
exactly the same, but be classified differently depending on their context
within the audio stream.

------
Dn_Ab
For those keeping score, google's image feature extractor shares the same core
principles as microsoft's speech recognizer.

EDIT: by keeping score I mean keeping track of which techniques are being used
where.

~~~
mikedmiked
What are these core principles?

~~~
Dn_Ab
The main characters of both papers are many layered neural network
architectures, autoencoders and stochastic gradient descent. The interesting
thing is that all these ideas are from the 80s but the breakthrough was in how
to use unsupervised learning to seed neural networks so that a many layered
neural network did not get mired in local optima.

The key idea is that if you train each layer in an unsupervised manner and
then feed its outputs as features for the next layer it performs better when
you go on to train it in a supervised way. That is, back-propagation on the
pre-trained Neural net, learns a far more robust set of weights than without
pretraining. Stochastic gradient descent is a very simple technique that is
useful for optimization when you are working with massive data.

The architecture Dahl used layers as RBM (very similar to autoencoders) to
seed a regular ole but many layered Feedforward network. SGD is used to do
back propagation. RBMs themselves are trained using a generative technique -
see Contrastive divergence for more.

The google architecture is more complex and based on biological models. It is
not trying to learn an explicit classifier hence they train a many layered
autoencoder network to learn features. I only skimmed the paper but they have
multiple layers specialized to a particular type of processing (think
photoshop not intel) and using SGD they optimize an objective that is
essentially learning an effective decomposition on the data.

The main takeaway is if you can find an effective way to build layered
abstractions then you will learn robustly.

~~~
tylerhobbs
There's a very good Google Tech Talk by Geoff Hinton (who has worked closely
with Dahl on a lot of this research and developed some of the key algorithms
in this field) that explains how to build deep belief networks using layers of
RBMs: <http://www.youtube.com/watch?v=AyzOUbkUf3M>

That video focuses on handwritten digit recognition, but it's great for
understanding the basics. There's a second Google Tech Talk video from a few
years later that talks directly about phoneme recognition as well:
<http://www.youtube.com/watch?v=VdIURAu1-aU>

------
MichaelGG
On a immediately useful practical note, OneNote also contains this
functionality (obviously not as powerful). I've used it to record a meeting's
audio sync'd to my notes, and then be able to search the audio to jump exactly
to where someone mentioned something and review context. Saved my ass on at
least one occasion.

------
droz
Research paper on the system:
<http://www.se.cuhk.edu.hk/hccl/publications/pub/HLT2006.pdf>

------
brutuscat
This seems very related to this <http://www.youtube.com/watch?v=ZmNOAtZIgIk>
speak by Andrew Ng. It is a 40min speak, but he explains very simply how all
this works for images and some examples about the audio case. It is incredible
how using this deep learning techniques we can teach this "neural networks" to
recognize such complicated patterns. It is like reverse engineering the
brain's algorithms.

BTW I took his Coursera's course about Machine Learning and it was great! I
also recommend it A LOT to gather basic ML knowledge.

~~~
JabavuAdams
Are you still able to access the course materials? I took the course as well
(and enjoyed it!) but I'd like to access the PDFs, especially.

~~~
brutuscat
Yes I have downloaded all the PDFs, email me (check my profile) I will share
it via Dropbox for you ;)

------
tsumnia
How does this compare to Microsoft's Old HTK (HMM Toolkit)? The language used
on the website seems to point to a lot of the same things. Is this breaking it
down to actual IPA phonemes?

I'm mostly curious because I used the HTK for my thesis and would like to know
how they compare (besides, one being just 'newer').

~~~
ezy
This approach still uses HMMs, it's just that the observation probabilities
are now coming from a DNN (neural network) instead of a GMM (gaussian mixture
model). "Senones" are not new, HTK can use various context dependent phoneme
models, and the HMM states (typically 3) within each context dependent phoneme
essentially boil down to what they call a "senone" here. Interestingly, they
use GMM's to bootstrap the DNN training -- which I suppose you could avoid
once you have a reasonable DNN laying around.

The main difference here is hooking DNN output to an HMM decoder, replacing
GMMs, and possibly even more important the training process they use to get
the DNN fairly efficiently. That's the biggest thing -- GMMs, at least the
last time I've looked, can be trained and adapted much quicker than a DNN.

------
cmicali
Vlingo, Siri, and others have been doing speaker independent auto-adapting
speech recognition for years and talking about systems requiring 'training'
and improvements there sound like this article is 5 years old. Great to see
innovation in this space but this article is very light on detail.

~~~
breckinloggins
It is my understanding (albeit based on limited knowledge) that Siri, like
other Nuance-powered systems that make a call to the server, are actually
"trained" continuously by the huge amount of sample speech they receive by
real users.

The true "breakthrough" here would be if Microsoft made a voice recognition
system that could run entirely on a device (no internet connection needed) and
accurately understand speech without terabytes of training data or a local
user training session. I can't tell from the article if this is what Microsoft
is claiming.

Also, it appears that "Deep Neural Network" isn't the most common term of art
here. DNN appears to be a synonym for "Deep Belief Network".[1] Can anyone
confirm?

[1] <http://www.scholarpedia.org/article/Deep_belief_networks>

~~~
rck
I believe that in this system, "deep neural network" just means a regular
feed-forward network that has a larger number of hidden layers. There is a
relationship to DBNs though, because they initialize the weights of the neural
net by doing unsupervised pre-training with a set of DBNs.

------
dewiz
related link: [http://research.microsoft.com/en-
us/news/features/speechreco...](http://research.microsoft.com/en-
us/news/features/speechrecognition-082911.aspx)

~~~
Nogwater
related link comments: <http://news.ycombinator.com/item?id=2936371>

