
Single Speaker Speech Recognition with Hidden Markov Models - kastnerkyle
http://kastnerkyle.github.io/blog/2014/05/22/single-speaker-speech-recognition/
======
lumpypua
Ask HN: Anybody got a piece of software to split an MP3 by speaker? (or output
list of times where each speaker is speaking)

I've got an audiobook with a particularly obnoxious commentator interjecting
dumb things every couple of minutes and I'd like to slice those out.

I'd do it manually if it were only a couple of edits, but there are easily
150+ comments spread through 6 hrs of material. Should be easy for a speech
recognizer, just two voices with different genders.

~~~
kastnerkyle
You might be able to do this with independent component analysis (ICA). There
is a block in sklearn, but you would have to write some Python.

Alternatively, you might look to see if the speakers are in separate channels
(if recording is stereo). Then it would be really simple - just take one
channel out and resave as mono!

If you have a sample I'd be curious to take a look - sounds like an
interesting problem.

~~~
relate
ICA helps you separate a superposition of two or more signals. Assuming the
speakers are not speaking at the same time, he might rather want to use
speaker identification methods to find what parts to mute (e.g. train a binary
classifier that operates a spectrogram of the signal).

~~~
kastnerkyle
Good point. I was thinking of two people speaking at the same time, but that
would probably be hard to listen to :). Speaker dependent muting is a much
more reasonable approach

------
rttlesnke
I've been studying HMMs lately. I think getting an initial estimate of HMM
emission and transition parameters using Segmental K-means training (or
Viterbi training) before applying Baum-Welch re-estimation should result in
the latter converging better. That's what the HInit tool in HTK does. AFAIK,
it's done in the following way:

\- Divide all examples (observation sequences) uniformly into as many segments
as the number of states.

\- Cluster the observations corresponding to each state, and estimate the GMM
using the cluster set so that each cluster corresponds to one multivariate
Gaussian.

\- Do this repeatedly until convergence: get the Viterbi alignment of all
examples, use it to get new segments, and estimate the parameters again using
the previous step.

Please correct me if I'm wrong. Also, I have two questions:

\- What kind of accuracy increase should be expected if using both Viterbi
training and Baum-Welch re-estimation, instead of just the latter?

\- What kind of accuracy should be expected if only using Viterbi training?

~~~
iori42
In my experience Viterbi training alone is often enough to get reasonable
accuracy, at least in speech recognition. Instead of doing the more costly
Baum-Welch training you can spend your time better elsewhere, e.g. use deep
neural networks instead of GMMs or collect more data.

~~~
kastnerkyle
Interesting - I've never tried Viterbi training. Maybe it is worth
implementing after all. I plan to do a hybrid DNN-HMM (or whatever it is
called now) with pylearn2 in a followup post.

