
Kaldi Speech Recognition Toolkit - based2
https://github.com/kaldi-asr/kaldi
======
braindead_in
We have an open position for an Kaldi Engineer, in case anyone's interested.

[https://news.ycombinator.com/item?id=12630618](https://news.ycombinator.com/item?id=12630618)

------
blueyes
This used to be the basis of Siri I think. Not sure it is any more, and not
sure it's state of the art of speech recognition, given what neural nets can
do.

~~~
galv
It is state of the art in speech recognition exactly because it uses neural
networks. I helped develop its nnet3 neural network library and have published
based on work I've done on it. No other open source toolkit is better than it
right now, both because it has an open license (Apache2) and because it has a
larger community than any other speech toolkit.

You can see a lot of state of the art work done using Kaldi at Dan Povey's web
page:
[http://www.danielpovey.com/publications.html](http://www.danielpovey.com/publications.html)
He is the main maintainer of the project.

------
redgetan
What's a good resource to learn more about speaker diarization so that I can
learn how to use existing tools properly (tweaking and modifying them
according to my needs such as improving accuracy).

Some diarizers that Im interested in trying so far are - PyCASP,
pyAudioAnalysis, dlg-segmenter

~~~
nshm
Previous generation technology used GMM and agglomerative clustering. Example
is Pycasp, lium diarization. For best performance you need ivector extraction
and ivector clustering, that technology is not fully available in open source
yet.

------
skykooler
Is there a good resource for getting started with Kaldi? I wanted to test it
as an alternative to Julius, but all the documentation seems to assume you
have a doctorate in speech recognition technologies or something.

~~~
galv
This guide was written by a user who thought the learning curve was too high:
[http://kaldi-asr.org/doc/kaldi_for_dummies.html](http://kaldi-
asr.org/doc/kaldi_for_dummies.html)

The reality is that you will have to study a lot to truly understand the
toolkit though. Speech recognition isn't as simple as image recognition where
you can just throw a neural network at the problem (that might come off as
offensive, but it really is more complicated). Note that you do not need a
doctorate in speech recognition to understand it, as I don't have one.

In hindsight, the resources most useful to me when I first began working with
Kaldi were:

\- the HTK book (behind a login wall, annoyingly) \- This book
[http://mi.eng.cam.ac.uk/~mjfg/mjfg_NOW.pdf](http://mi.eng.cam.ac.uk/~mjfg/mjfg_NOW.pdf)
\- The finite state transducers paper
[http://www.cs.nyu.edu/~mohri/pub/hbka.pdf](http://www.cs.nyu.edu/~mohri/pub/hbka.pdf)

~~~
skoocda
To add to this, Daniel Povey's lectures [0] are fairly useful (although not
exactly colourful, per se).

Would definitely second the FST paper. You'll get shunned from the forums if
you haven't read that one.

[0] [http://www.danielpovey.com/kaldi-
lectures.html](http://www.danielpovey.com/kaldi-lectures.html)

------
cnorthwood
Kaldi is used in production at the BBC:
[http://bbcnewslabs.co.uk/projects/transcriptor/](http://bbcnewslabs.co.uk/projects/transcriptor/)

------
jomamaxx
Does anyone have any benchmarks for this?

~~~
galv
Every dataset has scripts to run reproducible baselines. Look at the RESULTS
file for each dataset to see the results of various models. e.g.,
[https://github.com/kaldi-
asr/kaldi/blob/master/egs/tedlium/s...](https://github.com/kaldi-
asr/kaldi/blob/master/egs/tedlium/s5/RESULTS)

The training of these models can be reproduced on your own machine using a
run*.sh script plug-and-play style, but you'll have to do some digging to find
what scripts correspond to which models in the RESULTS files.

------
lobius
Still based on MFCC features, which are very noisy / high entropy.

~~~
gok
This is completely wrong. Kaldi has code to compute all kinds of features
(filter bank, pitch, PLP...) and the many recipes use them.

~~~
lobius
But not spectral peaks, which is what audio fingerprinting services like
Shazam use (very successfully too it seems)

~~~
skoocda
There's been very little use of spectral peaks in LVCSR for the past ~10
years.

