

Kaldi Speech Recognition Toolkit - adamnemecek
https://github.com/kaldi-asr/kaldi

======
96701
Are there any get started guides? Say I wanted to transcribe a podcast, I
would have to feed it a transcript to learn from?

~~~
pella
[http://kaldi-asr.org/doc/tutorial.html](http://kaldi-
asr.org/doc/tutorial.html)

~~~
96701
"This tutorial assumes that you know the basics of speech recognition using
the HMM-GMM approach."

So basically, not for beginners or hobbyists.

------
WalterGR
a. What is a speech recognition "toolkit"? Does it transform audio data into
text?

b. How does Kaldi compare to any other "toolkit"?

I clicked the Github link, read the README, found the link to the project home
page, read that, clicked the Documentation link, and read "About Kaldi", and
haven't found the answer for a., and the answer for b. is clearly beyond
me.[1]

[1] "Kaldi is similar in aims and scope to HTK... Important features include:
Code-level integration with Finite State Transducers (FSTs). We compile
against the OpenFst toolkit (using it as a library)." etc.

~~~
hiddencost
Many industrial speech recognition systems start with Kaldi, add their own
data and any modifications to the recognizer, and then spend a while tuning
the model. Speech recognition is one of those problems where you need a ph.d.
and hundreds of ours of transcribed audio plus a large amount of in domain
text to build a good model. So it's not really accessible without serious
resources.

~~~
WalterGR
Thanks for the clear description of the situation.

So (metaphorically speaking) this sounds like Google's Tesseract OCR
"engine"[1]: yes it'll recognize letters - but it does no layout analysis.
It's only one piece of a much larger puzzle.

[1] [https://code.google.com/p/tesseract-
ocr/](https://code.google.com/p/tesseract-ocr/)

------
somberi
If anyone could point me towards Speech Recognition in the field of
healthcare, would be grateful.

~~~
PeterisP
Healthcare is really a prime example of a field where each speech recognition
system (currently) needs adaptation to your particular needs; the language is
very specific for each subfield, general use systems don't work well for
healthcare use cases, and system built for one use case doesn't work for
others.

But Kaldi seems to be a common way to go - you'd take that, add samples of
your audio data, add a lot of text from your domain (to get a language model
that captures your terminology and common phrasing), retrain the system on
this data and you'd have something useful. Well, something useful that can be
tuned ad infinitum for more accuracy.

