What is the current state of open source speech recognition?
It would be immensely useful to be able to run a monologue or dialog wav file through a program and get more-or-less good text, even if there are some errors. As far as I know this is still quite a difficult problem, requiring an immense amount of data, and good language models, but I wouldn't be surprised if today there are some pre-trained models available that can be run using one of the many machine learning Python toolkits?
Kaldi is the best, in my opinion. It has a few open acoustic models. Unfortunately, speech is just hard; acoustic models are onlying one part of these systems.
Given the Baidu work, I think we can safely say that Hinton et al's forecasts 4 years ago were on the money. Deep approaches are now clearly dominant and have yielded fantastic performance.
The linked paper is 4 years old. DNNs have been dominant in speech since 2012. No one uses GMM systems anymore.
Baidu's approach isn't even the best (IBM's system tends to beat theirs on accuracy, and google tends not to publish numbers on known benchmarks), it's notable mostly for its use of RNNs to do pronunciation and language modeling (although they also tack on a mod-KN LM).
It would be immensely useful to be able to run a monologue or dialog wav file through a program and get more-or-less good text, even if there are some errors. As far as I know this is still quite a difficult problem, requiring an immense amount of data, and good language models, but I wouldn't be surprised if today there are some pre-trained models available that can be run using one of the many machine learning Python toolkits?