It would be immensely useful to be able to run a monologue or dialog wav file through a program and get more-or-less good text, even if there are some errors. As far as I know this is still quite a difficult problem, requiring an immense amount of data, and good language models, but I wouldn't be surprised if today there are some pre-trained models available that can be run using one of the many machine learning Python toolkits?
The linked paper is 4 years old. DNNs have been dominant in speech since 2012. No one uses GMM systems anymore.
Baidu's approach isn't even the best (IBM's system tends to beat theirs on accuracy, and google tends not to publish numbers on known benchmarks), it's notable mostly for its use of RNNs to do pronunciation and language modeling (although they also tack on a mod-KN LM).