The CMU Sphinx project as it stands is basically dead. Even though they recently implemented some sequence-to-sequence deep learning techniques for g2p [1], the core stack is still based on an ancient GMM/HMM pipeline, and current state of the art projects (even open source ones) have leapfrogged it in terms of accuracy. If you're implementing offline speech recognition today, start with something like this or Kaldi-ASR [2]. It will take a bit of work to get your models to running on a mobile device, but the end result will be much more usable.
We've worked in the past with CMU Sphinx too, and it is absolutely amazing the advances in this area in the last months.
A little bit off-topic, but do you know any recent work or paper for speech recognition in language teaching area ? (I mean, analysing and rating accuracy of speaker, detect incorrect pronunciation of phones, and so on)
> Do you know any recent work or paper for speech recognition in language teaching area?
What you're describing is called "speech verification". Language education is an application I'm personally very interested in, and one that almost no one discusses in the speech community (I assume because of machine translation), so if you find any research papers please let me know! I wrote a little about it: http://breandan.net/2014/02/09/the-end-of-illiteracy/
The task is actually much simpler than STT. You display some text on the screen, wait for an audio sample, then check the model's confidence that the sample matches the text. If the confidence is lower than some threshold, then you play the correct pronunciation through the speaker. The trick is doing this rapidly, so a fast local recognizer is key. I've got a little prototype on Android, and it's pretty neat for learning new words. I'd like to get it working for reading recitation, but that's a lot of work.
Hey, thank you for the link to you article. I've read it throughly and I cannot agree more. And that was written two years and a half ago, before the AI "explosion" that we saw later.
Actually, checking against confidence is something that we've tried to play with, but to my knowledge there is not a model that allows you to compare speech confidence against an specific text. Public APIs like MS ProjectOxford.ai can return a confidence, but against the "recognised" text, not against a predefined text.
Going further, this kind of approach can be very effective on words and small sentences, but I'd really love to see which specific phones the learner is failing, which can help in analysing full speaking exercises.
It works, but I am sure it should be possible to do better
[1] http://cmusphinx.sourceforge.net/2016/04/grapheme-to-phoneme...
[2] http://kaldi-asr.org/