Main site: http://userbase.kde.org/Simon
Just being clear, it wouldn't take you from a bucket of raw sound samples to a string like "I'd like a Coke, please" ?
"There is a simple rule of thumb in speech recognition: The smaller the application domain, the better the recognition accuracy. [...] Simon can now re-configure itself on-the-fly as the current situation changes. Through "context conditions" Simon 0.4 can automatically activate and deactivate selected scenarios, microphones and even parts of your training corpus. For example: Why listen for "Close tab" when your browser isn't even open? Or why listen for anything at all when you're actually in the next room listening to music? Yes, Simon is watching you."
It is one of the best subconscious techniques humans use both for listening and reading, so it makes complete sense to implement it here. I do find the choice of words in the final sentence somewhat ominous though!
This requires massive amounts of labeled data. This is why Nuance is king and few others come close - the amount of labeled data necessary to catch up is astounding. Not to mention a patent minefield to navigate.
This is unfortunately one field in which open-source alternatives face real obstacles and won't be viable in the near future.
I have CMU Sphinx (pocket edition even though I'm running on a full blown server) and for my use case it works fairly well.
I downloaded a couple here:
and it didn't seem to have a log of words labeled each by timestamp offset into the audio recording - which is the vital part for training a recognizer. Am I missing something?
It is very interesting but unfortunately just appears to be too much hassle for sane people to tackle (although it'd be extremely worthwhile if someone would innovate in this space and lower the barrier to entry - most are using commercial acoustic models with the FOSS software)
My sibling (using Debian Testing) had wrist RSI this summer and we tried to set up Simon Listens (the previous version; 0.4 looks from the release notes like it's improving). We are both Linux nerds. We were not able to get it to do anything useful after a few days of work, and I estimated a 50-50 chance that working harder on it would help. I did not find any other FOSS speech-to-text that I could get working either. (FOSS Linux text-to-speech is much better; e.g. Orca is good.) We did not try any commercial products; Dragon Naturally Speaking is the only one I know of having a good reputation but it is Windows-based. Also, it's hard to integrate well into the Linux stack without being FOSS. A list of products we looked at: https://en.wikipedia.org/wiki/Speech_recognition_in_Linux
 Issues with compiling, dependencies, figuring out the conceptual model Simon Listens uses, trying to figure out whether Simon-not-doing-anything was because we miscompiled it, or audio input, or incompatible dep versions or misconfigured deps, or us just doing the wrong thing because the English documentation wasn't super thorough... Imagine setting up Apache, MySQL and PHP if there weren't a billion tutorials online, you'd never used Apache, and MySQL wasn't compatible with your GCC unless you pulled the git version and hoped you didn't get confused by dev-version-only bugs.
Really frustrating when you see projects with potential botch basic developer usability so badly.
I really wish there were better options out there. Hopefully Simon will help improve the landscape. The automatic contributions to Voxforge should help.
Until now, I'd assumed the voice input on Android was open source. But if so it could clearly be taken and integrated into desktop apps like this. How does it actually work? Is it a closed-source plugin? Does it send a recording of what you say to a Google API?
Also, there is no Wikipedia page for them!
I wonder if speech recognition software can be developed with a :
- Dynamic Time Wrapping (DTW) algorithm for comparing utterances/words.
- A recording device for the users to record their words.
- Context separation like simon uses for limiting the phrases to listen at any time and improving accuracy.
Do you mean a software device like some kind of control panel? While that's a solution that eases the software developer's job, that's not how people want their software to work. I'm a software developer myself and I don't want my speech software to require training. Or if it's going to require training, fake it for me. Maybe a wizard: "Hi, I'm Simon! I need to hear your voice a bit before we get started. Please read the following sentence: ..." or something.
Sure, while this is a <1.0 release, maybe this recorder will help the devs learn their problem domain a bit more deeply, but I'd sincerely hope it doesn't become an engineer's crutch- IMO, it's Not Good expecting users to adapt themselves to the technology that's supposed to be serving them.
"If you are a packager and would like to package Simon 0.4, please do get in touch with us. Thank you."