Hacker News new | comments | show | ask | jobs | submit login
Open-Source Speech Recognition: Simon 0.4.0 Released (simon-listens.blogspot.com)
197 points by Tsiolkovsky 1784 days ago | hide | past | web | 34 comments | favorite

This might be better described as scenario-based command recognition, where a scenario is something like "Firefox", "Skype", and so on with commands specific to the scenario you're in. In other words, if you're looking to do automated voice transcription these aren't the libraries you're looking for.

Main site: http://userbase.kde.org/Simon

Wait, so if your universe is ["Coke", "Pepsi", "Sprite"], then this library will tell you which one the user spoke?

Just being clear, it wouldn't take you from a bucket of raw sound samples to a string like "I'd like a Coke, please" ?

Yeah, this tool is focused on recognizing specific utterances given the curent context, e.g. (to take an example they use) voice-operating a browser. So you say things like "close tab" and it realizes that you've said one of the command words it recognizes in that context, and acts accordingly. They also have an example somewhere about controlling a calculator by speaking digits and arithmetic operations. Doesn't seem to be aimed at free-form audio-to-text transcription, though some of the technologies it's built on, such as CMU Sphinx, are fairly general.

This seems to be the most important/interesting part:

"There is a simple rule of thumb in speech recognition: The smaller the application domain, the better the recognition accuracy. [...] Simon can now re-configure itself on-the-fly as the current situation changes. Through "context conditions" Simon 0.4 can automatically activate and deactivate selected scenarios, microphones and even parts of your training corpus. For example: Why listen for "Close tab" when your browser isn't even open? Or why listen for anything at all when you're actually in the next room listening to music? Yes, Simon is watching you."

It is one of the best subconscious techniques humans use both for listening and reading, so it makes complete sense to implement it here. I do find the choice of words in the final sentence somewhat ominous though!

Bottom line: speech recognition in the general case (more than a few predetermined words) is only as good as the 1) acoustic model (which utterances were heard), and 2) language model (how do we group the utterances into words).

This requires massive amounts of labeled data. This is why Nuance is king and few others come close - the amount of labeled data necessary to catch up is astounding. Not to mention a patent minefield to navigate.

This is unfortunately one field in which open-source alternatives face real obstacles and won't be viable in the near future.

Seems like there ought to be a way to crowdsource some of that.

Yeah there is http://www.voxforge.org/home

I have CMU Sphinx (pocket edition even though I'm running on a full blown server) and for my use case it works fairly well.

This cool. But are these audio files transcribed, or just provided?

I downloaded a couple here:


and it didn't seem to have a log of words labeled each by timestamp offset into the audio recording - which is the vital part for training a recognizer. Am I missing something?

It's trickier than just matching word sounds, the sphinx docs are first rate: http://cmusphinx.sourceforge.net/wiki/tutorialam

It is very interesting but unfortunately just appears to be too much hassle for sane people to tackle (although it'd be extremely worthwhile if someone would innovate in this space and lower the barrier to entry - most are using commercial acoustic models with the FOSS software)

I'd love to hear some opinions from anyone with speech recognition experience on how this stacks up to the commercial alternatives.

I don't think this is quite the sort of answer you want:

My sibling (using Debian Testing) had wrist RSI this summer and we tried to set up Simon Listens (the previous version; 0.4 looks from the release notes like it's improving). We are both Linux nerds. We were not able to get it to do anything useful after a few days of work, and I estimated a 50-50 chance that working harder on it would help[1]. I did not find any other FOSS speech-to-text that I could get working either. (FOSS Linux text-to-speech is much better; e.g. Orca is good.) We did not try any commercial products; Dragon Naturally Speaking is the only one I know of having a good reputation but it is Windows-based. Also, it's hard to integrate well into the Linux stack without being FOSS. A list of products we looked at: https://en.wikipedia.org/wiki/Speech_recognition_in_Linux

[1] Issues with compiling, dependencies, figuring out the conceptual model Simon Listens uses, trying to figure out whether Simon-not-doing-anything was because we miscompiled it, or audio input, or incompatible dep versions or misconfigured deps, or us just doing the wrong thing because the English documentation wasn't super thorough... Imagine setting up Apache, MySQL and PHP if there weren't a billion tutorials online, you'd never used Apache, and MySQL wasn't compatible with your GCC unless you pulled the git version and hoped you didn't get confused by dev-version-only bugs.

That is unfortunate, but somehow that sort of user experience doesn't come as a surprise to me. I think it is the "blog as the main project page, and no version control link in sight" thing that gives off that feeling. The only other thing that gives off vibes that bad is a sourceforge page with no hint of a real project page.

Really frustrating when you see projects with potential botch basic developer usability so badly.

There's also http://www.simon-listens.org/ . If you speak German there might be some info there that we couldn't read (some is in English, but I think the developers are mostly in Germany).

Ah, that's bit better. Found the version countrol: http://sourceforge.net/projects/speech2text/

Actualy Simon project has recently moved to KDE infrastructure: http://dot.kde.org/2012/04/08/simon-speech-recognition-proje... and the code is here: https://projects.kde.org/projects/extragear/accessibility/si...

I've spent a good bit of time trying to get CMUSphinx to transcribe audio with any amount of reasonable accuracy. I never was very successful. Resorted to using paid third party APIs.

I really wish there were better options out there. Hopefully Simon will help improve the landscape. The automatic contributions to Voxforge should help.

What third party APIs do you use? And how accurate are they in transcribing long audios with a reasonable amount of noise?

I have the most experience with Nuance's NDEV API. It is not really for very long audio. I'd say < 2 minutes.

This, and the other comments about the poor state of open-source speech recognition, are very disappointing.

Until now, I'd assumed the voice input on Android was open source. But if so it could clearly be taken and integrated into desktop apps like this. How does it actually work? Is it a closed-source plugin? Does it send a recording of what you say to a Google API?

The Voice Recognition is 1) part of the Google apps (the Google Search app) which are mostly not open source and 2) send the input to google to get analyzed

On Android 4.0 (afaik) the voice recognition is offline! But surely still a non-free Google app. :(

I would love to see a good, quality, open source alternative to Nuance. They are pretty much a monopoly (patent and market wise) in this area. It also has to be open source, Nuance is known for either suing into oblivion (for patent infringement) or buying the competition.

This is actually the sort of project that will/can never be open-source. NOt because of the patents. Because its hard, requires 1000s of hours of thankless data collection. There's no shortcut to something cool - and open source loves cool demos that are easy to build.

I'm really glad somebody's working on this. When my ex got RSI a couple of years ago it seemed like there was no option but to go back to Windows. There aren't many unsurmountable issues left for Linux users, but this seemed to be one of them.

I had some success (five to ten years ago now) with having a Windows machine purely to run the dictation software, and using a remote access tool like x2x or synergy to pass the keystrokes through to a Linux box which ran my actual desktop. Obviously you lose some of the application-awareness but for people who need the voice recognition but find Windows drives them up the wall it's better than nothing, and it has the incidental advantage that the dictation software isn't competing with anything else for CPU and RAM.

Does anyone know if this does full speech-to-text transcription, i.e., I speak, it fills my speech into a text box? Or is it just for controlling the desktop via speech I tried googling, but couldn't come up with much.

Also, there is no Wikipedia page for them!

on a side note: this is part of the HTML5 specification. If people are not afraid of Google/Chrome and if being online is not an issue, it's as simple as this: http://jsfiddle.net/dirkk0/pGFuR/

I only found https://wiki.mozilla.org/SpeechAPI for the FF implemenation, looks like it has no Linux support either, but I only glanced at the patch they made... It seems to support recording for OSX and Windows at least.

I would like to include a speech recognition in my commercial projects, but the license is GPL and is tied to KDE. I've also tried sphinx before but the recognition is kind of poor and it lacks a gui for user/developer configuration of the grammars.

I wonder if speech recognition software can be developed with a : - Dynamic Time Wrapping (DTW) algorithm for comparing utterances/words. - A recording device for the users to record their words. - Context separation like simon uses for limiting the phrases to listen at any time and improving accuracy.

"A recording device for the users to record their words."

Do you mean a software device like some kind of control panel? While that's a solution that eases the software developer's job, that's not how people want their software to work. I'm a software developer myself and I don't want my speech software to require training. Or if it's going to require training, fake it for me. Maybe a wizard: "Hi, I'm Simon! I need to hear your voice a bit before we get started. Please read the following sentence: ..." or something.

Sure, while this is a <1.0 release, maybe this recorder will help the devs learn their problem domain a bit more deeply, but I'd sincerely hope it doesn't become an engineer's crutch- IMO, it's Not Good expecting users to adapt themselves to the technology that's supposed to be serving them.

Looks like it depends on KDE4, but they distributed the binaries under windows? Anyone know if they plan a DEB or RPM release?

I just compiled it on ubuntu 12.04. It wasn't too bad. Although I agree, ppa would be nice

looks like they're looking for package maintainers see the quote at the bottom of the post:

"If you are a packager and would like to package Simon 0.4, please do get in touch with us. Thank you."

Can it be used for making education programs? For example a program to help improve English pronunciation.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact