

Open-Source Speech Recognition: Simon 0.4.0 Released - Tsiolkovsky
http://simon-listens.blogspot.com/2012/12/simon-040.html

======
biot
This might be better described as scenario-based command recognition, where a
scenario is something like "Firefox", "Skype", and so on with commands
specific to the scenario you're in. In other words, if you're looking to do
automated voice transcription these aren't the libraries you're looking for.

Main site: <http://userbase.kde.org/Simon>

~~~
gcr
Wait, so if your universe is ["Coke", "Pepsi", "Sprite"], then this library
will tell you which one the user spoke?

Just being clear, it wouldn't take you from a bucket of raw sound samples to a
string like "I'd like a Coke, please" ?

~~~
_delirium
Yeah, this tool is focused on recognizing specific utterances given the curent
context, e.g. (to take an example they use) voice-operating a browser. So you
say things like "close tab" and it realizes that you've said one of the
command words it recognizes in that context, and acts accordingly. They also
have an example somewhere about controlling a calculator by speaking digits
and arithmetic operations. Doesn't seem to be aimed at free-form audio-to-text
transcription, though some of the technologies it's built on, such as CMU
Sphinx, are fairly general.

------
CKKim
This seems to be the most important/interesting part:

"There is a simple rule of thumb in speech recognition: The smaller the
application domain, the better the recognition accuracy. [...] Simon can now
re-configure itself on-the-fly as the current situation changes. Through
"context conditions" Simon 0.4 can automatically activate and deactivate
selected scenarios, microphones and even parts of your training corpus. For
example: Why listen for "Close tab" when your browser isn't even open? Or why
listen for anything at all when you're actually in the next room listening to
music? Yes, Simon is watching you."

It is one of the best subconscious techniques humans use both for listening
and reading, so it makes complete sense to implement it here. I do find the
choice of words in the final sentence somewhat ominous though!

------
plainsman
Bottom line: speech recognition in the general case (more than a few
predetermined words) is only as good as the 1) acoustic model (which
utterances were heard), and 2) language model (how do we group the utterances
into words).

This requires massive amounts of labeled data. This is why Nuance is king and
few others come close - the amount of labeled data necessary to catch up is
astounding. Not to mention a patent minefield to navigate.

This is unfortunately one field in which open-source alternatives face real
obstacles and won't be viable in the near future.

~~~
DennisP
Seems like there ought to be a way to crowdsource some of that.

~~~
3amOpsGuy
Yeah there is <http://www.voxforge.org/home>

I have CMU Sphinx (pocket edition even though I'm running on a full blown
server) and for my use case it works fairly well.

~~~
plainsman
This cool. But are these audio files transcribed, or just provided?

I downloaded a couple here:

[http://www.repository.voxforge1.org/downloads/SpeechCorpus/T...](http://www.repository.voxforge1.org/downloads/SpeechCorpus/Trunk/Audio/Main/16kHz_16bit/)

and it didn't seem to have a log of words labeled each by timestamp offset
into the audio recording - which is the vital part for training a recognizer.
Am I missing something?

~~~
3amOpsGuy
It's trickier than just matching word sounds, the sphinx docs are first rate:
<http://cmusphinx.sourceforge.net/wiki/tutorialam>

It is very interesting but unfortunately just appears to be too much hassle
for sane people to tackle (although it'd be extremely worthwhile if someone
would innovate in this space and lower the barrier to entry - most are using
commercial acoustic models with the FOSS software)

------
rpm4321
I'd love to hear some opinions from anyone with speech recognition experience
on how this stacks up to the commercial alternatives.

~~~
idupree
I don't think this is quite the sort of answer you want:

My sibling (using Debian Testing) had wrist RSI this summer and we tried to
set up Simon Listens (the previous version; 0.4 looks from the release notes
like it's improving). We are both Linux nerds. We were not able to get it to
do anything useful after a few days of work, and I estimated a 50-50 chance
that working harder on it would help[1]. I did not find any other FOSS speech-
to-text that I could get working either. (FOSS Linux text-to-speech is much
better; e.g. Orca is good.) We did not try any commercial products; Dragon
Naturally Speaking is the only one I know of having a good reputation but it
is Windows-based. Also, it's hard to integrate well into the Linux stack
without being FOSS. A list of products we looked at:
<https://en.wikipedia.org/wiki/Speech_recognition_in_Linux>

[1] Issues with compiling, dependencies, figuring out the conceptual model
Simon Listens uses, trying to figure out whether Simon-not-doing-anything was
because we miscompiled it, or audio input, or incompatible dep versions or
misconfigured deps, or us just doing the wrong thing because the English
documentation wasn't super thorough... Imagine setting up Apache, MySQL and
PHP if there weren't a billion tutorials online, you'd never used Apache, and
MySQL wasn't compatible with your GCC unless you pulled the git version and
hoped you didn't get confused by dev-version-only bugs.

~~~
jlgreco
That is unfortunate, but somehow that sort of user experience doesn't come as
a surprise to me. I think it is the _"blog as the main project page, and no
version control link in sight"_ thing that gives off that feeling. The only
other thing that gives off vibes that bad is a sourceforge page with no hint
of a real project page.

 _Really_ frustrating when you see projects with potential botch basic
developer usability so badly.

~~~
idupree
There's also <http://www.simon-listens.org/> . If you speak German there might
be some info there that we couldn't read (some is in English, but I think the
developers are mostly in Germany).

~~~
jlgreco
Ah, that's bit better. Found the version countrol:
<http://sourceforge.net/projects/speech2text/>

~~~
Tsiolkovsky
Actualy Simon project has recently moved to KDE infrastructure:
[http://dot.kde.org/2012/04/08/simon-speech-recognition-
proje...](http://dot.kde.org/2012/04/08/simon-speech-recognition-project-
moves-kde) and the code is here:
[https://projects.kde.org/projects/extragear/accessibility/si...](https://projects.kde.org/projects/extragear/accessibility/simon/repository)

------
sunsu
I've spent a good bit of time trying to get CMUSphinx to transcribe audio with
any amount of reasonable accuracy. I never was very successful. Resorted to
using paid third party APIs.

I really wish there were better options out there. Hopefully Simon will help
improve the landscape. The automatic contributions to Voxforge should help.

~~~
karterk
What third party APIs do you use? And how accurate are they in transcribing
long audios with a reasonable amount of noise?

~~~
sunsu
I have the most experience with Nuance's NDEV API. It is not really for very
long audio. I'd say < 2 minutes.

------
rdtsc
I would love to see a good, quality, open source alternative to Nuance. They
are pretty much a monopoly (patent and market wise) in this area. It also has
to be open source, Nuance is known for either suing into oblivion (for patent
infringement) or buying the competition.

~~~
JoeAltmaier
This is actually the sort of project that will/can never be open-source. NOt
because of the patents. Because its hard, requires 1000s of hours of thankless
data collection. There's no shortcut to something cool - and open source loves
cool demos that are easy to build.

------
dakota
Does anyone know if this does full speech-to-text transcription, i.e., I
speak, it fills my speech into a text box? Or is it just for controlling the
desktop via speech I tried googling, but couldn't come up with much.

Also, there is no Wikipedia page for them!

~~~
dirkk0
on a side note: this is part of the HTML5 specification. If people are not
afraid of Google/Chrome and if being online is not an issue, it's as simple as
this: <http://jsfiddle.net/dirkk0/pGFuR/>

~~~
manveru
I only found <https://wiki.mozilla.org/SpeechAPI> for the FF implemenation,
looks like it has no Linux support either, but I only glanced at the patch
they made... It seems to support recording for OSX and Windows at least.

------
Joeboy
I'm really glad somebody's working on this. When my ex got RSI a couple of
years ago it seemed like there was no option but to go back to Windows. There
aren't many unsurmountable issues left for Linux users, but this seemed to be
one of them.

~~~
pm215
I had some success (five to ten years ago now) with having a Windows machine
purely to run the dictation software, and using a remote access tool like x2x
or synergy to pass the keystrokes through to a Linux box which ran my actual
desktop. Obviously you lose some of the application-awareness but for people
who need the voice recognition but find Windows drives them up the wall it's
better than nothing, and it has the incidental advantage that the dictation
software isn't competing with anything else for CPU and RAM.

------
smogzer
I would like to include a speech recognition in my commercial projects, but
the license is GPL and is tied to KDE. I've also tried sphinx before but the
recognition is kind of poor and it lacks a gui for user/developer
configuration of the grammars.

I wonder if speech recognition software can be developed with a : \- Dynamic
Time Wrapping (DTW) algorithm for comparing utterances/words. \- A recording
device for the users to record their words. \- Context separation like simon
uses for limiting the phrases to listen at any time and improving accuracy.

~~~
delinka
"A recording device for the users to record their words."

Do you mean a software device like some kind of control panel? While that's a
solution that eases the software developer's job, that's not how people want
their software to work. I'm a software developer myself and _I_ don't want my
speech software to require training. Or if it's going to require training,
fake it for me. Maybe a wizard: "Hi, I'm Simon! I need to hear your voice a
bit before we get started. Please read the following sentence: ..." or
something.

Sure, while this is a <1.0 release, maybe this recorder will help the devs
learn their problem domain a bit more deeply, but I'd sincerely hope it
doesn't become an engineer's crutch- IMO, it's Not Good expecting users to
adapt themselves to the technology that's supposed to be serving them.

------
lifelongUU
Looks like it depends on KDE4, but they distributed the binaries under
windows? Anyone know if they plan a DEB or RPM release?

~~~
hippich
I just compiled it on ubuntu 12.04. It wasn't too bad. Although I agree, ppa
would be nice

------
Egregore
Can it be used for making education programs? For example a program to help
improve English pronunciation.

