

On Speech Recognition: Pointers for Newbies & Lessons from a failed startup - fjabre
http://www.teabuzzed.com/2009/08/on-speech-recognition-web-app-integration-pointers-for-newbies-lessons-learned-from-a-failed-startup/

======
Caligula
Long 5am post alert broken into sections corresponding to the authors.

I wish I read this 8 months ago so it could of prepared me for some of the
misery to come. I initially thought it would be simple. Use some of the open
source tools or commercial SDK's, then quickly move on to the important stuff.
Did not turn out that way. In fact, speech recognition is a time killing
whore, but it sure is interesting, sometimes at least.

1\. Telephony & VoIP:

I disagree with the author. Freeswitch comes with speech recognition built in
and recently added uniMRCP so I think FS is definitely better not even taking
into account how shady asterisk is.

The prices he lists are also on the high side. You can get local DID's for
half what he lists, probably further less in bulk.

2\. Web Services:

Agree with his comments on web services. Speech recognition is perfect for
scaling onto the magical cloud. Dictation is more difficult because for very
large vocabulary at least, your going to take up a system to process it and be
lucky to get 1RT.

3\. Embedded

Disagree with this. Some decoders are made specificially for this. Well
pocketsphinx is. It even used ARM ASM code to speed up calculations on
embedded devices so you can get better results on the iphone for example. The
developer for pocketsphinx for example recently made this fantastic demo for
the N800 which I recall was similar to the iphone in power.

<http://www.youtube.com/watch?v=OEUeJb6Pwt4>

And I am sure it can be tweaked to be even better. So much of speech
recognition is tweaking. As he states correctly, Lumenvox is based on
sphinx(sphinx2 I recall reading), just its very tweaked.

Commercially available:

Very funny, I agree with most. MS,AT&T,IBM suck at least for providing API's
but their tech is very good. IBM for example released theirs as opensource but
changed their mind and just let it die. Nuance is 'contact our salesperson'
expensive. Lumenvox is the most affordable.

Open Source:

Minor nitpick, he should of included HTK with Julius. Acoustic models
definitely have given me the most grief.

I use sphinx4 and pocketsphinx and am very pleased. They are state of the art
decoders. The acoustic model, or lack thereof is the reason why commercial
engines are perceived as superior. If you have 50k you can spend get a LDC
membership with tonnes of transcribed data. Slap it into the trainers format,
BAM. Unfortunately I was not going to do this. I disliked the LDC for what I
thought was gauging. Same prices for mega corporations as individuals. But
after making my own model, and still in the neverending process of making my
own, tweaking it, etc.., I appreciate the misery that is collecting and
organizing transcribed data and appreciate the work they do even if I can't
use it.

Notice something wrong with your model, need to retrain it. Takes days with a
quad core. In fact my new 8core beast of a server I ordered arrives next week,
I wonder how my parents will take the noise, apparently servers are loud. Even
with that it will take maybe half a day. And spotting errors is hard. I cant
emphasize enough how boring it is to listen to hours on end of audio and see
that it matches up with the text perfectly. In some cases, listening a bunch
of times to make sure. Noticing issues with your model, having to go figure
out why.

Voxforge.org is great. They have ~50 hours of quality data, most at 16khz
computer microphone but a good portion at 8khz telephone. You can always
downsample. But for dictation you need much more.

You dont need thousands of hours unless your doing dictation and if that extra
few percent is worth it. You can get good results with low hundreds. There are
other equally important factors like language models that he should of
mentioned that could be equally as important as the acoustic model. How its
important to have relevant, and lots of data to train them. The acoustic model
is only one of many factors(as is the decoder for the matter). Perhaps because
its that he did not focus on dictation that he left it out.

Wrapping Up:

FS better. At least try both, its trivial to set each up and follow a simple
tutorial. You can still plug lumenvox into FS, although it will cost. But it
would cost the same for asterisk. I agree that its difficult but I don't think
that should stop you. Just be aware its a lot of work, some of it very boring
and frustrating, but I am sure the same can be said for most things. Ok maybe
not :)

------
mahmud
As real and raw as anything written by a badly bitten practitioner. If you're
doing SR, don't read this article and don't bookmark it, wrecking a nice beach
takes a lot of fucking effort; just contact the author and grab the guy while
he is between gigs.

If I had the money I would be hiring these guys with "failed" startups and
pairing them with good sales people. They haven't failed, they just don't have
industry contacts :-(

------
henning
My guess is that if any of this was new to you, it should be a clear sign you
have no business doing a company in this niche.

~~~
fjabre
Is that to say Mark Zuckerberg shouldn't have done a social networking site or
Steve Jobs, Pixar..? In fact maybe anybody who has their bachelor degree in
liberal arts shouldn't even dream of going to med school or being a lawyer.
This line of reasoning is flawed. Some of the most successful people I've met
in life have ended up doing things they never would have imagined.

------
fjabre
BTW just to clarify: by all means integrate speech rec into your solution if
you feel you get a real value proposition by doing so. For my product it
wasn't really needed and was more of a distraction than anything else. However
I learned so much in the process I wanted to share this back with the
community.

------
danbmil99
he starts out saying "stay the fuck away from this stuff" then goes on to talk
lovingly of the tech. WTF? Where's the explanation of why it didn't work as a
business?

I think the OP is a bit too in love with the technology, that's probably why
he got burnt.

~~~
mbrubeck
And he said exactly that in his previous post: _"So the number one reason my
startup failed was: I was distracted by a cool and shiny feature that didn’t
solve anyone’s problem."_

[http://www.teabuzzed.com/2009/08/the-number-one-reason-my-
st...](http://www.teabuzzed.com/2009/08/the-number-one-reason-my-startup-
failed/)

