I would really appreciate if someone working on speech technologies (speech to text or text to speech) can provide some insight on this?
Since past couple of months, I have been completely fascinated with what speech technologies (specifically speech recognition and speech synthesis) can do and how they can enhance the user experience. I decided to delve deep into speech synthesis technologies. From my research into available solutions, there is a huuuuuuuuge difference between the open source solutions and the commercial solutions available for $$$$$.
From what I have read about speech recognition, the open source solutions perform extremely poorly when compared to their commercial counterparts.
Has anyone else here looked at the possibility of improving any of the available open source speech technologies to a level where it is close to the commercial ones? Is it even possible to improve Sphinx or festival to a level where it can be commercially used without developing everything from scratch?
Is it something even worth investigating?
Is it possible for someone working in this area to articulate the challenges(technical/monetary etc.) involved?
Okay, thanks a lot for reading this. Looking forward to your comments.
P.S.:
I would really really like to get opinion from someone who has worked or is working in this area about their experiences. I am located in south bay.
I am also attending the startup school this month.
From what I gathered it would be pretty difficult to get speech synthesis up to their level. A single "voice" will be generated by taking tens of hours of audio and using algorithms to splice them together based on the text.
The only significant monetary constraint is going to be if you want to have a real voice talent doing your recording in a studio, but I wouldn't try to tackle the technical issues without a subject matter expert