It's a shame that the quality of open-source text to speech engines is so much worse than the current commercial state of the art, as that's the most notable difference in the demo videos between this and something like Siri. Would fixing that just be a matter of recording more high-quality free sample libraries, etc., or are there fundamental technical challenges to solve?
It's mostly a question of training data. Google trains its acoustic models on thousands of hours of annotated audio samples. It's very hard of an open source project to i/ get enough data ii/ have the computing power to actually train with it.
Sounds like this field could use some crowd sourcing.
What kind of data would be the most useful for learning? Would it be the same phrases read by different people, single words or is any text good? Do we need lots recordings from a single person or many smaller samples from different persons?
I'm thinking of a web site where people could contribute to the project by joining and then reading out phrases the system shows them.
Seriously there is no massive CC-licensed source of audio data out there. Most of the fancy algorithms for doing speech recognition are on github. What isn't is a massive and diverse dataset. I encourage others to reply if they have seen otherwise.
I strongly disagree! Text to speech needs the same data as speech to text - a well annotated collection of raw, single speaker speech data from a variety of speakers and accompanying text labels.
It is very, very difficult to find a large, well curated dataset of speech with accompanying text labels. TIMIT is the gold standard of speech recognition despite its simplicity, and it costs ~$150(!) to get access at all. Switchboard is another famous one which is bigger, but only partially labeled and still very hard to get. CMU has an openly available, small dataset called AN4, but it is only 64MB of raw recordings (a lot for 1991 when it was recorded, but still small today).
You could use YouTube and their automated translations, but the text is shoddy at best and definitely too unreliable to feed to an algorithm without significant cleansing.
Honestly, U.S. Congressional speeches may be the best "open" option IMO - good transcriptions, a variety of speakers, and TONS of data. If you could sync up the transcriptions with recordings of C-SPAN it could work very well, though I have not seen a dataset that puts this together. This still only covers English!
Companies understand that data is valuable - that is why they take the rights to it in every EULA! Well curated internal datasets will be the KFC/Coke secret recipes of the "big data" age.
I can't upvote this enough. There are very few companies that are masters of this game. Because it is extremely difficult to build. Every stage is a painful process and there are so many moving parts. To begin with you need very high quality recordings. Successful companies have voice directors whose job is to record voice at most optimal settings (and to make sure speaker speaks is neutral dispassionate tone). And then there is the manual process of segmentation and phoneme alignment - That requires lot of human effort. Then you run the code! It will take quite sometime. Then you study the quality of synthesis. Fix issues manually (for example, if you are using concatenative synthesis, you may have to fix offending phoneme alignment based on feedback). And iterate on this. Unless you have a team, I wouldn't venture into this and rather use third party leaders.
I recently discovered vocalid.org thanks to a TED video. They work to provide people having speech impairments with unique synthetic voices thanks to voice donors. According to their process, a voice donor need to read 2-3 hours of text to allow them to build a custom synthetic voice.
I don't know if they have any plan to open their data but if someone could find a way to increase the amount of voice donors while improving an open dataset, it would be interesting. Just a thought, they are most certainly many more issues / options in this field.
I wonder if the audio data from television programmes combined with subtitles would make a good training set. Here in the UK many of the subtitles are very good quality (presumably human transcribed), and would be easy to strip out non-language noises such as laughter (normally transcribed in brackets).
Economics is the main challenge because it's a difficult task and the demand for it is low (market is small if demanding ). But the mbrola people, who have a nice free speech synthesis program that also isn't opensource,  say it's a hard technical problem too.
Despite people mention the lack of the data it's not a problem anymore. It was a problem like few years ago but these days we have access to thousands hours of speaker data. For example we have 100 hours of single speaker database at CMUSphinx, it's not a big problem.
Algorithms become a bigger problem these days. For TTS for example one has to implement hybrid speech synthesis technology combining hidden markov models and unit selection. Most commercial companies are using this technology, but no open source project has it released. The closest ones are OpenMary and Festival, but they are either unit selection or hmm, no hybrid synthesis implementation yet.
Care to check a dictionary first? apendleton is not saying open source programmers should be ashamed that they haven't produce a high quality speech engine. Would you feel better if it was phrase like "it's a bummer that.."? Less hurt feelings?
I'm normally not a fan of condescending comments, but you hit the nail right on the head. It doesn't seem like some people grasp that a lot of open source projects are coded by unpaid volunteers in their free time in subjects they're passionate about.
Granted, there are commercial open source projects.
I'm sure everyone on HN is familiar with how open source projects work. The poster wasn't trying to tell contributors to get their act together and produce a higher quality product.
Beyond just lamenting the quality gap, they were trying to understand what it was that open source projects of this nature are missing. Is it lack of research making its way downstream? Poor media quality? There was no blame placed on contributors.
Understanding the answers to these questions can even help someone like the parent find an 'in' to contribute something themselves.
Great suggestions. I've been meaning to take a deeper dive into Wit.ai for a while now. It seems like their intents-entities architecture would actually fit in pretty cleanly with Jasper.
As an aside: I don't think it'd be difficult to developer Jasper modules that use Wit without modifying much of the original source (as long as the speech-to-text systems pick up on the text you'd need to pass to Wit).
Hi Kyle, Wit.AI team here -- thanks for suggestion (1) :-)
Actually our typical latency is less than 0.5s if you stream audio to the API (instead of waiting until a silence, then sending a WAV file). Also, we are working on an embeddable client (you would still use Wit.AI online to train your model, but then they can run locally on your Raspberry Pi).
I also did a project like this last year -- I only got around to asking for date & time working, and was happy enough to stop going there...
I disagree with (1), however: In the interest of making WAN-independent software (as in, I don't want my home automation to stop working if I can't call out to wit.ai), I actually don't agree with that point. I think that using CMUSphinx can be made extremely accurate with continuous training (something I'm angling to put in my version).
Then again, I'm planning on making a competing product so differentiation is good for me. I think there are a lot of ways they can improve on the idea
Nice project--I like the code, it's very clean and well structured. You should check out the subversion trunk of pocketsphinx, it has support for keyword spotting built in so you can do things like instantly recognize the persona keyword to enable the system instead of running the transcription through pocketsphinx and hoping for the best.
I am looking to use this / something similar at the startup I work at to toggle our robot's operating modes. I have a question regarding the voice recognition. Is the voice recognition stuff done on the Pi itself, or is there a service that Jasper taps into to perform voice recognition?
I'm working with PocjketSphinx now, it's totally language independant. There is a paper floating around for Arabic, in which they used several reads of the Qu'ran as training data. For Chinese there's a model in the official Sphinx page.
this might be a stupid question but HOW! :)
after some troubles I finally got Jasper to boot up and first thing he does is complaining he can't connect to a network.
So I connect my laptop via ethernet to the pi and try to reach the configuration site http://192.168.1.1:80000/cgi-bin/index.cgi but it just won't connect.
Might be something ridiculously obvious but I'm a total n00b when it comes to the pi...or linux in general :/
I've been working on a freetext question answering service, but more the question answering part (as opposed to the voice recognition side).
Looking at the documentation, it appears there is no way for it to handle free text questions ("What is population of X?" - where X is any country) since all words need to be defined in advance. Is that correct or am I missing something?
This looks great! I'm only missing a USB microphone. I'll be sure to make this once I get my hands on one.
It also seems pretty trivial to set up Wolfram Alpha on here. From what it looks like, you'd just have to:
1) get a developer account at Wolfram Alpha
2) download this promising looking module: https://pypi.python.org/pypi/wolframalpha/1.0.2
3) integrate it into Jasper (create a module)
I would love if that worked offline, too. Even if it is just a very limited use of voice recognition.
Like, Siri has to be online cause everything is processed on Apple's servers because there are so many things Siri could do. But if all I want is to be able to say "play <NameOfMovie/TVShow>", I would love it if that could be done offline. Even if I had to train the system myself.
Say I have 30 TV Shows in the library, I could see myself train the system the names of all of them if that meant I am able to actually start them via voice.
I was just thinking the same thing - I use XBMC on an Android TV stick now, with a wireless "airmouse" that's absolutely horrible to use, and I'd love a better option... For stuff that can easily be done with the XBMC web client, or a simple media remote it's not such a big deal, but it's not exactly a perfect solution.
Well, yes this can done. Noise cancelation & voice extraction algorithms aren't implemented here but it's totally possible. It wouldn't be too bad to implement if you have a decent understanding of DSP.
If you want to do dictation, you can use the Sphinx4 project. It's in Java, and you have to write a bit of code, but it's an offline recognizer, you record the audio and then it doesn't run in real-time.
I think this is a great project! But Raspberry Pi might be a bit overpriced/overpowered for this task. Maybe something like Arduino Yún would be more appropriate choice. I am really hoping this movement of small GNU/Linux based home appliances will take off and lower the price.