What kind of data would be the most useful for learning? Would it be the same phrases read by different people, single words or is any text good? Do we need lots recordings from a single person or many smaller samples from different persons?
I'm thinking of a web site where people could contribute to the project by joining and then reading out phrases the system shows them.
Seriously there is no massive CC-licensed source of audio data out there. Most of the fancy algorithms for doing speech recognition are on github. What isn't is a massive and diverse dataset. I encourage others to reply if they have seen otherwise.
That is a big factor in the accuracy of speech-to-text engines, but I don't think training data is a problem for text-to-speech (which is what the OP was complaining about).
It is very, very difficult to find a large, well curated dataset of speech with accompanying text labels. TIMIT is the gold standard of speech recognition despite its simplicity, and it costs ~$150(!) to get access at all. Switchboard is another famous one which is bigger, but only partially labeled and still very hard to get. CMU has an openly available, small dataset called AN4, but it is only 64MB of raw recordings (a lot for 1991 when it was recorded, but still small today).
You could use YouTube and their automated translations, but the text is shoddy at best and definitely too unreliable to feed to an algorithm without significant cleansing.
Honestly, U.S. Congressional speeches may be the best "open" option IMO - good transcriptions, a variety of speakers, and TONS of data. If you could sync up the transcriptions with recordings of C-SPAN it could work very well, though I have not seen a dataset that puts this together. This still only covers English!
Companies understand that data is valuable - that is why they take the rights to it in every EULA! Well curated internal datasets will be the KFC/Coke secret recipes of the "big data" age.
I don't know if they have any plan to open their data but if someone could find a way to increase the amount of voice donors while improving an open dataset, it would be interesting. Just a thought, they are most certainly many more issues / options in this field.
That's hardly an issue.. You just have to know one person that works for a university and you are pretty much all set.
Algorithms become a bigger problem these days. For TTS for example one has to implement hybrid speech synthesis technology combining hidden markov models and unit selection. Most commercial companies are using this technology, but no open source project has it released. The closest ones are OpenMary and Festival, but they are either unit selection or hmm, no hybrid synthesis implementation yet.
5. an occasion for regret, disappointment, etc
Please don't use this kind of sarcasm on Hacker News. It lowers the signal/noise ratio and it corrodes civility.
There'd be a good and helpful comment here if one took out the aggressive bits.
Message received, thanks for the reminder.
Granted, there are commercial open source projects.
Beyond just lamenting the quality gap, they were trying to understand what it was that open source projects of this nature are missing. Is it lack of research making its way downstream? Poor media quality? There was no blame placed on contributors.
Understanding the answers to these questions can even help someone like the parent find an 'in' to contribute something themselves.
Even the "compile Jasper from scratch" installation method involves downloading some binaries: http://jasperproject.github.io/documentation/software/#insta...
edit: more specific link
I think it would help if you briefly described in the docs what https://sourceforge.net/projects/jasperproject/files/usrloca... contains, since it's fairly large (75 MB).
Although smaller, the same goes for https://sourceforge.net/projects/jasperproject/files/usrloca...
Otherwise, great initiative!
First prototype was on an Arduino it and eventually ran out of firmware space. So then I upgraded to the RPi and from there it was a breeze.
(1) Use Wit.ai for NLP. There is some added latency but the capabilities far out reach Sphinx in the long run. It's free. Less code to maintain. Easier to deploy and distribute.
(2) Try to find a small mic so that you can put everything in a sleek package.
(3) Add support for bluetooth speakers (you're on a RPi, it's basically done for you)
(4) 3D print a custom case, throw some 3M tape on it and it's ready to be wall-mounted!
As an aside: I don't think it'd be difficult to developer Jasper modules that use Wit without modifying much of the original source (as long as the speech-to-text systems pick up on the text you'd need to pass to Wit).
Actually our typical latency is less than 0.5s if you stream audio to the API (instead of waiting until a silence, then sending a WAV file). Also, we are working on an embeddable client (you would still use Wit.AI online to train your model, but then they can run locally on your Raspberry Pi).
I disagree with (1), however: In the interest of making WAN-independent software (as in, I don't want my home automation to stop working if I can't call out to wit.ai), I actually don't agree with that point. I think that using CMUSphinx can be made extremely accurate with continuous training (something I'm angling to put in my version).
Then again, I'm planning on making a competing product so differentiation is good for me. I think there are a lot of ways they can improve on the idea
Unfortunately the keyword spotting stuff isn't documented yet, but check out the code for my Demolition Man swear detector project which is using it: http://hackaday.io/project/531-Demolition-Man-Verbal-Moralit... The important bit is the ps_set_kws function call, this takes either a text keyword or filename with list of keywords. Then after processing audio call ps_get_hyp and it will return any spotted keywords. Check out the code here in PocketSphinxKWS.h/cpp: https://github.com/tdicola/DemoManMonitor
Might be something ridiculously obvious but I'm a total n00b when it comes to the pi...or linux in general :/
I'm building a tool (prototype phase) for creating trail videos. One of the use case would be to create the demo movie of a product.
I've been working on a freetext question answering service, but more the question answering part (as opposed to the voice recognition side).
Looking at the documentation, it appears there is no way for it to handle free text questions ("What is population of X?" - where X is any country) since all words need to be defined in advance. Is that correct or am I missing something?
It also seems pretty trivial to set up Wolfram Alpha on here. From what it looks like, you'd just have to:
1) get a developer account at Wolfram Alpha
2) download this promising looking module: https://pypi.python.org/pypi/wolframalpha/1.0.2
3) integrate it into Jasper (create a module)
I'll be sure to try it once I get it set up.
I would use Android's TTS (picotts). Audio quality is better.
I can understand CNN or TechCrunch getting confused, but there seems to be a universal confusion here on HN too.
Not ranting. It is a bit exasperating to read comments and articles addressing only speech recognition. Siri is more than that.
How do you plan on doing this?
Like, Siri has to be online cause everything is processed on Apple's servers because there are so many things Siri could do. But if all I want is to be able to say "play <NameOfMovie/TVShow>", I would love it if that could be done offline. Even if I had to train the system myself.
Say I have 30 TV Shows in the library, I could see myself train the system the names of all of them if that meant I am able to actually start them via voice.
I like this a lot and I'm encouraged to see it.
You would also need to take the time to write drivers for the USB microphone and do realtime speech recognition, web traffic, and text-to-speech on the Yun's little 16MHz/400MHz processors.
This project is perfect for single-board computers like the RPi, and I can't imagine that you would get that far with a microcontroller.