Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: An open-source, Raspberry-Pi-based Siri alternative (jasperproject.github.io)
191 points by shbhrsaha on April 7, 2014 | hide | past | favorite | 81 comments

It's a shame that the quality of open-source text to speech engines is so much worse than the current commercial state of the art, as that's the most notable difference in the demo videos between this and something like Siri. Would fixing that just be a matter of recording more high-quality free sample libraries, etc., or are there fundamental technical challenges to solve?

It's mostly a question of training data. Google trains its acoustic models on thousands of hours of annotated audio samples. It's very hard of an open source project to i/ get enough data ii/ have the computing power to actually train with it.

Sounds like this field could use some crowd sourcing.

What kind of data would be the most useful for learning? Would it be the same phrases read by different people, single words or is any text good? Do we need lots recordings from a single person or many smaller samples from different persons?

I'm thinking of a web site where people could contribute to the project by joining and then reading out phrases the system shows them.

You mean something like http://www.voxforge.org/

Seriously there is no massive CC-licensed source of audio data out there. Most of the fancy algorithms for doing speech recognition are on github. What isn't is a massive and diverse dataset. I encourage others to reply if they have seen otherwise.

I don't think that's true.

That is a big factor in the accuracy of speech-to-text engines, but I don't think training data is a problem for text-to-speech (which is what the OP was complaining about).

I strongly disagree! Text to speech needs the same data as speech to text - a well annotated collection of raw, single speaker speech data from a variety of speakers and accompanying text labels.

It is very, very difficult to find a large, well curated dataset of speech with accompanying text labels. TIMIT is the gold standard of speech recognition despite its simplicity, and it costs ~$150(!) to get access at all. Switchboard is another famous one which is bigger, but only partially labeled and still very hard to get. CMU has an openly available, small dataset called AN4, but it is only 64MB of raw recordings (a lot for 1991 when it was recorded, but still small today).

You could use YouTube and their automated translations, but the text is shoddy at best and definitely too unreliable to feed to an algorithm without significant cleansing.

Honestly, U.S. Congressional speeches may be the best "open" option IMO - good transcriptions, a variety of speakers, and TONS of data. If you could sync up the transcriptions with recordings of C-SPAN it could work very well, though I have not seen a dataset that puts this together. This still only covers English!

Companies understand that data is valuable - that is why they take the rights to it in every EULA! Well curated internal datasets will be the KFC/Coke secret recipes of the "big data" age.

I can't upvote this enough. There are very few companies that are masters of this game. Because it is extremely difficult to build. Every stage is a painful process and there are so many moving parts. To begin with you need very high quality recordings. Successful companies have voice directors whose job is to record voice at most optimal settings (and to make sure speaker speaks is neutral dispassionate tone). And then there is the manual process of segmentation and phoneme alignment - That requires lot of human effort. Then you run the code! It will take quite sometime. Then you study the quality of synthesis. Fix issues manually (for example, if you are using concatenative synthesis, you may have to fix offending phoneme alignment based on feedback). And iterate on this. Unless you have a team, I wouldn't venture into this and rather use third party leaders.

I recently discovered vocalid.org thanks to a TED video[1]. They work to provide people having speech impairments with unique synthetic voices thanks to voice donors. According to their process, a voice donor need to read 2-3 hours of text to allow them to build a custom synthetic voice.

I don't know if they have any plan to open their data but if someone could find a way to increase the amount of voice donors while improving an open dataset, it would be interesting. Just a thought, they are most certainly many more issues / options in this field.


finally a use for the filibuster .

I wonder if the audio data from television programmes combined with subtitles would make a good training set. Here in the UK many of the subtitles are very good quality (presumably human transcribed), and would be easy to strip out non-language noises such as laughter (normally transcribed in brackets).

Google used the GOOG-411 automated directory assistance number to collect data for training.

> ii/ have the computing power to actually train with it.

That's hardly an issue.. You just have to know one person that works for a university and you are pretty much all set.

Economics is the main challenge because it's a difficult task and the demand for it is low (market is small if demanding [0]). But the mbrola people, who have a nice free speech synthesis program that also isn't opensource, [1] say it's a hard technical problem too.

[0] http://www.ted.com/talks/roger_ebert_remaking_my_voice

[1] http://tcts.fpms.ac.be/synthesis/mbrola.html

This is using espeak. There are alternatives, such as Festival which, with decent voices, sound way better.

Example: https://www.youtube.com/watch?v=BA0z6ztG7qU

The question is: how do you make it sound that good? Is that with commercial voices?

Despite people mention the lack of the data it's not a problem anymore. It was a problem like few years ago but these days we have access to thousands hours of speaker data. For example we have 100 hours of single speaker database at CMUSphinx, it's not a big problem.

Algorithms become a bigger problem these days. For TTS for example one has to implement hybrid speech synthesis technology combining hidden markov models and unit selection. Most commercial companies are using this technology, but no open source project has it released. The closest ones are OpenMary and Festival, but they are either unit selection or hmm, no hybrid synthesis implementation yet.

A lot of the DEC Voice stuff was part of a DARPA contract, it may be possible to liberate from some archive somewhere.

You can use Google TTS. It doesn't have a documented public API, but nor did Google Reader.

It's a shame no one did something you don't even have idea how to do? Do you really think it's a shame?

Care to check a dictionary first? apendleton is not saying open source programmers should be ashamed that they haven't produce a high quality speech engine. Would you feel better if it was phrase like "it's a bummer that.."? Less hurt feelings?


5. an occasion for regret, disappointment, etc

Haha, well that's an unexpected English-lesson for me. Really sorry for my mistake, and thank you for pointing that out.

Care to check a dictionary first?

Please don't use this kind of sarcasm on Hacker News. It lowers the signal/noise ratio and it corrodes civility.

There'd be a good and helpful comment here if one took out the aggressive bits.

But he started it first! </joke>

Message received, thanks for the reminder.

I just meant that it was unfortunate, not that it was shameful. It's just a figure of speech.

Yes, very sorry, it's all my bad English :)

I'm normally not a fan of condescending comments, but you hit the nail right on the head. It doesn't seem like some people grasp that a lot of open source projects are coded by unpaid volunteers in their free time in subjects they're passionate about.

Granted, there are commercial open source projects.

I'm sure everyone on HN is familiar with how open source projects work. The poster wasn't trying to tell contributors to get their act together and produce a higher quality product.

Beyond just lamenting the quality gap, they were trying to understand what it was that open source projects of this nature are missing. Is it lack of research making its way downstream? Poor media quality? There was no blame placed on contributors.

Understanding the answers to these questions can even help someone like the parent find an 'in' to contribute something themselves.

Personally, I really think it's a shame, yes. Why wouldn't it be?

It says it's 100% open-source, but I can't seem to find the sources for the Jasper platform (not the Jasper client).

Even the "compile Jasper from scratch" installation method involves downloading some binaries: http://jasperproject.github.io/documentation/software/#insta...

edit: more specific link

The client code at https://github.com/jasperproject/jasper-client comprises all of the source. The modified Raspbian distribution linked to in the Software Guide only includes supporting libraries and some configuration files.

Ok, but why do I have to both install some binaries and clone jasper-client?



I think it would help if you briefly described in the docs what https://sourceforge.net/projects/jasperproject/files/usrloca... contains, since it's fairly large (75 MB).

Although smaller, the same goes for https://sourceforge.net/projects/jasperproject/files/usrloca...

Otherwise, great initiative!

Thanks! Yes, we'll be sure to clear that up-- the binaries were for a few CMUCLTK and Phonetisaurus libraries that were difficult to compile on Raspberry Pi.

I actually made this exact thing Saturday afternoon. Great job!

First prototype was on an Arduino it and eventually ran out of firmware space. So then I upgraded to the RPi and from there it was a breeze.


(1) Use Wit.ai for NLP. There is some added latency but the capabilities far out reach Sphinx in the long run. It's free. Less code to maintain. Easier to deploy and distribute.

(2) Try to find a small mic so that you can put everything in a sleek package.

(3) Add support for bluetooth speakers (you're on a RPi, it's basically done for you)

(4) 3D print a custom case, throw some 3M tape on it and it's ready to be wall-mounted!

Great suggestions. I've been meaning to take a deeper dive into Wit.ai for a while now. It seems like their intents-entities architecture would actually fit in pretty cleanly with Jasper.

As an aside: I don't think it'd be difficult to developer Jasper modules that use Wit without modifying much of the original source (as long as the speech-to-text systems pick up on the text you'd need to pass to Wit).

Hi Kyle, Wit.AI team here -- thanks for suggestion (1) :-)

Actually our typical latency is less than 0.5s if you stream audio to the API (instead of waiting until a silence, then sending a WAV file). Also, we are working on an embeddable client (you would still use Wit.AI online to train your model, but then they can run locally on your Raspberry Pi).

I also did a project like this last year -- I only got around to asking for date & time working, and was happy enough to stop going there...

I disagree with (1), however: In the interest of making WAN-independent software (as in, I don't want my home automation to stop working if I can't call out to wit.ai), I actually don't agree with that point. I think that using CMUSphinx can be made extremely accurate with continuous training (something I'm angling to put in my version).

Then again, I'm planning on making a competing product so differentiation is good for me. I think there are a lot of ways they can improve on the idea

And use festival, not espeak!

Nice project--I like the code, it's very clean and well structured. You should check out the subversion trunk of pocketsphinx, it has support for keyword spotting built in so you can do things like instantly recognize the persona keyword to enable the system instead of running the transcription through pocketsphinx and hoping for the best.

Unfortunately the keyword spotting stuff isn't documented yet, but check out the code for my Demolition Man swear detector project which is using it: http://hackaday.io/project/531-Demolition-Man-Verbal-Moralit... The important bit is the ps_set_kws function call, this takes either a text keyword or filename with list of keywords. Then after processing audio call ps_get_hyp and it will return any spotted keywords. Check out the code here in PocketSphinxKWS.h/cpp: https://github.com/tdicola/DemoManMonitor

I am looking to use this / something similar at the startup I work at to toggle our robot's operating modes. I have a question regarding the voice recognition. Is the voice recognition stuff done on the Pi itself, or is there a service that Jasper taps into to perform voice recognition?

It's using Pocketsphinx: http://cmusphinx.sourceforge.net/

That's amazing, I'm impressed. I have no idea how voice recognition software works, but is it possible using this open source project to add other non-Latin languages (e.g. Greek, Arabic, Japanese)?

It uses the CMUSphinx project (http://cmusphinx.sourceforge.net/wiki/) that is language independent. You can download language models for English, Chinese, French, Spanish, German, Russian.

I'm working with PocjketSphinx now, it's totally language independant. There is a paper floating around for Arabic, in which they used several reads of the Qu'ran as training data. For Chinese there's a model in the official Sphinx page.

I'm not entirely clear on why it requires a WiFi adapter? Can you not use the wired connection? Module writing looks pretty nifty though, can't wait to give it a try over the weekend.

Yes you're right, it works absolutely fine over a wired connection, WiFi adapter not required.

How are you able to achieve this. My install says its 'attempting to connect with -SSID-' and fails, even with the wifi adapter removed and I've confirmed it's got an ethernet issued IP address.

this might be a stupid question but HOW! :) after some troubles I finally got Jasper to boot up and first thing he does is complaining he can't connect to a network. So I connect my laptop via ethernet to the pi and try to reach the configuration site but it just won't connect.

Might be something ridiculously obvious but I'm a total n00b when it comes to the pi...or linux in general :/

Guys, could you tell me where did you get the music for your demo? How much did you pay? What was the process? Did you use something like MoveMaker?

I'm building a tool (prototype phase) for creating trail videos. One of the use case would be to create the demo movie of a product.

Sure, the music is from http://www.jamendo.com/, where some tracks can be used for non-commercial purposes.

So you downloaded mp3 and used a tool like MoveMaker to make the video? Thx

This looks pretty good.

I've been working on a freetext question answering service, but more the question answering part (as opposed to the voice recognition side).

Looking at the documentation[1], it appears there is no way for it to handle free text questions ("What is population of X?" - where X is any country) since all words need to be defined in advance. Is that correct or am I missing something?

[1] http://jasperproject.github.io/documentation/api/standard/

This looks great! I'm only missing a USB microphone. I'll be sure to make this once I get my hands on one.

It also seems pretty trivial to set up Wolfram Alpha on here. From what it looks like, you'd just have to: 1) get a developer account at Wolfram Alpha 2) download this promising looking module: https://pypi.python.org/pypi/wolframalpha/1.0.2 3) integrate it into Jasper (create a module)

I'll be sure to try it once I get it set up.

This is pretty cool.

I would use Android's TTS (picotts). Audio quality is better.

Serious question: Why does everyone seem to confuse speech recognition with other parts of NLP (e.g. parsing)?

I can understand CNN or TechCrunch getting confused, but there seems to be a universal confusion here on HN too.

Not ranting. It is a bit exasperating to read comments and articles addressing only speech recognition. Siri is more than that.

Sweet! This is just what I was looking for to command my Sonos speaker to play some music.

I'd be very interested in this; I've been looking for a Sonos API for a long time and could only find an old Perl script that doesn't seem to be maintained.

How do you plan on doing this?


This is a nice python library for controlling a Sonos speaker.


This would be an awesome voice control addon for XMBC/media play services from a Pi!

I would love if that worked offline, too. Even if it is just a very limited use of voice recognition.

Like, Siri has to be online cause everything is processed on Apple's servers because there are so many things Siri could do. But if all I want is to be able to say "play <NameOfMovie/TVShow>", I would love it if that could be done offline. Even if I had to train the system myself.

Say I have 30 TV Shows in the library, I could see myself train the system the names of all of them if that meant I am able to actually start them via voice.

I was just thinking the same thing - I use XBMC on an Android TV stick now, with a wireless "airmouse" that's absolutely horrible to use, and I'd love a better option... For stuff that can easily be done with the XBMC web client, or a simple media remote it's not such a big deal, but it's not exactly a perfect solution.

Great work! What sort of recognition distance were you able to get with that microphone?

Thanks. Depends on the conditions, but works most of the time around 10-15 feet away!

Excellent work, I've always wanted to see a real world use of pocketsphinx with python. When I looked a year ago the documentation was lacking. The module system looks nicely extensible as well.

I had a very similar (albeit less-complete) hack a while back: https://github.com/rob-mccann/Pi-Voice

Would it be easy to get this running on a normal PC, without a RPi?

The RPi is essentially a PC, the only real difference is that it uses an ARM processor so yes, I'd assume it would be rather easy.

Thanks. I've been looking at the code, and aside from a list of dependencies in client/requirements.txt, it looks pretty simple.

Is there a way to make this subtract any noise being outputted by the pi's speaker ouput, so that it can still understand me if I'm playing music/watching a movie?

Well, yes this can done. Noise cancelation & voice extraction algorithms aren't implemented here but it's totally possible. It wouldn't be too bad to implement if you have a decent understanding of DSP.

It would be cool to trigger commands with my smartphone. I say "Open the door" he sends it to my Raspberry who then opens my door. Is this possible?

How accurate is text recognition (could you use it for dictation?) and how fast between the end of a command and recognition/parsing of said command?

If you want to do dictation, you can use the Sphinx4 project. It's in Java, and you have to write a bit of code, but it's an offline recognizer, you record the audio and then it doesn't run in real-time.

Excellent, I've pondered doing something like this with a BeagleBone Black. Can't wait to try it out and see what I can do.

I am wanting to get this running on my BBB black. Hopefully it won't be that hard. It looks like mostly Python code, and a mountain of dependencies.

Open Source project pitched like a product. This is how you do it peeps!

I like this a lot and I'm encouraged to see it.

Great work guys! Looking forward to playing with this!

Seems clean, useful, and well-documented!

I need a french acoustic model, if anyone has one...

French acoustic model for CMUSphinx is available in downloads: https://sourceforge.net/projects/cmusphinx/files/Acoustic%20... You need to use it with French dictionary http://sourceforge.net/projects/cmusphinx/files/Acoustic%20a...

I think this is a great project! But Raspberry Pi might be a bit overpriced/overpowered for this task. Maybe something like Arduino Yún would be more appropriate choice. I am really hoping this movement of small GNU/Linux based home appliances will take off and lower the price.

How would using a Yun be cheaper than a Raspberry Pi? The Yun is currently twice the cost of a the RPi, and even taking into account the WiFi adapter and SD card, the RPi comes out far ahead.

You would also need to take the time to write drivers for the USB microphone and do realtime speech recognition, web traffic, and text-to-speech on the Yun's little 16MHz/400MHz processors.

This project is perfect for single-board computers like the RPi, and I can't imagine that you would get that far with a microcontroller.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact