There is now a (very rudimentary) demo on the GitHub page: zzmp.github.io/juliusjs
Much thanks to @iffy for writing the first pass.
It uses voxforge's sample vocabulary, so you'll need to say things like "Dial 1 2 3" or "Call Kenneth McDougall" for it to understand you, but the vocabulary is easily swapped out for your own projects, as explained in the README.
Quick question the Julius website says there is no English acoustic model available [1], how did you solve this? Do you provide a default acoustic model?
I used voxforge[1], a project made specifically to solve this problem:
> VoxForge was set up to collect transcribed speech for use with Free and Open Source Speech Recognition Engines (on Linux, Windows and Mac).
VoxForge's sample grammar is provided as a default, in its own folder [2]. It would be nice to get a high-quality acoustic model, as voxforge's is not that comprehensive yet, but I couldn't find anything with the right licensing and zero cost. If anyone knows of one, I'd love to hear about it.
Hi! Just curious on general emscripten work - how long does it take to port something like this? What are the main hurdles, and things that you can get stuck on?
This was my first emscripten project - I would definitely recommend it. I found it to be a great tool. Two things to point out:
- If something is not documented well, it is probably tested well. If you can find the tests that cover it, you can usually get a good idea of how it works.
- There is no multithreading. I had to fake it by breaking up my loops with setTimeouts.
There is an IRC channel if you need a place to go for help. Also, feel free to PM.
This is absolutely amazing, great work. This will open up a whole new world of possibilities.
In my time using offline speech-recognition tech, I never was able to use julius properly (docs were a little lacking) so I jumped on to CMUSphinx/pocketsphinx, but I think you've just done a huge amount of work to bring Julius out of obscurity (at least in my mind). Thanks very much
[EDIT] - I really can't get across enough how awesome this is, please add a gittip link or bitcoin address or something
No, but I think it's a great idea. If you end up using JuliusJS for this, please let me know. I still need to make some example applications to showcase what JuliusJS can do - I'll have them listed in the README when they're done.
I also came across PocketSphinx, which has been around a little longer - that may have some users in the wild already, and I wouldn't be surprised if it was used for navigation somewhere.
Sounds good, thanks! Also, I think you'd get more interest here if there were a simple demo page that people could click through to -- that's the nice thing about JS, after all. Maybe something hosted on a github.io page that just transcribes into a textarea?
I'm using Chrome's speech recognition engine for a virtual reality in WebGL thing I'm building. It's a bit annoying, as it requires a network connection, and is rather buggy (I end up crashing Chrome about one ever ten sessions when using it, all of Chrome, every tab). Something like this would fit my needs a lot better.
This is sweet...to get an idea of how much fun this could be for web apps, check out the Annyang library (https://www.talater.com/annyang/), which wraps around the Google Web voice recognition API...it works very well, but of course, is subject to Google's terms...so an open source system is very welcome
Pretty cool! When I did my project I had to use https://github.com/kn/speak.js which is an amazing library. The library still works on Firefox 30, 31 by the time I finished my project (and the project itself hasn't change much for a year or two!).
I would definitely give this JuliusJS library a try. I am actually amazed that JuliusJS doesn't carry all the heavy data like speak.js does (multiple languages support though). I love the fact that you state 100% client side!
Nice work. Can it return confidence scores? Say I want to load 3 commands in my page:
1. Click blue button
2. Scroll down in the yellow text area
3. Expand image of man
I feed those to the engine, and when somebody speaks, I get a confidence score on each word so I can determine with a level of configurable certainty that the user is using the command:
{click: 0.9878 confidence,
blue: 0.8789 confidence,
button: 0.1889 confidence)
It does post them back from the worker, but the Julius interface doesn't expose them (yet). The way that Julius deals with confidence scores is also a little different (they're not fractional), so you'd need to account for that.
I'll be sure to include them soon - it's probably just a few more lines of code, so you can expect them in the onrecognition function this afternoon.
pocketsphinx.js looks amazing, but there's definitely a barrier to entry if you've never worked with speech recognition before. That was my biggest goal in porting this tool - a nice, abstracted API. Glad you like it :)
> Each person's voice is different. Some sounds, like "s", sound about the same no matter who says them, but other sounds, like vowels, tend to differ a lot from person to person. We use a special way of representing sound, the cepstrum, that captures lots of information, including the characteristic way you pronounce your vowels. Of course, someone could imitate the way you talk; fortunately, the cepstrum also captures certain fundamental characteristics of voices that are impossible to change. For instance, the length of your vocal tract -- the place where sound is produced in your body -- cannot be changed, and different length vocal tracts tend to produce cepstra with different characteristics. By identifying both the way you talk, and the way your body produces sound, WhisperID can do a great job of figuring out who you are.
As someone working in that area, I have to disagree. Sure, speaker recognition can work very well under certain circumstances, but there is no unique signature. (that would imply that you could successfully discriminate between any two speakers in the world, irrespective of any other factors)
OK, maybe unique was a strong word. But banks are starting to use speaker ID to log in to mobile apps, so the signature is, let's say, unique enough for practical applications.
In order to get a more comprehensive vocabulary/grammar, you need to substitute out the sample that it comes with. There are instructions in the README. For the demo, it just uses the sample grammar that voxforge provides, which is (as you can see) fairly limited.
This is the kind of technological challenge which must be fun to complete. And it must be quite satisfying for the author.
However, whenever I see a 'XYZ in pure javascript', I keep getting the impression we are only delaying the inevitable moment browsers have to step to a superior language. Kinda like instead of quickly ripping off a bandaid is better than slooowwwwllly removing it ....
My point exactly. The end result being we are solidly entrenching Javascript into browsers, not the contrary. So instead of starting with a proper hammer for nails, we are slowly turning a screw driver into a blunt object which can neither screw nor hammer nails properly.
True. Transpiled, with a few abstractions written over the transpiled code (such as a worker script), and some tweaks to the transpiled code to fake multithreading so that it can coexist with the Web Audio API.
Much thanks to @iffy for writing the first pass.
It uses voxforge's sample vocabulary, so you'll need to say things like "Dial 1 2 3" or "Call Kenneth McDougall" for it to understand you, but the vocabulary is easily swapped out for your own projects, as explained in the README.