There are at least two cross-platform projects where the biggest expense is a microphone instead of software.
1. My project, Talon. Windows/Linux/Mac support, and a first party local speech recognition engine that is pretty good and getting better. It’s free, but the engine is in a private beta (which is $15/mo to support development, optional if there’s a financial issue).
2. Serenade. They are VC backed. Currently free, unsure about their longer term plans. They use cloud based recognition.
I created Kaldi Active Grammar because I didn't trust relying on closed source software for something so crucial to my productivity, where a decision by an outside party determines whether I can function. As a bonus, open source means I can make it work better to fit my needs than closed source ever could.
Furthermore, the original article mentions Caster (which is built on Dragonfly), but doesn't mention that KaldiAG works with it, and that work is underway to expand Caster's platform support.
I found your measurement here  which is against an unknown wav2letter acoustic+language model pair, as the web demo is at any given point in time running an arbitrary model based on having users test in-progress models, and it has never been running the model I am currently shipping with Talon.
(As a small example, the unfinished wav2letter experiment I am training right now has a 3.17% WER on speech commands, and 6.86% WER on librispeech clean, both numbers without using a language model)
Fun tidbit: Most of the commands I use in vim are just the Talon alphabet.
(For example, I suspect the microphone their school supplies them with may be no good)
Also, Dragon has many commands that cannot be disabled, and if your accuracy is already low, more commands available means more possible things to get wrong. Something like Dragonfly with KaldiAG could allow you to reduce the command set, and improve the practical accuracy.
So, more personalized training/software may work better.
I've been waiting to try Talon for ages - glad I can give it a try now.
But then, after having seen doctors and neurologists, and finally a physical therapist, I came across my salvation:
- Exercising my hands.
I very rarely see this mentioned for some reason. I exercised regularly, but only the bigger muscle groups, rarely grip strength and wrist strength. It felt counter-intuitive to exercise my already extremely painfully aching hands (when typing), but using grip weights and other methods to work out my hands and wrists, the pain went away quickly! If you are not diagnosed with carpal tunnel, and not already doing this, definitely try it, it saved my career.
Look up nerve flossing exercises on YouTube. Routinely doing these had the largest impact for me. You'll feel the ones that work on whatever nerve is inflamed.
Try to improve your posture. My general mantra is "lift your head as much as possible. Pull your shoulders back". It gets easier eventually. If your head lies forward from the spine you have a hunch. You may notice a small lump of muscle behind your neck. That's bad. There are exercises to try and strengthen the opposite muscles.
If you sleep on your arms try to stop as well. I recommend sleeping on your back. Fluffy couches and back rests sacrifice posture. Don't use a laptop in bed.
Sleep, eat nutrient-dense foods, and run or swim. Avoid alcohol. If you do pushups or bench press, make sure to exercise your upper back equally to avoid imbalance.
It is frustrating that there seems to be only trial and error in all of this.
If you haven't tried it, you'll think I'm crazy, but it's amazing how fast computers can recognize individual letter names. You can just blurt them all out. I discovered this while entering a domain name by voice - just spell it out and poof, no problem, no corrections.
Not sure I'd want to spend all day doing it that way, but rather than fighting with voice recognition for misunderstood homonyms, just fall back on individual keys.
The trick is one-syllable words to represent each letter in the alphabet, but they have high phoneme diversity (sub-syllable unique sounds) and care was taken to ensure they can be differentiated even when saying them “too fast” in a stream without gaps, and I tested them extensively to make sure I picked words that didn’t clog up your mouth too much. The result is I can type 58wpm for short bursts using the alphabet alone, which is enough to play ztype, write entire words that would be hard to dictate, and control vim directly without an additional command set.
The base alphabet is:
air bat cap drum each fine gust harp sit jury crunch look made near odd pit quench red sun trap urge vest whale plex yank zip
Go ahead, see how fast you can read the alphabet in order. Then try reading it even faster, blurring the word boundaries together a bit. You can still tell the words apart!
Some people change a couple of words if they don’t work well in their accent, but the base idea holds strong.
How did you settle on jury? Notably it seems to be the only word that is more than one syllable.
And see sibling thread suggesting “jree” like “tree”, which is similar to how it sounds when you’re going quickly.
Edit: For people suggesting new words, please read my comments downthread. Also the biggest complaint I’ve had about jury in practice is it can sound similar to “three”. Not a ton of words are spelled with J, and when doing vim input it’s faster to use numbers or a repeat command than say the letter repeatedly. So as it’s a lower priority word to trim half a syllable on, it’s worth getting it right if you’re going to change it.
The alphabet isn’t strict law, but it’s best to replace individual words after you’ve had specific problems with them and not before, as it’s hard to match the phonetic feel.
For testing juke, I’d take a look at any words in the alphabet that could conflict with it on either side:
> air bat cap drum each fine gust harp sit jury crunch look made near odd pit quench red sun trap urge vest whale plex yank zip
I see: cap, each, gust, crunch, quench
I’d say “juke cap juke each juke gust ....” out loud to test it. I’d also test it out loud against every word in the alphabet if it passes the obvious collisions.
Ultimately juke doesn’t pass for me for two reasons:
1. ook is something of a glottal sound, it can be awkward to reset your mouth afterwards when flowing into another letter
2. When chaining with some of them other letters in the alphabet it can sound like you’re saying “Jew”, and it may be confusing to some people if you are saying that repeatedly. “Juice” had a similar effect.
Like tree with a j.
I’d like to save my voice from spelling the entire word. Often I don’t need to type more than a letter or two before I can select the desired variable, type name, etc.
The "voice coding" space is maybe not a mess, but far from great or even acceptable. However, there seem to be more recent efforts to make better tools. I would definitely check https://serenade.ai/ out.
The main problem, I think is that "voice coding" is too much focused on editor typing which they can't do right as, when combined with code syntax, it becomes too complex. Instead, they should focus on higher level actions (which btw, Serenade does) along with a different approach to typing. I think Vim is a good example of where editing should be. IntelliJ refactoring is where voice coding should start. With all the AI buzz, it's unbelievable how bad voice recognition is. I'm not talking about "Siri set an alarm", but instead separating context from tone, not having to say things 2-3 times having good response latency, etc.
Lastly, I wish there was simple voice assistance for code navigation - like go to definition, find usages, etc. This is much simpler to "parse" than code structure. Unfortunately, this is not even tackled by any tool as far as I've seen.
Often we have discussions about advanced code completion on HN. Many developers feel they don’t need it, or that it gets in their way, for example.
Reading stories like this convinces me even more that our editors (tools) need to be smarter. There is so much repetition is coding, it’s hard to believe we can’t do better.
This tool is often mentioned: https://tabnine.com/
There are no news on whether it is being actively developed and the current implementation is unusable in a corporate environment because it can't dial home through a proxy so it refuses to activate the license.
It's a shame too because it was basically a "shut up and take my money" reaction from me. I'd pay for this product. I'd pay good money for this product.
So to recap:
1. There is currently no paid license
2. The free version nevertheless needs to be activated online to unlock the full power (otherwise there are severe limitations)
3. It can't handle proxies, so you can't activate it at all in corporate environments
The keys that complete an Intellisense selection (space, period, semi-colon, Enter) are nearly ground down to nubs, and the open-paren and open-bracket keys are worn smooth, while the corresponding close keys indicate near complete disuse. Similarly for F5 (Start Debugger) and F10 (Step Over) compared to the rest of the F-keys.
What we really need are better programmers
Maybe he was better but not by that much. His tools were unquestionably better.
Some people are more susceptible to problems. This was on HN several years ago.
You can be in your 20’s and have a problem.
So in that respect it's unlike smoking I guess.
And now he's not working in tech anymore and is a field engineer doing repetitive labor with his hands...
Now, I only type while wearing long-sleeves.
And of course, I still have to take regular breaks.
I no longer suffer RSI symptoms. I'm guessing because it increases blood flow to the area and perhaps the warmth helps keep ligaments and muscles flexible and loose.
Simple solution, but took a while to figure out.
Hopefully this helps someone reading this.
For the longest time, I'm talking near a decade, after working for a few hours, I'd get this unbearable neck and shoulder pain. Even after the workday it would last long until the night. I am not ashamed to say that I would be almost in tears some days. Not because of the level of pain, but because it was constant and I was frustrated.
Finally, almost as if by accident, I simply pushed my chair back a bit and extended my arms while typing. The pain went away almost immediately. I had been working with my shoulders hunched up the whole time, causing all manner of muscle tension and fatigue.
All I had to do was extend my arms and let my shoulders drop.
(I've written about my experiences with RSI as well: https://www.gustavwengel.dk/overcoming-rsi)
I used to try to type with my arms/elbows floating -- not resting on anything. This seems to be fairly standard advice. For example, you can see it here: http://ergonomictrends.com/proper-ergonomic-typing-posture-a... I've had various doctors recommend the same thing.
But when I try that, the muscles in my shoulders/back get incredibly inflamed, and tension/pain tends to spread throughout my body, eventually to my hands/wrists.
Eventually I gave up on that, and now my elbows are splayed out and rest on the chair's arm pads. My hands rest on wrist pads, and my keyboard is split about 18" (using a Kinesis Freestyle 2). Basically, my hands/arms are nearly always at rest, and no energy is required to keep them suspended in the air. I also try to avoid "reaching" for the upper rows with my fingers, instead I move my entire arm forward to some extent to hit those keys (which is more a 'push' than a 'lift', because my arm pads have a bit of give in them).
I've had no problems since adopting this setup.
Lots of people have tried many things.
This is the first I've heard of such a therapy, it sounds fascinating.
Anecdotally it's done nothing for my RSI, but it has helped me deal with a 7+ year "chronic" upper back pain, though it did take a bit of time to swallow the concept at first.
Only a few minutes with bare arms and I get this snapping sensation in my forearms as I type. And if I push through, pain.
Doctor's wanted to do surgery. But I didn't have the problem prior to an injury that required me to use crutches, so I was hesitant to resort to surgery.
Keeping my arms covered has been a lower risk solution that's been working fine now for 6 years.
I had RSI for years as well, ended up changing careers away from tech for many years because of it. For an unrelated issue I started taking probiotics and gradually noticed the RSI had improved. It took a while to realize it was the probiotics doing it because I had generally been avoiding doing things that triggered the RSI. After I noticed I could type regularly again though, I've had a couple of instances where the RSI flared up again when I had stopped taking the probiotics for a week or two - that's when I finally realized that was likely the change that made the difference.
Looking back I had had the onset of significant chronic digestive issues a month or two before the RSI started and I know believe that was not a coincidence.
The key is not to let pain develop, stop immediately and develop solution, otherwise you can get through the point of no return.
Yes, laptops are convenient. But not so convenient to risk losing your wrists.
We are working towards cross-platform support Linux and Mac as well as adding support for Kaldi. Dragonfly is already cross-platform so just a few windows specific functions to be ported yet in Caster.
Kaldi via daanzu's kaldi active grammar.
Talon may be free but is closed sourced.
For what it's worth, my voice is quite abnormal, so most untrained speech recognition is terrible for me, and even performing the normal "training" for Dragon still resulted in very poor accuracy. However, apparently their training is quite limited, because once I developed Kaldi Active Grammar, and did my own direct training, the results were fantastic in comparison, with orders of magnitude better accuracy.
Open source is what allows this.
As I start each day exactly when the problem is revealed, without any prior knowledge of the task, everything about these videos including debugging broken code is completely real.
(note the fiddling around video playing isn’t a fundamental issue with voice input, google slides with video embeds turned out to be pretty unreliable with keyboard input and I believe that was ironed out in her later talks)
Is anyone working on decent speech recognition for Mac/Linux or know good resources for that? The ideal output is a stream of what could have been said, as well as some alternatives, each with a confidence.
Every alternative I've tried has not been as effective as the version of Dragon I used from 2011. I think the focus on accents and training is a big thing here -- I'm happy to spend a couple hours training it for better results.
Link to my project for those interested: https://github.com/osprey-voice/osprey. It's still kind of a work in progress but it's been working for me.
Right now it’s about 5 hours total, which isn’t a ton for actually training on, which is why I haven’t prioritized releasing it and haven’t even trained on it myself yet. I’ve been mostly using it for evaluation so far.
If someone approaches me and says “I have a compelling need for a bit of training data in the form of your prompts” I’ll probably prioritize a release higher.
As another perspective, a majority of the people at this point submitting their voice are already using Talon and just want the engine to be more robust.
It's not quite there yet, but I'm working on it.
It can type numbers and symbols reasonably well, I need to do some additional work like build a custom language model to be able to type letters and plug some other gaps in Mozilla's CommonVoice model.
Here's the number typing example:
There'd be some back and forth regarding it's suitability for coding over there. Much better support than I expected apparently...
Still as a general solution I do believe it has drawbacks, noisy environments etc.
As a generalization, you seem to be coming up with reasons you _think_ voice coding won’t work well, while ignoring the fact it already does. For example, noisy environments have several very good solutions, such as using microphones designed for them, leaving the environment, or software to reduce noise like Krisp.
The biggest realistic drawback from my perspective is the fact it’s not very quick at mousing, which is why I’ve done a bunch of research on fused eye/head mousing as well.
Now that I am a mac user I finally have the chance to take a look at your solution. I do think that I have a lot to learn and do not think that the areas of research are in conflict at all.
All the best and good luck!
Along with benchmarks for testing the performance and in a more or less sciency fashion compare them to each other. An overlap would also exist in other areas such as word prediction and probably many more.
This isn’t quite accurate. To my knowledge the options are: you are either using it in Talon and it just works, or you want to use it outside Talon and you will need to write entirely new glue code to add support for it in your project of choice.
1. I would like to command the speaker listen for a keyword like the Fizz Buzz Test if I counted to certain number.
2. Ask the speaker to remind me of something when hearing certain topics during a conversation. Much like the "if" keyword in text based computer programming languages.
3. Program a poem into the speaker over the microphone, tutor my kids to memorize it, correct the wrong parts. Share the snippet to other parents. program simple home made riddles and
tests over voice.
4. The ability to store certain list/map structure as global variables. e.g. asking the speaker, who is the second oldest son in this family? Who got up first this morning?
5. Voice memos and search engine. Stored and indexed securely offline on my home NAS.
Most likely if you want to make it work you'd either have to build your own smart speaker or make a serverless function that used one of the other voice programming programs mentioned in this thread as it's backend.
If I build my own smart speaker it would surely be run on my home server. Not so much for server-less. Yeah but I get it. the voice commands should be counted as new "keywords" or "functions". Let there be a general "voice programming" language.
"no swear words" rule for kids. The speaker need to maintain a global counter for kid_id/word counter for every time a swear word is heard. Will reset every week for rewards/penalty, etc.
The parents (as admin) could add/remove the swear words list, and can configure how long the counter need to be rotated.
A non-trivial example, I am an ESL speaker, trying to teach my kid to learn English. But had little patience with the grammar and pronunciation. (My Bad)
I'd like to record a simple poem e.g. The Moon by Robert Stevenson, ask my kid to listen to it without need to stare at any screen, the aim is to recite it correctly, the programmable speaker would correct any missing words with stronger emphasizes next time, and count any badly spoken words with a counter, test it several rounds to see if my kid improves and recite the full text correctly.
There are many family games that can be played around a programmable speaker, e.g. as the dungeon master. The story teller, the on-demand background music composer, etc.
1. “Family database” - you tell it facts and it tries to remember and reproduce them later when asked. Like storing your family tree somehow and asking about it. This has a lot of nuance but is likely possible in some way with current technology.
2. Teaching assist. You give it a task, such as an English text, and it helps train and evaluate your kid.
For (1), I think there are three obvious approaches to me:
1. Manually enter the data with a computer, but use natural language queries (some kind of Intent model) to access it
2. Have a strict set of data that you can enter (such as family tree, grocery lists, voice memos), and allow slightly strict natural language-ish statements to add data, and natural language queries to access it.
3. Convince a general understanding model like GPT-2 to learn your facts, and just pipe your questions directly into it. This is the coolest answer but would likely also be wrong more often.
I think (2) is an easier task, and likely would use an entirely different approach for it than (1).
(2) is especially easiest if the system already knows the text of the poem. If you’re making up a new poem, or speaking a poem it can’t somehow look up, it will be harder to match two voices against the text of the poem.
One caveat to all of this is I’ve heard speech recognition doesn’t work as well on children because they aren’t well represented in the training data.
1) Getting an ergo keyboard, in my case the Microsoft Sculpt
2) Remapping my keys to better match my workflow. Left and right parens are mapped to left and right shift - same for Ctrl-Braces and Alt for brackets. Mapping Caps lock to delete one word back was also a big one. Further, I have the number pad on the left to both make using the mouse require less movement in addition to remapping all the numpad keys to useful programming commands.
Corollary, if you have ulnar (pinky side) issues in your forearm, reduce pinky use.
That said, barring genetic lottery, as the sibling said: enough 15 hour sessions are going to give you RSI no matter what you’re typing on. There’s no way to do that every day, and exercise, take breaks, and sleep enough. Balance is important.
I can't imagine working 15 hours straight would be comfortable for me no matter what tools I was using.
I've never had any RSI type symptoms or even fatigue after long typing sessions.
The sculpt seems to just be a fancier wireless version of the same thing (although I haven't tried it so I could be wrong).
It's still very much a work in progress but it's already been working very well for me and I'm actually using it to type out this response right now.
On the other hand it’s probably better at general (non command) English.
I know it sounds crazy, but I solved a very intense bout of plantar fasciitis by massaging my calves. Took some time, but eventually the pain went away.
And when I had carpal tunnel, I did the same (although not learning heavily on my wrists helped a lot too).
You'll know if you're hitting the right spot because it'll hurt. A lot.
This is partially good, but dangerously incomplete advice. The spectrum of RSI is far bigger than carpal tunnel, such as cubital tunnel, tennis elbow, tendinitis, and general neuropathy.
It can be caused by many things including various muscle groups in your shoulders, neck, posture, past injuries, nerve adhesion, flexibility, lack of strength.
Massage is a wonderful way to address nerve adhesion, flexibility, general tightness, but if you have a different issue it’s not really a panacea.
My philosophy is that you should do these things in strict order if you have symptoms:
1. reduce use (take breaks, get hobbies that take you away from the screen at home)
2. vary conditions of use (standing desk, ergonomics, using more than one kind of keyboard/mouse)
3. work out (yoga, swimming are highly recommended)
4. physical treatment (see a doctor, physical therapist, deep masseuse, etc)
5. alternative input (voice, eye tracking, etc)
If you do these in a different order or skip any of them you’re probably doing yourself a disservice.
Also. Don’t ice. Don’t take anti inflammatories or other pain meds while you continue to work. Blood flow is important to recovery - heat can soothe your pain similarly to as ice, but without causing further damage. And pain meds during working hours will make you ignore the damage that is causing your pain, allowing you to cause more damage. Bad.
I’ve also seen some negative studies on steroids and surgery, suggesting if you get those without changing your habits, your pain will recur within a year in 90% of cases. Do your own research, but you might as well change your habits first and only use surgery/drugs as an extreme last resort. (Unless you have the specific form of carpal tunnel that can be permanently fixed in a day with a minor surgery in exchange for lower grip strength?)
Rest is ultimately the best way to prevent hand injuries and since I spend most of my time in Chrome, this extension lets me do it hands free.
That aside, in terms of worrying about your mic picking up other people's voices and the voice dictation getting confused, most dedicated microphones these days (i.e. not ones that are built into your phone's headphones), are pretty good at background noise reduction.
I've not used the one OP recommends - I'd never have considered a table based mic like that before - but the noise reduction on the Plantronics Blackwire 3215 headset I use is so good that if I move the mic boom a few inches up or down away from my mouth, people can't really hear me on calls. It's superb at getting rid of background noises, and if somebody else was in my home office using voice dictation it would not be picked up by my headset.
I work mostly from home, so I'm interested in this as a programming method to see how or if it changes the way I approach it as work.
1. Cheapest is probably a USB dynamic mic of some kind.
2. Next is a Stenomask at around $250
3. A lot of folks swear by the DPA d:fine cardioid, which is $800-1300 including an interface. There’s also cardioid shirt worn lavalier I’m interested in trying sometime, which is the same interface but the mic is $150 cheaper ($650 -> $500)
If you’re worried about other people hearing you, your options include an isolated area, playing noise (white noise or music?), or using a StenoMask, which blocks sound in both directions.
Remember in the US your employer is required under the ADA to provide “reasonable accommodations” for disability, which may include a private working space, pair programming, or letting you work from home more often.
Voice coding makes me think of audio compression for voice.
This is programming using voice recognition instead.
It should be 'voice-activated coding', VAC, or 'voice-activated programming' (VAP) or something.
So the clearest way to represent some of this may be to encode the meaning of “voice input/control capable of programming” in a title. We might need a new name for this kind of input to best represent it.
Ironically, I solved my RSI issue after coming across a book on HN. Everyone seemed to recommend the Mind Body Prescription, and after trying it, it did the trick.
Edit: I've been wanting to try this project for several years now: https://www.ctrl-labs.com. I think something like this is the solution moving forward vs voice coding.