I’d say STT is pretty much a solved problem. Everyday there is a new product and can be one-shotted by any current top of the line LLMs. Take a look at this [1]. Apple is just stuck in the past.
> On June 18, 2026, Gemini CLI and Gemini Code Assist IDE extensions will stop serving requests for Google AI Pro and Ultra, as well as those using it free of charge using Gemini Code Assist for individuals.
While SWE-bench Verified is not a perfect benchmark for coding, AFAIK, this is the first open-weights model that has crossed the threshold of 80% score on this by scoring 80.6%.
Back in Nov 2025, Opus 4.5 (80.9%) was the first proprietary model to do so.
I've been using speech-to-text tools every day now especially for dictating detailed prompts to LLMs and coding agents. I personally use VoiceInk which is open-source.
I tried to look for what other solutions are available and I've collected all the best open-source ones in this awesome-style GitHub repo. Hope you find something that works for you!
Will check out. I made a custom made dirty solution working for the coding agent we use. Speaking is much faster than typing but it takes mental effort to lay out your thoughts before speaking unlike typing.
I love using (tiling) window managers, and one of the most important requirements for me is having a key binding for switching to the last active workspace. The proposed solution in the blog doesn't achieve this. I use Aerospace on macOS right now and think it's the best solution available.
I generally have fixed workspaces for different things: first for a browser, second for a code editor, third for a terminal, and so on. If I want to switch between the browser and code editor, I can do that with a single key binding, usually Alt+Tab. The same binding lets me switch between the code editor and terminal just as easily.
When you have something like 10 different workspaces, not having this key binding becomes annoying. If you need to alternate between windows on workspace one and workspace eight, you're stuck using both hands to press Control+1 and then Control+8. But with a last-active-workspace key binding, you can just Alt+Tab between them. This is the killer feature I always need.
Can you explain how exactly dictation is used for development? I type about 120 WPM so typing is always going to be way faster for me than talking. Aside for accessibility, is dictation development for slower typers or is it more so you can relax on a couch while vibe coding? If this comes off as condescension it's not intended, I am genuinely out of the loop here.
I think most people can speak faster than 120 WPM. For example this site says I speak at 343 WPM https://www.typingmaster.com/speech-speed-test/, and I self-measure 222 WPM on dense technical text.
For me personally, it's not really about typing speed. While I can type pretty fast and most likely I speak faster than typing, but typing and dictating are just different way of doing things for me. While the end result of both is same, but for me it's just like different way of doing things and it's not a competition between the two.
I regularly just sit down and often just describe whatever I'm trying to do in detail and I speak out loud my entire thought process and what kind of trade-offs I'm thinking, all the concerns and any other edge cases and patterns I have in my mind. I just prefer to speak out loud all of those. I regularly speak out loud for 5 to 10 minutes while sometimes taking some breaks in between as well to think through things.
I am not doing it just for vibe coding, I'm using it for everything. So obviously for driving coding agents, but also for in general, describing my thoughts for brainstorming or having some kind of like a critique session with LLMs for my ideas and thoughts. So for everything, I'm just using dictation.
One other benefit I think for me personally is that since I'm interacting with coding agents and in general LLMs a lot again and again every day, I end up giving much more context and details if I'm speaking out loud compared to typing. Sometimes I might feel a little bit lazy to type one or two extra sentences. But while speaking, I don't really have that kind of friction.
Author here. My argument is: we give instructions to coding agents dozens of times a day. Over time, speaking those instructions naturally tends to produce more detailed context than typing them out, because the friction of typing makes you abbreviate.
I've been using VoiceInk on macOS for a few months now. The workflow is just: hold shortcut, speak, release, text appears at cursor and works in terminal, editor, chat, wherever.
The post covers Handy, Whispering, VoiceInk, OpenWhispr, and FluidVoice. All open-source, all do local transcription, all paste directly into the active window. The differences are mostly platform support, model selection, and how much extra stuff (AI post-processing, voice-activated mode, etc.) they add.
Happy to answer questions about any of these or about the voice-typing-for-agents workflow in general.
reply