I'm deaf as a post and rely on speech-to-text tools and features (Otter, Google Meet), usually paid, and constantly feel a sense of digital precarity: at any moment, these tools may be removed, changed, or experience outages, all which swing my world from inclusion to exclusion. I hope this works well, because it would give me control over my own tools for accessibility in a way that my workarounds don't.
For me, it's a tough tradeoff between privacy and accessibility – without these tools, I have zero accessibility for audio-based mediums. My best bridge for now is Google Meet's captions, and when Meet isn't where the call's taking place, piping the audio into Meet as a virtual audio source.
I ran this for the first time today, after reading about it on another HN thread. On my computer (running Ubuntu 20.x), it seems to work fine except that the stop button does nothing, so I can't stop the program and select+copy the displayed text.
How well does this work without a CUDA capable GPU? I tried out Whisper when it first showed up on a 12th gen i7 laptop in CPU mode and found that the larger, more accurate models would take enormous amounts of time.
I have been contemplating setting up a desktop machine with a discrete gpu so that I can play around with this further. I'd be curious if smaller sample sizes improve the performance, or if the smaller models are sufficient when doing voice transcription.