Happy to answer any questions! These are list of local models it supports: - whi...

phkahler · 2025-08-14T20:08:06 1755202086

I thought whisper and others took large chunks (20-30 seconds) of speech, or a complete wave file as input. How do you get real-time transcription? What size chunks do you feed it?

To me, STT should take a continuous audio stream and output a continuous text stream.

yujonglee · 2025-08-14T20:11:22 1755202282

I use VAD to chunk audio.

Whisper and Moonshine both works in a chunk, but for moonshine:

> Moonshine's compute requirements scale with the length of input audio. This means that shorter input audio is processed faster, unlike existing Whisper models that process everything as 30-second chunks. To give you an idea of the benefits: Moonshine processes 10-second audio segments 5x faster than Whisper while maintaining the same (or better!) WER.

Also for kyutai, we can input continuous audio in and get continuous text out.

- https://github.com/moonshine-ai/moonshine - https://docs.hyprnote.com/owhisper/configuration/providers/k...

zveyaeyv3sfye · 2025-08-15T13:05:10 1755263110

Having used whisper and noticed the useless quality due to their 30-second chunks, I would stay far away from software working on even a shorter duration.

The short duration effectively means that the transcription will start producing nonsense as soon as a sentence is cut up in the middle.

mijoharas · 2025-08-14T20:29:58 1755203398

Something like that, in a cli tool, that just gives text to stdout would be perfect for a lot of use cases for me!

(maybe with an `owhisper serve` somewhere else to start the model running or whatever.)

ctbellmar · 2025-08-15T14:16:24 1755267384

I wrote a tool that may be just the thing for you:

https://github.com/bikemazzell/skald-go/

Just speech to text, CLI only, and it can paste into whatever app you have open.

mijoharas · 2025-08-15T16:31:27 1755275487

Oh, this does sound cool. Couple of questions that aren't clear from the readme (to me).

What exactly does the silence detection mean? does that mean it'll wait until a pause, and then send the audio off to whisper, and return the output (and stop the process)? Same question with continuous. Does that just mean it continues going until CTRL+C?

Nvm, answered my own question, looks like yes for both[0][1]. Cool this seems pretty great actually.

[0] https://github.com/bikemazzell/skald-go/blob/main/pkg/skald/...

[1] https://github.com/bikemazzell/skald-go/blob/main/pkg/skald/...

yujonglee · 2025-08-14T20:36:29 1755203789

Are you thinking about the realtime use-case or batch use-case?

For just transcribing file/audio,

`owhisper run <MODEL> --file a.wav` or

`curl httpsL//something.com/audio.wav | owhisper run <MODEL>`

might makes sense.

mijoharas · 2025-08-14T20:47:03 1755204423

agreed, both of those make sense, but I was thinking realtime. (pipes can stream data, I'd like and find useful something that can stream tts to stdout in realtime.)

yujonglee · 2025-08-14T20:54:50 1755204890

It's open-source. Happy to review & merge if you can send us PR!

https://github.com/fastrepl/hyprnote/blob/8bc7a5eeae0fe58625...

shekhar101 · 2025-08-14T22:53:40 1755212020

FYI: owhisper pull whisper-cpp-large-turbo-q8 Failed to download model.ggml: Other error: Server does not support range requests. Got status: 200 OK

But the base-q8 works (and works quite well!). The TUI is really nice. Speaker diarization would make it almost perfect for me. Thanks for building this.

yujonglee · 2025-08-14T23:51:11 1755215471

we store data in R2 and range query sometime glitch... It might work if you retry it

alkh · 2025-08-14T21:17:35 1755206255

Sorry, maybe I missed it but I didn't see this list on your website. I think it is a good idea to add this info there. Besides that, thank you for the effort and your work! I will definetely give it a try

yujonglee · 2025-08-14T21:19:59 1755206399

got it. fyi if you run `owhisper pull --help`, this info is printed