We're using AssemblyAI too, and I agree that their transcription quality is good. But as soon as Whisper supports world-level timestamps, I think we'll seriously consider switching as the price difference is large ($0.36 per hour vs $0.9 per hour).
Both of those prices strike me as quite high, given that Whisper can be run relatively quickly on commodity hardware. It's not like the bandwidth is significant either, it's just audio.
It's pretty great from my perspective. I've been creating little supplemental ~10 minute videos for my class (using descript; i should probably switch to OBS), and the built in transcription is both wonderful (that it has it at all and is easy to fix) and horrible (the number of errors is very high). I'd happily pay a dime to have a higher quality starting transcription that saves me 5 minutes of fixing...
It has great quality transcription from video and audio (in English only sorry if that's not you!). Uses Whisper.cpp plus VAD to skip silent / non-speech sections which introduce errors normally. Give a try let me know what you think! :)
I've ran Whisper locally via [1] with one of the medium sized models and it was damn good at transcribing audio from a video of two people having a conversation.
I don't know exactly what the use case is where people would need to run this via API; the compute isn't huge, I used CPU only (an M1) and the memory requirements aren't much.
> I've ran Whisper locally via [1] with one of the medium sized models and it was damn good at transcribing audio from a video of two people having a conversation.
Agree! Totally concur on this.
I made a Mac app that uses whisper to transcribe from audio or video files. Also adds in VAD for reducing Whisper hallucination during silent sections, and it's super fast. https://apps.apple.com/app/wisprnote/id1671480366
I recently tried a number of options for streaming STT. Because my use case was very sensitive to latency, I ultimately went with https://deepgram.com/ - but https://github.com/ggerganov/whisper.cpp provided a great stepping stone while prototyping a streaming use case locally on a laptop.
As far as I can tell it doesn't support world-level timestamps (yet). That's a bit of a dealbreaker for things like promotional clips or the interactive transcripts that we do[^0]. Hopefully they add this soon.
It's also annoying since there appears to be a hard limit of 25 MiB to the request size, requiring you to split up larger files and manage the "prompt" to subsequent calls. Well, somehow, near as I can tell, how you're expected to use that value isn't documented.
I doubt it will matter if you're breaking up mid sentence if you pass in the previous as a prompt and split words. This is how Whisper does it internally.
I suggest you give revoldiv.com a try, We use whisper and other models together. You can upload very large files and get an hour long file transcription in less than 30 seconds. We use intelligent chunking so that the model doesn't lose context. We are looking to increase the limit even more in the coming weeks. It's also free to transcribe any video/audio with word level timestamps.
If you're interested in an offline / local solution: I made a Mac App that uses Whisper.cpp and Voice Activity Detection to skip silence and reduce Whisper hallucinations: https://apps.apple.com/app/wisprnote/id1671480366
If it really works for you, I can add command line params to an upate, so you can use it as a "local API" for free.
Like establish a WebRTC connection and stream audio to OpenAI and get back a live transcription until the audio channel closes.