Hacker News new | past | comments | ask | show | jobs | submit login

Whisper as an API is great, but having to send the whole payload upfront is a bummer. Most use cases I can build for would want streaming support.

Like establish a WebRTC connection and stream audio to OpenAI and get back a live transcription until the audio channel closes.




FWIW, AssemblyAI has great trasncript quality in my experience, and they support streaming: https://www.assemblyai.com/docs/walkthroughs#realtime-stream...


We're using AssemblyAI too, and I agree that their transcription quality is good. But as soon as Whisper supports world-level timestamps, I think we'll seriously consider switching as the price difference is large ($0.36 per hour vs $0.9 per hour).


Both of those prices strike me as quite high, given that Whisper can be run relatively quickly on commodity hardware. It's not like the bandwidth is significant either, it's just audio.


It's pretty great from my perspective. I've been creating little supplemental ~10 minute videos for my class (using descript; i should probably switch to OBS), and the built in transcription is both wonderful (that it has it at all and is easy to fix) and horrible (the number of errors is very high). I'd happily pay a dime to have a higher quality starting transcription that saves me 5 minutes of fixing...


Try my app: https://apps.apple.com/app/wisprnote/id1671480366

It has great quality transcription from video and audio (in English only sorry if that's not you!). Uses Whisper.cpp plus VAD to skip silent / non-speech sections which introduce errors normally. Give a try let me know what you think! :)


A plug here but check out https://vidcap.app/

It’s based on a finetuned Whisper and you’d get unlimited transcriptions for $4.99/month


Why do you need Word-level timestamps? I don't understand what that's for...


I've ran Whisper locally via [1] with one of the medium sized models and it was damn good at transcribing audio from a video of two people having a conversation.

I don't know exactly what the use case is where people would need to run this via API; the compute isn't huge, I used CPU only (an M1) and the memory requirements aren't much.

[1] https://github.com/ggerganov/whisper.cpp


> I've ran Whisper locally via [1] with one of the medium sized models and it was damn good at transcribing audio from a video of two people having a conversation.

Agree! Totally concur on this.

I made a Mac app that uses whisper to transcribe from audio or video files. Also adds in VAD for reducing Whisper hallucination during silent sections, and it's super fast. https://apps.apple.com/app/wisprnote/id1671480366


The 5gb model is likely too big for 95% of people's machines and renting gpus is likely not much cheaper.

I'm using also whisper myself locally to transcribe my voice notes though.


I recently tried a number of options for streaming STT. Because my use case was very sensitive to latency, I ultimately went with https://deepgram.com/ - but https://github.com/ggerganov/whisper.cpp provided a great stepping stone while prototyping a streaming use case locally on a laptop.


As far as I can tell it doesn't support world-level timestamps (yet). That's a bit of a dealbreaker for things like promotional clips or the interactive transcripts that we do[^0]. Hopefully they add this soon.

[^0]: https://www.withfanfare.com/p/seldon-crisis/future-visions-w...


It's also annoying since there appears to be a hard limit of 25 MiB to the request size, requiring you to split up larger files and manage the "prompt" to subsequent calls. Well, somehow, near as I can tell, how you're expected to use that value isn't documented.


You split up the audio and send it over in a loop. Pass in the transcript of the last call as the prompt for the next one. See item 2 here: https://platform.openai.com/docs/guides/speech-to-text/promp...


And:

> we suggest that you avoid breaking the audio up mid-sentence as this may cause some context to be lost.

That's really easy to put in a document, much harder to do in practice. Granted, it might not matter much in the real world, not sure yet.

Still, this will require more hand holding than I'd like.


I doubt it will matter if you're breaking up mid sentence if you pass in the previous as a prompt and split words. This is how Whisper does it internally.

It's not absolutely perfect, but splitting on the word boundary is one line of code with the same package in their docs: https://github.com/jiaaro/pydub/blob/master/API.markdown#sil...

25MB is also a lot. That's 30 minutes to an hour on MP3 at reasonable compression. A 2 hour movie would have three splits.


If that helps, just wrote a script to split the audio and use the prompt parameter to provide context with the n-1 segment transcription: https://gist.github.com/patrick-samy/cf8470272d1ff23dff4e2b5...


The page includes a five line Python example of how to split audio without breaking mid-word.


I suggest you give revoldiv.com a try, We use whisper and other models together. You can upload very large files and get an hour long file transcription in less than 30 seconds. We use intelligent chunking so that the model doesn't lose context. We are looking to increase the limit even more in the coming weeks. It's also free to transcribe any video/audio with word level timestamps.


I just gave it a try, and the results are impressive! Do you also offer an API?


If you're interested in an offline / local solution: I made a Mac App that uses Whisper.cpp and Voice Activity Detection to skip silence and reduce Whisper hallucinations: https://apps.apple.com/app/wisprnote/id1671480366

If it really works for you, I can add command line params to an upate, so you can use it as a "local API" for free.


contact us at team@revoldiv.com and we are offering an API on a case by case basis




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: