It generates both a file that just contains a line per uninterrupted speaker speech prefixed with the speaker number, as well as a file with timestamps which I believe would be used as subtitles.
I have had very good results using Spectropic [1], a hosted Whisper Diarization API service as a platform. I found it cheap and way easier and faster than setting up and using whisper-diarization on my M1. Audiogest [2] is a web service built upon Spectropic, I have not yet used it.
disclaimer : I am not affiliated in any way, just a happy customer! I had some nice mail exchanges after bug reports with the (I believe solo-)developer behind these tools.
Thomas here, maker of Spectropic and Audiogest. I am indeed focused on building a simple and reliable Whisper + diarization API. Also working on providing fine-tuned versions of Whisper of non-English languages through the API.
Feel free to reach out to me if anyone is interested in this!
Great looking API. Are you able to, or do you have plans, for there to be automatic speaker identification based on labeled samples of their voices? It would be great to basically have a library of known speakers that are auto matched when transcribing
Thanks! That is something I might offer in the future and is definitely possible with a library like pyannote. Would be really cool to add for sure.
I am also experimenting with post-processing transcripts with LLMs to infer speaker names from a transcript. It works pretty decent already but it's still a bit expensive. I have this feature available under the 'enhanced' model if you want to check it out: https://docs.spectropic.ai/models/transcribe/enhanced
I often subtitle old, obscure, foreign language movies with Whisper. Or random clips found on foreign Telegram/Twitter channels. Paired up with some GPT for translation it works great!
You can do this locally if you have enough (V)RAM, but I prefer the OpenAI API, as usually I don't have enough at hand. And the various Llamas aren't really quality on par with GPT-4. If you only need Whisper, and no translation, then local execution is indeed very viable. High quality Whisper fits in 4GB of (V)RAM.
The problem with using OpenAI whisper is that its too slow on CPU only machines. Whisper.CPP is blazing fast compared to Whisper and I wish people build better diarization on top of that.
Another advantage of Whisper.CPP is that it can use cublas to accelerate models too large for your GPU memory; I can run the medium and large models with cublas on my 1050, but only the small if I use the pure GPU mode.
Ah I see, thanks. Hm, I would imagine that it's not hard to make something that works with both (the surface area of the API should be fairly small, I imagine), odd that projects use the former and not the latter.
iirc whisper-diarization uses whisperx under the hood.
I’ll be honest, I haven’t dived much into this as I just needed something transcribed quickly, but when I was looking at WhisperX I couldn’t find a CLI that would just out of the box give me a text file with a line per speaker statement (not per word).
Are there any open-source or paid apps/shareware/freeware that can:
- Transcribe word-by-word in real time as audio is recorded
- Work entirely locally
- Use relatively recent open-source local models?
I've been using otter.ai for real-time meeting transcriptions - letting me multitask and instantly catch up if I'm asked a question by skimming the most recent few seconds worth of the transcript - but it's far from perfect and occasionally their real-time service has significant transcription delays, not to mention it requires internet connectivity.
Most of the Whisper-based apps out there, though, as well as (when I last checked) the whisper.cpp demo code, require an entire recording to be ingested at once. There are others that rely on e.g. Apple's dictation frameworks, which is a bit dated in capability at the moment.
I have built my own local-first solution to transcribe entirely locally in real time word by word, driven by a different need (I'm hard of hearing). It's my daily driver for transcribing meetings, interviews, etc. Because of its local-first capability, I do not have to worry about privacy concerns when transcribing meetings at work as all data stays on my machine. It's about as fast as Otter.ai although there's definitely room for improvements in terms of UX and speed. Caveat is that it only works on MacBooks with Apple silicon. Happy to chat over email (see my HN profile).
I have some staff with combined hearing and visual needs. Have you researched the one-, two- all-party consent requirements? Asking because I hope to identify transcription as "non-recording".
California has an exception for hearing aids and other similar devices, but it’s unclear if transcription aids count, or if this has been tested in court. https://codes.findlaw.com/ca/penal-code/pen-sect-632/ (Not a lawyer, this is not legal advice.)
Do you mean ephemeral, or are you actually wondering about something implanted under the skin? I'd think/hope if it goes under the skin, it ends up in "hearing aid" territory. I'm less sure about if it doesn't persist.
Two/all-party consent are hacky workarounds for the actual harm being inflicted (valid goals including not having your microwave inform Google's ad servers, not recording out-of-context jokes as evidence to imprison people, ... -- invalid goals caught up in the collateral damage include topics like the current one about hearing issues (note that a sufficiently accurate transcription service has all the same privacy problems 2-party consent tries to protect against, maybe more since it's more easily searchable)).
I'd be in favor of some startup pulling an Uber or AirBnB and blatantly violating those laws to the benefit of the deaf or elderly if it meant we could get something better on the books.
I've been using Transcribro[0] on Android/GrapheneOS. It's FOSS and only local, and while it's not word-for-word real-time, it doesn't have to wait for the whole audio to be uploaded before it can work. This is on a Pixel 5a, so hardly impressive hardware.
It works well enough that I use it with Telegram to shove messages over to my Linux machine when I don't feel like typing them out, which is such an unsophisticated hack, but is getting the job done. I spent a couple hours trying to find a Linux-native alternative, or even get this running in Waydroid, and couldn't find anything that worked as well, so I decided not to let the "smooth" become the enemy of the "good enough to get the job done."
Live Transcribe in the accessibility settings. AFAIK it's available on any fairly recent Android phone. I bought a Pixel tablet for no other reason but to run it -- nothing else I've tried comes close for local-only continuous transcribe-as-they-speak. (iOS has a similar feature also under accessibility; it's good but not at the same level. Of course I'd love to see an open-source solution.)
This was for English. One problem it took me a while to realize: when I switched it to transcribe a secondary language, it was not doing it on-device anymore. You can tell the difference by setting airplane mode.
There's a captioning button under the volume slider, and I think it's called "live captions" or something in settings. Just tap the button and it'll start.
I helped coding oTranscribe+ [0], which does something similar to what you are asking for. Using ElectronJS and the current, at that moment, version of oTranscribe, there is this desktop application. It also exists as web version and PWA [1].
Language models were those from BSC (Barcelona Supercomputing Center) at the time. The transcription is done via WASM, using Vosk [2] as base.
Punctuating is enough of the hard work of generating a transcript that it's not very useful to me without this, unfortunately. What use cases are you planning for where the words themselves are the desired output?
futo.org has FOSS voice input android app (voiceinput.futo.org) and live captions (https://github.com/abb128/LiveCaptions) for Linux. They specifically developed their own model that does fast real time transcriptions.
I've been using this for the past two or three weeks and I have been very impressed with it. I ended up giving them five dollars because it's now my primary keyboard!
You do still need to proof and QA even AI results, if you want a publication quality result, and do things like attribute who is speaking when (at least Whisper can't do that), and correct "unusual" last names and things. So I feel like people using AI still need good tools for the correcting/finishing/proofing too, that would be similar to the tools for non-assisted transcription.
Does oTranscribe automatically convert audio into text?
Sorry! It doesn’t. oTranscribe makes the manual task of transcribing
audio a lot less painful. But you still have to do the transcription.
I currently use Aiko’s free iOS app which does offline transcription using OpenAI’s Whisper model. It has been working pretty well for me so far. It can export in formats like SRT, TXT, CSV, JSON and text with timestamps too.
https://sindresorhus.com/aiko
You're always welcome to try my service TurboScribe https://turboscribe.ai/ if you need a transcript of an audio/video file. It's 100% free up to 3 files per day (30 minutes per file) and the paid plan is unlimited and transcribes files up to 10 hours long each. It also supports speaker recognition, common export formats (TXT, DOCX, PDF, SRT, CSV), as well as some AI tools for working with your transcript.
Thanks! I've had good results with Turboscribe (paid plan) and appreciate having this as a service. I typically use it for 2-3 hour long video recordings with a number of speakers and appreciate the editing tools to clean things up before export.
I was curious how good a transcription I could get from what may be the best multimoldal LLM currently, Gemini-1.5-Pro-Experiment-0801, so I had it transcribe five minutes of an interview between Ezra Klein and Nancy Pelosi from earlier today. The results are here:
Aside from some minor punctuation and capitalization issues, Gemini’s transcription looks nearly perfect to me. There were only one or two words that I think it misheard. If I had transcribed the audio myself, I would have made more mistakes than that.
One passage struck me in particular:
And then he comes up with "weird," which becomes viral and the rest, and here he is.
How did Gemini know to put “weird” in quotation marks, to indicate—correctly—that the speaker was referring to Walz’s use of the word as a word? According to Politico, Walz first used the word in that context in the media on July 23.
Maybe two factors helped achieve the impressive result with the quotation marks:
- auditory cues
- the sentence would be gramatically incorrect and make no sense without them
Just guessing out of the blue.
But I think it's likely that LLMs (and other speech recognition systems) need to exploit sentence context to recognize individual words and punctuation, and this is an example were it went well.
Human listening is similar in a way, we can recognize words even when spoken very mumbly or fast, if we have context.
The NYT’s closed captions do not put “weird” in quotation marks; they also divide sentences weirdly and have other mistakes. But they get some things better than Gemini, such as capitalizing “House” when it means the House of Representatives.
I haven’t compared the audio-only podcast version and the video version carefully; it’s possible that parts of the audio were edited or re-recorded for one or the other.
Just pitching in a transcription tool that lets you transcribe video and audio files using Whisper and WASM in your browser, and get a .txt, .srt, .vtt file. Maybe in the future support for Whisper Turbo?
Use this a lot. It's nice and simple and has exactly the tools you need (playback speed control, easy pause/play) and nothing more. Greatly prefer it over automatic transcription tools give you 40 pages of 'umm's and 'ahhhh's to filter through and edit.
People not used to AI have blind spots that prevent them from seing evident use case like this.
I'm always surprised at the amazed look of my friends when they see me concretely use the tool. They just didn't picture it until they saw it in action.
It's not even people not used to AI, I developed a tool that uses AI to do something, and then kind of couldn't be bothered to fix some of the output manually. It only occurred to me days later that I can ask the AI to fix it.
I started making an open source macOS app to do this with whisper and potentially pyannote.
It is functional but a bit slow. I think using whisper directly instead of swift bindings will help a lot.
Really interested in adding diarisation but having a lot of trouble converting Pyannote to CoreML. Pyannote runs so slowly with torch on CPU. Haven’t gotten around putting my latest work for that on GitHub yet.
Yeah definitely switch over to using ggerganov's whisper implementation, I use it in a little home brewed python app on my M1 for handling speech transcripts. The base EN model chews through minutes of audio in seconds, it's insanely fast.
I'm working on the tool, that includes AI. My original target is to test it on my https://www.youtube.com/c/VectorPodcast by offering something that Lex Fridman does for his episodes.
Current features:
1. Download from YT
2. Transcribe using Vosk (output has time codes included)
3. Speaker diarization using pyannote - this isn't perfect and needs a bit more ironing out.
What needs to be done:
4. Store the transcription in a search engine (can include vectors)
5. Implement a webapp
If anyone here is interested to join forces, let me know.
fantastic tool; I used it a lot to transcribe interviews during plane travels where there was no internet, and I needed to fill the time. Really useful to have if you do a lot of interviews
Am I missing something?. For what I checked it supports every language, as is yourself the one transcribing by hand. This is just an UI to watch the video or audio while you're typing it.
I made a similar tool for making tables of contents for youtube videos: https://youtoc.by/
Not developing it actively after I created tables of contents for the several videos I needed, years ago. If I ever need it again, I will probably work on mobile UI (aka responsive)
If you are looking for something automatic that also allows you to interact with your transcripts chatgpt style then I would recommend https://www.videototextai.com/
That cookies box though... Dark pattern (accept lots + accept all, fake drag affordance, covering a quarter of the page) for cookies doesn't bode well for privacy protections around the transcripts.
You are allowed to delete any transcription you make and with that we do not keep any copy of the transcripts :) . The cookie banner is there to comply with the EU laws.
Ah this brings back memories. When I was in college with limited money, I used to pirate movies and most of them didn't have subtitles and I used to daydream of writing a VLC plug-in which would real time generate subtitles. But I had better things to do like play video games...
I don't understand what the issue is. You don't know how to type the different diacritical marks? Or the textbox isn't accepting them? (Which seems like it would be a browser issue, not an issue with the site.)
SubtitleEdit is one of the most complete and has many online tutorials from users.
Make sure they are recent tutorials because they will probably mention how to use the automated generation tools/plugins that wasn't available years ago.
https://github.com/McCloudS/subgen worked very well for me. I had a TV series where somehow the last few seasons timestamps did not match up with subtitle files I could find online. I used subgen and it worked surprisingly well.
Worked excellent.
It generates both a file that just contains a line per uninterrupted speaker speech prefixed with the speaker number, as well as a file with timestamps which I believe would be used as subtitles.