Hacker News new | past | comments | ask | show | jobs | submit login
OTranscribe: A free and open tool for transcribing audio interviews (otranscribe.com)
482 points by zerojames 41 days ago | hide | past | favorite | 103 comments



I needed to do this this week (transcribe an interview with multiple speakers) and used https://github.com/MahmoudAshraf97/whisper-diarization

Worked excellent.

It generates both a file that just contains a line per uninterrupted speaker speech prefixed with the speaker number, as well as a file with timestamps which I believe would be used as subtitles.


I have had very good results using Spectropic [1], a hosted Whisper Diarization API service as a platform. I found it cheap and way easier and faster than setting up and using whisper-diarization on my M1. Audiogest [2] is a web service built upon Spectropic, I have not yet used it.

disclaimer : I am not affiliated in any way, just a happy customer! I had some nice mail exchanges after bug reports with the (I believe solo-)developer behind these tools.

---

[1] https://spectropic.ai/

[2] https://audiogest.app/


Thanks for the shout-out and kind words!

Thomas here, maker of Spectropic and Audiogest. I am indeed focused on building a simple and reliable Whisper + diarization API. Also working on providing fine-tuned versions of Whisper of non-English languages through the API.

Feel free to reach out to me if anyone is interested in this!


Great looking API. Are you able to, or do you have plans, for there to be automatic speaker identification based on labeled samples of their voices? It would be great to basically have a library of known speakers that are auto matched when transcribing


Thanks! That is something I might offer in the future and is definitely possible with a library like pyannote. Would be really cool to add for sure.

I am also experimenting with post-processing transcripts with LLMs to infer speaker names from a transcript. It works pretty decent already but it's still a bit expensive. I have this feature available under the 'enhanced' model if you want to check it out: https://docs.spectropic.ai/models/transcribe/enhanced


Hi! Any plans to support streaming transcription with diarization?


Streaming is definitely on the to-do list! Its quite complex to stream both transcription + diarization, but we will get there eventually


I often subtitle old, obscure, foreign language movies with Whisper. Or random clips found on foreign Telegram/Twitter channels. Paired up with some GPT for translation it works great!

You can do this locally if you have enough (V)RAM, but I prefer the OpenAI API, as usually I don't have enough at hand. And the various Llamas aren't really quality on par with GPT-4. If you only need Whisper, and no translation, then local execution is indeed very viable. High quality Whisper fits in 4GB of (V)RAM.


The problem with using OpenAI whisper is that its too slow on CPU only machines. Whisper.CPP is blazing fast compared to Whisper and I wish people build better diarization on top of that.


Another advantage of Whisper.CPP is that it can use cublas to accelerate models too large for your GPU memory; I can run the medium and large models with cublas on my 1050, but only the small if I use the pure GPU mode.


What's OpenAI Whisper vs whisper.cpp? Do you mean whisper-diarization uses the API?


https://github.com/openai/whisper

vs

https://github.com/ggerganov/whisper.cpp

They are two inference engines for running the whisper ASR model, each with their own API AFAIK.


Ah I see, thanks. Hm, I would imagine that it's not hard to make something that works with both (the surface area of the API should be fairly small, I imagine), odd that projects use the former and not the latter.


I had better success with whisperx, as whisper-dia does sometimes have weird issues I couldn't resolve: https://github.com/m-bain/whisperX


iirc whisper-diarization uses whisperx under the hood.

I’ll be honest, I haven’t dived much into this as I just needed something transcribed quickly, but when I was looking at WhisperX I couldn’t find a CLI that would just out of the box give me a text file with a line per speaker statement (not per word).


I use it like this:

whisperx $file int8 --min_speakers 3 --max_speakers 3 --language de --hf_token $token --diarize


> iirc whisper-diarization uses whisperx under the hood.

It seems like it does:

https://github.com/MahmoudAshraf97/whisper-diarization/blob/...


Fascinating how traditionally very complex and hard ML problems are slowly becomming commodities with AI:

- transcription

- machine translation

- OCR

- image recognition


Does it hallucinate when there's dead air?


Yes


Maybe it isn't perfectly clear, but OTranscribe isn't an automatic speech-to-text tool, but instead, a UI for assisting in manual transcribing.

So no AI here, folks.


Yep, it's designed to assist with manual transcription


Are there any open-source or paid apps/shareware/freeware that can:

- Transcribe word-by-word in real time as audio is recorded

- Work entirely locally

- Use relatively recent open-source local models?

I've been using otter.ai for real-time meeting transcriptions - letting me multitask and instantly catch up if I'm asked a question by skimming the most recent few seconds worth of the transcript - but it's far from perfect and occasionally their real-time service has significant transcription delays, not to mention it requires internet connectivity.

Most of the Whisper-based apps out there, though, as well as (when I last checked) the whisper.cpp demo code, require an entire recording to be ingested at once. There are others that rely on e.g. Apple's dictation frameworks, which is a bit dated in capability at the moment.

Anything folks are using out there?


I have built my own local-first solution to transcribe entirely locally in real time word by word, driven by a different need (I'm hard of hearing). It's my daily driver for transcribing meetings, interviews, etc. Because of its local-first capability, I do not have to worry about privacy concerns when transcribing meetings at work as all data stays on my machine. It's about as fast as Otter.ai although there's definitely room for improvements in terms of UX and speed. Caveat is that it only works on MacBooks with Apple silicon. Happy to chat over email (see my HN profile).


I have some staff with combined hearing and visual needs. Have you researched the one-, two- all-party consent requirements? Asking because I hope to identify transcription as "non-recording".


California has an exception for hearing aids and other similar devices, but it’s unclear if transcription aids count, or if this has been tested in court. https://codes.findlaw.com/ca/penal-code/pen-sect-632/ (Not a lawyer, this is not legal advice.)


If it were ephemeral? Would that change this? Say recording the meeting locally a 5 minute frame then updating a meeting summary?


Do you mean ephemeral, or are you actually wondering about something implanted under the skin? I'd think/hope if it goes under the skin, it ends up in "hearing aid" territory. I'm less sure about if it doesn't persist.


it's more likely in cochlear implant territory (there are different laws and regulations for implants vs aids depending on locale)


Yup, typo, sorry


Two/all-party consent are hacky workarounds for the actual harm being inflicted (valid goals including not having your microwave inform Google's ad servers, not recording out-of-context jokes as evidence to imprison people, ... -- invalid goals caught up in the collateral damage include topics like the current one about hearing issues (note that a sufficiently accurate transcription service has all the same privacy problems 2-party consent tries to protect against, maybe more since it's more easily searchable)).

I'd be in favor of some startup pulling an Uber or AirBnB and blatantly violating those laws to the benefit of the deaf or elderly if it meant we could get something better on the books.


What did your own research turn up?


I was so excited until the very end. I have the wrong hardware.


I've been using Transcribro[0] on Android/GrapheneOS. It's FOSS and only local, and while it's not word-for-word real-time, it doesn't have to wait for the whole audio to be uploaded before it can work. This is on a Pixel 5a, so hardly impressive hardware.

It works well enough that I use it with Telegram to shove messages over to my Linux machine when I don't feel like typing them out, which is such an unsophisticated hack, but is getting the job done. I spent a couple hours trying to find a Linux-native alternative, or even get this running in Waydroid, and couldn't find anything that worked as well, so I decided not to let the "smooth" become the enemy of the "good enough to get the job done."

[0] https://github.com/soupslurpr/Transcribro


> Are there any open-source or paid apps/shareware/freeware

Google Pixel phones have this feature and it works _very_ well.


Have you tried for non English languages?

New Microsoft Surfaces have this feature but just works for English


How is that feature accessed? Or what does Google call it so I can search for it.


Live Transcribe in the accessibility settings. AFAIK it's available on any fairly recent Android phone. I bought a Pixel tablet for no other reason but to run it -- nothing else I've tried comes close for local-only continuous transcribe-as-they-speak. (iOS has a similar feature also under accessibility; it's good but not at the same level. Of course I'd love to see an open-source solution.)

This was for English. One problem it took me a while to realize: when I switched it to transcribe a secondary language, it was not doing it on-device anymore. You can tell the difference by setting airplane mode.


There's a captioning button under the volume slider, and I think it's called "live captions" or something in settings. Just tap the button and it'll start.

https://support.google.com/accessibility/android/answer/9350...


It's the "recorder" app. It was released in 2021 I think and has been improving since the beginning.

If you elect to, the audio + transcription can be access/searched via https://recorder.google.com


I helped coding oTranscribe+ [0], which does something similar to what you are asking for. Using ElectronJS and the current, at that moment, version of oTranscribe, there is this desktop application. It also exists as web version and PWA [1].

Language models were those from BSC (Barcelona Supercomputing Center) at the time. The transcription is done via WASM, using Vosk [2] as base.

I hope it fits.

[0] https://github.com/projecte-aina/oTranscribe-plus [1] https://otranscribe.bsc.es/ [2] https://github.com/alphacep/vosk-api


Is there a way to get it to punctuate? Or does it only jot down words?


It does not punctuate. It just transcribes what is in an audio stream.


Punctuating is enough of the hard work of generating a transcript that it's not very useful to me without this, unfortunately. What use cases are you planning for where the words themselves are the desired output?


It is just an improvement of oTranscribe. It makes transcription easier, since you only have to polish the result, revising and punctuating it.


Yes! WhisperKit's TestFlight app does all three on Apple Silicon: https://www.takeargmax.com/blog/whisperkit

I wish they had Speaker Diarization, but they are waiting for upstream Whisper to add it: https://github.com/argmaxinc/WhisperKit/issues/31


futo.org has FOSS voice input android app (voiceinput.futo.org) and live captions (https://github.com/abb128/LiveCaptions) for Linux. They specifically developed their own model that does fast real time transcriptions.

Not sure if that helps for your specific usecase.


I've been using this for the past two or three weeks and I have been very impressed with it. I ended up giving them five dollars because it's now my primary keyboard!


Kinda surprised to not have AI integration.

You do still need to proof and QA even AI results, if you want a publication quality result, and do things like attribute who is speaking when (at least Whisper can't do that), and correct "unusual" last names and things. So I feel like people using AI still need good tools for the correcting/finishing/proofing too, that would be similar to the tools for non-assisted transcription.


This was written a really long time ago by a former WSJ Graphics reporter (Elliot Bentley) who is now at Datawrapper.

It is now operated by Muckrock and hasn't seen changes made to it in a while.

That's why it doesn't have any of these integrations, the technology just didn't exist.


Aha, good to know! That's actually important context, that this is not a recent release, and doesn't necessarily have a lot of ongoing development.


From their FAQ:

    Does oTranscribe automatically convert audio into text?
    
    Sorry! It doesn’t. oTranscribe makes the manual task of transcribing
    audio a lot less painful. But you still have to do the transcription.


I currently use Aiko’s free iOS app which does offline transcription using OpenAI’s Whisper model. It has been working pretty well for me so far. It can export in formats like SRT, TXT, CSV, JSON and text with timestamps too. https://sindresorhus.com/aiko


You're always welcome to try my service TurboScribe https://turboscribe.ai/ if you need a transcript of an audio/video file. It's 100% free up to 3 files per day (30 minutes per file) and the paid plan is unlimited and transcribes files up to 10 hours long each. It also supports speaker recognition, common export formats (TXT, DOCX, PDF, SRT, CSV), as well as some AI tools for working with your transcript.


Thanks! I've had good results with Turboscribe (paid plan) and appreciate having this as a service. I typically use it for 2-3 hour long video recordings with a number of speakers and appreciate the editing tools to clean things up before export.


This looks great. Did you have an API or plan to release one?


Thanks! Nothing to announce on the API front right now, but appreciate you asking :)


I was curious how good a transcription I could get from what may be the best multimoldal LLM currently, Gemini-1.5-Pro-Experiment-0801, so I had it transcribe five minutes of an interview between Ezra Klein and Nancy Pelosi from earlier today. The results are here:

https://www.gally.net/temp/20240809geminitranscription/index...

Aside from some minor punctuation and capitalization issues, Gemini’s transcription looks nearly perfect to me. There were only one or two words that I think it misheard. If I had transcribed the audio myself, I would have made more mistakes than that.

One passage struck me in particular:

  And then he comes up with "weird," which becomes viral and the rest, and here he is. 
How did Gemini know to put “weird” in quotation marks, to indicate—correctly—that the speaker was referring to Walz’s use of the word as a word? According to Politico, Walz first used the word in that context in the media on July 23.

https://www.politico.com/news/2024/07/26/trump-vance-weird-0...


Maybe two factors helped achieve the impressive result with the quotation marks:

- auditory cues

- the sentence would be gramatically incorrect and make no sense without them

Just guessing out of the blue.

But I think it's likely that LLMs (and other speech recognition systems) need to exploit sentence context to recognize individual words and punctuation, and this is an example were it went well.

Human listening is similar in a way, we can recognize words even when spoken very mumbly or fast, if we have context.

So we always hear phrased rather than words.


It's very likely that the model is capable of picking up on the verbal cues surrounding quotes.

Do you have the audio or video file?

I'd like to run it through our AI video editor and see how it punctuates the transcript.


The mp3 file that I gave to Gemini (a five-minute excerpt from the audio podcast) is linked in the source code of the page. Here is the full URL:

https://www.gally.net/temp/20240809geminitranscription/inter...

The full interview including video is on the New York Times website, though you might need a subscription to view it:

https://www.nytimes.com/2024/08/09/opinion/ezra-klein-podcas...

The NYT’s closed captions do not put “weird” in quotation marks; they also divide sentences weirdly and have other mistakes. But they get some things better than Gemini, such as capitalizing “House” when it means the House of Representatives.

I haven’t compared the audio-only podcast version and the video version carefully; it’s possible that parts of the audio were edited or re-recorded for one or the other.

Let us know how your AI video editor does!


OK I ran it.

We have a grammar correction step which "fixed" the sentence to "...he comes up with something weird, which becomes viral...".


Just pitching in a transcription tool that lets you transcribe video and audio files using Whisper and WASM in your browser, and get a .txt, .srt, .vtt file. Maybe in the future support for Whisper Turbo?

https://video2srt.ccextractor.org/

Disclaimer: Working on this project.


Use this a lot. It's nice and simple and has exactly the tools you need (playback speed control, easy pause/play) and nothing more. Greatly prefer it over automatic transcription tools give you 40 pages of 'umm's and 'ahhhh's to filter through and edit.


Can you not give the transcript to an LLM to remove the umms and ahhs?


People not used to AI have blind spots that prevent them from seing evident use case like this.

I'm always surprised at the amazed look of my friends when they see me concretely use the tool. They just didn't picture it until they saw it in action.


It's not even people not used to AI, I developed a tool that uses AI to do something, and then kind of couldn't be bothered to fix some of the output manually. It only occurred to me days later that I can ask the AI to fix it.


I started making an open source macOS app to do this with whisper and potentially pyannote.

It is functional but a bit slow. I think using whisper directly instead of swift bindings will help a lot.

Really interested in adding diarisation but having a lot of trouble converting Pyannote to CoreML. Pyannote runs so slowly with torch on CPU. Haven’t gotten around putting my latest work for that on GitHub yet.

Happy to accept contributions —

Some priorities right now:

* Fixing signing for local builds

* Replace swift whisper with whisper cpp

* Allowing users to provide their own models

https://github.com/Stack-Studio-Digital-Collective/Auditif


Yeah definitely switch over to using ggerganov's whisper implementation, I use it in a little home brewed python app on my M1 for handling speech transcripts. The base EN model chews through minutes of audio in seconds, it's insanely fast.


I'm working on the tool, that includes AI. My original target is to test it on my https://www.youtube.com/c/VectorPodcast by offering something that Lex Fridman does for his episodes.

Current features: 1. Download from YT 2. Transcribe using Vosk (output has time codes included) 3. Speaker diarization using pyannote - this isn't perfect and needs a bit more ironing out.

What needs to be done: 4. Store the transcription in a search engine (can include vectors) 5. Implement a webapp

If anyone here is interested to join forces, let me know.


fantastic tool; I used it a lot to transcribe interviews during plane travels where there was no internet, and I needed to fill the time. Really useful to have if you do a lot of interviews


From the homepage:

> A free web app to take the pain out of transcribing recorded interviews

How did you use a web app on the plane with no internet?


The web app saves an offline copy for use the first time you open it. https://otranscribe.com/help/#can_i_use_otranscribe_offline


Ran the server on his or her laptop...

You don't need the internet to use a web browser


it works offline if you preload the website :)


It’s MIT licensed so presumably self hosted


Any new language support in the future? Fingers crossed for japanese


https://tactiq.io is made for meetings, but also does uploaded transcripts and supports Japanese!


Am I missing something?. For what I checked it supports every language, as is yourself the one transcribing by hand. This is just an UI to watch the video or audio while you're typing it.


I made a similar tool for making tables of contents for youtube videos: https://youtoc.by/

Not developing it actively after I created tables of contents for the several videos I needed, years ago. If I ever need it again, I will probably work on mobile UI (aka responsive)


If you are looking for something automatic that also allows you to interact with your transcripts chatgpt style then I would recommend https://www.videototextai.com/


That cookies box though... Dark pattern (accept lots + accept all, fake drag affordance, covering a quarter of the page) for cookies doesn't bode well for privacy protections around the transcripts.


You are allowed to delete any transcription you make and with that we do not keep any copy of the transcripts :) . The cookie banner is there to comply with the EU laws.


You can also try Scribe (free chrome extension and iOS app) https://www.appblit.com/scribe


Talio.ai allows you to do this with chatGPT style chat with the transcript plus numerous other features https://talio.ai


Anyone knows one with transcription and translate in real time?

Nowadays, I use libretranslate/libretranslate and pluja/whishper to do this, but not at real time.


Ah this brings back memories. When I was in college with limited money, I used to pirate movies and most of them didn't have subtitles and I used to daydream of writing a VLC plug-in which would real time generate subtitles. But I had better things to do like play video games...


Many of us have had those ambitious tech ideas...



Looks cool! Unclear from the docs, but does it support non-English languages? How about mixed-language interviews?


Yes! Any language you understand is supported!


Does anybody tested it with Brazilian Portuguese? It is a hard problem, since we have too many accents.


I don't understand what the issue is. You don't know how to type the different diacritical marks? Or the textbox isn't accepting them? (Which seems like it would be a browser issue, not an issue with the site.)


I think that the commentor meant accents as in regional dialects.


What would that have to do with anything? Isn't that a problem for the person doing the transcribing?


Pretty amazing what a webapp an do. I whished there were more lile them and not all these native apps


Anyone knows a free tool for generating subtitles for movies and series videos ?


SubtitleEdit is one of the most complete and has many online tutorials from users.

Make sure they are recent tutorials because they will probably mention how to use the automated generation tools/plugins that wasn't available years ago.

https://github.com/SubtitleEdit/subtitleedit


https://github.com/McCloudS/subgen worked very well for me. I had a TV series where somehow the last few seasons timestamps did not match up with subtitle files I could find online. I used subgen and it worked surprisingly well.


You can try https://www.transcripo.com/ for free


oTranscribe is a free option for transcription but in many cases it's just too simple


See also TranscriberAG: <https://transag.sourceforge.net/>


If you just want quick transcriptions of YouTube video this works pretty well https://www.you-tldr.com/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: