Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: I made a free transcription service powered by Whisper AI (freesubtitles.ai)
224 points by mayeaux on Nov 18, 2022 | hide | past | favorite | 130 comments

This is a cool project. I’ve been very happy with whisper as an alternative to otter; it works better and solves real problems for me.

I feel compelled to point out whisper.cpp. It may be cheaper for the author but is relevant for others.

I was running whisper on a gtx 1070 to get decent performance; it was terribly slow on M1 Mac. Whisper.cpp has comparable performance to the 1070 while running on M1 CPU. It is easy to build and run and well documented.


I hope this doesn’t come off the wrong way, I love this project and I’m glad to see the technology democratized. Easily accessible high-quality transcription will be a game changer for many people and organizations.

Thanks for sharing! I was looking for a M1 solution weeks ago snd couldn‘t find any working one. Will try that one now! Looking around for servers with GPUs etc. resulted in stopping me at playing around with it as i got overwhelmed with options.

How long would whisper.cpp take to transcribe 2 hours of audio on M1?

Not sure about M1, but on the Macbook Pro 14" with an M1 Max using 8 threads I transcribed a 44 minute podcast in 16 minutes. So about 3x "real time" speed.

What model are you using? I guess large, as my M1 Max takes about 1.4 min for a 4 min file (35% of total time)?

Yep, large model.

It broke when I tried to feed it an entire podcast file, but still, I took this as a push to try out Whisper AI for myself, turns out it's easier to use than I thought. Long story short, I used it to transcriptify a podcast:


Not sure if there's a use for this that's not me, but I like the idea of having subtitles for a podcast I'm listening to.

In that case you could have a look at the Snipd podcast app. They have Whisper built in :)

What tool did you use for the player-text presentation?

Mostly just vanilla JS on that page, and a tiny bit of Python glue code to turn the WebVTT output from Whisper into a data format for the JS.

It’s stunning how it just looks like a human written copy. The punctuation is close to perfect and I couldn’t find any transcription mistakes.

If somebody wants to run Whisper for dirt cheap, vast.ai is the way to go.

I can usually get an (unverified) 1x RTX 3090 instance for about $0.10/hr, and that processes audio at something like 1.5X speeds. Unverified instances do crash once in a while, but as long as you back up the output every few hours, it's fine, you just set up a new one in case something happens. I wouldn't use this for confidential company meetings, but it's good enough for podcasts, Youtube Videos and other public or semi-public stuff.

I’m sorry to self promote again - but: https://whispermemos.com

I’m in love with the idea of pressing button on my Lock Screen and getting a perfect transcription in my inbox.

Also, just added emoji summarization in email subject, a small visual reminder of what your memo was about.

I hope this is useful to someone!

That's a cool idea. How about integrations like github or notion which would write out a markdown file?

is the app open source? I'm on android, so :c

If you don't mind non-OSS, the Google Recorder App is incredible.

I've been testing whisper on AWS - the g4dn machines are the sweet spot of price/performance. It's extremely good, and there will be rapid consolidation in the transcription market as a result of it existing (its one major missing feature is the ability to supply custom dictionaries). The fact that it does a credible job at translation to english is a cherry on top.

Anyway, I'd love to get it running well on g5g, but they seem extremely temperamental. If anybody has, please let me know your secret. :)

I recently tried Whisper to transcribe our local Seattle Fire Department radio scanner -- unfortunately it was not reliable enough for my use case, e.g. "adult male hit by car" gets transcribed as "don't mail it by car".

I imagine future models will allow the user to input some context to disambiguate. Like if I could give it the audio along with the context "Seattle Fire Department and EMS radio traffic", it would bias towards the type of things you'd likely hear on such a channel.

Have you tried the --initial_prompt CLI arg? For my use, I put a bunch of industry jargon and names that are commonly misspelled in there and that fixes 1/3 to 1/2 of the errors.

I was initially going to use Azure Cognitive Services and train it on a small amount of test data, after Whisper released for free I use Whisper + openai GPT-3 trained to fix the transcription errors by 1) taking a sample of transcripts by Whisper 2) fixing the errors and 3) fine-tuning GPT-3 by using the unfixed transcriptions as the prompt and the corrected transcripts as the result text.

Whisper with the --initial_prompt containing industry jargon plus training GPT-3 to fix the transcription errors should be nearly as accurate as using a custom-trained model in Azure Cognitive Services but at 5-10% of the cost. Biggest downside is the amount of labor to set that up, and the snail's pace of Whisper transcriptions.

Thanks for the tip, that did improve accuracy a lot.

There have been a lot of hacks to speed up whisper inference

Sweet! Do you have any links to resources on how to speed it up? I couldn't find any while searching Google or the Whisper discussion forums.

Not a hack per se but a complete reimplementation.


This is a C/C++ version of Whisper which uses the CPU. It's astoundingly fast. Maybe it won't work in your use case, but you should try!

The issue here is for most radio systems you end up with about 3 kHz of effective audio bandwidth (sampling). Most ASR/STT models are trained on at least 16 kHz audio.

Did you try a telephony oriented model like aspire or similar? They’re trained on sort-of 8 kHz audio and might work better.

I tried something similar for my SDR feeds and gave up because it’s just too challenging and niche - the sampling, the jargon, the 10 codes, the background noise, static on analog systems/drop outs on digital systems, rate of speech, etc all contribute to very challenges issues for an ML model.

> the sampling, the jargon, the 10 codes, the background noise, static on analog systems/drop outs on digital systems, rate of speech, etc

Is the reduced bandwidth really the most significant problem? Naively I'd think everything else you mentioned would matter a lot more, I'm curious how much you experimented with that specifically.

When it all comes together it's kind of a nightmare for an ASR model. There were plenty of times in reviewing the recordings and ASR output where I'd listen to the audio and have no idea what they said.

I'm not sure which contributes most but I know from my prior experiences with ASR for telephony even clean speech on pristine connections does much worse with models trained on 16 kHz being fed native 8 kHz audio that gets resampled.

I've done some early work with Whisper in the telephony domain (transcribing voicemails on Asterisk and Freeswitch) and the accuracy already seems to be quite a bit worse.

Could one train an interpolation layer (eg take a bunch of 16k audio, down sample to 8k, train 8k->16k upsampler)? Or better yet (but more expensive), take whisper, freeze it, and train the upsampler on whisper’s loss.

Sure, that's called audio super resolution, there's a few papers/projects doing that. Haven't really seen models which are robust and have good generalization though.

"I understand some of those words."

Hah, in all seriousness I'm more of a practitioner in this space. If this was something I absolutely needed to get done who knows where it would have went. For a little side hacking project once I encountered these issues I moved on - back in the day expectations were lower for telephony and the 8 kHz aspire models and kaldi were adequate to get that "real work" done.

We built a working version of this using Assembly.ai

I am no-longer involved in the project but you're welcome to contact the CTO if you're curious how it worked:


I (used to) use simonsaysai.com to generate subtitles and they had the functionality to input specialized vocabulary, so I suppose it's possible in some sense but I don't know how it would work with Whisper, something to ask on their Github if nobody else has yet I suppose.

But, for me, the English model works really well. Using the 'large' model works about perfectly for me, I can't think of anything I thought the large model got too badly wrong, is that the model you tried?

Yes, the problem is that the radio chatter is just very, very low quality, for a lot of words your brain just needs to know the context to fill in the gaps due to radio static and such. Even as a human some parts are unintelligible.

Yeah it's a hard case, Whisper with the large model is among the cutting edge in the business so if the static is bad and the quality is low there's not much you can do but wait for better AI, or fix whatever they get wrong by hand, but Whisper AI is on the cutting edge so you might have to wait for a bit lol

Was there a big difference in accuracy depending on which model you used?

Yes, large was by far the best, but still not accurate enough that I'd be willing to put it into a fully automated pipeline. It would have gotten it right probably 75% of the time. Anything other than the large model was far too bad to even think about using.

What was the performance, resource usage, etc of doing this with large? What's the speed like?

I'm still getting spun up on this but base delivers a pretty impressive 5-20x realtime on my RTX 3090. I haven't gotten around to trying the larger models and with only 24GB of VRAM I'm not sure what kind of success I'll have anyway...

In my case the goal was to actually generate tweets based on XYZ. As I've already said there were serious technical challenges so I abandoned the project but I was also a little concerned about the privacy, safety, etc issues of realtime or near-realtime reporting on public safety activity. I also streamed to broadcastify and it really seems like they insert artificial delay because of these concerns.

You can run the larger models just fine on a 3090. Large takes about 10G for transcribing English.

For a 1:17 file it takes:

6s for base.en, I think 2s to load the model based on the sound of my power supply.

33s for large, I think 11s of which is loading the model.

Varies a lot with how dense the audio file is, this was me giving a talk so not the fastest and quite clean audio.

While I saw near perfect or perfect performance on many things with smaller models, the largest really are better . I'll upload a gist in a but with Rap God passed through base.en and large.

edit -

Timings (explicitly marked as language en and task transcribe):

base.en => 23s

large => 2m50

Audio length 6m10

Results (nsfw, it's Rap God by Eminem): https://gist.github.com/IanCal/c3f9bcf91a79c43223ec59a56569c...

Base model does well, given that it's a rap. Large model just does incredibly, imo. Audio is very clear, but it does have music too.

> based on the sound of my power supply

Hah, I love that - "benchmark by fan speed".

Good to know - I've tried large and it works but in my case I'm using whisper-asr-webservice[0] which loads the configured model for each of the workers on startup. I have some prior experience with Gunicorn and other WSGI implementations so there's some playing around and benchmarking to be done on the configured number of workers as the GPU utilization of Whisper is a little spiky and whisper-asr-webservice does file format conversion on CPU via ffmpeg. Default was two workers, is now one but I've found as many as four with base can really improve overall utilization, response time, and scale (which certainly won't be possible with large).

OPs node+express implementation shells out to Whisper which gives more control (like runtime specification of model) but almost certainly has to end up slower and less efficient in the long run as the model is obviously loaded from scratch on each invocation. I'm front-ending whisper-asr-webservice with traefik so I could certainly do something like having two separate instances (one for base, another for large) at different URL paths but like I said I need to do some playing around with it. The other issue is if this is being made available to the public I doubt I'd be comfortable without front-ending the entire thing with Cloudflare (or similar) and Cloudflare (and others) have things like 100s timeouts for final HTTP response (Websockets could get around this).

Thanks for providing the Slim Shady examples, as a life-long hip hop enthusiast I'm not offended by the content in the slightest.

[0] - https://github.com/ahmetoner/whisper-asr-webservice

Whisper does pretty well, even with background music and things like that, I think you're working with a pretty weird subsection of recorded audio that won't work, for that edge case to work you'll very likely need to train your own model.

why did you want to transcribe it? What would you do with the output?

I wanted to make a twitter bot that posted whenever a pedestrian/cyclist got hit by a car.

I'd need to:

- Wait for a MED6 or AID response code to come across the live event stream

- Listen to the radio chatter to see if it was a pedestrian getting hit (use GPT3 on the transcription to determine if the text was about a ped/cyclist getting hit)

- Maybe also correlate to SPD logging a 'mvc with injuries' at the same location

This is cool. We have integrated Whisper with our human in the loop tech at Scribie [1] and the results have been great. We offer free credits if you want to try it out.

[1] https://scribie.com

There’s a lot of startups starting in the space offering transcription.

Read.ai - https://www.read.ai/transcription

Provides transcription & diarization and the bot integrates into your calendar. It joins all your meetings for zoom, teams, meet, webex, tracks talk time, gives recommendations, etc.

It’s amazing how quickly this space is moving. Particularly, with the increase in remote work. Soon you’ll be able to search all your meetings and find exactly when a particular topic was discussed! It’s exciting.

Yeah I was paying $100/month for transcription services and turns out Whisper with the large model was much more accurate, and I didn't like the UI, I much prefer to just use this app as opposed to the paid service, and I chose it because it was the cheapest by far ($100/30h) as opposed to most of the other paid services which were $10 an hour which to me was a bit much really. But Whisper is really a game changer I don't know how those companies stay in business really.

Recently I’ve been using my iPad as a transcription “keyboard” for my laptop when writing documents in Dropbox paper. Open the document on both computers, then use dictate to enter all text, and then correct if necessary with the laptop. I’m expecting to move to iPad only when I can use stage manager with an external monitor.

Dictation is also great when writing in a foreign language: I speak German ok-ish, writing is harder. Dictation helps writing more correct German.

Yeah I apologize, the queue is a little messed up. It's not showing your progress properly and it's not stopping people's processing if they left (their websocket dies). I'm going to fix these and reboot the server and the experience will be a lot better, sorry. It is working properly though and transcribing all this stuff but the queue needs some TLC, brb.

This site hung basically every time I tried to use it. Only one out of nine attempts I made resulted in the page even decrementing the queue count. Curious if you could just generate a unique url to go back and see the results of a job waiting in the queue for a while?

Should this have the “Show HN” tag?

Also, labels’ [for] attributes are all "file" instead of "language" and "model", so all labels trigger the file selection dialog on click :-)

UPD: https://github.com/mayeaux/generate-subtitles/pull/1

Merged, thanks. Do I just add 'Show HN:' to the beginning of the title?

Done, thanks! I'm at the top, feels good! I got zero traction on Reddit so glad to see Hacker News was there for me lol

Hmmmm, I wonder how well this would run on a local AI chip like Coral or Gyrfalcon. Model sounds like it may be too big but it would be nice to have something like Google Nest devices that _don't_ suck hard.

It should be all local until it needs information from the internet...

I've been running whisper in the terminal with Python and I've found whisper surprisingly accurate with the transcription even from Chinese.

Just given your site a try, nicely done. One feedback - would be great to have a progress indicator on the processing page, I have no idea what stage it's at or how much longer I need to wait.

Yeah Whisper is top of the line, they posted their performance compared to the industry standard and it's right in there at the top.

It should show the data via processing.. it's setup to just take whatever stdout/stderr comes back from Whisper and send it directly to the frontend via websockets, I'm surprised you got stuck there :thinking:

Shouldn't the language and model inputs be dropdowns instead of text input?

I'm going to hope/assume you're doing some sort of sanitisation on those inputs.

Additionally, wouldn't you lose the language detection that's done for no language input? (IIRC, it uses the first 30 seconds to detect language if you don't specify one)

Yeah someone submitted a PR for those to be fixed, I'm just wary about restarting the server because I haven't setup a way to be able to reboot without losing the websockets

Well those inputs should all error unless they are a valid value.

Yes if nothing is input it will automatically detect the language based on the first 30s of input

> I'm just wary about restarting the server because I haven't setup a way to be able to reboot without losing the websockets

Wait what. Not being able to safely restart the server sounds like a disaster waiting to happen.

This was just a personal project a couple hours ago so it's not setup properly to do safe reboots and a lot of other things, I was just using it locally and now it's in in the wild, will take some time to get everything refined and professional

Isn’t this very expensive to host? Are you aware this could cost A LOT?

in another comment they state:

> I'm just running this off of a 2x RTX A6000 server on Vast.ai at the moment, about $1.30/h

whether that's a lot is a matter of perspective

> losing the websockets

users would lose the session and have to start over, not the end of the world

I'm not even using sessions just localstorage and websockets lol.

I wasn't even aware that it was on GitHub, which I suppose is an issue in itself.

I just rebooted the server, now if the websocket disconnects (person closes the browser) it will kill the processing and move the queue, so that should help unclog the queue. I'm going to add a couple more queue touchups and then it should be stable again (no reboots), but running well in the meantime

I didn't manage to transcribe anything (it just doesn't remember that I submitted anything), but whatever, I didn't need to anyway. I just wanted to ask: how good is Whisper with non-english? At least "major" ones, like, German, French, Russian, Spanish?

I've got people to test and usually with the 'medium' and 'large' models it works really efficiently. Honestly, I just use the large model for everything, because might as well have the best quality if you're going to do the effort.

I've been using it on lots of Russian stuff and it's great, even translates for you.

I’ve tried it on French and results were pretty good

Spanish is great.

Exciting to see.

Curious if there was a benefit to using whisper over something like vosk which can transcribe on mobile device pretty decently.

Whisper has other interesting functionality but for straight transcription it seems a bit heavy. Still learning about it and putting it through its paces.

We did comparison of recent Vosk and Whisper models here:


In general, Whisper is more accurate but much more resource heavy. Vosk runs on single core while Whisper needs all CPU cores.

Accuracy difference for clean speech between Vosk-small and Whisper tiny is 2-3% absolute, 20% relative. Not sure how important is it, I would claim it is not that critical.

Numbers there are for original Whisper. Whisper.cpp recommended here is actually 10% worse than stock Whisper for speed considerations. Not that simple.

Vosk is streaming design, you get results with minimum latency of 200ms. Whisper requires you to wait for significant amount of time. If you refactor Whisper for lower latency you will loose a lot of accuracy advantage. Latency is very important for interactive applications like assistants.

Whisper is multilingual and has punctuation, that is a clearly a good advantage. It also can use context properly improving for long recordings.

So on mobile Vosk is still a viable option actually as many others mobile-focused engines.

For server based transcription Whisper is certainly better. But not much better than Nvidia Nemo for example. Not that much publicity for the former though.

Rebooted the app do to queue upgrades, if you had a pending upload please reupload it, thanks!

What resources do you have for hosting this?

I setup a whisper-asr-api backend this week with gobs of CPU and RAM and an RTX 3090. I’d be interested in making the API endpoint available to you and working on the overall architecture to spread the load, improve scalability, etc.

Let me know!

I'm just running this off of a 2x RTX A6000 server on Vast.ai at the moment, about $1.30/h and then using nginx on another server to reverse proxy it to Vast

Open an issue on the Github repo and we can collab for sure!: https://github.com/mayeaux/generate-subtitles/issues

Cool - will do!

Through a series of events I'm in the beneficial position of my hosting costs (real datacenter, gig port, etc) being zero and the hardware has long since paid for itself. I'm almost just looking for ways to make it more productive at this point.

Hey, I know the feeling, I felt bad when I had my GPU just sitting there and it's just a little Vast server lol. If you want to use your hardware to run this software I'd be more than happy to help get it setup!

For what's it worth my approach has been running a tweaked whisper-asr-webservice[0] behind traefik behind Cloudflare. Traefik enables end to end SSL (with Cloudlare MITM, I know) and also helps put the brakes on a little so even legitimate traffic that makes it through Cloudflare gets handled optimally and gracefully. I could easily deploy your express + node code instead (and probably will anyway because I just like that approach more than python).

Anyway, I'll be making an issue soon!

[0] - https://github.com/ahmetoner/whisper-asr-webservice

Right on, looking forward to it! Yeah I saw that module and was planning to use it but I just wrote up an Express/Node implementation first and never really looked back. But looking forward to collabing I will await your issues, cheers!

Good luck with your AWS bill Haha. But seriously. How?

I am just paying for a somewhat expensive server and I love how it's really fast but also I have a lot of free GPU time so might as well let others use it too lol. It's an experiment to see if people will use it productively or if someone will abuse it and ruin it for others lol

someone will abuse it and ruin it for others

Maybe I will put in some mechanism to prevent that but for now I just want to see if people could find it useful. I also have the code open source and will write tutorials for people to put up their own instance as well

It would be more useful if one could directly paste links to videos online as well. But yeah, in general this is extremely useful. I'm looking forward to video site integration. Would be great if youtube could finally retire their horrible auto caption function for something that actually works. Being able to easily watch media in different languages from around the world will be an absolute game changer.

I also plan to support automatic language translation I have that working locally already actually, and I work for one of the big alt-video platforms and rumour has it that I will be shipping this feature for them soon (auto transcription with auto translated subtitles)

Really cool, that would be a killer feature. Definitely post here when this gets released!

Yeah it's all ready to go using LibreTranslate, they have about 25 languages, maybe I'll finish that this weekend and put it up, it's really inexpensive to make the translations compared to making the original transcription so may as well. Coming soon!

Also, I have that tested (auto download) with YouTube-Dl, it works fine but haven't put it into the frontend, but may as well, it helps a lot on your own instance so you don't have to download it first and then upload it

I'd love to read a tutorial on how to do this for myself.

Would love to locally host this, do you have a source?

No docs or anything yet but: https://github.com/mayeaux/generate-subtitles

thank you!

Not even abuse, but just intensive use cases. Like the guy who posted a few says ago about recording and transcribing all day.

I setup the server to only transcribe two files at a time, so yeah someone could abuse it for sure with two big uploads and stick everyone else on the queue. But for me, even a 3 hour video translates with large model in about ~30 minutes so it wouldn't be too bad, but hopefully everyone is conscious to not do that, so far nobody has abused it which is cool.

Me again - why two at a time? In my initial testing with whisper-asr-webservice and my RTX 3090 I could pretty easily throw ~10 different files at it simultaneously as there is some natural staggering between API entry, CPU conversion/resampling/transcoding of audio, the actual audio length, network effects like upload speed, etc.

I also implemented some anti-abuse-ish features between traefik and Cloudflare that should help it stand up better in the face of bad actors abusing it.

Certainly not something to necessarily depend on but I thought I'd mention it.

> I am just paying for a somewhat expensive server and I love how it's really fast but also I have a lot of free GPU time so might as well let others use it too lol.

They are donating some spare capacity.

I appreciate your initiative!

Thanks! Whisper is a lot of fun but it didn't take long before I wanted to build a frontend. And then I built something that I think came out super nice so why not share it with people. I used to pay $100/month for transcriptions and this works a lot better for me so might as well open-source transcription if I can, but I give all the credit to Whisper that module they put out is amazing

Very generous of you. I made a similar free service 3 years ago using much worse tools and it's so cool to see whisper making it all so much better and more efficient. Thanks for releasing for free

No problem! I am just seeing how it runs, I might throw up a referral link to Vast and put up a tutorial on how to host your own service, maybe that can offset the cost a bit? The current server is $700/month, maybe it could just run off donations who knows

$700/month? Where digital ocean? I am new to python and ML, curious to know why..

Okay that should be my last reboot for a while, I've got it ready how I'd like it for now, feel free to give any feedback!

I've made some updates and the queue works how I wanted it, I rebooted the server and I think I will leave it like this for a while

What is the maximum length of audio allowed? What are your costs in running this? Are the hardware requirements substantial?

Right now the limit is 100MB because of Cloudflare, no length limit, costs $1.30/h to run this, that's enough for a 2x RTX A6000 on Vast.ai can you check out the specs there

Congratulations on taking this to completion and announcing here! Love your approach to this!

Thanks! I wrote it for myself over a weekend and have really enjoyed it ever since I'm glad others were able to get something out of it! It seems to run pretty well but I have some improvements planned, first is I will take the whisper output and feed it to you when you're in the queue so you can see them progressing. Will be pretty trivial to implement but I am feeling bored in the queue at the site atm so that is the next killer feature lol

Do you just hold the page open after upload and wait for it to update?

Yeah, there is a websocket connection and when the transcription is done it will update the frontend with the links to .srt, .vtt and .txt file downloads

Thanks. Does the queue position update through the websocket too?

Yeah it does, I think there's a bug with it for saying what your position is, but when the others are done it will start correctly even if the frontend shows like position -2 or something. There's 2 uploads in the queue atm so not bad

Thanks a lot for this. I've wanted to test whisper's usefulness for vintage movie subtitling projects, but haven't had such a straightforward, preconfigured opportunity. I promise I'll beat the subs into some sort of shape as long as the timings are at least vaguely alright, and not waste your money.

Hey, glad I could be of use. The problem with Whisper is that it needs a lot of GPU. Actually my Mac can't even use my GPU so right away I had to get it up on a server, but Whisper is so powerful and it's so amazing that it's open source I am surprised nobody did this yet. I could see them charging for it but may as well use it anyways, the other services are insanely expensive ($10/h?!) and I don't really like their UIs to boot lol

Nothing has come back from the two that I tried (one medium in French, the other large in Spanish), meaning no change on the page since I uploaded them an hour an a half ago. I loaded the page again in another tab, though, and after a few seconds "finishedProcessing" appeared under the form. I suspect that means something.

On Firefox 102.4.0esr, also uBlock Origin.

It's probably due to me rebooting to load new code, I will have a way to send a signal to the frontend to inform them but not implemented atm

I tried it again this morning. I'm getting all of the output properly this time, but it has hung partway through every time I tried.

It's me again. Ran it again, ran perfectly. Thanks for all of your work.

edit: don't know if you'll see this any time soon, but I've had it fail/hang again. You might want to take a hash of uploads, so if the lost connections still end up getting transcribed, if they're reuploaded they won't get transcribed again.

Also I haven't had success in Firefox, only Chromium.

Where is your server hosted with?

2x RTX A6000 server on Vast.ai with another server with nginx as a reverse proxy

Free startup idea: Use Whisper with pyannote-audio[0]’s speaker diarization. Upload a recording, get back a multi-speaker annotated transcription.

Make a JSON API and I’ll be your first customer.

[0] https://github.com/pyannote/pyannote-audio

I think there's been talk to do speaker diarization with whisper-asr-webservice[0] which is also written in python and should be able to make use of goodies such as pyannote-audio, py-webrtcvad, etc.

Whisper is great but at the point we get to kludging various things together it might start to make more sense to use something like Nvidia NeMo[1] which was built with all of this in mind and more.

[0] - https://github.com/ahmetoner/whisper-asr-webservice

[1] - https://github.com/NVIDIA/NeMo

It's not as if people aren't trying to do that: https://github.com/openai/whisper/discussions/264

I tried out this notebook about a month ago, and it was rough. After spending an evening improving it, I got everything "working", but pyannote was not reliable. I tried it against an hour-ish audio sample, and I found no way to tune pyannote to keep track of ~10 speakers over the course of that audio. It would identify some of the earlier speakers, but then it felt like it lost attention and would just start labeling every new speaker as the same speaker. There is an option to force the minimum number of speakers higher, and that just caused it to split some of the earlier speakers into multiple labels. It did nothing to address the latter half of the audio.

So, sure, someone should continue working on putting the pieces together, and I'm sure the notebook in the discussion I linked has probably improved since then, but I think pyannote itself needs some improvement first.

Sadly, I think using separate models for transcription and diarization ends up being clunky to the point that it won't ever be polished, no matter how good pyannote might get. If you have a podcast-like environment where people get excited and start talking over each other, then even if pyannote correctly identifies all of the speakers during the overlapping segments and when they spoke... Whisper cannot be used to separate speakers. You end up with either duplicate transcripts attributed to everyone involved, or something worse. Impressively, I have seen pyannote do exactly that, when it's working.

At the end of the day, I think someone is going to need to either train Whisper to also perform diarization, or we're going to need to wait until someone else open sources a model that does both transcription and diarization simultaneously. Unfortunately, it seems like most of these really big advances in ML only happen when a corporate benefactor is willing to dump money into the problem and then release the result, so we might be waiting awhile. I'm trying to learn more about machine learning, but I'm not at the point where I have any realistic chance of making such an improvement to Whisper. Maybe someone else around here can proven me wrong by just making it happen.

Speaker recognition is another piece that isn't usually as high a priority as recognizing the speech.

It's a new thing to me, I hadn't really considered it. Do they have that for movies and stuff? I can't think of a clear case when I've seen it

It sounds pretty good, this is my first time hearing about it but it looks good. Even if it does detect that they are separate entities talking, how does it label it in a way that's helpful/useful for you? I guess it comes out as 'Speaker 1', 'Speaker 2', etc in the end? And you can find/replace the speakers with the actual people?

And you expect the API to be free? If not why not use one of a million other such services?

Well the code is open source, I don't know what I plan to do with this, depends on how people like it, but for the meantime you can use it to transcribe stuff for free which is a victory unto itself

It's the idea that's free, not the API.

This is an exciting one! We are building an open-source low-code alts to Retool and I think we can build integration with your projects! Take a look at ours and see if you want to collaborate or not. Here you go:https://github.com/illacloud/illa-builder

Please stop spamming your link everywhere. You know people can see your post history?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact