yt-dlp and a batch file that runs via Task Scheduler has been doing this for me for a couple of years now. I also grab the captions and throw that into a database so that I can search transcripts for a clip that I can remember but can't remember which video it's in. It was a fun weekend project.
Long ago I had my podcast downloader keep all files it downloads and recently I've been using OpenAI's Whisper to go through and create transcripts of the 8000 or so hours of data I have downloaded over the years.
It's very cool to be able to search through and remind myself of something I heard once. Not exactly life changing, but still, nice to be able to quickly drill down and find audio for something when a curiosity strikes me.
What kind of hardware do you have that makes it feasible to process thousands of hours of podcasts? I want to do the same but I’ve heard that Whisper requires some serious GPU might for decent accuracy (Linux Unplugged podcast specifically).
Yep, it takes a bit of GPU RAM. I'm using 3 machines with NVidia 3080 or better. I let them go for a few weeks over the winter break when I was mostly disconnected from the tech world. The workers prioritized podcasts I'm personally likely to want to search, and got through almost a third of my archive.
Now it's down to 1 or 2 machines depending on what's going on, so it'll take much longer to finish up, but I'm in no rush.
This includes data from 1995 on. The early data is backfill of radio shows that transitioned to podcasts and dumped old episodes in their feed at some point. My reader itself started in 2012, I downloaded around 7000 hours of new podcasts, which works out to 1.7 hours per day. So, around 2 hours per day, since I don't listen every day, and to be fair, I haven't listened to every podcast I've downloaded, some don't interest me. But 1-2 hours of listening a day is the sweet spot for me.
I prefer the file prefixed with a number that indicates "air date". 01 being the first uploaded video. The default is by index and the top of the channel or playlist is number 01 which is the most recent.