yt-dlp and a batch file that runs via Task Scheduler has been doing this for me ...

seligman99 · on Jan 5, 2023

Long ago I had my podcast downloader keep all files it downloads and recently I've been using OpenAI's Whisper to go through and create transcripts of the 8000 or so hours of data I have downloaded over the years.

It's very cool to be able to search through and remind myself of something I heard once. Not exactly life changing, but still, nice to be able to quickly drill down and find audio for something when a curiosity strikes me.

ebb_earl_co · on Jan 6, 2023

What kind of hardware do you have that makes it feasible to process thousands of hours of podcasts? I want to do the same but I’ve heard that Whisper requires some serious GPU might for decent accuracy (Linux Unplugged podcast specifically).

seligman99 · on Jan 6, 2023

Yep, it takes a bit of GPU RAM. I'm using 3 machines with NVidia 3080 or better. I let them go for a few weeks over the winter break when I was mostly disconnected from the tech world. The workers prioritized podcasts I'm personally likely to want to search, and got through almost a third of my archive.

Now it's down to 1 or 2 machines depending on what's going on, so it'll take much longer to finish up, but I'm in no rush.

7h3b8duvwi · on Jan 6, 2023

8000 hours? Napkin math time, that's 20 years of 10+ hours daily.

I call BS.

seligman99 · on Jan 6, 2023

It's about an hour or two or a day.

This includes data from 1995 on. The early data is backfill of radio shows that transitioned to podcasts and dumped old episodes in their feed at some point. My reader itself started in 2012, I downloaded around 7000 hours of new podcasts, which works out to 1.7 hours per day. So, around 2 hours per day, since I don't listen every day, and to be fair, I haven't listened to every podcast I've downloaded, some don't interest me. But 1-2 hours of listening a day is the sweet spot for me.

defined · on Jan 6, 2023

My math says 365 days x 10 hours/day = 3650 hours. 8000 hours is just over 2 years, not 20.

michaelcampbell · on Jan 7, 2023

You need to resize your napkins.

suzumer · on Jan 6, 2023

You might be intereseted in these youtube archive scripts: https://github.com/TheFrenchGhosty/TheFrenchGhostys-Ultimate...

Havoc · on Jan 6, 2023

Neat. Thanks for sharing

NegativeLatency · on Jan 5, 2023

I run mine with cron and it puts files in a special folder for plex: https://github.com/nburns/utilities/blob/master/youtube.fish

Pulls from my watch later playlist which is quite handy

jamessb · on Jan 6, 2023

It looks like this depends on a "./add-video.py" script that isn't in the repository.

2OEH8eoCRo0 · on Jan 5, 2023

How do you deal with file numbering?

I prefer the file prefixed with a number that indicates "air date". 01 being the first uploaded video. The default is by index and the top of the channel or playlist is number 01 which is the most recent.

prometheus76 · on Jan 5, 2023

I just use the publish date in the format of YYYY-MM-DD at the beginning of the filename so that they sort properly.