Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

yt-dlp and a batch file that runs via Task Scheduler has been doing this for me for a couple of years now. I also grab the captions and throw that into a database so that I can search transcripts for a clip that I can remember but can't remember which video it's in. It was a fun weekend project.


Long ago I had my podcast downloader keep all files it downloads and recently I've been using OpenAI's Whisper to go through and create transcripts of the 8000 or so hours of data I have downloaded over the years.

It's very cool to be able to search through and remind myself of something I heard once. Not exactly life changing, but still, nice to be able to quickly drill down and find audio for something when a curiosity strikes me.


What kind of hardware do you have that makes it feasible to process thousands of hours of podcasts? I want to do the same but I’ve heard that Whisper requires some serious GPU might for decent accuracy (Linux Unplugged podcast specifically).


Yep, it takes a bit of GPU RAM. I'm using 3 machines with NVidia 3080 or better. I let them go for a few weeks over the winter break when I was mostly disconnected from the tech world. The workers prioritized podcasts I'm personally likely to want to search, and got through almost a third of my archive.

Now it's down to 1 or 2 machines depending on what's going on, so it'll take much longer to finish up, but I'm in no rush.


8000 hours? Napkin math time, that's 20 years of 10+ hours daily.

I call BS.


It's about an hour or two or a day.

This includes data from 1995 on. The early data is backfill of radio shows that transitioned to podcasts and dumped old episodes in their feed at some point. My reader itself started in 2012, I downloaded around 7000 hours of new podcasts, which works out to 1.7 hours per day. So, around 2 hours per day, since I don't listen every day, and to be fair, I haven't listened to every podcast I've downloaded, some don't interest me. But 1-2 hours of listening a day is the sweet spot for me.


My math says 365 days x 10 hours/day = 3650 hours. 8000 hours is just over 2 years, not 20.


You need to resize your napkins.


You might be intereseted in these youtube archive scripts: https://github.com/TheFrenchGhosty/TheFrenchGhostys-Ultimate...


Neat. Thanks for sharing


I run mine with cron and it puts files in a special folder for plex: https://github.com/nburns/utilities/blob/master/youtube.fish

Pulls from my watch later playlist which is quite handy


It looks like this depends on a "./add-video.py" script that isn't in the repository.


How do you deal with file numbering?

I prefer the file prefixed with a number that indicates "air date". 01 being the first uploaded video. The default is by index and the top of the channel or playlist is number 01 which is the most recent.


I just use the publish date in the format of YYYY-MM-DD at the beginning of the filename so that they sort properly.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: