I use this script, because automatically generated subtitles are badly formatted as transcript (only good as subtitles). It works pretty well to archive the videos including the transcript and subtitles.
```
#!/bin/zsh
# download as mp4, get normal subtitles
yt-dlp -f mp4 "$@" --write-auto-sub --sub-format best --write-sub
# download subtitles and convert them to transcript
Checking online, this [1] appears to be one of the most heavily referenced on StackOverflow for downloading both user entered and automatically generated transcripts. (Python based)
Notably, Google really needs to have an obvious API endpoint for this kind of call. If 1000's of programmers are all rolling their own implementation, there's probably a huge number that constantly download the full video and transcribe in data harvesting.
Kind of surprised honestly it's taken this long for Youtube to fall prey to massive data harvesting campaigns. From this article [2] and this paper on Youtube data statistics [3] there are ~14,000,000,000 videos on Youtube with a mean length of 615 seconds (~10 minutes).
My understanding is that YouTube actively _undermines_ the ability for tools like youtube-dl to download videos. I see the irony that providing an api endpoint (just for transcripts) would maybe save them on egress costs.
But, I think they are probably culturally opposed to publicly exposing this sort of thing, even if it only works via authenticated account. Also worth considering that doing so would make it easier for a competitor to steal the value they provide with the generated closed captions.
The only argument I'm making, is that if 1,000,000 developers all want to train LLMs on video data, because they desperately need to beat Sora, or ChatGPT, or Stable Diffusion, then there's probably a lot rolling their own scraping software.
Probably rolling their own scraping software with inefficient methods. And then likely pseudo-DDOSing (mostly irritating) Google with constant scrape attempts.
I could fight forever against petabytes of constant downloads, or simply make an incredibly small, condensed, easy to download summary that minimizes my data bandwidth cost and reduces each download to bytes - kilobytes rather than 100's of MB.
At 1,250Kbps, 480p, (~Google rec), every user, streaming for an hour, is approximated at 550 MB / hr of data. If the situation gets real bad, and 50% is scrapers (like crawling has gotten to be 50% of the web), and maybe 50% of those can be reduced by a factor of 100, because all they want is the text, then maybe 150 MB can be reduced to 1.5 MB. Close to a 1/4 bandwidth removed.
There may also be a lot that effectively "are" search crawlers, and all they really want is a summary for categorization of videos and better search indexing. Except they download the video, because everybody's rolling their own solutions, and huge portions of StackOverflow and similar amount to "use this code, its invincible." And the people deploying them don't even know what they're doing because its all copy-pasta.
Admittedly, it runs into issues where they then simply download 100x many videos. However, video streams per second, API calls / time, # calls from IP address block / time that are reasonable, could mostly mitigate those issues.
I appreciate you see the irony in the issue, and their cultural opposition is partially what I'm pointing out. Constantly fighting against a deluge when you could just divert the river.
Why would YT want to give away all this excellent training data for LLMs/AI? My guess is not doing this makes it expensive for those wanting to slurp up data
PSA: yt-dlp exits non-zero if destination filename or any of intermediate files’ names are too long for the filesystem. Use `-o "%(title).150B [%(id)s].%(ext)s"` to limit filename length(to 150 bytes in this example). “--trim-filenames” don’t work.
How do I actually search files with timestamps (preferably from the CLI)?
I can use rg if the search terms happen to be on the same line but if the terms span multiple lines the interleaved timestamp metadata will prevent the query from being matched.
```
#!/bin/zsh
# download as mp4, get normal subtitles
yt-dlp -f mp4 "$@" --write-auto-sub --sub-format best --write-sub
# download subtitles and convert them to transcript
yt-dlp --skip-download --write-subs --write-auto-subs --sub-lang en -k --sub-format ttml --convert-subs srt --exec before_dl:"sed -e '/^[0-9][0-9]:[0-9][0-9]:[0-9][0-9].[0-9][0-9][0-9] --> [0-9][0-9]:[0-9][0-9]:[0-9][0-9].[0-9][0-9][0-9]$/d' -e '/^[[:digit:]]\{1,3\}$/d' -e 's/<[^>]>//g' -e '/^[[:space:]]$/d' -i '' %(requested_subtitles.:.filepath)#q" "$@"
```