Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: I create a free website for download YouTube transcript, subtitle (downloadyoutubesubtitle.com)
125 points by trungnx2605 9 months ago | hide | past | favorite | 27 comments



I use this script, because automatically generated subtitles are badly formatted as transcript (only good as subtitles). It works pretty well to archive the videos including the transcript and subtitles.

```

#!/bin/zsh

# download as mp4, get normal subtitles

yt-dlp -f mp4 "$@" --write-auto-sub --sub-format best --write-sub

# download subtitles and convert them to transcript

yt-dlp --skip-download --write-subs --write-auto-subs --sub-lang en -k --sub-format ttml --convert-subs srt --exec before_dl:"sed -e '/^[0-9][0-9]:[0-9][0-9]:[0-9][0-9].[0-9][0-9][0-9] --> [0-9][0-9]:[0-9][0-9]:[0-9][0-9].[0-9][0-9][0-9]$/d' -e '/^[[:digit:]]\{1,3\}$/d' -e 's/<[^>]>//g' -e '/^[[:space:]]$/d' -i '' %(requested_subtitles.:.filepath)#q" "$@"

```


Checking online, this [1] appears to be one of the most heavily referenced on StackOverflow for downloading both user entered and automatically generated transcripts. (Python based)

[1] https://github.com/jdepoix/youtube-transcript-api

Notably, Google really needs to have an obvious API endpoint for this kind of call. If 1000's of programmers are all rolling their own implementation, there's probably a huge number that constantly download the full video and transcribe in data harvesting.

Kind of surprised honestly it's taken this long for Youtube to fall prey to massive data harvesting campaigns. From this article [2] and this paper on Youtube data statistics [3] there are ~14,000,000,000 videos on Youtube with a mean length of 615 seconds (~10 minutes).

You'd think people would be interested in:

  8,610,000,000,000 seconds
  143,500,000,000 minutes
  2,391,666,666 hours
  3,274,083 months
  272,840 years
  27,284 decades
  2,728 centuries
  273 millennia
Of live action video on nearly every single subject in human existence.

Also, the paper's really cool and extremely sobering about being a "content creator" based on the 1% get all views.

[2] "What We Discovered on ‘Deep YouTube’", https://www.theatlantic.com/technology/archive/2024/01/how-m...

[3] "Dialing for Videos: A Random Sample of YouTube", https://journalqd.org/article/view/4066/3766


My understanding is that YouTube actively _undermines_ the ability for tools like youtube-dl to download videos. I see the irony that providing an api endpoint (just for transcripts) would maybe save them on egress costs.

But, I think they are probably culturally opposed to publicly exposing this sort of thing, even if it only works via authenticated account. Also worth considering that doing so would make it easier for a competitor to steal the value they provide with the generated closed captions.


The only argument I'm making, is that if 1,000,000 developers all want to train LLMs on video data, because they desperately need to beat Sora, or ChatGPT, or Stable Diffusion, then there's probably a lot rolling their own scraping software.

Probably rolling their own scraping software with inefficient methods. And then likely pseudo-DDOSing (mostly irritating) Google with constant scrape attempts.

I could fight forever against petabytes of constant downloads, or simply make an incredibly small, condensed, easy to download summary that minimizes my data bandwidth cost and reduces each download to bytes - kilobytes rather than 100's of MB.

At 1,250Kbps, 480p, (~Google rec), every user, streaming for an hour, is approximated at 550 MB / hr of data. If the situation gets real bad, and 50% is scrapers (like crawling has gotten to be 50% of the web), and maybe 50% of those can be reduced by a factor of 100, because all they want is the text, then maybe 150 MB can be reduced to 1.5 MB. Close to a 1/4 bandwidth removed.

There may also be a lot that effectively "are" search crawlers, and all they really want is a summary for categorization of videos and better search indexing. Except they download the video, because everybody's rolling their own solutions, and huge portions of StackOverflow and similar amount to "use this code, its invincible." And the people deploying them don't even know what they're doing because its all copy-pasta.

Admittedly, it runs into issues where they then simply download 100x many videos. However, video streams per second, API calls / time, # calls from IP address block / time that are reasonable, could mostly mitigate those issues.

I appreciate you see the irony in the issue, and their cultural opposition is partially what I'm pointing out. Constantly fighting against a deluge when you could just divert the river.


What competitor?


Did you consider that the reason they don't have many competitors is because of that sort of behavior?

To answer more directly, "a hypothetical one", also - I'm speculating and may be wrong.


Why would YT want to give away all this excellent training data for LLMs/AI? My guess is not doing this makes it expensive for those wanting to slurp up data


I’m also doing this but it adds punctuations, paragraphs and chapter headers because most raw YouTube transcripts lack proper punctuation

https://www.appblit.com/scribe


How are you deriving the punctuation?


While they could retranscribe it w/ Whisper, they use an in-browser model. The worker's source code is available at https://www.appblit.com/scribe-worker.js.


Not the GP, but this prompt works reasonably well for ChatGPT 3.5:

    The following is a raw transcript from a YouTube video. Add paragraphs and punctuation. Do not modify or correct the text:

    <paste raw text here>


Wow! This is great


yt-dlp[1] can also do this:

```

$ yt-dlp --write-sub --sub-lang "en.*" --write-auto-sub --skip-download 'https://www.youtube.com/watch?v=…'

```

[1] https://github.com/yt-dlp/yt-dlp


I see lots of yt-dlp commands here so…

PSA: yt-dlp exits non-zero if destination filename or any of intermediate files’ names are too long for the filesystem. Use `-o "%(title).150B [%(id)s].%(ext)s"` to limit filename length(to 150 bytes in this example). “--trim-filenames” don’t work.


to whom it may be useful

- `.` - precision flag [1]

- `150` - precision amount [2] - bytes to count after casting to binary representation

- `B` - special conversion type [3] - bytes

[1]: https://docs.python.org/3/library/stdtypes.html#printf-style...

[2]: https://docs.python.org/3/library/stdtypes.html#printf-style...

[3]: https://github.com/yt-dlp/yt-dlp?tab=readme-ov-file#output-t...


Here is mine: https://www.val.town/v/taras/scrape2md

Use it like https://taras-scrape2md.web.val.run/https://youtu.be/TJqeCpx...

This is meant to be a general purpose content-to-markdown tool for llm interactions in https://chatcraft.org


What's the copyright license on your scrape2md code?


Updated description with license(MIT) and link to the more fully featured version.


I liked also this one:

https://filmot.com/

Here you can search the subtitles of YouTube videos.


How are you getting the transcripts? Using the private YT API like in https://www.npmjs.com/package/youtube-transcript?


youtube_transcript_api


How do I actually search files with timestamps (preferably from the CLI)?

I can use rg if the search terms happen to be on the same line but if the terms span multiple lines the interleaved timestamp metadata will prevent the query from being matched.


I still can't find a good service for accurate subtitle for auto generated videos, always gibberish here and there.


Hi guy, I still get an error that shows Client Error: Too Many Requests for URL. So YouTube blocked the IP right?


Is there any way to extract the transcripts from JS state on YouTube, instead of making API reqs for them?


It use youtube_transcript_api


What a great service. Thanks!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: