Hacker News new | past | comments | ask | show | jobs | submit login

Looking forward for a document leak about openai using YouTube data for training their models. When asked if they use it, Murali (CTO) told she doesn't know which makes you believe that for 99% they are using it.



I would say 100%, simply because there is no other reasonable source of video data


I use multiple websites that have hundreds of thousands of free stock videos that are much easier to label than YouTube videos.


Number of videos are less relevant than the total duration of high-quality videos (quality can be approximated on YouTube with metrics such as view and subscriber count). Also, while YouTube videos are not labelled directly, you can extract signal from the title, the captions, and perhaps even the comments. Lastly, many sources online use YouTube to host videos and embed them on their pages, which probably contains more text data that can be used as labels.


To be fair I don’t think Google deserves exclusive rights to contents created by others, just because they own a monopolistic video platform. However I do think it should be the content owner’s right to decide if anyone, including Google, gets to use their content for AI.


Any other company can start a video platform. In fact a few have and failed.

Nobody has to use youtube either.

If you want change in the video platform space, either be willing to pay a subscription or watch ads.

Consumers don't want to do either, and hence no one wants to enter the space.


*Murati


I am surprised to see a pro-copyright take on HN :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: