Looking forward for a document leak about openai using YouTube data for training their models. When asked if they use it, Murali (CTO) told she doesn't know which makes you believe that for 99% they are using it.
Number of videos are less relevant than the total duration of high-quality videos (quality can be approximated on YouTube with metrics such as view and subscriber count). Also, while YouTube videos are not labelled directly, you can extract signal from the title, the captions, and perhaps even the comments. Lastly, many sources online use YouTube to host videos and embed them on their pages, which probably contains more text data that can be used as labels.
To be fair I don’t think Google deserves exclusive rights to contents created by others, just because they own a monopolistic video platform. However I do think it should be the content owner’s right to decide if anyone, including Google, gets to use their content for AI.