I'm not actually sure that's true. There is a lot of detail in the world represented in audio and video, and presumably large transformers could learn from the textures and shadows and articulated movements and the physical modeling of how sounds are made, etc.