Take this example music video:
And this lyrics video version (more interesting because it's SOMEWHAT changing):
Simply finding the differences between frames would give the first video a higher score than the second.
Nice catch on the thumbnails that YouTube already captures. A histogram comparison between the second and third auto-generated thumbnails from the lyrics video was mostly equivalent when I ran one. That would be a good sign that it's not the actual music video.
Perhaps, on top of that, a histogram comparison of the same frames, which should capture if the video is pretty much lyrics on a static background.
2. I'm pretty sure it's not trivial to download just three specific frames of a video. And downloading so many videos would probably be expensive/bannable.