I don't think that's going very far. Humans who figure this out also have access to context. You wouldn't know a song is protesting against a war unless you know about the war it's protesting against. 800 million pages might seem overkill but it pales in comparison to the amount of information humans (sub)consciously use to reach these conclusions. Think about the amount of information required to adequately describe the concept of a protest-song to a machine.
Obviously the whole point is for it to figure it out for itself, not to be told.