I wouldn't call it to 'tweak' the data collection. He is simply normalizing the results to ignore the differences in language distribution.
This is normal and has nothing to do with how you choose to represent it.
It would have been meaningless to show any graph or table saying 'Python has the most messages with profanity" if the amount of Python projects is 80% of all the projects out there.
He is right to normalize the results, but parent's point is that he is wrong to do that by modifying his data collection.
He should just collect as many commit messages as possible, then divide the profanity count for each language by the commit message count. Because that has lower standard error [and no more bias] than what he did.
This is normal and has nothing to do with how you choose to represent it.
It would have been meaningless to show any graph or table saying 'Python has the most messages with profanity" if the amount of Python projects is 80% of all the projects out there.