Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I wouldn't call it to 'tweak' the data collection. He is simply normalizing the results to ignore the differences in language distribution.

This is normal and has nothing to do with how you choose to represent it.

It would have been meaningless to show any graph or table saying 'Python has the most messages with profanity" if the amount of Python projects is 80% of all the projects out there.



He is right to normalize the results, but parent's point is that he is wrong to do that by modifying his data collection.

He should just collect as many commit messages as possible, then divide the profanity count for each language by the commit message count. Because that has lower standard error [and no more bias] than what he did.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: