Hacker News new | comments | show | ask | jobs | submit login

Pie chart? I have no idea how to interpret this...

http://www.flickr.com/photos/amit-agarwal/3196386402/sizes/l...




Pie charts have a lot of drawbacks, sure, but it's ridiculous that we're at the point now where the first (and highest rated) response to a pie chart is always a negative comment about pie charts, regardless how good or bad the pie chart is.

This one in particular is very clear:

C++, Ruby and Javascript have the most profanity. They're relatively equal to each other and collectively account for more than 50% of the swearing in commit messages.

C is next, with significantly less swearing.

C# and Java are roughly tied a bit below C.

Python and PHP have, comparatively, almost no swearing.

Was that really so hard? When the data is already subjective (what is and isn't a swear word) and intended almost solely for humor, do we really need more precision than a pie chart offers?

It is at best hyperbolic and at worst dishonest to say you "have no idea" how to interpret this. You have an idea. You just don't have precision.


> Python and PHP have, comparatively, almost no swearing.

Of course. Python users are happy people.

I wonder what happens with PHP... ;-)


For PHP, you'd have to search for "ass_hole", "fuckmother", "cock_sucker", and so on to be fair.


Reading the ones with the underscores makes me think this mode of swearing deserves its very own accent for when it's read aloud.


thanks, I almost spit tea on my computer :)


Complete lack of comments?


> I wonder what happens with PHP... ;-)

Ignorance is bliss?

> Of course. Python users are happy people.

Heh, I've been using Python lately and have felt lots of urges to swear, but I can't get myself to commit it. Shoot.


PHP code gets delivered to the customer :)


So does Python.


much fewer lines of it.


Did they check swear words in other languages? ;-)


maybe they don't know about github?


The same number of commits were taken from each language.


> C++, Ruby and Javascript have the most profanity. They're relatively equal to each other and collectively account for more than 50% of the swearing in commit messages.

this is the problem. In the pie chart it's almost impossible to determine which of those three has the most. In the bar chart, it's fairly obvious to my eye that C++ wins, though JS/Ruby are very close.


Rather than being organized by language names, the items in the pie graph should have been grouped by size (largest at 12, proceeding clockwise to the smallest at 11:59, for example). What relationship is there to show between the grouped names of programs that outweighs making this clear?


Dude, no. I think he was talking about how you can't tell how the size of the user base of a language is affecting the ranking. So, for example, only 1% of all projects could be in Java, but the swearing could be frequent enough to make it have ~15% of all curse words.


> Note that I ripped an equal amount of commit messages per language so the results aren't based on how many projects there are per language.

All the languages are equally represented by commit count.


but his total number is 929857, which is not divisible by 8


What a bummer... The percentages might be off by a fraction of a fraction of a percent...


I see no reason to believe that, given his process for ripping an "equal" number of commit messages per language was broken, that anything else even approaches validity. It's simple arithmetic; a grade schooler who notices that the last number is 7 would realize something's off.


What about the process is broken? Did you read the code and find bugs? With a total commit count of 929857 missing a single commit to round out to a perfectly even number of commits in each language is insignificant.


Or he had 929857 commit's and then he randomly sampled an equal number for each language. Thus, no division etc.


Then, of course, you have problems of sample size. Nearly a million commits is a pretty good sample size; a hundred, not so much.


Sorry, my bad.


[deleted]


Ahem, "Note that I ripped an equal amount of commit messages per language". I do think this is a bad place for a pie graph, but your specific criticism here is misplaced.


Yes, and? The amount of messages is equal, the amount of profanity per a set of commit messages (which is what is measured here) is not.


This was in response to a now deleted comment that claimed that more popular languages would show up as having more profanity because they have more commits, even if the profanity per commit was constant.


I get it now. Sorry, my bad.


What are you getting at?


But he said he sampled equal number of commit messages from each language.



A good weekend project would be to take an existing graphing library and make a wizard for it that would create a correct type of graph based on the data and your stated intentions with the data, as shown in the flowchart above.


Thanks for reminding me again why I don't bother reading these forums. One day I'll quit clicking links too.


Note that I ripped an equal amount of commit messages per language so the results aren't based on how many projects there are per language.

I like how he had to tweak the data collection process to make the visualization method fit.


That is not the case. He wanted to compare curse words across languages independent of language popularity. If he did not collect the same amount of data per language, then he would have two variables: number of curse words and number of commits. Then there would be the danger that a more popular language would have far more curse words simply because it has far more commits.


I wouldn't call it to 'tweak' the data collection. He is simply normalizing the results to ignore the differences in language distribution.

This is normal and has nothing to do with how you choose to represent it.

It would have been meaningless to show any graph or table saying 'Python has the most messages with profanity" if the amount of Python projects is 80% of all the projects out there.


He is right to normalize the results, but parent's point is that he is wrong to do that by modifying his data collection.

He should just collect as many commit messages as possible, then divide the profanity count for each language by the commit message count. Because that has lower standard error [and no more bias] than what he did.


That's not the case, that's just his personal choice. He could just as well have gone with %age of swear words per commit which would have made the number of commits per language irrelevant (as long as that number was kept above statistical nonsense) and would have yield the same result.


What are you talking about?


That flow chart there is helpful, thanks.


I added a pie chart :)




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: