Hacker News new | comments | ask | show | jobs | submit login
What makes one appear smarter and more sociable? (judg.me)
255 points by bvi on May 2, 2012 | hide | past | web | favorite | 83 comments

When you don't start your y-axis at 0, you skew the interpretation of your data. At best, this is a significant mistake, and at worst, this is intentionally misleading.

Take this graph:


It looks like women are rated as more than twice as smart as men. Huge difference.

Except until you run the numbers. Women are rated about 4.3% "smarter" than men. Not twice as smart, like the graph implies. Not 20% smarter. Not even 5%.

Please, pay attention to your graphs. They're great tools, but they can mislead as much as they can help elucidate.

Gah, this data is so interesting, but is not presented very well.

First change I would make is put all the red bars next to each other, and the blue bars next to each other; there is more value in comparing across enthicities/genders than there is in comparing across variables. Second change I would make it get rid of the eye-popping primary colors and choose two more neutral ones (two different shades of beige maybe?)

Finally, Y-axis scale changes too frequently. Also error bars would be nice as has been mentioned.

edit: Now that I think more about it, at a glance it would be nice to show a scatter plot with a dot for each of Male/Female/Ethnicity/etc in the way that you have a scatter plot on your home page. http://judg.me

> Also error bars would be nice as has been mentioned.

Not just nice but absolutely crucial. If you're looking at differences of 2%, you need tens of thousands of samples to say anything definitive. He's only got 1000 photos total, so I'm pretty sure most of this article is just an analysis of random noise.

You're right.

Also, maybe I'm just dense but it took me a REALLY long time to wrap my head around the scoring matrix of the website, and even after a few practices, each photo I still find myself having to think painfully hard about what each quadrant means. I have to imagine that this would add even more noise to the data, if people's heads are hurting and don't know where to click.

One suggestion would be to provide an intuitive guide for each of the quadrants, not just the axes, e.g. "Outgoing idiot," "Shy nerd" etc. but of course you risk biasing the responses based on the language you use. The other option would be to simplify the input, only rating one variable at a time.

Yep, there are two sources of variance: the variance amongst a sample of 1000 photos selected from a hypothetical pool of all people, and the variance in how the sample of 100 subjects rated any given photo compared to the "intrinsic" qualities of the photo.

I don't see any analysis at all. I don't think there is enough evidence to claim any of these results. In a 1000 photo sample we would expect ~3% variability (not defined), or 0.3 on these scales.

Pick another 1000 samples, redo the blog post and see if there is anything in common (not that that would make it statistically significant).

Don't judg.me by my graphs :)

And worse than that, the y-axis range shifts. For example, the range on the Fat vs. Normal chart is much smaller than the surrounding charts, giving the illusion that the spread is much larger than it is. (site is down or I'd link).

True, and I was having a tough time deciding which option to take (whether to start the y-axis at 0, or 5). As far as averages are concerned, the differences are in the tenths and hundredths - so starting the y-axis at 0 would have it more difficult to see any differences across the graphs.

| so starting the y-axis at 0 would have it more difficult to see any differences across the graphs

Then maybe there isn't a real difference. See statistical significance.

A (somewhat blunt) rule of thumb, not just for you, but for anyone in this endeavor:

In the future, if you are making a graph like this, and the differences are so small that they don't show up when your y-axis starts at 0, they probably aren't important enough to talk about.

If you've taken a stats class, you probably know there are times when small differences are meaningful. Any time you're analyzing your own data, though, these differences are rarely worth even considering. Taking a survey and analyzing the resulting data is very hard. There are countless nuances that can affect your results, both mathematical and psychological. Look at political polling - even the people who do it for a living regularly get things wrong.

starting the y-axis at 0 would have it more difficult to see any differences across the graphs.

That's the point: the differences are small.

I would go further: the differences probably are not significant. (Small differences may still be significant. In this case, I doubt it.)

Inter-rater reliability is super important for tests like this. http://en.wikipedia.org/wiki/Inter-rater_reliability The gist is: you can't simply mark one person as "asian" and assume that categorization is correct. In that respect, the data would reveal more about the person sorting the photos than it would reveal about the perceptions of those that are rating the photos.

Second, there is a huge problem with causality here. So for instance, the author writes: "Be Asian if you want to appear smart; Latino if you want to appear extroverted." The problem is that there is a methodological flaw. On the first photo I saw on judge.me, I was presented with this image: http://images.judg.me/82e7fcbd988dbdcac0d00bd53fb93e96.jpg This would appear to me to be a latino or hispanic male at a party. I'm highly inclined to rate them highly on the extrovert scale: they're at a party. But that doesn't indicate stereotypically latino or hispanic features indicate extroversion. It could be that people with stereotypically latino or hispanic features were more likely to upload photos in which the image portrayed a more stereotypically extroverted activity.

Third, it appears that users can upload a photo to the site and see their feedback from votes. It seems highly possible that users self-select a photo that will best affirm the image of themselves they wish to cultivate. In that respect, there's both a huge confirmation bias and huge self-selection bias. If I want to think of myself as an academic, I'll upload a picture of me at my desk studying and watch the "intellectual" ratings pour in. Then I can feel assured that other people perceive me the way I want to be perceived. Additionally, if one wants to conform to social expectations (and things like Asch's line test http://en.wikipedia.org/wiki/Asch_conformity_experiments indicate conformity is common), this data might really be nothing more than showing the degree to which people post photos affirming their conformity to their social expectations (i.e. 'smart' ethnicities posting 'smart-looking' photos) and be saying nothing at all about how people actually perceive ethic cues.

There are huge methodological concerns for this 'study'. Instead, the revelation of this data might actually be the insight that "pictures of yourself at social events makes you look more social." Taking much of anything at all away from this data set would be rather unwise.

"In that respect, the data would reveal more about the person sorting the photos than it would reveal about the perceptions of those that are rating the photos."

The problem is none of this matters since we don't know anything about the people rating the photos. Not their sex, not their age, not their location nothing.

To wit:

"and nothing about users who judge the photos."


In addition to the inter-rater reliability issue, there are also a lot of unanswered questions about the statistical distributions involved. The results are reported as population means, but without information about the underlying distribution of the results it's unclear whether the mean is a meaningful measure of central tendency for the data or how much overlap there was in the distributions. How did the mean compare with the median and mode? What were the standard deviations? Interquartile range? They're using a visual analog scale for the ranking which is reasonable, but it seems that it's just been assumed that the data can be treated as interval data for the analysis and the validity of that assumption hasn't been established. If I were doing the analysis I'd have been inclined to bin the data and report the results as odds ratios with 95% confidence intervals (e.g. people wearing glasses are N + or - 95% CI times more likely to be regarded as "smart", where "smart" is defined as a score >= some reasonable threshold on the "smartness" axis than those without glasses).

"it's just been assumed that the data can be treated as interval data"

Which is especially problematic since user generated ratings are ordinal, not interval data. Since the idea of an interval between points in ordinal data is essentially meaningless the summary statistics you mentioned are not meaningful either.

It's one thing for Amazon to come up with a mean user rating to give you a sense of how people like something, but it's not a valid method of comparing the data we have here, especially when the differences are so small

I would think that a multi-level/hierarchical/mixed GLM would be an interesting approach to their data. Multilevel modeling assumes that there is correlation between observations that are inside the same "level". This is in stark comparison to regular GLM (even one with dummy variables to represent categories), which assumes that all observations are 100% independent.

E.g. in a model that predicts students' GPA, you could divide your data into a hierarchy consisting of, at the highest level, geographic area, followed by high school, maybe followed by teacher. In that model, the correlation between students who are in the same state, the same school or in the same classroom would be accounted for. You could even go as deep as at an individual level if you have >1 observation per student.

In addition to regular predictive variables, judg.me could probably use their weblogs to group people's judgement scores by country of origin and by individuals, among other possibilities.

While inter-rater reliability is a concern for identifying ethnicities, other traits should be more reliable (gender, hair color). Still, stuff like that needs to be pre-tested. You basically let the coders code a limited set of photos and check the correlation between their codes. The higher the correlation, the better, >.7 is the convention in social science (but still pretty bad, higher would be better).

You should also check intra-coder reliability, i.e. give the same person the same set of photos with two weeks or so between. You can then again calculate the correlation. This tells you wether your categories are too fuzzy (e.g. what exactly is medium length hair?).

All in all this has serious methodical flaws, from a social science perspective it’s not salvageable, and I haven’t even talked about the complete lack of statistical tests (which, to be honest, would just be like polishing a turd).

The problem with this blog is so obvious that I'm suprised that I haven't seen it in the comments yet (I probably missed it), but you can't use a random selection of photographs for this if you want to expect gross ratings to mean something. You would have to normalise each trait that you were comparing against every other trait. Otherwise, when you were trying to isolate how smart people judge black people to be, and black people were wearing caps a quarter more ofter than the average person, you would think you were getting interesting data for blacks when really you were getting interesting data for caps.

If you didn't plan to use gross ratings like this blog did (I think), then I'm pretty sure that you could do a post-normalization by analyzing the frequencies in the sample and determining how much you'd expect each of the traits to affect the rating for every other trait, then trying to determine the if the deviations from that were statistically significant in a universe that contained only those traits.

Honestly - just take the original data and assign every trait a 5 rating, then pick a random trait and pull that value up or down, then check and see what the gross ratings now say about the other traits.

I apoligize if the methodology was more complicated than it looks, and I hope there's a link to the spreadsheet of the original distribution somewhere in the blog that I missed, so someone could make sense of this data.

I have no idea what the standard deviation on this is. Lots of the numbers look close enough to be noise. Others in this thread have pointed out other missing information that makes this a fairly poor survey.

Agreed. What's the margin of error on these? Sample group of 1000 is also far too small to make any kind of conclusion considering all the variables.

What a tragic waste of data and time. Not one mention of confidence intervals (are _any_ of these differences statistically significant??), selection bias (who was more likely to submit photos, and why did they choose a specific photo??), or sampling errors (who rated the attributes, and how consistent were they?). The OK Cupid blog posts are a great source for similar (but statistically sound) studies.

"thousands of photos have been uploaded and judged by users since."

Who are the users that are judging? What is the breakdown of those users (age,sex,location,education etc.)? What can possibly be inferred from this without knowing that info?

It's entirely anonymous - I know nothing about the users who are uploading the photos (apart from the email addresses they use when uploading a photo), and nothing about users who judge the photos.

The entire premise of the site is for the user to be judged by strangers. Why would age/sex/location/education of the person doing the judging matter?

Because it's most likely skewed. Without knowing anything about where your votes come from, I'm pretty sure the WEIRD demographic is over-represented (white, educated, industrialized, rich, democratic).

Not trying to be overly critical - I like the concept and execution, but these things are really important in statistics.

Research is tricky.

Unless you're very careful with adjusting for factors (like age, location, etc) and careful with statistics you end up with garbage.

Either do something that's just purely fun, or do something properly and call it research. But don't do something wrong and call it research, because there are so many people ignorant about how science works and ignorant about numbers.

The judges have IP addresses that you can map to a geographic region, such as a state or country. That would be useful data, and you'd be able to actually say with authority whether or not at least the location of the person doing the judging matters.

I think that everyone would like to see how men and women rated them as individual categories(maximizing sex appeal is a common goal), and age might also be desirable.

> Why would age/sex/location/education of the person doing the judging matter?

It can bias the results. How did your users find your site?

I read just enough to decide it isn't really worth reading. I love the articles OK Cupid does with hard statistical data backing up their inferences about similar social stuff. This does not strike me as of that ilk.

I am disappointed. I was recently thinking about how people are judged based on looks (and blogged about it) so was hoping for/looking forward to something meatier.

Your graphs appear to be very misleading. There is little to be learned from the data. Learn some Data analysis and learn how to not provide bias via graphs.

I think they drove too far into the details considering their sample size.

It is an interesting study so I hope they update the post once they have been in business longer.

These graphs could use some error bars.

Indeed. While the results are very interesting, and some of them are clearly very strong differences, I'd like to see some t-tests.

When you mention the strong differences, are you accounting for the y axis starting at varying locations? Once I realized this, the significance of the results went down significantly.

That's usually when you have some idea of the error in what you're measuring. They are just reporting on a social poll they did, isn't that usually a bit different. I mean of course this isn't a rigorous scientific study, but that doesn't mean it's useless either.

They're presenting averages, which should always be accompanied by the standard deviations, otherwise they're close to meaningless. I actually think box plots would be better here.

Indeed. It's hard to tell if any of the differences are significant, particularly when you compare with so many traits -- the chances of a false positive are very high.

I have a statistics final exam coming up; if the complete data set were available, I'd love to play with it for practice.

Why people assume that parameters are independent?

If most black woman that sent their photos are fat and people don't rate fat woman high then the black women will be rated low not because of the race but because being black woman and being fat woman is correlated in the sample data.

Owner of such sites have large sample of some data and they assume that large equals representative and they go on slicing their data by different parameters not controlling for anything and making statements that are only technically true with respect to their data but strongly misleading in many ways.

The authors conflate "extroversion" and "social skills". For example, based on his pic I'd rate this guy high on extroversion but low on social skills:


Similarly, being introverted doesn't mean you have low social skills.

I was expecting a video, how can you know that he has low social skills only by a picture? just because the meathead stereotype?

I don't know whether he has low social skills; I'm making a judgment based on his pic (as if I had seen it on judg.me).

"Similarly, being introverted doesn't mean you have low social skills."

I don't know how you can say that.

The definition here: http://dictionary.reference.com/browse/introvert?s=t

implies low social skills.

Even when people discuss it here, they talk about wanting to be left alone, not going to parties because they aren't interested in socializing, and feeling "weak" after socializing for a short amount of time.

With all of this, I don't know how your social skills could ever be considered high.

As an introvert, I can be a social butterfly at parties but need to leave after an hour because it felt like work. As an extrovert, I can be a wallower but want to stay out all night because the ambiance is energizing.

No, that definition you linked to does not imply low social skills.

Social skills are the ability to make conversation, to make people feel comfortable, to engage others, etc.

Extroversion is the desire to do those things.

Yes, they are clearly correlated, but by absolutely no means interchangeable. There are plenty of introverts who have perfectly decent social skills, and plenty of extroverts who have terrible social skills, because the desire for something is not the same as having a skill for it. For example, an adolescent extrovert may feel strong motivation to be outgoing, loud, talkative, and active, but that doesn't mean he will have good social skills and be polite.

The most common psychological definition of extrovert and introvert that we currently use was largely shaped by Carl Jung, and his definition involves far more than just sociability. However, sociability is the easiest application. What it really comes down to (as the link you provided points out) is a focus on the "external world" vs the "internal world", so I use the word "are" in those two definitions above very loosely.

Introvert has a definition which is colloquial, which is what you're pointing to.

It also has a definition used in psychology as well, which is something completely different, than the term used colloquially.

Enjoying socializing (extraversion) != being good at it (social skills)

the y axes vary a great deal. there's no information on the distribution of ratings for each class. really difficult to tell if there's anything meaningful or even interesting here at all.

Awesome analysis, though I agree they results look like error bars.

I know you mention a random sample of 1000 images, but what were your overall metrics? Did you have a good data set across the board (ie as many Hispanic females as Caucasian males)? What kind of advertising did you do as well?

Reason I ask is I've been working on trying build a face-morpher based on different criteria (make you look 80, fat, African) and these are some of the questions I've got bouncing in my head about how to collect the data.

Error bars!

Law of large numbers says that your error should scale like 1/sqrt(N) where N is the sample size. In this case N = 1000, so 1/sqrt(N) ~ 3%

This measures 1 STD (68% of values lie in an interval of 3% of reported value). To be on the safe side you should take 2 or 3 STDs for the error bars. This already nullifies most of the results!

I'm sorry, but there's no 'unsexy data crunching' here - just a series of ratios compared against one another. There is a whole body of statistical literature about how to do anything of this kind, and they haven't done any of it. I'd quite happily believe that none of these differences have any kind of significance in the statistical sense (i.e., it's due to background variation). But then again, I wasn't given any information to know whether they've even looked. So I can't say...

Prof. Dan Ariely mentioned Hot or Not website in one of his books. He used the website to get his attractiveness score and other interesting data. The book is a great read, and analysis of how people percieve you by the looks. As for the website, I think that judg.me looks very promising as a source of social data, which is otherwise very difficult to obtain.

That's very interesting - I had no idea. Do you know which book it is?

Its "The Upside of Irrationality", Chapter 7. Great read.

Thanks, appreciate it! I'll try to get a copy.

These people have a wrong definition of extroversion.

The actual site rates extroversion vs introversion, but the analysis here mistakenly uses the term social scale, implying that extroversion and sociability are interchangeable. They are correlated, but by absolutely no means are they interchangeable. This analysis should have stuck with the original vocabulary more consistently.

For someone in his early 30s whose hair is starting to thin out the results are interesting, though expected. I won't lose many social points but will pick up a good amount of perceived smartness when the baldness battle is finally lost. And to throw the "I'm a fun-loving extrovert" vibe out there for special occasions, I just throw on some shades.

Long live Hot or Not! I don't want to like this sort of site. I'm ashamed that I read the whole post (and found it interesting).

This is cool data, but it would be best if you could release numbers about the distribution more than just the average, ie standard deviation, quartile, medians.

It is hard to determine significance from these graphs, especially as pflats commented that the y-axis are skewed.

Okay, that's it, I'm cutting of my hair. Everyone seems to hate long hair on man.

But I wonder how many long haired man were in the sample. They are quite rare and a few ponytail grad students might lower the score.

The comments here are depressing. Why can't you just enjoy it for what it is? No-one believes this is pioneering research so you don't need to analyze it as such.

That's okay. :) I'm learning quite a bit from the comments, and it's always good to know what others think.

Upvoted. I hope you do turn this into a source for solid analysis of this sort.

Best of luck.

I come here to read informed analysis and thoughts. If you prefer misguided positive commentary instead of critical analysis I suggest you don't read HN comments.

I just wish I could downmod such stark idiocy.

here is a 2D scatter plot of the smart/social ratings: https://docs.google.com/spreadsheet/oimg?key=0AmoarnvJ2W0ndF... . It is still very misleading without axes going to zero, but at least you can see the purported differences clearly.

I'm confused as to what Perceived Smartness `div` Extroversion is meant to represent. Or is it implying that extroverts are perceived as smart?

They phrased that poorly, but the graphs make it pretty clear. Every picture was rated for "looks intelligent" and "looks extroverted". The two are orthogonal in presentation to the voter and as presented in the graphs. The blue bars are for the "smartness" ratings and the red bars are for the "extroversion" ratings.

Be asian and bald = smart! It ain't that simple!

And female?

I guess all future photos of me will have to be full body shots, with five o'clock shadow, outside, smiling, and died gray hair.

According to this, the smartest most sociable people should be smiling bald indian women with glasses and 5 o'clock shadows.

What would be interesting to see is the likely decline in intelligence as the user base increases.

I had to stop reading after the following entered their glossary:

sistas, ladies.

Sorry but that is a turn-off to me.

I'll give you sistas. But ladies? Really? Why?

It's the context.

It's pretend deference and, to me, mildly sexist. He said men but not 'gentlemen'.

So, please mark what data is statistically significant at a 95% level?

Oh good, just what the Internet needed: two-axis hot-or-not

Nothing says sociable like an iced grill ..

So in other words (for men) to get the most responses from dating sites, be happy and white and wear sunglasses in an outdoor setting with a 5 o'clock shadow and show your fit body?

In other words be a rugged, outdoorsy, all-american white guy.

Pretty sure this is only confirming what was already common knowledge.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact