Hacker News new | past | comments | ask | show | jobs | submit login
The Real ‘Stuff White People Like’ (okcupid.com)
204 points by dfreidin on Sept 8, 2010 | hide | past | web | favorite | 82 comments

On the proper use of this data:

Perhaps the easiest way to think of it is that the phrases are predictors for the race/sex, not the other way around. For example, you shouldn't expect every white male you meet to like Van Halen. However if someone says to you "I have a friend who's a big Van Halen fan", you're pretty safe in assuming that the friend is a white male.

Likewise, it might be that only 10% of blacks like soul food. But if almost no other demographics like it, it will still show up high on their list. So "is black" does not strongly imply "loves soul food", but "loves soul food" does strongly imply "is black".

In other words, http://en.wikipedia.org/wiki/Bayes_theorem

a commenter on there made a good point about the "soul food" term, where that's also the name of a movie, book and tv series so it ends up being 3 (probably black) data points all counting towards 1.

For that reason, I'd be happier if they sorted by KL Divergence instead of just log-odds. That'd give a much better tradeoff between commonality and predictive power.

You have a valid point but I'm guessing they avoided kl-divergence because it's 1) harder to explain 2) it's not symmetric, e.g. kl(asian, white) != kl(white, asian) 3) it needs a smoothing function for comparing distributions where not every element is in both distributions.

There's just so much that they execute well on that I hate to pick any bit of it, but one thing everybody with linkbait should probably do is create something spiritually similar to the bar which pops up when you're done with the article. It is a force multiplier for all pillar content you write, it increases the viral factor, and the way it grabs someone's attention just when their brain is known to be vacant is sixteen flavors of brilliant.

I did something very similar for a client today, and after I get a little better at manipulating code to do it, I'm probably going to try something similar for getting trial signups. ("Looks like you're done reading about it. Feeling confused about what to do next? WHAM, signup box.")

Interesting, I was thinking the exact opposite, since those bars are all super annoying.

So there are very few things I would be less likely to read than "47 Shocking New Ways To Please Your Man In Bed", but on an intellectual level I understand that that formula has made Cosmo more money than I will ever see in my lifetime.

You are welcome to your preference, of course, but the relevant question for my client is "Does Tuesdays and people like him control a lot of links? Would he link to the article in the absence of this widget? Is he going to refrain from linking to the article now? Will the aggregate number of links we lose offset the massive gain we are expecting to get [and are capable of measuring] from less-opinionated users?"

If the answer to any one of those questions is "no", your preferences are not economically relevant to my client. (Similarly, in the context of doing software trials with a similar mechanism, I am going to go out on a limb and say the intersection of "hates Javascript tomfoolery like this" and "pays money for software" is so low that I am incapable of devising a system precise enough to measure it.)

Oh wow, that is annoying.

I do like the pop-out bars that the NYT has at the end of their articles, though. See, for example, http://www.nytimes.com/2010/09/07/business/economy/07jobs.ht...

Interesting. I also was reminded of the NYT popout bars, but I hate them. Pop-anything that I didn't ask for really bothers me.

I'm the same way. Whenever I encounter one, I tool around in AbBlock Plus for a while until I find the appropriate scripts to block.

They're one of the factors finally pushing me towards just a default-block policy (the other factor is sites that pop something up when you mouseover a link). I've tried really hard to start with a blank adblock list and individually block sites that have particularly egregious stuff, but the whack-a-mole is getting ridiculous. Haven't quite given in yet, but if I don't reach a fixed point soon where I can stop adding 5 new things a day, I'm just going to subscribe to a blocklist.

I wonder how hard it'd be, alternately, to write a heuristic to block absolutely-positioned top/bottom bars. May play around with that a bit.

I too was intrigued by them and thought to myself just a week ago: "Those popup-at-the-bottom-of-the-page things are brilliant. I should email the Bingo guy and get him to A/B test them."

I am rather more receptive to receiving that email than most companies would be, and if it comes with an offer to create MIT-licensed Javascript/CSS to make the UI happen, I'd probably have it live within an hour of receiving the code.

That same test has been on my agenda since at least April, but front end is my weakest skill, so I keep pushing it back. I've got two partial implementations where I tried to follow tutorials and they just blew up on me. (I just starting scheming and dreaming about hire #1 yesterday, and it will almost certainly be a front end guy.)

Interesting read, a couple things come to mind though: how does "the people who have ok cupid profiles" vary against "the general population". Several things I suspect are skewed because of the population bias.

Also, they say most of their users are urban, but I'm curious if people aren't prone to list themselves as the nearest big city rather than where they really live. For instance, I suspect everyone within 45 minutes of Des Moines is listing themselves as living there, rather than the tiny farm town / suburb they really live in.

Generally, OKC's demographics are perceived as being college age, liberal, middle class, and alt friendly.

I'm not sure they've ever directly posted demographic data, but that's their perceived demographic, at least.

Agreed. The white male info made me think "divorced 30-45 yo professional type" quite distinctly. (I'm a young white guy). Also, I bet this thread will get seriously contentious if not for the fact that the HN audience was depicted quite well there (based on my imaginary demographic profile of Hacker News).

I found this, like most of the other blog posts by the OK Cupid team, pretty genius. I am glad the people who sit on top of this goldmine of social information have a good sense of humor.

It would be cool to see a statistician guest post. The OK Cupid people are great at coming up with ideas for analysis, but I'd love to see some solid stats behind some of their analyses.

It's interesting, I am asian, like soul food, but it's not something that would occur to me as putting on my profile. Similarly, I would write sashimi (if I ate it anymore) versus sushi, and I suspect that non-asians like sashimi just fine but wouldn't know to put it on... So the statistics point to self-cultural broadcasting, I think, more than preferences.

I didn't know what "soul food" meant until I just looked it up on Wikipedia. Looking over the list[0], I see a number of things I greatly enjoy, but I'd normally just call it "food", or possibly "Southern food".

I'm from the lower part of California's central valley, which has a distinctly Southern influence due to the dustbowl migration. My town is roughly 40% Hispanic, 45% White, and 5% Pacific Islander, so if the term is only widely used among the black population, it's no wonder I've never heard it.

[0]: http://en.wikipedia.org/wiki/List_of_soul_food_items

the term is widely used among the population, not just the black population, in the mid-atlantic up to new york, chicago, and the south.

Sometimes soul food is conflated with Jamaican food (jamaican soul) but for the most part when I think soul food, I think, okra, ham hocks/hambacks, collard and turnip greens, fried chicken, catfish, biscuits, gravy, cornbread, etc. I can't vouch for all the stuff in that wikipedia list, some of it I thought huh?

Since I stopped eating meat except for one day a week, a lot of that has gone out the window.

On the other hand this is a data point.

Knowing that the stuff you like with fish is called "sashimi" and the rice stuff is called "sushi" is a good indicator that you are asian.

Just like knowing that different types of cheese should actually taste different is a good indicator that you are european.

> Knowing that the stuff you like with fish is called "sashimi" and the rice stuff is called "sushi" is a good indicator that you are asian.

Probably true statistically, but there are plenty of non-Asians like me who know the difference (sashimi is the raw fish, sushi is that which has sushi rice). But I'll never show up on their tests because I don't have an OKCupid profile and neither food is a favorite (they're okay, but definitely not favorites).

>Just like knowing that different types of cheese should actually taste different is a good indicator that you are european.

This is what prolonged exposure to soylent green does to you.

Just kidding, we love Americans.

Continental European. (OK, the British have Stilton.)

Those results are highly odd, but I don't think okcupid is mainstream enough to really glean any insight into racial psychology (if such a thing exists) from their data. Furthermore, they don't provide enough info on their analysis method, but I would be interested in seeing the results of a null-run: randomly assigning profiles to groups (rather than by race) and seeing what "statistically distinct" phrases arise (if the analysis is valid no phrases should arise).

It would also be interesting to see them do the same analysis for other features such as height, income, photo attractiveness, etc.

Similar analysis for craigslist personals by city: http://blog.kiwitobes.com/?p=42

They do have an enormous sample size. Much, much larger than any rigorous scientific studies based on getting undergrads to fill out surveys.

They've done similar analyses for things like self-reported heights, and photos:



Not really sure why this is getting upvoted and mine down, but it doesn't matter how large their sample size is since it's restricted to people who voluntarily sign up for an online dating website, specifically okcupid. As okc becomes more mainstream the data is more valid for the general population. But sample size is irrelevant here.

Fair enough. Although the data is presented in a non-rigorous, light-hearted manner, I guess a more accurate title would be "Stuff single, web savvy white people like"

I'm really surprised by how they didn't mention one of the most striking results from the data: Latinos on OKCupid are much more likely to have the word "stationed" in their profile than other demographics. Based on this, it looks like the military contains a large proportion of Latinos ("stationed in [location]"). What are the demographics of the military versus the general population?

According to http://www.armyg1.army.mil/HR/docs/demographics/FY05%20Army%...

Army: White 63.9%, Black 19.0%, Hispanic 10.3%, Asian 3.8%

USA: White 75%, Black 12%, Hispanic 15%, Asian 4%

So hispanic is underrepresented in the army and black over represented - of course you would have to balance these by age profile for each group and also consider US overseas territories that can join the army.

Maybe we looked at the wrong service, since the latino profile also mentioned "marines":


According to this 2007 report on marines: http://www.cna.org/documents/D0016910.A1.pdf

Young blacks became underrepresented after the Iraq wars, but high NCO ranks are still heavily overrepresented. Conversely, marine accessions are on the rise for young hispanics, and they are now slightly overrepresented. High NCO ranks are underrepresented, but diminishingly so.

So this is definitely a trend, but not a major one. Those mentions may have more to do with what contributes to the trend than be caused by it.

Sorry not American - didn't realize army didn't include marines.

Really interesting, but isn't this more about what white people want other people to think that they like, rather than what they actually like.

It would be interesting to compare this to what they actually like, but I have no idea how to get that data.

Facebook maybe? And I don't mean pulling information from a profile but rather gleaning clues from status updates and comments. I would love to see a public vs private list of works these people would assign to themselves.

Eh their analysis method is not too hot. From the comments section:

"The phrases included in the black boxes are the top 50 phrases most statistically correlated to that group. We calculated this as follows:

1. We calculated the frequency of every 1, 2, and 3 word phrase for the whole population. 2. We calculated those same frequencies within each race/gender pair. 3. For each phrase, we divided #2 by #1. 4. This is the propensity of a given group to use a given phrase. 5. The list you see is the phrases with the 50 highest ratios of #2/#1."

So even if a group uses a phrase 1.001x more than the population average, it might still be listed, if there are no actual phrase-usage differences (i.e., all phrase ratios will be small, and the top 50 will be arbitrary).

They're talking about the top fifty such phrases. It seems unlikely that there would be a demographic group that is proportionately less likely to use any arbitrary phrase. The only possibility is for a group's phrase usage to be statistically indistinguishable from the average.

Fortunately, we can perform a sanity check: read some of the phrases to someone, and ask that person which group they think the phrases came from. I bet people will guess with high enough accuracy to establish that it's nonrandom.

I'm not saying the results were random. I'm saying they're not really allowing for a "confidence interval" in how many more times a certain group uses a phrase than average. For example if black men use "soul food" 30x greater than average that seems like a solid result. But if it's only 1.01x more than average that seems like noise.

Max Shron, OkCupid Data Scientist here.

In an older version of the post, we did have the actual numbers, but they didn't seem to add much. Black women use "soul food" 20 times (!) more frequently than the site-wide average; for black men, "soul food" is 11 times as frequent.

AFAIK, nothing we put up for this article is less than twice as frequent for that group as it is for the general (OkC) population.

RE: Soul Food, one of the comments on the OKC site mentioned that Soul Food applies not only to a style of cuisine, but also to a movie and subsequent spin-off series on HBO.

The reason it pops up so frequently is because the movie is hailed within the black community as one of the few Hollywood productions that presents black people as multi-dimensional humans that deal with a number of problems related to race, class, and life in general. It's a classic.

Just for completeness, you guys should compute g-test statistics for this so that the statisticians see something they're used to.


I'll check it out. Thanks!

I think it should actually give you better results. It's monotonic in kl divergence, and does a much better job of taking into account how common the feature is rather than just how different it is. You no longer need to do things like throwing out phrases that appear less than x times if you use it.

Fair enough. OKC should really release an api for people to pull data for their own analyses. Make your data viral!

If they only use a phrase 1.001x more than the population average, it's going to be much further down on the list than the top 50, unless they use nearly every phrase less than the average (which, barring some extremely pathological outlier cases, is impossible).

>Basically this is just noise, and a null-run (random group assignments) would produce just as valid results.

You didn't substantiate this claim.

I thought it was interesting how the largest countries aren't the most nationalistic - no Brazil, Mexico, China, Japan, etc.. I also came away with an identity crisis - #1 good food (Soul) and seeing Mos Def, Lupe Fiasco and Talib Kwali in the top "stuff"... dammit I might be black.


As a black man it was refreshing to see a "description" of my demographic that I actually related too, as opposed to a stereotype based one that you might find on BET.

I liked that Mos Def and Lupe Fiasco beat out Kanye.

I doubt I'm alone here, but when confronted with the "insert fucking theory" I promptly went through the list of what white people like, inserting "fucking" anywhere I could...

"Groundhog Fucking Day" kind of left a bad taste in my mouth.

Re; "I am cool" being in the top phrases. Could be used in a sentence, like "I am cool with that".

Jus saying.

I am white and like none of those things other than guitar and software. I suspect the case is similar for most of the HN readership.

This should be called "what the lowest common denominator like"

It struck me as sort of funny that the minority each had a common self-description (cool, funny, simple), but the closest thing white guys have is "I'm a country boy".

Judging by the stats (The Red Sox), we have strong US East Coast bias (due to user base).

I noticed this as well and came to the same conclusion considering I've never met a single Red Sox fan on the left coast. Although, it could be that east coasters all rally around one team, where as we're split between the Dodgers and the Giants, which would present a false bias.

You have now.

seventh word for asian males: software developer....

Yeah I found that a bit strange... mechanical engineer was there too... Also, no Japanese? Many of the other major asian countries were broadcasting.

a software engineer #3 for indians.

Yeah Sox!!

I'm taking this as scientific evidence that liking the Red Sox makes people more attractive.

Considering this is based on profiles from users at a dating site, wouldn't the inverse be more likely?

While normally, I'd agree with you, OKC has a high poly and hookup population, so normal relationship scarcity constraints do not necessarily apply.

Well, one could say it's perceived as attractive, since they're putting it on their profiles.

Of course, I wasn't really attempting to make a serious point. And the Sox are in a distant 3rd in the AL East right now anyways, which presumably explains the downvotes :)

That last stat about reading level bothers me. "Ok, before anyone gets offended about reading level vs race, let's show you a stat that confirms another stereotype: religious people are stupid! And atheists are smartest of all! Scientifically proven with a reading test based on the lengths of words, and metrics I just made up. And don't worry that almost half of the data points belie my analysis, ha ha ha, it confirms your prejudices, so it's ok!"

If the goal of black and latino OKC members is to attract members of the opposite sex within their race (which is most likely the case), then it would make sense that they would communicate in their own vernacular.

That is, a black person who is otherwise educated might use a phrase that makes perfect sense to other black people (e.g. "where dey do dat at?"; often used to express confusion at someone's ridiculous behavior) but that isn't clear to the mainstream. Latinos may pepper their profiles with Spanish words.

Asians might do this too if there wasn't so much ethnic/linguistic diversity among the Asian American population. As such, they likely use "safe" mainstream wording.

All this is to say that there are reasons that have nothing to do with intelligence that could have caused this sort of result.

My technical writing professor always told us to try and reduce the complexity of our writing (I think she used Flesh Kincaid). Perhaps more of us white people have had similar teachers, and took their advice to heart.*

* Unlikely, but hey, I might as well bring it up, since we're talking about flaws in the analysis.

Scientifically proven with a reading test based on the lengths of words, and metrics I just made up

Actually they're using the Coleman-Liau Index, a computer readability formula going back to 1975.

The metric about adherence was made up and without explanation of how they categorized people. The Coleman-Liau Index is based on lengths of words, which proves... what, exactly? Many of the "most" religious people scored higher than the ones rated "meh". Also, agnostics are in the middle of the pack.

So what does this chart predict? That certain self-selected religious categories in their dataset correlate with word length in the essays. Is that good statistics, or a confirmation of prejudices, ie, religious people are less intelligent?

With all of those charts they passed over with "no comment", they saw fit to make a joke about a "Comic Sans Bible".

It bothers me because it statistically shaky and baldly prejudiced. I would love to have a conversation about that, and I wish people would engage instead of downvote.

I don't know if you've kept up with this OkCupid series, but there has always been an undercurrent of snide commentary towards Christians specifically and religious people in general.

I especially "enjoyed" the break down by race. They admit that the site is primarily American in demographic, but there is no accounting for language proficiency being a barrier to entry to the site for users from non-English speaking cultures. If you care enough to learn English as a second language to the point where you can use OkCupid, then I expect you will learn to to a greater degree than a minimally literate, native English speaker.

Si -- they should have corrected for second languages and proficiency, if they were serious. My big beef is the implication that word length obviously correlates with intelligence instead of, say, pretention.

They probably could have been more scientific with their study as you're pointing out. However, despite the flaws, their conclusions seem to be consistent with actual, scientific studies of the correlation between IQ and Religiosity:


Sure, it seems to agree with those studies. But let's suppose that the religion vs word length chart is a valid measure of intelligence.

Are you then prepared to defend the same conclusion about the race vs word length chart? Or will you start looking for confounding factors, eg, Latinos trend Catholic? If so, think about why you accepted the religion claim on its face, but examined the race claim more carefully.

I'm not defending anything, just pointing out as a matter of interest that there's a consistency between religion/word length, which OKCupid implies is an indication of intelligence in their article, and religion/intelligence in actual academic studies.

So, there is at least an apparent correlation between word length and intelligence for whatever that's worth.

If you're trying to draw me into a race/intelligence discussion based on possible implications of the correlation... no thanks.

Because religion is a choice based on personal beliefs and/or reason (but usually the former). Race is a state of being. I'm not quite sure what you're trying to get at.

He's saying that if you consider the race results as a control of sorts, and you operate under the assumption that there is nothing inherent in race that would create the results shown, it is obvious that the variable being measured is not well controlled, and there is plenty of noise in the results.

Religion is mostly based on what you grow up with, i.e. what your parents told you. Turns out, race is also highly correlated to your parents.

Word length also correlates well with being German or Dutch.

They also allow you to input the languages you speak (which include C++ and Lisp, btw) and how well you speak them and then write your profile in each of those languages, so they have extra data about it.

You can take the data without the commentary, though. The results actually fit nicely with a "pseudointellectual, tries to sound smart" stereotype I have about internet atheists. (I'm an atheist myself, but can't stand reddit.com/r/atheism types.)

Yes, there's not too much to talk about in atheism. (Or at least not much worth listening to.)

At least for Muslims, there is an obvious flaw since Islam forbids dating. So people that self-identified as "very serious" are pretty much by-definition not that. And since it is, among all the practices of Muslims, one that is popularly known and probably even overemphasized (relative to other practices), a large population of those that self-identified as "somewhat serious" are probably guilty of being a bit generous with themselves.

At the end of the day, of course, correlation does not imply causation. If either religious people get their panties in a bunch and overly-defensive or irreligious people start gloating, both are simply demonstrating that their dogma is overriding their intelligence.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact