Perhaps the easiest way to think of it is that the phrases are predictors for the race/sex, not the other way around. For example, you shouldn't expect every white male you meet to like Van Halen. However if someone says to you "I have a friend who's a big Van Halen fan", you're pretty safe in assuming that the friend is a white male.
Likewise, it might be that only 10% of blacks like soul food. But if almost no other demographics like it, it will still show up high on their list. So "is black" does not strongly imply "loves soul food", but "loves soul food" does strongly imply "is black".
In other words, http://en.wikipedia.org/wiki/Bayes_theorem
I did something very similar for a client today, and after I get a little better at manipulating code to do it, I'm probably going to try something similar for getting trial signups. ("Looks like you're done reading about it. Feeling confused about what to do next? WHAM, signup box.")
You are welcome to your preference, of course, but the relevant question for my client is "Does Tuesdays and people like him control a lot of links? Would he link to the article in the absence of this widget? Is he going to refrain from linking to the article now? Will the aggregate number of links we lose offset the massive gain we are expecting to get [and are capable of measuring] from less-opinionated users?"
I do like the pop-out bars that the NYT has at the end of their articles, though. See, for example, http://www.nytimes.com/2010/09/07/business/economy/07jobs.ht...
I wonder how hard it'd be, alternately, to write a heuristic to block absolutely-positioned top/bottom bars. May play around with that a bit.
That same test has been on my agenda since at least April, but front end is my weakest skill, so I keep pushing it back. I've got two partial implementations where I tried to follow tutorials and they just blew up on me. (I just starting scheming and dreaming about hire #1 yesterday, and it will almost certainly be a front end guy.)
Also, they say most of their users are urban, but I'm curious if people aren't prone to list themselves as the nearest big city rather than where they really live. For instance, I suspect everyone within 45 minutes of Des Moines is listing themselves as living there, rather than the tiny farm town / suburb they really live in.
I'm not sure they've ever directly posted demographic data, but that's their perceived demographic, at least.
It would be cool to see a statistician guest post. The OK Cupid people are great at coming up with ideas for analysis, but I'd love to see some solid stats behind some of their analyses.
I'm from the lower part of California's central valley, which has a distinctly Southern influence due to the dustbowl migration. My town is roughly 40% Hispanic, 45% White, and 5% Pacific Islander, so if the term is only widely used among the black population, it's no wonder I've never heard it.
Sometimes soul food is conflated with Jamaican food (jamaican soul) but for the most part when I think soul food, I think, okra, ham hocks/hambacks, collard and turnip greens, fried chicken, catfish, biscuits, gravy, cornbread, etc. I can't vouch for all the stuff in that wikipedia list, some of it I thought huh?
Since I stopped eating meat except for one day a week, a lot of that has gone out the window.
Knowing that the stuff you like with fish is called "sashimi" and the rice stuff is called "sushi" is a good indicator that you are asian.
Just like knowing that different types of cheese should actually taste different is a good indicator that you are european.
Probably true statistically, but there are plenty of non-Asians like me who know the difference (sashimi is the raw fish, sushi is that which has sushi rice). But I'll never show up on their tests because I don't have an OKCupid profile and neither food is a favorite (they're okay, but definitely not favorites).
This is what prolonged exposure to soylent green does to you.
Just kidding, we love Americans.
It would also be interesting to see them do the same analysis for other features such as height, income, photo attractiveness, etc.
Similar analysis for craigslist personals by city: http://blog.kiwitobes.com/?p=42
They've done similar analyses for things like self-reported heights, and photos:
Army: White 63.9%, Black 19.0%, Hispanic 10.3%, Asian 3.8%
USA: White 75%, Black 12%, Hispanic 15%, Asian 4%
So hispanic is underrepresented in the army and black over represented - of course you would have to balance these by age profile for each group and also consider US overseas territories that can join the army.
Young blacks became underrepresented after the Iraq wars, but high NCO ranks are still heavily overrepresented.
Conversely, marine accessions are on the rise for young hispanics, and they are now slightly overrepresented. High NCO ranks are underrepresented, but diminishingly so.
So this is definitely a trend, but not a major one. Those mentions may have more to do with what contributes to the trend than be caused by it.
It would be interesting to compare this to what they actually like, but I have no idea how to get that data.
"The phrases included in the black boxes are the top 50 phrases most statistically correlated to that group. We calculated this as follows:
1. We calculated the frequency of every 1, 2, and 3 word phrase for the whole population.
2. We calculated those same frequencies within each race/gender pair.
3. For each phrase, we divided #2 by #1.
4. This is the propensity of a given group to use a given phrase.
5. The list you see is the phrases with the 50 highest ratios of #2/#1."
So even if a group uses a phrase 1.001x more than the population average, it might still be listed, if there are no actual phrase-usage differences (i.e., all phrase ratios will be small, and the top 50 will be arbitrary).
Fortunately, we can perform a sanity check: read some of the phrases to someone, and ask that person which group they think the phrases came from. I bet people will guess with high enough accuracy to establish that it's nonrandom.
In an older version of the post, we did have the actual numbers, but they didn't seem to add much. Black women use "soul food" 20 times (!) more frequently than the site-wide average; for black men, "soul food" is 11 times as frequent.
AFAIK, nothing we put up for this article is less than twice as frequent for that group as it is for the general (OkC) population.
The reason it pops up so frequently is because the movie is hailed within the black community as one of the few Hollywood productions that presents black people as multi-dimensional humans that deal with a number of problems related to race, class, and life in general. It's a classic.
You didn't substantiate this claim.
"Groundhog Fucking Day" kind of left a bad taste in my mouth.
This should be called "what the lowest common denominator like"
I'm taking this as scientific evidence that liking the Red Sox makes people more attractive.
Of course, I wasn't really attempting to make a serious point. And the Sox are in a distant 3rd in the AL East right now anyways, which presumably explains the downvotes :)
That is, a black person who is otherwise educated might use a phrase that makes perfect sense to other black people (e.g. "where dey do dat at?"; often used to express confusion at someone's ridiculous behavior) but that isn't clear to the mainstream. Latinos may pepper their profiles with Spanish words.
Asians might do this too if there wasn't so much ethnic/linguistic diversity among the Asian American population. As such, they likely use "safe" mainstream wording.
All this is to say that there are reasons that have nothing to do with intelligence that could have caused this sort of result.
* Unlikely, but hey, I might as well bring it up, since we're talking about flaws in the analysis.
Actually they're using the Coleman-Liau Index, a computer readability formula going back to 1975.
So what does this chart predict? That certain self-selected religious categories in their dataset correlate with word length in the essays. Is that good statistics, or a confirmation of prejudices, ie, religious people are less intelligent?
With all of those charts they passed over with "no comment", they saw fit to make a joke about a "Comic Sans Bible".
It bothers me because it statistically shaky and baldly prejudiced. I would love to have a conversation about that, and I wish people would engage instead of downvote.
I especially "enjoyed" the break down by race. They admit that the site is primarily American in demographic, but there is no accounting for language proficiency being a barrier to entry to the site for users from non-English speaking cultures. If you care enough to learn English as a second language to the point where you can use OkCupid, then I expect you will learn to to a greater degree than a minimally literate, native English speaker.
Are you then prepared to defend the same conclusion about the race vs word length chart? Or will you start looking for confounding factors, eg, Latinos trend Catholic? If so, think about why you accepted the religion claim on its face, but examined the race claim more carefully.
So, there is at least an apparent correlation between word length and intelligence for whatever that's worth.
If you're trying to draw me into a race/intelligence discussion based on possible implications of the correlation... no thanks.
At the end of the day, of course, correlation does not imply causation. If either religious people get their panties in a bunch and overly-defensive or irreligious people start gloating, both are simply demonstrating that their dogma is overriding their intelligence.