
N-gram Analysis of the New York Times Weddings Section - lil_tee
http://news.rapgenius.com/Atodd-when-harvard-met-sally-n-gram-analysis-of-the-new-york-times-weddings-section-lyrics
======
rayiner
Another statistical analysis of the NYT wedding section, looking at the
occurrence frequency of certain characteristics in the NYT wedding
announcements relative to their occurrence in the general population:
[http://www.theatlanticwire.com/entertainment/2011/12/odds-
ge...](http://www.theatlanticwire.com/entertainment/2011/12/odds-getting-new-
york-times-wedding-section/45440).

You can clearly see the recent tech boom by searching "Google," "Facebook,"
"Twitter," and "Apple"
[http://www.weddingcrunchers.com/?q=facebook%2C%20google%2C%2...](http://www.weddingcrunchers.com/?q=facebook%2C%20google%2C%20twitter%2C%20apple&s=1).

The key takeaway here is Google:

Google has raced ahead of establishment NY law firms:
[http://www.weddingcrunchers.com/?q=wachtell%2C%20cravath%2C%...](http://www.weddingcrunchers.com/?q=wachtell%2C%20cravath%2C%20cromwell%2C%20wardwell%2C%20skadden%2C%20google&s=1).

Google has also recently overtaken top investment banks:
[http://www.weddingcrunchers.com/?q=goldman%20sachs%2C%20morg...](http://www.weddingcrunchers.com/?q=goldman%20sachs%2C%20morgan%20stanley%2C%20%20ubs%2C%20barclays%2C%20google&s=1)

Ditto for consulting:
[http://www.weddingcrunchers.com/?q=mckinsey%2C%20boston%20co...](http://www.weddingcrunchers.com/?q=mckinsey%2C%20boston%20consulting%2C%20bain%2C%20google&s=1).

When do you think Google will start hosting a debutante ball in Chelsea?

~~~
paxtonab
You mean Google's yearly Deb Ops ball?

------
jmduke
Given how horizontally expansive RapGenius is trying to be (this is the first
time I've seen NewsGenius, but I'm familiar with PoetryGenius et al -- there's
an annotated _Iliad_ that's pretty cool), I'm wondering if they'd be better
off as a layer or a plugin as opposed to a stand-alone site. I'm much less
tempted to visit the site for each individual story that pops up than I would
be to peruse the annotations as I browse normally.

Either way -- stuff like this is a delight to read.

~~~
steveklabnik
My understanding was that RapGenius was always a text annotation platform,
with the hip-hop stuff serving as an exemplar, rather than being the actual
product.

~~~
jere
Always a text annotation platform? That's a bit revisionist from what I've
read.

Fred Wilson initially told them "I think lyrics is a very crowded space and
almost entirely reliant on Google for traffic" and they admitted "our pitch
back then was a bit too lyrics-focused.."

[http://news.rapgenius.com/Lemon-how-rap-genius-
raised-s18m-i...](http://news.rapgenius.com/Lemon-how-rap-genius-
raised-s18m-in-seed-funding-without-knowing-what-we-were-doing-lyrics#lyric)

You can find those quotes in an annotation in the above link (which makes me
realize the problem with annotations is you can't ctrl-f them).

~~~
wwarnerandrew
Here is a permalink for that annotation:
[http://news.rapgenius.com/1900809](http://news.rapgenius.com/1900809)

You can get that by clicking "share" in the annotation footer

------
sengstrom
"This makes it possible to rigorously test our intuitions about trends like."

let me fix that for you

"This makes it possible to put numbers on our preconceived notions and play
around with them."

It may be entertaining, but rigorous? I don't think so.

~~~
mjn
Yes, generally n-gram-based analyses are a huge minefield. Computational
linguists do use them, with a _lot_ of caveats and careful analysis of
confounding factors.

One simple one that comes to mind here is that you need to analyze to what
extent changes over the period of the data set are caused by underlying
societal changes, versus changes in the NYT itself; the end result will be a
mixture of those two changes, some of which may be magnifying and others
offsetting. The 1980 NYT and the 2013 NYT are not the same newspaper, not
edited by the same people, not sold to the same readership demographics, and
not soliciting the same advertisers, so it's somewhat questionable to treat it
as a stable proxy for a social group.

Another common pitfall is language change screwing up all kinds of measures
(since n-gram models just work on word counts). For example, if two words are
used roughly interchangeably in 1980, but by 1990 one of them has fallen out
of usage, and been replaced wholly by the other one, searches for just the one
word will look like the word's on an upwards trend, but it would be misleading
to infer an increase in the underlying concept over the period. Of course, you
can account for this by merging words into equivalence classes (most analyses
will do basic stemming and merging of alternate spellings), but you have to be
very careful to get all the equivalence classes (which is not a well-defined
notion). Just a list of the top words in a year will tend to be some mixture
of 1) top concepts; and 2) concepts expressed using only a small number of
wording variations, so their count doesn't get diluted.

------
wonnage
Not to take an entertaining post too seriously, but when your graph scale
ranges from 0 - 0.02% the statistical significance is dubious.

~~~
davvolun
As you said, caveat 'entertaining' blah, blah. That said...

That's actually a slightly dubious analysis. The question you need to ask is
0.02% _of what_? In this case, I would take a guess it means 0.02% of all the
words analyzed. As a very simple example, imagine analyzing all the letter in
a book. If English were perfectly balanced, we expect to see all 26 letters at
1/26 or 0.038%, so seeing the letter 'e' appear at say, 1.0% (or even 0.1%)
would be a notable statistical result.

~~~
edu
I'd say it's 0.02% of all the weddings, at least the axis is "NYT Wedding
Frequency" not "Word Frequency".

------
bigtech
Are all wedding announcements posted, or does The Times have an editorial role
in which announcements are printed?

~~~
mturmon
The Times certainly picks and chooses which are printed. That's the whole
reason for the interest in the topic. It's a perfect collision between young-
adult ambition and old-school establishment vetting.

I understand that, like college admissions, you can hire a wedding planner or
consultant who can considerably raise the chances of your wedding being
listed.

The NYT obits are another interesting read.

~~~
WillyF
We used a high-end wedding planner, and when we asked her what to do for the
NY Times announcement, she told us to submit via the form on the website.
Maybe she pulled some strings behind the scenes, but it seemed to us that she
had no pull whatsoever on that front.

~~~
larrys
I've read these for years and have known people who appeared.

The factors that enter into getting in (from my observation strictly) are a
combination of things like:

\- parents who live in ny metro

\- the parties getting married living in ny metro

\- having gone to school in ny metro

\- parents or parties getting married working in ny metro

\- what the parents do for a living

\- any lineage "grandparent governor of NY"

\- what the parties getting married do for a living

\- school attended as far as perceived impressiveness

\- whether an impressive job or title of any of the parties mentioned.

..and so on. That's off the top.

For example, "physician" and "went to school in NY" is probably almost assured
to get the announcement printed.

"father a mechanic, mother a homemaker, inlaws are nobodies, parties are
cashiers who work at walmart, no college, live in jersey city"[1] and so on
either don't get in, don't care to get in, or don't have the drive to even
submit a form to get in.

[1] Unless of course one of the parties is related to a famous former
politician or some other mitigating factor.

~~~
mtdewcmu
I found my high school classmate's wedding announcement in the NY Times. She
happened to be a doctor.

------
jlgreco
I am very surprised by the prevalence of " _was_ graduated from". I have only
rarely heard that in 'real life', is the NYT's style guide enforcing this
usage?

~~~
lifeisstillgood
I expect so - graduation is something that happens to you

~~~
Casseres
Agreed. When people say, "I graduated college", they are saying the college
graduated from them, not they graduated from college. The correct usage should
be, "I graduated from college."

------
JeremyMorgan
What they should do is find the divorce stats and throw that in the mix.

------
misiti3780
Really interesting - how did you guys downloaded the 60K articles (is there an
API i do not know about)? Also - what graphing lib are you using (I see it is
not d3)?

~~~
wwarnerandrew
The graphing library is highcharts
([http://www.highcharts.com/](http://www.highcharts.com/))

------
photorized
I was expecting something more interesting, but that's probably because I like
N-gram analysis, among other things.

This is how we do it (examples below are not weddings, but random topics):

[http://blogdotitrendcorporationdotcom.files.wordpress.com/20...](http://blogdotitrendcorporationdotcom.files.wordpress.com/2013/04/2013-04-18_15-52-08.png)

[http://blog.itrendcorporation.com/2013/04/10/social-media-
on...](http://blog.itrendcorporation.com/2013/04/10/social-media-on-
microsofts-scroogled-ads-attacking-androids-data-sharing/)

------
mrcactu5
How were they able to collect all the Wedding announcements. Doens't NYT limit
the number of articles or what portions of text they can retrieve?

~~~
pbhjpbhj
Reminds me of something a certain Mr Swartz did ...

------
Alex3917
See also a previous scoring guide here:

[http://www.grantland.com/story/_/id/6769919/matrimonial-
mone...](http://www.grantland.com/story/_/id/6769919/matrimonial-moneyball)

------
rdl
This is awesome -- it's kind of like the older Priceonomics blogs which used
quantitive analysis to uncover hidden facts in plain sight.

------
the_watcher
Was this designed specifically for Katie Baker (she writes summaries of the
NYT Wedding Section for Grantland)?

------
jsnk
Is it just me or does anyone else dislike reading articles like this with dark
background and light text?

~~~
fatjokes
Personally I prefer it. Dark text on a white background (aka, the norm) often
requires me to turn down the brightness on my laptop.

------
mrcactu5
Why is RapGenius interested in this ??

~~~
rdl
I think they're 25-35 year old guys living in NYC...

------
rorrr2
Whoever uses two similar shades of blue in a chart should be beaten with a
stick until they learn more colors.

~~~
cpeterso
Learn black _and_ blue?

