
Unisex Names – Data Analysis Use Case - ScottWRobinson
https://kenandeen.wordpress.com/2015/06/20/unisex-names-data-analysis-use-case/
======
danso
I love the SSA baby names dataset, especially for teaching beginners...not
only is it relatively compact (about 30MB for the nationwide dataset) but it
is about as granular as you need (1,825,433 rows as of 2014) to do a variety
of analyses. It's a great dataset because it is interesting to almost
everyone...because almost everyone has a name, and almost everyone has a
particular interest in their own name, even at least a casual curiosity about
how many other babies were like them, or whether their name is going out of
style. And the queries to do those are so easy (SELECT SUM(count) AS s FROM
babynames ORDER BY s DESC), and the visualizations make sense.

But that's only the beginning...it's easy to illustrate the variety of
insights that can come from a single, straightforward dataset. The sums of
names as a timeseries shows trends. Faceting the cumulative count into
buckets...such as top 1000 names vs. everyone else, let's you see what appears
to be a measurement of America's increased diversity. And this isn't even
considering all the insights that can be gained from the names-by-state count,
which is a separate and equally massive datafile.

It's also a great example of how research and domain knowledge and real-world
pragmatism is a fundamental part of data science. It seems possible that
tracking baby names could be just a matter of the SSA doing a sum/count on
their own database, which is populated by electronic form...but what about the
years before computers? For at least a century, the SSA must have had to deal
with handwritten and typed records, which means that the accuracy of the
records are as potentially flawed as the humans who manually tallied them up.
And that's barely scratching the surface of what could be wrong...The Social
Security Administration didn't just _happen_...it certainly didn't exist in
1880. And it took time for things to be the way they are now. For a long time,
only men were allowed to have SSNs, which means all data before the 1950s is
ridiculously skewed towards men. Only in the 1980s was it commonplace for all
babies to have SSN's...which means many of the prior names are names of
_adults_. [1]

So if things can get that messy for something as simple as baby names, imagine
the flaws of something like a crime or health database.

[1] [http://www.prooffreader.com/2014/07/graphing-problematic-
asp...](http://www.prooffreader.com/2014/07/graphing-problematic-aspects-of-
us-baby.html)

~~~
Asparagirl
_" For a long time, only men were allowed to have SSNs..."_

This is incorrect; for a long time, jobs that were more likely to be held by
women or minorities (domestic servants and agricultural workers) were excluded
from the Social Security program, so those workers often didn't pay into or
receive benefits from the system, and therefore never applied for a number.
But there was no formal gender or racial discrimination. My sweatshop-working
female relatives certainly had SSNs.

Incidentally, requesting a deceased person's SS-5 (the original handwritten
form that applied for a SSN) is a _fantastic_ genealogical tool, because it
lists the applicant's parents' names, including mother's maiden name, among
other things. They're available from the Social Security Administration under
FOIA; you can order copies through their website.

See also:

[https://en.wikipedia.org/wiki/History_of_Social_Security_in_...](https://en.wikipedia.org/wiki/History_of_Social_Security_in_the_United_States)

[https://secure.ssa.gov/apps9/eFOIA-
FEWeb/internet/main.jsp](https://secure.ssa.gov/apps9/eFOIA-
FEWeb/internet/main.jsp)

------
jackcarter
>Let’s say we define 20%-80% as our threshold, so if a name is used on, let’s
say girls, more than 80%, we call the name female-dominant. If it is used on
less than 20%, then we call that name “male-dominant” name. Otherwise, we are
left with the range from 20% to 80% (60% of all names) which I propose to
consider “unisex” names.

Interesting writeup, but this is obviously faulty reasoning.

~~~
lmitchell
It's pretty arbitrary, yeah, but why 'obviously faulty'? Seems to me like it's
just a different threshold than you or I might have picked, I can't think of a
better metric to judge by though.

It would be interesting to see which names were the 'most' unisex, though -
ie. those with closest to 50% spread, regardless of popularity.

~~~
kevb
The "60% of all names" is wrong. Let's imagine there are 3 total names, Alice,
Bob and Casey. Alice is > 80% female, Bob is < 20% female and Casey is 50%
female. Even though we're using the 80/20 cutoff, only 33% of the names are
unisex, not 60%. You can't just subtract 20 from 80 to get the number of
unisex names as the distribution is unknown (or unmentioned). You'd have to
actually measure or take the distribution into account.

