Hacker News new | past | comments | ask | show | jobs | submit login

I love the SSA baby names dataset, especially for teaching beginners...not only is it relatively compact (about 30MB for the nationwide dataset) but it is about as granular as you need (1,825,433 rows as of 2014) to do a variety of analyses. It's a great dataset because it is interesting to almost everyone...because almost everyone has a name, and almost everyone has a particular interest in their own name, even at least a casual curiosity about how many other babies were like them, or whether their name is going out of style. And the queries to do those are so easy (SELECT SUM(count) AS s FROM babynames ORDER BY s DESC), and the visualizations make sense.

But that's only the beginning...it's easy to illustrate the variety of insights that can come from a single, straightforward dataset. The sums of names as a timeseries shows trends. Faceting the cumulative count into buckets...such as top 1000 names vs. everyone else, let's you see what appears to be a measurement of America's increased diversity. And this isn't even considering all the insights that can be gained from the names-by-state count, which is a separate and equally massive datafile.

It's also a great example of how research and domain knowledge and real-world pragmatism is a fundamental part of data science. It seems possible that tracking baby names could be just a matter of the SSA doing a sum/count on their own database, which is populated by electronic form...but what about the years before computers? For at least a century, the SSA must have had to deal with handwritten and typed records, which means that the accuracy of the records are as potentially flawed as the humans who manually tallied them up. And that's barely scratching the surface of what could be wrong...The Social Security Administration didn't just happen...it certainly didn't exist in 1880. And it took time for things to be the way they are now. For a long time, only men were allowed to have SSNs, which means all data before the 1950s is ridiculously skewed towards men. Only in the 1980s was it commonplace for all babies to have SSN's...which means many of the prior names are names of adults. [1]

So if things can get that messy for something as simple as baby names, imagine the flaws of something like a crime or health database.

[1] http://www.prooffreader.com/2014/07/graphing-problematic-asp...

"For a long time, only men were allowed to have SSNs..."

This is incorrect; for a long time, jobs that were more likely to be held by women or minorities (domestic servants and agricultural workers) were excluded from the Social Security program, so those workers often didn't pay into or receive benefits from the system, and therefore never applied for a number. But there was no formal gender or racial discrimination. My sweatshop-working female relatives certainly had SSNs.

Incidentally, requesting a deceased person's SS-5 (the original handwritten form that applied for a SSN) is a fantastic genealogical tool, because it lists the applicant's parents' names, including mother's maiden name, among other things. They're available from the Social Security Administration under FOIA; you can order copies through their website.

See also:



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact