
Data mining for the taboo: searching for what isn't there - sideshowb
https://linkingideasblog.wordpress.com/2015/08/19/data-mining-for-the-taboo-searching-for-what-isnt-there/
======
danso
A similar but slightly simpler computational concept is the `LEFT OUTER JOIN`
in SQL, in which for every row on the "left"-side table is listed, whether or
not it has a corresponding match on the "right"-side table...doing a `WHERE`
filter for `NULL` values then shows you all of the left-side records that have
no match.

INNER JOINs are plenty useful in themselves, of course...it's already
difficult for humans to manually compare two lists to see the matches, and
it's not done as routinely as it could be (think of the number of surprising
news stories and investigative journalism that involves discovering that a
person in one list is also in another list, e.g. campaign contributors,
convicted sex offenders, etc). Given the difficulty of matching -- even though
its insights are tangible to most people -- my experience has been that people
don't even consider looking for what _isn 't_ there that _should_ be there.
Maybe it's a corollary of Donald Rumsfeld's apt observation of the "unknown
unknowns".

Here's a toy example I use in one of my classes: Using the U.S. Social
Security baby names data, look for the most popular baby names in 2010 that
did not exist in the 1990 dataset:

[http://2015.padjo.org/assignments/midterm-babysteps-
sql/#the...](http://2015.padjo.org/assignments/midterm-babysteps-
sql/#the-5-most-popular-names-in-2010-that-dont-show-up-at-all-in-1990)

(actually, it's pretty surprising how much churn there is even within a
decade...When I initially used LEFT JOINs just to see which popular names in,
say, 1970, were still among the most popular names today, the result set was
so small (sometimes, just empty), that I thought I had made a mistake in my
query)

