

Datasets You've Likely Never Seen - bane
http://blog.yhathq.com/posts/7-funny-datasets.html

======
velavar
This is the site I usually turn to, when I'm on the prowl for interesting
datasets: [http://rs.io/100-interesting-data-sets-for-
statistics/](http://rs.io/100-interesting-data-sets-for-statistics/) These
datasets will make for some nice additions to this list.

------
waterlesscloud
I now have a spreadsheet of Spanish Silver Production 1720-1800 for no other
reason than it was available to me.

Well, and a long lingering interest in Sid Meier's Pirates and Neal
Stephenson's Baroque Cycle.

~~~
MrZongle2
Well then, here's timing for you... Jimmy Maher, who has an excellent blog
that chronicles the evolution of the game industry, posted an article about
Pirates! today:
[http://www.filfre.net/2015/07/pirates/](http://www.filfre.net/2015/07/pirates/)

------
jrapdx3
That was fun. A couple of the sets were particularly interesting. No surprise
that using LSD adversely affects cognitive performance, though the strength
and linearity of the effect were perhaps more relentless than expected. Its a
reasonable hypotheses that other drugs would produce similar outcomes, but
can't say for sure until the studies are done.

Eyeballing the marijuana data reveals that except for Mississippi (and maybe
Kentucky) the Pacific coast states have the lowest prices. Notably OR was even
lower than WA or CA. The graphs also show west coast prices were consistently
_falling_ , whereas the other states with low cost, price was either rising or
fluctuating over time. OR (where I live) has a "laid back" reputation, maybe
there's a connection.

Don't know what makes cannabis less expensive out here, though I've read our
state is a leading pot grower/producer. Since Oregon is about to launch legal
recreational marijuana sales, prices may drop even more.

Data: entertaining, and educational too.

~~~
ObviousScience
California and Washington were long major suppliers in the top 5 producing
states, with California first and Washington fifth. The data for that was
2006-2008 era, when I studied this more closely, however, I find it unlikely
either of those two has significantly slipped. In addition to the raw volume
that they produce, they also serve as conduits for illegal drugs moved across
the border (from Mexico and Canada). Oregon also produced, but the volumes
were somewhat lower.

I have a conjecture that the weed prices in those states are artificially
high, because the average person in those states is buying a premium product
which there simply isn't enough supply/demand for in states with a) a less
direct line to the source, and first selection of choice portions and b) a
culture where the dominant cash crop in the state is marijuana -- usually by a
non-trivial amount -- making the whole culture of marijuana endemic to the
state. Of interesting note is that in 2006-2008, Washington (while fifth
overall) was the state which produced the most hydroponically grown marijuana,
which fetches a considerably higher price than much of the outdoor crop.

Oregon, then, is the first state that represents "the rest of the country"
once you step away from weird effects right at the source, and seems to sit in
a clear trough around California and Washington. (Such patterns existed back
in 2006-2008, and also happened around Kentucky, which is another major
producer state.)

tl;dr: Seattle people are pot snobs as well as coffee snobs, and the WA price
of pot is high for the same reason the average cup of coffee in Seattle is
high -- $4 lattes instead of $1 gas station.

------
IndianAstronaut
One of the big walls I hit as a data analytics person is how to turn data into
actionable insights. I sent over the pigeon data to a friend who does pigeon
research. Hope to see if it impacts his view of the pigeon world!

~~~
roel_v
This is one of my pet topics to bore people with: how we've now passed the
point where data _collection_ , or even accessibility, (for most subjects) is
the hard part. 10 years ago, for many things, there simply was no data; or if
there was, you didn't know it existed, or it was very expensive. Today, the
problem is that we don't know what to do with all the data. Of course loading
it into R and making scatter plots is fine and dandy, and one can easily spend
days on writing elaborate dataset-specific analysis reports, trying out
various techniques just because you've never used them.

But turning data into _insights_ , or even further, _actionable advice_ \-
that's a whole different story; and one that many people aren't really
interested in (yet?), either, both researchers and practitioners...

~~~
ObviousScience
I'm not sure I understand your last sentence; could you elaborate?

~~~
roel_v
What I mean is, that many people are still stuck at the 'we need more data'
stage, or at least at the 'better data collection/verification'. And that the
focus of much analysis and modeling is less on actionable advice, but more...
well how shall I put it, 'dissecting' data, without having a clear way in mind
how that dissection will lead to insights that are relevant for the
stakeholder.

I should probably mention that this in the context of academia, I guess
business analytics has an existential intrinsic motivation to be actionable.

~~~
tfgg
Couldn't it still be the case, though, that we have too much data, but it's
also the wrong sort for actionable insights? As a scientist I find the most
actionable data are often in the smallest, custom-made datasets, driven by
some question, not trawling through masses of data collected without a goal,
hoping that they'll have collected the right thing.

~~~
roel_v
Sure, could very well be, and what is a perfect fit in one situation, might be
unusable in another situation that at the surface looks like it's almost the
same. It would be silly for me to claim that all data we need is already being
collected or something like that. But that's not at odds with my abstract
point that the realization that _data_ is usually no longer the problem, but
the lack of _knowing what to do with it_ hasn't sunk in for most people. (This
makes it sound like I think of myself as someone who has seen The Light and
'those others' are chumps, which would obviously be delusional of me, and I
don't mean it that way)

I guess what I'm failing to articulate here is the shift that has snuck up on
us over the last 10 or so years. My bitching about the quality of datasets
today is about increasingly marginal issues (at the macro scale of course,
there are still crap individual datasets, obviously); whereas 15 years ago, I
didn't even have datasets to bitch about.

------
minimaxir
Those datasets aren't very robust, though. The sample visualizations provided
with them are about the extent of analysis possible.

