
Stanford Large Network Dataset Collection - jonbaer
https://snap.stanford.edu/data/
======
alialkhatib
This is cool (although unless something has changed recently on that page that
I'm not seeing, it's kind of old).

In the HCI group at Stanford we recently had a talk about massively online
open courses (MOOCs), how Harvard and MIT recently made their edX data
available (albeit to other researchers, not completely publicly)[0], and how
Stanford could do the same with its own MOOC data. We were wrestling with the
idea of making it only available to other established institutions the way
Harvard and MIT did it, but the irony (and maybe hypocrisy) of limiting MOOC
data access to those within the ivory towers was not lost on us. The
alternatives (scrubbing the data more rigorously or to varying degrees
depending on our trust level of the entity requesting it) seemed better, but
also has problems; how do you determine those levels, what if someone shares
their privileged data with an untrusted individual, etc... (these are not
unique problems; if we have any hurdle to clear we always have to worry that
someone who clears it will break that wall down and mess the whole thing
up[1]).

We're really struggling to come to a good solution on this problem in part
because IRB protocols were not originally designed for this kind of stuff.
They were imagined for the kinds of experiments where the data collection
itself was what endangered participants, not the _analysis_. As a result, IRB
approval for a protocol outlining the collection of data might not foresee
every imaginable permutation of data analysis that could reveal embarrassing
or incriminating details about participants.

I'm sorry, this is becoming a rant. The point is that we're talking about
making more data - specifically more MOOC data - available for research and
analysis. Hopefully we'll figure something out that will be interesting to
(white hat) hackers without it endangering participants in the hands of black
hat hackers.

0: [https://newsoffice.mit.edu/2014/mit-and-harvard-release-
de-i...](https://newsoffice.mit.edu/2014/mit-and-harvard-release-de-
identified-learning-data-open-online-courses)

1: case in point:
[http://en.wikipedia.org/wiki/AOL_search_data_leak](http://en.wikipedia.org/wiki/AOL_search_data_leak)

~~~
minimaxir
The MOOC data from Harvard/MIT isn't researcher only; the only limitation is
that you can't redistribute the data. (it's _very_ good data)

I did a blog post on it and have not received any angry emails from either
party: [http://minimaxir.com/2014/07/online-class-
charts/](http://minimaxir.com/2014/07/online-class-charts/)

~~~
alialkhatib
Ah you're right, sorry about that.

That post is really interesting. I know ggplot2 does quite a bit of it, but
the visualizations are really nice.

You mention significance a few times but I don't see alpha levels or
significance testing per se; do you mean significant in the casual sense, or
are you just withholding the stats talk for the audience of (most likely)
laypeople? If it's the former, you might find statistical significance in even
the avg % grade by gender, given a large enough sample size.

I might be overlooking the part where you talk about this, and I apologize if
that's the case.

~~~
minimaxir
In the casual sense. I've been having a little difficulty graphically
conveying confidence intervals without making the charts unreadable/too
complicated. The _correct_ way is to use boxplots, but I'm working on other
things too.

~~~
alialkhatib
If you figure out a more intuitive way to communicate confidence intervals
visually, _please_ post it here :)

------
emu
I might have missed it, but I couldn't find any licensing information for most
of these data sets, which troubles me. Personally, I'm hesitant to download or
work with much of this data for that reason.

------
robmccoll
These are my go-tos for quick testing with real data. I've published a paper
or two using these datasets (obtained from SNAP).

There are also some decent large graphs of different types from various DIMACS
challenges that people may find useful
([http://www.cc.gatech.edu/dimacs10/downloads.shtml](http://www.cc.gatech.edu/dimacs10/downloads.shtml)).

