
Open Source Datasets - amberj
https://deepmind.com/research/open-source/open-source-datasets/
======
zitterbewegung
I have been thinking about what this solves in respect to other datasets.
Nearly all shape recognition datasets have a restriction that you can't use
unless you are an academic. I feel like that Open Sourcing data sets will
allow us to be more democratic with data and the things that are generated by
them. Creative Commons seems like a good license for this though. Once you
have the data is half the battle . The rest is to make open models (google is
good at this) and then you could take pretrained models and not have your data
leave your house . I hope and dream we can do this.

~~~
nerdponx
_Nearly all shape recognition datasets have a restriction that you can 't use
unless you are an academic. I feel like that Open Sourcing data sets will
allow us to be more democratic with data_

Just because something is open source doesn't mean it can't have an academia-
only restriction. Data sets _should_ cost money in for-profit uses, open or
not.

~~~
tuukkah
“Open data and content can be freely used, modified, and shared by anyone for
any purpose” [http://opendefinition.org/](http://opendefinition.org/)

"The license must not restrict anyone from making use of the program in a
specific field of endeavor. For example, it may not restrict the program from
being used in a business, or from being used for genetic research."
[https://opensource.org/osd](https://opensource.org/osd)

~~~
nerdponx
Interesting. So is a nominally open-source program like RStudio not actually
open-source because enterprise uses need a license?

~~~
SloopJon
It looks like RStudio is AGPL v3, which is a free and open source license.
Enterprise users only need a different license if they don't want to abide by
the AGPL's strong copyleft requirements.

------
Kpourdeilami
Somewhat unrelated: Deepmind's website is so cluttered and distracting to the
extent that it is almost unusable

~~~
dharness
I came to the comments to say exactly that... What is up with that weird
floating hamburger search bar?

~~~
saagarjha
Especially since it covers a ton of content on the left side…

------
Alexqw85
The lab I work in publishes, and has continues to extend, the studyforrest
dataset for quite a few years now.

[http://studyforrest.org](http://studyforrest.org)

Most of the consumers so far have been neuroscience researchers and
statisticians, but we do hope (and think) that there's value for a wide
variety of interests.

There's a bunch of different data, but the highlights are fMRI scans of people
watching and/or listening to the movie Forrest Gump, eye tracking, and
detailed annotations of the movie. We are also about to begin acquiring
simultaneous EEG and fMRI.

[http://studyforrest.org/data.html](http://studyforrest.org/data.html)

Accessing the data is easy, and, as great admirers of Joey Hess, we also have
it available in a git annex repo. :-)

[http://studyforrest.org/access.html](http://studyforrest.org/access.html)

\---Alex

[EDIT] Given that this thread is about open source datasets, it's probably
worth mentioning that the license is PDDL.

[https://opendatacommons.org/licenses/pddl/1.0/](https://opendatacommons.org/licenses/pddl/1.0/)

------
iandev
Forgive my ignorance, but I'm not sure for what a dataset like the
"Collectible Card Game to Code"[0] might be used. Can anyone explain how and
for what it might be used?

[0]
[https://github.com/deepmind/card2code](https://github.com/deepmind/card2code)

~~~
Houshalter
They use it to train an AI to program. It reads the descriptions of the cards'
effects and produces computer code that generates that behavior.

------
ptero
One question that is not clear to me is what should the dataset license to
allow / restrict, in the perfect world. For me (just a personal opinion) it
would allow free (as in liberty) use, but somehow encourage those who use it
to share the benefits (data, software or algorithms) under the same license.

Unfortunately, Open Source does not help here -- I do not see how OS can be
used with data sets. The main OS leverage with software development is that if
you use software X to build software Y, X is usually present in some way,
shape or form in your deliverable Y. Not so with training data -- once
algorithm development is done you can (and usually do) strip training data out
and have a finished product that does not require X to run.

Even if one were to require open sourcing derived datasets it is usually easy
to segregate the dataset with a tainted (open source) license as you build up
your data so the new datasets are not formally "derived" and thus would not
need open sourcing.

I would love a better way forward on this, or at least a cleaner explanation
of options.

~~~
nerdponx
OS helps tremendously in reproducibility. Without the underlying data, there
is no way to audit an analysis. Moreover, an algorithm is only ever "done" in
the same way that software is ever "done". New techniques might come along
that could enhance the model, or the business requirements might change that
necessitate re-tuning the algorithm.

The benefits of OS data are the same as the benefits of OS software. The
distinction between "Free" and "Open" is the same as well.

Edit 1: OS data sets are nothing new. The UCI Machine Learning Repository[1]
has been around for years. There is also an entire Open Data Stack Exchange
site [2], and an Open Data Subreddit [3].

Edit 2: OS data sets are essential for developing _new_ algorithms because
they can be used as benchmarks. Nobody should trust a model that's been
developed on a proprietary data set for use on anything other than that one
data set.

[1]: [https://archive.ics.uci.edu/ml/](https://archive.ics.uci.edu/ml/)

[2]:
[https://opendata.stackexchange.com/](https://opendata.stackexchange.com/)

[3]: [https://reddit.com/r/opendata/](https://reddit.com/r/opendata/)

~~~
ptero
Maybe I was not clear -- I am not arguing for proprietary datasets. Validation
on publicly available data is a key component in comparisons, assessments,
etc.

However, there is a whole bestiary of open source licenses that span the
spectrum of "use any way you want" to much more restrictive. But they were
mostly thought through for software and data is different; what may prevent
proprietary abuse in software may not have any teeth for data.

------
jackschultz
This brings up a huge point about how important data sets are to analysis and
machine learning. There are so many libraries out there that make learning
algorithms quick to run, and the absolute most important part of a project of
that type is correct and formatted data.

------
deepnet
Dear Deepmind, as you have retired AlphaGo please open source the dataset of
Go games used to train it.

------
caniszczyk
check out [https://data.world](https://data.world) who is doing a decent job
in organizing a variety of data sets out there

------
blazespin
Cool, when deep mind originally joined google it was on the condition that
google would be moral about its use of AI.

