
Datasets for Machine Learning - mromaine
https://gengo.ai/articles/the-50-best-free-datasets-for-machine-learning/
======
benhamner
Ben from Kaggle.

Open up the ~50 different individual datasets linked in separate tabs, and
then quickly flip through all of them trying to get a sense of what each one
is.

That experience will demonstrate one of the main challenges we're aiming to
solve by making Kaggle Datasets your default place to publish data online
([https://www.kaggle.com/datasets](https://www.kaggle.com/datasets))

~~~
bhnmmhmd
I've heard that Kaggle data sets encourage people to do "supervised" ML only.
Is that true?

~~~
hideo
(Not Ben, but - ) outside of academia, the main thing that seems to encourage
people to do supervised ML is that it's the only thing that seems to work. I
haven't really heard of any success stories with using unsupervised techniques
for most common ML applications.

~~~
raverbashing
Unsupervised works, but your ability to measure "does it work or not" is much
more dependent on a case by case evaluation rather than a score.

(Because if you know a priori what is it that you want to measure - it's
supervised)

~~~
atupis
Yeah this my experience too, evaluation ends being almost endless time sink.

------
logancg
The link at the bottom should be emphasized:
[https://github.com/awesomedata/awesome-public-
datasets](https://github.com/awesomedata/awesome-public-datasets)

It is a very expansive collection of datasets, some well-prepped for ML and
most not (which is part of the fun of it, anyways).

------
danso
Two sources that are missing:

opendatanetwork.com: this is effectively a Google for public Socrata data
portals, and for me, the best way to discover datasets across different
municipalities. For example, when I was interested in trying to replicate the
NYT's _" Do ‘Fast and Furious’ Movies Cause a Rise in Speeding?"_ [0] article,
it was pretty easy to find a bunch of other traffic/motor vehicle violation
datasets with opendatanetwork's search.

Enigma public ([https://public.enigma.com](https://public.enigma.com)): a huge
collection of scraped public datasets, including flattened versions of data
that originally comes in annoying-to-parse, such as U.S. lobbying disclosures
[1]

[0] [https://www.nytimes.com/2018/01/30/upshot/do-fast-and-
furiou...](https://www.nytimes.com/2018/01/30/upshot/do-fast-and-furious-
movies-cause-a-rise-in-speeding.html)

[1] [https://public.enigma.com/datasets/lobbying-disclosures-
lobb...](https://public.enigma.com/datasets/lobbying-disclosures-
lobbyists-2013/f3ce179f-9171-4754-9f71-71d7596d900a?&filter=%2B%5B%3E%5Blobbyist%5D%5D)

------
andy-wu
Surprised that CIFAR wasn’t mentioned under Images. I feel like that’s one of
the standards, even more so than some of the ones that are listed.

------
rerx
To train machine translation models parallel corpora in many languages are
provided on the WMT conference site: [http://www.statmt.org/wmt17/translation-
task.html](http://www.statmt.org/wmt17/translation-task.html) and previous
years

------
Smerity
My original comment was meant for a separate HN article on machine learning
and I posted in the wrong tab.

My apologies.

~~~
rerx
How is this related to the article on gengo.ai?

~~~
pilooch
Oops, missread for [https://modelzoo.co/](https://modelzoo.co/)

------
loisaidasam
Inspired by this post, I was looking for a fun way to browse datasets
randomly, which led me to build this Kaggle Random Dataset Generator:

[https://news.ycombinator.com/item?id=17313374](https://news.ycombinator.com/item?id=17313374)

Thanks Gengo!

------
mohi13
Here are 1000s of more open datasets for anyone to explore, use or build upon:
[https://dataturks.com/projects/trending](https://dataturks.com/projects/trending)

------
rahimnathwani
From the title 'The 50 Best Free Datasets...' I was expecting a curated list
of datasets. But the list has mix of individual datasets, and sites that
provide/host datasets :(

------
codemetro53
Here is a dataset for abstractive summarization created from Reddit .

Dataset
[https://zenodo.org/record/1168855#.WyJG3I7pdhE](https://zenodo.org/record/1168855#.WyJG3I7pdhE)
Paper
[http://aclweb.org/anthology/W17-4508](http://aclweb.org/anthology/W17-4508)

------
mrphilroth
Security industry related datasets always seem to be omitted from this type of
thing. Please check out the excellent
[http://www.secrepo.com/](http://www.secrepo.com/).

------
kokimame
For audio, LibriSpeech, M-AILABS, LJ-Speech, VCTK, TIMIT, Mocha-Timit,
VoxForge, Blizzard Challenge, and so on.

------
greentuna
Does anyone know of good datasets for Concept Drift analysis?

------
bhnmmhmd
Can these datasets be used for academic and research purposes?

------
fwdpropaganda
Can't open this website.

~~~
welly
Click on the link.

~~~
fwdpropaganda
Done, what now?

