
A list of the biggest datasets for machine learning - nickplesha
https://www.datasetlist.com/
======
aepiepaey
Missing Danbooru2018, released Jan 2019:

> Danbooru2018 is a large-scale anime image database with 3.33m+ images
> annotated with 99.7m+ tags

[https://www.gwern.net/Danbooru2018](https://www.gwern.net/Danbooru2018)

~~~
imraj96
92.7m+ tags!! How did they annotate the images? Was it done manually by humans
or a neural network?

~~~
gwern
All humans, believe it or not. Never underestimate the power of
procrastinating grad students or anime weebs.

At present, the most advanced tagger for Danbooru, DeepDanbooru
[https://www.reddit.com/r/MachineLearning/comments/akbc11/p_t...](https://www.reddit.com/r/MachineLearning/comments/akbc11/p_tag_estimation_for_animestyle_girl_image/)
, still isn't good enough to do annotation by itself but I think that's mostly
because no one has really tried.

------
kyrieeschaton
The Reddit archive is a glaring omission given its recent use in OpenAI's
GPT-2 model.
[https://www.reddit.com/r/datasets/comments/65o7py/updated_re...](https://www.reddit.com/r/datasets/comments/65o7py/updated_reddit_comment_dataset_as_torrents/)

~~~
nickplesha
Thanks for the feedback! I'll be adding it to the list.

------
danbr
Neat.

Would have been nice to add some sort of discriptor indicating what type of
dataset it is. For example, I personally have no clue off the top of my head
what the “MURA” dataset is.

Edit: I now see there is a little icon on the left of all lines. Bit
ambiguous, but it sorta gets the point across.

------
viksit
Great job! A categorization system (especially if it can be
crowdsourced/edited via github) would be great, since these datasets are
likely to be useful across multiple domains - and not just "CV" or "NLP" \-
I'm thinking "Stocks", "Finance", etc.

Also [https://registry.opendata.aws/](https://registry.opendata.aws/) from AWS
has a lot of datasets that could either be included en masse, or even linked
to for the page to be more comprehensive. I like their categorization system
(tags/labels) as well. They also have usage examples which is excellent to get
a sense of what the data is/useful for.

~~~
nickplesha
Thanks! Those are good ideas!

------
rglullis
Is there anything that stops you from adding a link to the dataset/torrent for
the ones that are freely distributable?

~~~
nickplesha
Good idea, a link to academictorrents seems like a good candidate for adding
to the table. I'll think about how to fit it in there without overloading the
page with information. Thanks!

------
bane
I really wish there was a large-scale corpus of completely benign email
traffic.

~~~
1024core
Enron? [https://www.cs.cmu.edu/~./enron/](https://www.cs.cmu.edu/~./enron/)

500K messages from 150 users.

~~~
kippinitreal
Now I’m just imagining a bunch of NLP models subtly learning large scale
fraud.

~~~
1024core
AGI needs to be adept at fraud too, you know ;-)

------
thecodemonkey
/r/datasets on Reddit also has some surprisingly interesting datasets.

~~~
mindcrime
/r/opendata on Reddit is another one worth checking.

------
jumasheff
Stanford Cars dataset:
[http://ai.stanford.edu/~jkrause/cars/car_dataset.html](http://ai.stanford.edu/~jkrause/cars/car_dataset.html)

------
dstick
Let me be the first to say thanks for taking the time for compiling and
sharing this list! Bookmarked it for future reference / use :)

------
mtw
I would like to see size of dataset, a form to submit datasets, categories
(such as person, face, animals or plants etc.)

~~~
nickplesha
Thanks for the feedback, all seem like good ideas.

------
1024core
If the word "biggest" is being used, then some indication of the size of the
dataset would be useful.

~~~
nickplesha
Thanks for the feedback, perhaps the stats and numbers are too well hidden in
the description field. I'll see how to make it easier to gauge the size of
datasets at a glance.

------
craze3
You should add this movie dialog collection:
[https://www.cs.cornell.edu/~cristian/Cornell_Movie-
Dialogs_C...](https://www.cs.cornell.edu/~cristian/Cornell_Movie-
Dialogs_Corpus.html)

Alot of projects use it!

~~~
nickplesha
Thanks!

------
calibas
Does the licensing of a dataset also apply to the trained neural network?

------
hgasimov
Thanks a lot for sharing this :)

------
ru999gol
misses a ton of important datasets and corpora but anyways, interesting how so
many of them are non-commercial only :(

Especially annoying are european universities and other government funded
institutions

