Hacker News new | past | comments | ask | show | jobs | submit login
Datasets for Machine Learning (gengo.ai)
456 points by mromaine on June 14, 2018 | hide | past | web | favorite | 38 comments

Ben from Kaggle.

Open up the ~50 different individual datasets linked in separate tabs, and then quickly flip through all of them trying to get a sense of what each one is.

That experience will demonstrate one of the main challenges we're aiming to solve by making Kaggle Datasets your default place to publish data online (https://www.kaggle.com/datasets)

This is great! Thanks for sharing. Would be awesome if your license filter had a "not for commercial use" vs. "for commercial use" or similar.

Thanks for the feedback! Totally agree this could be clearer

Same question as parent...but curious...which one is it?

You can also start a new Jupyter notebook session on any of these datasets with a click (click "New Kernel"), and then accelerate your analysis by attaching a GPU to the session with another click (for applications a GPU helps, e.g. training Tensorflow models on image data)

Ironically, the same challenges are better solved in the world of code: Docker, GitHub, npm, etc.

Some friends and I created Quilt to bring versioning and packaging to data: https://quiltdata.com/. The interface is the familiar Python lifecycle of install and import.

This is a great idea Ben, and I appreciate the work you do. Do you see Kaggle datasets as a tool to encourage better data formatting, or are you also thinking about building tools for automatically visualizing, cleaning, and organising data?

All of the above, and more! One thing I'm really excited about that we're about to release is a much better explorer for tabular data (automated histograms, sorting/filtering/showing the data, and the like).

We also encourage sharing analytics code and visualizations that users create on the data back to the community. For example, see all these visualizations and insights in StackOverflow's developer survey data linked from https://www.kaggle.com/stackoverflow/stack-overflow-2018-dev...

Great, thanks for the link (and to the blog author for her links). I do machine learning at work, but just two very specific use cases involving GANs and RNNs. I appreciate resources to use in my own time to explore other architectures.

I've heard that Kaggle data sets encourage people to do "supervised" ML only. Is that true?

The competitions we host (https://www.kaggle.com/competitions) are supervised and always have a target we can create a numeric leaderboard on, but the public datasets (https://www.kaggle.com/datasets) are used for everything under the sun.

There's some supervised ML use of those, and a lot more open-ended exploration, visualization, cleaning, clustering, language modeling, etc.

Not at all - I released a customer support on Twitter dataset there specifically focused on unsupervised tasks! I think the focus on supervision in what people do with the data shows that there are still a lot of people poking around with the easier supervised tasks.

[0]: https://www.kaggle.com/thoughtvector/customer-support-on-twi...

(Not Ben, but - ) outside of academia, the main thing that seems to encourage people to do supervised ML is that it's the only thing that seems to work. I haven't really heard of any success stories with using unsupervised techniques for most common ML applications.

I'm not an expert, but I feel that:

Unsupervised techniques work really well for language modelling.

There is also weakly supervised and distant-supervision, where the labels are "noisy" or not exactly what you want.

You're right in that strong supervision, where you basically trust your class label, works really well, because it's probably the easiest case.

Combining unsupervised (e.g. pre-trained language models) with a very small set of strongly labeled data, or a larger set of weakly labeled data, seems to work pretty well too.

Unsupervised works, but your ability to measure "does it work or not" is much more dependent on a case by case evaluation rather than a score.

(Because if you know a priori what is it that you want to measure - it's supervised)

Yeah this my experience too, evaluation ends being almost endless time sink.

I used a very simple unsupervised ML built in scikit-learn to find good matches on OK Cupid. Worked very well, it found definite boundaries between the clusters of women.

One of the features was a subjective rating of how much I liked some of the women, and scikit-learn then suggested to me other women in the clusters that had my best ratings. It turns out that I like vegetarians, redheads, and left-wingers. Which happens to be true, even though I eat meat and do not identify as left-wing. But those traits correlate with _other_ traits that are more difficult to measure objectively, such as caring about children, liking to hike, and preferring an evening of sex to an evening of television.

I think it's more that supervised ML is sufficient for most of the low hanging fruit. It's relatively easy and well-understood, and there are a lot of things out there where we have copious data that we just need to digest into a model to make it useful.

What about clustering?

The link at the bottom should be emphasized: https://github.com/awesomedata/awesome-public-datasets

It is a very expansive collection of datasets, some well-prepped for ML and most not (which is part of the fun of it, anyways).

Two sources that are missing:

opendatanetwork.com: this is effectively a Google for public Socrata data portals, and for me, the best way to discover datasets across different municipalities. For example, when I was interested in trying to replicate the NYT's "Do ‘Fast and Furious’ Movies Cause a Rise in Speeding?" [0] article, it was pretty easy to find a bunch of other traffic/motor vehicle violation datasets with opendatanetwork's search.

Enigma public (https://public.enigma.com): a huge collection of scraped public datasets, including flattened versions of data that originally comes in annoying-to-parse, such as U.S. lobbying disclosures [1]

[0] https://www.nytimes.com/2018/01/30/upshot/do-fast-and-furiou...

[1] https://public.enigma.com/datasets/lobbying-disclosures-lobb...

Surprised that CIFAR wasn’t mentioned under Images. I feel like that’s one of the standards, even more so than some of the ones that are listed.

To train machine translation models parallel corpora in many languages are provided on the WMT conference site: http://www.statmt.org/wmt17/translation-task.html and previous years

My original comment was meant for a separate HN article on machine learning and I posted in the wrong tab.

My apologies.

I had the same reaction. I don't like it too much when sites copy up information and only link to original content at the bottom of the page.

The collection is good though, it's sad that it looks like it is stealing from the sources.

How is this related to the article on gengo.ai?

Oops, missread for https://modelzoo.co/

Inspired by this post, I was looking for a fun way to browse datasets randomly, which led me to build this Kaggle Random Dataset Generator:


Thanks Gengo!

Here are 1000s of more open datasets for anyone to explore, use or build upon: https://dataturks.com/projects/trending

From the title 'The 50 Best Free Datasets...' I was expecting a curated list of datasets. But the list has mix of individual datasets, and sites that provide/host datasets :(

Here is a dataset for abstractive summarization created from Reddit .

Dataset https://zenodo.org/record/1168855#.WyJG3I7pdhE Paper http://aclweb.org/anthology/W17-4508

Security industry related datasets always seem to be omitted from this type of thing. Please check out the excellent http://www.secrepo.com/.

For audio, LibriSpeech, M-AILABS, LJ-Speech, VCTK, TIMIT, Mocha-Timit, VoxForge, Blizzard Challenge, and so on.

Does anyone know of good datasets for Concept Drift analysis?

Can these datasets be used for academic and research purposes?

Can't open this website.

Click on the link.

Done, what now?

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact