
Kaggle Datasets – Discover and analyze open data - benhamner
https://www.kaggle.com/datasets
======
benhamner
Our goal with Kaggle Datasets is to provide the best place to publish,
collaborate on, and consume public data.

As a data publisher, you have an easy way to publish data online, see how it's
used, and interact with the users of the data. You can create the dataset via
a simple web interface, and update it through the interface or an API. We
automatically version these updates under the hood.

As a data consumer, you can browse the data online and download it (through
the web or an API). You can see the code and insights others have generated on
the data through Kaggle Kernels (hosted, versioned IPython notebooks that run
in Docker containers). You can fork their code to get started on the data, or
start coding from scratch on your own analysis. If you find improvements that
could be made to the metadata (dataset/file/column-level descriptions), you
can make those directly.

We're rapidly iterating on this product and expanding it's functionality, and
would love any feedback and suggestions.

~~~
lgierth
First of all, this looks like a great tool for datasets, thank you.

Do you have plans for adding file hashes to the datasets, e.g. sha256? This
would make it a lot easier to integrate with other systems.

~~~
amrrs
Sorry for a noob, could you please explain how adding hashes would help in
better integration?

~~~
lgierth
They mainly help in four ways:

\- avoid data corruption when downloading/transferring/copying datasets

\- notice changes/updates in the original dataset

\- dataset versioning (think how e.g. git turns directories and files into
hash trees -- also called content-addressing)

\- most importantly: stable names without a naming authority

~~~
prepend
How does this apply when you can filter / conditional exports? Is the idea
that the csv has a fixed hash and if you trust that, you can trust anything
else?

------
QasimK
How about you let me download them without creating an account before calling
them “public”?

~~~
benhamner
Thanks for the feedback. This is likely a "not quite yet" vs. "never".

Definitely understand the motivations from a user standpoint for not needing
to login to download.

There's some non-obvious benefits we get as a small team by requiring login,
in addition to new user growth. Bandwidth for hosting data can be large, and
it's easier to reason about and prevent abuse in the context of authenticated
users.

We do enable previewing the dataset while logged out, and the preview
functionality will become more full-featured.

~~~
shepardrtc
I disagree with the parent. You've taken the time to organize and host these
datasets. The least we can do is create an account to download them.

~~~
diggan
Sure, but then maybe don't call it "public" as people will think it can be
downloaded without creating an account. Calling something public but require
account creation is misleading.

~~~
omg_ketchup
I'm not sure I'd agree- the data is accessible to anyone who wants it for
free. There's no restrictions on creating the account.

Radio is free, but you hear advertisements. Probably better to create an
account than to have product placement in the actual data.

~~~
diggan
The restriction I'm talking about is creating the account. "Public" (at least
for me) does not mean that I need to agree to some lengthy "terms of service",
"privacy policy" and create an account. Public means it's public and can be
accessed from curl or my browser of choice without signing a contract.

Not sure were you are from, but where I'm from (Sweden), public radio (not
free but public) and public TV is free of advertisement and does not require
me to sign up for an account to be able to listen to/watch it. That's what I
call public.

\- [https://www.kaggle.com/terms](https://www.kaggle.com/terms)

\-
[https://www.kaggle.com/about/privacy](https://www.kaggle.com/about/privacy)

~~~
shepardrtc
When Kaggle is saying "public" dataset, they're implying the origin of the
dataset is public. Meaning, the datasets were created by various
groups/companies/institutions and made available for the general public.
Kaggle is simply hosting them again. They're doing us a service by organizing
them all into one location and eating the bandwidth costs. My argument is that
in return for that service to us, the least we can do is create an account
with them.

~~~
diggan
I have no problem with them offering datasets to the public and just requiring
to sign up for an account. But call then Kaggle-Public, Semi-Public or
anything else, public data has a meaning that is not what they are doing.

For example, the government where I live (Catalunya) has public data. So I can
go to the website and click download, no account required. If that data was
distributed via Kaggle and requires account signup to get, I would not
consider what they are providing public.

------
antirez
This is gold. When I wrote the NeuralRedis module I had so much fun
downloading a few random datasets from Kaggle and wrap it in a few lines of
Ruby script to check what the results were in terms of predictions. Normally
the data is very high quality, the format well documented, and so forth.
However make sure to check the license for the details depending on what use
you plan to do.

------
Radim
What happens when the company changes direction? If there's a shift of
priorities, an internal restructuring, a "strategic startup pivot", an
acquisition?

Not to assume bad faith on Kaggle's part, but we got burned one too many times
with private companies pushing their proprietary ("open") platforms for
gobbling up data. The "it's free! just create an account — data lock-in — gap
after project death/monetization" pattern leaves me a little cynical.

It's awesome that resources like these exist, but I'd be more comfortable
paying attention if this was hosted as raw data somewhere (Github?), with a
clear licensing and access model.

~~~
benhamner
We joined Google via acquisition one year ago, and Kaggle Datasets has grown
from 450 datasets to over 13,000 in that timeframe. We are firmly committed to
supporting and growing this platform.

------
neuromantik8086
The Awesome Public Datasets Github repo [1] also constitutes a good effort at
organizing all of the open data out there that people can play around with.

[1] [https://github.com/awesomedata/awesome-public-
datasets](https://github.com/awesomedata/awesome-public-datasets)

------
metakermit
Wonderful, thanks for sharing this! It's useful that the kernels people have
submitted are there as well and that there is a HN-style upvoting mechanism.

As an aside – I'm really curious to explore the datasets with "fake" in the
title :)

[https://www.kaggle.com/datasets?sortBy=relevance&group=publi...](https://www.kaggle.com/datasets?sortBy=relevance&group=public&search=fake&page=1&pageSize=20&size=all&filetype=all&license=all)

------
cosmic_ape
It would help if the datasets were categorized by data type. Timeseries,
multilabel, etc...

~~~
benhamner
Not all the datasets are ML specific, but hopefully this helps:

\- [https://www.kaggle.com/tags](https://www.kaggle.com/tags)

\-
[https://www.kaggle.com/tags/linguistics](https://www.kaggle.com/tags/linguistics)

\- [https://www.kaggle.com/tags/multiclass-
classification](https://www.kaggle.com/tags/multiclass-classification)

\- [https://www.kaggle.com/tags/text-data](https://www.kaggle.com/tags/text-
data)

------
socksy
Is there an announcement of some kind of change? Are they still owned by
Google? Or is this the thing where sometimes existing solutions will hit the
front page of HN? :)

~~~
benhamner
I shared this because another public data portal that I don't think has
changed in years ended up at the top of HN. Kaggle Datasets has grown by over
an order of magnitude in the past year, and jumps in scale fundamentally
changes the utility of community products like this

------
naushit
Any plan to share same data/files using IPFS?

