
AWS Public Datasets - darshanrai
https://aws.amazon.com/public-datasets/
======
sitkack
Google Cloud has something similar, [https://cloud.google.com/public-
datasets/](https://cloud.google.com/public-datasets/)

I think it would show a kind of gilded age maturity if all the cloud providers
cooperated on their public datasets, because they are for the public good.

~~~
vgt
And the first terabyte of analysis for free in BigQuery!!

(Work at g)

~~~
diggan
That's all fine and dandy, but I assume I could also pull down the entire
dataset for free and run unlimited analysis locally for free?

~~~
IAmEveryone
Yes, although data transfer fees may apply if you go over the free quota.

~~~
diggan
Ah, I see. These "public" datasets are not actually public as they are only
available if you login (and I assume input your payment details before).

Shame on me to think "Public Datasets" meant that I could just download them
via rsync/http.

If you're interested in truly public datasets, see
[http://academictorrents.com/](http://academictorrents.com/) which hosts
datasets as torrents.

~~~
komali2
This seems critical - as Google is a corporation, I wonder where this sense
comes from? Obviously we have certain expectations for a couple who's tagline
reads "don't be evil," but does that extend to criticizing them for not
providing terabytes of free transfers of terabytes of free data, which they
pay to host and serve?

~~~
diggan
> criticizing them for not providing terabytes of free transfers of terabytes
> of free data

My criticism is not towards them not providing free transfers of data. My
criticism is that they say "these datasets are freely hosted and accessible"
and public, while only being available while logged in to Google, which I
don't think classifies as public.

I would not have any problem if it's just called "Google-hosted Datasets" or
"Google-only Public Datasets", I just think the current naming is misleading.

Edit: compare this to the AWS Public Datasets which are actually available
without a AWS account, just go to [https://landsat-
pds.s3.amazonaws.com/c1/L8/139/045/LC08_L1TP...](https://landsat-
pds.s3.amazonaws.com/c1/L8/139/045/LC08_L1TP_139045_20170304_20170316_01_T1/index.html)
for example, which is the Landsat Dataset

~~~
kalcode
A public library still requires a library card...

the term public means its available to the public. Signing up google, library
card, doesn't negate that fact that any member of the public can access the
data with a reasonable and insignificant barrier.

~~~
diggan
Sure, the difference between getting a library card and signing up to a Google
account is that usually the first has a user agreement which is about 10 lines
long with human language and the second one has a user agreement which is 10+
pages with lawyer speak and probably includes that they are allowed to sell
your data. How is it reasonable for a normal person to understand those kind
of user agreements?

And if that's your definition of public, isn't everything public, if you have
the right amount of money, knows the right people and can get access to the
right place?

~~~
kalcode
You only pay a price because you are using someone else system. Which cost
money. Public library are public, you also pay taxes on them. Because using it
as a service still has a cost.

Google give you 1TB on data to query each month. That is a lot of data, if you
need more for free collect the data over time.

But you still need a library card and if the facility gets overused they ask
for a small tax to cover the operating costs.

I fail to see how this isn't considered public. Public literally just means
accessible to the general public. That all. A public event can still charge a
fee for entry. As long as the fee doesn't create a barrier for most the
public, you know accessible to the general public.

Also if a library sold their data would it no longer be considered public? I
believe it would still be considered public. User data isn't sold typically by
name or email, byt activity. If a library compiled a list of books checked out
and their frequency, the amount of people entering everyday etc that be the
equivalent of most user data being sold. Very rare for a company to sell your
actual personal data, when they do they disassociate your personal information
with it.

So if the above doesn't disqualify a library from being public, then neither
would this dataset that is public. If you really disagree with that then you
are just trying to be pedantic at that point.

------
Mizza
If anybody is interested in AI/bioinformatics projects on AWS, I'm currently
involved in a project to harmonize _all_ of the publicly available RNA data
(many petabytes) into easily sliced, AI/ML-ready datasets for anybody to use:

Website: [http://www.refine.bio/](http://www.refine.bio/)

Source:
[https://github.com/AlexsLemonade/refinebio](https://github.com/AlexsLemonade/refinebio)

~~~
eggie
Would this allow rapid realignment of the data against a given reference
(graph?) model? Or is the alignment somehow baked in?

~~~
Mizza
We do the alignment too, yes.

~~~
eggie
What reference model do you align the RNAseq against? That would seem to have
a huge effect on results no?

------
benhamner
Over 13,000 community-uploaded public datasets:
[https://www.kaggle.com/datasets](https://www.kaggle.com/datasets)

(I work at Kaggle)

~~~
santiagobasulto
I take the chance to ask you: is there any command line utility similar to pip
to import datasets from kaggle? Something like:

`kik pull titanic`

~~~
Godel_unicode
[https://github.com/Kaggle/kaggle-api](https://github.com/Kaggle/kaggle-api)

------
chrisbaglieri
A lot of these datasets have been available for some time:
[https://aws.amazon.com/blogs/aws/new-aws-public-data-sets-
tc...](https://aws.amazon.com/blogs/aws/new-aws-public-data-sets-tcga-and-
icgc/). Perhaps what is most surprising is that the list hasn't grown a ton
since then.

~~~
yellowbkpk
As someone who has spent a considerable amount of time on data that has ended
up on this page, I think the fact that the list hasn't grown says more about
the priorities of other companies than of AWS. Amazon doesn't (yet) have time
to build and maintain these datasets themselves: they work with others to
build and maintain it and then fund the storage and transmission fees.

I helped build the Terrain Tiles dataset as part of Mapzen, which recently
shut down. The OpenStreetMap data exists on the AWS Public Datasets page
because it's useful to Humanitarian OpenStreetMap Team. If you're able to
convince your company to generate and work with a public dataset, consider
reaching out to the AWS and Google public datasets teams to get it hosted and
publicized.

------
marcinzm
Interesting, they used to have Wikipedia data and now it's gone from the list,
anyone know why?

~~~
arxpoetica
Also curious...

------
minimaxir
How much does it cost to export/process this data from Amazon? Unlike
GCP/BigQuery which has free cloud processing built in, downloading/analyzing
these GB/TB datasets for personal analysis can’t be cheap.

The descriptions note “Educators, researchers and students can apply for free
promotional credits to take advantage of Public Datasets on AWS.” which is not
a good sign.

~~~
sjburt
It's the normal data cost. EG the Landsat data is stored in S3, so it's free
to EC2 (in the same region), and somewhere around $0.09/gb to the public
internet.

However, the idea is not that you download it all (there's probably cheaper
ways to acquire Landsat data), the idea is if you want to do whatever analysis
on AWS, they've already got it neatly ingested for you.

------
pacificleo11
Importance of data for machine learning algorithms can’t be stressed enough. I
remember talking to a friend in Google Translate team. they had a good also
but they were struggling to get quality translation data to train their
service. the problem was more severe when it came to language which was not
very popular. translation set was next to nothing for Say something like
Turkish, Hindi, Latvian etc.

They finally solved this by using meeting notes from UN Assembly. Which were
transcribed by best of the translators? that access to meeting transcription
was (unfair ??) advantage Google had over other tools. Was it wrong ? I don’t
think so. Should have been those meeting notes be public: Yes

~~~
pavlov
When Google Translate was new in my language Finnish, their translation
quality was atrocious.

I tried this sentence: "Hei äiti, puhun suomea". The expected translation
would be "Hi mom, I speak Finnish".

Instead Google's result was: "July's mother, I speak English".

Obviously the engine had been trained on unvetted data sets where the word
"English" occurred in translations in a position where the original had the
word "Finnish", and no context was provided to avoid this kind of mistake.

The word "July" came about because "hei" is also used as an abbreviation for
"heinäkuu" (July). It was sobering that a supposed world-class AI couldn't
distinguish between these two usages. Machine learning needs a lot of old-
fashioned handtuned human-made heuristics.

~~~
komali2
Sobering perhaps to learn what the state of the art is for "world class" AI
indeed ;)

Cortana when??

------
gajju3588
Adding one more source of open datasets in the thread: Manually tagged high-
quality datasets :

[https://dataturks.com/projects/trending](https://dataturks.com/projects/trending)

~~~
ah-
Seems to be down. What do they have?

------
neuromantik8086
This doesn't appear to be publicized on the linked to page, but two of the
major open access fMRI dataset repositories (FCP-INDI [1] and OpenfMRI [2] are
hosted in S3.

[1]
[http://fcon_1000.projects.nitrc.org/](http://fcon_1000.projects.nitrc.org/)
[2] [https://openfmri.org/](https://openfmri.org/)

------
elorant
I wonder if anyone has processed Common Crawl's dataset for a list of all the
existing domains (I don't care if the data are old).

~~~
RobAley
You can get the URL data (and parse out the domains yourself) at
[http://index.commoncrawl.org/](http://index.commoncrawl.org/) Be aware that
commoncrawl, while huge, only crawls a fraction of the web, and so you won't
get all of the "existing domains" (i.e. all domains that exist) if that is
what you are after.

------
tracker1
I wish they'd centralize WHOIS data and clean it up... that is some messy
stuff there.

