
Discovering Millions of Datasets on the Web - Anon84
https://blog.google/products/search/discovering-millions-datasets-web/
======
sixstringtheory
This is great. I've been collecting a list of open data sets for a while now
with an eye to at some point turn it into a blog post. Now maybe I don't have
to... saved me some work!

Some other indices of open data sets I've found:

[https://registry.opendata.aws](https://registry.opendata.aws)

[https://en.m.wikipedia.org/wiki/List_of_datasets_for_machine...](https://en.m.wikipedia.org/wiki/List_of_datasets_for_machine-
learning_research)

[https://meta.m.wikimedia.org/wiki/Datasets](https://meta.m.wikimedia.org/wiki/Datasets)

~~~
breck
This is a good list: [https://github.com/awesomedata/awesome-public-
datasets](https://github.com/awesomedata/awesome-public-datasets)

~~~
subroutine
Also zenodo.org is seems like it's gaining traction as a de facto data repo
for scientific journals. I've had to deposit a copy of my raw data here 2x in
the last 6 months for different pubs (1 neurobio, 1 genomics).

------
lettergram
For those interested, I recently wrote a blog post on how to download & parse
USPTO patents for a large free corpus for NLP problems:

[https://austingwalters.com/parsing-uspto-patents-to-
create-a...](https://austingwalters.com/parsing-uspto-patents-to-create-a-
dataset/)

I actually have found FOIA requests[1] and downloads from government websites
to be the easiest & most effective way to get robust datasets.

[1] [https://austingwalters.com/foia-
requesting-100-universities/](https://austingwalters.com/foia-
requesting-100-universities/)

------
philshem
A couple good resources for finding datasets:

\+ For individual requests, come over to
[https://opendata.stackexchange.com/](https://opendata.stackexchange.com/) and
ask!

\+ Wikidata has loads of structured data, but using SPARQL is often a barrier.
But you can request help:
[https://www.wikidata.org/wiki/Wikidata:Request_a_query](https://www.wikidata.org/wiki/Wikidata:Request_a_query)

------
dzonga
I really feel like the data side of things is under-rated. mostly, it seems
like when people talk of IP, they talk about th e software and forget the
data. Uber, Snapchat etc are companies mostly in the business of shuffling
data around. Good or bad, that's subjective. And this data-search product is a
nice welcome to those people who are trying to get something off the ground,
research or just trying to understand the world and human behaviour better.

~~~
bordercases
I don't think it's underrated, but I do think that there is a gap between
massive data and an idea of what to do with it.

And data is simple, it's parameters plus timestamps plus a lot of storage.

Realtime access is harder but it's a well-specified problem.

The issue is inference. No one does inference extremely well except in limited
circumstances. It's one of our greatest bottlenecks as humans and our software
is going to be limited by it as well insofar as our understanding of what to
build is controlled by what kind of inference we want.

~~~
danso
I think data is underrated because it is actually _not_ "simple". Especially
the collection and curation of it.

------
grogenaut
Trying to teach my wife pandas and the thing she most wants to do is compute
the 10 year projected return on fortune 500s (buffetology) based on last ten
year financial reports. It's really hard to find a good data source though as
it's either in PDF or Google has been optimized to rent seeking data
repackagers where it's hard to see if they have the data without jumping
through hoops. Would love a source for that.

~~~
hbcondo714
If you use Google's Dataset Search for SEC Filings[1], you get outdated
information. FTP access has been removed for years but SEC Filings are still
are great example of large datasets. I built a side business at
[https://Last10K.com](https://Last10K.com) using buffettology and provide 10
years of company annual reports (10Ks). There's also an API at
[https://dev.Last10K.com](https://dev.Last10K.com) that returns financial data
from these filings in JSON or XML.

[1][https://datasetsearch.research.google.com/search?query=EDGAR...](https://datasetsearch.research.google.com/search?query=EDGAR%20Database%20of%20SEC%20Filings&docid=3%2Bc8oAy8U5sVy2HZAAAAAA%3D%3D)

~~~
grogenaut
Interesting. I was considering having her as a side hustle type these sheets
into a place I could then sell. Sounds like that was what you did. How did
that work out?

~~~
hbcondo714
Didn't see any contact details on your HN profile so feel free to contact me
directly and I can provide details.

------
blacksmith_tb
I was picturing a link to Shodan, nice to see this is about legitimate sources
instead.

------
unitykid9008
If anyone is looking for cleaned and linked finance datasets, and works at a
university. You should double check if you get access to Wharton Research Data
Services [https://wrds-www.wharton.upenn.edu/](https://wrds-
www.wharton.upenn.edu/), it could save you alot of time.

------
fudged71
All this (and comments) have taught me is the data set I'm looking for doesn't
exist in the public domain. Time to make it myself.

------
igravious
Found a Vocabulary of Philosophy using it, very skookum!
[https://www.loterre.fr/skosmos/73G/en/](https://www.loterre.fr/skosmos/73G/en/)

Unfortunately, most every result for the word `philosophy' is borderline
garbage imho. Keyword indexing of datasets may need improving?

------
davedx
I gave this a try for a few queries, but the results are very varied. Often
you get the landing page for a study with its PDF behind a research/journal
paywall (even with the "Free" filter applied, so not sure what "free" means to
Google). Sometimes the "dataset" is some kind of visualization without any
obvious way to get the raw data. Only a couple of results had a JSON or CSV to
download.

Overall a bit underwhelming.

------
livingmargot
Still can't find the damned 'International Corpus of Learned English', though.

~~~
nl
You seem to be able to order it here:
[https://www.i6doc.com/en/collections/cdicle/](https://www.i6doc.com/en/collections/cdicle/)

~~~
livingmargot
I have no practical way of reading a CD-ROM, why don't they make available a
virtual version? Thanks anyway

~~~
ant6n
If you pay 350$ for a cd rom u should be able to find a way to digitize it,
for example a copy shop or just getting a usb drive.

------
sandinmyjoints
Please try searching for datasets on this site on mobile. It needs some work.

------
bg117
Are there any resources where models are available?

