
Google Dataset Search - kmax12
https://toolbox.google.com/datasetsearch
======
danso
Very nice. Worth keeping in mind prior examples for comparison's sake. My
favorites so far:

\- [https://www.opendatanetwork.com](https://www.opendatanetwork.com) \-- what
I would call the "Google, for Socrata datasets"

\- [https://public.enigma.com/](https://public.enigma.com/) \-- One of the
best collections of U.S. federal data, with good taxonomy and lots of useful
options for refining a search, such as filtering by dataset size.

\- [https://www.data.gov/](https://www.data.gov/) \-- Not as useful as what
most people would want -- e.g. unlike Enigma and Socrata, it's a directory of
self-submitted (by the government) data sources, not one in which the data is
stored/provided in a standardized way. But it's a pretty good listing, though
not sure if it's much better than just using Google.

\- [https://data.gov.uk/](https://data.gov.uk/) \-- Better than the U.S.
version in terms of usability and taxonomy.

~~~
wiredfool
[https://data.gov.ie](https://data.gov.ie) —- this is a repository of a bunch
of the open data produced in the Irish public sector.

(Disclosure, I work on this)

~~~
fnord123
Playing around, a lot of countries use this url scheme:

[https://data.gov.be](https://data.gov.be)

[http://data.gov.ro/](http://data.gov.ro/)

[https://www.dati.gov.it/](https://www.dati.gov.it/) (note, Italy redirects
data. to dati.)

[https://data.gov.pl/](https://data.gov.pl/)

[https://data.gov.za/](https://data.gov.za/) (South Africa's has a cert
problem)

[https://data.gov.in](https://data.gov.in)

[https://data.gov.au/](https://data.gov.au/)

[https://datos.gob.mx/](https://datos.gob.mx/)

~~~
mnx
although data.gov.pl exists, and even presents a valid cert, it has no
content. The place with the data is:

[https://danepubliczne.gov.pl/en/#](https://danepubliczne.gov.pl/en/#)

(website is being terribly slow for me right now)

------
Dangeranger
Nice. I'm glad Google is making it easier to find public data sets. I wish
that these could be filtered by format, so that you could narrow them to CSV,
XML, JSON, KML, etc.

Another nice resource that I've used in the past is 'toddmotto/public-apis' on
Github [0].

In the end I would prefer all public data sets to be available over the DAT
protocol [1] instead of being hosted only on government or organization
websites. A lot of climate data previously made available by the EPA was taken
down, and only saved by efforts of volunteers.[2]

[0] [https://github.com/toddmotto/public-
apis](https://github.com/toddmotto/public-apis)

[1] [https://datproject.org/](https://datproject.org/)

[2] [https://www.wired.com/2017/01/rogue-scientists-race-save-
cli...](https://www.wired.com/2017/01/rogue-scientists-race-save-climate-data-
trump/)

~~~
neuromantik8086
Dat's pretty cool, but it's not the only game out there. The efforts of git-
annex/Datalad [0], Academic Torrents [1], Quilt[2], DVC[3], and Pachyderm [4]
are also notable in this space. My hopes are broader in the sense that I just
hope that dataset versioning happens in the first place.

[0] [https://www.datalad.org/](https://www.datalad.org/)

[1] [http://academictorrents.com/](http://academictorrents.com/)

[2] [https://quiltdata.com/](https://quiltdata.com/)

[3] [https://dvc.org/](https://dvc.org/)

[4] [http://www.pachyderm.io/](http://www.pachyderm.io/)

~~~
rgardaphe
[5] [https://qri.io](https://qri.io) (we're tackling the dataset versioning
problem head on.

------
heinrichhartman
The state of data sharing seems to be still quite sad.

* Hosting problems. The first link I tried was already broken.

* Format problems. Also the presented data is in all kinds of formats, some "data sets" even require me to read data off images: [https://www.ceicdata.com/en/indicator/germany/gdp-per-capita](https://www.ceicdata.com/en/indicator/germany/gdp-per-capita) And even if it's JSON, this is not particularly great either (Unicode support? Large (64bit) integers?).

* Update problems. Many data-sets change over time (e.g. GDP). How can I subscribe to updates? "git pull" would be nice.

* Provenance problems. I want to know who put which record into the dataset, when and why? "git log" would be nice.

* Presentation problems. (This is OK sometimes) I necessarily want to download 5Gb file before I looked into it. The first few rows of the dataset should be presented on the page, with information about it.

Wrote down a few more thoughts a while ago here:
[https://github.com/HeinrichHartmann/data-sharing#in-the-
idea...](https://github.com/HeinrichHartmann/data-sharing#in-the-ideal-world)

Approaches I have seen so far in the wild:

* figshare.com -- Addresses Hosting and Presentation.

* [https://quiltdata.com/](https://quiltdata.com/) \-- (!) looks great. Still exploring.

* github.com -- works fine for small datasets (<1GB)

* packaging (yum, pkg, pip) -- (?) Not sure if that works, but at least they solve: Hosting, Update, Provenance.

This seems to be a wide open problem to me.

~~~
rgardaphe
Totally agree! At Qri ([https://qri.io](https://qri.io)) we're working on many
of these problems together - hosting, formatting (interoperability),
provenance and sync. It's an open source project - we'd love to have your
feedback as we design it!

~~~
sixdimensional
This is cool and is a perfect case for IPFS for public datasets. I've not
heard of it before, though, and I think naming / branding is something that
makes finding these things / building momentum more difficult.

For example, someone else mentioned engima.com. I would have no idea that is
related to data sources / sets unless I knew what it was.

Certainly wish you the best of luck though and will keep an eye on Qri! Cool
project!

~~~
rgardaphe
Thanks! you can follow us on twitter where we announce most updates: @qri_io

------
minimaxir
Interesting vertical integration for Kaggle Datasets:
[https://www.kaggle.com/datasets](https://www.kaggle.com/datasets)

~~~
rufuspollock
Note Kaggle recently adopted the FrictionlessData.io Data Package specs:
[https://github.com/Kaggle/kaggle-api/wiki/Dataset-
Metadata](https://github.com/Kaggle/kaggle-api/wiki/Dataset-Metadata)

------
rebel1
Interestingly enough, I built and released something very similar [1] about a
month ago using a Google Custom Search Engine.

Here is the Show HN for it:
[https://news.ycombinator.com/item?id=17789119](https://news.ycombinator.com/item?id=17789119)

[1] [https://databasd.com/](https://databasd.com/)

------
MrEfficiency
What is the best way for a website to format data to the public?

I already have my presentation, but I can also provide it as a .xls, .csv,
sql, or html table.

What would be best to help programmers/data scientists use my data?

~~~
combatentropy
I vote CSV.

\- simple

\- lightweight

\- open

It's easy for the consumer of your data to convert a CSV to whatever format
they need.

\- spreadsheet, for personal analysis

\- SQL database, for industrial-strength analysis

\- HTML, for pretty output to _their_ users

~~~
lolive
Oh, and don't forget that CSV is a rigorous data format, with many tricks.
Don't just append some text together, separated by commas, and call it CSV.
Instead, use a dedicated lib to create the CSV for you.

------
frabcus
Nice to see Google trying this again!

It's one of those areas they have long attempts at involvement in - e.g.
Google Public Data Explorer which never quite reached it's potential, and
Freebase which although flawed was good and was shut down after Google
acquired it.

I like that this is search based! The web is still the best place to publish
data - in fact in my view normal Google search is still by far the best way to
find datasets, even though it isn't directly designed for that.

There's a link from the about page of Google Dataset Search to this help for
webmasters on how to mark up content for it - although it is a bit odd, mainly
showing how to mark a dataset with a DOI (so good for academics certainly!):

[https://productforums.google.com/forum/#!topic/webmasters/nP...](https://productforums.google.com/forum/#!topic/webmasters/nPq4BW6iPIA)

Just metadata about data feels like a very niche thing to search to me - I'm
still not convinced anyone will maintain the metadata well enough to help.
Possibly will work in particular domains.

Does Dataset Search have some way to search column headings, types or content
(of CSV, Excel, JSON etc)? I can imagine a load of operators that would make
that really powerful for finding badly meta-marked up datasets deep in the
web. Would seem like the obvious extra thing a dataset search would do.

Also previews please!!! Just nicely render the fist ten rows of common formats
- CSV and Excel to begin with.

What part of Google is doing this?

------
earth2mars
It would be good to know who sponsored the data research so that everyone can
make self biased decision to use/trust the data.

------
neuromantik8086
Looks like academic institutional repositories and figshare are doing the
heavy lifting here. It's still neat to see Google aggregate everything, but
it's not that different from what they do with other services relying on these
sources already, and is largely dependent upon how rich these upstream sources
are in the first place.

------
sbr464
This is nice. I'm working on a similar open source project that is releasing
soon called DataLibrary[0]

It goes further by bringing this kind of data together into a single API,
converting/cleaning into a similar schema where possible.

A small write up can be found on github [1]. Any feedback/ideas would be
appreciated!

[0] [https://www.datalibrary.com](https://www.datalibrary.com) (not online
currently)

[1]
[https://github.com/reactual/datalibrary/blob/master/README.m...](https://github.com/reactual/datalibrary/blob/master/README.md)

------
ToFab123
Pretty lame that it doesn't work in Edge.

------
afettere
[https://data.urbanfootprint.com/](https://data.urbanfootprint.com/) \- browse
thousands of environmental, social, transportation, and land use datasets.

(disclosure - I work there)

------
sprague
Then there's this $500K just awarded by the NSF to build a "Google for data
sets". I wonder if, before making these sorts of grants, the NSF looks at what
Google and other and other companies are already (or likely) doing.
[https://www.lehigh.edu/engineering/news/faculty/2018/2018082...](https://www.lehigh.edu/engineering/news/faculty/2018/20180820-davison-
heflin-jia-dataset-search-engine-nsf-award.html)

------
nterpo
You may want to try
[[https://data.opendatasoft.com](https://data.opendatasoft.com...](https://data.opendatasoft.com\]\(https://data.opendatasoft.com\))
\-- thousands of datasets available through the same API, usable online, no
download required.

------
smartvlad
Look at [https://knoema.com](https://knoema.com) which positions itself as a
search engine for data with more than 2.5 billion time series available. They
provide both visual data discovery through search and navigation as well as
API access through Python, R etc.

------
spoiledtechie
Can we list some good data searches to suggest below this comment?

I would love to see some cool data we might be able to use.

------
madisonmay
I also very much like [https://www.figure-eight.com/data-for-
everyone/](https://www.figure-eight.com/data-for-everyone/). It's not
optimized for search but it's an excellent repository of high quality
datasets.

------
michaelmior
Of note is the link below which indicates how you can have your dataset
indexed.

[https://developers.google.com/search/docs/data-
types/dataset](https://developers.google.com/search/docs/data-types/dataset)

------
ece
Nothing for FRED
[https://toolbox.google.com/datasetsearch/search?query=site%3...](https://toolbox.google.com/datasetsearch/search?query=site%3Afred.stlouisfed.org)
hmm...

------
jonbaer
Using Google to search Google ...
[https://toolbox.google.com/datasetsearch/search?query=site%3...](https://toolbox.google.com/datasetsearch/search?query=site%3Agoogle.com)

------
jwillmer
Since we have already a nice reference list of open data portals by country I
like to add the German portal:
[https://www.govdata.de/](https://www.govdata.de/)

------
sytelus
Looks like this doesn’t crawl one of the most important source for data:
academic torrents. For example, I searched for

ilsvrc lmdb

This should have found imagenet data in lmdb format available somewhere but it
returned no results.

------
MR4D
Maybe I’m missing something but this strikes me as underwhelming. To the point
of something that I could do as opposed to something that the firm that
created maps and gmail could do.

Is it just me?

------
xcubic
Add this one: [https://dados.gov.pt/](https://dados.gov.pt/) (Portugal)

------
a2x
Google is so damn bad on UX. An excellent example of interaction of unlimited
ressources and priorities.

------
xaranke
Placeholder comment to calculate the time before Google inevitably shuts this
down.

------
ericand
What's up with this domain? Is there anything else under toolbox.google.com?

~~~
amadeuspzs
Here you go: [http://bfy.tw/JkG7](http://bfy.tw/JkG7)

~~~
dang
Please don't do this here.

------
lmy86263
it will be helpful for somebody who major in AI

