Hacker News new | past | comments | ask | show | jobs | submit login
Google Dataset Search (toolbox.google.com)
1035 points by kmax12 on Sept 5, 2018 | hide | past | favorite | 76 comments

Very nice. Worth keeping in mind prior examples for comparison's sake. My favorites so far:

- https://www.opendatanetwork.com -- what I would call the "Google, for Socrata datasets"

- https://public.enigma.com/ -- One of the best collections of U.S. federal data, with good taxonomy and lots of useful options for refining a search, such as filtering by dataset size.

- https://www.data.gov/ -- Not as useful as what most people would want -- e.g. unlike Enigma and Socrata, it's a directory of self-submitted (by the government) data sources, not one in which the data is stored/provided in a standardized way. But it's a pretty good listing, though not sure if it's much better than just using Google.

- https://data.gov.uk/ -- Better than the U.S. version in terms of usability and taxonomy.

@danso thanks for the feedback on data.gov. I'm part of the small (3 person) team that helps to manage it. If you have a moment to chat I'll see if I can reach out to you to see if you'd be interested in participating in some more in-depth user research in the future. Folks can also always leave feedback via email, github, twitter, and other means - https://www.data.gov/contact

The Federal Data Strategy will also be opening up for comments again in October - https://strategy.data.gov/feedback/

Data.gov and Federal agencies use the same metadata standard (DCAT) that Google Dataset Search is using so much of our metadata is also being syndicated there.

I think the biggest blocker with using publicly available datasets is stale data.

If you, or anyone else who aggregates these datasets could make it EASY to find the FREQUENCY of updates, rather than just the LAST UPDATED timestamp, it'd incentivize people to consume APIs more.

I realize having a snapshot from 2014 is better than what was publicly available before. But I feel no one's really talked about why they would or wouldn't use particular data.

I think this is exactly correct. Frequency of updates (and clear documentation of the lag relationship between when data is reported and for what period data is applicable too) is often missing or hard to find.

The value of increasing the cadence of updates should also not be understated! A lot of public dataset report on annual frequencies with more than a quarter of delay... Although this is a different issue altogether that has more to do with the processes of the reporting agency.

Yes, it's interesting how much difference the data about data management can make in people's engagement with the platform.

Definitely, feel free to email me (in my user bio). Thanks for the info about the upcoming comment period, will have to put a reminder on my calendar for that.

Interesting. There have been a lot of attempts at "meta data portals" that search across portals. Most of them have struggled.

At Open Knowledge we built a really early one called opendatasearch.org in 2011/2012 - now defunct - and were involved in the first version of the pan EU open data portal. We also had the original https://ckan.net/ (and subsites) which is now https://datahub.io/ and has become much more focused on quality data and data deployment. [Disclosure: I was/am involved in many of these projects]

The challenge, as others have mentioned, is that data quality is very variable and searching for datasets is complicated (think of software as an analogy - searching for good code libraries is a bit of an art).

I imagine Google are trying this out before making datasets another "special type" of search result -- after all you can already search google for datasets. In addition, Google are already Google so including datasets will have a level of comprehensiveness and exposure you struggle with elsewhere (part of the power of monopoly in a sense!).

PS: for those looking for data gov sites https://dataportals.org/ has most of these.

https://data.gov.ie —- this is a repository of a bunch of the open data produced in the Irish public sector.

(Disclosure, I work on this)

Playing around, a lot of countries use this url scheme:



https://www.dati.gov.it/ (note, Italy redirects data. to dati.)


https://data.gov.za/ (South Africa's has a cert problem)




although data.gov.pl exists, and even presents a valid cert, it has no content. The place with the data is:


(website is being terribly slow for me right now)

https://data.gov.sg/ (Singapore)

I'm a particular fan of the boundary "dataset" you have that's a low-resolution TIFF file.

(edit to add something more productive: the site is littered to the tune of at least 25% and maybe even a third with junk "data", all obviously added to get the number of records as high as possible, with no regard to whether that data is either useful to anybody, is machine-readable in any way at all, or even -- in the example above -- even qualifies as "data". Data.gov.ie would be moderately interesting if all the shit in it was removed.)

The quality of the datasets varies greatly depending on the source. Some work well, some, less so. There are data sources that are undergoing active development to harvest them more accurately. None of them were added to pump up the numbers.

The biggest numbers bump recently was ca 1600 Met Eireann rainfall records datasets from all around the country, some of them daily rainfall dating back 60 years. (Spoiler, there’s a lot of rain)

This is specifically a catalog of data sets, it doesn’t host the data except for previews, and even doing that is pretty complicated in all its glory.

> None of them were added to pump up the numbers.

Then kindly explain these


Or these


(I'd show only the PDF-only ones but your search doesn't work.)

Oh and look two of these also contain no machine-readable data whatsoever


Are you going to release any street address datasets at some point?

I believe that the eircode dataset is one of the most highly requested sources, but it’s a private for profit database.

If you are aware of such a dataset that a public body is hosting, then it would certainly be something to include. Convincing (and helping) the public bodies to publish their data is still a big task.

I believe that An Post built a similar system many moons ago, and they should presumably be more amenable to open-sourcing.

The handling of the whole Eircode thing makes my blood boil, to be honest.

Blatantly self-promoting, but we (in Australia) are trying to develop a general solution for better open data search over at https://search.data.gov.au - solution is all open source at https://github.com/magda-io/magda.

Also, https://data.gov.in -- Lots of self-submitted datasets, and includes an API. Usability is average, much like the other government data sources.

Another example is Elsevier's Data Search project

- https://datasearch.elsevier.com/

I really like NYC Open Data: https://opendata.cityofnewyork.us/

Nice. I'm glad Google is making it easier to find public data sets. I wish that these could be filtered by format, so that you could narrow them to CSV, XML, JSON, KML, etc.

Another nice resource that I've used in the past is 'toddmotto/public-apis' on Github [0].

In the end I would prefer all public data sets to be available over the DAT protocol [1] instead of being hosted only on government or organization websites. A lot of climate data previously made available by the EPA was taken down, and only saved by efforts of volunteers.[2]

[0] https://github.com/toddmotto/public-apis

[1] https://datproject.org/

[2] https://www.wired.com/2017/01/rogue-scientists-race-save-cli...

Dat's pretty cool, but it's not the only game out there. The efforts of git-annex/Datalad [0], Academic Torrents [1], Quilt[2], DVC[3], and Pachyderm [4] are also notable in this space. My hopes are broader in the sense that I just hope that dataset versioning happens in the first place.

[0] https://www.datalad.org/

[1] http://academictorrents.com/

[2] https://quiltdata.com/

[3] https://dvc.org/

[4] http://www.pachyderm.io/

[5] https://qri.io (we're tackling the dataset versioning problem head on.

Yeah, if they had added an attribute for formats available for each data set, and then added a filter by type to the search (e.g. "type:csv") or similar, it would be great.

Sometimes you really want a specific format for a dataset.

Having dabbled a bit in open data, I think "this data is available in too many formats and going through all the options manually is tiresome" counts as a problem you'd love to have.

The state of data sharing seems to be still quite sad.

* Hosting problems. The first link I tried was already broken.

* Format problems. Also the presented data is in all kinds of formats, some "data sets" even require me to read data off images: https://www.ceicdata.com/en/indicator/germany/gdp-per-capita And even if it's JSON, this is not particularly great either (Unicode support? Large (64bit) integers?).

* Update problems. Many data-sets change over time (e.g. GDP). How can I subscribe to updates? "git pull" would be nice.

* Provenance problems. I want to know who put which record into the dataset, when and why? "git log" would be nice.

* Presentation problems. (This is OK sometimes) I necessarily want to download 5Gb file before I looked into it. The first few rows of the dataset should be presented on the page, with information about it.

Wrote down a few more thoughts a while ago here: https://github.com/HeinrichHartmann/data-sharing#in-the-idea...

Approaches I have seen so far in the wild:

* figshare.com -- Addresses Hosting and Presentation.

* https://quiltdata.com/ -- (!) looks great. Still exploring.

* github.com -- works fine for small datasets (<1GB)

* packaging (yum, pkg, pip) -- (?) Not sure if that works, but at least they solve: Hosting, Update, Provenance.

This seems to be a wide open problem to me.

Yes, state is still mediocre. Back in 2011 i did Chaos Computer Congress talk on "apt-get for Debian of Data". This itself came out of building the original data portal and open data catalog/hosting site in 2006 https://datahub.io/ (originally ckan.net).

The ideas there are now getting realised in Frictionless Data https://frictionlessdata.io/

This an initiative providing a simple way of "packaging" data like software plus an ecosystem of tools including a package manager etc - https://frictionlessdata.io/data-packages/.

Aims to be minimal, easy to adopt etc (e.g. based on CSV). It has got significant traction with integration and adoption into Pandas, OpenRefine etc.

https://datahub.io/ itself is entirely rebuilt around Data Packages and includes a package manager tool "data".

If you're interested to talk more please come chat on http://gitter.im/datahubio/chat

Shameless plug. Please give me feedback.


I would like to create laymen oriented central repository for all public spending data of the world.

* Hosting problems - I make my own copy of the data.

* Format problems - clearing and formatting data from different sources is a real pain. Once it is on my website, I offer CSV download or COPY/PASTE tabulated data.

* Update problems - no versioning or public API yet.

* Provenance problems - there is a link to the source of the data.

* Presentation problems - tailored to displaying budgets. Not cross-browser or full mobile support yet.

Open Data portals powered by OpenDataSoft let you see/sort/filter/visualize data in your browser. You can access the data via API (including sorting/filtering) or download static files (CSV/JSON/XLS etc.).

I haven't seen many other platforms offer the same kind of functionality.

e.g This one is a dataset of CCTVs across Leicester. You can easily see all of the columns, sort data around, display a chart of camera types, see their location on a map etc.


Screenshots: https://postimg.cc/gallery/j5e5thv2/

A few portals powered by them:

- Paris https://opendata.paris.fr/explore/

- Mannheim https://mannheim.opendatasoft.com/page/home/

- Durham NC https://opendurham.nc.gov/pages/home/

disclaimer: I was an intern there 4 years ago

Totally agree! At Qri (https://qri.io) we're working on many of these problems together - hosting, formatting (interoperability), provenance and sync. It's an open source project - we'd love to have your feedback as we design it!

This is cool and is a perfect case for IPFS for public datasets. I've not heard of it before, though, and I think naming / branding is something that makes finding these things / building momentum more difficult.

For example, someone else mentioned engima.com. I would have no idea that is related to data sources / sets unless I knew what it was.

Certainly wish you the best of luck though and will keep an eye on Qri! Cool project!

Thanks! you can follow us on twitter where we announce most updates: @qri_io

That looks interesting. IPFS backed -- nice.

A web interface would be nice so that I don't have to install a tool to browse the content. Going to give it a shot non-theless.

Hmm: "Qri backend is unavailable". Going to come back tomorrow.

Hey! thanks for checking us out. If you are still having trouble, please head over to our github (http://www.github.com/qri-io/frontend) and I'd be happy to help out.

You would be doing us a huge solid, working through these use cases irl is beyond helpful.

I agree and I have an idea for one possible solution that I've wanted to implement for years. It could be a business but, I think maybe it would be better for the world if it was open source. I just haven't had the time or support to do it, as it would need full-time effort even to get it started.

It's one of those... I know I should just do it kind of things, even just to get it out there, but I haven't found the inertia.

Seeing things like dat, quiltdata, public data sets, etc. made me think what I wanted to do was unnecessary, but I also agree with your comment.

I think a core problem is data democracy / control / politics of data. Too much we still act silo'ed instead of benefiting from massive data sharing, for a multitude of reasons (especially but not limited to $$$).

Interesting vertical integration for Kaggle Datasets: https://www.kaggle.com/datasets

Note Kaggle recently adopted the FrictionlessData.io Data Package specs: https://github.com/Kaggle/kaggle-api/wiki/Dataset-Metadata

Interestingly enough, I built and released something very similar [1] about a month ago using a Google Custom Search Engine.

Here is the Show HN for it: https://news.ycombinator.com/item?id=17789119

[1] https://databasd.com/

What is the best way for a website to format data to the public?

I already have my presentation, but I can also provide it as a .xls, .csv, sql, or html table.

What would be best to help programmers/data scientists use my data?

1) Put it on figshare or github. Your website might be dead next year. Those services have a better chance of survival.

2) CSV, JSON works fine if your dataset is just a few numbers and strings. GitHub will preview csv files as HTML tables.

3) If you need the efficiency of binary data and more robust data containers I would look into - Parquet https://parquet.apache.org/documentation/latest/ and - Arvo http://avro.apache.org/ http://blog.cloudera.com/blog/2009/11/avro-a-new-format-for-...

4) DataScientists work with R/Pandas "DataFrames". If you are familiar with either one, import the data into a data frame and use an export method to do the serialization for you: https://pandas.pydata.org/pandas-docs/stable/api.html#id12

I vote CSV.

- simple

- lightweight

- open

It's easy for the consumer of your data to convert a CSV to whatever format they need.

- spreadsheet, for personal analysis

- SQL database, for industrial-strength analysis

- HTML, for pretty output to their users

Oh, and don't forget that CSV is a rigorous data format, with many tricks. Don't just append some text together, separated by commas, and call it CSV. Instead, use a dedicated lib to create the CSV for you.

Suggest CSV plus a Table Schema in a nice little Data Package: https://frictionlessdata.io/data-packages/

There's a wide set of tooling for Tabular Data Packages, plus underlying data is CSV which anyone can use.

If you want this done automatically you can just publish your CSV to https://datahub.io/ and your Tabular Data Package is made automatically for you.

Csv and a permitive license like CC0. Provide good meta data to make provenance easier. Mark up your dataset with schema.org.

CSV is your best option. CSVW is a CSV+a metadata file to convert the tabular data to graph data (plus it types the nodes and relationships of the graph). You may want to have a look at it.

You probably ask here the most controversial question of the Web today. The fact that CSV is the prefered answer is a HUUUGE fail.

Nice to see Google trying this again!

It's one of those areas they have long attempts at involvement in - e.g. Google Public Data Explorer which never quite reached it's potential, and Freebase which although flawed was good and was shut down after Google acquired it.

I like that this is search based! The web is still the best place to publish data - in fact in my view normal Google search is still by far the best way to find datasets, even though it isn't directly designed for that.

There's a link from the about page of Google Dataset Search to this help for webmasters on how to mark up content for it - although it is a bit odd, mainly showing how to mark a dataset with a DOI (so good for academics certainly!):


Just metadata about data feels like a very niche thing to search to me - I'm still not convinced anyone will maintain the metadata well enough to help. Possibly will work in particular domains.

Does Dataset Search have some way to search column headings, types or content (of CSV, Excel, JSON etc)? I can imagine a load of operators that would make that really powerful for finding badly meta-marked up datasets deep in the web. Would seem like the obvious extra thing a dataset search would do.

Also previews please!!! Just nicely render the fist ten rows of common formats - CSV and Excel to begin with.

What part of Google is doing this?

It would be good to know who sponsored the data research so that everyone can make self biased decision to use/trust the data.

Looks like academic institutional repositories and figshare are doing the heavy lifting here. It's still neat to see Google aggregate everything, but it's not that different from what they do with other services relying on these sources already, and is largely dependent upon how rich these upstream sources are in the first place.

This is nice. I'm working on a similar open source project that is releasing soon called DataLibrary[0]

It goes further by bringing this kind of data together into a single API, converting/cleaning into a similar schema where possible.

A small write up can be found on github [1]. Any feedback/ideas would be appreciated!

[0] https://www.datalibrary.com (not online currently)

[1] https://github.com/reactual/datalibrary/blob/master/README.m...

Pretty lame that it doesn't work in Edge.

https://data.urbanfootprint.com/ - browse thousands of environmental, social, transportation, and land use datasets.

(disclosure - I work there)

Then there's this $500K just awarded by the NSF to build a "Google for data sets". I wonder if, before making these sorts of grants, the NSF looks at what Google and other and other companies are already (or likely) doing. https://www.lehigh.edu/engineering/news/faculty/2018/2018082...

You may want to try [https://data.opendatasoft.com](https://data.opendatasoft.com... -- thousands of datasets available through the same API, usable online, no download required.

Look at https://knoema.com which positions itself as a search engine for data with more than 2.5 billion time series available. They provide both visual data discovery through search and navigation as well as API access through Python, R etc.

Can we list some good data searches to suggest below this comment?

I would love to see some cool data we might be able to use.

I also very much like https://www.figure-eight.com/data-for-everyone/. It's not optimized for search but it's an excellent repository of high quality datasets.

Of note is the link below which indicates how you can have your dataset indexed.


Since we have already a nice reference list of open data portals by country I like to add the German portal: https://www.govdata.de/

Looks like this doesn’t crawl one of the most important source for data: academic torrents. For example, I searched for

ilsvrc lmdb

This should have found imagenet data in lmdb format available somewhere but it returned no results.

Maybe I’m missing something but this strikes me as underwhelming. To the point of something that I could do as opposed to something that the firm that created maps and gmail could do.

Is it just me?

Add this one: https://dados.gov.pt/ (Portugal)

Google is so damn bad on UX. An excellent example of interaction of unlimited ressources and priorities.

Placeholder comment to calculate the time before Google inevitably shuts this down.

What's up with this domain? Is there anything else under toolbox.google.com?

Here you go: http://bfy.tw/JkG7

Please don't do this here.

it will be helpful for somebody who major in AI

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact