Hacker News new | past | comments | ask | show | jobs | submit login
List of high-quality open datasets in public domains (github.com)
388 points by Jasamba on Jan 30, 2016 | hide | past | web | favorite | 34 comments

I don't quite understand these awesome lists. From what I've seen it usually ends up being a way for creators to promote their stuff and for the list creator to have a big project with a few thousand stars in their profile. So when I did something in say, electron, I would go to the awesome-electron list and add it there for promotion sake.

I couldn't find a usecase for these lists myself yet. There is no way to verify the quality of the product or the activity (stars for example? last commit date?).

In one case I searched for aws adapters for a language, clicked on all links inside awesome-{{language}} just to find that all of them are inactive or a few days young. I ended up using something I found on google instead.

When I started learning Golang last year I found the awesome-go list quite helpful. Not because I needed every single library mentioned, but because it quickly gave me an impression of how the ecosystem plays out and what typical ways to build stuff are. Because that's quite important in case you're not starting from an existing framework (.net or Rails etc).

It really depends on how well they are maintained and how selective they are. (E.g. a long list with a few links for many categories is often more useful than one that lists everything that possibly fits in one category). And some basic information for each entry is really needed.

GitHub offers a reasonable way to manage contributions to them, compared to many other solutions. It is easy to suggest/fix something for external contributors, but the owner can work as a gatekeeper. This is something many link aggregators or bookmarking sites lack.

One example that isn't perfect, but I found interesting was https://github.com/Kickball/awesome-selfhosted It has a sentence about each project, license information, and tries to purge unmaintained projects.

I think the benefit is as a source of inspiration. Finding data can require a lot of domain knowledge which students especially tend to lack. The problem is that the lists are always incomplete and other data may be more appropriate for a particular use case.

Also, quality is more complex than just stars or activity. You need to know what the advantages or limitations are and how that applies to your project.

I'm quite keen to start learning about data science. A bunch of big datasets like this is exactly what I need.

69 points, #3 on Hacker News, and no comments? :P

This list would be much improved with descriptions for each dataset and indication of schema, as some of the datasets listed have very unfriendly schema. (e.g. the IMDB interfaces link)

Kaggle's recently-released Public Datasets feature (https://www.kaggle.com/datasets) provides an interesting approach to presenting data and qualifying datasets by giving good examples of data robustness.

Exactly my thought as soon as I looked at the lack of comments.

How is it this community can debate some inconsequential nonsense, and there is no discussion here of how we get a consistent set of meta-data for these data sources.

There are researchers both in academia and in the commercial world who would thrive if there were such a list with good consistent meta-data on how to interact with it.

Disclosure: I work regularly with open datasets, and the effort it takes to work with each different set overshadows any effort on actual analysis.

I'm building a dataset of images with annotations (svg-like as well as clinically relevant metadata). I would really like to know any papers or people who could help guide the effort. Any suggestions?

I think it's really a political thing. How many of these open datasets come with a license/disclaimer that you are also allowed to rehost the data in its entirety for redistribution? It's one thing to standardize a format for open data, it's another thing to be able to make that standardized format available to everyone if you're repackaging someone else's data.

I'm working with a postgis corpus consisting of ACS, OES, CHSI, IPEDS, and many others. If you're in the same domain I'd love to chat sometime (email in profile).

> 69 points, #3 on Hacker News, and no comments?

I expect this is because there is no ability on HN to differentiate (on articles and comments) between bookmark and upvote, most likely the majority of votes are for the purposes of bookmarking. Very often I want to upvote someone for a good comment, but I do that very sparingly now because I try to keep my upvoted comments list minimal so when I try to find something noteworthy I don't have to wade through pages of "liked" comments.

I think you are drastically overestimating how many HN users are aware of the saved comments/stories feature.

I've used HN for years, I've never even heard mention of saved comments/stories. What is this voodoo?

My saved comments: https://news.ycombinator.com/saved?id=cbd1984&comments=t

Your saved comments: https://news.ycombinator.com/saved?id=vosper&comments=t

My saved stories: https://news.ycombinator.com/saved?id=cbd1984

Your saved stories: https://news.ycombinator.com/saved?id=vosper

I have no idea if you'll be able to see mine or if I'll be able to see yours. We'll just have to find out together.

Edited To Add: No. I can't, and you won't.

SolveBio (my startup) has parsed, normalized, and indexed a bunch of the datasets listed under biology. Our goal is to make these kinds of datasets easier to access for programmers and non-programmers alike, similar to other some sites mentioned here (Enigma and Quandl) but for genomics. You can query and filter the data on the website or through one of our API clients: https://www.solvebio.com/library

Sounds like a good idea, especially the normalization part, but your site requires Java Script without even some basic functionality without it... nope.

The author's notion of a "dataset" is weird. Under "Finance", there's a link to Google Finance page ( http://finance.google.com/ ). How is that a "dataset" ??

For those in the UK, the available Government datasets are published on http://www.data.gov.uk

The datasets are not public domain, but licensed under the Open Government Licence (which allows you to use and adapt the data for commercial use).

There's also the Global Open Data Index: a website that ranks countries by how much Government data is available as open datasets based on certain criteria. The current top spot is taken by Taiwan

  1. Taiwan
  2. UK
  3. Denmark
  4. Colombia
  5. Finland
  5. Australia
  7. Uruguay
  8. USA
  8. Netherlands
  10. Norway
  10. France

You mean Colombia?

Yes, sorry. Corrected :-)

I wish linkeddata.org or ckan installs weren't being reinvented here, but instead ckan supported pull requests or similar decentralized ways to publish new data sets

If you have any ideas/suggestions for how this might be implemented in CKAN, please do drop a mail to the list ( https://lists.okfn.org/mailman/listinfo/ckan-dev ) or add an issue at https://github.com/ckan/ideas-and-roadmap/issues for discussion.

For the complex network part, I think the collection missed this one: http://www.networkrepository.com/ The site itself is a collection of several publicly available network datasets.

I noticed no http://commoncrawl.org/ (oh no, naked domain!) or http://www.cochrane.org/

I don't quite understand the criteria for being included in the list since I think it's:


Betfair Historical Exchange Data requires you to have "100 Betfair points" which you acquire by gambling on their site. It's hardly an open dataset.

Check out data.ny.gov

Also nycopendata.socrata.com

Is it too late to create a central registry of datasets - to aid discoverability. A voluntary system maintained by convention?

Perhaps a distributed registration system ala DNS?

We have https://www.biodiversitycatalogue.org/ for biodiversity informatics APIs. A hackathon I attended made an API for registering, but I don't think it was deployed.

Sure, go ahead!

Enigma.io is great for public data too.

I took another look at the Enigma.io public datasets. Over 50% of all the public datasets are from the Federal Reserve Bank of St. Louis. Finance data is boring. :P

Quandl (https://www.quandl.com/browse) is similar to Engima, except they got rid of all the fun datasets and added more finance/economic datasets. Hmrph.

While the FRED data might be ~%50 of the datasets, most of those table are 200-4000 rows, it is not nearly %50 of the rows of data.

the remaining 50% of datasets have a lot of gems

quality in the breadth of data is important

state-wide liquor license, corp reg. OSHA is great data. AMS shipping records. FDA adverse events. Oil and Gas well locations/production. Consolidated weather reports since 1800... im almost certainly forgetting some.

Are there sites that have datasets other than financial? (I also think financial data is too dry, boring)

It's strange that they put Wikidata under natural language.

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact