
List of high-quality open datasets in public domains - Jasamba
https://github.com/caesar0301/awesome-public-datasets
======
dvcrn
I don't quite understand these awesome lists. From what I've seen it usually
ends up being a way for creators to promote their stuff and for the list
creator to have a big project with a few thousand stars in their profile. So
when I did something in say, electron, I would go to the awesome-electron list
and add it there for promotion sake.

I couldn't find a usecase for these lists myself yet. There is no way to
verify the quality of the product or the activity (stars for example? last
commit date?).

In one case I searched for aws adapters for a language, clicked on all links
inside awesome-{{language}} just to find that all of them are inactive or a
few days young. I ended up using something I found on google instead.

~~~
elcapitan
When I started learning Golang last year I found the awesome-go list quite
helpful. Not because I needed every single library mentioned, but because it
quickly gave me an impression of how the ecosystem plays out and what typical
ways to build stuff are. Because that's quite important in case you're not
starting from an existing framework (.net or Rails etc).

------
minimaxir
69 points, #3 on Hacker News, and no comments? :P

This list would be much improved with descriptions for each dataset and
indication of schema, as some of the datasets listed have very unfriendly
schema. (e.g. the IMDB interfaces link)

Kaggle's recently-released Public Datasets feature
([https://www.kaggle.com/datasets](https://www.kaggle.com/datasets)) provides
an interesting approach to presenting data and qualifying datasets by giving
good examples of data robustness.

~~~
qume
Exactly my thought as soon as I looked at the lack of comments.

How is it this community can debate some inconsequential nonsense, and there
is no discussion here of how we get a consistent set of meta-data for these
data sources.

There are researchers both in academia and in the commercial world who would
thrive if there were such a list with good consistent meta-data on how to
interact with it.

Disclosure: I work regularly with open datasets, and the effort it takes to
work with each different set overshadows any effort on actual analysis.

~~~
niels_olson
I'm building a dataset of images with annotations (svg-like as well as
clinically relevant metadata). I would _really_ like to know any papers or
people who could help guide the effort. Any suggestions?

~~~
developer2
I think it's really a political thing. How many of these open datasets come
with a license/disclaimer that you are also allowed to rehost the data in its
entirety for redistribution? It's one thing to standardize a format for open
data, it's another thing to be able to make that standardized format available
to everyone if you're repackaging someone else's data.

------
davecap1
SolveBio (my startup) has parsed, normalized, and indexed a bunch of the
datasets listed under biology. Our goal is to make these kinds of datasets
easier to access for programmers and non-programmers alike, similar to other
some sites mentioned here (Enigma and Quandl) but for genomics. You can query
and filter the data on the website or through one of our API clients:
[https://www.solvebio.com/library](https://www.solvebio.com/library)

~~~
lap88
Sounds like a good idea, especially the normalization part, but your site
requires Java Script without even some basic functionality without it... nope.

------
discardorama
The author's notion of a "dataset" is weird. Under "Finance", there's a link
to Google Finance page (
[http://finance.google.com/](http://finance.google.com/) ). How is that a
"dataset" ??

------
chestnut-tree
For those in the UK, the available Government datasets are published on
[http://www.data.gov.uk](http://www.data.gov.uk)

The datasets are not public domain, but licensed under the Open Government
Licence (which allows you to use and adapt the data for commercial use).

There's also the Global Open Data Index: a website that ranks countries by how
much Government data is available as open datasets based on certain criteria.
The current top spot is taken by Taiwan

    
    
      1. Taiwan
      2. UK
      3. Denmark
      4. Colombia
      5. Finland
      5. Australia
      7. Uruguay
      8. USA
      8. Netherlands
      10. Norway
      10. France
    

[http://index.okfn.org/place/](http://index.okfn.org/place/)

~~~
psykovsky
You mean Colombia?

~~~
chestnut-tree
Yes, sorry. Corrected :-)

------
clockwerx
I wish linkeddata.org or ckan installs weren't being reinvented here, but
instead ckan supported pull requests or similar decentralized ways to publish
new data sets

~~~
rossj
If you have any ideas/suggestions for how this might be implemented in CKAN,
please do drop a mail to the list (
[https://lists.okfn.org/mailman/listinfo/ckan-
dev](https://lists.okfn.org/mailman/listinfo/ckan-dev) ) or add an issue at
[https://github.com/ckan/ideas-and-
roadmap/issues](https://github.com/ckan/ideas-and-roadmap/issues) for
discussion.

------
yzh
For the complex network part, I think the collection missed this one:
[http://www.networkrepository.com/](http://www.networkrepository.com/) The
site itself is a collection of several publicly available network datasets.

------
jack9
I noticed no [http://commoncrawl.org/](http://commoncrawl.org/) (oh no, naked
domain!) or [http://www.cochrane.org/](http://www.cochrane.org/)

I don't quite understand the criteria for being included in the list since I
think it's:

[https://groups.google.com/forum/#!forum/awesomepublicdataset...](https://groups.google.com/forum/#!forum/awesomepublicdatasets)

------
patrickk
Betfair Historical Exchange Data requires you to have "100 Betfair points"
which you acquire by gambling on their site. It's hardly an open dataset.

------
Spooky23
Check out data.ny.gov

Also nycopendata.socrata.com

------
lifeisstillgood
Is it too late to create a central registry of datasets - to aid
discoverability. A voluntary system maintained by convention?

Perhaps a distributed registration system ala DNS?

~~~
Symbiote
We have
[https://www.biodiversitycatalogue.org/](https://www.biodiversitycatalogue.org/)
for biodiversity informatics APIs. A hackathon I attended made an API for
registering, but I don't think it was deployed.

------
tylercubell
Enigma.io is great for public data too.

~~~
minimaxir
I took another look at the Enigma.io public datasets. Over 50% of _all the
public datasets_ are from the Federal Reserve Bank of St. Louis. Finance data
is boring. :P

Quandl ([https://www.quandl.com/browse](https://www.quandl.com/browse)) is
similar to Engima, except they got rid of all the fun datasets and added more
finance/economic datasets. Hmrph.

~~~
adkatrit
While the FRED data might be ~%50 of the datasets, most of those table are
200-4000 rows, it is not nearly %50 of the rows of data.

the remaining 50% of datasets have a lot of gems

quality in the breadth of data is important

state-wide liquor license, corp reg. OSHA is great data. AMS shipping records.
FDA adverse events. Oil and Gas well locations/production. Consolidated
weather reports since 1800... im almost certainly forgetting some.

------
legulere
It's strange that they put Wikidata under natural language.

