
Show HN: Datasets.co – Share and discover new machine learning datasets - mrborgen
http://www.datasets.co/
======
danso
A couple things I would suggest. Besides places like Stanford's SNAP, I would
check out Socrata's OpenData portal [0], which is a place where anyone can
post datasets, and the Open Data Network portal [1], which searches across all
of the Socrata city portals...it's stunning what's out there. The NYPD's
release of felony data is nice (if many years late), but it's nothing compared
to what cities such as Chicago, Dallas, and Los Angeles have, in terms of
quantity of records and detail (and the NYPD's stop-and-frisk data -- which
the ACLU forced them to release -- is also much more voluminous).

I would consider providing numbers about the datasets in the default view,
such as number of observations and variables. That's probably the biggest
weakness, IMO, of current data portals (including data.gov)...you have to
click through every link to then find out there's not much data in the set. In
your situation, this applies to several of the things you've included...that
Gun Ownership and Crime Rates set, for example...unless I'm missing something,
but that has fewer than 40 observations, and relies on highly questionable
numbers from the FBI [2] on a nationwide level...that can't possibly be of any
use in a machine learning context, can it? I'm surprised it's even the basis
for an academic paper (though kudos to the authors for posting their work). If
you still think it's worth keeping that dataset, it'd be nice to know # of
observations before having to click through.

[0] [https://opendata.socrata.com](https://opendata.socrata.com)

[1] [http://www.opendatanetwork.com/](http://www.opendatanetwork.com/)

[2] [http://www.jsonline.com/watchdog/watchdogreports/fbi-
crimere...](http://www.jsonline.com/watchdog/watchdogreports/fbi-
crimereporting-audits-are-shallow-infrequent-cg5uvel-166665516.html)

~~~
mrborgen
Was a bit quick comment last time. Yes, having the amount of observations is a
very natural feature to add. Shiuld be fairly simple as well, so it'll be
added soon!

------
tryitnow
I was skeptical when I first saw this because there seem to be many other
efforts to collect datasets.

But most of those efforts are pretty poor because they're unnecessarily
difficult to understand.

I like how the data is clearly described with "Feature/Description/Example",
that gives me most of what I want to know at a first glance.

~~~
mrborgen
Thanks! Would you also be interested in having a 'data preview' feature, where
you'd see a tiny part of the data in the browser?

------
stared
It would be nice to have some estimate of the data size (e.g. number of rows).

~~~
mrborgen
Hey, on datasets added from now on, this'll be an option. Check it out in
action here (displayed under the image).

[http://www.datasets.co/dataset/All-UFC-
Fights](http://www.datasets.co/dataset/All-UFC-Fights)

Was that what you were thinking?

~~~
stared
Yes, it's what I meant.

But in this case there are 19 features (as said) or 10 features (no of
columns)?

~~~
mrborgen
It's avtually 19, I just didn't understand all of them to be able to write
proper descriptions.

~~~
stared
Then it is better to write all columns, even if some with "(unknown)" desc.

------
mrborgen
Hey all!I built this site because I think it should be easier and more fun to
discover new datasets. I'd love hear your thoughts, suggestions and critisism!

------
danso
Awhile ago I put together a list of big (100,000+ rows) and public
datasets...not all of them are ideal for machine learning applications but
you'll probably find a few worth sharing:

[http://cjlab.stanford.edu/2015/09/30/lab-launch-and-data-
set...](http://cjlab.stanford.edu/2015/09/30/lab-launch-and-data-sets/)

~~~
mrborgen
Awesome list, thanks!

------
uberneo
Good datasets source .. one point - you can register an empty data set . You
might want to fix that.. rest all looks great .. good job

------
phatbyte
This is awesome, as someone who is currently playing with ML this is a great
resource for playing with Datasets. Thanks to everyone involved in this
project. Keep it up.

