A couple things I would suggest. Besides places like Stanford's SNAP, I would check out Socrata's OpenData portal [0], which is a place where anyone can post datasets, and the Open Data Network portal [1], which searches across all of the Socrata city portals...it's stunning what's out there. The NYPD's release of felony data is nice (if many years late), but it's nothing compared to what cities such as Chicago, Dallas, and Los Angeles have, in terms of quantity of records and detail (and the NYPD's stop-and-frisk data -- which the ACLU forced them to release -- is also much more voluminous).
I would consider providing numbers about the datasets in the default view, such as number of observations and variables. That's probably the biggest weakness, IMO, of current data portals (including data.gov)...you have to click through every link to then find out there's not much data in the set. In your situation, this applies to several of the things you've included...that Gun Ownership and Crime Rates set, for example...unless I'm missing something, but that has fewer than 40 observations, and relies on highly questionable numbers from the FBI [2] on a nationwide level...that can't possibly be of any use in a machine learning context, can it? I'm surprised it's even the basis for an academic paper (though kudos to the authors for posting their work). If you still think it's worth keeping that dataset, it'd be nice to know # of observations before having to click through.
Was a bit quick comment last time. Yes, having the amount of observations is a very natural feature to add. Shiuld be fairly simple as well, so it'll be added soon!
Hey all!I built this site because I think it should be easier and more fun to discover new datasets. I'd love hear your thoughts, suggestions and critisism!
Awhile ago I put together a list of big (100,000+ rows) and public datasets...not all of them are ideal for machine learning applications but you'll probably find a few worth sharing:
This is awesome, as someone who is currently playing with ML this is a great resource for playing with Datasets. Thanks to everyone involved in this project. Keep it up.
I would consider providing numbers about the datasets in the default view, such as number of observations and variables. That's probably the biggest weakness, IMO, of current data portals (including data.gov)...you have to click through every link to then find out there's not much data in the set. In your situation, this applies to several of the things you've included...that Gun Ownership and Crime Rates set, for example...unless I'm missing something, but that has fewer than 40 observations, and relies on highly questionable numbers from the FBI [2] on a nationwide level...that can't possibly be of any use in a machine learning context, can it? I'm surprised it's even the basis for an academic paper (though kudos to the authors for posting their work). If you still think it's worth keeping that dataset, it'd be nice to know # of observations before having to click through.
[0] https://opendata.socrata.com
[1] http://www.opendatanetwork.com/
[2] http://www.jsonline.com/watchdog/watchdogreports/fbi-crimere...