
A collection of free datasets from Microsoft Research - activatedgeek
https://msropendata.com/
======
iamdave
I ask this not with snark but with fond(ish) memories, but does Microsoft
providing open data like this make anyone else think of that Northwinds sample
database that used to come with SQL Server?

Not sure if it still does or not, as I'm in a different DB engine entirely
these days, but I kind of grinned with a wee bit of nostalgia checking out
this site thinking of that. I cut my teeth on that data set.

Edit: Son of a gun they still offer it [https://docs.microsoft.com/en-
us/dotnet/framework/data/adone...](https://docs.microsoft.com/en-
us/dotnet/framework/data/adonet/sql/linq/downloading-sample-databases)

~~~
kerng
SQL Server ships something called Adventure Works since over a decade - same
idea as Northwind but larger to also demo and play around with Analytivs and
BI.

~~~
AmericanChopper
Adventure Works, pffft... I’ll take Scott any day of the week.

------
abetusk
This page has problems loading the datasets. Is it overloaded or just broken?

Does anyone happen to know what licenses the various datasets are under?

~~~
Insanity
Can confirm that I'm having the same issues, could be overloaded indeed.

~~~
xinit
Turn off your ad blocker. Fixed it for me.

~~~
alokdhari
How did you find that fix??! Just Curious..

~~~
SpuriousSignals
Ctrl+shift+i in Chrome, Console, you may see something like "blocked by
client".

------
antoaravinth
Well if you are looking for a huge dataset, I would suggest stack overflow.

[https://archive.org/details/stackexchange](https://archive.org/details/stackexchange)

I remember I wrote a node script to download all question and dump into my
Postgres instance. It was fun, Postgres with an index could able to fetch
results super fast

How many answers for this question?

How many questions are unanswered?

With gin index, I could do a free text search as well.

------
pcurve
Wow, 230GB celebrity photo dataset caught me off guard.

~~~
ehsankia
Is it the one used in this paper?

[https://youtu.be/XOxxPcy5Gr4](https://youtu.be/XOxxPcy5Gr4)

~~~
ct520
Dude watching this video is how I imagine a bad acid trip to be

~~~
lurquer
It's worse. I think I damaged something in my brain by watching that...

------
rspeer
They don't make it easy to find out what the Microsoft Research License
Agreement is, when they put it behind a login wall on a site that doesn't work
with ad-block. But this appears to be a copy of it:
[http://www.cs.toronto.edu/~aditya/langid/msrla.txt](http://www.cs.toronto.edu/~aditya/langid/msrla.txt)

These license terms are a trap. Nobody should ever accept them.

You can't use the data for commercial purposes, which is harsh enough on its
own -- it means the data should only be interesting to people who plan to be
academics permanently, or who don't plan to succeed at anything they do with
it.

You can't even open-source anything you make out of it, because open source
software does not restrict the purpose for which it's used. You can't combine
it with data under any Creative Commons license except CC-By or CC-0, because
its restrictions are _definitely_ not Creative Commons compatible.

The truly insulting part is that you have to give an unrestricted license
_back_ to Microsoft for everything you make out of this data. You can't
benefit from what you make out of the data, but _Microsoft can_ , with _no
limitations_.

Whatever you make from this data, you can't benefit from it. Most other people
can't benefit from it. Microsoft can benefit from it, and if they want to,
they can just take it without crediting you.

This is like an artist being asked to do something "for the exposure", by
someone who will just take it and give you no exposure.

~~~
jonhendry18
"You can't use the data for commercial purposes, which is harsh enough on its
own -- it means the data should only be interesting to people who plan to be
academics permanently, or who don't plan to succeed at anything they do with
it."

It'd also be useful for someone farting around with machine learning who isn't
anywhere close to being at a point of shipping anything to anyone else.

What you can benefit from is what you learn while using the data. Then you
start a new project using what you've learned, and using different data.

~~~
rspeer
Why would you plan to fail? You could just use different data from the start
that doesn't require you to sign an NDA.

It's not like MSR is providing a rare commodity.

~~~
jonhendry18
"Why would you plan to fail?"

Because it isn't failure if your only goal is to learn.

------
iamaelephant
Absolutely nothing on this website works for me. Search gives an infinite
spinner. Categories gives a blank page. What gives?

~~~
threatofrain
For me it was uBlocker Origin.

~~~
_wmd
Confirmed uBlock causing it here too

~~~
kfrzcode
Ditto

------
purplezooey
Do we really need more sample data sets

~~~
maaaats
Yes.

I can't really fathom the snark here. Is providing this a _bad_ thing?

~~~
FaradayComplex
I think the commenter above you missed that it's not "sample data", but
curated datasets for machine learning tasks, which we most certainly need more
of.

