
50 Years of Data Science (2015) [pdf] - charlysl
http://courses.csail.mit.edu/18.337/2015/docs/50YearsDataScience.pdf
======
charlysl
In the first lecture of CMU's Practical Data Science course, the lecturer
explains very well the differences between data science, machine learning,
statistics and big data ("as far as this course is concerned"), between about
minutes 2 and 15: [https://youtu.be/IS279gXmfkQ](https://youtu.be/IS279gXmfkQ)

------
graycat
There's another problem here:

(A) The worker, statistician or data scientist, sends resumes looking for
jobs. Well, that job has to be provided by someone else, likely someone who at
least eventually gets some real value from the worker. And the person's
training has to be at least close to what the job needs. So, the person is
_one step away_ from the actual money making.

(B) The academic teaches those workers to be but is _two steps away_ from the
actual money making. So, what the academic teaches has to be quite powerful
quite generally.

Should be able to get a better _fit_ if the academic or the worker is the one
in the business with the problem that has a very valuable use for statistics,
data science, or some such. Then the academic or worker can focus on what is
really valuable for the business. Else the worker and academic are a bit far
from the money, so far they may have a tough time ever seeing the money.

~~~
dxbydt
Even if the DS gets close to the money, the massive assumption here is that
the underlying money making process is amenable to statistics methodology. The
surprise & disappointment is that it mostly isn't. The underlying embedded MC
in most of these CTMCs is not ergodic. It doesn't have a stationary
distribution. Looking at a time series of historical ad clicks & trying to
predict the inventory for the next year etc. is mostly borderline fraud, if
you want to get downright scientific. There's no theorem that says this is how
a social science metric like ad purchases will behave. People clicked on
something yesterday, they'll click on something else tomorrow. Even in the
aggregate, if you go off the deep end of asymptotic theory & invoke the gods
of CLT, there is nothing to prove one way or other. Fortunately, the problem
is bounded, so data scientists throw around a wide enough confidence interval
& make glib predictions. But these are not physical processes. As Morgenthau
famously observed - illusion of a social science imitating a model of the
natural sciences. There is a fundamental difference between the two, namely
the technical matter regarding the stability of behavioral relationships
between the variables. The speed of light is a constant; social relations are
not.

~~~
graycat
> Even if the DS gets close to the money, the massive assumption here is that
> the underlying money making process is amenable to statistics methodology

Yup.

A famous recipe for rabbit stew starts off "First catch a rabbit ...."

Well, a first recipe for applied math/stat is "First get an application ..."

Still better, pick in a coordinated way the pair of the problem and the method
of solution. For business want lots of people/money to have the problem and
like the solution. Also want a Buffett moat: Network effect, natural monopoly,
a brand name with a lot of power to get and keep customers, maybe some crucial
core technology _secret sauce_ difficult to duplicate or equal. Now try to
exploit computing and the Internet. Then want LUCK of timing, etc.

------
charlysl
There is a companion talk by the paper's author in Youtube:
[https://youtu.be/E7w-gfFKPf8](https://youtu.be/E7w-gfFKPf8)

------
graycat
Classic statistics with _inference_ is one way.

New data science with _predictions_ is another way.

There's also a third way!

~~~
robsalasco
causality?

