50 Years of Data Science (2015) [pdf]

charlysl · on April 14, 2019

In the first lecture of CMU's Practical Data Science course, the lecturer explains very well the differences between data science, machine learning, statistics and big data ("as far as this course is concerned"), between about minutes 2 and 15: https://youtu.be/IS279gXmfkQ

graycat · on April 13, 2019

There's another problem here:

(A) The worker, statistician or data scientist, sends resumes looking for jobs. Well, that job has to be provided by someone else, likely someone who at least eventually gets some real value from the worker. And the person's training has to be at least close to what the job needs. So, the person is one step away from the actual money making.

(B) The academic teaches those workers to be but is two steps away from the actual money making. So, what the academic teaches has to be quite powerful quite generally.

Should be able to get a better fit if the academic or the worker is the one in the business with the problem that has a very valuable use for statistics, data science, or some such. Then the academic or worker can focus on what is really valuable for the business. Else the worker and academic are a bit far from the money, so far they may have a tough time ever seeing the money.

dxbydt · on April 14, 2019

Even if the DS gets close to the money, the massive assumption here is that the underlying money making process is amenable to statistics methodology. The surprise & disappointment is that it mostly isn't. The underlying embedded MC in most of these CTMCs is not ergodic. It doesn't have a stationary distribution. Looking at a time series of historical ad clicks & trying to predict the inventory for the next year etc. is mostly borderline fraud, if you want to get downright scientific. There's no theorem that says this is how a social science metric like ad purchases will behave. People clicked on something yesterday, they'll click on something else tomorrow. Even in the aggregate, if you go off the deep end of asymptotic theory & invoke the gods of CLT, there is nothing to prove one way or other. Fortunately, the problem is bounded, so data scientists throw around a wide enough confidence interval & make glib predictions. But these are not physical processes. As Morgenthau famously observed - illusion of a social science imitating a model of the natural sciences. There is a fundamental difference between the two, namely the technical matter regarding the stability of behavioral relationships between the variables. The speed of light is a constant; social relations are not.

graycat · on April 14, 2019

> Even if the DS gets close to the money, the massive assumption here is that the underlying money making process is amenable to statistics methodology

Yup.

A famous recipe for rabbit stew starts off "First catch a rabbit ...."

Well, a first recipe for applied math/stat is "First get an application ..."

Still better, pick in a coordinated way the pair of the problem and the method of solution. For business want lots of people/money to have the problem and like the solution. Also want a Buffett moat: Network effect, natural monopoly, a brand name with a lot of power to get and keep customers, maybe some crucial core technology secret sauce difficult to duplicate or equal. Now try to exploit computing and the Internet. Then want LUCK of timing, etc.

charlysl · on April 14, 2019

There is a companion talk by the paper's author in Youtube: https://youtu.be/E7w-gfFKPf8

graycat · on April 13, 2019

Classic statistics with inference is one way.

New data science with predictions is another way.

There's also a third way!

robsalasco · on April 13, 2019

causality?