
Statistics: Losing Ground to CS, Losing Image Among Students - sampo
http://blog.revolutionanalytics.com/2014/08/statistics-losing-ground-to-cs-losing-image-among-students.html
======
cs702
I disagree. Statistics is not "losing image" among students. It's the other
way around: _more and more students and software developers have become
interested in statistics_ , inspired by the successes achieved by machine
learning and AI researchers. Because machine leaning _is_ statistical
learning. It's almost impossible to build a machine learning system without
learning at least some statistics!

As evidence of the growing popularity of statistics, consider that this
submission made the front page on HN.

The "losing image" meme applies, not to the discipline of statistics, but to
_staid academic departments_ who have lost power and prestige. They don't like
that their subject has become popular in a different department (CS), outside
their purview.

PS. I would add that many (but not all) statistics departments have brought
this upon themselves. For a long time, their approach to real-world, practical
applications has been to apply _existing_ models in ways that can be readily
interpreted by human beings, e.g., with "explanatory variables." Meanwhile,
machine learning and AI researchers have been experimenting with, testing, and
deploying _new_ models and _new_ ways of 'scaling up' old models, focused on
making accurate predictions on previously-unseen data. _Of course_ students
find the latter more exciting.

------
east2west
Talk about head in the sand. CS is subsuming stats (not really) because
computers are a, if not the, major driving force begind statistical advances.
L1 regulation became a fad in stats because computational optimization could
efficiently solve this non-smooth optimization problem. Engineers are now
advancing algorithms and moving it to real world use, while statisticians are
still re-searching their field and finding applications using existing
algorithms. There is nothing wrong with application, but this means that they
are not at the forefront of research and more importantly they have to wait
for other people to solve their problems because they are no capable of
inventing novel algorithms. When the world at the large relies on computers
for every little thing and on computer scientists for a lot of scientific
advances, is it really surprising that computer scientists are more prominant?

And computer scientists have plenty of rigor. NIPs conferrence is an
AI/neuroscience conference that is now prestigious in stats as well. Some
machine learning experts lack mathematical foundations but statisticians are
not really good mathematicians either. Being able to regurgitate stat theorems
and prove convergence in expection under two dozen conditions (no kidding;
read average JASA papers) are not the hallmark of analytic rigor. Average
stats Ph.D., even from top programs, lack thorough understanding of measure
theory. Just look at standard stat textbooks; examples are aplenty (using
statistical software), but proofs are nowher to be found. It is amazing how
many statisticians are still using pseudo-inverses in numeric computation.

The statistical education is a problem. The biggest factor being the very
software this article is promoting: R. It is an atrocious monstrocity. Perhaps
statisticians can live with it. Fine, but don't expect good programmers coming
out of using R for five years. Python would be far better. In the end,
statisticians are not professional programmers, the same way accountants are
not. I don't see too many accountants fretting CS is subsuming their field.
Let us not forget, most statistical methods have their roots in other science
and engineering fields, and they are not particularly bothered.

~~~
kiyoto
Here is a math major's perspective (I also have a C.S. master's from Stanford,
a decent school for computer science).

The real reason statistics is losing to machine learning is M&M: money and
marketing.

Money is a huge factor when kids choose majors in college. Many of them have
student loans to pay off, and for many of them, getting a high-paying job post
college is a serious consideration. In this regard, statistics is a great
major, but computer science is flat-out ridiculous. When big tech companies
are offering a Stanford graduate with only a few summers of programming
internships under the belt for 150k/year, naturally a lot of kids are lured
into computer science.

Then, once they start studying computer science, they discover this thing
called machine learning. While I do think there is a difference in emphasis
between stats and machine learning, the fundamentals are same, except the
nomenclature sounds much cooler in machine learning. Nonparametric inference
sounds esoteric and cryptic, but unsupervised learning sounds futuristic and
cool.

The biggest problem with both statistics and machine learning education, I
would say, is their lack of emphasis on mathematical foundations. When I was
in college, CS229 (Introduction to Machine Learning) was touted to be the
hardest class at Stanford. Having helped my friends wade through CS229
problems (a lot of which comes down to wading through linear algebra and
multi-variable differential calculus), I do not think this is remotely true:
CS229 is hard because it attracts students who do not have the requisite
mathematical maturity to learn statistics/machine learning in a serious way (I
am not even talking about the real-analytic foundation of probability and
calculus but rudimentary linear algebra and chain rules).

Also, I do think CS229@Stanford is a great class that brings theory and
practice together =)

As for R/Python/Matlab/etc., I will let the zealots argue what's best. To me,
they are like statistics and machine learning: similar with different
emphasis.

~~~
wwweston
> The biggest problem with both statistics and machine learning education, I
> would say, is their lack of emphasis on mathematical foundations.

Any recommendations for someone who'd like to pick up sound (if modest)
foundations?

~~~
kiyoto
Yes. For example, to follow CS229's lecture notes, the following material
should suffice.

\- Linear Algebra Done Right by Sheldon Axler

\- Linear Algebra by Steven Levandosky

\- Linear Algebra and Multivariable Calculus, and Manifolds by Ted Shifrin
(you probably do not need the manifolds part)

Also, as paradoxical as it may sound, CS229 lecture notes themselves. The key
here is to verify the details painstakingly. In math, we joke that whenever a
book says something is "trivial" or "easy to verify" or "clear to show", it is
a substantial exercise in disguise.

~~~
skybrian
Yes, the whole "trivial exercise" thing is in itself a big problem with how
mathematics is taught. Someone who's struggling just to understand the text
could probably really use a worked-out example.

~~~
brational
That's the wrong interpretation. In math, a 'trivial' or 'clearly' is supposed
to indicate that a certain detail is important to understanding the subject.
And if it isn't clear or trivial then the reader is expected to delve deeper
before moving on.

The "big problem" is that math is taught in a restricted time scale - a
semester, just like all other standard courses. In reality it might take
someone many weeks or months to pick up a certain piece of measure theory or
hilbert spaces. Math builds upon itself and when it is rushed there is a lack
of connection and then ultimately failure to understand.

This is why MOOCs are a great boon for many, as they allow more freedom to
self-dictate the pace.

~~~
skybrian
If that's what it really means then a perfectly good way to say it is: "To
test your understanding, make sure you can do this exercise before
continuing."

I suspect many mathematicians _like_ being intimidating.

------
Homunculiheaded
I'm surprised that they didn't mention Leo Breiman's famous paper "Statistical
Modelling: The Two Cultures"[0]. A worthwhile read for anyone interested in
the topic. I could sum it up, but the abstract does a better job:

\----------

Abstract. There are two cultures in the use of statistical modeling to reach
conclusions from data. One assumes that the data are generated by a given
stochastic data model. The other uses algorithmic models and treats the data
mechanism as unknown. The statistical community has been committed to the
almost exclusive use of data models. This commitment has led to irrelevant
theory, questionable conclusions, and has kept statisticians from working on a
large range of interesting current prob- lems. Algorithmic modeling, both in
theory and practice, has developed rapidlyin fields outside statistics. It can
be used both on large complex data sets and as a more accurate and informative
alternative to data modeling on smaller data sets. If our goal as a field is
to use data to solve problems, then we need to move awayfrom exclusive
dependence on data models and adopt a more diverse set of tools.

[0].[http://tuvalu.santafe.edu/~aaronc/courses/5352/readings/Brei...](http://tuvalu.santafe.edu/~aaronc/courses/5352/readings/Breiman_01_StatisticalModeling_TheTwoCultures.pdf)

~~~
east2west
Fantastic paper. I wonder what "irrelevant theory, questionable conclusions"
he had in mind.

------
yiransheng
In my opinion, one big reason in the fall of statistics is the asymmetry in
curriculum.

If you are in CS/Machine Learning, you will be forced/suggested to take a few
statistics/probability courses, typically up to a regression course. These are
good enough to get you started and familiar with statistical concepts and
methodology.

However, if you come from a typical statistics program, programming/computing
courses are almost always never required. Professors in probability
theory/statistics often do not have great computing skill, they might know
enough to get around in R/Mathlab, and teach you the basics in these tools,
but a lot of times you are left in the dark in terms learning how to code.

If your career goal is an analyst, whose job is report oriented, you can get
by by knowing just enough R (SAS, Matlab etc.) scripting. As your deliverables
are generally fancy LaTex pdf (or presentations). But current job market pays
and values more of a "data scientist", whatever that terms mean. This type of
roles typically requires solid software engineering background, and requires
you to be able to work with software developers and commit to the same
codebase on a regular basis. The lack of programming/computing skills of a
stats majored students will be a huge minus for this type of responsibilities.

~~~
analog31
Granted, I started college 30+ years ago, so my perspective may be outdated,
but I fear that it might not be. What you say about "asymmetry" struck me
because it seems like a more widespread issue. Some anecdotes from my college
years:

* The CS students did all of their work on the college mainframe, and there was a pervasive attitude that personal computers were toys.

* The earliest adopters of PC's, and of using computers to solve problems rather than as objects of study, were in the physics department.

* Math required students to learn programming, but only one course used it -- numerical analysis.

I wonder if doing symbolic math by hand still dominates math education. Of
course I enjoy doing it as a form of recreation, but I think that my use of
math in my subsequent career would be stunted, had I not taken an interest in
programming outside of the math department.

I also wonder if "the hot stuff in discipline A is being taught in department
B" is just a feature of the chaotic nature of scholarship and research,
meaning that a student needs to explore more than one field in order to really
be top notch in their main chosen field.

------
mturmon
Here's another relevant HN discussion about the relationship of CS and
Statistics:

[https://news.ycombinator.com/item?id=6630491](https://news.ycombinator.com/item?id=6630491)

The boosting, random forest, support vector machine, and deep learning ideas
should have been developed in Statistics. Why weren't they?

It's much deeper than TI-83 calculators and cookbook AP courses!

~~~
dthal
Boosting was, I suppose, invented by Schapire and Freund, both CS guys, but
its most successful form, gradient boosting was invented by Friedman, a stat
professor. Random forests were developed by Breiman and Cutler, both stats
professors. SVM's probably weren't developed in statistics because they don't
have a probability model.

~~~
mturmon
It sounds like you're challenging my central claim: conventional statistics
has largely ignored the impact of computational technologies on inference, and
dismissed the resulting failure as a mere marketing problem.

You are wrong.

Boosting is a generic technique, and reducing it to gradient boosting and then
claiming it as "mainstream statistics" is silly. Boosting is _not_ , even
today, part of the standard statistics curriculum, despite its wide utility.
But the bootstrap is. I wonder why?

Describing Leo Breiman as a stats professor is similarly twisted. He was
always an outlier in traditional statistics; in fact, he left a position in
the math department at UCLA because he didn't fit in, and began independent
consulting, during which time he developed CART. And during his time, years
later, as a bona fide stats professor at UC Berkeley, he spent considerable
effort overcoming a tendency of mainstream stats to ignore the implications of
computing. For instance, he was one of a small number of stats professors who
was active in the NIPS community. (His bio at
[http://projecteuclid.org/download/pdf_1/euclid.ss/1009213290](http://projecteuclid.org/download/pdf_1/euclid.ss/1009213290)
is absolutely fascinating, see pages 195-6 for his feelings about conventional
statistics.)

And don't forget that a lot of the popularity of CART was due to the C4.5
implementation, due to Ross Quinlan, a CS professor with a physics background,
which allowed easy building of classifiers on top of trees. Which points out
another thing, that these advances have in many cases not come from, strictly
speaking, _computer science_ ; physicists and engineers have been very
influential.

SVM's were developed by Vladimir Vapnik and collaborators, who came out of a
Russian math/physics background. Who cares whether SVMs have a model or not?
This is an example of the myopia that afflicts old-fashioned statistics. SVMs
are highly effective predictors with deep mathematical properties and admit a
statistical analysis of their performance and properties. (You don't think
every statistician who uses logistic regression really believes this is a
_model_ for the data generation process! Yet logistic regression is part of
the stats canon.) The SVM analysis ("large margin classifiers") largely took
place, again, outside mainstream statistics.

We could say the same thing about neural networks.

And we could say the same thing about deep learning.

NNs: 1980s-1990s. SVMs: 1990s-2000s. Deep learning: 2000s-2010s. In ten years,
we'll be adding another example to the list, because this myopia is not just
poor marketing skills.

------
iends
My first stat class was an introductory statistics class that was geared
towards biology students. It was pretty terrible.

When I got into CS grad school I took more rigorous statistics courses and
fell in love. I mean, head over heels in love. I tried to transfer out of the
CS graduate school and into to the statistics graduate school. Because I would
lose funding and would be starting over I instead decided just to quit when I
was awarded my masters instead. (Now I just spend my days code JavaScript and
CSS...what happened..)

If I would have been exposed to statistics earlier in life I'm absolutely
certain I would have majored in it in addition to computer science.

------
viraptor
I did three terms of statistics combined at two different universities and
then took the EdX statistics course a few years later. I was not impressed
with the way it's taught at any of those places - if that's the usual way to
tech statistics, I'm not surprised it's not very popular.

They say "This “engineering-style” research model causes a cavalier attitude
towards underlying models and assumptions." \- but CS should be easy to
verify. Everything I ever did for the CS courses could be looked at and
explained exactly why it's right in simpler terms than what I did in the first
place. With statistics I get a feeling it's completely the other way around -
you can get some data, answer a question and it takes really complicated
explanation to prove what you did is right. If you do simple mistake like
taking one tailed instead of two tailed test, you still get a result - but how
can you actually test it's not the right one? I was never taught or could find
any way to do that.

In geometry you can fairly easily throw some more equations at the problem and
see your solution matches them too. In CS you can prove some thing in multiple
ways, or throw enough data at some algorithm that you can see that both
results and complexity bounds are what you expect them to be. The statistics
courses I've seen resulted in many people who can go through the usual
assignments, but couldn't tell if what they got back was right or wrong.

------
jtleek
As an academic statistician running the largest data science program in the
world (10,000+ have completed a class) I think that CS/Data Science only pose
a threat to statistics if we don't adapt. I wrote about it here.
[http://simplystatistics.org/2013/04/15/data-science-only-
pos...](http://simplystatistics.org/2013/04/15/data-science-only-poses-a-
threat-to-biostatistics-if-we-dont-adapt/)

------
murbard2
I'm not sure what the difference between statistics and machine-learning is
supposed to be, seriously. The theoretical foundation of any machine learning
algorithm is probability theory and statistics.

Is machine learning implied to be a more "hands on" approach, to focus more on
learning a set of techniques than on learning the theoretical underpinning of
such techniques? Is it basically "statistical engineering"?

The least charitable interpretation would be that it's just a hipper term for
it.

This article pretty much sums up my impression of the divide:
[http://brenocon.com/blog/2008/12/statistics-vs-machine-
learn...](http://brenocon.com/blog/2008/12/statistics-vs-machine-learning-
fight/)

------
noelwelsh
After reading the title I spent some time thinking about what the causes might
be before reading the article. To my great amusement the reasons I consider
machine learning to be a more vital field than statistics were all given as
"structural problems with the CS research “business model”". :)

Take the great recent success of deep learning. This is a method that could
have never arisen in statistics, because at this point in time it has no
(well, very little) supporting theory. But it works! Given a decade or so
we'll likely have a good theory but we shouldn't let lack theory hinder
development of applications, and in machine learning we generally don't.

The classic Brieman "Two Cultures" paper is very relevant here.

~~~
1971genocide
The irony is conventional scientists forgot was it was to be scientists ! the
best science was created without knowing what you were creating.

------
moron4hire
His attitude seems to be more "slow CS down" than "speed Stats up". He talks
about "let's consider the problems in CS first", but then never gets around to
seriously discussing the problems of the Stats profession.

The most he seems to admit to is "Stats hasn't been awake, hasn't been
present", like it's some kind of defendable excuse. It's a major problem! If
you don't show up to the game, you can't say anything about the "competition"
playing unfairly or not playing by the right rules or what have you.

We've seen this before in all kinds of discussions. "Them damn kids and their
Rubython on Djrails think they invented everything." It paints a picture of
some new, young beast hunting down and devouring the old guard bit by bit in
some kind of cruel game designed to torture body _and_ mind. Which is
completely ridiculous. The new gang never "attacks" the old. They just walk
away.

For all sets of tools, the reason people prefer one over another is because
they are finding answers quicker and more easily with A rather than B. Nobody
cares about the purity of form (or in other words, the ones who do care are
_statistically insignificant_ ). They care about getting work done.

None of this is to say that working on theory is inferior or superior to
working on applied use and engineering. They are all equally valid pursuits.
But we all need to be able to admit when people don't share our passions, and
not blame them for having their own.

Show up to the game. The game is industry. It's work. Stop focusing on
academia as the only legitimate outlet for graduating mathematicians and
statisticians. Talking about "adult supervision" is really just armchair
quarterbacking. Sit in a meeting with a person with a 30-year-old, unused
history degree with _absolutely zero_ understanding of statistics--thinks ALL
math is ivory tower bullshit not worth the paper the textbooks are printed on
--and explain to him why his desired results can't be obtained from his
current data. See how much your statistical models stand up to "let's turn
that can't into a can" or "can't is unacceptable, we're not going to leave
here until we figure it out".

------
jostmey
Statistics is taught wrong from the get go. Who cares about z-tables and chi
values? It distracts from the important underlying laws of probability.
Statistics has never been more important, and yet the vocabulary is badly
outdated leaving students with a huge mountain of knowledge to climb before
reaching anything of real utility.

~~~
osazuwa
I don't quite agree, as the standard normal and Chi squared distribution are
obviously important to understand. I think what you mean is that teaching
stats based on hypothesis testing on simple experiments with small samples is
not the best way to introduce students to this world. But the history of
statistics was heavily influenced by designed experiments, while most of the
data that is out there is observational. That is hard for older generations of
statisticians to swallow.

------
kazinator
I was expecting a statement like:

"Based on a random sample of institutions surveyed, Statistics is losing
ground to CS at a relative mean rate of 5.7% per year (by junior and senior
year undergraduate enrollment headcounts, taken over the past five years from
the institutions sampled), with a 95% confidence interval of +/\- 0.3%."

Ironically, no.

------
anarchy99
Tisk, tisk Norman. A good statistician shouldn't use anecdotal evidence with a
sample size of one to support his/her points. If I picked up a random stat
pub, I bet I could find some missing citations as well.

------
lucidguppy
I wonder if CS should be divided into heavy research and then a serious minor
to be applied to other majors.

CS of applications should augment other fields of study. CS of research should
be the heavy duty stuff.

------
andyidsinga
im currently reading "discovering statistics using R" ..even though we're
using python and ipython at work - ipython is simply fantastic! -- hope to
integrate the R support magic soon too.

