
Nate Silver: What I need from statisticians - carlosgg
http://www.statisticsviews.com/details/feature/5133141/Nate-Silver-Wha%20t-I-need-from-statisticians.html
======
hharrison
To respond to a bunch of other posters here:

There's a fundamental difference between data scientist and statistician, I
think. I see statistics as an academic discipline and data science as an
applied discipline.

More concretely, the statistics approach is: formulate question --> formulate
hypothesis --> collect data in a controlled environment under a specific set
of assumptions (i.e., perform an experiment) --> determine probability of the
data given the hypothesis (and assumptions).

While the data science approach is: hey look, we already have all this data
--> generate predictions --> collect more data --> refine predictions.

Of course, that's an over-generalization. But I think the different emphasis
on hypothesis testing vs. machine learning/data mining is fundamental.

------
chimeracoder
> "I think data scientist is a sexed-up term for a statistician."

As a statistician-and-engineer who is currently on the job market (my graduate
program finishes this spring), I feel this pain.

I've been referred to as a "data scientist" multiple times (that's even been
my official title at work before), though I do still cringe sometimes when I
hear the word, for this exact reason.

That said, I don't usually present myself as a statistician, even though my
degree is a statistics degree. Most people who hold statistics degrees are
fairly lousy engineers[0], and I don't know of any other term that (concisely)
expresses that I'm equally competent as a statistician and a (backend)
engineer[1].

Of course, this is because many of these programs haven't yet caught up to the
fact that computers exist and are still teaching statistics as if we're in a
pre-computation era. The perfect solution is to fix this, and thereby fix the
connotation of the word "statistician".

It's the same reason I dislike the term "growth hacker" \- really, that's just
the way marketing _should_ be done (ie, based on numbers and verifiable
statistics). In a perfect world, all (competent) marketers would be "growth
hackers". But many marketers aren't, and so we have to make up another cringe-
worthy term for it.

Unfortunately, that's a problem that's beyond my means to solve. So I bite my
tongue and add the word "data scientist" to my resume anyway.

[0] Usually self-proclaimed, too [1] ie, "I could work as a backend engineer
if I wanted to/needed to, but I'm looking for work involving both skillsets"

~~~
textminer
I've actually seen "data scientist" used to indicate "probably not that
proficient as an actual engineer, but knows what he or she is talking about in
terms of theory and can usually get a working model done in R, Matlab, or
Python". There seems to be an additional jump in knowledge of systems, data
structures, and algorithms in building up performant machine learning stacks
(something which can add in practical knowledge and praxis, but steals a
little from raw-ideas work).

Citation: working in data science and machine learning engineer roles for
nearly three years straight out of grad school in math.

------
mturmon
"I think data scientist is a sexed-up term for a statistician."

This statement, given by Silver to the annual meeting of the Joint Statistics
Meetings (the main cross-organization stats conference), was guaranteed to be
a crowd-pleaser for that audience.

Unfortunately for them, it's not really true.

The problem is that much of conventional academic statistics consists of
proving theorems about model classes. This requires a lot of sophisticated
analysis, but has turned rather vacuous. And much conventional applied
statistics consists of computing diagnostics based on dubious modeling
assumptions. Under pressure in the last 20 or so years from computer science,
machine learning, computer vision, Moore's law, and the data avalanche, the
discipline has changed, but not fast enough.

As a result, a lot of what _should_ be taught and researched in statistics
departments has been co-opted by these other disciplines. And many people with
a real problem would rather work with a "machine learning" person than a
"statistics" person.

The best summary of this state of affairs is Leo Breiman's essay
([http://projecteuclid.org/DPubS/Repository/1.0/Disseminate?ha...](http://projecteuclid.org/DPubS/Repository/1.0/Disseminate?handle=euclid.ss/1009213726&view=body&content-
type=pdf_1)). The abstract of this essay is brutal:

"There are two cultures in the use of statistical modeling to reach
conclusions from data. One assumes that the data are generated by a given
stochastic data model. The other uses algorithmic models and treats the data
mechanism as unknown. The statistical community has been committed to the
almost exclusive use of data models. This commitment has led to irrelevant
theory, questionable conclusions, and has kept statisticians from working on a
large range of interesting current problems. Algorithmic modeling, both in
theory and practice, has developed rapidly in fields outside statistics. It
can be used both on large, complex data sets and as a more accurate and
informative alternative to data modeling on smaller data sets. If our goal as
a field is to use data to solve problems, then we need to move away from
exclusive dependence on data models and adopt a more diverse set of tools."

Breiman was mathematically sophisticated, so it's not that he wasn't able to
follow the theory he critiques, it's that he wasn't snowed by detail and could
see its lack of relevance to real problems.

~~~
tel
The two cultures argument is strong, but so is David Hand's "Illusion of
Progress" paper [1] which fits into the debate by highlighting that classifier
technologies often touted by machine learning aficionados does not scale to
better performance when analyzed carefully.

In my experience, improvements in performance tend to come from improvements
in domain knowledge. This usually plays out in feature selection, but may
involve radical changes in the classifier/model design. The secondary ones are
more fun and can more easily be the basis of a PhD. The initial ones are
usually swept under the rug when they're brutal and boring, though fundamental
changes also form great PhDs.

I think it's fairly clear that the success of NLP almost entirely came down to
a feature engineering change.

I don't know what the ideal Data Scientist/Statistician position is. I think
it varies based on many of the implicit needs that only statistics savvy
businesses can pick between clearly. The role may require strong programming
skills, the ability to build and iterate massive multilevel models of social
data, a familiarity on-line classifiers, the ability to graph things nicely in
R and D3, the ability to from scratch write a non-linear optimizer over a
completely novel space, a familiarity with NLTK, or, finally and perhaps most
commonly, the ability to pander bullshit about the benefits of machine
learning.

I love the field and that's why I tend to side with the statisticians who want
to really figure out what's needed to do an objectively good job drawing
inference from data. Many in the new generation of statisticians are quite
successfully crossing the cultural divide. A smaller cohort are brushing up at
least their R coding skills to a suitable degree. Some already have an outside
competency in programming. I'd like these people to eventually form the core
of what real "data science" is.

[1]
[http://arxiv.org/pdf/math/0606441.pdf](http://arxiv.org/pdf/math/0606441.pdf)

~~~
mturmon
That paper by David Hand is good, I had forgotten it. (Even though I presented
it to a reading group at my work...).

He points out a lot of ways the standard "supervised learning, iid feature
vectors, moderate dimensionality, moderate n" problem setting is a theoretical
construct that does not encompass enough of the real-world problem setting.
And he highlights the fact that an epsilon improvement in error rate is not
that remarkable.

So, I agree that his paper was smart as a critique of a certain culture within
ML practice (and publications) of totally eliminating all problem context, and
just saying "Give me a bag of labeled feature vectors -- I will build a
classifier with an error rate better than your classifier, and publish the
result."

But that is largely a straw man -- critiquing ML hackery is worthwhile and
salutary, but it's going after small game. The random forest, support vector
machine, and deep learning approaches should have been developed in
Statistics. Why weren't they?

------
chockablock
If you're a trained scientist, 'Data Science' sounds distinctly odd. What
other kind of science is there? A friend likened it to going to a restaurant
to do some 'Food Eating'.

~~~
finiteloop
In general, if the subject includes "science" in the name, it probably isn't
actually science. Political, data, social, etc. No chemist needs to say they
study "chemical science."

------
nfoz
Data-scientist is a sexed-up term for statistician without any expectation of
mathematical expertise.

~~~
monkeyspaw
What would be a preferable term, data engineer? I don't have a background in
statistics, and I do a lot of things that would make PhD level statisticians
unhappy... but the things I do are sufficient for my own purposes.

Sometimes I don't need a mathematically proven process. I often find a use for
something like a Monte Carlo simulation, without rigorous methodology, because
it works for me, and for what I'm doing.

By analogy, not every website needs a multiple tier, doubly redundant
architecture. Sometimes a "dumb" setup with a single webserver and database on
a single server gets the job done.

In fact, when I worry less about scientific validity and focus more on what is
necessary to get the job done... I make a lot more progress.

It feels a lot like scientific elitism to me.

~~~
hharrison
Sounds like you agree with nfoz... you don't always need mathematical
expertise ('scientific validity') to play with data, but when you do, you need
a statistician.

Really, it comes down whether you need to confirm your assumptions and
quantify your uncertainty, etc. Often, in applied situations, you don't. In
science, you do.

~~~
monkeyspaw
Yup. I don't plan on publishing anything I do in a peer reviewed journal. If I
did, I would have to be way more cautious with a lot of my assumptions.

By freeing myself of that level of effort, I can test a lot more stuff and
build things a lot faster. In what I do, that's more important than having
something be "scientifically accurate".

Similarly, when I am coding a prototype, I don't do unit tests. Don't need
them, and it takes me longer to develop.

~~~
kyzyl
Pardon me if I come off as rude, but you've made a couple of statements that
strike me as rather silly. Maybe I've misunderstood you.

The function of a peer reviewed publication is to formalize and verify that
some result is correct, reproducible, and relevant (and to advance tenure).
One would hope that whether you publish or not is mostly orthogonal to _how_
you conduct your work. If you're doing data analysis in any professional
capacity, you should be using mathematically sound techniques, _and you should
understand those techniques_. There's no other way about it. I'm not sure what
your line of work is, but I can't think of any technical field where one's
analysis techniques being "scientifically sound" is anything but paramount.
Why else would the work have any value? Perhaps it helps to think about
publishing your work to be more like deploying your code, rather than testing
it.

That isn't to say that every step of the way you have to conduct yourself with
the utmost rigor. It doesn't mean that you must prove every theorem each time
you use it, nor that you have to be laying out strict tests and hypotheses
every time you load a data set. But to do data analysis in a scientifically
sound manner _does_ mean that you have to understand the background of the
techniques you're using, and how they apply to your data. It _does_ mean that
you have to periodically bring your analysis back to basics and make sure your
ducks are still in a row; that you haven't fallen off the assumption wagon
somewhere along the line. It _does_ mean that you should know why something is
working, or why something is broken, not just that it is.

"Statisticians" don't sit around all day laying out hypothesis tests in latex.
They do the same thing you do; they fire up R, load the data and start playing
with it. It's all about the context you have in your brain when you are
playing with it. This, I think, is where the formal training becomes quite
important. Because as with most things, it's the _unknown_ unknowns that will
destroy you, and having the broad formal foundation gives you the tools to
protect yourself from walking into a minefield problems you didn't even know
existed yet.

~~~
monkeyspaw
I guess I just see the level of "scientific soundness" as a spectrum. For some
work (peer reviewed journals, work you hope other people to be able to
reproduce, etc.) you need a higher standard.

I find that higher standard requires a significant amount more time, effort,
and ultimately prevents me from accomplishing what I need to.

It depends on the application and how you're using the result. Most of the
time, for me, it just doesn't matter.

It may be worth noting that much of what i'm talking about is for personal
understanding (e.g., baseball statistics) or is not mission critical and will
never see the light of day.

Sometimes good enough is just that.

I do appreciate the thoughtful response you wrote. But often my decision is
this: do I spend 2 weeks trying to understand a technique, or do I spend 1 day
using it, understanding that my analysis has limitations. When you only have 1
day, my choice becomes "do what seems like it will work" or "do nothing." I
agree that more complex techniques (neural nets, etc.) may need more
understanding. But because they do, they are also inaccessible to me.

Maybe I can sum it up like this: it's not always necessary to do an analyses
in a completely scientifically sound manner, if I can answer the questions to
my satisfaction. Especially relevant when the other option is to not get
anything done, and get stuck in a textbook trying to understand complex
theory.

It's bit me once in a while, but that doesn't matter in what I'm doing. (And
its usually pretty easy to recover from.)

Also worth a mention that I would not call myself a data scientist, nor do I
operate in that role.

It's all tradeoffs, all the way down. I just happen to choose mine differently
than someone who might call themselves a data scientist or statistician.

------
srean
The way I have made personal peace with this is that I consider myself a
better programmer than the median statistician and better in statistics and
machine learning than a median programmer. Whether this is a useful spot to be
in I have to find out. I can see that depending on the times this can either
be an asset or a liability.

------
casca
We need a new term for statistician and data scientist is as good as any. For
many years, the terms "statistics" and "statistician" have had negative
undertones within the general public and renaming is a great way to overcome
that.

~~~
nfoz
huh? what are these "negative undertones"?

~~~
johnpmayer
Probably the widely-help belief that anybody can come up with statistics for
anything.

~~~
nfoz
Well, that statement is true. You hire a statistician if you want statistics
that are correct. Maybe people undervalue the latter...

------
rohunati
It could be argued that we shouldn't abandon the term "statistics." Data
science is to statistics what mathematics is to physics, but we don't (nor do
physicists) call it "number science."

------
mrcactu5

      I think data scientist is a sexed up term for a statistician.
    

fuck-yeah

