
What Data Scientists Really Do, According to Data Scientists - pseudolus
https://hbr.org/2018/08/what-data-scientists-really-do-according-to-35-data-scientists
======
sixdimensional
"The only difference between screwing around and science is writing it down"
\- Adam Savage.

In all seriousness, why can't data science simply be about applying the
scientific method in the realm of data analysis? It doesn't need to be
conflated with machine learning, BI, SQL, etc. It can just be about
approaching data analysis with scientific rigor.

My opinion is that the term data science evolved when we started needing
cross-functional people who are a blend of:

\- domain experts;

\- numerical/quantitative specialists (such as statisticians, mathematicians,
physicists, STEM people);

\- business analysts, business intelligence; and,

\- those who traditionally deal with data management, platforms and tools.

That confluence of people was needed amidst the related trends:

\- increased government funding for STEM education and brain research;

\- marketing from companies such as IBM ("Watson"), the democratization of
data and increase in the use of data in daily life;

\- the big data wave, subsequent interest in "internet of things" and "digital
transformation";

\- renewed interest in machine learning and AI (recurrent neural networks and
other breakthroughs);

\- and others of course..

We needed to apply more discipline to data analysis - thus data science was
born. A formalizing of what many were already doing, to capture the need and
changing paradigm. Or so I like to believe.

~~~
digitalzombie
> why can't data science simply be about applying the scientific method in the
> realm of data analysis?

That's what a statistician do.

I've seen these ML and Datascience people. And the majority the time how they
tackle data is radically different from statistician and is more of an art
than a science compare to what statistician does.

But this could be my bias opinion and just some small data sample from
personal experiences.

\---

Actually my last day of internship I've met a few statistician interns some of
them are from Cal (UCBerkely) and they came to the same conclusion (we have a
lot of complaints). The ML/DS group is really just doing black magic (nicest
way of putting it). I wish statistic is better at marketing. Oh well.

~~~
ur-whale
>> why can't data science simply be about applying the scientific method in
the realm of data analysis? >That's what a statistician do.

Mmmh.

Run that experiment for me next time you meet a statistician:

    
    
        - ask him if he can apply Chi-squared to a decision problem
    
        - ask him if he can *explain* how and why Chi-squared works.
    

In my experience, all statisticians can do the first, almost none can do the
second.

Learning how to use a screwdriver to screw screws without understanding
notions of torque and moment doesn't mean you're applying the scientific
method.

~~~
pimmen
I would argue that they should be able to understand it to the level that they
can at least defend Chi-squared as a tool for the problem at hand. Then, they
should be able to evaluate whether or not it works correctly.

If a medical researcher is testing a new radio-therapy treatment, but can't
mathematically model every fission problem you can throw at them, they're
still applying the scientific method.

------
NPMaxwell
Tangential: I loved the appearance of the term, "Data Scientist". Scientists
are domain experts. A biologist knows cell membranes. A chemist knows
valences. For decades statisticians had been pitching that they were helpful
without knowing the domain: appear-deliver-and-run consultants. It was
wonderful to see a group embrace the importance of knowing the system that is
producing the data. I regret the more recent movement of "data science"
consultants who try to run models and AI without understanding the systems the
data came from.

~~~
sgt101
This : totally, we don't understand data, or data bases. We've only had them
for 20 years really, we don't understand the lifecycle or how value is
accumulated or destroyed and we don't understand the composite behaviour or
the dynamics with the users.

There is a crossover : data science is often about constructing data resources
from other data resources (and then advanced analysis like : count how many x)
doing this rigorously and efficiently and with regard to the underlying
infrastructures and other users (don't kill production) is a big trick.

~~~
mnem
I think databases and analysis have been around a little longer than 20
years...

~~~
sgt101
Not in their current form : we got relational systems in 1970, but data
warehousing came in in the mid 90's. The use of data for multiple purposes
(not just to underpin one grand application) is relatively new - as is the
practice of using data for a purpose for which it was not gathered... remember
all the "all the statistician can do is pronounce what has gone wrong" stuff?

------
em500
This is fairly accurate in my experience:
[https://twitter.com/thesmartjokes/status/684286479401652224](https://twitter.com/thesmartjokes/status/684286479401652224)

Day to day, it's mostly SQL, or worse Hive queries which makes most things
much slower than they should be.

~~~
minimaxir
One thing rarely discussed with the rise of big data is how to do _efficient_
querying, especially at scale.

I've had a ton of data science interviews which ask how to reimplement binary
search from scratch (which I would never do on the job), but not anything
about how to do efficient JOINs and query nesting.

~~~
bioquestion
I work as a biostatistician and I've been tasked recently with querying large
databases using SQL, in addition to analysing the data. However, my
programming background is very limited and thus I'm sure my queries are very
inefficient.

Could you point me to some materials/texts about how to improve querying
efficiency for SQL? If it's oriented for beginners then that would be ideal.

Thanks in advance.

~~~
collyw
This site is great.

[https://use-the-index-luke.com/](https://use-the-index-luke.com/)

~~~
bioquestion
Thank you! I've already started reading it and it seems to be just what I
needed.

------
appleiigs
“It has been a common trope that 80% of a data scientist’s valuable time is
spent simply finding, cleaning, and organizing data, leaving only 20% to
actually perform analysis.”

Is it really trope? For my experience I almost think collecting data is >80%.

~~~
visarga
I get to spend 90% collecting, cleaning and tagging data.

~~~
pbhjpbhj
Sounds like poor role definition? You don't have your laboratory scientist
cleaning lab equipment or ordering reactants.

------
thibautg
\- years (continuous endeavor): find existing data sources

\- months: convince management to give access to data source

\- weeks: try to find the connection string

\- days: clean up the data (mostly converting dates to yyyy-mm-dd) and
importing/exporting csv files

\- hours: load data in database, write simple SQL query and simple
visualisation

\- seconds: brief moment of satisfaction

------
cpeterso
I just started reading _Weapons of Math Destruction_ by mathematician Cathy
O'Neil. She warns about big data systems that codify racism and classism from
flawed data and self-fulfilling feedback loops. The systems' "unbiased"
decisions are opaque, proprietary, and often unchallengeable.

[https://en.wikipedia.org/wiki/Weapons_of_Math_Destruction](https://en.wikipedia.org/wiki/Weapons_of_Math_Destruction)

~~~
listenallyall
If "big data" informs you that a group of people (race, geographic location,
nationality, occupation, gender, etc) is less likely to say, pay back a loan,
or successfully complete 4 years at university, or avoid insurance claims...
must that be anything-ist? Or just a fact? Or, as your post suggests, are you
obligated to throw out the conclusion and just assume the inputs must have
been "flawed data"?

~~~
chrischen
Big data may conclude that the data shows black people are more likely to
commit crimes, but it doesn’t tell you the why, which may be that black people
are systematically opressed, more likely to come from a poorer background,
discriminated upon so as to reduce their options in life. The AI, based on
statistics can tell you the effects, but not the causes.

So if i were to go and beat up every person named John, and then machine
learning takes a crack at it and tells us that people named John are more
likely to get injured, we may end up discriminating against Johns without
realizing it was I who just have a thing against people named John. If this
happens in a system and creates a feedback loop, It can lead to something
becoming a self-fulfilling prophecy when it need not have been.

Based on a naive conclusion the solution may be to deny health coverage to
people named John, but obviously the real solution is to put me in jail.

~~~
ekianjo
Regardless of the data used, not knowing the why does not invalidate a strong
correlation at all. If you stopped at doing thing only when you know the why
there would no progress in science, because there are always new WHYs you
uncover as you go.

~~~
tanana
That's why when studies that make discoveries of new correlations, they have
to replicate such findingers in different study designs that actually allow
you to say something about causation.

In medicine, that would often mean a randomized study between treatment A and
a placebo.

However, with factors such as race, or sex this is obviously impossible.

~~~
ekianjo
Intead of looking at race or sex, looking at individual behaviors usually
explains things better but it is harder to capture.

------
minimaxir
There has been a rise in _romantic_ thought pieces lately about how Data
Scientists are wizards and can solve any problem with the real superpower of
teamwork. (here's an older example from Instacart:
[https://tech.instacart.com/data-science-at-instacart-
dabbd2d...](https://tech.instacart.com/data-science-at-instacart-
dabbd2d3f279))

In the real world, the state of affairs in Data Science is more practical and
pragmatic. And there's nothing wrong with that.

------
rahimnathwani
Not worth reading. For example, this makes no sense:

"(2) decision science, which is about “taking data and using it to help a
company make a decision”; and (3) machine learning, which is about “how can we
take data science models and put them continuously into production."

Machine learning isn't about putting models into production. It's about
machine learning models directly from data.

And if decision science is 'taking data and using it to help a company make a
decision', then pretty much any job involves data science, e.g. the guy
comparing quotes for paperclips and picking a vendor.

~~~
lmkg
> Machine learning isn't about putting models into production. It's about
> machine learning models directly from data.

From a business perspective, the thing that's different about "machine
learning" compared to other things you do with data is that it's possible to
take the human out of the loop. That's a qualitative difference, as opposed to
the quantitative difference of your business analysts giving better
recommendations. We can quibble over terms, but as a broad stroke, things that
are machine learning can do that and things that aren't machine learning
cannot.

That qualitative difference is the main thrust of the quote you pulled,
although it could be more explicit. Rather than the analyst building a model
that tells him what shade of red is best for a button so that he can pass that
information along to a design team, the button color is connected directly to
the model.

~~~
rahimnathwani
The distinction, as you've restated it, still isn't useful:

'From a business perspective, the thing that's different about "machine
learning" compared to other things you do with data is that it's possible to
take the human out of the loop.'

There are many things you can do with data that take humans out of the loop,
that don't involve machine learning. For example, software that automatically
re-orders stock in a supermarket once stock (calculated based on starting
stock less sales) goes below some level.

You could argue that this still has a human in the loop (to define a
threshold) and that you're not removing the human from the loop until the
thresholds themselves are automatically calculated.

But then you're just moving the job of the human from deciding the threshold,
to deciding what % of the time it's acceptable to be out of stock of that
item. Sure, you can automate that, too, but then the job of the human still
exists: she's just deciding the objective function that stock-out percentage
must satisfy, rather than deciding the stock-out percentage for each SKU
directly using a jupyter notebook or Excel sheet.

~~~
abakker
I sincerely wish more people thought like this. Nothing is different about
machine learning. It only performs better than OLS in a very specific subset
of rich data, where improving prediction/action is important.

~~~
rahimnathwani
I think of OLS as just one type of machine learning.

OLS is great for many types of problem.

For others, other techniques massively outperform them in some way (e.g. CNNs
for classifying camera images or spectrograms of audio data).

Even where OLS performs well, it seems other techniques can frequently do
better.

~~~
abakker
Totally fair, I guess what I was getting at (poorly) was that OLs has been
around for a long time, lots of hype for ML now, but there are plenty of
techniques here that have been readily available.

------
jamesblonde
Who are Data Scientists' heros? Seriously.

In AI, it's Hinton, Le Cun, Bengio.

In systems, it's D Richie, J Dean, Berners-Lee, Torvalds.

In distributed systems, it's Lampord, Chandy, J Dean.

In programming languages, it's D Richie, Gosling, Dijkstra, Knuth, Milner,
etc.

Who are data scientists' heros or role models?

~~~
em500
Efron, Hastie, Tibshirani (basically Stanford stats).

~~~
f311a
They are statisticians.

~~~
laichzeit0
They wrote the books Elements of Statistical Learning and Introduction to
Statistical Learning in R. Those books are about least squares regression,
clustering, decision trees, random forests, boosting, additive models, support
vector machines, etc.

All these are common statistical learning methods used in Data Science.

~~~
LeanderK
Also, if you read the fantastic computer age statistical inference from Efron
& Hastie (it's available online!), you notice they are both fans of data-
science! The whole book reads like a big argument why we need data-science and
why traditional statistics is not always the answer.

This comes especially obvious in the epilogue, where they try to give a quick
oververview how the concept of a "data-science" formed and how statistics
diverged into data-science + ML and the traditional statistics-community.

They end the book by arguing that both communities should find to each other,
because fundamentally they try to do similiar things. I also think is badly
needed. Unfortuntaly I experience some of arrogance on both sides, which makes
it harder! (DS/ML-people have no idea what they are doing and only throw their
algorithms on problems & benchmark them! Statisticians are obsolete & i can
just automate them with NNs!)

------
didibus
After reading this article, I still have no idea.

I mean, it made it sound like data scientist is just the same as a business
analyst? Is this the new computer scientist vs software engineer?

~~~
compcoffee
> _I mean, it made it sound like data scientist is just the same as a business
> analyst?_

I think this demonstrates how hard "titles" are, because a "business analyst",
in the sense that I learned, is not at all like a data scientist (or data
analyst):

 _" Business Analysis'' is a research discipline of identifying business needs
and determining solutions to business problems. Solutions often include a
software-systems development component, but may also consist of process
improvement, organizational change or strategic planning and policy
development."_

Most BA work I've done involved translating business requirements into
technical or software requirements.

In other words, who knows...

~~~
didibus
Hum, or maybe I meant data analyst, like you say who knows.

------
killjoywashere
‘Data scientist” is what they call a statistician on the West Coast. — an East
Coast statistician

~~~
snackematician
This is true. I'm a Ph.d. statistician with a "data scientist" job position in
San Francisco.

------
crunchlibrarian
Data science lost all of its appeal to me when I spent a weekend diving in and
found it was about 70% fidding with weights until you get the answer you want
and 30% trying to figure out why the data was so wrong.

------
pleasecalllater
A friend of mine told me that in his company people write ETLs, send them to
an external service for processing, and get that back - this is what they call
"doing data science" :)

------
dopeboy
As a full stack engineer that knows very little about data science. what
courses, libraries, etc are worth my time to explore? What should I be well
versed in to be competent in the future?

~~~
TheAceOfHearts
A boring response, but have you studied stats? Knowing stats and a bit of SQL
is enough to get you pretty far with a lot of problems. I'd consider those
skills an important pre-requisite to more advanced tools and techniques.

~~~
dopeboy
I haven't, atleast not since my last university class. I'd love to find some
kind of course that trains me on the basics along with applying it through
programming.

------
bane
Mostly connecting to data, cleaning it and finding some place to stash all of
it.

Only after that 90% is done can anybody think about modeling data,
transforming it, processing it and lastly that glorious 5% of actually
analyzing it.

Oh, and then somebody wants the results of the analysis to be put into a fully
interactive scalable web application so now we're late.

------
mcrad
It is a hybrid of software engineering and stats.

Calling it science is a stretch. I can understand if you are solving problems
in a traditional scientific field, but if you are doing economic modeling to
manage investment risk and optimize profit for an internet company, it's
hardly science. What a scam!

~~~
natalyarostova
Science isn't a noble and pure endeavor. It's just a methodology to construct
predictive power from information.

~~~
mcrad
Science is more of an institution than what you are describing like some
isolated act of making a prediction

------
DrNuke
Data Science attaining the scientific grade when ablation analysis becomes
mandatory maybe?

~~~
sgt101
I like the idea of ablation analysis, but when did fiddling with it until it
changes become science?

~~~
DrNuke
Less black box, more reproduciblity / generalisation is what people ask for
these days, so ablative studies exposing how the bricks in the model work
individually? In old terms, sensitivity studies.

------
jblow
There are some quotes missing from this title ... the proper spelling is Data
“Scientists”.

~~~
glup
Their work can be perfectly scientific. My problem with it is is the
redundancy — imagine someone claiming to be a "food chef."

------
nsxwolf
ETL.

------
claydavisss
Query Monkey

