
The Data Science Process - EternalData
https://www.springboard.com/blog/data-science-process/
======
peatmoss
I dislike that "substantive domain knowledge" has here been replaced with
"communications skills." Science stands on the shoulders of giants, and
ignorance of what has come before doesn't do that.

Being able to spin a good yarn isn't really enough here. If data science just
becomes a code word for brogramming your way through a set of black-box ML
algorithms, then I will welcome the inevitable crash of data science.

If insight is the goal, then classic applied statistics plus reproducibility
feels like a much better story. At least if insight rather than "making it go"
is the goal.

~~~
nickdavidhaynes
>If data science just becomes a code word for brogramming your way through a
set of black-box ML algorithms, then I will welcome the inevitable crash of
data science.

A fundamental challenge I see here is how bottom-heavy data science feels now.
There are tons of people out there trying to "get into data science" from
other fields, but the number of people with substantive domain knowledge,
strong programming skills, and the math background to be able to understand
the ML black boxes is quite small relative to the number of people calling
themselves data scientists. In other words, real insight definitely is (or
_should be_ ) the goal, but real insight is really hard, and scikit-learn is
so easy.

My hope is that this improves over the next 5-10 years - the more mature data
science becomes as a discipline/career, the better the education will be and
the more experienced people there will be. There is a risk in the mean time,
though, that a flood of relatively inexperienced people causes a collapse in
expectations for data science, making businesses less eager to hire them in
the future.

~~~
jackgolding
From my experience the biggest hinder to the future of data science is how
crappy it is to learn statistics. And I think this is why a lot of data
science courses stop at Z-tests and p values or super basic Bayes theorem. I
think mathematicans and statisticians have a lot of work to do to make more
advanced parts of the field more accessible, otherwise we will end up with
people ignoring important assumptions and using tools like a black box.

~~~
claytonjy
I completely agree; I've found it much harder to self-learn the stats than the
software side of things. Sibling post makes a good point, but I think the
history of stats vs. comp sci bears weight here too; having many people want
to learn stats outside academia is a much newer phenomenon than people doing
the same with programming.

Anyone have any good resources for self-teaching stats? I have a BS in math
but only took one stats course, and it was as terrible as all intro-stats
classes are. I have a strong, proof-based understanding of probability theory,
but haven't found a similar approach to stats. It all seems to be "if data
looks like this, use this test, watch for these pitfalls" which is terrible
for building intuition.

~~~
parul
Try the Khan Academy stats resources -
[https://www.khanacademy.org/math/statistics-
probability](https://www.khanacademy.org/math/statistics-probability)

Datacamp also launched a bunch of new stats courses recently. I haven't
checked them out yet, but their courses are usually good quality.
[https://www.datacamp.com/courses/topic:probablity_and_statis...](https://www.datacamp.com/courses/topic:probablity_and_statistics)

------
nl
I'm interviewing people for multiple DS positions (subtle recruiting thing
there...) at the moment and it's not fun.

The number of people who can't work out what _kind_ of solution a DS scenario
needs is very disappointing. I'm not even talking about giving a "correct"
solution: most can't even work out the class of problem!

Here's something to think about: Are you doing visualization? Building some
kind of model to explain existing behavior? Building a predictive model? Is it
supervised or unsupervised?

This is pretty basic stuff (surely it's close to the FizzBuzz of data
science?), and yet it is borderline impossible to find people who just nail
it.

Why is this?

~~~
SatvikBeri
I also hire data scientists & have a similar experience. As far as I can tell,
many people are taught to start from a statistical/machine learning method and
apply it to a problem, but very few are taught to start from a problem and
figure what techniques to use. Honestly, 95% of the time I solve my questions
through iterative SQL queries in a few hours, while I see most people using
laborious statistical methods the first chance they get.

~~~
cosmie

       > 95% of the time I solve my questions through iterative SQL queries in a few hours, while I see most people using laborious statistical methods the first chance they get.
    

The issue seems to be a mismatch between your posting and your workload.

I do hiring for a data team, and explicitly don't advertise a data science
role. While we do have projects that are advanced enough to fall under a data
science moniker, the majority of candidates we got for that role had very...
_academic_ expectations. But a business isn't a static, cleanroom environment
with everything already collected, cleaned, standardized, validated, and
normalized for use.

Re-titling the job posting to Data Specialist or Data Analyst resulted in a
lot more candidates that are perfectly well suited to the type of problem
solving you mentioned. There's an endless number of business problems where
this skillset can be applied, making them very flexible and providing high
labor utilization. Including getting to a "good enough" state for the few
problems we have that could benefit from the more advanced statistical methods
a data science candidate would bring to the table.

~~~
SatvikBeri
Yeah–to be clear, I pretty much totally revamped the hiring process once I
became a manager, and was speaking mostly from previous experience. I've found
splitting up the job into different titles "Data Analyst", "Data Scientist",
and "Data Engineer" depending on the actual role to work pretty well.

That said, even with the vast majority of analyst candidates, I find them very
eager to apply known methods–flexibility and problem-first thinking is rare
and extremely valuable.

~~~
daveguy
Those titles. What type of roles do they cover? Is there a quick summary --
particularly between analyst and scientist. I expect engineer is source
quality, repeatability, accuracy, precision, feature engineering, etc. In
other words making the data stable and easily consumed, whether that is
directly from the instrument or the charts for the final decision.

The nuance between analyst and scientist is less clear. Can you describe what
type of candidates the two draws or what you look for depending on the title?

~~~
bigger_cheese
My job title is currently "Data Engineer" I work in an industrial plant.
Here's my two cents:

My background is in Engineering (I'm a materials engineer by qualification).
What differentiates me from a statistician, analyst etc is my domain
knowledge. I have almost 15 years experience working with industrial
processes. I have the background knowledge of chemistry, thermodynamics,
mechanics etc. Which someone with a stats background would be lacking. So when
I am asked to optimize an industrial process I can utilize that expertise
whilst developing models.

I would expect that a data scientist would know more about machine learning
and would have a much stronger stats background than me. They'd also probably
write much better code (I work in C/C++ and SAS, from what I have seen data
scientists tend to be Python/R focused).

------
arrosenberg
Is this data science? This is the process for being a good BI/Marketing/Web
Analyst. I use a variation of this process, but I've never really considered
this to be what Data Scientists are meant to be - I always saw Data Scientists
as being more specialized in statistics and algorithms, with less
specialization in domain knowledge and stakeholder communication.

If someone needs to improve their conversion funnel and help with segmentation
and reporting, they need an analyst. If you want to build an algorithm to
determine what content is shown to each customer when they make a request, you
need data scientists.

~~~
ende
Agreed. Too many people confuse basic BI for data science.

------
nrjames
This is missing an important component of the process. If you don't want to
have to reinvent the wheel every time you are asked to do a certain type of
analysis, you also have to set up some infrastructure to support your analytic
pipeline. That involves understanding databases, writing scripts to
automatically harvest data, possibly creating APIs for your data to support
flexible analytic views, etc. The more time I spend in data science, the more
time I find myself spending on these types of infrastructural tasks. It's
great to work for a company that provides engineers that will do all of this
for you, but those companies aren't super common.

~~~
pea
BTW this is what we're hacking on at NStack.. we're building an analytics
platform which gives you a high-level language which provides an abstraction
over infrastructure. The aim is that data teams can productionize code without
thinking about anything but business-logic and without requiring an
engineering team. So you can write things like..

    
    
      nstack start "Schedule { interval : 'Daily' } | DataWarehouse { sql : "./request.sql" } | YourPythonClassifier | Postgres { insert_table : "Results"}"
    

..which then gets distributed on your cloud-provider. You can kind of think
about it like a type-safe, distributed cloud bash!

I'm clocking off for bed, but would love to give you a demo if you're
interested: leo@nstack.com.

------
mifeng
I think the problem is that the industry hasn't created a product management
layer that interfaces between non technical business folk and the technical
data scientists. In software engineering, we don't expect the customers to
speak directly to the engineers; that's what product managers are for (cur
infamous Office Space scene).

However product managers aren't typically involved in solving data science
related problems. This is primarily because most product managers don't have
the math/stat/compsci background to be useful.

However I predict this will change in the next 5 years.

~~~
plusbzz
Agree with this. In fact, Lead Data Scientist roles often become de facto PM
roles, where the LDS basically spends their time prioritizing the important
research questions DS has to solve based on customer and business needs.

I've been hearing from multiple people that this is a gap that's really hard
to fill right now -- PMs who can work with heavy DS and AI products. It's much
easier to train experienced data scientists to be PMs than the other way
round.

------
vegabook
this looks suspiciously identical to what software engineering has always been
about.

There's nothing new in "data science", as per this post, than what has always
been true of building a piece of software for non-technical clients. It has
always been true that having domain expertise provides a huge boost. It has
always been true that requirements are moving targets, that objectives are
fluid, that clients don't talk computer science ("data science" in this case).
Clients (internal or external) often don't know how to describe their own
workflows, and especially edge cases, in rigorous ways. All deja vu. It has
always been true that you need to "frame the problem", "clean the data",
"design and apply the algos", "communicate the results". We've been grappling
with this for 50 years.

------
minimaxir
This article is a good overview on the _why_ of data science and statistical-
based decision marking, but doesn't discuss much of the _how_ and the various
warnings that occur during the process (i.e. data gathering/fidelity issues
which invalidate models)

The article is marketing for a data science bootcamp which likely answers
those questions. There has been a lot of discussion on HN about the merits of
bootcamps for developers, but not much about the merits of bootcamp for
statisticians, or even the entire hiring workflow in that field.

I've been looking into Data Analyst/Science jobs at companies in the San
Francisco Bay Area and almost every position wants a Masters/PhD, either
explicitly stated as a requirement or implied. If there is a high demand/low
supply of data science jobs out there, I'm unsure how a data science boot
camp/tutorial would be able to compete.

~~~
boozywoozy
I've been researching various masters programs and bootcamps, including
talking to graduates of both.

My sense is that while there's a huge variance in quality on both, the median
bootcamp seems to be more in touch with industry and better at imparting real-
world skills than the median master's program. I'm not sure if employers have
started to recognize this yet (from your comment, it seems that they haven't).
But once the feedback loop completes, I'd wager that they will.

Also, getting a graduate degree and attending a data science bootcamp doesn't
seem to be mutually exclusive. For instance, there are data science bootcamps
that specifically target PhDs.

~~~
patmcguire
Yeah, there a bunch of bootcamps for people who have statistics or statistic-
heavy PhDs that need to translate those skills from the academic to tech
company context. Tends to work out pretty well.

------
DrNuke
Point is there is an explosion of ML usage in soft sciences and applied
industries who treat the tools as a black box. On the other hand, extreme
reliability is really not needed there, it's just a lot of non-math or basic-
math trained people messing around with stats, R packages and novel jargon.

~~~
DrNuke
Another point is the overreaction among the cognoscenti: so many words about
open source leveraging the tools and allowing the masses to focus on real-life
problems, then the rage against the end-users mocked as data-monkeys? Most
industries do not need data science at all and, if the case, a very simple
80/20 approach solves all their problems. It happens the 80/20 approach is
within the reach of every data-monkey able to clean and normalize datasets,
set up Anaconda with Scikit, theano, xgboost, do some ensembling and deploy to
AWS for semi-intensive tasks. You all as an industry wanted that for years, so
what now?

------
kbos87
"As an ethical data scientist concerned with both security and privacy, you
are careful not to extract any personally identifiable information from the
database. All the information in the CSV file is anonymized, and cannot be
traced back to any specific customer."

Honest question - is this really necessary and applicable in this scenario?
We're talking about a full time employee accessing company data, presumably
with any necessary permissions, to generate insights for internal consumption
within the company about its customers?

~~~
plusbzz
Yes, it is. Full time employees usually work on their laptops, which can be
stolen or hacked especially when they're outside work. Ultimately, people and
culture are usually the weakest links in security.

------
ellisv
I understand this is a marketing piece but as a data scientist the narrative
doesn't resonate with me at all. Are there any data scientists that have
actually had an experience like this?

~~~
SatvikBeri
This seems mostly like the "data analyst" job. Jobs with the "data scientist"
title that I know of are usually basically programming jobs with a focus
around machine learning and large scale data.

~~~
ddysgath
Those programming jobs exist because someone at some point discovered a
problem that could be solved through statistical techniques. The difference is
between moving quickly and exploring new problem spaces and really hammering
home on well-defined problems that have solutions that need to be implemented
in a way that looks a lot like normal software engineering.

------
tszymczyszyn
So, are all "data science" jobs related to marketing?

~~~
z2210558
In my experience most are, but there is also some work in operations
management and optimisation, product design (e.g. financial products, digital
products) and other areas.

I think marketing is the obvious first use case, but in large organisations
there are often gains to be made looking at operational data.

------
ExploDron
So this is a question for those of you in the comments.

I'm finishing up a Ph.D. in engineering (heavy into climate change research,
so tons of programming + mathematical + statistical knowledge in addition to
combing through TBs of data with R and other languages).

What kinds of problems are frequently present in the data science industry
that differs from academic research?

~~~
gautambay
I realize this isn't a proper answer to your question, but it reminds of a
tweet from Monica Rogati:

    
    
      "A decade in academia taught me a bunch of sophisticated algorithms; a decade in industry taught me when not to use them."
    

Source:
[https://twitter.com/mrogati/status/726115691703619584](https://twitter.com/mrogati/status/726115691703619584)

~~~
ExploDron
Welp that's really fair. Thanks for the quote.

------
tshiran
Great article, this is the exact process I've observed multiple times.

------
Apfel
Down for me, but the main springboard page is up.

