
How to learn data science - spYa
https://www.dataquest.io/blog/how-to-actually-learn-data-science/
======
minimaxir
The actual problem with learning "data science" is making inferences and
conclusions which _do not violate the laws of statistics_.

I've seen many submissions to Hacker News and Reddit's /r/dataisbeautiful
subreddit where the author goes "look, the analysis supports my conclusion and
the R^2 is high, therefore this is a good analysis!" without addressing the
assumptions required for those results.

Of course, not everyone has a strong statistical background. Except I've seen
_YC-funded big data startups and venture capitalists_ commit the same
mistakes, who should really, really know better.

"Data science" is a buzzword that successful only due to obscurity and no one
actually _caring_ if the statistics are valid. That's why I've been attempting
to open source all my statistical analyses/visualizations, with detailed steps
on how to reproduce. (see my recent /r/dataisbeautiful submissions on reddit:
[https://www.reddit.com/user/minimaxir/submitted/](https://www.reddit.com/user/minimaxir/submitted/)
)

~~~
nabla9
Roughly 80% of data scientists I know have PhD in something very math heavy.
Rest have masters degrees. There are programmers who can assist them doing the
grunt work but it's just basic programming to assist analysts to crunch data.

If you want to do data science for real:

1\. Get Masters of PhD from statistics, computer science, economics, physics
or some other heavy field and specialize data analysis in that field. You must
learn lots of statistics when doing so.

2\. Learn programming, statistical machine learning and tools of the trade.

Good data science is not based on collecting large amounts of data passively
and then mining it mindlessly. You need to ask right questions and design data
collection and modeling process based on those questions.

~~~
DataWorker
No. A phd in statistics or economics means almost nothing at this point. Even
if it did, truly, signal mastery of the content, which it doesn't anymore, it
would signal to most people who do this kind of work that you're way
overqualified while simultaneously being totally ignorant of the day-to-day
work of actual data scientists.

If you want to be a useful data scientist, do a lot of work with data. If you
have strong programming skills and are flexible and a quick learner then you
will do well.

Spending the better part of your young adulthood getting a phd in statistics,
unless you want to go into academia, just makes you look like a fool.

~~~
Baghard
I've had complete opposite experience. Do the people who hire for this kind of
work often bet on non-PhD candidates? Do they trust themselves to separate the
wheat from the chaff?

Don't you want a colleague who is able to mention seminal papers for specific
problems? Who is able to read and understand these papers and can distill
useful features and optimizations from them?

People with PhD who go into business, usually end up in the better positions.
They hire other PhD's for the good positions to keep the signal (mastery of
the content) stronger.

As someone who did a lot of work with data I have little problem with my
usefulness, but a lot of problems opening doors to the really interesting data
companies (lacking a proper academic network). I wish I had gotten that PhD,
because right now applying to Google, Microsoft, Facebook, Yahoo or eBay for
data science positions makes me look like a fool.

~~~
GauntletWizard
I've met a lot of fools who've quoted all the right works, in both Computer
Science and Data Science. Computer Science fools usually get fired. Data
Science fools seem to get promoted to Yes-man status. It's a lot harder to lie
about your code than it is with statistics; As the old adage goes, it right
behind Lies and Damned Lies.

------
pvnick
Good article for beginners. A couple thoughts, just to build on what the
author said:

First off, data science == fancy name for data mining/analysis. Wanted to
clear that up due to buzzwordy nature of "data science."

Learn SQL - this is the big one. You must be proficient with SQL to be
effective at data science. Whether it's running on an RDBMS or translating to
map/reduce (Hive) or DAG (Spark), SQL is invaluable. If you don't know what
those acronyms mean yet, don't worry. Just learn SQL.

Learn to communicate insights - I would add here to try some UI techniques.
Highcharts, d3.js, these are good libraries for telling your data story. You
can also do a ton just with Excel and not need to write any code beyond what
you wrote for the mining portion (usually SQL).

I would also go back to basics with regards to statistical techniques. Start
with your simple Z Score, this is such an important tool in your data science
toolbox. If you're just looking at raw numbers, try to Z-normalize the data
and see what happens. You'd be surprised what you can achieve with a high
school statistics textbook, Postgres/MySQL (or even Excel!), and a moderate-
sized data set. These are powerful enough to answer the majority of your
questions, and when they fail then move on to more sexy algorithms.

Edit: one more thing I forgot to mention. After SQL, learn Python. There are a
ton of libraries in the python ecosystem that are perfect for data science
(numpy, scipy, scikit-learn, etc). It's also one of the top languages used in
academic settings. My preferred data science workspace involves Python,
IPython Notebook, and Pandas (This book is quite good:
[http://www.amazon.com/Python-Data-Analysis-Wrangling-
IPython...](http://www.amazon.com/Python-Data-Analysis-Wrangling-
IPython/dp/1449319793))

~~~
rjusher
But how do you become a good data scientist, instead of a technical person,
that knows how to apply an algorithm in Python/R.

What I am trying to ask is how do you become good at setting your start
point(formulate your hypotheses), communicating your insights and selecting
which tools apply where, because if your are good at coding and have
experience in things related to computer science you have the abilities to
handle a dataset(SQL Knowledge) and the data tools(Python, Pandas, etc), but
that doesn't earn you the title of data scientist.

~~~
eitally
Practice. And education, but mostly practice. This is the kind of thing that
is typically taught in formal educational settings (at least in engineering,
which is my experience). As an example, I learned more about probability &
statistics in 1) AP biology in high school, and 2) a "simulation systems"
class in my industrial engineering master's curriculum. We spent much of the
former class learning basic statistical analysis techniques (ANOVA, chi-
square, etc) to apply to our lab data, and the latter class was all about
statistical analysis of process flows (aimed at the real life problem of
factory production planning & scheduling and manufacturing process
optimization).

So, do I consider myself a data scientist? Absolutely not. But do I understand
basic statistical concepts and know how to apply them to several categories of
real life data analysis problems.

I'm a terrible coder, btw.

~~~
rjusher
Would you recommend any approach or I should go undust my high school and
college books in the search for study material. Or is this too basic material.

------
stdbrouw
Definitely agree that those long lists that tell you to first become awesome
at combinatorics, linear algebra, then learn all about statistical inference
(that is, not actual statistical procedures but the mathematical underpinnings
of statistics that would enable you to construct and evaluate methods you
invent yourself), then move on to stochastic optimization... those are really
more about machismo than about actually helping people to learn data science.
Sure, linear algebra is helpful, but whether it's fundamental really depends
on the kind of data science you're keen to do.

I also generally dislike /r/machinelearning and /r/statistics because they
seem to have been taken over by people who will tell you to either get a PhD
or get out. But, for me, just learning whatever I thought I needed to help me
solve the problem at hand got me stuck really fast. There's so much statistics
where you really just have to learn it first before you can start to see when
and why you'd like to use it.

It never occurred to me to use hierarchical modeling and partial pooling for a
certain set of problems until after I'd read Gelman & Hill. I never thought
that inference on a changing process might require different techniques from
the techniques for stationary processes until I had to study Hidden Markov
Models for an exam. Heck, when I got started with data analysis I didn't even
realize that the accuracy of most statistics improves proportional to sqrt(n)
and so the next logical step in my mind was always "get more data!" instead of
"learn more about statistics!" (If you look at the industry's obsession with
unsampled data, data warehouses that store absolutely everything ever and
map/reduce, my hunch is I'm not the only one who lacks or at some point lacked
elementary statistical knowledge because it just never came up on their self-
motivated, self-directed learning path.)

So I think the ideal learning path incorporates a bit of both: learn more
about what excites you and about what's immediately useful right now, but also
put aside some time to fill out gaps in your knowledge – even things that
don't immediately look useful – and make some time for fundamental/theoretical
study.

(x-posted from DataTau)

~~~
mikedmiked
+1 for DataTau, I didn't know about that.

[http://www.datatau.com/](http://www.datatau.com/)

Check out this too: [http://www.pyquantnews.com/](http://www.pyquantnews.com/)

------
lessthunk
Data science is a stupid buzzword. The ideal candidate knows enough about IT
to massage data, the more the knows about the domain to investigate the
better, and for sure some statistics. Most of all always do sanity checks ..
does it make sense? Can it be? Is the data correct?

It is an art. Like writing awesome code, etc. practice, practice, and working
with experienced people is key.

~~~
SG-
At least it's not something engineer like every other job in the tech field.

------
neovive
"You need something that will motivate you to keep learning." This is so true
and often forgotten. I am always learning new things, but the concepts that
stick, beyond just the basics, are tied to specific projects or solutions to
real problems. I'm typically ok with being a "jack-of-all-trades" for most
technologies, just to stay aware of new things. However, when it comes to
applying new concepts, skills, or tech to solve problems, a deeper
understanding is required; usually obtained through motivation.

~~~
liviu-
>"You need something that will motivate you to keep learning." This is so true
and often forgotten.

I'm surprised that both you and OP seem to think this advice is rare, as I've
personally seen it mentioned more than a few times. Professors always
brainstorm ways to motivate students, employers always seek methods to
motivate employees, etc. The answer always seems to be something in the lines
of 'do what you love' which is so overused it loses its impact.

Anyway, as someone new to data science, I did not feel like I gained any new
information after reading the article, and all the advice seems either
intuitive or rehashed. Looking forward to read the HN discussion though.

~~~
studentrob
> Professors always brainstorm ways to motivate students, employers always
> seek methods to motivate employees

Uhh.. You must have quite a fortunate streak of having great teachers and
great employers. Speaking for myself, I had some great teachers, but finding a
boss who seeks to motivate employees with the right challenges is rare. Bosses
and companies generally assume your salary is the primary motivator.

------
washedup
If you want to learn about data science, read this book: [http://www-
bcf.usc.edu/~gareth/ISL/](http://www-bcf.usc.edu/~gareth/ISL/)

I have been going through it and I cannot think of a better resource.

------
acbart
As someone who uses "Data Science" to teach "Computational Thinking", I think
this blog post hits on a lot of really valuable pedagogoical notes. Getting
motivated, learning things through doing, and having a strong context for your
learning.

For those wondering why I put my buzzwords in quotes, it's because I don't
want to sound like I'm a huge proponent of either of them. CT is the term I
use to describe how I teach my students about abstractions, algorithms, and
some programming. DS is the term I use to describe how students learn all of
that in the context of working with data related to their own majors. I'm not
trying to claim some crazy paradigm shift, just that it's a great way to
convince students that CS is useful to them.

------
davemel37
Anyone interested in data science should first study cognitive psychology. The
CIA has a manual on the psychology of intelligence analysis that is a must
read for anyone pursuing any analytical job.

If you dont understand how your mind sees, processes, retains and recalls
data...how can you possibly analyze it accurately?

~~~
cwyers
You have a link to where to obtain said manual?

~~~
davemel37
[https://www.cia.gov/library/center-for-the-study-of-
intellig...](https://www.cia.gov/library/center-for-the-study-of-
intelligence/csi-publications/books-and-monographs/psychology-of-intelligence-
analysis/)

~~~
cwyers
Thank you.

------
anderspitman
These principles are useful when learning anything really: human language
(immersion), programming (build something), sports (practice), etc.

That said, as someone who worked in software engineering for 5 years without a
degree, and recently returned to school, I would say be careful not to
discount studying theory at the same time you're practicing your craft. I
really think a combined approach of structured university courses and MOOCs,
including reading textbooks, along with applying the knowledge has been the
best approach for me.

I was arrogant about "not needing" a degree for years, feeling justified by
the fact that I was making very valuable contributions as an engineer, until I
finally went back to school and realized how valuable theoretical knowledge
can be.

------
tyfon
I've been working as an analyst for 7 years, it's only last couple of years
I've heard of statistical analysis referred to as data science.

Am I missing something or is it just a new word?

~~~
noelsusman
In many ways it's just a new word for the same thing, but there's a few key
differences. The main difference between traditional statistics and data
science is strength in programming. Data scientists are also expected to be
more well versed in statistical modeling than your average programmer or data
analyst.

With that said it's not really a new thing, people have been doing data
science for decades. The demand for people who can program and also do more
complex statistical modeling has skyrocketed so I think that's why there's a
new name for it now.

Part of the problem is that even with this definition there's a wide range of
abilities present in data scientists. A long time computer programmer who has
dabbled in statistics and a long time statistician who has dabbled in computer
programming would both be data scientists even though they bring very
different strengths to the table.

------
gbersac
I am doing this course and find it really good :
[https://www.edx.org/course/scalable-machine-learning-uc-
berk...](https://www.edx.org/course/scalable-machine-learning-uc-berkeleyx-
cs190-1x)

It is about creating a linear and logistic regression + pca using spark
(python api).

------
graycat
Here are some topics. Are they considered relevant to _data science_?

Matrix row rank and column rank are equal.

In matrix theory, the polar decomposition.

Each Hermitian matrix has an orthogonal basis of eigenvectors.

Weak law of large numbers.

Strong law of large numbers.

The Radon-Nikodym theorem and conditional expectation.

Sample mean and variance are sufficient statistics for independent,
identically distributed samples from a univariate Gaussian distribution.

The Neyman-Pearson lemma.

The Cramer-Rao lower bound.

The margingale convergence theorem.

Convergence results of Markov chains.

Markov processes in continuous time.

The law of the iterated logarithm.

The Lindeberg-Feller version of the central limit theorem.

The normal equations of linear regression analysis.

Non-parametric statistical hypothesis tests.

Power spectral estimation of second order, stationary stochastic processes.

Resampling plans.

Unbiased estimation.

Minimum variance estimation.

Maximum likelihood estimation.

Uniform minimum variance unbiased estimation.

Wiener filtering.

Kalman filtering.

Autoregressive moving average (ARMA) processes.

Rank statistics are always sufficient.

Farkas lemma.

Minimum spanning trees on directed graphs.

The simplex algorithm of linear programming.

Column generation in linear programming (Gilmore-Gomory).

The simplex algorithm for min cost capacitated network flows.

conjugate gradients.

The Kuhn-Tucker conditions.

Constraint qualifications for the Kuhn-Tucker conditions.

Fourier series.

The Fourier transform.

Hilbert space.

Banach space.

Quasi-Newton iteration and updates, e.g., Broyden-Fletcher-Goldfarb-Shanno.

Orthogonal polynomials for numerically stable polynomial curve fitting.

Lagrange multipliers.

The Pontryagin maximum principle.

Quadratic programming.

Convex programming.

Multi-objective programming.

Integer linear programming.

Deterministic dynamic programming.

Stochastic dynamic programming.

The linear-quadratic-Gaussian case of dynamic programming.

~~~
S4M
The topics you mention are maths or applied maths topics. "Data Science" is a
bubbly term that roughly means "take that big dump of data and give me some
advice on how to make more money", so your list, very sadly, has little
relevance with it.

~~~
graycat
Most of those topics I listed are supposed to be good at taking data and
saying how to "make more money"!

------
vermontdevil
Also learning R Language and using RStudio is a great way to get into. RStudio
has so many packages to help you do any data analysis. The learning curve is
quite steep though.

------
gtrubetskoy
Read this (free) book: [http://mmds.org/](http://mmds.org/)

------
searine
Moving data around is just grunt work.

Real science requires a creative and critical mind, which takes years to mold.

~~~
stdbrouw
Sounds like you've also spent years molding professional disdain for everyone
who's not a Real Scientist.

~~~
searine
No I've just seen too many people spin their wheels on "analysis" that is not
hypothesis driven.

You got to start with questions to get answers, and the hard part of science
isn't crunching data, it is asking the right question!

~~~
danenania
And how does the Right Question appear if not through exploration and
manipulation of the data?

Theory can obviously be very useful, but much of this stress on advanced
statistics and phds is just a smokescreen for academics who suck at
programming.

If you can't program and manipulate data, statistics won't save you because
you won't have the ability to dig deep enough to find valuable insights. On
the other side, if you know how to slice and dice data quickly and reliably,
you can learn a huge amount by applying only the simplest statistical
techniques. Generally the simple techniques are better anyway because they
make mistakes less likely and your findings are easier to communicate.

~~~
searine
>And how does the Right Question appear if not through exploration and
manipulation of the data?

Questions don't magically come out of a data set. Doing so is called a fishing
expedition and usually results in boring, descriptive results which have no
impact.

To answer impactful questions, you must go into your data collection with the
questions in mind. To understand what questions to ask, you need a trained,
critical, and creative mind. That is something you don't get from pushing
bits.

>If you can't program and manipulate data

Programming, and manipulating data is easy. Almost every new statistician
these days can, and does do this routinely.

What's hard is the years of intuition about what is meaningful and what is
noise.

I know. It's hard to hear, and career programmers most of all hate to hear it,
but its the truth.

~~~
danenania
Anyone I've ever heard say "programming is easy" is without fail a terrible
programmer.

I'm not really sure how to respond to the idea that exploring a dataset isn't
a useful way to help develop questions about it. It's only a "fishing
expedition" if you have no idea what you're doing.

~~~
searine
>Anyone I've ever heard say "programming is easy" is without fail a terrible
programmer.

Development of a worldclass application, is difficult because of the
complexity built into a program of large scope.

Knowing enough programming to competently move a data set around, is easy.
Hell you could do most of it with just bash.

>I'm not really sure how to respond to the idea that exploring a dataset isn't
a useful way to help develop questions about it. It's only a "fishing
expedition" if you have no idea what you're doing.

Well I've seen a lot of it, in both science and business. People who spend a
lot of time and money to generate a large data set simply because they lack a
question to ask. They expect meaningful answers to just tumble out of it like
mana from heaven, and end up confused and dismayed when the answers aren't
impactful.

Fishing expeditions are looked down upon because they can only describe the
data you generated. That is minimally useful, and can be done without grabbing
a huge sample.

Good science starts with a question, then puts data to work to create new
insight by removing confounding factors through careful design.

------
pvaldes
Learn biology, chemistry and physics, question yourself often and use your
instinct.

------
curiousjorge
if I had some type of practical application that I knew could benefit from
data science, like learning RoR to make a marketplace app for example, it
would help a lot as I have a clear goal and route to achieve that. However,
data science, machine learning, these are so broad, and seemingly complicated
(my fear of complicated math formulas and statistics) and worse I don't know
what I want to achieve out of it nor do I know what I want to make which
really hinders the learning process for me. I need some incentive or reward at
the end of the goal.

~~~
sanderjd
Yes, the step 1 that nobody seems to mention is that you need to have a
question that you're curious about, which data analysis may be able to help
you answer. The reward is having some answer to that question, with an
argument for its validity. Instead of links to a bunch of datasets, I'd love
to see a site that collects questions with the potential for data-driven
answers. This perhaps exists somewhere.

