
The Open Source Data Science Masters - hooande
http://datasciencemasters.org/
======
mailshanx
Like many people here have mentioned, mastering everything can take far too
long, and is completely unnecessary in practice.

It is much, much more important to master the foundations, which consist of:
1) Probability and statistics 2) Linear algebra, and 3) Programming
proficiency

To be more precise here is what you would need to know in each area:

1) Probability and stats: Properties of the most common distributions (Normal,
Poisson, Exponential), conditional probability, properties of expectation,
Bayes rule and its applications, hypothesis testing, confidence intervals,
bootstrap / jackknife, know how to compute the power of a test, be able to run
Monte Carlo simulations to model real world phenomena.

2\. Linear Algebra: Know enough to understand and compute Eigenvalues /
Eigenvectors, SVD and various matrix factorizations.

3\. Programming proficiency: Algorithms and design patterns. Its enough to
know the typical undergraduate algorithms course content: trees, heaps, hash
tables, various sorting algorithms, graph algorithms: shortest paths, spanning
trees, max flows.

If you know this, you can easily pickup everything else on a need-to-know
basis. Depending on the domain you are working in, that might entail either
natural language processing, computer vision, machine learning or big-data
tools like Hadoop / Spark etc. All of that is very easy to learn once you have
the foundations.

~~~
ska
One thing I think you are missing here is

4) Elements of numerical analysis

Under standing how matrix inversion works on paper is very different than a
computer implementation. If you don't understand how floating point really
works, you are bound to make fundamental errors in this kind of work, etc.

~~~
jeffreyrogers
Are most people doing data analysis actually going to be implementing their
own numerical algorithms though? I'd imagine most people just use whatever
library is available in R, Python, or whatever other language they're using.

~~~
RBerenguel
It's important having at least a basic knowledge to avoid common, really
stupid pitfalls that libraries can't handle out of the box but humans can
(i.e. non-interpolable functions, small number division...)

------
dkarapetyan
Here's the problem with stuff like this. If you actually learn all the stuff
that is listed it will take you at least 3 years if not more. All the subjects
listed require prolonged study if you want them to become part of your
intellectual repertoire and then it takes some more time until you are
comfortable applying that knowledge in non-trivial settings. On top of all
that you have to stay motivated to consistently study and make progress
alongside whatever else you are doing for your day job.

I program every day and I'm been studying parsers, virtual machines, and
compilers for some time now and it still requires hard work to not fall back
on bad habits. It takes even more effort to consistently apply techniques I
learned while studying those subjects in day to day programming. There is no
royal road.

~~~
clarecorthell
It's worthwhile to explain why many people find the OSDSM useful.

The topics listed do indeed require depth to acquire an adequate proficiency
to understand and command them. The OSDSM is most useful for people who don't
know where to _start_ , which is often the hardest part. The hardest things
are often the most trivial in hindsight.

Typical interaction:

Q: I want to be a Data Scientist. What do you think I should study first?

A: Build a basic proficiency in linear algebra, programming in python, and
statistics. Then take cursory classes in the subjects in the OSDSM that
interest you. You'll figure out what depth is most meaningful to you from
there.

Q: What does a Data Scientist actually need to know?

A: Totally and completely depends on what you want to do. There are people who
crunch click logs all day, people who comb the tendrils of search algorithms,
and yet others who seek terrorists and criminals in statistical signs among
the bits and bytes. See my last answer for relevant guidance.

e.g. [https://medium.com/@clarecorthell/the-brief-multi-tweet-
enum...](https://medium.com/@clarecorthell/the-brief-multi-tweet-enumeration-
of-job-titles-encompassed-by-data-science-8d48a988e3a1)

~~~
dkarapetyan
In that case I have some feedback. As a beginner I still don't know where to
start after reading your post. There are 101* bullet points in your post
distributed among IDEs, books, online courses, libraries (the programming kind
not the book kind), programming languages, pure math, applied math, database
theory, machine learning, natural language processing, visualization, etc. and
there is no obvious ordering on how hard or easy things are because the very
first link in the math section is another list of bullets tackling some heavy-
duty mathematics and I say this as someone who studied mathematics at a
graduate school level.

Expanding mailshanx's answer will be much more helpful to beginners than the
current 101 bullet points.

*I just counted the bullet points with document.querySelectorAll('li') so there are some false positives.

~~~
clarecorthell
Start here:
[http://bit.ly/uwintrodatascience](http://bit.ly/uwintrodatascience). It's the
first bullet.

------
dantiberian
Here's my personal list for those interested, inspired by the Data Science
Masters: [http://danielcompton.net/2014/03/23/data-science-
curriculum](http://danielcompton.net/2014/03/23/data-science-curriculum)

------
huherto
The field of Data Science is so wide and there are many techniques. How do you
decide on what to specialize since you don't know which techniques are going
to be needed to solve your next problem?

~~~
clarecorthell
The truthful meta-answer which you probably don't want to hear: We only have a
statistical likelihood (of dubious confidence) of what will transpire in the
future. We have no knowledge of the future. None. Same is true of your
knowledge of your future problems. Use some basic bayesian logic and guess in
an educated manner, like with most prediction problems.

More practical advices: Find out what problems you're interested in working
on, then google, read books, engage people working on those problems to tell
you more about how they solve them. Another great place to start is to look at
your business' biggest inefficiencies and informational gaps, and determine
which of them could potentially be addressed with prediction or statistical
inference. This used to be called Business Intelligence. Even a coffee shop
can benefit from understanding simple seasonality.

------
akbar501
The list of Python libs is very helpful. Thanks for putting this together.

------
clarecorthell
I'm curious about the audience here -

What career goal is driving your interest in learning Data Science?

~~~
KrisAndrew
It's more of a response to the market. Beginning about 5 years ago I was
increasingly asked to do more numerically oriented things. Prior to that I was
mostly writing applications that generated SQL and wrapped the results in some
HTML. Pretty boring. Data science is more compelling.

Over the past 15-20 years there has been a massive amount of information
piling up in databases and log files; not just from web applications but from
desktop and mobile apps too. And there are companies who want to pan for gold
in that data. So if you want to do something more interesting than fiddle with
canvases, or CSS or MVC frameworks, then data science is fairly accessible if
you're not afraid of math. Furthermore, most companies will have a need for it
even if you don't really care to develop their software products directly.

NB: I doubled my salary by moving into data science. Nowadays gas station
attendants can write a Rails app to search a database. The bar has been
lowered. Understanding stochastic gradient descent (among other things) and
knowing where/when to use it commands more earning power.

~~~
dasboth
"I was mostly writing applications that generated SQL and wrapped the results
in some HTML. Pretty boring." \- This is basically why I'm interested in
making the move (eventually). Earning power may turn out to be a perk but the
main motivation is something that is perpetually stimulating.

------
shire
how does someone who has a responsibility and dependents keep up with the time
to learn all this stuff? time is of the essence.

~~~
clarecorthell
I took six months off for the OSDSM. I am very aware that it was a luxury to
do so. I had to take loans, move out of my apartment, and I studied 10 hours
per day.

If you weren't working 10 hours per day, 6 days per week for six months (1440
hrs), and you took 6 hours per week to do the same, it would take you more
than 4 years to finish a similar curriculum.

Managing your time is still the hardest part of self-study. It always will be.
Most people pay institutions to structure their lives with a workload,
deadlines, time off, expectations, and consequences (positive and negative).
Having the time for self-study is a luxury few people have; most people won't
have the time, money, and opportunity to give up for a "classic liberal
education," and by extension they won't have time for self-study, either (one
such conversation on the topic: [http://www.newyorker.com/culture/culture-
desk/loud-nathan-he...](http://www.newyorker.com/culture/culture-desk/loud-
nathan-heller-joshua-rothman-elite-colleges)). That's part of why new forms of
education like The OSDSM are so necessary. Such curriculums fit exceptional
cases that are less and less the exception.

------
afdfdfdfd
Is there something similar but for a CS degree?

~~~
clarecorthell
Try CodeAcademy and CS106a (how I learned programming at Stanford)
[https://www.udemy.com/cs-106a-programming-
methodology](https://www.udemy.com/cs-106a-programming-methodology)

I've started cataloging some of the best beginning resources here. Main
benefit is that people can battle out what the best resources are in PRs.
[https://github.com/datasciencemasters/go/blob/master/basic-p...](https://github.com/datasciencemasters/go/blob/master/basic-
programming.md)

------
upquark
Thank you for putting this together, looks like a great collection of
resources.

------
misiti3780
no ggplot under visualization - what's up with that ?

~~~
clarecorthell
You'll find R resources here:
[https://github.com/datasciencemasters/go/blob/master/r-resou...](https://github.com/datasciencemasters/go/blob/master/r-resources.md)

The OSDSM maintains a focus on python resources.

