
Show HN: An Open-Source Data Science Curriculum - coderjack
https://github.com/datasciencemasters/go/
======
squigs25
Well done! As a data scientist, I can say that your curriculum is spot on.

And I also think you're taking the right approach building/hacking/doing
rather than going to an institution. This stuff is so new that I find myself
spending 5-10% of my time just trying to stay up on the latest tech.

One thing I would add though, data science is really 3 things usually:
business knowledge, hacking and lastly stats/machine learning. The stats piece
is shockingly easy as more and more modules/packages/libraries make it
possible to create/train a model in 2 lines of code. (Applying the right model
to your data set is difficult.)

The other shocking thing is that really, 80% of my time is probably spent
hacking, and most of that is just spent on getting data.

------
afaqurk
The original post for this was what inspired me to create this:
[http://afaq.dreamhosters.com/free-cs](http://afaq.dreamhosters.com/free-cs)

~~~
prodev42
i like this. good job.

~~~
afaqurk
Thanks! Any feedback would be welcome.

~~~
the_watcher
Is the best approach to just go straight down the list? Or should I do the
first required course in each before moving on to the next?

~~~
afaqurk
However, within each category, it is best to take the required courses (or
feel confident with the topic) before taking the electives.

------
blakerson
I'm biased as a non-technical analyst with an academic history, but I'm
concerned that the curriculum doesn't meet the most basic needs.

Quoting the section 'An Academic Shortfall': "Academic credentials are
important but not necessary for high-quality data science. The core aptitudes
– curiosity, intellectual agility, statistical fluency, research stamina,
scientific rigor, skeptical nature – that distinguish the best data scientists
are widely distributed throughout the population."

In my estimation, none of those aptitudes are covered by teaching technical
skills like databases, NLP, ML, graphical models, and the other topics this
curriculum covers.

The "core aptitudes" generally boil down to asking the correct questions,
establishing the correct answers, and correctly defending them. Academia
doesn't automatically instill these skills, but it can do a great job of doing
so.

Either way, inside or outside the ivory tower, an ace programmer who masters
NLP, ML, Hadoop, and everything else could easily still come out without the
required core aptitudes, and be thoroughly unprepared to do what data
scientists are really expected to do: answer questions.

~~~
craigching
It's an open source curriculum, so you could get involved and help mold it to
meet what you see are its needs as a curriculum, no? I'm not sure what I think
of it yet as a curriculum, but I am bookmarking it and plan to come back to it
for more resources. Personally, I think it's an interesting idea and something
I want to watch to see how it unfolds.

~~~
blakerson
I'm happy to share what I learned, but sadly, everything I learned was in a
classroom lecture, and the course isn't available online. There may be
something on one of the major MOOC hubs, though. I'll keep an eye out.

------
Beliavsky
I see many books on Python listed. It's a good language for data analysis and
scientific computing, especially with scipy, but there are alternatives, of
course. I like Fortran 95, which is available in gcc as gfortran. A relevant
book for data scientists would be "Developing Statistical Software in Fortran
95" (2005).

~~~
hdevalence
Out of curiosity, why do you prefer Fortran?

Note: preemptive clarity: this isn't a language flamewar thing, I'm genuinely
curious.

~~~
Beliavsky
Arrays in Fortran 90+ are a powerful feature -- there are whole array
operations and operations on array slices, as in Matlab and Python with numpy.
It's easy to allocate multidimensional arrays. Compilers are good at
optimizing code -- if it's easiest to do something with loops you can go ahead
and not worry about vectorizing the code, as you might with R or Matlab. There
is a lot of statistics code in Fortran, so it's good to have at least a
reading knowledge of it.

~~~
clarecorthell
This is why I'm so in love with pandas (pandas.pydata.org) -- Wes McKinney did
the world a favor creating a library with powerful, manipulable
multidimensional data structures

------
jlees
I feel like lists of resources are OK, but with something like data science,
which has its own branches and specialities, it would be good to have some
kind of stack ranking of topics and information beyond just 'start here'. That
way, a reader gets more and more conversant with the different ideas being
thrown around.

Also, I don't have a list of these handy, but I've found long annotated
notebooks/blog posts of worked data science examples very helpful for
refreshing my memory on applied techniques.
[http://derandomized.com/](http://derandomized.com/) is a great example, maybe
other HN readers have some favourites we could add.

~~~
coderjack
The list is for beginners.Ofcourse data science is a huge domain and you can't
actually make a roadmap to be a master in this field but I am sure this list
will help people to get basic understanding of what to read and how to workout
things.

~~~
jlees
I think the list is more helpful for someone a step or two beyond beginner. A
true beginner is going to look at that and be scared. Once they've read a few
introductory things, they'll be able to go back and make better sense of it,
for sure. (I showed this to a friend of mine who's interested in learning data
science and that was his reaction, so I am generalising, but I think it's a
fair generalisation.)

------
hootener
I do a lot of statistical work, but wouldn't call myself a data scientist. To
that end:

> I geared the original curriculum toward Python tools and resources, so I've
> explicitly marked when resources use other tools to teach conceptual
> material (like R)

Why did you choose Python over R? Personal preference, a bent toward Python in
the online courses you found, or is Python generally considered the de facto
language choice for professional data scientists?

I imagine you could tackle these courses with any programming language, but if
Python seems to be the way the data science community is going, it would be
helpful to know that. Personally, I'm curious because I'm trying to decide if
I should pick up Python on the side to supplement the knowledge I already have
of R and various other programming languages.

Also, thanks for putting all this together. It's great!

~~~
hadley
Python is definitely not the de facto choice. Python and R both have strengths
and weaknesses, and relative use depends quite a bit on community. A few
recent surveys ([http://blog.revolutionanalytics.com/2014/01/in-data-
scientis...](http://blog.revolutionanalytics.com/2014/01/in-data-scientist-
survey-r-is-the-most-used-tool-other-than-databases.html),
[http://www.kdnuggets.com/2013/10/rexer-
analytics-2013-data-m...](http://www.kdnuggets.com/2013/10/rexer-
analytics-2013-data-miner-survey-highlights.html),
[http://blog.revolutionanalytics.com/2013/09/top-languages-
fo...](http://blog.revolutionanalytics.com/2013/09/top-languages-for-data-
science.html)) show strong growth for both R and python, and I expect that
will continue in the future. There's no reason to limit yourself to one
language, and knowing both R and python can only help. That said, you do need
to be carefully about spreading yourself too thin, and you want to make sure
you're an expert in at least one data analysis environment.

~~~
pyoung
And lets not forget SAS, Matlab/Octave, Julia, STATA, SPSS, etc... There are a
ton of choices out there. While not currently in vogue, there are a lot of
SAS/STATA/SPSS jobs out there. Generally it is bigger, more established
companies that use these software/languages, but if your goal is to get into a
more stats focused position, these languages can be a good choice to learn.

------
jimzvz
I haven't been excited about my career in a while but I am getting more and
more excited about data science. I have a new desire to learn and to actually
finish my masters degree.

Thanks for putting these resources together.

------
the_watcher
It would be nice to have a clear list of assumed capabilities (for example, I
am familiar with basic programming and have built a few websites, but I
haven't taken a math class since senior year of high school, and don't think I
remember enough about Calculus to do anything that assumes knowledge of it).
Just a simple list of what level math, stats, and programming fluency this
starts from would be great.

~~~
stillsut
Data Science is still approachable through the "Edison"-style: only do as much
math as your comfortable with, but keep probing discrepancies between
different models. It's more debugging than architect-ing. Evidence: none of
the top kaggle competitors is an academic/statistician to my knowledge.

~~~
the_watcher
I understand what you are saying, and I am not looking to become a
mathematician or statistician. It has just been so long since I took a math
class that before diving into any of these courses, I'd like to be aware of
the minimal fluency expected.

------
craigching
On the entry "NLP with Python O'Reilly / Book", note that a second edition
might be "in the works." There is an online edition of the work in progress
(updating for Python 3 and NLTK 3) available here [1].

[1] -- [http://nltk.org/book3/](http://nltk.org/book3/)

------
Ajoo
A resource I've found invaluable and that I can't find listed is
videolectures.net

Particularly
[http://videolectures.net/pascal/](http://videolectures.net/pascal/) has
plenty of lectures and tutorials from their summer schools on very relevant
topics for machine learning.

------
it_learnses
thanks. this is a good resource :) I've explored most of these courses on
Coursera over the last year before finally deciding to go to grad school. For
me, the biggest factor was motivation (in terms of actually doing the
assignments and projects), networking, and getting internship opportunities.
However, I still am using the coursera lectures as a supplement to my courses
and it helps a lot.

Regardless of any path you take, these are very exciting times to be in
computing sciences. All the best to everyone and keep upgrading your skills
and knowledge :)

------
prodev42
for the first course Intro to Data Science, the only access you will have are
the videos. You can't see other resources like homeworks and other stuff. What
to do about this?

~~~
mparr4
The assignments are, in fact, available:
[https://class.coursera.org/datasci-001/assignment](https://class.coursera.org/datasci-001/assignment)

You'll be required to login is all.

~~~
jimzvz
I'm logged in but get "Looks like you are not enrolled in this course!".

~~~
skadamat
Google around, often times people have downloaded the lecture videos /
homework assignments

------
emre
Thanks for putting these resources together, this is amazing

~~~
coderjack
All credits goes to the original author.

------
igvadaimon
that's kinda awkward, but something is wrong with my monitor when I'm on that
website.

~~~
coderjack
surely awkward.... :-)

------
Ryel
YES I support this.

