Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: An Open-Source Data Science Curriculum (github.com)
135 points by coderjack on Jan 16, 2014 | hide | past | favorite | 43 comments

Well done! As a data scientist, I can say that your curriculum is spot on.

And I also think you're taking the right approach building/hacking/doing rather than going to an institution. This stuff is so new that I find myself spending 5-10% of my time just trying to stay up on the latest tech.

One thing I would add though, data science is really 3 things usually: business knowledge, hacking and lastly stats/machine learning. The stats piece is shockingly easy as more and more modules/packages/libraries make it possible to create/train a model in 2 lines of code. (Applying the right model to your data set is difficult.)

The other shocking thing is that really, 80% of my time is probably spent hacking, and most of that is just spent on getting data.

The original post for this was what inspired me to create this: http://afaq.dreamhosters.com/free-cs

This is fantastic. Addresses my concern about not being quite up to par in my math education. Combining the two of these is exactly what I'm looking for.

i like this. good job.

Thanks! Any feedback would be welcome.

Is the best approach to just go straight down the list? Or should I do the first required course in each before moving on to the next?

However, within each category, it is best to take the required courses (or feel confident with the topic) before taking the electives.

You can do either. As long as you finish the Math and CS required courses first, the others do not require a specific sequence.

I'm biased as a non-technical analyst with an academic history, but I'm concerned that the curriculum doesn't meet the most basic needs.

Quoting the section 'An Academic Shortfall': "Academic credentials are important but not necessary for high-quality data science. The core aptitudes – curiosity, intellectual agility, statistical fluency, research stamina, scientific rigor, skeptical nature – that distinguish the best data scientists are widely distributed throughout the population."

In my estimation, none of those aptitudes are covered by teaching technical skills like databases, NLP, ML, graphical models, and the other topics this curriculum covers.

The "core aptitudes" generally boil down to asking the correct questions, establishing the correct answers, and correctly defending them. Academia doesn't automatically instill these skills, but it can do a great job of doing so.

Either way, inside or outside the ivory tower, an ace programmer who masters NLP, ML, Hadoop, and everything else could easily still come out without the required core aptitudes, and be thoroughly unprepared to do what data scientists are really expected to do: answer questions.

The best advice I've both heard and passed on, when asked about hiring "data scientists", is you want someone who can look at raw data, massage it, and develop their own original opinions and insights about it, preferably derived from a deep understanding of the nuances of statistics and probability and the messy real world. It's the judgement and insight you're primarily looking for. So, hire the best mathematician, statistician, probabilist, economist, or heck biologist, epidemiologist, etc. you can find, and then teach them the vocational tools - hadoop, etc.

It's an open source curriculum, so you could get involved and help mold it to meet what you see are its needs as a curriculum, no? I'm not sure what I think of it yet as a curriculum, but I am bookmarking it and plan to come back to it for more resources. Personally, I think it's an interesting idea and something I want to watch to see how it unfolds.

I'm happy to share what I learned, but sadly, everything I learned was in a classroom lecture, and the course isn't available online. There may be something on one of the major MOOC hubs, though. I'll keep an eye out.

Sure, you can learn to use a technology and still be braindead or poor at analysis.

This is an applied curriculum with a focus on specific technologies that enable an analysis-bent people to leverage their brains with technology. That's why data work is so beautiful -- it's space to demonstrate unquantifiables like curiosity, diligence, creativity, and grit.

The quality of your projects is likely a good metric for your aptitude for data work, which is why I strongly advise working on a personal project.

I'd love to get more pull requests with more materials that teach analysis!

I see many books on Python listed. It's a good language for data analysis and scientific computing, especially with scipy, but there are alternatives, of course. I like Fortran 95, which is available in gcc as gfortran. A relevant book for data scientists would be "Developing Statistical Software in Fortran 95" (2005).

Out of curiosity, why do you prefer Fortran?

Note: preemptive clarity: this isn't a language flamewar thing, I'm genuinely curious.

Arrays in Fortran 90+ are a powerful feature -- there are whole array operations and operations on array slices, as in Matlab and Python with numpy. It's easy to allocate multidimensional arrays. Compilers are good at optimizing code -- if it's easiest to do something with loops you can go ahead and not worry about vectorizing the code, as you might with R or Matlab. There is a lot of statistics code in Fortran, so it's good to have at least a reading knowledge of it.

This is why I'm so in love with pandas (pandas.pydata.org) -- Wes McKinney did the world a favor creating a library with powerful, manipulable multidimensional data structures

I feel like lists of resources are OK, but with something like data science, which has its own branches and specialities, it would be good to have some kind of stack ranking of topics and information beyond just 'start here'. That way, a reader gets more and more conversant with the different ideas being thrown around.

Also, I don't have a list of these handy, but I've found long annotated notebooks/blog posts of worked data science examples very helpful for refreshing my memory on applied techniques. http://derandomized.com/ is a great example, maybe other HN readers have some favourites we could add.

The list is for beginners.Ofcourse data science is a huge domain and you can't actually make a roadmap to be a master in this field but I am sure this list will help people to get basic understanding of what to read and how to workout things.

I think the list is more helpful for someone a step or two beyond beginner. A true beginner is going to look at that and be scared. Once they've read a few introductory things, they'll be able to go back and make better sense of it, for sure. (I showed this to a friend of mine who's interested in learning data science and that was his reaction, so I am generalising, but I think it's a fair generalisation.)

I do a lot of statistical work, but wouldn't call myself a data scientist. To that end:

> I geared the original curriculum toward Python tools and resources, so I've explicitly marked when resources use other tools to teach conceptual material (like R)

Why did you choose Python over R? Personal preference, a bent toward Python in the online courses you found, or is Python generally considered the de facto language choice for professional data scientists?

I imagine you could tackle these courses with any programming language, but if Python seems to be the way the data science community is going, it would be helpful to know that. Personally, I'm curious because I'm trying to decide if I should pick up Python on the side to supplement the knowledge I already have of R and various other programming languages.

Also, thanks for putting all this together. It's great!

Python is definitely not the de facto choice. Python and R both have strengths and weaknesses, and relative use depends quite a bit on community. A few recent surveys (http://blog.revolutionanalytics.com/2014/01/in-data-scientis..., http://www.kdnuggets.com/2013/10/rexer-analytics-2013-data-m..., http://blog.revolutionanalytics.com/2013/09/top-languages-fo...) show strong growth for both R and python, and I expect that will continue in the future. There's no reason to limit yourself to one language, and knowing both R and python can only help. That said, you do need to be carefully about spreading yourself too thin, and you want to make sure you're an expert in at least one data analysis environment.

And lets not forget SAS, Matlab/Octave, Julia, STATA, SPSS, etc... There are a ton of choices out there. While not currently in vogue, there are a lot of SAS/STATA/SPSS jobs out there. Generally it is bigger, more established companies that use these software/languages, but if your goal is to get into a more stats focused position, these languages can be a good choice to learn.

In most tech companies, data scientists have adopted python over R or Matlab. The benefit is it's easier to scale python to larger datasets, and of course easier for the engineers in the company to take python code that computes algorithms and such and put them into production / products.

Working with a technology that is more highly documented is better when you're starting out. Especially when you're teaching yourself and StackOverflow is your TA. Python is very appropriate for someone who's new to data work.

Otherwise, asking what technology to use is like asking what mode of transportation to use to get to a destination -- it's not the point. The important part is that you arrive. Some days walking over the mountain is the least sensible method, other days high seas make taking the boat around it impossible. The tool that gets the job done is the best tool.

Apparently the R people realize Python over R already..


I haven't been excited about my career in a while but I am getting more and more excited about data science. I have a new desire to learn and to actually finish my masters degree.

Thanks for putting these resources together.

It would be nice to have a clear list of assumed capabilities (for example, I am familiar with basic programming and have built a few websites, but I haven't taken a math class since senior year of high school, and don't think I remember enough about Calculus to do anything that assumes knowledge of it). Just a simple list of what level math, stats, and programming fluency this starts from would be great.

Data Science is still approachable through the "Edison"-style: only do as much math as your comfortable with, but keep probing discrepancies between different models. It's more debugging than architect-ing. Evidence: none of the top kaggle competitors is an academic/statistician to my knowledge.

I understand what you are saying, and I am not looking to become a mathematician or statistician. It has just been so long since I took a math class that before diving into any of these courses, I'd like to be aware of the minimal fluency expected.

On the entry "NLP with Python O'Reilly / Book", note that a second edition might be "in the works." There is an online edition of the work in progress (updating for Python 3 and NLTK 3) available here [1].

[1] -- http://nltk.org/book3/

A resource I've found invaluable and that I can't find listed is videolectures.net

Particularly http://videolectures.net/pascal/ has plenty of lectures and tutorials from their summer schools on very relevant topics for machine learning.

thanks. this is a good resource :) I've explored most of these courses on Coursera over the last year before finally deciding to go to grad school. For me, the biggest factor was motivation (in terms of actually doing the assignments and projects), networking, and getting internship opportunities. However, I still am using the coursera lectures as a supplement to my courses and it helps a lot.

Regardless of any path you take, these are very exciting times to be in computing sciences. All the best to everyone and keep upgrading your skills and knowledge :)

for the first course Intro to Data Science, the only access you will have are the videos. You can't see other resources like homeworks and other stuff. What to do about this?

The assignments are, in fact, available: https://class.coursera.org/datasci-001/assignment

You'll be required to login is all.

I'm logged in but get "Looks like you are not enrolled in this course!".

Google around, often times people have downloaded the lecture videos / homework assignments

Thanks for putting these resources together, this is amazing

All credits goes to the original author.

that's kinda awkward, but something is wrong with my monitor when I'm on that website.

surely awkward.... :-)

YES I support this.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact