Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: what would you put in course for data science?
6 points by tomrod on Dec 1, 2013 | hide | past | favorite | 6 comments
I'm considering putting together a university course on data science, and would like your input as to the skills and techniques you'd expect a data scientist to have fresh out of school.

You all are the experts on what you need, and I'm all ears. Fire away!

I think a good way to structure a course would be to cover several problems from end-to-end: motivation (what you want to achieve), theory, data preprocessing, algorithm development, and finally setting up a "production grade" system that solves the problem.

In my experience learning ML, learning concepts in theory is good and all, but I never really understood the details until I had to implementing the algorithm.

Bill Howe did a solid intro course for the University of Washington. Videos and other materials are available on Coursera.


The one thing I'd really change is to tighten up the range of tools used. It seems helpful to show students a range of tools, but it usually ends up being a major distraction for students and a lot of extra effort for course staff. Any such course is already going to be a blitz of new concepts and technology.

Go full Python, plus interactive tools as helpful (Weka, Tableu). Let them pick up R or D3.js or whatever later, after they have a better appreciation for the concepts and such which make them useful.

The hard part is not the coding or statistics; the hard part is figuring out what to code/analyse.

I would want something on identifying actionable dimensions and how to talk to people to figure out how to help them

It depends on what your background is. And what the role looks for. There is a wide spectrum of programming vs statistics background that may be required for a job. Having said that, here is a list of basic stuff that may be useful to know.

1. Computer Science:

- Algorithms/D.S. Enough to be able to identify what sort of problems are C.S. problems or statistics problems.

- Systems. You may not have to build a system but it is useful to know how the real world systems are built, what sort of constraints come into play, what trade-offs are there. Especially, if you will be working with large scale datasets. You don't want to be remembered as the dude who did a select * order by rand limit 10 on an HBase table.

- Programming Language: Learn one programming language well. Depending on your job, you may need to learn more than one. Python is a nice starting language. One useful trick to learning more languages is to learn one language really well and see how stuff you can do changes in the other language. Also, side note: don't get into one true language debates. They are useless. Every language has its pros and cons.

2. Stats/Math/ML:

This is tough to kind of characterize. Because the field is so diverse.

- Probability: Get some basic probability under your belt. Getting the intuition right is more useful than learning a lot of stuff. You can pick up more complicated stuff (Stochastic processes, Stochastic Calculus stuff) as and when you progress further anyway.

- Statistics: At the very least, figure out hypothesis testing, biases, p-values, estimators and regression. The more statistics I learn the more I am of the opinion that the tools matter less as much as a critical understanding of where statistics should apply. What biases are there and how you can identify them.

- Linear Algebra: Again a very basic undergraduate linear algebra course (with vector spaces) should help you understand say Matrix completion stuff. Of the top of my head, I think grokking how vector spaces work, what independence means, how dimensionality reduction, kernels work is useful.

- Machine Learning: This is mostly a tie up of the kind of stuff you learn in the math courses. My basic 101 ML grad school covered the following: Unsupervised Learning (KMeans or some clustering algorithms), Supervised Learning (Discriminative, Generative approaches, bias - variance tradeoffs etc). I also learnt some silly bullshit on Genetic algorithms.

So yeah, as long as you learn the basic fundamentals really well, you should be able to pick up stuff fairly easily.

E.g. Recommendation systems, I never learnt most of this in school as part of a specific course. However, once you know what goes on in Matrix decomposition and know what regression is, you can understand the why of why people do what they do when they solve these problems.

Great suggestions. I'll probably structure the course in the stats/ml vein--that corresponds to what I already had in mind.

Do you have any recommendations for big data components? Would it be worth teaching how to use a Hadoop cluster, or is a small toy cluster too far abstract from what using a large cluster requires?

Well, I took a couple of courses at UT which dealt with "practical" distributed systems. The first was more on the lines of here is a large dataset, how do we design a distributed system that handles it. There, we learn about principles/paradigms such as MPI, threading, CUDA to address these issues. I took another course that was basically only Hadoop. Both are good. It completely depends on what the outcome of your course is: If you are looking for someone who just needs a flavor of what it means to think at that scale, generating an EC2 cluster and working on a few Map Reduce problems should be fine. If you are looking more to inculcate general principles of how to think at scale, the former should be good.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact