

Ask HN: what would you put in course for data science? - tomrod

I&#x27;m considering putting together a university course on data science, and would like your input as to the skills and techniques you&#x27;d expect a data scientist to have fresh out of school.<p>You all are the experts on what you need, and I&#x27;m all ears. Fire away!
======
eshvk
It depends on what your background is. And what the role looks for. There is a
wide spectrum of programming vs statistics background that may be required for
a job. Having said that, here is a list of basic stuff that may be useful to
know.

1\. Computer Science:

\- Algorithms/D.S. Enough to be able to identify what sort of problems are
C.S. problems or statistics problems.

\- Systems. You may not have to build a system but it is useful to know how
the real world systems are built, what sort of constraints come into play,
what trade-offs are there. Especially, if you will be working with large scale
datasets. You don't want to be remembered as the dude who did a select * order
by rand limit 10 on an HBase table.

\- Programming Language: Learn one programming language well. Depending on
your job, you may need to learn more than one. Python is a nice starting
language. One useful trick to learning more languages is to learn one language
really well and see how stuff you can do changes in the other language. Also,
side note: don't get into one true language debates. They are useless. Every
language has its pros and cons.

2\. Stats/Math/ML:

This is tough to kind of characterize. Because the field is so diverse.

\- Probability: Get some basic probability under your belt. Getting the
intuition right is more useful than learning a lot of stuff. You can pick up
more complicated stuff (Stochastic processes, Stochastic Calculus stuff) as
and when you progress further anyway.

\- Statistics: At the very least, figure out hypothesis testing, biases,
p-values, estimators and regression. The more statistics I learn the more I am
of the opinion that the tools matter less as much as a critical understanding
of where statistics should apply. What biases are there and how you can
identify them.

\- Linear Algebra: Again a very basic undergraduate linear algebra course
(with vector spaces) should help you understand say Matrix completion stuff.
Of the top of my head, I think grokking how vector spaces work, what
independence means, how dimensionality reduction, kernels work is useful.

\- Machine Learning: This is mostly a tie up of the kind of stuff you learn in
the math courses. My basic 101 ML grad school covered the following:
Unsupervised Learning (KMeans or some clustering algorithms), Supervised
Learning (Discriminative, Generative approaches, bias - variance tradeoffs
etc). I also learnt some silly bullshit on Genetic algorithms.

So yeah, as long as you learn the basic fundamentals really well, you should
be able to pick up stuff fairly easily.

E.g. Recommendation systems, I never learnt most of this in school as part of
a specific course. However, once you know what goes on in Matrix decomposition
and know what regression is, you can understand the why of why people do what
they do when they solve these problems.

~~~
tomrod
Great suggestions. I'll probably structure the course in the stats/ml vein--
that corresponds to what I already had in mind.

Do you have any recommendations for big data components? Would it be worth
teaching how to use a Hadoop cluster, or is a small toy cluster too far
abstract from what using a large cluster requires?

~~~
eshvk
Well, I took a couple of courses at UT which dealt with "practical"
distributed systems. The first was more on the lines of here is a large
dataset, how do we design a distributed system that handles it. There, we
learn about principles/paradigms such as MPI, threading, CUDA to address these
issues. I took another course that was basically only Hadoop. Both are good.
It completely depends on what the outcome of your course is: If you are
looking for someone who just needs a flavor of what it means to think at that
scale, generating an EC2 cluster and working on a few Map Reduce problems
should be fine. If you are looking more to inculcate general principles of how
to think at scale, the former should be good.

------
ivan_ah
I think a good way to structure a course would be to cover several problems
from end-to-end: motivation (what you want to achieve), theory, data
preprocessing, algorithm development, and finally setting up a "production
grade" system that solves the problem.

In my experience learning ML, learning concepts in theory is good and all, but
I never really understood the details until I had to implementing the
algorithm.

------
glimcat
Bill Howe did a solid intro course for the University of Washington. Videos
and other materials are available on Coursera.

[https://www.coursera.org/course/datasci](https://www.coursera.org/course/datasci)

The one thing I'd really change is to tighten up the range of tools used. It
seems helpful to show students a range of tools, but it usually ends up being
a major distraction for students and a lot of extra effort for course staff.
Any such course is already going to be a blitz of new concepts and technology.

Go full Python, plus interactive tools as helpful (Weka, Tableu). Let them
pick up R or D3.js or whatever later, after they have a better appreciation
for the concepts and such which make them useful.

------
rfergie
The hard part is not the coding or statistics; the hard part is figuring out
what to code/analyse.

I would want something on identifying actionable dimensions and how to talk to
people to figure out how to help them

