
Ask HN: How does a beginner best spend one week of learning machine learning? - mrborgen
I&#x27;ll spend all my time next week studying machine learning. However, I know that it&#x27;s easy to get lost in overly theoretical material when learning ML.<p>So I&#x27;m interested in some advice on how to best spend my time, given that I&#x27;d like to get as much practical knowledge as possible.<p>Here is the level I&#x27;m at now:<p>I&#x27;m currently halfways though Andrew Ng&#x27;s ML course on Coursera, and will probably finish this within the week. I love the mix of theory an practice this course is based around.<p>I&#x27;ve also done the Udacity - Intro to Machine Learning, but found it too theoretical.<p>I kind of understand the basic principles of linear &amp; logistic regression, cost functions and gradient descent &amp; the normal equation.<p>By the end of the week, I hope to be able to do linear regression using gradient descent on an actual dataset. If so, the week has been very well spent!<p>My preferred language is Python.<p>All tips and suggestions are highly appreciated :)
======
geebee
I think you're in a position to learn a lot in a week. My advice would be to
create a simple data set and go through what you learned in Andrew Ng's course
with scikit-learn (even better, since you already prefer Python).

[http://scikit-learn.org](http://scikit-learn.org)

Much of what you learn in Ng's course is how to implement these algorithms -
there's less (no?) emphasis on using existing libraries in R or Python. I
think that implementing your own code base for logistic regression, neural
net, random forest, and so forth is an extremely valuable exercise, but I'd
recommend you put that aside just for the moment. Instead, try using some of
the existing libraries.

For instance, use scikit-learn's logistic regression, neural net, and random
forest (not covered in Ng's class) libraries to do a classification. You don't
want to use these with no understanding of how they work, but you've done the
coursera, so see if you can use your knowledge of how these algorithms differ
to create dataset that will highlight the benefits each approach (i.e.., can
you create a dataset that works great for logistic regression but poorly for
neural nets, or random forest?) Think about how you'd use an unsupervised
approach to classification, and run it through k-means. I really think that
applying different techniques to the same dataset, on a high level, combined
with general knowledge of what's going on under the hood, can be a great way
to understand how/when to use these algorithms.

~~~
mrborgen
Thanks a lot for all the advise! Using a library at first seems like a good
idea. And then go deeper into the material as I learn the the practicalities,
rather than doing it the other way around.

------
chollida1
Honestly, with only a week you can't really learn a lot so I'd recommend doing
something practical. See the canonical answer here:

[http://norvig.com/21-days.html](http://norvig.com/21-days.html)

Go to Kaggle and do one of the competitions.

[https://www.kaggle.com/competitions](https://www.kaggle.com/competitions)

and reference the wiki for help:

[https://www.kaggle.com/wiki/Home](https://www.kaggle.com/wiki/Home)

In my opinion, the only way to learn machine learning without a strong
foundation is to page in learning. That means you'll need to pick a practical
task and just jump into it and find out what you don't know.

If you've never done anything practical, great you'll need to learn about the
basic data structures of machine learning.

Find out what a data frame is, what time series analysis is and what data
structures you need to do it.

Are you going to do classification or prediction? Pick one and learn some of
the basic tools used for it.

~~~
mrborgen
Thanks for the tips!

I'm considering to use a dataset Bank of England has about households in
England. As far as I understand, the data can be used for both prediction and
classification. It has quite a lot of different features.

Which one would you recommend to start with, in general?

~~~
mikehz
I signed up to HN just to reply. I suggest doing the MNIST dataset. The
features are really simple (greyscale pixel values) yet it works very well.
You can also play with different classifiers to see how they behave (training
time and classification performance).

This will let you setup a whole pipeline from feature selection (in this case
just normalise, you can try 0 to 1 or -1 to 1 or subtract mean then divide by
stddev, or don't normalize and see what happens) to training the model and
evaluating its performance with cross validation. Then you can check your CV
results by submitting to the leaderboard.

I took Andrew Ng's ML course then played with the MNIST dataset. I learnt
heaps by doing this. Then I got carried away competing in real competitions.
:) That's where more advanced feature selection came into play as well as
making sure your CV split is representative of the test split.

I was using scikit-learn and just swapping classifiers in and out trying
different ones as well as trying different parameters. You can even roll your
own logistic regression if you want and see how regularisation affects
performance etc.

~~~
mrborgen
Awesome, thanks, I'll definitely test out that dataset! Btw, have you
proceeded with any MOOC's after Andrew Ng's ML course? Anyone you'd like to
recommend?

~~~
bra-ket
[https://www.coursera.org/course/neuralnets](https://www.coursera.org/course/neuralnets)

[http://deeplearning.stanford.edu/tutorial/](http://deeplearning.stanford.edu/tutorial/)

[http://vision.stanford.edu/teaching/cs231n/syllabus.html](http://vision.stanford.edu/teaching/cs231n/syllabus.html)

------
Emore
I wrote a machine learning cheat sheet once, as a way to learn when I was in
your seat: [http://eferm.com/machine-learning-cheat-
sheet/](http://eferm.com/machine-learning-cheat-sheet/)

Perhaps the cheat sheet in itself is useful for you, but mostly I'd recommend
the process of assembling information in a concise way as a good way to learn.

~~~
ivan_ah
Very cool. Thx for making this.

------
moserware
The best way to learn is to try to apply techniques on problems that are
interesting to you.

Random Forest is a very powerful technique these days that's usually pretty
good as a first-pass. Using it with permutation importance usually helps you
identify important variables.

I cover several other machine learning "getting started" recommendations at
[http://stackoverflow.com/a/598772/1869](http://stackoverflow.com/a/598772/1869)

------
sputknick
I think you can achieve a lot in a week. the big four for python data
libraries are pandas, numpy, scipy, and scikit-learn. scikit-learn will
provide the most milage of advancement of knowledge relative to the time
investment to learn that knowledge. If you aren't concerned with how effective
the results are, you can learn a lot from simple implementations. For example,
to implement a basic Random Forest is three lines of code: # create random
forest forest = RandomForestClassifier() # train random forest forest =
forest.fit(train_data) # test random forest output = forest.predict(test_data)

That's just one example, but all implementations are reasonably easy for
someone with your foundation to learn quickly. Now... being good at it... that
will be your next challenge :-)

------
projectramo
I think my answer is a bit different. The interesting/hard part of ML is the
algorithm behind it all. It is important to get a really good sense of exactly
what the technique is doing. So I would not use python, or any computer
language if you are a complete beginner to a technique. I would just work it
out with a toy example and pen and paper. Make your own little decision tree,
or work out a bayesian probability for a given set. The "problem" with a
library like, say, Sk-learn is that it does the "work" for you (sorry about
the heavy use of scare quotes), but you may not know what it is doing well
enough to analyze the output.

My two cents.

~~~
jlees
I'd definitely agree on not starting with scikit-learn, but using Python isn't
so bad. Implement a few things manually -- decision trees and Naive Bayes
really aren't that hard to do, and neither's clustering.

I quite like this site for an accessible, well-scoped intro with some good
example code: [http://guidetodatamining.com/](http://guidetodatamining.com/)

------
bjourne
I got a lot of mileage out of this book: [http://www.amazon.com/Programming-
Collective-Intelligence-Bu...](http://www.amazon.com/Programming-Collective-
Intelligence-Building-Applications/dp/0596529325) All his examples are in
Python, so if you already know Python it should work well for you. The only
minor thing is that there are a few typos here and there in the code, but
usually you can just use your common sense and figure out what the author
intended.

------
drd4
This would probably be a good tutorial for you to give a whirl. It's
applicable, slowly brings you into the more complex, and python :)

Framework:
[http://deeplearning.net/software/theano/](http://deeplearning.net/software/theano/)
Tutorial:
[http://deeplearning.net/tutorial/gettingstarted.html#getting...](http://deeplearning.net/tutorial/gettingstarted.html#gettingstarted)

------
bpp
Here's a great repo from my friend Preston Parry
([https://news.ycombinator.com/user?id=ClimbsRocks](https://news.ycombinator.com/user?id=ClimbsRocks)):
[https://github.com/ClimbsRocks/learningmachines](https://github.com/ClimbsRocks/learningmachines)

It's in JavaScript, but if that works for you, clone it down and work through
the instructions until it isn't broken. He created it as part of a lecture at
Hack Reactor called "A Conjurer's Guide to Machine Learning," so it's a great
way to get started without going too deep into the details.

------
ssabev
Get Rodeo - [http://blog.yhathq.com/posts/introducing-
rodeo.html](http://blog.yhathq.com/posts/introducing-rodeo.html) And start
playing around with the data!

------
waitingkuo
The Udacity one (introduction to Machine Learning) is very practical in my
mind. It only description some intuition and then show you how to call scikit-
learn's function to do some simple machine learning works.

~~~
mrborgen
Just realised that I mixed the courses up. The one I did was simply called
"Machine Learning" and not "Intro to Machine Learning". Anyway, thanks for the
tip, I'll check it out!

------
sidmitra
This past week, since i'm between projects. I looked at the data science
tutorials from Pycon 2015

Especially this: [https://www.youtube.com/watch?v=L7R4HUQ-
eQ0](https://www.youtube.com/watch?v=L7R4HUQ-eQ0)

Here's the ipython notebook

[https://github.com/jakevdp/sklearn_pycon2015](https://github.com/jakevdp/sklearn_pycon2015)

It helped me make my first Kaggle submission, although i'm rank 1800 out of
1900 on the restaurant review one. But i'm sure with time i'll figure it out.

------
kadder
You should practice on some toy problems. Get some KDD / kaggle data sets. Try
to work on them. The advantage from these 2 sites is that they already have a
solution published. You can always refer to the solution for help. Remember
there is no right or wrong answer. Just a more accurate answer.

Try applying it some of the problems you want to solve. Mostly be patient.
Unlike conventional programming, machine learning is non deterministic and can
take some time to become a little comfortable

------
huac
You're probably best with something very applied (duh), but I want to give a
plug for ISLR: [http://www-bcf.usc.edu/~gareth/ISL/](http://www-
bcf.usc.edu/~gareth/ISL/). This book is really useful for understanding the
statistical underpinnnings of most ML things while being approachable enough
for someone who doesn't care (that much) about the math.

------
dome82
Here you can find some interesting podcast about ML and Data Science:
[http://goo.gl/KF4NGE](http://goo.gl/KF4NGE)

Enjoy :)

~~~
mrborgen
Thanks, I've been listening to The Talking Machines for a while. It's really
good!

[http://www.thetalkingmachines.com/](http://www.thetalkingmachines.com/)

------
samirparikh
Thanks for making this post. I am in the EXACT same position as you. Just
finished 65% of the ML course on coursera and was wondering how to dive in
deeper.

Question for the audience: Will self-learning be enough to get me considered
for a job in this area ? I work on stuff that completely unrelated right now
(MSEE in circuits).

------
kdoherty
Give it more than a week. I didn't realize the significance and implications
of lots of really cool ideas in statistical learning for months after I
started. It probably won't click right away, but check out scikit-learn for
Python if you want a good way to dive into data with great resources.

~~~
mrborgen
I'll definitely continue on after next week. It's just meant as a kick start.
This tutorial series seems like a good way to start with scikit-learn btw:

[https://www.youtube.com/watch?v=URTZ2jKCgBc&list=PLQVvvaa0Qu...](https://www.youtube.com/watch?v=URTZ2jKCgBc&list=PLQVvvaa0QuDd0flgGphKCej-9jp-
QdzZ3)

------
parkaboy
Andrew Ng's group's Unsupervised Feature Learning and Deep Learning wiki-
tutorial pairs nicely with the Coursera course.

[http://ufldl.stanford.edu/wiki/index.php/UFLDL_Tutorial](http://ufldl.stanford.edu/wiki/index.php/UFLDL_Tutorial)

------
louden
If you have time, in the long term I would recommend looking at the theory
behind the methods. It will give you a lot of insight on why a person is using
a particular method and when it is inappropriate to use a certain method.

------
barterbox
While not exactly an answer to the OP's question --
[http://www.datatau.com/](http://www.datatau.com/) is a decent data science
aggregator.

------
compbio
Day 1: Spend some time setting your machine up for doing machine learning. For
Python look at Numpy, IPython Notebook, Scipy, Pandas, Scikit-Learn,
Matplotlib, Seaborn, NLTK, XGBoost wrapper, Vowpal Wabbit wrapper, Theano +
Nolearn.

2: Learn how to manipulate Numpy arrays (
[http://www.engr.ucsb.edu/~shell/che210d/numpy.pdf](http://www.engr.ucsb.edu/~shell/che210d/numpy.pdf)
) and how to read and manipulate data with Pandas (
[https://www.youtube.com/watch?v=p8hle-ni-
DM](https://www.youtube.com/watch?v=p8hle-ni-DM) ).

3: Do the Kaggle Titanic survival prediction challenge with Random Forests. (
[https://www.kaggle.com/c/titanic/details/getting-started-
wit...](https://www.kaggle.com/c/titanic/details/getting-started-with-python)
)

4: Study Scikit-learn documentation ( [http://scikit-
learn.org/stable/documentation.html](http://scikit-
learn.org/stable/documentation.html) ). Run a few examples. Change
RandomForestClassifier into SGDClassifier and play with the results. Scale the
data to make it perform better. Combine a RF model and a SGD model through
averaging and try to improve the benchmark score.

5: Study the ensemble module of Scikit-learn. Try the examples on the wiki of
XGBoost (
[https://github.com/dmlc/xgboost/tree/master/demo/binary_clas...](https://github.com/dmlc/xgboost/tree/master/demo/binary_classification)
) and Vowpal Wabbit ( [http://zinkov.com/posts/2013-08-13-vowpal-
tutorial/](http://zinkov.com/posts/2013-08-13-vowpal-tutorial/) ). Practically
you want to get to a stage of: Getting the data transformed to be accepted by
the algo, a form of evaluation, and then getting the predictions back out in a
sensible form.

Then next week start competing on Kaggle and form a team to join up with
people at your level. You will learn a lot that way and start to open up the
black box.

I found these series very accessible:
[http://blog.kaggle.com/2015/04/22/scikit-learn-
video-3-machi...](http://blog.kaggle.com/2015/04/22/scikit-learn-
video-3-machine-learning-first-steps-with-the-iris-dataset/)

Kaggle also recently released a feature to run machine learning scripts in
your browser. You could check those out and check out Python, R, common
pipelines and even the more advanced neural nets:
[https://www.kaggle.com/users/9028/danb/digit-
recognizer/big-...](https://www.kaggle.com/users/9028/danb/digit-
recognizer/big-ish-neural-network-in-python) .

~~~
mrborgen
Awesome, this looks like a great plan! Quite a lot of setup, so I'll start
during the weekend :)

------
yavramen
Try out checkio.org - quite a lot of practical machine learning missions in
Python

