Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: How does a beginner best spend one week of learning machine learning?
107 points by mrborgen on April 24, 2015 | hide | past | favorite | 35 comments
I'll spend all my time next week studying machine learning. However, I know that it's easy to get lost in overly theoretical material when learning ML.

So I'm interested in some advice on how to best spend my time, given that I'd like to get as much practical knowledge as possible.

Here is the level I'm at now:

I'm currently halfways though Andrew Ng's ML course on Coursera, and will probably finish this within the week. I love the mix of theory an practice this course is based around.

I've also done the Udacity - Intro to Machine Learning, but found it too theoretical.

I kind of understand the basic principles of linear & logistic regression, cost functions and gradient descent & the normal equation.

By the end of the week, I hope to be able to do linear regression using gradient descent on an actual dataset. If so, the week has been very well spent!

My preferred language is Python.

All tips and suggestions are highly appreciated :)

I think you're in a position to learn a lot in a week. My advice would be to create a simple data set and go through what you learned in Andrew Ng's course with scikit-learn (even better, since you already prefer Python).


Much of what you learn in Ng's course is how to implement these algorithms - there's less (no?) emphasis on using existing libraries in R or Python. I think that implementing your own code base for logistic regression, neural net, random forest, and so forth is an extremely valuable exercise, but I'd recommend you put that aside just for the moment. Instead, try using some of the existing libraries.

For instance, use scikit-learn's logistic regression, neural net, and random forest (not covered in Ng's class) libraries to do a classification. You don't want to use these with no understanding of how they work, but you've done the coursera, so see if you can use your knowledge of how these algorithms differ to create dataset that will highlight the benefits each approach (i.e.., can you create a dataset that works great for logistic regression but poorly for neural nets, or random forest?) Think about how you'd use an unsupervised approach to classification, and run it through k-means. I really think that applying different techniques to the same dataset, on a high level, combined with general knowledge of what's going on under the hood, can be a great way to understand how/when to use these algorithms.

Thanks a lot for all the advise! Using a library at first seems like a good idea. And then go deeper into the material as I learn the the practicalities, rather than doing it the other way around.

Honestly, with only a week you can't really learn a lot so I'd recommend doing something practical. See the canonical answer here:


Go to Kaggle and do one of the competitions.


and reference the wiki for help:


In my opinion, the only way to learn machine learning without a strong foundation is to page in learning. That means you'll need to pick a practical task and just jump into it and find out what you don't know.

If you've never done anything practical, great you'll need to learn about the basic data structures of machine learning.

Find out what a data frame is, what time series analysis is and what data structures you need to do it.

Are you going to do classification or prediction? Pick one and learn some of the basic tools used for it.

I second using kaggle. They have a lot of 'competitions' that are really just learning exercises with a lot of information about the techniques used directly linked to the competition.

Thanks for the tips!

I'm considering to use a dataset Bank of England has about households in England. As far as I understand, the data can be used for both prediction and classification. It has quite a lot of different features.

Which one would you recommend to start with, in general?

I signed up to HN just to reply. I suggest doing the MNIST dataset. The features are really simple (greyscale pixel values) yet it works very well. You can also play with different classifiers to see how they behave (training time and classification performance).

This will let you setup a whole pipeline from feature selection (in this case just normalise, you can try 0 to 1 or -1 to 1 or subtract mean then divide by stddev, or don't normalize and see what happens) to training the model and evaluating its performance with cross validation. Then you can check your CV results by submitting to the leaderboard.

I took Andrew Ng's ML course then played with the MNIST dataset. I learnt heaps by doing this. Then I got carried away competing in real competitions. :) That's where more advanced feature selection came into play as well as making sure your CV split is representative of the test split.

I was using scikit-learn and just swapping classifiers in and out trying different ones as well as trying different parameters. You can even roll your own logistic regression if you want and see how regularisation affects performance etc.

Awesome, thanks, I'll definitely test out that dataset! Btw, have you proceeded with any MOOC's after Andrew Ng's ML course? Anyone you'd like to recommend?

Sorry I didn't take any more courses after that. However I did continue to read every ML link that appeared on HN front page.

Sometimes I'll go read the original paper to learn specific things such as DropConnect for NN (I saw slight improvement on MNIST with DropConnect but not on my other datasets). Same thing for domain knowledge for feature selection, just read relevant papers. Often I have to cross reference certain bits with other papers and reread 10 times before I understand the maths.

It's still my goal to understand convolutional neural nets (and other topics like Bayesian statistics) but I've since stopped on ML and have been learning audio DSP instead. :) Too many things to learn...

I wrote a machine learning cheat sheet once, as a way to learn when I was in your seat: http://eferm.com/machine-learning-cheat-sheet/

Perhaps the cheat sheet in itself is useful for you, but mostly I'd recommend the process of assembling information in a concise way as a good way to learn.

Very cool. Thx for making this.

The best way to learn is to try to apply techniques on problems that are interesting to you.

Random Forest is a very powerful technique these days that's usually pretty good as a first-pass. Using it with permutation importance usually helps you identify important variables.

I cover several other machine learning "getting started" recommendations at http://stackoverflow.com/a/598772/1869

I think you can achieve a lot in a week. the big four for python data libraries are pandas, numpy, scipy, and scikit-learn. scikit-learn will provide the most milage of advancement of knowledge relative to the time investment to learn that knowledge. If you aren't concerned with how effective the results are, you can learn a lot from simple implementations. For example, to implement a basic Random Forest is three lines of code: # create random forest forest = RandomForestClassifier() # train random forest forest = forest.fit(train_data) # test random forest output = forest.predict(test_data)

That's just one example, but all implementations are reasonably easy for someone with your foundation to learn quickly. Now... being good at it... that will be your next challenge :-)

I think my answer is a bit different. The interesting/hard part of ML is the algorithm behind it all. It is important to get a really good sense of exactly what the technique is doing. So I would not use python, or any computer language if you are a complete beginner to a technique. I would just work it out with a toy example and pen and paper. Make your own little decision tree, or work out a bayesian probability for a given set. The "problem" with a library like, say, Sk-learn is that it does the "work" for you (sorry about the heavy use of scare quotes), but you may not know what it is doing well enough to analyze the output.

My two cents.

I'd definitely agree on not starting with scikit-learn, but using Python isn't so bad. Implement a few things manually -- decision trees and Naive Bayes really aren't that hard to do, and neither's clustering.

I quite like this site for an accessible, well-scoped intro with some good example code: http://guidetodatamining.com/

I got a lot of mileage out of this book: http://www.amazon.com/Programming-Collective-Intelligence-Bu... All his examples are in Python, so if you already know Python it should work well for you. The only minor thing is that there are a few typos here and there in the code, but usually you can just use your common sense and figure out what the author intended.

This would probably be a good tutorial for you to give a whirl. It's applicable, slowly brings you into the more complex, and python :)

Framework: http://deeplearning.net/software/theano/ Tutorial: http://deeplearning.net/tutorial/gettingstarted.html#getting...

Here's a great repo from my friend Preston Parry (https://news.ycombinator.com/user?id=ClimbsRocks): https://github.com/ClimbsRocks/learningmachines

It's in JavaScript, but if that works for you, clone it down and work through the instructions until it isn't broken. He created it as part of a lecture at Hack Reactor called "A Conjurer's Guide to Machine Learning," so it's a great way to get started without going too deep into the details.

Get Rodeo - http://blog.yhathq.com/posts/introducing-rodeo.html And start playing around with the data!

The Udacity one (introduction to Machine Learning) is very practical in my mind. It only description some intuition and then show you how to call scikit-learn's function to do some simple machine learning works.

Just realised that I mixed the courses up. The one I did was simply called "Machine Learning" and not "Intro to Machine Learning". Anyway, thanks for the tip, I'll check it out!

This past week, since i'm between projects. I looked at the data science tutorials from Pycon 2015

Especially this: https://www.youtube.com/watch?v=L7R4HUQ-eQ0

Here's the ipython notebook


It helped me make my first Kaggle submission, although i'm rank 1800 out of 1900 on the restaurant review one. But i'm sure with time i'll figure it out.

You should practice on some toy problems. Get some KDD / kaggle data sets. Try to work on them. The advantage from these 2 sites is that they already have a solution published. You can always refer to the solution for help. Remember there is no right or wrong answer. Just a more accurate answer.

Try applying it some of the problems you want to solve. Mostly be patient. Unlike conventional programming, machine learning is non deterministic and can take some time to become a little comfortable

You're probably best with something very applied (duh), but I want to give a plug for ISLR: http://www-bcf.usc.edu/~gareth/ISL/. This book is really useful for understanding the statistical underpinnnings of most ML things while being approachable enough for someone who doesn't care (that much) about the math.

Here you can find some interesting podcast about ML and Data Science: http://goo.gl/KF4NGE

Enjoy :)

Thanks, I've been listening to The Talking Machines for a while. It's really good!


Thanks for making this post. I am in the EXACT same position as you. Just finished 65% of the ML course on coursera and was wondering how to dive in deeper.

Question for the audience: Will self-learning be enough to get me considered for a job in this area ? I work on stuff that completely unrelated right now (MSEE in circuits).

Give it more than a week. I didn't realize the significance and implications of lots of really cool ideas in statistical learning for months after I started. It probably won't click right away, but check out scikit-learn for Python if you want a good way to dive into data with great resources.

I'll definitely continue on after next week. It's just meant as a kick start. This tutorial series seems like a good way to start with scikit-learn btw:


Andrew Ng's group's Unsupervised Feature Learning and Deep Learning wiki-tutorial pairs nicely with the Coursera course.


If you have time, in the long term I would recommend looking at the theory behind the methods. It will give you a lot of insight on why a person is using a particular method and when it is inappropriate to use a certain method.

While not exactly an answer to the OP's question -- http://www.datatau.com/ is a decent data science aggregator.

Day 1: Spend some time setting your machine up for doing machine learning. For Python look at Numpy, IPython Notebook, Scipy, Pandas, Scikit-Learn, Matplotlib, Seaborn, NLTK, XGBoost wrapper, Vowpal Wabbit wrapper, Theano + Nolearn.

2: Learn how to manipulate Numpy arrays ( http://www.engr.ucsb.edu/~shell/che210d/numpy.pdf ) and how to read and manipulate data with Pandas ( https://www.youtube.com/watch?v=p8hle-ni-DM ).

3: Do the Kaggle Titanic survival prediction challenge with Random Forests. ( https://www.kaggle.com/c/titanic/details/getting-started-wit... )

4: Study Scikit-learn documentation ( http://scikit-learn.org/stable/documentation.html ). Run a few examples. Change RandomForestClassifier into SGDClassifier and play with the results. Scale the data to make it perform better. Combine a RF model and a SGD model through averaging and try to improve the benchmark score.

5: Study the ensemble module of Scikit-learn. Try the examples on the wiki of XGBoost ( https://github.com/dmlc/xgboost/tree/master/demo/binary_clas... ) and Vowpal Wabbit ( http://zinkov.com/posts/2013-08-13-vowpal-tutorial/ ). Practically you want to get to a stage of: Getting the data transformed to be accepted by the algo, a form of evaluation, and then getting the predictions back out in a sensible form.

Then next week start competing on Kaggle and form a team to join up with people at your level. You will learn a lot that way and start to open up the black box.

I found these series very accessible: http://blog.kaggle.com/2015/04/22/scikit-learn-video-3-machi...

Kaggle also recently released a feature to run machine learning scripts in your browser. You could check those out and check out Python, R, common pipelines and even the more advanced neural nets: https://www.kaggle.com/users/9028/danb/digit-recognizer/big-... .

Awesome, this looks like a great plan! Quite a lot of setup, so I'll start during the weekend :)

Try out checkio.org - quite a lot of practical machine learning missions in Python

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact