Hacker News new | past | comments | ask | show | jobs | submit login
Dive into Machine Learning with Jupyter and Scikit-Learn (github.com/hangtwenty)
204 points by stared on Nov 4, 2015 | hide | past | favorite | 33 comments

I'll add my thoughts on the topic of "how do you know when you're out of the danger zone and can start marketing yourself as a machine learning expert?"

Do competitions on kaggle (or find them on other sites, but kaggle is definitely the best place to start). Once you get past the point where you are finishing in the middle of the pack (multiple top 10% or 25% finishes and maybe a prize win) then you are an expert. That is proof that you are separate from than hackers who just throw scikit-learn algorithms at a matrix. The people in the master tier use clever feature engineering and/or code up custom learning algorithms to get themselves above the masses. Looking at a problem and figuring out the correct modeling approach is what the experts do. They don't just create a data frame and run down the list of classification algorithms that they have access to. Read the "No Free Hunch" reports on how the winners did it and you'll quickly see the difference between yourself and the experts.

This is a fantastic contribution, thank you!

I opened an issue ... https://github.com/hangtwenty/dive-into-machine-learning/iss...

I'll wait a bit in case you want to add a note on this in your own words (via PR). Otherwise, tonight or tomorrow I'll paraphrase you or something. Whether you or I make the change I want it in a branch, and then I'll try to get a bit of review for that branch ...

THANKS AGAIN, the guide really needs some insight like this.


The guide's primary recommended course is Andrew Ng's Machine Learning course. Current session started November 2nd, you must enroll by the 7th. Another session is starting November 30th.



It's not time-sensitive at all. After the course has ended, you can still enter the course and do everything; you just won't get the certificate (which doesn't matter).

Good point, people should know that.

What I had in mind is that some people get a lot from the the community features on Coursera, more active while a class is in session. So that's all I meant.

What exactly is the appeal of the certificate? Do they hold any weight?

Precomitting to finish the course for the certificate is a good motivational hack

If the author is here, thank you very much for providing this. I wanted to to look it into Jupyter and machine learning and this is probably the right way to start. I tried the course one Udacity for machine learning (Python, Scikit-Learn) but it not my way of learning things,since I like to fiddle around instead of going the straight way . If anyone is interested in an alternative check out the Udacity coure https://www.udacity.com/course/machine-learning-supervised-l... .

Author here. Thank you! Glad you find it useful!

Would you say that Machine learning is something only for PhDs and very experienced people or can a dev pick it up and be hired as one?

Like any topic/skill, it can be learnt, but only if you spend significant time and effort by doing projects, exercises, asking questions (stackexchange, etc.). It's very important to pay attention to fundamentals and thinking from scratch rather than mastering a laundry list of tips/tricks, because fundamental ideas can be composed in different ways and adapted to a new situation. The fundamentals here would be probability, statistics, linear algebra, optimisation.

I started my career in machine learning with absolutely no knowledge in it. It is definitely some thing that you can learn on the job. You do need a background in linear algebra/ statistics to understand the theory behind different algorithms that will help you decide what algorithm to choose (SVM vs Random Forest for ex.).

Like suggested in the other comment, the best place to start is probably by working on projects with open data sets. Try experimenting with different algorithms, feature engineering techniques. This is especially important because there are plenty of algorithms and identifying which algorithm works for which kind of data set is useful.

I started with no ML experience per se, but I already had a background in linear algebra, differential equation modeling and stochastic processes.

For those interesting in picking up machine learning that already know R, I recommend this book:


It's loaded with useful R snippets and practical examples.

Could someone with industry/academic experience in ML comment on the quality and reliability of the resources in the repo?

It's a nice list of resources for starting. General tools he mentions are both easy to start and are used in practice; also, I like the overview part.

But most importantly - it's not a dump of all possible links, making a daunting list "I will never go through".

Source: I run workshops introducing to ML and Big Data (http://workshops.deepsense.io/, next one in London) and I made a lot of choices converging with this one (Python + scikit-learn, everything in Jupyter Notebook, etc). Also, a lot of links there is already in my delicious list of things I am sending to friends wanting to jump into data science (and many of them were already on the HN main page).

BTW: See also discussion on the same post on DataTau: http://www.datatau.com/item?id=10093

The quality of Scikit-Learn? It's not bleeding-edge but it's very well tested and documented. Quite good quality.

No one gets fired for using Scikit.

I'm curious if you can speak more to this, or share any resources about it. It seems clear that scikit-learn is a good fit for this kind of hacking-learning. If there's a way I can throw in a sentence (with link to more detail), giving context about where it sits in the eyes of experts ... Would be nice.

What is there to be worried about? scikit-learn is a solid, tested implementation of most machine learning algorithms. If you're doing work in Python and want to run your data through a standard ML algorithm, and the algo is implemented by scikit-learn, then just use scikit-learn. If it isn't implemented by scikit-learn, you find some other implementation or implement it yourself.

Experts use all sorts of things: MATLAB, R, Python (with scikit-learn), etc.

What you're saying -- actually every sentence of your comment -- was my existing impression.

tdaltonc said "No one gets fired for using Scikit." Maybe I read too much into this comment, but it seemed to have a negative tone. So I got the impression that tdaltonc might have more to say about it. Maybe not though!

I'd love one such list about AI in general and other sub-fields like NLP/Computational Linguistics as well. I've recently started the Berkeley AI course on EdX along with Russell & Norvig's standard textbook. :)


This is a really good list of resources on Machine learning and has a section dedicated to NLP/Text mining.

Thank you, now I've added a link to this in the appendix about finding libraries.

How does an academic introduction and study in Machine Learning compare to a self taught one? I know it's a shallow question but there has to be some sort of line where the difference opens and closes opportunities.

Another way to look at it would be that the difference opens opportunities to improve the self-taught path.

Are there any quality ML courses (of norvig or ng quality) that uses python or java.

I would love to learn ML concepts, but I really don't have the cognitive bandwidth to learn a new language, which I most likely will never use in my day job (Python, ruby,java).

When I last looked, most of the top quality courses use some variant of proprietary tools or MATLAB, but production code is in python or java (with R sometimes).

Learning a new language is very easy compared to learning all sorts of other stuff, including ML.

I agree - the problem is that there are some problems at my work that I can probably solve by applying some concepts of ML. But I dont think I can do that through matlab.

I have been having a bad day on HN, so before I get misconstrued - there is nothing wrong with matlab. I was just hoping a go-to-production language like Java or Python for learning ML.

Is there a reason you can't read data into MATLAB (by reading an exported csv/other format file, or querying a database)?

That being said, you can do ML completely in Python.

1. I dont have matlab, and I dont want to buy it. 2. when you go into production (say.. predicting top customers for an ecommerce site), you are not going to run matlab on the server.

yes - python/pandas/scikit is pretty popular for writing production ML code. The question really is - any good courses ? Most of the top courses I see are using some variant of Matlab to teach.

Good point. I don't think it's too much trouble to translate MATLAB to the equivalent numpy/scipy/ python library calls.

Thank you for this! I've started digging in to ML and this looks absolutely awesome.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact