

Machine Learning Cheat Sheet for scikit-learn - ColinWright
http://peekaboo-vision.blogspot.ca/2013/01/machine-learning-cheat-sheet-for-scikit.html

======
dk8996
Nice work. I wonder if there is a way to build the cheat sheet in a
collaborative fashion kinda like a wiki. On thing I noticed, I am no expert in
ML, but recently I was reading about K-means algorithm that enables you to
quickly find the number of clusters with unknown number of clusters. This
seems to be missing in your example where the clustering branch end with
"tough luck".

~~~
tlarkworthy
" find the number of clusters"

Plot k against prediction accuracy (averaged over a number of runs). Look for
a step, that is the correct setting of k. I don't believe there exists an
algorithm for detecting that step as good as I can.

~~~
ogrisel
Prediction accuracy cannot be computed in a purely unsupervised setting: you
don't have labels for the samples.

You can compare the cluster found by k-means run several times against what
you get with a randomized version of your dataset though:
<http://blog.echen.me/2011/03/19/counting-clusters/>

------
tocomment
I'd love to see a cheatsheet like this for computer vision. I don't suppose
anyone knows of one, or can make one?

------
sadkingbilly
Maybe they're using different names, but where are: Linear Regression,
Logistic Regression, Neural Networks, and SVMs?

~~~
rm999
Linear regression and logistic regression are covered by elastic net lasso,
sgd regression, and sgd classifiers. Support vector machines are covered in
SVCs and SVRs. I don't see neural networks on there (just saw this edit: "the
chart is not really comprehensive, as I focused on scikit-learn. Otherwise I
certainly would have included neural networks").

Some notes that may help clarify: regression is trying to predict a continuous
number (e.g. how much a house will cost). Linear regression is a type of
regression. Classification is trying to predict a category (e.g. condo vs
apartment). Logistic regression is usually used for classification over two
categories (even though it confusingly has 'regression' in the name). So
linear regression will fall into into the regression bubble, logistic
regression will fall into the classification bubble, and support vector
machines are split into the two bubbles based on what you want to do with
them.

The difference between sgd (stochastic gradient descent) and elastic net lasso
is how the models are trained. The final model in both cases can be used
(evaluated/predicted) in the same way.

------
tocomment
So where would deep learning come into play in this flow diagram? (Or did I
miss it?)

~~~
ogrisel
This diagram is only presenting algorithms as currently implemented in scikit-
learn. Deep Learning models are not yet implemented in scikit-learn although
baselines (stacked RBMs and MLP) will probably make it in the coming months.

------
tocomment
What do you experts think if a good use of genetic programming? Where would
you put that into this diagram?

~~~
Homunculiheaded
just as a point of clarification: "genetic programming" is not the same as a
"genetic algorithm". Genetic programming is an area of evolutionary
computation in which the AST of a program is created using a genetic
algorithm. So while genetic programming is definitely "machine learning" in
the truest sense, it's not terribly useful for classification and regression
problems (what we usually think of when we talk about ml).

Now genetic algorithms are any algorithm that encodes solutions (typically
parameters to a cost function) as a gene and performs optimization through an
artificial evolutionary process. Now GAs are incredibly easy to learn how to
implement, but it's much harder to figure out when they should be used.

In theory GAs can be used anywhere in this diagram that parameters need to be
chosen. However there are two major draw backs to GAs:

1\. Cost function is a huge bottle neck 2\. In many cases there is a known,
better method of optimization.

So for example you might want to use a GA to determine the parameters for
training an SVM, however (given you have enough data) this runs into problem
1, since it might take minutes, hours or days to train an SVM. Which is too
long and you would have to train 100s or more SVM for each iteration of the
GA, with a least 100s of iterations.

Also GAs have been used to find optimal weights in neural nets, however here
you run into 2 (and I believe 1 as well), as using backprop usually performs
better.

All that said GAs are an amazing tool when you have some cost function you
need to optimize that 1.) can be evaluated very quickly, and 2.) is weird,
very non linear, or otherwise has no known 'good' solution already.

So finally to your question: most heavily studied areas of ML already have a
better optimization algorithm than a GA (hence no GAs on this diagram),
however in the real world you never know when you my find some very strange,
difficult and poorly studied optimization issue and then GAs can be very
useful

~~~
huherto
Here is my intuition. Hopefully adding to your explanation.

GA are a way to solve search problems. But there are other ways that are more
efficient and more specific.
<http://www.pearsonhighered.com/samplechapter/0136042597.pdf>

But solving a problem with GA may sometimes be cooler and more interesting.

~~~
msellout
genetic algorithms are just randomized hill climbing with the added assumption
that variables "physically" next to each other in your data structure are
correlated.

------
gtani
some other things that might be useful

<http://eferm.com/machine-learning-cheat-sheet/>

<http://www.cs.bris.ac.uk/~flach/mlbook/tmp/mlbook-iptr.pdf>

[http://rise.cse.iitm.ac.in/wiki/index.php/Introduction_to_Ma...](http://rise.cse.iitm.ac.in/wiki/index.php/Introduction_to_Machine_Learning)

<http://mlg.eng.cam.ac.uk/creed/Notes/ML_Compendium.pdf>

------
suredo
comment from the author: "2) if you need a flow-chart to know what to do, I
don't think you will be successful in working with structured models ;)"

