Four years ago I took a class based on that paper where we implemented all ten algorithms (every participant every algorithm). It was a very instructive and somewhat painful experience. The course website is still online, if anyone is interested, the presentations of the algorithms and the matlab/python code stub for each algorithm might be useful.
Missing Logistic Regression: so old school (from the 1930s), so mathematically rigorous (compare to neural networks for example), so powerful, and so often overlooked.
In the textbook "multivariate stats" by izenman, (http://www.amazon.com/gp/product/0387781889/ ) , he claims that stats & ML progressed in parallel. So traditional stats techniques like OLS, multiple regression, nonlinear regression, logistic regression, GLMs are generally not covered in ML. Similarly ML topics like k-means, svm, random forests etc. are not taught by the stats dept.
What is happening in this past decade is a convergence of stats & ML, primarily driven by data scientists working in the domain of big data. The stats folks are slowly incorporating ML techniques into stats & finding rigorous heuristics for when they should be employed. Similarly ML guys, who are mostly CS folk who unfortunately have taken only 1 course on undergraduate stats & probability, are discovering you can do so much more without resorting to needless large-scale computation, by sampling intelligently & leveraging plain old statistics.
This schism between stats & ML can be leveraged very profitably during interviews :))
When I interview data science folks, I usually ask very simple questions from plain stats - how would you tell if a distribution is skewed...if you have an rv X with mean mu, and say rv Y = X-mu, then what is the mean of Y...if you have an rv A with mean 0 variance 1, then what are the chances of being 3 standard deviations away from the mean if you have no clue about the distribution of A ? What if you knew A was unimodal ? What if A is now normally distributed ?
Now if its a stats guy, I ask very simple ML....what is perceptron, have you heard of an neural network etc.
surprisingly, the stats guys do much better on ML than the ML guys on stats!
Not surprising, really. ML is the shiny new thing, so the MLers don't tend to feel they missed anything while the statisticians need to keep up with the times.
I say this as an MLer still struggling to find out what R^2 is, among other things ..
Its just a stupid fraction.
say you have a dataset ie. sequence of (x,y) tuples. In OLS, you try to fit a line onto the dataset. So your manager wants to know how well the line fit your dataset. If it does a bang-up job, you say 100% aka rsquare of 1. If it does a shoddy job, you say 0% aka rsquare of 0. Hopefully your rsq is much closer to the 1 than to the 0.
Respectfully, it's not a stupid fraction. It is a fundamental quantity arising from the linear algebraic interpretation of correlation.
Correlation induces an inner product on the set of zero-mean random variables. The regression coefficient is precisely the projection coefficient <x,y>/<y,y> and R^2 is precisely the Cauchy-Schwarz ratio <x,y>^2 / <x,x><y,y> (i.e. the product of the two projection coefficients between x and y).
It is a theoretically natural measure of linear quality-of-fit. It has the added bonus of being equal to the ratio of modeled variance to total variance (variance being the square-norm of a random variable in the norm induced by the correlation inner product).
It's also very very cheap to compute. Though there are more practically useful measures of "predictive power", like mutual information, R^2 does an admirable job for an O(1)-space and O(num data)-time predictiveness metric.
I suspect that I learned more about R^2 by reading these comments in the order presented (informal, then formal) than I would have had they been reversed.
Learning about the basis of logistic regression in Ng's Stanford class was eye-opening. Also like that he then motivated generalized linear models, and why they're nice (e.g., parameters are linear with input data; the maximum-likelihood hypothesis is the expectation of the sufficient statistic) and he explains why we see the logistic function in so many damn places (it's the response function whenever y|x is distributed as an exponential family).
It was great how he spent a lot of time on logistic regression before delving into SVM's or Neural Nets - it was much easier to understand the cost functions & regularization for other types of classifiers after having understood those for logistic regression.
My takeaway: if you can avoid adding risk to your systems by using more complicated models, you should.
Neural networks are well understood as additive models. They may not have nice closed form training algorithms, but they are also more expressive than logistic regression.
Conceptually, logistic regression is a simple neural network. A neural network with no hidden layers and a softmax output layer using cross-entropy loss is precisely equivalent to logistic regression. Though, if you were to extend your suggestion of logistic regression to include its locally-weighted variants, I'd gladly support your cause.
Oldie paper but still a goodie. Some other algorithms to consider... Random Forests, Gradient Boosted Machines (similar to AdaBoost), Logistic Regression, Neurel Networks.
Random Forests are interesting because they are designed to make many, many bad decisions that cancel each other out leaving only the good decisions behind.
To oversimplify, each of the many, many decision trees is going to do a poor job building a classifier. Each tree only uses a subset of the data and a subset of the features. So, the results of each tree won't generalize well (and may not even do well classifying the data it was trained on). But, for each tree that misclassifies in one direction, another tree will misclassify in the other direction. The theory (which appears to be validated in practice) is that poor decisions cancel each other out but good decisions don't get canceled out.
Although, it isn't quite a simple as a "good" tree and a "bad" tree. All decision trees are bad but there should randomly be good parts in the bad trees. Each tree is going to be "mostly crap" (as Jeremy Howard puts it). But "mostly crap" means "a little good." The crap in one tree cancels out the crap in another tree. The good parts are left behind... and... like magic... you have a good classifier that is fast although somewhat of a black box.
Jeremy Howard from Kaggle is a big believer and where most of my knowledge of Decision Tree Ensemble algorithms come from: https://www.kaggle.com/wiki/RandomForests
They were one of the first widespread ensemble methods (based on bagged decision trees), and share with other bagging/boosting methods fairly good empirical performance on a wide range of problems.
You will find no understanding in the grown trees-- by definition, they're each seeing a different angle of the data, and each node a different subset of available features.
You can, however, calculate importance scores for the features used. Brennan's original paper gives a good algorithm for doing this (in short: for each tree, permuting the data along some feature for an out of bag sample and seeing how much worse it does.)
Depend on the number of trees, and their depth. Even with 20 trees, with effort, it could be written on paper as rules, for example. And it helps if most of the trees are depth 3 or something.
But then, some models are pretty much impossible to understand or interpret.
A random forest builds a diversified portfolio of decision trees.
To the extent that the errors of the individual trees are uncorrelated, the random forest keeps the same low expected error of a single tree while reducing the high error variance of a tree. This is the same reason a mutual fund is better than a single stock -- on average they have the same return, but the latter has far more variance.
Visualisations are very powerful thing for understanding algorithms. For the classical CS undergrad ones (DFS, various shortest path or minimal spanning tree ones, and of course all of the sorting), visualisations are easy enough to find on youtube, but with less well known algorithms, sometimes you have to draw one yourself.
Another thing I sometimes do is run a few iterations of the algorithm myself on paper.
It's from CMU's course titled "Concepts of Math" The professor, Brendan Sullivan, just wrote the textbook in an easy to understand style.Here's a link to the course: https://colormygraph.ddt.cs.cmu.edu/21127-f12/ and if you click the AnnotateMyPDF website, you can get a copy of the textbook, practice problems, etc for a more in depth understanding.
You could try Knuth's Art of Computer Programming, Volume 1. It's in a formal style but covers some simpler algorithms and it's by an incredible writer. It's painstaking to read but some find it to be nearly a religious experience.
I had the pleasure of studying under Prof Xindong Wu at UVM as an undergraduate and have some slides from, and maybe even a video of a talk he gave to the UVM Computer Science Student Association. If there's interest I'd be happy to track them down.
All the presented algorithms have pros and cons. You have to understand the algorithms(tools) at your disposal and find the best for the questions you want to make.
In the K-means case, it is good to know beforehand how many groups(clusters) you are expecting. Otherwise, there are other clustering algorithms(like gdbscan) which detect the number of cluster automatically(although you have to provide the average expected density). But once again all depends on the question you want to answer, or if you are just cleaning the data a bit.
Know your data, know your algorithms and always question the results.
The paper enumerates all of the shortcomings it has. So, what more did you need? Essentially, it can be highly influenced by where the initial means are chosen and the existence of outliers. Further, on multidimensional data you have to take care in choosing the distance formula. (There was a great blog not long ago about using K-Means to find the "palette" of an image where many of these shortcomings can be seen easily.)
That isn't to say the algorithm can not be used. It just is not bullet proof.
http://www.cis.hut.fi/Opinnot/T-61.6020/2008/