Hacker News new | past | comments | ask | show | jobs | submit login
Top ten algorithms in data mining (2007) [pdf] (googlecode.com)
218 points by daoudc on Dec 18, 2012 | hide | past | favorite | 41 comments

Four years ago I took a class based on that paper where we implemented all ten algorithms (every participant every algorithm). It was a very instructive and somewhat painful experience. The course website is still online, if anyone is interested, the presentations of the algorithms and the matlab/python code stub for each algorithm might be useful.


Missing Logistic Regression: so old school (from the 1930s), so mathematically rigorous (compare to neural networks for example), so powerful, and so often overlooked.

In the textbook "multivariate stats" by izenman, (http://www.amazon.com/gp/product/0387781889/ ) , he claims that stats & ML progressed in parallel. So traditional stats techniques like OLS, multiple regression, nonlinear regression, logistic regression, GLMs are generally not covered in ML. Similarly ML topics like k-means, svm, random forests etc. are not taught by the stats dept.

What is happening in this past decade is a convergence of stats & ML, primarily driven by data scientists working in the domain of big data. The stats folks are slowly incorporating ML techniques into stats & finding rigorous heuristics for when they should be employed. Similarly ML guys, who are mostly CS folk who unfortunately have taken only 1 course on undergraduate stats & probability, are discovering you can do so much more without resorting to needless large-scale computation, by sampling intelligently & leveraging plain old statistics.

This schism between stats & ML can be leveraged very profitably during interviews :))

When I interview data science folks, I usually ask very simple questions from plain stats - how would you tell if a distribution is skewed...if you have an rv X with mean mu, and say rv Y = X-mu, then what is the mean of Y...if you have an rv A with mean 0 variance 1, then what are the chances of being 3 standard deviations away from the mean if you have no clue about the distribution of A ? What if you knew A was unimodal ? What if A is now normally distributed ?

Now if its a stats guy, I ask very simple ML....what is perceptron, have you heard of an neural network etc.

surprisingly, the stats guys do much better on ML than the ML guys on stats!

Not surprising, really. ML is the shiny new thing, so the MLers don't tend to feel they missed anything while the statisticians need to keep up with the times.

I say this as an MLer still struggling to find out what R^2 is, among other things ..

Its just a stupid fraction. say you have a dataset ie. sequence of (x,y) tuples. In OLS, you try to fit a line onto the dataset. So your manager wants to know how well the line fit your dataset. If it does a bang-up job, you say 100% aka rsquare of 1. If it does a shoddy job, you say 0% aka rsquare of 0. Hopefully your rsq is much closer to the 1 than to the 0.

Here I just coded up a 10-liner for you: https://gist.github.com/4333595

Respectfully, it's not a stupid fraction. It is a fundamental quantity arising from the linear algebraic interpretation of correlation.

Correlation induces an inner product on the set of zero-mean random variables. The regression coefficient is precisely the projection coefficient <x,y>/<y,y> and R^2 is precisely the Cauchy-Schwarz ratio <x,y>^2 / <x,x><y,y> (i.e. the product of the two projection coefficients between x and y).

It is a theoretically natural measure of linear quality-of-fit. It has the added bonus of being equal to the ratio of modeled variance to total variance (variance being the square-norm of a random variable in the norm induced by the correlation inner product).

It's also very very cheap to compute. Though there are more practically useful measures of "predictive power", like mutual information, R^2 does an admirable job for an O(1)-space and O(num data)-time predictiveness metric.

> Respectfully, ...

I suspect that I learned more about R^2 by reading these comments in the order presented (informal, then formal) than I would have had they been reversed.

Learning about the basis of logistic regression in Ng's Stanford class was eye-opening. Also like that he then motivated generalized linear models, and why they're nice (e.g., parameters are linear with input data; the maximum-likelihood hypothesis is the expectation of the sufficient statistic) and he explains why we see the logistic function in so many damn places (it's the response function whenever y|x is distributed as an exponential family).

100% agree.

It was great how he spent a lot of time on logistic regression before delving into SVM's or Neural Nets - it was much easier to understand the cost functions & regularization for other types of classifiers after having understood those for logistic regression.

My takeaway: if you can avoid adding risk to your systems by using more complicated models, you should.

Neural networks are well understood as additive models. They may not have nice closed form training algorithms, but they are also more expressive than logistic regression.

Conceptually, logistic regression is a simple neural network. A neural network with no hidden layers and a softmax output layer using cross-entropy loss is precisely equivalent to logistic regression. Though, if you were to extend your suggestion of logistic regression to include its locally-weighted variants, I'd gladly support your cause.

Can you explain more about why you like it?

In the paper they mention logistic regression as an extension of the naive Bayes model (see page 26).

"Another extension to the naive Bayes model was developed entirely independently of it. This is the logistic regression model."

Oldie paper but still a goodie. Some other algorithms to consider... Random Forests, Gradient Boosted Machines (similar to AdaBoost), Logistic Regression, Neurel Networks.

What's so good about random forests? I seem to be hearing it mentioned a lot lately.

Random Forests are interesting because they are designed to make many, many bad decisions that cancel each other out leaving only the good decisions behind.

To oversimplify, each of the many, many decision trees is going to do a poor job building a classifier. Each tree only uses a subset of the data and a subset of the features. So, the results of each tree won't generalize well (and may not even do well classifying the data it was trained on). But, for each tree that misclassifies in one direction, another tree will misclassify in the other direction. The theory (which appears to be validated in practice) is that poor decisions cancel each other out but good decisions don't get canceled out.

Although, it isn't quite a simple as a "good" tree and a "bad" tree. All decision trees are bad but there should randomly be good parts in the bad trees. Each tree is going to be "mostly crap" (as Jeremy Howard puts it). But "mostly crap" means "a little good." The crap in one tree cancels out the crap in another tree. The good parts are left behind... and... like magic... you have a good classifier that is fast although somewhat of a black box.

Jeremy Howard from Kaggle is a big believer and where most of my knowledge of Decision Tree Ensemble algorithms come from: https://www.kaggle.com/wiki/RandomForests

They were one of the first widespread ensemble methods (based on bagged decision trees), and share with other bagging/boosting methods fairly good empirical performance on a wide range of problems.

Are they easy to interpret? Is it ok if I have a lot of columns in my data?

You will find no understanding in the grown trees-- by definition, they're each seeing a different angle of the data, and each node a different subset of available features.

You can, however, calculate importance scores for the features used. Brennan's original paper gives a good algorithm for doing this (in short: for each tree, permuting the data along some feature for an out of bag sample and seeing how much worse it does.)

(Pardon me/blame autocorrect: Breiman.)

Depend on the number of trees, and their depth. Even with 20 trees, with effort, it could be written on paper as rules, for example. And it helps if most of the trees are depth 3 or something.

But then, some models are pretty much impossible to understand or interpret.

They handle large-dimensional spaces fairly well, which is one of their strengths. They are not all that interpretable, though.

Crudely: combine the speed of classification trees with the generally low error rates (flexibility) of the SVM.

There is a good implementation in Mahout and it scales well on Hadoop, so its a pretty good out of the box whole data set learning algorithm.

No advantage over taking a decent statistically valid subset and running C4.5 (or a variant with floating point splits) apart from it's less work.

A random forest builds a diversified portfolio of decision trees.

To the extent that the errors of the individual trees are uncorrelated, the random forest keeps the same low expected error of a single tree while reducing the high error variance of a tree. This is the same reason a mutual fund is better than a single stock -- on average they have the same return, but the latter has far more variance.

As a self-taught programmer I always had a hard time reading descriptive algorithms. Any suggestions on how to learn reading it ?

Visualisations are very powerful thing for understanding algorithms. For the classical CS undergrad ones (DFS, various shortest path or minimal spanning tree ones, and of course all of the sorting), visualisations are easy enough to find on youtube, but with less well known algorithms, sometimes you have to draw one yourself.

Another thing I sometimes do is run a few iterations of the algorithm myself on paper.

Here is a concise PDF of definitions and notation https://docs.google.com/open?id=0B_WU5GXujPOVRDFYRUlKcFF4RWs

It's from CMU's course titled "Concepts of Math" The professor, Brendan Sullivan, just wrote the textbook in an easy to understand style.Here's a link to the course: https://colormygraph.ddt.cs.cmu.edu/21127-f12/ and if you click the AnnotateMyPDF website, you can get a copy of the textbook, practice problems, etc for a more in depth understanding.

I just signed up for the AnnotateMyPDF website but i see nothing regarding the course in there. is it only available to CMU students?

I clicked the AnnotateMyPDF, but it looks like you need to be signed up for the course to see those materials :(

You could try Knuth's Art of Computer Programming, Volume 1. It's in a formal style but covers some simpler algorithms and it's by an incredible writer. It's painstaking to read but some find it to be nearly a religious experience.

Hm, the scribd link is prompting me to download the pdf

I had the pleasure of studying under Prof Xindong Wu at UVM as an undergraduate and have some slides from, and maybe even a video of a talk he gave to the UVM Computer Science Student Association. If there's interest I'd be happy to track them down.

We does he say K-means isn't great? It always seemed like it would be useful if you have lots of data.

You're kind of looking for the most similar records to match each record to. Am I understanding that right?

All the presented algorithms have pros and cons. You have to understand the algorithms(tools) at your disposal and find the best for the questions you want to make.

In the K-means case, it is good to know beforehand how many groups(clusters) you are expecting. Otherwise, there are other clustering algorithms(like gdbscan) which detect the number of cluster automatically(although you have to provide the average expected density). But once again all depends on the question you want to answer, or if you are just cleaning the data a bit.

Know your data, know your algorithms and always question the results.

The paper enumerates all of the shortcomings it has. So, what more did you need? Essentially, it can be highly influenced by where the initial means are chosen and the existence of outliers. Further, on multidimensional data you have to take care in choosing the distance formula. (There was a great blog not long ago about using K-Means to find the "palette" of an image where many of these shortcomings can be seen easily.)

That isn't to say the algorithm can not be used. It just is not bullet proof.

The groups are dependent on the initial configuration. Have a look at the (poor) grouping on page 8.

As Fisher said, the task is to separate the noise from the data!

I would expect GUHA instad of Apriori, but I guess Apriori is more influential. But it's not like either of them is new.


did they datamine a paper on data mining?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact