
Top ten algorithms in data mining (2007) [pdf] - daoudc
http://some-docs.googlecode.com/files/Top%2010%20algorithms%20in%20data%20mining.pdf
======
lkozma
Four years ago I took a class based on that paper where we implemented all ten
algorithms (every participant every algorithm). It was a very instructive and
somewhat painful experience. The course website is still online, if anyone is
interested, the presentations of the algorithms and the matlab/python code
stub for each algorithm might be useful.

<http://www.cis.hut.fi/Opinnot/T-61.6020/2008/>

------
mooneater
Missing Logistic Regression: so old school (from the 1930s), so mathematically
rigorous (compare to neural networks for example), so powerful, and so often
overlooked.

~~~
dxbydt
In the textbook "multivariate stats" by izenman,
(<http://www.amazon.com/gp/product/0387781889/> ) , he claims that stats & ML
progressed in parallel. So traditional stats techniques like OLS, multiple
regression, nonlinear regression, logistic regression, GLMs are generally not
covered in ML. Similarly ML topics like k-means, svm, random forests etc. are
not taught by the stats dept.

What is happening in this past decade is a convergence of stats & ML,
primarily driven by data scientists working in the domain of big data. The
stats folks are slowly incorporating ML techniques into stats & finding
rigorous heuristics for when they should be employed. Similarly ML guys, who
are mostly CS folk who unfortunately have taken only 1 course on undergraduate
stats & probability, are discovering you can do so much more without resorting
to needless large-scale computation, by sampling intelligently & leveraging
plain old statistics.

This schism between stats & ML can be leveraged very profitably during
interviews :))

When I interview data science folks, I usually ask very simple questions from
plain stats - how would you tell if a distribution is skewed...if you have an
rv X with mean mu, and say rv Y = X-mu, then what is the mean of Y...if you
have an rv A with mean 0 variance 1, then what are the chances of being 3
standard deviations away from the mean if you have no clue about the
distribution of A ? What if you knew A was unimodal ? What if A is now
normally distributed ?

Now if its a stats guy, I ask very simple ML....what is perceptron, have you
heard of an neural network etc.

surprisingly, the stats guys do much better on ML than the ML guys on stats!

~~~
malkarouri
Not surprising, really. ML is the shiny new thing, so the MLers don't tend to
feel they missed anything while the statisticians need to keep up with the
times.

I say this as an MLer still struggling to find out what R^2 is, among other
things ..

~~~
dxbydt
Its just a stupid fraction. say you have a dataset ie. sequence of (x,y)
tuples. In OLS, you try to fit a line onto the dataset. So your manager wants
to know how well the line fit your dataset. If it does a bang-up job, you say
100% aka rsquare of 1. If it does a shoddy job, you say 0% aka rsquare of 0.
Hopefully your rsq is much closer to the 1 than to the 0.

Here I just coded up a 10-liner for you: <https://gist.github.com/4333595>

~~~
_dps
Respectfully, it's not a stupid fraction. It is a fundamental quantity arising
from the linear algebraic interpretation of correlation.

Correlation induces an inner product on the set of zero-mean random variables.
The regression coefficient is precisely the projection coefficient <x,y>/<y,y>
and R^2 is precisely the Cauchy-Schwarz ratio <x,y>^2 / <x,x><y,y> (i.e. the
product of the two projection coefficients between x and y).

It is a theoretically natural measure of linear quality-of-fit. It has the
added bonus of being equal to the ratio of modeled variance to total variance
(variance being the square-norm of a random variable in the norm induced by
the correlation inner product).

It's also very very cheap to compute. Though there are more practically useful
measures of "predictive power", like mutual information, R^2 does an admirable
job for an O(1)-space and O(num data)-time predictiveness metric.

~~~
politician
> Respectfully, ...

I suspect that I learned more about R^2 by reading these comments in the order
presented (informal, then formal) than I would have had they been reversed.

------
larrydag
Oldie paper but still a goodie. Some other algorithms to consider... Random
Forests, Gradient Boosted Machines (similar to AdaBoost), Logistic Regression,
Neurel Networks.

~~~
tocomment
What's so good about random forests? I seem to be hearing it mentioned a lot
lately.

~~~
_delirium
They were one of the first widespread ensemble methods (based on bagged
decision trees), and share with other bagging/boosting methods fairly good
empirical performance on a wide range of problems.

~~~
tocomment
Are they easy to interpret? Is it ok if I have a lot of columns in my data?

~~~
textminer
You will find no understanding in the grown trees-- by definition, they're
each seeing a different angle of the data, and each node a different subset of
available features.

You can, however, calculate importance scores for the features used. Brennan's
original paper gives a good algorithm for doing this (in short: for each tree,
permuting the data along some feature for an out of bag sample and seeing how
much worse it does.)

~~~
textminer
(Pardon me/blame autocorrect: Breiman.)

------
phatbyte
As a self-taught programmer I always had a hard time reading descriptive
algorithms. Any suggestions on how to learn reading it ?

~~~
sk55
Here is a concise PDF of definitions and notation
<https://docs.google.com/open?id=0B_WU5GXujPOVRDFYRUlKcFF4RWs>

It's from CMU's course titled "Concepts of Math" The professor, Brendan
Sullivan, just wrote the textbook in an easy to understand style.Here's a link
to the course: <https://colormygraph.ddt.cs.cmu.edu/21127-f12/> and if you
click the AnnotateMyPDF website, you can get a copy of the textbook, practice
problems, etc for a more in depth understanding.

~~~
bosie
I just signed up for the AnnotateMyPDF website but i see nothing regarding the
course in there. is it only available to CMU students?

------
cllns
Hm, the scribd link is prompting me to download the pdf

------
canthonytucci
I had the pleasure of studying under Prof Xindong Wu at UVM as an
undergraduate and have some slides from, and maybe even a video of a talk he
gave to the UVM Computer Science Student Association. If there's interest I'd
be happy to track them down.

------
tocomment
We does he say K-means isn't great? It always seemed like it would be useful
if you have lots of data.

You're kind of looking for the most similar records to match each record to.
Am I understanding that right?

~~~
kikas
All the presented algorithms have pros and cons. You have to understand the
algorithms(tools) at your disposal and find the best for the questions you
want to make.

In the K-means case, it is good to know beforehand how many groups(clusters)
you are expecting. Otherwise, there are other clustering algorithms(like
gdbscan) which detect the number of cluster automatically(although you have to
provide the average expected density). But once again all depends on the
question you want to answer, or if you are just cleaning the data a bit.

Know your data, know your algorithms and always question the results.

------
sgt101
As Fisher said, the task is to separate the noise from the data!

------
glogla
I would expect GUHA instad of Apriori, but I guess Apriori is more
influential. But it's not like either of them is new.

------
11001
(2007)

------
caycep
did they datamine a paper on data mining?

