
Top Algorithms in Data Mining (2008) [pdf] - Anon84
http://www.cs.uvm.edu/~icdm/algorithms/10Algorithms-08.pdf
======
joeblau
Wow this paper is great. Back in November, I implemented the "Trend or No
Trend: A Novel Nonparametric Method for Classifying Time Series"[1] paper in
node.js and then I hooked it up to Twitters API. The algorithm was able to
find trending topics pretty quickly and accurately. Does anyone know if the
source for these algorithms is implemented anywhere?

[1] -
[http://dspace.mit.edu/bitstream/handle/1721.1/85399/87030495...](http://dspace.mit.edu/bitstream/handle/1721.1/85399/870304955.pdf)

~~~
mendicantB
These algorithms are implemented in just about every major language, as they
are standard tools. I'd say your best bet for leveraging them is python scikit
learn (super good documentation)

------
tdaltonc
I'm surprised that there aren't any PCA/Feature-Extraction/Dimension-Reduce
algos on this list. And there's also a really strong emphasis on clustering
and classification.

Is this an artifact of how the list was put together or is clustering really
that much more common than feature extraction? Is PCA thought to be too
trivial to qualify?

~~~
micro_cam
PageRank is a similar eigenvalue based algorithm and I think PCA can be seen
as a relaxation of k-means so they sort of did.

But yes, if i wrote the list it eigenvector based methods would be more
prominent.

Also If I wrote it the tree based methods would have been replaced with random
forests which are similar to CART but use the crucial ideas of sample bagging
and random feature subsets which have proven themselves over and over.

I think it may just be out of date.

~~~
jessaustin
Any suggestions for something more recent?

~~~
micro_cam
The machine learning reddit recently started a faq:

[http://www.reddit.com/r/MachineLearning/comments/20i0vi/meta...](http://www.reddit.com/r/MachineLearning/comments/20i0vi/meta_collection_of_links_for_beginners_faq/)

They also have a good selection of articles though they tend to be somewhat
deep learning/neural network focused. These algorithms have seen a big surge
in interest and success of late but may not be as applicable to "data mining"
(as opposed to say image or speech recognition).

------
Anon84
This article was eventually expanded into a full fledged book with more in
depth descriptions of each algorithm:

[http://www.crcpress.com/product/isbn/9781420089646](http://www.crcpress.com/product/isbn/9781420089646)

------
eterpstra
Is it just me, or was anyone else taken aback by an academic article using a
linkbait-esque title? I'm sure the title is perfectly reasonable, given that
it's a survey paper, but the past few years of being bombarded with articles
like "10 weird tricks to optimize your everything" immediately triggers my
skeptical meter. Funny how the internet warps the brain.

~~~
minimaxir
Unlike most Top 10 lists that are deliberately designed with unexpected
choices for shock value, the paper cites exactly how the 10 were chosen and
attempts to justify why each choice was made.

If the paper was titled "Wow, The Top 10 Algorithms in Data Mining Are
Incredible, And I Can't Believe Where Naive Bayes Placed," then yeah, that
might be link-baity.

------
mindslight
Q: What's the difference between working on data mining and designing
missiles?

A: When you build missiles, you can at least take comfort knowing that they'll
only be used to harm people who are far away.

~~~
Houshalter
What are you talking about?

~~~
philovivero
He's on about the problem of data mining being used by NSA and corporate
America to do Bad Things.

And what are you on about?

I'm guessing you're on about the "tool is only as good or evil as the
wielder."

Trying to take this back out of the meta, our society needs to figure out how
to stop hurting each other. I think a good dose of critical thinking skills
for our youngsters would go a good distance toward helping that.

~~~
Houshalter
I guess that's sort of like saying computers are evil because they are used by
the NSA. Is there even any evidence that they do use it for anything
significant? Let alone that this guys work is contributing to it.

~~~
mindslight
Computers have _many_ applications, while data mining is generally used with
working sets consisting of information about large groups of people - either
to get a strategic leg up on them (say, price discrimination or advertising
susceptibility), or to single out deviant individuals.

I don't think the entire field is strictly a universal evil per se, but with
the current climate of centralization I think any good that it can bring is
massively dwarfed by the bad.

Similarly, missiles themselves aren't a universal bad as they can be used for
self defense, but in the current world basically their entire purpose is to
preserve empires.

~~~
Houshalter
I don't think that's the case. In theory it can be used to "find deviants",
but has that actually ever been done and to what degree? Improving
advertisements isn't _horrible_.

It has a lot of applications in science, could vastly improve medicine, it's
used to predict energy usage for power plants, filter spam, recommend movies
or products, improve harvests for farming, etc. But I guess those things don't
sell papers.

And lastly calling a technology "bad" isn't actually helpful. We can't stop
bad people from using the technology or "uninvent" it. You'll just damage the
people using it for good. See the Halo Effect where people told about the
risks of nuclear power perceive the benefits as less
([http://lesswrong.com/lw/lj/the_halo_effect/](http://lesswrong.com/lw/lj/the_halo_effect/)).

~~~
mindslight
By "finding deviants", I mean exactly what the NSA is doing, and what every
Keystone police department will be doing in ten years. And yes, improving
advertisements _is_ horrible - this is precisely one of the anti-human facets
I am talking about.

> _It has a lot of applications in science, could vastly improve medicine, it
> 's used to predict energy usage for power plants, filter spam, recommend
> movies or products, improve harvests for farming, etc_

You're right. My initial comment was way too harsh and broad for indicting the
whole field rather than just the in-our-face web scene applications.

> _lastly calling a technology "bad" isn't actually helpful. We can't stop bad
> people from using the technology_

Of course you can't uninvent something, but you can slow down or speed up it's
progress. _Path dependence_ is a thing. Currently, most people do not feel
empowered with respect to computers. The more we put them at the mercy of
seemingly unpredictable computer systems that are actually trying to outwit
them, the harder it is going to be for them to get over those feelings of
disempowerment so that we can get to a place where people expect their
computers to work for them.

~~~
Houshalter
I've seen no evidence that the NSA does that, and really it makes no sense.
Doing a search for "abnormal" behavior would mostly get you people that visit
odd websites, or are awake at atypical hours, or spend too much time online,
etc. Nothing remotely useful. There is a lot of hype around data-mining but
there are diminishing returns on what can be done with it.

I agree that ads are bad, but they aren't that bad. Targeted advertisement is
mostly just finding things that are relevant to the website or viewer. And
it's so incredibly easy to block ads on the internet I don't really have much
sympathy for those that don't and complain about them. Ads were _way_ worse in
the age of television where 25% of all content was commercials.

>Of course you can't uninvent something, but you can slow down or speed up
it's progress. Path dependence is a thing.

Perhaps, but for the most part the technology _already exists_. At it's core
it's just basic statistics or sometimes even simpler techniques. Diminishing
returns have already been hit long ago - I can't think of much that would
improve the "bad" applications more than the "good".

>The more we put them at the mercy of seemingly unpredictable computer systems
that are actually trying to outwit them, the harder it is going to be for them
to get over those feelings of disempowerment so that we can get to a place
where people expect their computers to work for them.

I.. what? On the rare occasions where machine learning is used in apps, it's
to improve user experience, or do something that would otherwise be too
complex or inefficient to code by hand. I'm sure someone, somewhere, abused
the technology in their app - just like every programming technology ever. But
in general I don't see how this is bad or even that it makes computers more
"unpredictable".

------
maldeh
Kevin Murphy's treatment of Kernel Methods in "Machine Learning - A
Probabilistic Perspective" offers a take on SVMs that I don't often find in
most discussions. He makes the case that SVMs are generally rather unusual
from a Bayesian point of view. Apropos to other sparse vector machines, Murphy
opines:

"Note that SVMs are very unnatural from a probabilistic point of view. First,
they encode sparsity in the loss function rather than the prior. Second, they
encode kernels by using an algorithmic trick, rather then being an explicit
part of the model. Finally, SVMs do not result in probabilistic outputs, which
causes various difficulties, especially in the multi-class classification
setting.

"It is possible to obtain sparse, probabilistic, multi-class kernel-based
classifiers, which work as well or better than SVMs, using techniques such as
L1VM or RVM... However, we include a discussion of SVMs, despite their non-
probabilistic nature, for two main reasons. First, they are very popular and
widely used, so all students of machine learning should know about them.
Second, they have some computational advantages over probabilistic methods in
the structured output case... [specifically, SSVMs]

...

"The conclusion is as follows: if speed matters, use an RVM, but if well-
calibrated probabilistic output matters (e.g., for active learning or control
problems), use a GP. The only circumstances under which using an SVM seems
sensible is the structured output case, where likelihood-based methods can be
slow. (We attribute the enormous popularity of SVMs not to their superiority,
but to ignorance of the alternatives, and also to the lack of high quality
software implementing the alternatives.)"

\----

the tl;dr is that SVMs are an important tool for their historical impact and
popularity, but the modern data scientist has a range of cookie-cutter
algorithms to choose from that could possibly be much more useful.

------
graycat
Okay, given a box of data, shove it into one of these algorithms, let the
computer run, and now what? What do we have?

E.g., for some 'classifier', what the heck do we have? That is, what
guarantees do we have? That is, is there something specific and guaranteed
that the algorithm has actually done?

I know; I know; in some vague, unspecific sense, that seems to work in some
practical cases, the algorithm will 'classify'. Still, we don't have precise
description of (1) in what cases does the algorithm work and (2) how well.

E.g., what is the rate (probability) of false classifications? Sure, we'd need
some hypotheses to say. Okay, what the heck are the algorithms assuming for
hypotheses that would let there even be a rate of false classifications. Then
with those hypotheses, what are the probabilities of a false classification?

Without such considerations, about all we have is some code that runs.

Uh, where's the beef?

~~~
srean
The body of literature you are looking for is called statistical learning
theory. If you are interested, you can _start_ with Vapnik's two books, but
that is what I said, a start. The theory of Learning has shifted quite a bit,
but the fundamentals remain the same. Once you are up to speed with Vapnik you
can start reading proceedings of COLT and ALT (computational learning theory
and algorithmic learning theory respectively).

You will find the following 3 types of guarantees

i) Strong assumptions on the knowledge of the stochastic process generating
the data to back guarantees on the algorithm

ii) No assumptions on the on the data generating process except that it is
governed by some unknown independent stochastic process. Prove guarantees on
the output of the algorithm

iii) Don't even assume that the data generation process is stochastic yet
provide guarantees on the output.

What exactly is guaranteed will of course differ.

Little advice, smugness as demonstrated with ...I figured it out 16 years ago
yo bitches, will not help your cause whatever that might be. BTW are you H. A.
Dye, 'cause Dye figured out group structures on measure preserving
transformations in 1959 a little bit earlier than 16 years ago...just saying.

~~~
graycat
I was commenting on the OP. I started reading it but found just 'algorithms'
with no indication of anything like any guarantees or qualities for them.

Your i) sounds good. E.g., in stochastic dynamic programming with assumptions
reasonably expected, e.g., essentially a Markov assumption, we can easily
prove that the number crunching yields the decisions with the least possible
expected cost from any 'measurable' means of manipulating the data. And once
when I argued this, I also included a proof that additional data independent
of what we already wouldn't help either. This result is, of course, 'obvious'
although obvious is not a proof. The proof turned out to be just a light tap
with Fubini's theorem.

So, good: Some people actually argue that some of the algorithms actually have
some desirable qualities. So, the world is not totally coming apart at the
seams or being turned over to naive hackers who only know to watch the code
run and then conclude qualitative that it seems 'nice'.

On your ii), "some unknown independent stochastic process", that's not so
clear. So, for some index set I and random variables X(i) for i in I, the set

{ X(i) | i in I }

is independent? That's something of an unusual and, in practice, not very
interesting stochastic process. Maybe some EEs would call that 'white noise'.

On your iii), sorry, but essentially everything is 'stochastic'. Even a
'deterministic' process is stochastic, in practice usually is also Markov
(past and future conditionally independent given the present).

Thanks for the reference to H. A. Dye. I went on to show how such a group
leads easily to statistical hypothesis tests. That is, in particular I was
making sense out of 'resampling' and 'bootstrap' plans.

In part I got the idea of considering 'measure preserving', of course, from
ergodic theory where measure preserving transformations are central, but the
transformations I used, and needed for the statistics, are the ones that
ergodic theory throws out.

No, I'm not Dye. That's the first I heard of Dye's paper. Thanks. At one point
long after publishing I saw remark in one of the volumes of Dunford and
Schwartz, as I recall, attributed to Norbert Wiener, that maybe something
could be done with the 'periodic' case where, thus, there is a finite group.

Never heard of Vapnik. Thanks for the reference. From a simple Google, found
at

[http://vserver1.cscs.lsa.umich.edu/~crshalizi/reviews/vapnik...](http://vserver1.cscs.lsa.umich.edu/~crshalizi/reviews/vapnik-
nature/)

a review

The Bactra Review: Occasional and eclectic book reviews by Cosma Shalizi The
Nature of Statistical Learning Theory by V. N. Vapnik

Okay, Vapnik is trying to do some mathematics to show some desirable
mathematical properties of some of statistical learning theory, maybe in
particular some of the algorithms. Good.

The review is a bit tough on Vapnik's writings from when he was writing in
Russia. That's sad: The famous Russian texts on probability and stochastic
processes, usually from Springer, were exemplary and with beautifully done
translations to English, e.g., E. Dynkin, R. Lipster, etc.

I can'f find where I said "yo bitches".

But I do believe that a proposed algorithm should _do_ something, that what it
does should be clear, good, and demonstrated, hopefully with some theorems and
proofs. Knuth's TACP was big on 'analysis' of algorithms, but from the OP and,
say, Prof Ng's course on machine learning it appears that the analysis part is
being omitted.

Broadly, it looks like (1) statistics, optimization, stochastic processes are
old fields and no longer very popular, at least in the US where they might be
in departments of mathematics which, however, in the US, mostly don't like
those fields; (2) computing is a new field and has lots of grant money,
interest, and students; (3) for an actual application, those old fields need
computing to do the specified data manipulations; (4) computing wants to take
on some of the problems the old fields worked on and, then, is at risk of
putting poor versions of good old wine in new bottles with new labels. This is
a form of 'academic theft'? Well, if not, if the new wine is poor, then it's a
form of academic dilution. Not good.

~~~
srean
Well its a HN comment not a scientific communication.

Scenario (i) is mostly considered uninteresting and "solved" because it
assumes that the particular stochastic process generating the data is known
(this was the pre-Vapnik model). In most practical cases, however, one does
not know the corresponding measure from which the data was generated (and will
be generated in future). The old school thinking was that one should estimate
the generating distribution, and then solve for the optimal decision, based on
the estimate. Vapnik showed that it is far better to do it directly without
bothering to estimate the distribution which in itself is a non-trivial
problem.

So (ii) is the more interesting case that machine learners care to handle and
Vapnik's was one of the first breakthrough fundamental result in this
formalism. It applies to all algorithms that does something called structural
risk minimization, not just one specific algorithm. His bound applies to all
generating distributions, all samples, and all structural risk minimization
algorithms ! However in this set up we still need to assume independence. Some
form of independence, even in some weak form needs to be present. But it is
quite remarkable that you can give guarantees on performance _regardless_ of
what actual distribution generated the data. I should also mention Leslie
Valiant. Lookup PAC learning you will understand the basic premise of this
style. One genuine criticism is that it is overly conservative (and it has to
be, because it is giving guarantees that applies for the worst distribution,
worst training sample, worst algorithm...).

(iii) Removes the need for independent draws. If you want to read up on these,
search for "regret minimization." In this setup the guarantees hold for all
sequence of data, even if adversarially generated by someone who knows the
algorithm you will employ. Some geometric assumptions are necessary on the
domain and some function analytic ones on the function that you are trying to
minimize (typically convex. Non-differentiable functions are allowed).

    
    
       -- Shalizi's Review --
    

Shalizi writes good stuff but this review of Vapnik's book is surprisingly
mean spirited. The field of course has progressed after Vapnik and there are
legitimate criticisms that one can raise, but the review exhibits bad form, I
think. Prior to Vapnik there were very few results that one could state
outside of the statistical physics community (that worked with strong
assumptions, typically Gaussian noise etc)

    
    
        -- Aside --
    

No you did not say the word "yo bitches" but that was the attitude you
demonstrated. Before casting judgments as you did at least make an effort to
familiarize yourself with the background. And that is exactly what it is,
background. A machine learner will not even mention these things because you
are expected to know these things if you are in this field.

    
    
        -- Re: Analysis being omitted  --
    

Think of it this way, one doesn't really need to know precise chemistry of
hydrocarbon combustion to drive a car, conversely, knowledge of such chemistry
_alone_ is useless unless it is teamed up with good engineering and in this
case algorithm. There is a huge difference between showing that minimizing a
particular function has nice properties and actually coming up with a piece of
code to efficiently minimize such a function, not just in theory where
constants can be conveniently swept under the rug, but also in practice. So
dont be too quick to dismiss the "naive coders".

Most importantly, practice, almost always precedes explanation and analysis in
engineering. Usually practitioners find the algorithms in their hacky ways.
The analyst/theorist smoothens the kinks and explains why the method works so
well after it has been demonstrated to work so well-- not always, but almost
always.

------
KhalilK
These algorithms are all good, but they seems to be missing recurrent ANN,
that is, the form of datamining tool our brains are using. I can add this
paper as relevant for those being interested
[http://orre.neurologic.org/pubdoc/abstract/complex2003-abstr...](http://orre.neurologic.org/pubdoc/abstract/complex2003-abstract.html)

~~~
tarlar
The actual link is here
[http://orre.neurologic.org/pubdoc/abstract/complex2003-abstr...](http://orre.neurologic.org/pubdoc/abstract/complex2003-abstract.html)

------
KhalilK
It was published in 2007 not 2008 ;)

------
hydralist
i understand some of these words

