

Machine Learning Classifier Gallery (Tom Fawcett) - chromophore
http://home.comcast.net/~tom.fawcett/public_html/ML-gallery/pages/index.html

======
apu
My Observations:

* Instance-based methods can adapt to any kind of pattern if you have enough data. This is a well-known result in machine learning -- the interesting question is simply how much data do you need?

* It's quite remarkable how well RBF SVMs do in almost all cases. They even outperformed a 2nd degree polynomial kernel SVM on a 2nd degree polynomial!

* Logistic regression is pretty terrible for anything except "easy" data -- simple linear data.

* Random forests are sometimes ok, but tend to prefer axis-aligned data (at least in his formulation).

* Meta-point: all the tests shown are with "clean" data...i.e., there are no "wrong" training examples. This is unrealistic, and in practice makes a HUGE difference for some of these methods. E.g., a lot of the rule-based methods get demolished by even a little bit of wrong data. In contrast, SVMs have a slack variable that can tolerate some amount of noise, and would probably shine even more on such data.

~~~
eob
Shouldn't an RBF kernel be expected to perform better than a 2nd degree poly
kernel since, technically, it is infinite-dimensional? (I.e., the radial basis
function can be expanded into an infinite-degree taylor series)

I think (not a ML expert by any means) that RBF should probably outperform
most fixed-integer polynomial kernels, but it does so at the cost of
introducing more structural risk into the model.

~~~
alextp
It does not really go like that. The higher the dimension (actually VC
dimension, but polynomial dimension is a similar concept) the more points you
would theoretically need to guarantee that the classifier is not overfitting
(ie, representing the noise in the data set).

------
mahmud
OT:

What do the cognoscenti recommend for doing automatic text
classification/categorization? I have been looking at Spam filters, and
they're mostly boolean type predicates that return Spam/NotSpam results along
with a confidence number. I want to be able to do that same for a large number
of categories.

~~~
thomaspaine
For simple classification tasks, naive bayes performs surprisingly well, is
very easy to setup and run, and is relatively computationally inexpensive.
It's excellent for getting a good baseline performance approximation because
it's just so easy to use.

If you have a lot of training data and computational power, an SVM will almost
always outperform naive bayes, but they take longer to train and tune. Try
naive bayes first.

------
alextp
It bothers me that these data sets are very low dimensional, without noise and
pretty-picture-like. This pretty much excludes any interesting data to try to
learn (after all one could easily manually code a classifier for most os these
"concepts" that performs 100%)

~~~
gjm11
Right. This also biases the test toward certain kinds of algorithm. The
implicit requirement that the thing being learned should be something you can
see (and check) by looking at pictures like these pretty much guarantees that
simple "proximity-based" approaches will do well. So it's no surprise that we
see nearest-neighbours and radial basis functions looking good in these tests.

------
greg
So what generalizations can we make from these plots? Instance and rule based
methods learn better? Random forest seems to be especially reliable.

~~~
modeless
I don't think any useful generalizations can be made without information about
the CPU and memory usage of the algorithms.

