Hacker News new | past | comments | ask | show | jobs | submit login
Machine Learning Classifier Gallery (Tom Fawcett) (comcast.net)
37 points by chromophore on Sept 4, 2009 | hide | past | favorite | 11 comments

My Observations:

* Instance-based methods can adapt to any kind of pattern if you have enough data. This is a well-known result in machine learning -- the interesting question is simply how much data do you need?

* It's quite remarkable how well RBF SVMs do in almost all cases. They even outperformed a 2nd degree polynomial kernel SVM on a 2nd degree polynomial!

* Logistic regression is pretty terrible for anything except "easy" data -- simple linear data.

* Random forests are sometimes ok, but tend to prefer axis-aligned data (at least in his formulation).

* Meta-point: all the tests shown are with "clean" data...i.e., there are no "wrong" training examples. This is unrealistic, and in practice makes a HUGE difference for some of these methods. E.g., a lot of the rule-based methods get demolished by even a little bit of wrong data. In contrast, SVMs have a slack variable that can tolerate some amount of noise, and would probably shine even more on such data.

Shouldn't an RBF kernel be expected to perform better than a 2nd degree poly kernel since, technically, it is infinite-dimensional? (I.e., the radial basis function can be expanded into an infinite-degree taylor series)

I think (not a ML expert by any means) that RBF should probably outperform most fixed-integer polynomial kernels, but it does so at the cost of introducing more structural risk into the model.

It does not really go like that. The higher the dimension (actually VC dimension, but polynomial dimension is a similar concept) the more points you would theoretically need to guarantee that the classifier is not overfitting (ie, representing the noise in the data set).

I think its interesting how well the instance based methods perform at comparative numbers of training samples to the SVM-RBF methods. I'd certainly choose instance based methods for tasks with similar data to these tests.

Doing predictions with instance based methods can be really expensive.


What do the cognoscenti recommend for doing automatic text classification/categorization? I have been looking at Spam filters, and they're mostly boolean type predicates that return Spam/NotSpam results along with a confidence number. I want to be able to do that same for a large number of categories.

For simple classification tasks, naive bayes performs surprisingly well, is very easy to setup and run, and is relatively computationally inexpensive. It's excellent for getting a good baseline performance approximation because it's just so easy to use.

If you have a lot of training data and computational power, an SVM will almost always outperform naive bayes, but they take longer to train and tune. Try naive bayes first.

It bothers me that these data sets are very low dimensional, without noise and pretty-picture-like. This pretty much excludes any interesting data to try to learn (after all one could easily manually code a classifier for most os these "concepts" that performs 100%)

Right. This also biases the test toward certain kinds of algorithm. The implicit requirement that the thing being learned should be something you can see (and check) by looking at pictures like these pretty much guarantees that simple "proximity-based" approaches will do well. So it's no surprise that we see nearest-neighbours and radial basis functions looking good in these tests.

So what generalizations can we make from these plots? Instance and rule based methods learn better? Random forest seems to be especially reliable.

I don't think any useful generalizations can be made without information about the CPU and memory usage of the algorithms.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact