
Do We Need Hundreds of Classifiers to Solve Real World Classification Problems? [pdf] - msherry
http://jmlr.csail.mit.edu/papers/volume15/delgado14a/delgado14a.pdf
======
benhamner
This is consistent with our experience running hundreds of Kaggle
competitions: for most classification problems, some variation on ensembled
decision trees (random forests, gradient boosted machines, etc.) performs the
best. This is typically in conjunction with clever data processing, feature
selection, and internal validation.

One key exception is where the data is richly and hierarchically structured.
Text, speech, and visual data falls under this category. In many cases here,
variations of neural networks (deep neural nets/CNN's/RNN's/etc.) provide very
dramatic improvements.

This study does have a couple limitations. The datasets used are very small &
form a very biased selection of real-world applications of machine learning.
It doesn't consider ensembles of different model types (which I'd expect to
provide a consistent but marginal improvement over the results here).

------
smu3l
Here's some commentary on the article: [http://www.win-
vector.com/blog/2014/12/a-comment-on-preparin...](http://www.win-
vector.com/blog/2014/12/a-comment-on-preparing-data-for-classifiers/)

~~~
elliott34
Great commentary. Encoding high-level categorical variables as unordered
integers, or even frequency count maps, is pretty common practice, but really
only valid (as the author says) in tree based methods. Given deep enough trees
(or enough iterations in the boosting scenario), it is my anecdotal experience
that trees are able to parse out relevant values. The problem with indicator
variables in general is the enormous potential loss in computational time and
increase in complexity. My models have hundreds of vehicle/car types, U.S.
state, etc., and so this type of coding helps quite a bit in performance
without sacrificing (and sometimes improving due to curse-of-dimensionality)
model performance (IMHO YMMV)

------
hooande
I think the lesson here has been evident for some time: There is no one best
classifier, only classifiers that perform better on particular problems.
Random Forests work well because as ensemble methods they can be optimized to
explore many different aspects of a data set, but they don't perform
significantly better than SVMs or most of the other methods.

In practice the solution is often to _use a combination of multiple methods_.
Trees, support vector machines, multilayer perceptrons, gaussian kernels,
bagging and boosting. In most applications you don't have choose. Combining
the results of all of them together using a weighted average will out perform
any of them individually. And in most cases, the whole is greater than the sum
of its parts. Each classifier fits a given data set differently and provides
its own perspective on the prediction problem. The goal isn't to choose the
best one, but to find an ensemble of methods that best explain the patterns
and relationships in the data.

There are many cases where resource and speed limitations dictate that only
one classifier can be tuned and implemented, and in those situations it's good
to know which one is 'best'. But when it's possible to build an ensemble out
of many different methods it's almost always the best way to go.

~~~
zt
Couldn't agree with this comment more. This hit home for me when I read
Grimmer and King's "General Purpose Computer-Assisted Clustering and
Conceptualization"
([http://gking.harvard.edu/files/gking/files/201018067_online....](http://gking.harvard.edu/files/gking/files/201018067_online.pdf)),
which looked at every text clustering/classification alogithm they could find
at the time (as summarized [http://gking.harvard.edu/files/gking/files/discov-
supp.pdf](http://gking.harvard.edu/files/gking/files/discov-supp.pdf)). They
basically end up clustering in the space of clustering algorithms...

