

Support Vector Machines and Hadoop: Theory vs. Practice - techtime77
http://www.distilnetworks.com/support-vector-machines-hadoop-theory-vs-practice/#more-3680

======
HamSession
This is a interesting take and why I love machine learning and its
intersection with HPC.

First the seemingly blind decision to implement an SVM for improved
performance. An SVM isn't magical in fact an SVM and neural network are
equivalent with the SVM being the general case. SVMs suffer from the the same
problems as neural networks in that your # Hidden Nodes/Activation function is
the same as trying to choose your kernel function. When looking at your
problem you have to ask yourself.

 _1) Is the model time varying? - > NN_

 _2) Very large N dimensional search space - > SVM_

Secondly even without changing algorithms you can get a significant accuracy
improvement by examining your features. Features are the most important part
of machine learning (garbage-in/garbage-out). Even simple classifiers such as
Naive Bayes can do well if given the right feature set. There are multiple
methods to examine your features such as ReliefF another is ANOVA. If you find
your features are not good enough try unsupervised feature detection and learn
more about the problem domain and coming up with your own features.

The final issue specifically with HPC and machine learning is even given 100
cores your algorithms may not speedup. Many machine learning algorithms are
built to be iterative in nature and do not lend themselves to the becoming
parallel. This means that MapReduce must be invoked at each iteration. As you
scale up the number of available cores the overhead of startup and shutdown of
your cluster at each iteration overrides your gain in performance, as many
nodes will finish faster than others and just sit and wait.

The solution to all of this is simple

1) Get your features correct

2) Try new algorithms

    
    
       - Try online learning algorithm first like vowpal wabbit
         here https://github.com/JohnLangford/vowpal_wabbit/
    

3) HPC

    
    
       - Apache Spark http://spark.incubator.apache.org/ 
         or 
         GraphLab http://graphlab.org/
         or (if personal computer only)
         GraphChi http://graphlab.org/graphchi/
    
       - Both support HPC with graph centric framework
    
       - Orders of magnitude faster than Hadoop  
    
       - Both built on top of Hadoop HDFS so connect and go
    
    

Hope this helps Everyone out there.

Have fun and try to solve some cool problems.

~~~
gfodor
Nice post. Question for you: feature selection is certainly the most important
part of ML. But yet, the focus of most ML texts is on the algorithm zoo and
they gloss over feature selection. Are there any good references on the
variety of techniques, with examples, of feature selection best practices?

~~~
chcleaves
That's a good question. Feature selection is a large field of research and is
a bit too broad for me to summarize in an abbreviated fashion. I would look
into "model selection", specifically into scores of models that weigh both
complexity (the number of variables) and goodness of fit. A good score to look
into first is the Bayesian information criterion (BIC) which is used, for
instance, in model selection in neuroscience.
[http://en.wikipedia.org/wiki/Bayesian_information_criterion](http://en.wikipedia.org/wiki/Bayesian_information_criterion)

One thing you might want to try is cross-validation
([http://en.wikipedia.org/wiki/Cross-
validation_%28statistics%...](http://en.wikipedia.org/wiki/Cross-
validation_%28statistics%29)). Cross-validation should help you determine if
your model is overfitting, as it will perform significantly better on its
training set than on the left out data.

------
pnachbaur
I'm still surprised there is no SVM implementation for Mahout
([http://stackoverflow.com/questions/10482646/recently-svm-
imp...](http://stackoverflow.com/questions/10482646/recently-svm-
implementation-was-added-into-mahout-i-am-planning-to-use-svm-an)).

------
chcleaves
A few years ago some SVM code was contributed to the Mahout project, but as of
yet, it still doesn't appear to have a working implementation. It seems one
can tweak existing Mahout functions a bit in order to accomplish the same sort
of thing, but Mike went ahead and started working on an SVM implementation
when he initially discovered it wasn't fully implemented. Given a package like
sklearn (part of the Python scipy package), it's not so hard to implement a
scheme similar to the one described in the blog once you know what to do.

------
PaulHoule
I'd like to see some classification performance numbers, ROC curves, etc.

------
eakyol
Paul,

We don't have any published performance numbers at the moment as we just
implemented our cluster within our production environment. Looking to do a
post-facto write up on that in a bit.

------
konstantintin
how much of an increase in accuracy is gained by using the full collection of
data rather than sampling?

