
Comparison of machine learning libraries used for classification - pzs
https://github.com/szilard/benchm-ml
======
nl
This is some pretty good work.

Vowpal Wabbit does pretty well, which isn't surprising - it has always
benchmarked well.

I'm not really familiar with H2O at all, but those are some pretty impressive
results. Not only is it competitive in terms of speed with Vowpal Wabbit, bur
it also looks like it achieved the highest absolute score too (AUC = 81.2 on a
GBM using H2O-3).

------
raus22
For the people interested: [http://mlcomp.org/](http://mlcomp.org/) MLcomp is
a free website for objectively comparing machine learning programs across
various datasets for multiple problem domains.

~~~
stared
I see some tests, but are there nay public comparison tables/charts? (I.e.
same data, but different algorithms, or different packages.)

------
earino
My friend Szilard is the author of this benchmark. He was attempting to answer
the questions in the comments, but the responses did not show up. I am posting
his answers here for him:

~~~~

I was trying to answer each question, but got blocked, so here are some
answers in one comment:

1\. Yes, it's WIP.

2\. Did RF+GBM so far, did not even start DL, but stay tuned...

3\. For Spark I used number of partitions = number of cores

4\. For why Spark is less accurate for RF see Databricks comments here:
[http://datascience.la/benchmarking-random-forest-
implementat...](http://datascience.la/benchmarking-random-forest-
implementations/#comment-53599) (essentially current version aggregates votes,
while it should aggregate probabilities).

5\. The issue of Spark being slow is not only an overhead on small data, but
lots of serializing/deserializing, garbage collection etc.

6\. Non-linear SVMs scale badly, but one could look at what max size can e.g.
svmlight do (results/pull requests are welcome)

7\. I think Logistic regression in Python scikit-learn can be much improved
(as I mentioned on github) by using a sparse format (someone should do it)

8\. The difference between implementations in somewhat surprising, but some of
the tools take different approaches (e.g. RF in Python/xgboost works with the
data, in H2O/Spark it builds bins from the data etc.) There are tricks,
options available only in 1 or 2 etc. I was trying to do my best to match
parameter values/setup, but it's not the same.

I also wrote a blog post (a slightly more organized text for RF than the
github README): [http://datascience.la/benchmarking-random-forest-
implementat...](http://datascience.la/benchmarking-random-forest-
implementations/)

There is also a recording of a talk I gave at a Machine Learning meetup if you
want more insights:
[https://www.youtube.com/watch?v=DK87lCLH_6A](https://www.youtube.com/watch?v=DK87lCLH_6A)

~~~
math_and_stuff
Is there any detailed breakdown of the data that led to conclusion 5?

------
snnn
This comparison is silly and unfair. It is tantamount to use anti-aircraft
guns to fight mosquitoes then conclude that heavy weapons are too slow for
anti-mosquitoes.

In these experiments, p is quite small(About 1K). So the sizeof the weight
vector (or weight gradient vector) is only about 4K (for float) or 8K (for
double). It should take less than 0.1 second to transfer these vectors. So
there are no need to use tree allreduce or bittorrent to transfer them. These
big inventions from VW and Spark become useless and burdensome. And also, in
such scene, Spark's accumlator has no advantage than the tranditional map-
reduce.

Another point is: Though correctness is more important than performance, it's
very difficult to get a correct implementation even for the most basic
problems. e.g. Most implementations of OWLQN(which is for L1 regularized
convex problems) are wrong. More serious, sometimes the system you relied on
is problematic or unstable by design. e.g. This Spark issue
[https://issues.apache.org/jira/browse/SPARK-5490](https://issues.apache.org/jira/browse/SPARK-5490)
may never be fixed.

I would also recommand google's
sensei([https://github.com/google/sensei](https://github.com/google/sensei))
as an alternative to be evaluated. I recommand it just because it comes from
google. But it's not well-known yet.

~~~
math_and_stuff
Since when did Spark invent tree allreduces (which have been standard practice
within MPI for decades)?

~~~
rxin
I don't think he said anything about Spark inventing allreduce. Spark did use
torrent broadcast, which I believe is pretty unique.

~~~
math_and_stuff
"So there [sic] are no need to use tree allreduce [...] These big inventions
from Spark..."

~~~
rxin
"tree allreduce or bittorrent"

------
stared
For logistic regression scikit-learn seems to be the slowest, but also the
most accurate (BTW: why is it so memory-consuming?). Is it possible to tweak
parameters for various languages so to get a similar scores? (Or to show
score, and time, ranges?) Even the number of iterations can play a huge
factor.

------
mikkom
This paper is also worth reading, it's excellent and very little known
(comparisons start from page 5)

[http://www.cs.uic.edu/~tdang/file/CHIRP-
KDD.pdf](http://www.cs.uic.edu/~tdang/file/CHIRP-KDD.pdf)

------
sgt101
It's always good to see proper evalations and I am sure that this will provide
food for thought and improvement for the tool owners.

However - a couple of points.

\- One data set can't be more than an indicative test. \- Spark isn't designed
for single machine environments. \- Spark isn't designed for "small problems"
ie. where it takes 30 seconds to run the system. \- If the algorithms are
right then the results for accuracy should be identical. That they aren't
indicates a bug, which would be worth unpicking with the relevant teams?

~~~
Analog24
These algorithms are all fairly complicated with numerous parameters involved
in the calculations. Most of the packages (I can't speak for all of them since
I'm not familiar with all of them) make it easy to implement these algorithms
by providing reasonable default values for most of the necessary parameters.
It's very unlikely that they are set to identical values, thus leading to
differing results.

In addition, some of the algorithms are non-deterministic. Random forests, as
the name implies, involves randomly setting the decision tree parameters
numerous times. You can run the same algorithm with the same implementation
and get different results. Likewise with NN's, it all depends how you set the
weights initially, which is usually a non-deterministic process as well.

To sum it up, I don't think that obtaining different results from different
implementations of the same ML algorithm is indicative of a bug. It's actually
expected in many situations.

~~~
sgt101
_edited to better explain_

So...

If the variance _of a particular implementation_ from run to run is
significant then one has to question how real any gain vs the bottom result
obtained is? In the old days we used to do things like cross validate 30+
times and give results x vs y vs z with some confidence interval.

I believe that any implementation of a particular algorithm should be almost
exactly the same as any other implementation if it is correct. Therefore, over
a number of runs the results should be the same.

~~~
Analog24
I think you have a stricter definition of what an algorithm is than what is
used in the context of this study. The general concept of the random forest
algorithm (for example) is the same in each implementation but the exact
details of how that general algorithm is implemented (the exact algorithm) is
most likely not the same. Therefor, you shouldn't expect to get completely
identical results regardless of how much data you train them with. They should
all be in the same neighborhood though, which they are, for the most part, in
the results from the study.

------
thomasrossi
Very cool, thanks for doing/sharing. I am curious why he says "non linear SVM
are the most accurate but can't scale". I've used the good old C svmlight on
pretty large datasets, it was not real time but..

~~~
pmelendez
> I've used the good old C svmlight on pretty large datasets

When the dataset size is in Gbs, Non-lineal kernels might not even finish.
Kaggle has good sized datasets that you could use to see the difference, it is
insane. That said, I would love to see it included in this kind of benchmark
to see how ACU would compare.

------
cozzyd
I'm curious to see how TVMVA
([http://tmva.sourceforge.net/](http://tmva.sourceforge.net/)) would stack up

~~~
jbssm
I would advise anyone to stay away from ROOT. It's a really, really badly
programmed framework that will steal many hours of your time.

~~~
cozzyd
Hey now, it's better than PAW!

------
blueyes
Would be curious to see how they compare to DL tools like
[http://deeplearning4j.org](http://deeplearning4j.org)

------
helloImNew
Didn't set the number of threads for Spark to use. You should oversubscribe.
This is probably the biggest misstep.

------
ilaksh
Where is the table comparing the accuracy of DL tools to others?

Seems like mainly he is just saying "DL takes a shitload of resources" and no
numbers.

My understanding is that the other tools have limited relevance now with DL,
since most systems are moving to cloud-based services that do have resources
for DL, and DL offers much better accuracy. That is not readily concluded from
this report.

~~~
Fede_V
Actually, resources are the least of your concerns when it comes to DL.

DLs have a shitload of hyper-parameters (effectively, the entire architecture)
which is mostly built by trial and error + intuition. There was some really
exciting work on reversible SGD by an applied math group at Harvard
([https://github.com/HIPS/hypergrad](https://github.com/HIPS/hypergrad)) to
obtain derivatives with respect to network architecture, but it's not very
mature yet.

Further, DLs need an immense amount of data to be efficient. If your dataset
has only a few hundred of features, and you have a few thousand samples, it's
unlikely DL will help you much.

Finally, DL is not really useful out of the box, at all. scikit-learn's
selling point is that it has an amazingly good uniform API, fantastic
documentation, and very clean code. Making a working DL implementation with a
simple 'fit' sklearn-like API is impossible.

~~~
ilaksh
I noticed you also did not give any information on the accuracy of DL versus
other techniques.

------
JD557
Is this still WIP? I would really like to see how spark fares in a distributed
environment.

------
jbssm
Do any of these libraries have support for computation on the GPU using CUDA?

------
felipelalli
(in the end of the README file)

Conclusions: ...

