
Machine Learning Showdown: Apache Mahout vs. Weka - doppenhe
http://blog.algorithmia.com/post/103009975044/machine-learning-showdown-apache-mahout-vs-weka
======
jackhammer
Most data scientist these days use scikit-learn or R. Weka is really out of
fashion. Mahout and mllib are difficult to use and perform less. Often it's
better to just down-sample or rent an EC2 instance with a lot of memory.

~~~
platypii
Weka is definitely more old-school, but it has a LOT of algorithms available.
Weka and Mahout are the two biggest ML libraries on the JVM, but we couldn't
find any direct head-to-head comparison so this was the result. In the future
we plan to also add scikit and mllib and more in the future.

Your point about being difficult to use is exactly the problem that
Algorithmia solves.

------
discardorama
This is almost apples and oranges. Mahout's power lies in it's ability to
handle huge amounts of data, in a parallel fashion. Weka (which is rarely used
these days anyways) is for smaller problems and experimentation.

None of these (Mahout and Weka) are mainstream anymore. For large-scale
classification, people are using packages like VW[1] . And for small-scale
experimentation, SciKit or R.

[1] [http://hunch.net/~vw](http://hunch.net/~vw)

~~~
riffraff
IIRC VW doesn't have the easy integration with the hadoop ecosystem that
mahout does, I am not sure everyone has moved to VW.

~~~
jbooth
Doesn't need it. If you're computing a linear/logistic regression via gradient
descent, performance-oriented C code on a single machine using local
filesystem/caches will beat a hadoop-based algorithm for just about any size
dataset.

~~~
riffraff
I meant that shops are likely to already have data sitting in
hdfs/hive/whatever which they can trivially use with mahout, while they may
not have it sitting on a single file system on a beefy machine.

~~~
jbooth
They can hadoop fs -get it or query it from hive.

Because you want to coalesce all your model updates for every pass over the
data, the long startup-time for hadoop jobs actually plays into this. You can
have hundreds of millions of samples, in a sparse space of hundreds of
millions of parameters, and do it faster on a single node using VW than in a
hadoop cluster of equivalent nodes running Mahout.

------
bayonetz
Rapidminer is the jam for prototyping ML processes. It's so powerfully useful
I've always been surprised they've kept it free for so long (they have a pay
version but it's not necessary). In addition to its own algorithms, it has a
plugin that wraps Weka so you get all those if you want them too. I'm I n no
way connected with them, just a big fan of it over every other ML library or
tool I've seen. If I could buy stock I would...

------
doppenhe
for our HN friends direct invite
[https://algorithmia.com/signup?invite=HN24hr](https://algorithmia.com/signup?invite=HN24hr)

~~~
MrBuddyCasino
algorithmia.com looks like a fantastic idea, never heard of it before. What do
you use to wrangle your infrastructure?

~~~
doppenhe
Thanks! We use multiple cloud providers (Digital Ocean and AWS being our
primary), Play Framework, AKKA , Ansible for deployment, Shippable for our CI
and a good amount of custom clever code by our back end team.

------
akbar501
For ML, Spark MLlib is a solid choice.

For large scale, distributed stats I'd go with SparkR.

[https://spark.apache.org/](https://spark.apache.org/)

~~~
doppenhe
we love Spark MLib

------
folli
I'm not very experienced in Machine learning, just dabbled around a bit, so
maybe someone could explain me this:

Looking at the graph number of trees vs accuracy, I would have expected that
the line would asymptotically reach a maximum accuracy given more and more
trees; however for weka it looks quite wavy and for mahout it even looks as if
there's an optimum and more trees are worse.

Or is it just noise and I'm interpreting too much?

~~~
dthal
Usually RF will improve up to some point, and then the test (or OOB) accuracy
will rattle around a little bit, centered around some final value. RF should
never overfit from having too many trees. Both plots start at 50 trees, so it
looks like the 'improve' part of that is over by then, and what you are seeing
here is variation due to different trees being slightly better/worse.
Incidentally, that means that the variation in both plots is probably not
meaningful and that 'best' score of 99.4% at 250 trees is probably basically
an outlier.

------
dthal
There is something bothering me about this...Weka's accuracy seems quite high
in comparison to the results at Yann LeCun's MNIST page [1]. Its hard for me
to believe that "the answer" to the MNIST problem is "use WEKA's RF".
[1][http://yann.lecun.com/exdb/mnist/](http://yann.lecun.com/exdb/mnist/)

------
yid
Ugh... comparing random algorithms without showings error bounds on the
accuracies.

~~~
waw76
the point here was more just to get both things running with a slick and more-
or-less common interface. you're quite right that this isn't rigorous enough
to settle the question between two algorithms in a domain where they both work
somewhat - but stay tuned, rigorous analysis and comparison is high on our
priority list.

------
sgwizdak
I'm surprised that Spark's mllib wasn't included in this comparison.

~~~
doppenhe
next :) We started with these two because they are both available in
Algorithmia now.

------
tsewlliw
Just playing around with it, do typical strategies for using these tools
include "bad" data? I drew a '-' and got '4' as the guess, which feels very
wrong.

~~~
doppenhe
just a function of our simple demo, it returns confidence intervals for what
it thinks it is of digits between 0-9 and we pick the top one. Not perfect by
any means but shows how easily we can compare the libraries against each
other.

------
spountzy
WEKA ist really slow, at least when trying out the 'example'... And why
choosing these two? There are a lot more. But anyway, thx for the comparison

~~~
doppenhe
We chose these two because they are the most common JAVA libraries used. We
are adding new ones regularly to our library.

------
therobot24
WEKA seems to take forever to classify a digit for their demo, also i wonder
why there are various drops in performance when using 200 and 300 trees

~~~
doppenhe
traffic slowed down our system there for a second should be up and running at
speed again.

~~~
therobot24
works now! However it keeps mis-classifying my 4's.

[http://imgur.com/mYRdIC0](http://imgur.com/mYRdIC0)
[http://imgur.com/cLQn74O](http://imgur.com/cLQn74O)
[http://imgur.com/bg2UaJN](http://imgur.com/bg2UaJN)
[http://imgur.com/66oUGdM](http://imgur.com/66oUGdM)
[http://imgur.com/oU92V09](http://imgur.com/oU92V09)
[http://imgur.com/qvILooJ](http://imgur.com/qvILooJ)
[http://imgur.com/URyRiOB](http://imgur.com/URyRiOB)
[http://imgur.com/AybUI20](http://imgur.com/AybUI20)
[http://imgur.com/kugzG2D](http://imgur.com/kugzG2D)

~~~
mx12
IIRC the MINST data has a mix of style of 4's. I tried it with an open top,
and it worked correctly. I wondering if they just happened to sample a
majority of 4s that have an open top.

Example: [http://imgur.com/mRRz1L3](http://imgur.com/mRRz1L3)

------
mch82
Can anyone summarize the general workflow for using these analysis tools? Just
looking for a high level intro and maybe a link to more detail.

------
coffeemugmugmug
I don't know anybody seriously using either of these. Mahout has bad
implementations and Weka is showing it's age.

