
Apache Mahout: Scalable machine learning for everyone - bpuvanathasan
http://www.ibm.com/developerworks/java/library/j-mahout-scaling/
======
law
Honestly, frameworks like Mahout and Weka have their place, and that's
typically for exploratory data analysis. My belief is that for large-scale,
extremely intensive machine learning, your best bet is to implement algorithms
tailored to the job at hand. Algorithms like logistic regression work fine if
your data is linearly separable, but it's not a panacea. None of the
algorithms are.

If you're interested in machine learning and artificial intelligence, I _very
strongly_ consider "enrolling" in Tom Mitchell's machine learning class at
<http://www.cs.cmu.edu/~tom/10701_sp11/lectures.shtml> \-- the lectures are
long and the mid-term and final are extremely difficult, but the material
covered is an outstanding primer for these types of analyses.

After going through all of the lectures, you will look at things like Mahout
and Weka as mere toys, and will be equipped to write your own implementations
for whatever task you and your company are working on. It's a lot of front-
loading for rewards that may at first glance seem illusory, but investing the
time now will pay dividends later.

~~~
tensor
Libraries like Weka and Mahout are no more _toys_ than any other library that
implements standard and widely applicable algorithms. Yes, you need to do a
lot of extra work to properly model your problems, choose features, and
combine different algorithms into a final product. But it's not often that you
need to tweak the core algorithms that these libraries provide.

If you really understand enough to implement new classifiers or other types of
learning algorithms, these libraries are _still_ useful to you. For one, they
provide a solid framework for allowing your new algorithm to easily interact
with other algorithms. Two, it's not unlikely that your new algorithm is a
variation on an existing one. Don't re-implement it. These libraries are open,
so copy the source and modify it. And three, mahout uses hadoop. Distributed
processing systems are another topic altogether. If you are proposing to write
your own, I would hope that you have good reasons for spending the time.
Hadoop is certainly no toy.

In summary, don't waste time reimplementing core algorithms unless you are
doing it for a learning exercise. But _do_ still take a good course on machine
learning, because using the provided algorithms in these packages and others
correctly is highly non-trivial.

~~~
dvcat
Dunno about weka but my last experience ( 5 months back) with Mahout was not
good. There still are quite a few bugs and the fact that entire code base is
in Java makes it extremely unpleasant for someone who wants to hack and modify
the code to jump right in and start tweaking stuff. However, in its defense,
it is open source is probably the only hadoopified ml library out there and
has given me a ton of good ideas on how to write custom code.

~~~
mark_l_watson
Wow, we disagree. As much as I like to do my own development in dynamic
languages like Clojure and JRuby, for me:

I would _much_ rather have library and framework code that someone else has
written, debugged, and supports to be written in Java: easy to browse in a
good IDE, statically typed, lots of unit tests so you can hack away with some
protection, etc.

~~~
dvcat
Maybe my point wasn't clear enough: 1\. I am comfortable with using someone
else's library without having to reinvent the wheel but I want to know exactly
what I am getting into without having to browse through tons of Java code.
There are zillions of variants of algorithm X but I want to know exactly which
implementation/variant Mahout uses without going through the source code.
Unfortunately the docs (at least 4 months back) were pretty bad.

2\. Their unit test coverage was not good enough which incidentally is how I
found that there were bugs. The problem in trying to contribute back to the
community by trying to rectify these bugs? When I read the source code, I get
the feeling that each algorithm is owned to a great extent by one developer
who brings in their own idiosyncrasies which means that you need to really
study the code to make sure you don't accidentally add more bugs. The other
disadvantage of this approach is that questions regarding potential bugs and
puzzling issues can go unanswered or answered in an unsatisfactory manner
(mainly because of the one developer writing most of the code issue).

Having said all this, I want to be charitable and chart these to growing
pains. But if I were building something critical and big dataish, I would
either use Python (dumbo) or Scala which are much more concise languages where
it is easier to express math without introducing bugs.

------
dmk23
Mahout is a great platform, but the real challenge is defining your learning
problems, preparing data sets and choosing right algorithms.

Once you are clear as to what you actually want to accomplish chances are you
are going to need some kind of significantly modified or hybrid algorithm.
Packages like Mahout could help get started, but it is kinda funny that even
quite a few examples in this article do not demonstrate actually good
algorithm performance, like this one -

    
    
      Correctly Classified Instances : 41523 61.9219%
      Incorrectly Classified Instances : 25534 38.0781%
      Total Classified Instances : 67057
      =======================================================
      Confusion Matrix
      -------------------------------------------------------
      a b c d e f ><--Classified as
      190440 12 1069 0 0 | 20125 a= cocoon_apache_org_dev
      2066 0 1 477 0 0 | 2544 b= cocoon_apache_org_docs
      165480 2370 704 0 0 | 19622 c= cocoon_apache_org_users
      58 0 0 201090 0 | 20167 d= commons_apache_org_dev
      147 0 1 4451 0 0 | 4599 e= commons_apache_org_user

~~~
Radim
There are decimal dots missing in the confusion matrix numbers (i.e., 190440
should read 19044.0, in case anyone else was wondering why the numbers don't
add up).

If anything, the article convinced me _not_ to use Mahout. So, the author
decided to use the simplest algorithm, Naive Bayes, and got miserable results
(from the article: "This is possibly due to a bug in Mahout that the community
is still investigating."). He then changed to problem formulation in order to
get better results, and concluded by saying the outcome is still likely a bug,
but he's happy with it anyway?

This would be probably fine if we were talking about a small, nimble project
that you could go into and hack/fix yourself. But we're talking about a
massive, Java codebase. The thought of customizing it makes me shudder.

EDIT: forgot to mention I agree with the parent comment completely, except I
would add "... and choosing the right evaluation process" to the initial
sentence.

------
srowen
Hey all, I'm one of main devs of Mahout and saw this article and commentary. I
think it's basically right. I'd like to add my own perspective.

I think Mahout has one key problem, and that's its purported scope. The
committers' attitude for a long while, which I didn't like myself, was to
ingest as many different algorithms that had anything to do with large-scale
machine learning.

The result is an impressive-looking array of algorithms. It creates a certain
level of expectation about coverage. If there were no clustering algorithms,
you wouldn't notice the lack of algorithm X or Y. But there are a few, so,
people complain it's not supporting what they're looking for.

But there's also large variation in quality. Some pieces of the project are
quite literally a code dump from someone 2 years ago. Now, some is quite
excellent. But because there's a certain level of interest and hype and usage,
finding anything a bit stale or buggy leaves a negative impression.

I do think Mahout is much, much better than nothing, at least. There is really
only one game in town for "mainstream" distributed ML. If it is only a source
of good ideas, and a framework to build on, then it's added a lot of value.

I also think that some corners of the project are quite excellent. The
recommender portions are more mature as they predate Mahout and have more
active support. Naive Bayes, for example, in contrast, I don't think has been
touched in a while.

And I can tell you that Mahout is certainly really used by real companies to
do real work! I doubt it solves everyone's problems, but it sure solves some
problems better than they'd have solved them from scratch.

I strongly agree with here is that you're never likely to find an ML system
that works well out-of-the-box. It's always a matter of tuning, customizing
for your domain, preparing input, etc. properly. If that's true, then
something like Mahout is never going to be satisfying, because any one system
is going to be suboptimal as-is for any given system.

And for the specialist, no system, including Mahout, is ever going to look as
smart or sophisticated as what you know and have done. There are infinite
variations, specializations, optimizations possible for any algorithm.

So I do see a lot of feedback from smart people that, hmm, I don't think this
all that great, and it's valid. For example, I wrote the recommender bits
(mostly) and I think the ML implemented there is quite basic. But you see
there's somehow a lot of enthusiasm for it, if only because it's managed to
roughly bring together, simplify, and make practical the basic ML that people
here take for granted. That's good!

------
mark_l_watson
Another good article by Grant Ingersoll on Mahout. I used Mahout on a customer
project last year when it was not yet a complete machine learning system
layered on Hadoop. Looking at Table 1. in this article, many of the previous
gaps have been implemented. BTW, the book Mahout in Action is a good guide but
the new MEAP released last week does not cover some of the new features, which
is OK. Also, Grant has been working on "Taming Text" for a while, but a new
MEAP has not been released in a year or two - I would bet that his energies
have been focused on extending and using Mahout.

------
mahmud
I prefer Weka, mostly because it has excellent literature and has academic
leanings, unburdened real-world issues of performance or scalability so it can
afford to focus on accuracy.

~~~
paraschopra
The real value proposition of Hadoop isn't the algorithms but using Hadoop to
massively parallelize the machine learning algorithms. Do you know any port of
Weka that can be scaled in such a manner? Just curious.

------
zgoldberg
The Google Prediction API (code.google.com/apis/predict) will help you get
started with machine learning without the need to write any additional code
(other than API calls)!

------
tel
Table 1 reminds me why even if these algorithms are available it's a big step
to being able to understand and apply them. It's clear the author doesn't have
a lot of familiarity with them.

~~~
mwexler
He is co-founder of the Mahout project with a pretty extensive background in
text analysis. I suspect he's familiar with the algorithms. In fact, he may be
showing the reader that they _aren't_ as magical as one may believe, by
showing that they don't work perfectly oob.

Unless you are being sarcastic, in which case, forgive me for missing it.

------
reuser
That's cool and stuff, but why do I have to write Java?

~~~
jwr
You don't have to. I use mahout from Clojure, not a single line of Java needs
to be written.

~~~
ericmoritz
Then you have two problems... I kid.

