

Why Generic Machine Learning Fails - yarapavan
http://metamarketsgroup.com/blog/machine-learning-in-wonderland/

======
oakenshield
> I get pitched regularly by startups doing “generic machine learning” which
> is, in all honesty, a pretty ridiculous idea.

Learned this the hard way. As an academic who used to think you could solve a
problem (e.g., spam filtering) using a single model --- something we routinely
do in academic papers --- I had to wait until I went to an industry internship
to realize how "ugly" a real spam filtering algorithm has to be. Large mail
providers see so many diverse patterns of spam that they need a complicated
mix of ML models, datasets, labels, and sometimes even plain old blacklists to
keep their spam under control.

~~~
tensor
Coming from a background of bioinformatics, my experience of academia was
quite different. Combining multiple sources of data and results from different
prediction algorithms is quite common.

As for generic machine learning being a ridiculous idea, I don't see why he'd
think this. Nearly all specialized systems use generic machine learning
algorithms as a submodule. They can very much be commoditizable like EC2. Even
google has an upcoming framework for this. Although I would agree that by
themselves they are not sufficient.

edit: I also think you miss the point of academic papers. The goal is not to
build a product, but rather to understand algorithms. Testing algorithms in
isolation of other boosters is crucial for this. If you are testing a
particular combining framework, only then does it make sense to include
multiple approaches within the context if the proposed idea.

In bioinformatics, you additionally have the researchers who actually want an
applied answer for their studies and work. Thus, in that area you _do_
routinely get something more like a product being produced in an academic
setting. The combined systems are often, but not always published.

~~~
alextp
The issue is that generic machine learning algorithms work ok enough as black
boxes, but to squeeze top performance out of them you need to do feature
engineering, architecture/model structure futzing, method selection, etc, and
in practice there are far too many of these meta-hyperparameters to tune with
cross-validation or something similar.

While the generic ML tools work really well, it takes domain knowledge to find
the best way of applying them to the problem at hand, specially since it
almost never fits into the classification/regression from IID training
training data model that most algorithms are designed based on. At first this
might seem counter-intuitive to you, but I've seen dramatic reductions in the
error rate just from picking good features or a reasonable model structure in
a way that's not easy to automate. And while deep learning or structure
learning tries to address these problems, there are issues with nonconvexity
and really long training times that make these algorithms unrealistic in many
situations (and, consequently, make them underperform simpler methods with
clever domain engineering).

~~~
tensor
Absolutely. In light of your comment, perhaps I am misunderstanding what the
original article means by a generic learning algorithm?

The points you make are well understood in academics. There are probably
hundreds of papers on feature selection and domain specific modelling in
bioinformatics, for example.

In terms of boxed learning algorithms, I would assume that such a thing would
provide for a way to supply models and inputs in a variety of formats. The
latter allowing for users to do their own domain specific feature selection or
other types of data reduction before applying a particular learning algorithm.
In that sense, I could see things like Google's prediction API being useful in
principle, even though it won't eliminate the large domain specific portion of
the work.

------
tansey
I appreciate the practical aspects of the author's post. Too often on mailing
lists for various machine learning groups, a novice will ask if they can "just
apply [technique] to [big problem]"; usually something like stock trading or
DNA analysis. The obvious answer is "Sure! Now go spend years understanding
how your domain really works in context of the algorithm you're trying to
use." You can't just feed in stock prices to a black box and get rich, sorry,
doesn't work that way.

As for the idea of "general machine learning" not being feasible, it's worth
noting that the No Free Lunch Theorem [1] applies here.

[1]
[http://en.wikipedia.org/wiki/No_free_lunch_in_search_and_opt...](http://en.wikipedia.org/wiki/No_free_lunch_in_search_and_optimization)

------
joe_the_user
I just want to note that article's main point is _that_ generic machine
learning fails.

Why?

"The Netflix prize is a good example: the last 10% reduction in RMSE wasn’t
due to more powerful generic algorithms, but rather due to some very clever
thinking about the structure of the problem; observations like “people who
rate a whole slew of movies at one time tend to be rating movies they saw a
long time ago” from BellKor."

That's kind of hand-wavy in the sense that you haven't produced the factor
which prevents a generic machine learning algorithm from "being very clever"
or determining the specific useful observation. Sure, there's an intuition we
have about this but that's it.

And that is kind of inevitable - if we could get an exact measure of why
current machine learning algorithms fail, we could probably build new one that
succeeded.

~~~
bermanoid
The “people who rate a whole slew of movies at one time tend to be rating
movies they saw a long time ago” example is wonderful, actually: it indicates
_exactly_ the reason that humans can guide choices of algorithms in a way
better than machines can: the data that a human uses to to come up with that
hypothesis is quite literally unavailable to the machine. It's completely
outside the dataset that's under analysis, and comes from a human's experience
dealing with humans, and his assumptions about how they act. Most humans would
probably mark that statement as "probably true" without even investigating the
data, and that's an _extremely_ valuable prior that a ML algorithm has no
access to (unless we explicitly program it in).

Sure, you might argue that the hypothesis is implicit in the data set, and
(though I'm not familiar with the actual Netflix data, so I'm not sure) that
might be true - if it's in there in some form, then it's even conceivable that
some algorithm might eventually pick it up. But a human would likely never
even dream of advancing that hypothesis without at least some vague sense that
other humans would probably act that way, and in many cases, without that high
prior probability that comes from our knowledge of psychology it wouldn't be
proper to consider that factor. So in a sense, we're cheating every time we
use our external domain knowledge to push our ML algos to a better spot in
hypothesis space.

This doesn't say that generic ML fails; it merely says that "the sum total of
human knowledge + ML algo applied to data set" > "ML algo applied to data
set", especially when "data set" has something to do with shit that humans
know very well, like ourselves.

