
What it takes to build great machine learning products  - yarapavan
http://radar.oreilly.com/2012/04/great-machine-learning-products.html
======
gfodor
A great and insightful article. A common theme I've seen in practice is folks
who have a deep understanding of ML often run straight to applying the most
sophisticated algorithms possible on raw data. On the other hand, people who
know a bit about ML but understand the domain better start by applying
intuition to data cleansing and then follow up with simpler algorithms.
Without fail the latter group ends up with better results.

~~~
tel
I think there are essentially two "deep" understandings of ML prevalent today.
The first is more common: the ability to do the calculus, algebra, and
probability derivations required to design complex ML algorithms combined with
the CS knowledge to find/design a good algorithm and the software design skill
to actually implement it on real, "big" data.

No doubt this is a difficult position to master and those who perform well are
able to tackle lots of mathematical and computational challenges. They also
are model builders who (have tendency to) relentlessly seek complex models in
order to solve complex problems.

The other, rarer side is the learning theorist who may or may not understand
the model building, algorithmic, and computational tools but understands well
the theories which allow us to have reasonable expectations that the tools of
the first group will work at all. These guys have a funny story in that they
were the old statisticians who got a major egg-on-the-face after proclaiming
that essentially all of ML was impossible. Turns out the first group managed
to redefine the problem slightly and make major headway (and money).

\---

The thing I want to bring to light however is that the second group knows the
math that bounds the capacities of ML algorithms. This isn't easy. It's one
thing to say you recognize that the curse of dimensionality exists, but it's
another to have felt it's mathematical curves and to build an intuition for
what forces are sufficient to cause disruption.

The more experience you have with the learning maths, the more likely you are,
I feel, to apply very simple algorithms, to be scared of "little x's" (real
data) enough to treat it with great care, and to attempt to explore the
problem space with confidence for what steps will lead you to folly.

\---

It's a fine line between the two, though. Stray too far to the first group and
you'll spend a month building an algorithm that does a millionth of a
percentage point better than Fisher's LDA. Spend too much time in the second
camp and you'll confidently state that no algorithm exists that does better
than a millionth of a percentage point over Fisher LDA... and then lose purely
by never trying.

~~~
sireat
Our Data Mining (different field but somewhat related) professor had this
quote "All models are wrong, but some models are useful" on the first day of
university.

You can build an extremely complicated model that is not useful, where a
simpler one might suffice.

~~~
tel
I've heard that line referred to as Box's Razor. It's definitely the right
heuristic, but it's interesting to see that even if your model is _right_ ,
you're still in trouble if it's too complex. This is a sort of bias/variance
tradeoff.

------
tgflynn
I agree that the big wins in machine learning/(weak)AI are probably going to
come more from figuring out how to better apply existing models and algorithms
to real problems rather than from improving the performance of the algorithms
themselves.

That said one shouldn't underestimate the amount of commonality between
problems that to some people may appear unrelated. For example this post talks
about the gains in machine translation performance from including larger
contexts. The same principle applies to many other sequence learning problems.
For example you have a very similar issue with handwriting recognition where
it is often not possible (even for a human) to determine the correct letter
classification for a given handwritten character without seeing it within the
context of the word.

------
chaostheory
The article is light on details. imo there are two major things your team
needs:

1) Programmers that have the needed math skills, or mathematicians with the
needed coding skills

2) A distributed ML framework

Solving problem one is not easy but it's straightforward.

Solving problem two is harder. While there are a lot of open source machine
learning projects, almost all of them seem to have a focus of being used by a
person and not a program. Moreover very few do distributed processing except
for mahout (<http://mahout.apache.org/>). Mahout is promising but the
documentation is still thin, and I'm not sure if it's getting momentum in
terms of mind share yet.

~~~
suneilp
What kind of math skills? What would a programmer need to learn in order to
work on ML stuff?

~~~
salimmadjd
Aside from algebra to do log or exp division addition and multiplications, you
have to be versed in statistics. Most ml problems are solved on statistical
bases. Although many of the algorithms have been solved, you still need to
grasp the statistics behind it which is a bit more involved than calculating
the odds of a die

~~~
chaostheory
Yes, many people that I know working on ML do not remember statistics well
enough. Some keep their ignorance while relying on mathematicians on-staff
(who can't really code), while others either start buying college textbooks or
go to night classes. There are too many people who don't understand the
algorithms being used.

------
Dn_Ab
In case anyone clicks his link to Variational Methods and is confused to find
an article on quantum mechanics, as inspirational and arguably related as it
may be, I think he actually meant to link to:
<http://en.wikipedia.org/wiki/Variational_Bayesian_methods>

------
ma2rten
Right now NLP is mostly limited to niche applications, like e.g. sentiment
analysis and clever products build around it. I actually think the reason is
that both natural language processing and machine learning are still in their
early days.

Imagine all the applications for consumer products if algorithms would be
really able to actually understand language (as far as you can understand
something if you are a computer program and not a sentient human being), e.g.
for example if we were able to do _real_ text summarization.

I believe this is not only possible, but not as far away as people think.
However, to reach that goal we need to let go of that idea that NLP is mostly
about clever feature engineering, but instead start building algorithm that
derive those features themselves. Part of the problem is how evaluation is
setup in NLP. What the best algorithm is, is decided based on who gets the
best performance on some dataset. This sounds all nice and objective, but you
will always be able to get the best performance if you try enough combinations
of features (overfitting the testset) [1]. These small improvements say little
about real world performance.

For the NLP people among you this is an interesting paper that tries to do a
lot of things different:
<http://ronan.collobert.com/pub/matos/2008_nlp_icml.pdf>

This is the corresponding tutorial, which is quite entertaining as well:
<http://videolectures.net/nips09_collobert_weston_dlnl/>

[1] I think, this is less true for machine translation, where there are more
and bigger testsets and less feature engineering going on.

~~~
brendano
Careful with the Collobert ICML-2008 paper. It has a very negative reputation
among NLP researchers who actually know the area, just for its
setup/evaluation. If you're interested in the methods (which I think are
interesting), that group's later work is much improved.

~~~
ma2rten
Thanks I will look into it.

------
ogrisel
Very nice article Aria. You quickly mention Pegasos as a scalable alternative
to SMO. I agree that this works well for linear models. But despite the claim
that Pegasos can be trivially adapted to kernel models I have never seen any
implementation of a kernel Pegasos and I don't understand how it's even
possible. Have you used Pegasos-style algorithm to fit non linear models?

On the other hand there exist alternatives such as LaSVM that can effectively
scale linearly to large datasets (but the optimizer works in dual
representation as with SMO and not like Pegasos).

~~~
srconstantin
So...are you saying you need the dual formulation in order to allow a kernel
model?

~~~
ogrisel
No actually it's not the case. But I don't know how Pegasos can be adapted to
use kernels. If you take figure 1 of the paper [1] you will see that the
gradient of the objective function is used to update a single weights vector
`w` at each step of the projected stochastic gradient descent. In a kernel
model, all the support vectors cannot be collapsed into a single weight vector
`w`. You would need to handle the kernel expansion against the support vectors
explicitly. But then how to select the support vectors out of all the samples
from the dataset while keeping the algorithm online? The Pegasos paper does
not say mention it.

[1] [http://eprints.pascal-
network.org/archive/00004062/01/Shalev...](http://eprints.pascal-
network.org/archive/00004062/01/ShalevSiSr07.pdf)

~~~
sparsevector
The set of support vectors is just the set of training examples that have non-
zero alpha parameters. To implement the gradient update you just evaluate the
support vector machine on the example (using the explicit kernel expansion)
and then if the example has signed margin less than 1 you add y * eta to the
corresponding alpha value.

The difficulty with Pegasos for non linear kernels is the support set quickly
becomes very large and so evaluating the model becomes very slow. Note that
since the alpha values are not constrained to be non-negative (unlike the
standard dual algorithms) the alpha values don't ever get clipped to zero--
instead they just slowly converge to zero. It's still (I think) one of the
fastest methods in terms of theoretical convergence guarantees but perhaps not
as fast as LaSVM or something similar in practice.

However, there's been a more general trend in machine learning to use linear
models with lots of features instead of kernel models, partially because of
these sort scalability issues.

~~~
ogrisel
Thanks for the reply. I was told by twitter that this is similar to the kernel
perceptron which I don't know well either. There is a good introduction with
python code snippet here:

[http://www.mblondel.org/journal/2010/10/31/kernel-
perceptron...](http://www.mblondel.org/journal/2010/10/31/kernel-perceptron-
in-python/)

However it seems that you need to compute the kernel expansion on the full set
of samples (or maybe just the accumulated past samples?): this does not sound
very online to me...

~~~
sparsevector
It's true you need to compute the kernel dot product between every example you
see and every example in the support set (every example that ever previously
evaluated to signed margin < 1). Whether it's online depends on your
definition of "online" . It's definitely not online in the sense of using
memory independent of the number of examples, since you have to keep around
the support set. I think there are results showing the support set grows
linearly with the size of the training set under reasonable assumptions.
However, it is online in the sense that it operates on a stream of data,
computing predictions and updates for each example one-by-one. It's also
online in the sense that its analysis is based on online learning theory (e.g.
mistake / regret bounds). A lot of learning theory papers use "online" in the
latter two senses, which is confusing if you expect the former.

------
TimPC
It's a very exciting time. I'm incredibly excited to see what goes on here. I
previously explored an online education start-up idea and I'm really looking
forward to seeing Ng and Koller change the world. I'm also very exciting to
see machine learning on the radar. For me one of the biggest challenges is
often making AI intuitive. As machine learning becomes more mainstream it will
be on people's design radar and that will make it less hard to turn great
algorithms into great products.

~~~
3pt14159
Partially, although in my experience over the past 4 years doing this stuff 1
hour cleaning the input data gets you thrice the output of 1 hour tuning the
algos. Some algorithms are more sensitive than others, but in general, garbage
in, garbage out.

~~~
TimPC
I think in many cases you're correct. My point wasn't about the performance of
the AI algorithms themselves though. In my experience most of the problems
where I want to use AI the algorithm itself performs adequately. Getting the
interaction with the algorithm sensible for a non-technical user is hard. If
AI becomes prevalent enough that UI/UX people start thinking about it, I
suspect it will be much easier to solve that problem, which to me is the
bigger business problem with AI.

------
mailshanx
I think this is pretty accurate. Here is an example from my own thesis
research: I'm using machine learning to tune an (underwater) communication
link, i.e. decide what modulation / error coding algorithms/parameters will
yield good data rates in a dynamic channel.

At first i tried using an off-the-self classifier to figure out which
parameters will work well. That failed because by the time i had sampled a
decent proportion of the possible parameter values, the channel would change
(the number of possible combinations is of the order of a few millions).

Turned out that the real problem is not learning the performance of the
available parameters, rather it lies in "learning how to learn": i.e. my ML
system needs to adaptively search the space, by responding to the history of
previous explorations and their outcomes. This kind of exploration would be
effective only with an understanding of how the underlying modulation/coding
algorithms work and interact with each other.

~~~
danieldk
Indeed. From our own experience: we use pretty much off-the-shelf maximum
entropy parameter estimators for parse disambiguation and fluency ranking. In
the past ~10 years most of the gain has come from smart feature engineering by
using linguistic insights, analyzing common classes of classification errors,
etc. Beyond l1 or l2 regularization, the use of (even) more sophisticated
machine learning algorithms/techniques have not yet given much, if any,
improvement for these tasks in our system.

What did help in understanding models is the application of newer feature
selection techniques that give a ranked list of features, such as grafting.

------
mwexler
Reading this reminded me of this recent post by Chris Dixon, which is also a
good read: [http://cdixon.org/2012/04/14/there-are-two-ways-to-make-
larg...](http://cdixon.org/2012/04/14/there-are-two-ways-to-make-large-
datasets-useful/)

------
seamusabshere
My for-profit company (Brighter Planet) often gets product ideas from our data
scientists; it's exactly what Dr. Haghighi is talking about.

For example: trying to model environmental impact of Bill Gates's 66,000 sq ft
house during a hackathon -> discovery that we need fuzzy set analysis
(<https://github.com/seamusabshere/fuzzy_infer>) -> new, marketable
capabilities in our hotel modelling product
([https://github.com/brighterplanet/lodging/blob/master/lib/lo...](https://github.com/brighterplanet/lodging/blob/master/lib/lodging/impact_model.rb)).

------
salimmadjd
I have enjoyed the author's other posts via his prismatic blog here. It's one
of the most interestings blogs to follow with only a few posts so far.
However, this article falls a bit short. It feels rushed out, which is
understandable.

I think it would have been better if this was just the first part of a multi-
article write up on ML. With this one being an intro and follow-ups on
specific approaches.

------
junktest
Probably try PCA (principal components analysis) to help select the most
important features of data, first, before going further in modeling it.

------
marshallp
The article doesn't mention two important things (and instead focuses on being
clever - the opposite of what machine learning stands for). First, the deep
learning algorithms that automatically create features. Second, the importance
of gathering lots of data, or generating it.

If you have to be really clever with feature engineering, then what's the
point of even calling yourself a machine learning person.

~~~
ogrisel
I agree that deep-learning is an interesting approach to learn higher level
features. However it's still a long way from being a universal solution: for
instance deep learning won't help you solve the machine translation or multi-
documents text summarization problems automagically: you still need to find
good (hence often task dependent) representation for both your input data and
the datastructure you are trying to learn a predictive model for.

