
The Three Cultures of Machine Learning - fforflo
http://cs.jhu.edu/~jason/tutorials/ml-simplex.html
======
argonaut
I don't know why people are criticizing this. I did research in ML and while
this is obviously reductionist, it does accurately reflect one way you might
divide AI/ML research into 3 broad categories.

~~~
gunnm
It seems to correspond nicely to intelligence, knowledge, and wisdom.

------
solidrocketfuel
The Geneticists: Use evolutionary principles to have a model organize itself

The Bayesians: Pick good priors and use Bayesian statistics

The Symbolists: Use top-down approaches to modeling cognition, using symbols
and hand-crafted features

The Conspirators: Hinton, Lecun, Bengio et al. End-to-end deep learning
without manual feature engineering

The Swiss School: Schmidhuber et al. LSTM's as a path to general AI.

The Russians: Use Support Vector Machines and its strong theoretical
foundation

The Competitors: Only care about performance and generalization robustness.
Not shy to build extremely slow and complex models.

The Speed Freaks: Care about fast convergence, simplicity, online learning,
ease of use, scalability.

The Tree Huggers: Use mostly tree-based models, like Random Forests and
Gradient Boosted Decision Trees

The Compressors: View cognition as compression. Compressed sensing,
approximate matrix factorization

The Kitchen-sinkers: View learning as brute-force computation. Throw lots of
feature transforms and random models and kernels at a problem

The Reinforcement learners: Look for feedback loops to add to the problem
definition. The environment of the model is important.

The Complexities: Use methods and approaches from physics, dynamical systems
and complexity/information theory.

The Theorists: Will not use a method, if there is no clear theory to explain
it

The Pragmatists: Will use an effective method, to show that there needs to be
a theory to explain it

The Cognitive Scientists: Build machine learning models to better understand
(human) cognition

The Doom-sayers: ML Practitioners who worry about the singularity and care
about beating human performance

The Socialists: View machine learning as a possible danger to society. Study
algorithmic bias.

The Engineers: Worry about implementation, pipe-line jungles, drift, data
quality.

The Combiners: Try to use the strengths of different approaches, while
eliminating their weaknesses.

The Pac Learners: Search for the best hypothesis that is both accurate and
computationally tractable.

See also [http://www.kdnuggets.com/2015/03/all-machine-learning-
models...](http://www.kdnuggets.com/2015/03/all-machine-learning-models-have-
flaws.html)

> It is common for people to learn about machine learning within one framework
> which often becomes there "home framework" through which they attempt to
> filter all machine learning. (Have you met people who can only think in
> terms of kernels? Only via Bayes Law? Only via PAC Learning?) Explicitly
> understanding the existence of these other frameworks can help resolve the
> confusion.

~~~
YeGoblynQueenne
Nice list, has everything and the kitchen sink(ers). Two comments:

a) Inductive Logic Programming (and generally relational learning) is ideal
for feature discovery and firmly in the symbolic camp, so the reliance of the
Symbolists on hand-crafted features is not an absolute.

b) PAC learning should go under symbolic techniques, no? In fact, so should
decision tree learning.

Also, I think it's obvious you can always unify and divide classifications
like the above to come up with as many or as few "tribes" as you like. The
real question is: are there really that many people who are wedded to their
favourite technique, so much so that they won't ever try anything different?

~~~
solidrocketfuel
a) perhaps "feature engineering" was not the right 2-gram. I was looking for
the (cultural) difference in approaches. Logic programming starts with
background knowledge, predicate logic, and hand-written rules on what is valid
and what isn't. Deep learning is trying to learn this bottom-up. When DL finds
a rule or fact it did so from data, not by using any pre-defined rules or
facts.

b) if we apply hierarchical clustering, it would probably be a subset.

Anyway, this was more or less tongue-in-cheek. And yes, you could go on and
on. I should have added "The Logicians", "The Game Theorists" and the NLP'ers
solving object detection problems with visual bag-of-words. Also forgot to
take a jab at business intelligence/operations research.

As for being wedded to a favorite technique, I think that is largely a problem
for beginners (and PhD. students with a supervisor who can only think from
within a certain framework). I myself may try SVM, but I rank it pretty low as
an alternative.

~~~
YeGoblynQueenne
a) Weelll, Ish :) DL really doesn't need any more direction than what's in the
data, but with IPL (Inductive Logic Programming) you also don't _need_ to have
any background knowledge. And it can totally discover its own features.

Anyway it depends a lot on the specific algorithm, for instance see Alignment-
Baed Learning [1] and the ADIOS algorithm [2] for two examples of thoroughly
symbolic grammar induction algorithms (though not quite ILP) that works on
unannotated, tokenised text, so is entirely unsupervised.

And, if I may be so bold, my own Masters dissertation [2], an unsupervised
graph induction algo that learns a Prolog program from unannotated data. You
won't find evidence of that on my github page, but I've used my algorithm to
extract features from text- as in word embeddings. Also, it's a recursive
partitioning algorithm, so essentially an unsupervised decision tree learner,
only of course it learns a FOL theory rather than propositional trees. My
hunch is you could use decision trees unsupervised and let them find their own
features, although that'll have to go under Original Research for now :)

Those just happen to be three algorithms I know well enough, but you can
google around for more examples. In general: relational learning can do away
with the need for feature engineering, it's one of its big strengths.

In fact, I'm starting to think that - unless DL is somehow magical and special
- it should be possible to turn most supervised learners into unsupervised
feature learners, by stringing together instances of an algorithm and having
each instance learn on features the previous one discovered. Again- Original
Research and big pinch of salt 'cause it's just a hunch.

[1]
[http://ilk.uvt.nl/menno/research/software/abl](http://ilk.uvt.nl/menno/research/software/abl)
[2]
[http://adios.tau.ac.il/algorithm.html](http://adios.tau.ac.il/algorithm.html)
[3] [https://github.com/stassa/THELEMA](https://github.com/stassa/THELEMA)

------
rayuela
That was actually worse than the Buzzfeed slide show I was expecting it to be.
There was absolutely nothing informative about that. @solidrocketfuel
references a far more insightful list and Pedro Domingos' new book and recent
talks also go more in depth into the very nuanced history of ML, for anyone
interested.

------
stared
It may depend "with whom you talk", but I have impression of three cultures:

\- statisticians (frequentist statistics, bayesian statistics; trying to find
an underlying probabilistic model, even if at cost of underfitting)

\- computer scientists (algorithms; ad-hoc methodology with goal of predicting
data as well as possible)

\- physicists (physics-motivated tools, relatively clean and composable
mathematics; trying to get some properties of phenomena, even if at cost of
cherry-picking "spheres in the vacuum")

~~~
tormeh
Old joke: A farmer has a problem with his hens not laying any eggs, and he
goes to a physicist for help. "Well", says the physicist, "I can help you if I
can assume that the eggs are perfect cubes, and that the egg-laying process
happens in a frictionless, airless existence with no gravity".

------
asymptotic
For the deep learning vertex the OP states:

"At the right vertex, we have Breiman's know-nothing approach—high-capacity
models like neural nets, decision forests, and nonparametrics that will fit
anything given enough data. This is engineering with less science (see these
remarks). Deep learning people cluster here."

The phrase "will fit anything given enough data" is misleading and not correct
about cutting-edge machine learning methods. "Fit" is a useless term, and you
will instead find people talking about "bias" and "variance".

For any supervised method (you know the intended outputs) you apply to predict
data, there are three sources of error: bias, variance, and random. Random
error is some irreducible unpredictability that cannot be modeled. Bias occurs
from bad assumptions made by the model itself (e.g. maybe the model is too
simple). Variance is sensitivity to small changes in the data the algorithm is
trained on.

High bias means the model is too simple to capture all the variations in the
data set. High variance means the model is too overfit on the data at hand and
it is not successfully generalizing to unseen data. In real-world problems
there is a direct tradeoff between bias and variance. Nevertheless the goal of
any supervised learning model is to have both low bias and low variance.

By splitting off a big (~10-20%) chunk of all data available into a "test"
set, training the model on the remaining "train" set, then evaluating it on
the "test" set, it's possible to estimate the generalizability of the model on
future "unseen" data by whatever metric you want. By additionally plotting
learning curves one can crudely estimate whether we have high bias or high
variance.

Hence the insinuation that machine learning blindly "fits" data as much as
possible is false. Sophisticated (yet not difficult) methods both minimize and
estimate the generalizability of the model to future unseen data (minimizing
variance), inevitably at the cost of some notion of accuracy (increasing
bias).

I think the OP's objection is that such ML methods "know nothing". This is a
trivial statement to make. Rather, I would turn the objection on its head and
ask "If our methods achieve acceptable estimated generalizability on unseen
data, do we need to know anything?". This reminds me of Alan Turing's
arguments about machines passing the Turing Test vs. "are they really human?".

~~~
argonaut
While everything you've said is technically true, straight out of the
textbook, I don't really see how it contradicts what he says.

These high-capacity models (neural nets, decision trees, boosting) do overfit
like crazy and tend to be used as black boxes without any domain knowledge.
The key in his statement is when he says "given enough data," because having
_tons_ of data is one of the best ways to combat overfitting (given enough
data, variance is negligible). And the fact that we can measure how much they
overfit and take steps to regularize doesn't change the fact that, for
example, deep learning is really way more of an engineering discipline than a
mathematical or statistical discipline. And these are not criticisms of those
areas at all: those are exciting areas of research precisely because there are
so many unsolved problems and areas where we are working without a solid
understanding!

------
YeGoblynQueenne
* Estimators for [Baeysian and Deep learning] approaches usually have to solve intractable optimization problems. Thus, they fall back on approximations and get stuck in local maxima, and you don't really know what you're getting. *

So, with deep nets, how big a problem is getting stuck in local optima?

I mean, my intuition is that it's not magic, you're optimising a system of
functions so you'd get stuck in local optima in the same way you get stuck
with a single function. Is that generally the case?

~~~
cinjon
It seems that the local minima aren't actually a problem in these high
dimensional spaces because most of them are either very close to the global
minima or are saddle points, which can be escaped. There was a lot of work on
this out of Bengio's lab. One such paper was Dauphin's "Identifying and
attacking the saddle point..."

~~~
YeGoblynQueenne
Thanks, I'll see if I can find this paper.

------
Houshalter
We should cluster machine learning papers or citations and see what clusters
actually form.

~~~
liquidmetal
And then submit a paper to the same conference from where we collected the
papers.

