

Large-Scale Machine Learning at Twitter [pdf] - Anon84
http://www.umiacs.umd.edu/~jimmylin/publications/Lin_Kolcz_SIGMOD2012.pdf

======
bravura
Okay, so the pluses and minuses of this work.

So a fair bit of this paper goes into detail about the technical details and
infrastructure. This is commendable, because these details are important in
applied work.

The authors devise a method for large-scale _supervised_ classification.
Essentially, the data is partitioned between nodes, and each node trains a
l2-regularized logistic model on its data using SGD. The individual models are
then aggregated into an ensemble. This approach is obvious but non-standard
(I'll describe why later).

I believe this methodology is of limited applicability. The experiments are
supervised training of sentiment analysis on 100M tweets. But it is simple and
fast to do supervised learning over 1B instances _on a single box_. And there
are _very few_ tasks that have >>1B _labeled_ training examples.

In most "large-scale" real-world tasks, there are 10-1M labeled training
examples, and 100M to jillions _unlabelled_ training examples. So what we
really care about is unsupervised learning (over unlabelled examples) followed
by supervised learning (over labelled examples) or semi-supervised learning
(over unlabelled and labelled examples). _Distributed_ unsupervised learning
is challenging. It is an active research area with only a few results, and a
lot of untapped potential.

One claim in the paper is surprising: "As is often the case for many machine
learning problems, and confirmed in our experiments (see Section 6), an
ensemble of classifiers trained on partitions of a large dataset outperforms a
single classifier trained on the entire dataset (more precisely, lowers the
variance component in error)." First of all, I'm surprised by the claim "an
ensemble of classifiers trained on partitions of a large dataset outperforms a
single classifier trained on the entire dataset". It is true that an ensemble
(forest) of decision trees is better than a single decision tree. This is like
adding another layer in a neural network, and allows more powerful models to
be expressed more compactly. But, crucially, _an ensemble of linear models is
just a linear model_. So in fact, there should be no variance reduction in the
authors' approach. It should be better to train a single linear model on ALL
the data, rather than an ensemble of linear models on subsets of the data. I
am surprised that the ensemble of linear models outperforms the single model.
Further investigation is needed to understand why. My gut feeling is that
there is a problem in his experiment.

 _tl;dr_ The machine learning models will have limited impact, because they
are purely supervised. But check out the technical details and infrastructure
portion of the paper.

~~~
ot
> an ensemble of linear models is just a linear model

Doesn't this depend on how they aggregate the results? In the paper they say
that they do majority voting and averaging of the class probabilities, and in
both cases the decision surface is not linear (before averaging, in the first
case the response of the linear model is passed through a step function, in
the second case through a logistic function)

I think what you say is true in a regression setting, if you just average the
predictions of the linear models, but here they are talking about
classification.

~~~
bravura
I was being sloppy in my description.

You are right that they are applying a non-linear function (step-function or
logistic) to the output of a linear model, and then aggregating. I am _not
sure_ if this has equivalent modeling power as a linear model or single logreg
model. I suspect it does, but haven't done the math.

I am not aware of results that say that an ensemble of linear or logreg
models, trained over partitions of the data, is better than a single linear or
logreg model trained over _all_ the data. My understanding was that the
opposite was true. Most of the positive results for ensemble methods and
bagging are when the base classifier is a decision splits and decision trees,
in which case it is clear that the ensemble has more modeling power.

~~~
ot
I don't see why it should have equivalent modeling power, the set of decision
surfaces that it can represent is strictly larger.

~~~
bravura
_puts on theoretician's hat_

You are correct, it can represent more decision surfaces.

 _takes off theoretician's hat_

In practice, my intuition tells me that aggregating over an ensemble of logreg
models trained on subsets of the data is inferior to training a single logreg
model over the entire data.

I apologize for being sloppy tho.

------
ot
Also check out these slides by John Langford from a recent workshop. The first
example is distributed training of a linear model from 17B examples in 70
minutes.

"3 Fun Machine Learning Problems for Big Data"

[http://people.cs.umass.edu/~mcgregor/stocworkshop/langford.p...](http://people.cs.umass.edu/~mcgregor/stocworkshop/langford.pdf)

------
margold
really interesting to a recent grad like me who's had a hard time imagining
how things learnt at university will be applied at a 'big company scale'

