
Thoughts on Machine Learning – Dealing with Skewed Classes - HaFl
http://florianhartl.com/thoughts-on-machine-learning-dealing-with-skewed-classes.html
======
bravura
This blog post presents good techniques, but misses the point.

The most important thing to do when designing a machine learning algorithm is
to _choose your loss function_. The loss function quantifies how much your
model sucks.

Ideally, your loss function should be the true loss, in real life. In a
business, your true loss function is: -profit over the lifetime of the
business. A lot of time it's difficult to measure the true loss, so your loss
function should be a "surrogate" (approximation) of the true loss.

The author is poking around with techniques, without considering what he
actually wants to optimize.

Brandon Ballinger said it well: [http://www.quora.com/What-are-the-keys-to-
operationalizing-a...](http://www.quora.com/What-are-the-keys-to-
operationalizing-a-machine-learning-ranking-system-from-an-organization-
engineering-management-point-of-view/answer/Brandon-Ballinger)

"Make your success metric user happiness. Traditional accuracy measures like
precision, square error, ROC, etc. don't capture what you really care about--
how users react to your model. For example, if you're running an ad system,
your metrics should be revenue per pageview and click through rate. It's
completely possible to have a model with a reduced error rate which lowers
revenue, due to Simpson's paradox and a host of other reasons."

The author of this post hasn't really considered these questions, and gets all
hand-wavey about precision and recall vs. confusion matrices. How do you know
that any of these are what you _really_ care about?

So: What's the value of a false positive? What's the value of a false
negative?

Here's where it gets interesting. For airport screening, one person might say
that a false negative (terrorist slips through and kills 100 people) is 100M
times worse than a false positive (I strip-search an innocent person). So this
person has made a value judgment that the death of a random innocent person is
1 million times worse than humiliating an innocent person. The TSA, on the
other hand, decides that a false negative is 100B times worse than a false
positive, so they assign 1000 times as much value to catching terrorists vs.
humiliating people as the other person.

Who's right? It's a value judgment, and machine learning can't answer that.

What I'm getting at here is that the choice of a loss function is ultimately
an _aesthetic_ decision. Once we can optimize loss perfectly, what loss will
we optimize?

~~~
rm999
I used to build fraud models where we had insane amounts of negative data and
precious positive data. Our performance metric was well-reasoned; as you
mention, we cared about the happiness of our customer, and baked all sorts of
useful information into our evaluation metrics.

The problem is there is not always a cheap way to encode a complex evaluation
metric into a loss function for many machine learning algorithms. In fact most
of the really popular algorithms are popular because they work well
mathematically - for example, generalized linear models are convex on a simple
quadratic loss function, and neural networks have gradients when their loss
functions are differentiable.

Also, I found optimizing for simple but sensible loss functions often gave
pretty good performance. From a practical perspective, most of the gains of
understanding the performance metrics we cared about came in the data
preparation, evaluation and calibration stages of modeling, not model
selection or training.

------
iskander
Another trick I've seen in biomedical data mining is synthesizing new samples
from the under-represented class (using an algorithm like SMOTE).

------
hooande
Another option could be to use an ensemble of exemplar svms [1], essentially
training separate classifiers with a single target row set to 1 and all others
to 0. This can be useful when trying to pick out examples that match a very
specific pattern.

[1] <http://www.cs.cmu.edu/~tmalisie/projects/iccv11/>

------
nwenzel
Good post on the exact problem I ran into when developing a machine learning
algorithm to identify rule violations (basically fraud) based on self-reported
descriptive text data. The few records that are "fraud" are precious from a
training standpoint.

I found it helpful to start by over-representing the rare class and then
slowly bring the balance back to the observed percentages.

The main reason for my over-representation of the rare class was that rare
case was often not correctly identified (many false negatives) in the so-
called "Gold Standard" training data. By over-representing the rare case, I
was able to build a better training set while tuning your algorithm.

I'd be interested to hear if others flush out false positives in a
similar/different manner (for the Skewed/Rare Class problems described by the
article).

------
Rickasaurus
One thing that may work well here is to train two models and ask which fits
the data better (likelihood). Or, to leverage the full weight of the data in
training submodels and then leverage that in the classifiers.

Fraud is particularly difficult though, because the entities are actively
trying to thwart your attempts to detect them. Outlier detection is a must
certainly. Collective entity resolution helps a lot too. It might even be
worth seeing if you can use LDA to cluster the fraudsters together.

It's just not a domain suited to simple models or a singular approach.

------
tumanian
Precision and recall are a great way of looking at overall accuracy, but ROC
curves and ROC AUC are highly recommended for model selection criteria on
imbalnced datasets. For SVMs you can do a two dimensional grid search for C
and Gamma looking for a maximum ROC.

I also use per class weights - by default SVM(libSVM, libLinear) cost
parameter is the same for both classes. Penalize the classifier for false
negatives more then for false positives(order of magnitude more).

------
picsoung
Totally new field for me, very interesting thoughts :)

------
HaFl
hooande, thanks for the additional info. I'll definitely take a closer look at
it.

------
marshallp
How about

\- make your classifier get to 100% accuracy (or close enough).

Also, do it automatically

\- use a unsupervised learning stage - then a simple a classifier (so you
don't overfit). Repeat, growing the size of your unsupervised, followed by
simple classifier, until you get 100%.

~~~
fiatmoney
That approach is almost exactly that of Deep Learning / stacked Restricted
Boltzmann Machines.

~~~
marshallp
yeah, exactly

