Hacker News new | comments | ask | show | jobs | submit login
Thoughts on Machine Learning – Dealing with Skewed Classes (florianhartl.com)
41 points by HaFl on Aug 27, 2012 | hide | past | web | favorite | 14 comments



This blog post presents good techniques, but misses the point.

The most important thing to do when designing a machine learning algorithm is to choose your loss function. The loss function quantifies how much your model sucks.

Ideally, your loss function should be the true loss, in real life. In a business, your true loss function is: -profit over the lifetime of the business. A lot of time it's difficult to measure the true loss, so your loss function should be a "surrogate" (approximation) of the true loss.

The author is poking around with techniques, without considering what he actually wants to optimize.

Brandon Ballinger said it well: http://www.quora.com/What-are-the-keys-to-operationalizing-a...

"Make your success metric user happiness. Traditional accuracy measures like precision, square error, ROC, etc. don't capture what you really care about--how users react to your model. For example, if you're running an ad system, your metrics should be revenue per pageview and click through rate. It's completely possible to have a model with a reduced error rate which lowers revenue, due to Simpson's paradox and a host of other reasons."

The author of this post hasn't really considered these questions, and gets all hand-wavey about precision and recall vs. confusion matrices. How do you know that any of these are what you really care about?

So: What's the value of a false positive? What's the value of a false negative?

Here's where it gets interesting. For airport screening, one person might say that a false negative (terrorist slips through and kills 100 people) is 100M times worse than a false positive (I strip-search an innocent person). So this person has made a value judgment that the death of a random innocent person is 1 million times worse than humiliating an innocent person. The TSA, on the other hand, decides that a false negative is 100B times worse than a false positive, so they assign 1000 times as much value to catching terrorists vs. humiliating people as the other person.

Who's right? It's a value judgment, and machine learning can't answer that.

What I'm getting at here is that the choice of a loss function is ultimately an aesthetic decision. Once we can optimize loss perfectly, what loss will we optimize?


I used to build fraud models where we had insane amounts of negative data and precious positive data. Our performance metric was well-reasoned; as you mention, we cared about the happiness of our customer, and baked all sorts of useful information into our evaluation metrics.

The problem is there is not always a cheap way to encode a complex evaluation metric into a loss function for many machine learning algorithms. In fact most of the really popular algorithms are popular because they work well mathematically - for example, generalized linear models are convex on a simple quadratic loss function, and neural networks have gradients when their loss functions are differentiable.

Also, I found optimizing for simple but sensible loss functions often gave pretty good performance. From a practical perspective, most of the gains of understanding the performance metrics we cared about came in the data preparation, evaluation and calibration stages of modeling, not model selection or training.


It's a great point, and well articulated.

In the case of the example in the post (fraud), the credit card industry has actually termed the false positive rate the "insult rate" (because the customer feels insulted in the store that their charge has been declined), which certainly gives you an idea of how they feel about false positives.


I love and reiterate this point consistently. ML is ultimately quite often very dumb. It's simply a good tool that must be wielded with understanding, intuition, and ingenuity if you're going to make it valuable.


Another trick I've seen in biomedical data mining is synthesizing new samples from the under-represented class (using an algorithm like SMOTE).


Another option could be to use an ensemble of exemplar svms [1], essentially training separate classifiers with a single target row set to 1 and all others to 0. This can be useful when trying to pick out examples that match a very specific pattern.

[1] http://www.cs.cmu.edu/~tmalisie/projects/iccv11/


Good post on the exact problem I ran into when developing a machine learning algorithm to identify rule violations (basically fraud) based on self-reported descriptive text data. The few records that are "fraud" are precious from a training standpoint.

I found it helpful to start by over-representing the rare class and then slowly bring the balance back to the observed percentages.

The main reason for my over-representation of the rare class was that rare case was often not correctly identified (many false negatives) in the so-called "Gold Standard" training data. By over-representing the rare case, I was able to build a better training set while tuning your algorithm.

I'd be interested to hear if others flush out false positives in a similar/different manner (for the Skewed/Rare Class problems described by the article).


One thing that may work well here is to train two models and ask which fits the data better (likelihood). Or, to leverage the full weight of the data in training submodels and then leverage that in the classifiers.

Fraud is particularly difficult though, because the entities are actively trying to thwart your attempts to detect them. Outlier detection is a must certainly. Collective entity resolution helps a lot too. It might even be worth seeing if you can use LDA to cluster the fraudsters together.

It's just not a domain suited to simple models or a singular approach.


Precision and recall are a great way of looking at overall accuracy, but ROC curves and ROC AUC are highly recommended for model selection criteria on imbalnced datasets. For SVMs you can do a two dimensional grid search for C and Gamma looking for a maximum ROC.

I also use per class weights - by default SVM(libSVM, libLinear) cost parameter is the same for both classes. Penalize the classifier for false negatives more then for false positives(order of magnitude more).


Totally new field for me, very interesting thoughts :)


hooande, thanks for the additional info. I'll definitely take a closer look at it.


How about

- make your classifier get to 100% accuracy (or close enough).

Also, do it automatically

- use a unsupervised learning stage - then a simple a classifier (so you don't overfit). Repeat, growing the size of your unsupervised, followed by simple classifier, until you get 100%.


That approach is almost exactly that of Deep Learning / stacked Restricted Boltzmann Machines.


yeah, exactly




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: