The most important thing to do when designing a machine learning algorithm is to choose your loss function. The loss function quantifies how much your model sucks.
Ideally, your loss function should be the true loss, in real life. In a business, your true loss function is: -profit over the lifetime of the business. A lot of time it's difficult to measure the true loss, so your loss function should be a "surrogate" (approximation) of the true loss.
The author is poking around with techniques, without considering what he actually wants to optimize.
Brandon Ballinger said it well: http://www.quora.com/What-are-the-keys-to-operationalizing-a...
"Make your success metric user happiness. Traditional accuracy measures like precision, square error, ROC, etc. don't capture what you really care about--how users react to your model. For example, if you're running an ad system, your metrics should be revenue per pageview and click through rate. It's completely possible to have a model with a reduced error rate which lowers revenue, due to Simpson's paradox and a host of other reasons."
The author of this post hasn't really considered these questions, and gets all hand-wavey about precision and recall vs. confusion matrices. How do you know that any of these are what you really care about?
So: What's the value of a false positive? What's the value of a false negative?
Here's where it gets interesting. For airport screening, one person might say that a false negative (terrorist slips through and kills 100 people) is 100M times worse than a false positive (I strip-search an innocent person). So this person has made a value judgment that the death of a random innocent person is 1 million times worse than humiliating an innocent person. The TSA, on the other hand, decides that a false negative is 100B times worse than a false positive, so they assign 1000 times as much value to catching terrorists vs. humiliating people as the other person.
Who's right? It's a value judgment, and machine learning can't answer that.
What I'm getting at here is that the choice of a loss function is ultimately an aesthetic decision. Once we can optimize loss perfectly, what loss will we optimize?
The problem is there is not always a cheap way to encode a complex evaluation metric into a loss function for many machine learning algorithms. In fact most of the really popular algorithms are popular because they work well mathematically - for example, generalized linear models are convex on a simple quadratic loss function, and neural networks have gradients when their loss functions are differentiable.
Also, I found optimizing for simple but sensible loss functions often gave pretty good performance. From a practical perspective, most of the gains of understanding the performance metrics we cared about came in the data preparation, evaluation and calibration stages of modeling, not model selection or training.
In the case of the example in the post (fraud), the credit card industry has actually termed the false positive rate the "insult rate" (because the customer feels insulted in the store that their charge has been declined), which certainly gives you an idea of how they feel about false positives.
I found it helpful to start by over-representing the rare class and then slowly bring the balance back to the observed percentages.
The main reason for my over-representation of the rare class was that rare case was often not correctly identified (many false negatives) in the so-called "Gold Standard" training data. By over-representing the rare case, I was able to build a better training set while tuning your algorithm.
I'd be interested to hear if others flush out false positives in a similar/different manner (for the Skewed/Rare Class problems described by the article).
Fraud is particularly difficult though, because the entities are actively trying to thwart your attempts to detect them. Outlier detection is a must certainly. Collective entity resolution helps a lot too. It might even be worth seeing if you can use LDA to cluster the fraudsters together.
It's just not a domain suited to simple models or a singular approach.
I also use per class weights - by default SVM(libSVM, libLinear) cost parameter is the same for both classes. Penalize the classifier for false negatives more then for false positives(order of magnitude more).
- make your classifier get to 100% accuracy (or close enough).
Also, do it automatically
- use a unsupervised learning stage - then a simple a classifier (so you don't overfit). Repeat, growing the size of your unsupervised, followed by simple classifier, until you get 100%.