
Learning from Imbalanced Classes - bradaallen
http://www.svds.com/learning-imbalanced-classes/
======
throw_away_777
Solid article. Especially important are the suggestions to consider a
probabilistic evaluation metric.

If you want to work with really imbalanced data, try working with data from
the LHC. There were on order of 1000 Higgs in a year, while there are around
600 million proton-proton collisions a second.

~~~
lifestil
LHC data = EPIC

~~~
haddr
This is one of those few moments when saying Big Data is actually a well-used
term.

~~~
mrjaeger
There's a really great episode of Linear Digressions[0] (a Data Science
podcast) that goes into the different scales of data that exist in the world.
Everyone thinks of Google and Facebook as Big Data, but the Australian Square
Kilometer Array Pathfinder collects 7.5B TB of data per second!

[0]: [https://soundcloud.com/linear-digressions/whats-the-
biggest-...](https://soundcloud.com/linear-digressions/whats-the-biggest-
bigdata?in=linear-digressions/sets/linear-digressions)

~~~
nl
And it's going to end up _quite_ a bit more than that.

Check out slides 21 and 22 from [1]. There are parts that will process 4 PB/s
(!)

[1] [http://www.slideshare.net/SparkSummit/distributed-data-
proce...](http://www.slideshare.net/SparkSummit/distributed-data-processing-
using-spark-by-panos-labropoulosand-sarod-yatawatta)

~~~
haddr
It's not a secret that in the process of automatic manufacturing there are
terabytes of data produced in a matter of minutes. OF course how many of such
data is of any use is another story, but anyway it still requires storage and
processing power capable of handling that volumes.

------
paulrosenzweig
There is a current Kaggle competition that is a good learning opportunity for
imbalanced data. The goal is to predict parts that will be rejected by quality
control.

[https://www.kaggle.com/c/bosch-production-line-
performance](https://www.kaggle.com/c/bosch-production-line-performance)

------
denzil_correa
Quite an interesting article!

I had two research projects on finding low quality - deleted and closed -
questions on Stack Overflow [0, 1]. Since, these question classes always
suffered from imbalanced class problems (2% closed and 8% deleted) - I decoded
to under sampling of the majority class. Later, I used an ensemble - Gradient
Decision Boosting Tree, Random Forrest - for classification. However, rather
than taking bootstrap samples - I took various random samples from the
majority class (equivalent to the minority class) and built many ensemble
classifiers for each random sample. In the end, I checked the variation on the
final classification results. It just seemed intuitive to me. I had no idea
about Wallace et al.!

[0] Denzil Correa and Ashish Sureka. 2014. Chaff from the wheat:
characterization and modeling of deleted questions on stack overflow. In
Proceedings of the 23rd international conference on World wide web (WWW '14).
ACM, New York, NY, USA, 631-642. DOI:
[http://dx.doi.org/10.1145/2566486.2568036](http://dx.doi.org/10.1145/2566486.2568036)

[1] Denzil Correa and Ashish Sureka. 2013. Fit or unfit: analysis and
prediction of 'closed questions' on stack overflow. In Proceedings of the
first ACM conference on Online social networks (COSN '13). ACM, New York, NY,
USA, 201-212.
DOI=[http://dx.doi.org/10.1145/2512938.2512954](http://dx.doi.org/10.1145/2512938.2512954)

------
fizx
In addition to unbalanced classes, where let's say 2% of data is from a
minority class and 98% is from the majority, how do people handle when 2% of
data is from the minority and 98% is of an unknown class, but pulled from a
known distribution?

~~~
ausvisaissues
I am not completely following what you mean by "98% is of an unknown class,
but pulled from a known distribution".

Are you suggesting that: 2 percent of samples are positive, drawn from
p(x|y=1)

98 percent of samples are drawn from a distribution p(x), but may be either
positive or negative?

The setting that you described above is called "positive and unlabeled (PU)"
learning. This paper:
[http://cseweb.ucsd.edu/~elkan/posonly.pdf](http://cseweb.ucsd.edu/~elkan/posonly.pdf)
is one of the seminal articles on the topic (although the equation on the
bottom of page 214 contains a statement that may not necessarily hold true).
There are quite a lot more recent papers on this topic.

------
autokad
I usually find that SVM's do really well with very unbalanced data due to its
hinge loss

