
The Area Under an ROC Curve (2001) - dedalus
http://gim.unmc.edu/dxtests/roc3.htm
======
pveierland
The AUC for ROC does not work well in comparisons involving imbalanced
datasets, and Precision Recall curves can be a better option.

[https://classeval.wordpress.com/simulation-analysis/roc-
and-...](https://classeval.wordpress.com/simulation-analysis/roc-and-
precision-recall-with-imbalanced-datasets/)

~~~
gcmac
I disagree with the analysis of this article. In a typical machine learning
process, the response variable stays the same (at a distributional level) but
you cycle through candidate models. So regardless of whatever the class
distributions are, a higher AUC score indicates a better model.

It might be true that the classifier performance is worse on an imbalanced
data set (with the same AUC score) than a balanced one but that just reflects
the fact that classifiers are harder to build for imbalanced data.

~~~
letitgo12345
Having a better AUC score does not guarantee a better AUPR score. So a model
with better AUC is not universally "better"

See
[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.98....](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.98.4362&rep=rep1&type=pdf)

------
hprotagonist
i would dearly love for this to be a more commonly used measure of
performance. it’s much harder to hide your sins if you give a ROC.

it’s not naively useful for multi-category classifiers, but it’s great
otherwise.

~~~
mikebenfield
I came to the conclusion that area under ROC is pretty much garbage. What you
really want is two separate steps: estimate the probability that instance A
belongs in classification X, and then a decision step where you decide how to
classify A based on a loss function (this varies depending on how harmful a
false positive is).

Area under ROC forces you to conflate these two.

Why not evaluate the probability estimation directly, using Brier score [1] or
something similar?

[1]
[https://en.wikipedia.org/wiki/Brier_score](https://en.wikipedia.org/wiki/Brier_score)

~~~
rm999
>I came to the conclusion that area under ROC is pretty much garbage.

I cannot disagree more. IMO AUC is a rare great metric: it's principled,
useful, universally applicable (e.g. invariant to class imbalances), and easy
to explain adequately to non-statistician ("probability of choosing a positive
sample over a negative one").

>What you really want is two separate steps: estimate the probability that
instance A belongs in classification X, and then a decision step where you
decide how to classify A based on a loss function (this varies depending on
how harmful a false positive is).

Yes, but those are two inherently separate steps and should be measured
separately. AUC is a metric of the first step (the "model"). The second step
is a business decision and will often be made separately from the modeling
process, by different people, with a different cadence, and with a different
goal in mind.

For example if I am designing a model to find a disease, I just want to make
the best prediction I can, which is cleanly measured by AUC. Then, when it
comes to actual diagnosis, someone else will choose cutoffs based on various
factors like false positive costs (treatment cost, human toll), false negative
cost (disease damage, death toll), supply of treatment, etc. I can picture
scenarios where the same model is used for decades, but the cutoffs change
seasonally.

~~~
mikebenfield
> invariant to class imbalances

I think this is a red herring that comes from not thinking probabilistically.
If the distribution of your training data does not resemble the distribution
of your real-world data (or cannot be made to resemble it), you're just
guessing anyway. If it does resemble the real distribution, then you _want_
those class imbalances. In fact, I tend to think that the fact that ROC is
"invariant to class imbalances" is a significant downside: it means in some
sense your score is just as sensitive to things that rarely happen as it is to
things that happen all the time.

> "probability of choosing a positive sample over a negative one"

I find the practical implications of this pretty opaque, and it's never been
clear to me whether this is measuring anything I actually care about. As far
as I know there aren't theoretical guarantees that a better AUC score means
anything real. I haven't thought deeply about it, but I am reasonably sure I
could find some simple examples illustrating how to "cheat" AUC by getting a
higher score with predictions that are worse in any practical sense.

I still like the Brier score: just give me a number indicating how well my
estimated probability predictions do on a test/validation set. There are even
theoretical guarantees about it, because it's a proper score function.

------
assafmo
This page actually help me a lot in the past. Thank you for sharing!

------
WilliamSt
I really wish people could make it a habit to define acronyms.

~~~
curiousgal
I really wish people could make it a habit to read articles.

The article defines the acronym and even mentions where it comes from.

~~~
sxg
True, but the definition is in the very last line of the whole page. For
people unfamiliar with an ROC curve, which seems to be the target audience, it
should be defined the first time it's used.

Edit: on navigating the site a little more, the previous section does define
the acronym on its first use.

