
Can a classifier that never says “yes” be useful? - jmount
http://www.win-vector.com/blog/2014/03/can-a-classifier-that-never-says-yes-be-useful/
======
TrainedMonkey
This is semantics. Difference only in question you ask. If asked
"Statistically, does this person has 30% of chance to defaulting this year?"
[0] the answer becomes very clear yes or no.

[0] Problem definition states that in 2% selected accounts are
indistinguishable , thus question is identical to: "Does this account belong
to 2% population that statistically had high default rate?" which is what
author uses.

~~~
jmount
In my opinion there is a difference.

You can build a classifier that says "does the person have a 30% chance of
defaulting this year." And you can in aggregate estimate the accuracy of such
a classifier. However, for an individual example you don't know if such a
classifier is right or wrong which can confuse partners. You say somebody has
a 30% chance of default and they don't default (were you right?), or they do
default (again, were you right on this single instance?). Obviously you can
work something out- but you have moved to building a classifier where you no
longer have an easily observable ground truth on individuals (only on
aggregates).

You then want to move to forecasting and rating of scores or quality of
sorting (which is what we encourage in the writeup).

~~~
TrainedMonkey
"And we build a good forecast or scoring procedure that for 2% of the
population returns a score of 0.3 and for the remaining 98% of the population
returns a score near 0.01. Further suppose our scoring algorithm is well
calibrated and excellent: the 2% of the population that it returns a score of
0.3 and above on actually tends to default at a rate of 30%."

In this case your classifier has two outputs. Let's label them safe and
default prone respectively. Now we can transform question for each account
into: "Is this account default prone?". In the example provided answers are
obviously yes or no. If your classifier has two options, there must exist a
question that would yield a yes/no answer. Other alternative is many option
classifiers (neural-networks, decision trees, etc...).

Part of confusion might stem from the fact that your example is simplified.
From problem statement and description it sounds you guys built a scoring
algorithm, that for each account outputs probability that account will default
this year. In that case 0.01 and 0.3 would directly translate into 1% and 30%
chance of default respectively. Obviously algorithm would not just output two
options, it would output a range from 0 to 1.

This is not yet classifier, however it is trivially easy to build a classifier
with desired number of buckets on top of this. For example with two buckets:
ask a question "does this account has at least 30% of chance to default this
year?", with scoring algorithm solving that is as simple as filtering accounts
with score in excess of 0.3.

------
graycat
Sure: It's easy to have no false alarms; just turn off the detector! It's easy
to have no missed detections of real problems; just sound the alarm all the
time.

Then it's easy to have whatever false alarm rate want; just 'interpolate'
between these two detectors.

Such detectors are known as 'trivial' with good justification.

If some monitoring software claims a low false alarm rate, then, sure, that's
easy to get -- just use a trivial detector!

And for being useful, if have two detectors that actually are good and want to
'interpolate' between them, then a trivial detector can do that and be
'useful'.

------
lerchmo
it may be useful to get the probability of a yes, even if it's .0001% chance
for things like rare event detection

