
The problem with metrics is a big problem for AI - QuitterStrip
https://www.fast.ai/2019/09/24/metrics/
======
mikorym
> However, the researchers found that several of the most predictive factors
> (such as accidental injury, a benign breast lump, or colonoscopy) don’t make
> sense as risk factors for stroke. So, just what is going on? It turned out
> that the model was just identifying people who utilize health care a lot.

A good analogous example of this is PCA. If your first component has a
dominating effect, then this will drown everything out (you can compensate by
looking at components 2 and 3). (Examples: [monetary] inflation, year or month
effects.) It's a cool exercise to do PCA on datasets and to see whether things
like this pop out. This is also why PCA (which does maximisation) is an
_explorative_ analysis and should not be used as an authoritative "result" or
"metric". You could, but you have to be careful. Even in picking your
components you introduce curation and bias.

Imagine a scenario where we regress back to the dark ages and the eugenics
inclined doctor says: "Sorry Stan, your second component value is an outlier,
no kids for you; your genes are not considered adequate."

~~~
vlz
PCA probably means Principal Component Analysis, if anybody wonders.

[https://en.m.wikipedia.org/wiki/Principal_component_analysis](https://en.m.wikipedia.org/wiki/Principal_component_analysis)

~~~
EdwardDiego
Thanks :)

------
throwaway_bad
Metrics are really meant as a social tool for humans.

For example a commonly repeated advice is that you need a "north star" metric.
For facebook this was daily active users (for an arbitrary definition of
active). For machine translation research this was the BLEU score (which is
also fairly arbitrary and flawed).

Humans need these simplistic metrics because we can't agree on progress
otherwise.

But it's also because every human has conflicting goals that we know that the
guiding metric won't fail too horribly. At all points in time, people are also
evaluating progress in terms of their personal values from their point of
view. Gaming of the metrics won't go unnoticed by another human. Edge cases
can be patched. Conflicting goals are surfaced, decided on and resolved. The
guiding metric and real mission of the organization is evolved organically.
All of this is done using biological intelligence.

I don't know if it's really possible to build an AI system with the same
property. The course correcting of metrics might for a long time remain in the
domain of vigilant humans, as we are the only ones who know what our values
are.

~~~
rm_-rf_slash
Your comment made me think of the Vietnam War. Many of the military leaders in
the war came of age during World War II and Korea, where the most obvious
metric was land: if Nazis or communists were there, then bad. If allies, then
good.

But that completely broke down in Vietnam, when the placid villagers in South
Vietnam by day became the ruthless Viet Cong by night. American generals had
no idea how to approximate their success like they had in previous wars, so
they settled on kill scores, and as long as more Vietnamese died than
Americans, the DoD could go on television and claim that America was
“winning”, even though the incentives for kill scores resulted in a lot of
bystander villages being torched and civilians being killed, which only
further put public opinion against the Americans.

I wonder if a sufficiently advanced AI system could correct its use of (or
abandon) a bad metric even when it is incentivized to use it.

The problem I cited above was effectively so hard for the people at the time
that it was effectively intractable. I’m not arguing that an AI would have had
to solve the problem (setting aside the likelihood that even had an AI come up
with a better solution other inertial forces like politics and sunk cost would
have prevented the solution from being implemented), but any system that would
claim to be a “strong” AI, it would at least have to be up to the task of
trying.

~~~
Spooky23
Vietnam was worse than that. There was no criteria for winning and no visible
end, so the conduct of the conflict was insane.

Kill counts were an objective measurement that demonstrated that something was
done, and turned into evidence of “victory“.

Intelligence, whether human or artificial, cannot fix problems that cannot be
defined.

------
mlthoughts2018
The problem with discussions like this is that they never provide systematic
examples of how a portfolio of metrics or qualitative checking can be
integrated into a modeling problem. There’s a lot of finger pointing at
metrics and complacency about problems, but the solutions are super vague,
like the sanctimonious passage in this article about hiring from under-indexed
groups in tech companies and just listening to first-person accounts (which is
probably a bad idea if you actually want to help).

Ultimately I agree with the underlying idea, but I think to be helpful you
have to present case studies of reproducing research but with metric
optimization swapped out for a holistic variety of metrics plus qualitative
checking.

I recommend the books Bayesian Data Analysis by Gelman et al and Data Analysis
Using Regression and Multilevel/Hierarchical Models by Gelman and Hill if you
want to read good accounts of doing this in practice with real data sets.

There’s definitely room for a book like this that focuses on more domain
specific models in NLP, computer vision and deep neural networks.

~~~
inimino
> There’s a lot of finger pointing at metrics and complacency about problems,
> but the solutions are super vague

The solution is obvious, and not vague at all: stop over-relying on metrics,
and stop pretending that what matters can in most cases be measured.[1]
However, I think you dislike this answer (for reasons given in my other reply
in this thread) so you are looking for ways to replace bad metrics with better
metrics. Which is worthwhile, but not the immediate answer.

> just listening to first-person accounts (which is probably a bad idea if you
> actually want to help).

I didn't see in the article where anyone suggested _only_ listening to first-
person accounts. However my strong belief is that if you don't listen to and
seek out first-person accounts, you have almost no chance whatsoever of doing
any good in basically any kind of complex social problem, and a high
likelihood of doing harm while patting yourself on the back because some
metric you settled on is going up.

[1]: Edit to add: This implies not using AI/ML in some areas where it
currently is being used. The infamous example of AI grading essays, linked
from the article, is one of the most egregious examples of the misuse of ML
for something that, to me, is unimaginably, breathtakingly, forehead-
slappingly idiotic to give to anyone but people, and not just any people, but
the people closest to the students whose work is being graded.

~~~
mlthoughts2018
> “The solution is obvious, and not vague at all: stop over-relying on
> metrics, and stop pretending that what matters can in most cases be
> measured.”

That is vague. You, like the article, are not being specific or explaining how
this can be systematically applied.

~~~
inimino
"systematically applied"...

I'm not giving you a system, I'm specifically telling you not to rely on
systems and rules in cases where they do not work. I'm arguing that people
need to take responsibility for the correspondence of their own actions and
values. This is not a system, it's an attitude.

If you're using ML to grade essays, then you need to stop doing that, because
the things modern ML can measure are not the things that make an essay good or
bad.

If you're using metrics to drive a business process, and the output of that
process that corresponds to your values cannot be captured by a metric, then
you need to stop using metrics and instead design your process according to
principles that reflect your values.

------
dr_dshiv
The main thing is to know, deeply, that the metrics are not the goal, but are
merely signals that tend to correlate with the goal. Having a clear link
between values, goals, and metrics can be really helpful for maintaining the
alignment, even when situations change.

~~~
dr_dshiv
Don't mistake metrics for strategy!

------
sanxiyn
The obvious answer is to learn the metrics itself too. OpenAI does interesting
work in this area: "Deep reinforcement learning from human preferences", for
example. [https://arxiv.org/abs/1706.03741](https://arxiv.org/abs/1706.03741)

~~~
inimino
I'm not sure if this is obvious, if what it means in practice is a metric that
may be the wrong metric, but which is visible, being replaced by a metric that
is invisible. At least in the first case we can reason about the metric and
why optimizing for it might go wrong.

------
motohagiography
Stepping back a bit, I worked on productizing ML and found a basic principle
that causes consternation.

The metric is not the ROC curve, the metric is the overall value the algorithm
generates in aggregate, based on the unique business situation as a function
of risk.

When you look at a trading bot, it is not evaluated on its predictions of
market moves, but on how much money it makes - a higher order effect of using
the method over time vs. others.

You have to look at the symmetry (or distribution) of risk in the problem it's
being applied to. (famously described in a ROC vs. Indifference curves blog
post)

In the case of cancer detection, ML is immensely valuable in alerting people
who might not otherwise have been tested, but terrible as a diagnostic tool
when it is substituted for human judgment. Not because it is inaccurate, but
because the consequences of it being wrong are fatal to the patient, where the
benefits of it being right are life saving. You need to have a problem with
asymmetric upside to benefit from ML. That is, the downside to using it can't
be serious, because the machine cannot hold risk or accountability, and no
sane organization would expose themselves to catastrophic liability for a
probabilistic algorithm being wrong.

This view pisses off managers, executives and investors because most of the ML
products out there hide their downside. Luckily, most customers are smart
enough to recognize this - which is why AI companies are having such trouble
finding PMF.

Self driving cars are another example of a problem with a catastrophic failure
mode. You could argue that if we made self driving cars as safe as flying,
that would be acceptable. Except the net benefit/upside of a self driving car
is not sufficient to justify the downside of its failure mode.

Self-flying vehicles will be more acceptable much sooner than cars because
most people don't fly. The upside benefit anchors people to a higher degree of
risk appetite than for something they already do. If I were in the self
driving car business, I'd switch to the self flying car business for that
reason.

From a management perspective, AI/ML does not produce something you can
manage. It's not like there are rational levers and incentives you can adjust
to achieve outcomes. It's a curve, you tweak weights and try to keep it
producing aggregate net benefit. If you are not directly engaged in that
process, it's a lottery with outcomes distributed along its a ROC curve. In
many ways, it is a substitute for management.

AI/ML is useful for marginal fraud detection, policing of various kinds, and
other relationships where you are in a position of power over a distribution
of outcomes. It is not something you can be honestly served by.

This is an unpopular view because it bursts the AI/ML bubble, where it has
been an excellent source of greater fools for investment, but if you have
money in the space, the question to ask is, does the problem this company is
solving provide a net benefit that compensates the subjects to this system for
its failure mode?

Not how low can you get the MTBF FP/FN rate, that's the hustle. The real
question is whether a given individual user or subject of this system can
afford the cost of it being wrong. If you don't have that answer, metrics are
useless.

~~~
MauranKilom
> You could argue that if we made self driving cars as safe as flying, that
> would be acceptable. Except the net benefit/upside of a self driving car is
> not sufficient to justify the downside of its failure mode.

Aren't you ignoring the human failure modes here? If self-driving cars were
safer than human-driven ones, we would definitely be in the positive, and even
more so for "safer than flying". There would simply be no downside (in terms
of failure mode/rate) to begin with.

I get your overall argument, but for this point you are ignoring precisely
what you are advocating for - you are looking at the solution in isolation,
not comparing to the alternatives.

~~~
motohagiography
The crux of my argument is that if self driving cars were safer than human
driven ones, their adoption would still be thinner than expected because why
would I accept potential fatality and give up my agency to a machine for
something I can already do fine myself? (even if just perceived).

You'd have to convince people they were terrible drivers and that driving was
super dangerous to get most people to put their kid in a self driving taxi in
traffic. I'd agree there is a cultural change effect in play, as cars were
farcically unsafe (and yet still 1000x safer than horses) for the first 60
years of their use, but the business model where people switch to self-driving
is optimistic.

That isolated individual decision making will be the thing that scales. It's a
question of perception of risk/reward, which is very different from
statistical risk modelling.

The AI/ML will be great, the products? Whole different class of problem.

~~~
MauranKilom
> You'd have to convince people they were terrible drivers and that driving
> was super dangerous to get most people to put their kid in a self driving
> taxi in traffic.

I don't think so. All things equal, who wouldn't want to spend their time
doing something more fun than navigating traffic? Whether it can be affordable
to everyone is a different question, but you wouldn't have to insult anyone's
driving skills to offer incentives for a switch.

Also, "self-flying cars" have about 100x more problems that affect adoption
(aside from having virtually nothing in common with self-driving ones). Having
to constantly generate 10 m/s² of upwards acceleration is _really_ expensive.
You can't go to work in a flying car. People are huge wimps about crashing.
Basic stuff like this. I mean, I can't/won't stop you leaving the self-driving
car business for the self-flying car business, but it seems incredibly ill-
advised to me.

------
minimaxir
Metrics will always be gamed as long as a) there is an incentive to game them
and b) there are no proportionate consequences for gaming said metrics.

To use YouTube video recommendations as an example, (a) is reflected by the
current title/thumbnail brinksmanship seen recently as well as the shift to
"alt-right" topics, and (b) is reflected through YouTube contorting itself
trying to justify not punishing such methods. In the case of recommendation
rabbit holes (e.g. alt-right rabbit holes), it's why journalism highlighting
the trends is especially important as a correcting force.

------
m0zg
The fundamental problem with academic metrics is that if you're solving a
_real_ problem (a rarity in academia, BTW), they are merely a proxy for how
well you've solved the problem, and often not a good one.

Case in point: say you're building a berry picking robot using computer
vision. As a part of that you'll probably use an object detection system which
lets your robot see the berries and know where they are. Commonly you will use
a combination of losses to optimize the system, and a combination of mean
average precision metrics to evaluate how good it is. But here's the issue:
even the evaluation mAP (let alone loss) does not tell you how good the robot
will be at picking berries. Moreover, there's no point of reference for you to
tell if e.g. mAP of 80% is "good enough". And "goodness of the robot" while it
can be defined as a real world metric (e.g. robot successfully picked 90% of
ripe berries and only destroyed 0.5% of the plants), usually can't be
formulated as a direct optimization objective. So you end up futzing with the
metrics that are easier to work with, hoping and praying that your result is
good in the end.

If that doesn't sound scary enough, think of autonomous driving systems. :-)

~~~
XuMiao
Most of the time, the surrogate metrics align with the unknown metrics when
they are small, e.g. 80%. They can start to diverge at 90% for example. When I
see people report a .5% improvement over a 98% accuracy and claim the
statistical significance, I highly doubt about the real significance. The long
tail is always longer than we thought. On the hand, there is no coincidence in
the world. If some datapoints happen to be outliers that the metrics
miscalculated, it might be that the problem isn't setup right. For example,
missed some information in the data. In-person interview helps not just
because people have a better judgement. It's because we can find out more
details about that sample. The current AI/ML method lacks an active
information seeking process. They don't have the freedom to explore the world
beyond the defined state/reward space.

