
Goodhart's Law and how systems are shaped by the metrics you chase - neonate
https://whyisthisinteresting.substack.com/p/why-is-this-interesting-the-goodharts
======
hn_throwaway_99
One of the best examples of Goodhart's Law I think is what has happened over
the past 20 years in sports like gymnastics and figure skating.

The International Skating Union changed the figure skating scoring system
after the 2002 Olympics scandal. The goal was to make the system more
transparent and objective. Instead of a couple of 6.0 values, each element in
a program would get a score based on difficulty and quality of execution.

The end result is that things that are essentially easily measured (like the
number of revolutions in a jump) went up in value, while many feel like the
overall quality and artistry of skating has suffered greatly. I'm not going to
argue about how the "artistry" has suffered, even though I agree; artistry is
subjective, and it is supposed to be a sport, after all. What I think is
tragic about the scoring system, though, especially in women's figure skating,
is that it is greatly biased towards _girls_ , not women. The top "women"
skaters these days are a trio of 16 year olds from Russia who excel at some of
the quadruple jumps that were rare even among the men just a decade or so ago.

The issue is that it's nearly impossible for a fully developed woman (i.e.
with breasts and hips) to do quadruple jumps - the physics just make it
extremely difficult. So, sadly in my opinion, it has turned ladies figure
skating into girls figure skating. Gone are the old days where you could
follow a skater and see her improve and mature over multiple Olympics - now
it's basically become a sport of mastering the difficult jumps and quads as
quickly as possible until physical development eventually takes over.

I see more of this in society in general now, where with so much data and
analytics there is a rush to "measure" everything. But I think it's important
to acknowledge what is lost when "subjective" becomes a dirty word.

~~~
cloogshicer
Thank you for the last paragraph.

I feel that a lot of people have this assumption, unconscious or not, that the
world is entirely objective. That everything can be measured, and what isn't
measurable doesn't exist. I think this belief (which essentially elevates the
scientific method to a religion) is causing a huge amount of grief in the
world.

~~~
doctor_eval
Agreed. We had a doctor who refused to believe that my partner’s symptoms were
relieved by a specific procedure - because there were no reports of other
people’s symptoms being relieved by this procedure. But by this logic, no such
report could ever get recorded!

It is critical that subjective, experiential data is cross referenced with
objective data in order to control for errors in both. Both kinds of data are
fallible for different reasons.

In my opinion, the erosion of subjectivity from science is probably a driver
of the loss of objectivity in politics.

~~~
jacquesm
I'd say it is the reverse. Less influence of science in politics is a driver
of loss of objectivity in politics.

~~~
doctor_eval
Well, ISTM that science has had a pretty good run in politics, especially
relative to the rest of history - but now, particularly in the US, science
based governance is well and truly out the door.

Why is this? In my opinion it’s because science has rejected people’s
subjective experience for too long, to the point where science has become just
“someone else’s opinion”. I can think of three examples From the past decade,
off the top of my head.

Gluten make you feel yukky? Nonsense, said Science, only people with celiac
disease can be gluten intolerant... oh wait, there are other types of gluten
intolerance? Who knew?

Qualms about GM? Nonsense, said Science, what we’re doing is perfectly safe...
oh wait, GM crops can unexpectedly spread into non GM paddocks, ruining a
bunch of farmers? Who knew?

Prefer organic food? Nonsense says Science, pesticides are safe and
harmless... oh wait, all the bees are dying from Neonicotinoids? Who knew?

It’s not that there is anything wrong with the scientific method, but
scientists who run around and demean people’s subjective view of the world -
right or wrong - have IMO really contributed to the current crisis of faith in
science. People’s subjective opinion and experience must necessarily be taken
into account, since that’s part of their lived experience and what we are here
to explain. You don’t have to believe everything people tell you, but you do
have to listen and be tolerant.

It’s like a special application of Goodhart’s law, “if we haven’t already seen
it, it doesn’t exist”. As a huge fan of the scientific method, this attitude
makes me very sad.

~~~
jacquesm
I think the main reason is that science has simply become too hard to be
explained in laymans terms in a way that you can still connect all the dots.
Even plenty of scientists are now so narrow in their knowledge that within the
same domain they won't be able to keep up with their peers if it isn't their
exact specialization.

It's logical: in the renaissance a single individual could still hold 'all of
science' in their heads, by the early 1900's that had split up into a whole
bunch of domains each of which a single individual could still comprehend in
their entirety. By the mid 1970s I don't think any single scientist still had
complete command of their domain and that has only gotten worse.

It was bound to happen sooner or later, and this is not a sad thing per se,
more of a measure of how far we have come and how quickly we've done that. The
scientific method is the most amazing invention we've done, it succeeded where
everything else has failed at explaining how the world and in fact much of the
universe works.

------
cs702
Short, well-written, and interesting.

One question that popped in my mind as I was reading this was, how about
alpha, correlation, and volatility in financial markets? Has the behavior of
markets changed over time as alpha, correlation, and volatility have become
dominant metrics to be chased?

~~~
maest
A lot more has changed in the financial markets than just the way perofrmance
is measured and which alphas are known.

Some examples (not in chronological order, mostly US equities focused):
electronic market makers, electronic order execution (which means large orders
are harder to detect now), retail access, the way retail flow is directed
around, ZIRP + QE, the rise of indexing, Volcker Rule.

My point is that it would be quite hard to isolate the impact of just how
people think of performance.

~~~
cs702
Thank you. Yes, I agree.

My question was only whether the behavior of the whole system has been shaped
by the chasing of these metrics.

Your comment implies that you think the answer is yes, even if you won't
venture a guess of the impact of these metrics in dependent of other changes
(understandably, in my view; I'm not sure I would venture a guess either).

------
thecellardoor
There was this concept of objectives and counter objectives which was meant to
combat Goodhart’s law. The idea was that you define what you want to measure
and then define another metric which acts as regularization for the first. A
good example is if you define a metric of WAU only, you might be motivated to
bring in as many new users each week as possible. Setting a counter metric of
retention makes you take into consideration that it’s important they stay.

I don’t remember who suggested this - I vaguely remember it was someone at
a16z, but can’t find the original place I read it

~~~
munchbunny
_There was this concept of objectives and counter objectives which was meant
to combat Goodhart’s law._

Metrics and counter-metrics, IIRC originally coined by Julie Zhuo.

It’s a great framework for making sure that you consider Goodhart’s law, but
it’s also only as good as the person thinking about it.

The problem I have seen in practice is that the same person whose job it is to
think of the counter-metric (usually the PM) often looks better and gets
promoted/paid more when the real costs of chasing the KPI stay hidden. PM
leadership doesn’t have the bandwidth to make sure every PM is being rigorous.

I’ve felt this pressure as a PM (I don’t think I gave into the temptation but
that’s not for me to judge) and as a developer I’ve seen my PM invest only
nominal time on the counter-metric, where arguably you should be spending
_more_ time thinking about it than the metric itself. In practice, that
resulted in things like security-through-annoyance-and-unreliability.

I’ve been in those shoes, so I know defining metrics to track business goals
is really hard. But I do think people on the ground can tell when it’s a
matter of difficulty as opposed to a matter of cutting corners or storytelling
spin.

~~~
cutemonster
> security-through-annoyance-and-unreliability

What's that? How does it work / What does it mean?

------
cargo8
I love this, but it struck me in the opening paragraph how similar it is to
the challenges around training AI on higher-level abstraction goals, with tons
of examples of AI's basically "cheating" and passing information discretely to
itself to yield answers, rather than actually solving the problem as the human
had hoped!

------
jacquesm
You can see this in the Netherlands: people are actively discouraged from
filing reports of criminal activity resulting in lower crime on paper but
rising (small) crime in real life. Car burglaries, bike thefts, pickpocketing
shoplifting and so on are strongly subjected to this pressure.

------
RangerScience
This is an excellent example of Goodhart's Law in practice.

Every time it comes up, I'm reminded of a Dune quote: "The problem isn't
thinking machines, it's letting the machines do the thinking for you."

IMO the Law comes into effect as people stop also paying attention to the
externalities in favor of "letting the [metric] do the thinking for them.".

------
gholap
We don't even need a fictional nail-factory example, because an almost-similar
real example exists: the Backyard Furnaces of China during the Great Leap
Forward [1]

> Pots, pans, and other metal artifacts were requisitioned to supply the
> "scrap" for the furnaces so that the wildly optimistic production targets
> could be met.

[1]
[https://en.wikipedia.org/wiki/Great_Leap_Forward#Backyard_fu...](https://en.wikipedia.org/wiki/Great_Leap_Forward#Backyard_furnaces)

------
amelius
We need to rate the performance of our economy not by how much we produce, but
by how little.

~~~
choward
A better metric would be something related to happiness. If we produce less
and and GDP goes down but everyone is happier then who cares?

~~~
Mirioron
There are many ways happiness can be manipulated though. People tend to be
happy when things are improving. The most obvious way to manipulate this
metric is to deliberately hit rock bottom and then slowly improve from there.

Another problem is drugs. It's possible to manipulate people's happiness
rather explicitly with that.

Edit: these are out there examples, but here's a more realistic one: borrow
money from future generations. Set up a system that will eventually fail, but
provide a lot of benefits/happiness in the present. When the long-term
consequences arrive politics have already changed.

~~~
sideshowb
Drugs: yup, Huxley's dystopian "Brave new world" scores remarkably high on a
subjective wellbeing scale. I think this is a sort of obvious limit to the
validity of the metric though; it doesn't invalidate the metric within
sensible limits (though we could have a chat about antidepressants and adhd I
suppose)

On your last point, Wales legislated for wellbeing as a guiding principle in
the Well-being for Future Generations Act which incorporates that concern.

My other issues with wellbeing are ...

1\. it's shaped by expectations. There are well known cases of people who have
horrific life changing accidents yet after adapting to their new lifestyle are
just as happy as before. We can also imagine very well off people being picky
because the champagne and caviar are not up to standard.

2\. we know how to measure it at individual level but don't have enough data
on how we as a society think we should trade off between average levels of
wellbeing and inequality of the same.

More on all this on my link in sibling comment. I'm interested to hear
feedback on all the above as we're working on the next stage of this project
now.

------
ChainOfFools
What gets measured gets treasured.

alternate form: The uncountable is of no account

------
amadeuspagel
>But some of these chiefs started to figure out, wait a minute, the person
who's in charge of actually keeping track of the crime in my neighborhood is
me. And so if they couldn’t make crime go down, they just would stop reporting
crime. And they found all these different ways to do it. You could refuse to
take crime reports from victims, you could write down different things than
what had actually happened. You could literally just throw paperwork away. And
so that guy would survive that CompStat meeting, he’d get his promotion, and
then when the next guy showed up, the number that he had to beat was the
number that a cheater had set. And so he had to cheat a little bit more.

Why does the story end here? Why not put someone else in charge of taking
crime reports?

------
dannykwells
While many here would hate it, Goodharts law is a powerful argument for things
being obtus, vague and abstract. The less measurable something is, the less it
can be easily hacked.

Also, interestingly, what will stand in the way of AGI - no definable loss
function for these vague cases and scenarios.

~~~
swagasaurus-rex
Approval of managers in charge of payroll is a very obtuse, vague and abstract
thing.

It can also be quite demoralizing and damaging as an incentive structure, even
to very productive employees.

~~~
randomsearch
Approval of managers should, IMO, be the main way employee performance is
measured.

If it’s damaging then you’ve got a bad manager. Either his manager or HR
training is at fault. Both can be prevented with the right systems in place.

------
jmchuster
So the sample story was about how the police achieved their metrics through
unsavory methods because it was too hard/too much work to do it the "correct"
way?

Is the problem then how you go about achieving the metrics? Let's say that you
always pair metrics with an approval process for how you plan to achieve those
metrics. Presumably when the police chief gives a presentation on how he plans
to reduce crime rate by throwing out crime reports, the committee would be
none too pleased.

Does that theoretically handle the majority of cases where enacting metrics
has adverse effects, or does it fall on the other side of 80/20? Or maybe
actors are always so inclined to lie and cheat that there's no point in even
exploring such an approach.

~~~
yew
Presumably the optimal solution would be to prepare fake but easily executed
plans (tuned to appeal to whoever's on the committee) to go with your fake
numbers. But probably no amount of regulating police will help if your
regulators are also chasing metrics (say, to make themselves look better than
their opponent for voters).

Random sampling is one effective (if adversarial) way to validate numbers. For
policing I guess that looks like "secret citizens" reporting crimes and
sending the results back up the chain? But that doesn't solve the second half
of the problem.

~~~
hef19898
You optimize what you measure, don't you? It is crucial to that any set of
metrics considers this effect. DOne right, a metrics system points an
organisation is the "right" direction and sets priorities for everone. Done
wrong, and the organisation will run, sometimes ruthlessly efficient, in the
wrong direction.

For policing, just counting the number of arrests and tickets written
(something German police is guilty of to certain degree), pushes
investigations and police work to pursueing "easy" crimes. And sometimes
harassing people for minor stuff to meet numbers. No idea what could be a
better way, absennce of issues is so hard to quantify and faces the same risk
of optimizing for one thing at the cost of another.

------
mmhsieh
in some sense the history of the SAT examinations is a deep example of
Goodhart at work.

------
fblp
Im curious what "deeper interventions" look like? What metrics systems work?

~~~
azernik
a) Pick a single measure that is very very hard to game (life expectancy, for
example). But that's non-trivial, even something like life expectancy can be
gamed by deciding who's included in the stats, by changing the start date
(e.g. for cancer diagnosis), by ignoring quality of life, etc.

b) Use a robust mix of different measures that is harder to game. Continuing
the example of the above, one common metric is the "disability-adjusted life
year", which combines quality of life, length of life, economic productivity,
and subjective ratings of happiness

c) Socialize your workers to believe in "the cause", so they are more likely
to do what you mean rather than what you measure (hard to do when you're not a
governmental or a political organization). One of the big motives of early
Soviet purges (back when they just involved kicking someone out of the Party)
was to remove people for which this didn't work.

d) Not exactly "fair", but often effective - investigate and punish people
when they do counterproductive things to optimize their stats. This was
_another_ reason for said early purges of the CPSU membership rolls.

These are all _hard_.

------
aaron695
I personally don't believe Goodhart's Law overpowers the benefit of the
metric, ever.

It's a quaint factoid, but just as the soviets would just execute a factory
manage gaming the system for producing pins or spikes, people will see you at
some level gaming the system and call you on it. (and make a TV show like The
Wire)

Just like other factoids (ie Risk compensation), the effects will be real, but
the factoid is so misused at extreme interpretations the factoids do more harm
the good.

~~~
mannanj
Instead of downvoting you I thought I'd ask, how likely do you think people do
see you gaming and don't call you on it? I've been in many more situations
where you don't have the power or ability to call someone out on something
because it's more complicated than just "I'll tell on you".

~~~
aaron695
It's your boss not being fooled by your gaming and not promoting you.

Its not something you report.

Like taking a long lunch break.

People aren't blind, they can see what you are doing, people will talk about
it behind your back.

Goodharts law relies on a mythical assumption everyone in the organisation is
just out to scam it.

It's all measured in real life. If people are evil and out to game metrics why
won't they just game the system when there are no metrics?

A good version of Goodharts law might be, people will work towards metrics so
some times it's worth tweaking metrics to reduce incorrect outcomes, but
sometimes worrying about the inefficiency is worse than the inefficiency
itself.

~~~
kingdomcome50
I think your last paragraph gets to the point.

Goodhart’s law is not about people “gaming” the system. Sure that could be one
of the effects of aligning incentives to metrics, but it’s really about about
the dangers of aligning _goals_ to metrics (of course incentives are often
tied to achieving goals).

For most non-trivial systems it can be difficult to correctly identify and
describe all of the variables that affect the system and in what ways. For
example, at a high level a company’s singular goal may be something as
abstract as “make more money”. It’s pretty difficult to just _do_ that, so we
break it down into smaller, more concrete goals that we feel will contribute
to the above. How? Good news! Last quarter we hired a business analyst who
started recording detailed metrics about our business and processes. Now we
can simply look at the metrics from last quarter and try to improve upon some
key numbers.

And this is where Goodhart’s law comes in. You see last quarter we were not
_trying_ to improve upon any specific numbers. Although we can see that we
achieved poor efficiency in our inventory processes last quarter, it turns out
that focusing on improving this metric actually net lost us money because it
created more inefficiency in shipping!

No one has to be “gaming” the system for the above to occur. A system full of
good actors can fall prey to turning metrics into goals (at the expense of the
system as a whole). Goodhart’s law is just a warning. Of course we still want
to measure things!

~~~
babesh
If it isn't a goal (or a subgoal) then why measure it? Groups have many goals
but what usually happens is that certain goals are prioritized.

Unfortunately, because of incentives, the anointed goals (or the metrics) get
over prioritized causing worsening externalities until the goals or metrics
get adjusted to correct the externalities. "Move fast and break things" turns
into "Move fast with stable infrastructure".

Of course this is a dialectic, so expect the goals and metrics to shift again
as new externalities are exploited. The ultimate correcting mechanism is a god
who isn't a slave to the incentive structure and who resets the system by
getting rid of bad actors.

The world is the Matrix.

~~~
toast0
> If it isn't a goal (or a subgoal) then why measure it?

Often the goals are hard to measure, so you measure a proxy. This can work
well if people are actively focused on the goal, and check the measure to see
how they're doing. It stops working if people forget that the measure isn't
the goal, and work to increase the measure.

