Hacker News new | past | comments | ask | show | jobs | submit login
Goodhart's Law and how systems are shaped by the metrics you chase (whyisthisinteresting.substack.com)
160 points by neonate on July 7, 2020 | hide | past | favorite | 58 comments

One of the best examples of Goodhart's Law I think is what has happened over the past 20 years in sports like gymnastics and figure skating.

The International Skating Union changed the figure skating scoring system after the 2002 Olympics scandal. The goal was to make the system more transparent and objective. Instead of a couple of 6.0 values, each element in a program would get a score based on difficulty and quality of execution.

The end result is that things that are essentially easily measured (like the number of revolutions in a jump) went up in value, while many feel like the overall quality and artistry of skating has suffered greatly. I'm not going to argue about how the "artistry" has suffered, even though I agree; artistry is subjective, and it is supposed to be a sport, after all. What I think is tragic about the scoring system, though, especially in women's figure skating, is that it is greatly biased towards girls, not women. The top "women" skaters these days are a trio of 16 year olds from Russia who excel at some of the quadruple jumps that were rare even among the men just a decade or so ago.

The issue is that it's nearly impossible for a fully developed woman (i.e. with breasts and hips) to do quadruple jumps - the physics just make it extremely difficult. So, sadly in my opinion, it has turned ladies figure skating into girls figure skating. Gone are the old days where you could follow a skater and see her improve and mature over multiple Olympics - now it's basically become a sport of mastering the difficult jumps and quads as quickly as possible until physical development eventually takes over.

I see more of this in society in general now, where with so much data and analytics there is a rush to "measure" everything. But I think it's important to acknowledge what is lost when "subjective" becomes a dirty word.

Thank you for the last paragraph.

I feel that a lot of people have this assumption, unconscious or not, that the world is entirely objective. That everything can be measured, and what isn't measurable doesn't exist. I think this belief (which essentially elevates the scientific method to a religion) is causing a huge amount of grief in the world.

Agreed. We had a doctor who refused to believe that my partner’s symptoms were relieved by a specific procedure - because there were no reports of other people’s symptoms being relieved by this procedure. But by this logic, no such report could ever get recorded!

It is critical that subjective, experiential data is cross referenced with objective data in order to control for errors in both. Both kinds of data are fallible for different reasons.

In my opinion, the erosion of subjectivity from science is probably a driver of the loss of objectivity in politics.

I'd say it is the reverse. Less influence of science in politics is a driver of loss of objectivity in politics.

Well, ISTM that science has had a pretty good run in politics, especially relative to the rest of history - but now, particularly in the US, science based governance is well and truly out the door.

Why is this? In my opinion it’s because science has rejected people’s subjective experience for too long, to the point where science has become just “someone else’s opinion”. I can think of three examples From the past decade, off the top of my head.

Gluten make you feel yukky? Nonsense, said Science, only people with celiac disease can be gluten intolerant... oh wait, there are other types of gluten intolerance? Who knew?

Qualms about GM? Nonsense, said Science, what we’re doing is perfectly safe... oh wait, GM crops can unexpectedly spread into non GM paddocks, ruining a bunch of farmers? Who knew?

Prefer organic food? Nonsense says Science, pesticides are safe and harmless... oh wait, all the bees are dying from Neonicotinoids? Who knew?

It’s not that there is anything wrong with the scientific method, but scientists who run around and demean people’s subjective view of the world - right or wrong - have IMO really contributed to the current crisis of faith in science. People’s subjective opinion and experience must necessarily be taken into account, since that’s part of their lived experience and what we are here to explain. You don’t have to believe everything people tell you, but you do have to listen and be tolerant.

It’s like a special application of Goodhart’s law, “if we haven’t already seen it, it doesn’t exist”. As a huge fan of the scientific method, this attitude makes me very sad.

I think the main reason is that science has simply become too hard to be explained in laymans terms in a way that you can still connect all the dots. Even plenty of scientists are now so narrow in their knowledge that within the same domain they won't be able to keep up with their peers if it isn't their exact specialization.

It's logical: in the renaissance a single individual could still hold 'all of science' in their heads, by the early 1900's that had split up into a whole bunch of domains each of which a single individual could still comprehend in their entirety. By the mid 1970s I don't think any single scientist still had complete command of their domain and that has only gotten worse.

It was bound to happen sooner or later, and this is not a sad thing per se, more of a measure of how far we have come and how quickly we've done that. The scientific method is the most amazing invention we've done, it succeeded where everything else has failed at explaining how the world and in fact much of the universe works.

Everything can be measured. We just don’t have the tech yet.

A better phrasing of the problem is: the world is immensely complicated and our individual and collective intelligences have only just begun to scratch the surface of this complexity.

Once an individual understands this, they realize that almost everything in life cannot be measured or predicted, and that subjectivity is still very important and a completely valid way to navigate life.

> Everything can be measured. We just don’t have the tech yet.

Theory of science shows us that this statement is a belief. Whether you believe it or not can be argued about, but it can never be proven or disproven, much like the question whether God exists.

Personally, I believe the opposite - that there are some things that even in theory, even with the best and most advanced technology, can't be measured.

> I see more of this in society in general now, where with so much data and analytics there is a rush to "measure" everything. But I think it's important to acknowledge what is lost when "subjective" becomes a dirty word.

You may measure as long as you accept the limitations associated with the measure. Measures are man-made. Hence, subject to interpretation. You add that as a † next to the measure and accept the implications of using that as a measure. There's no perfect measure. There are only options and implications.

Well you can argue the same and say that football sucks because it privileges athleticism which is at its peak for 20-something people.

Short, well-written, and interesting.

One question that popped in my mind as I was reading this was, how about alpha, correlation, and volatility in financial markets? Has the behavior of markets changed over time as alpha, correlation, and volatility have become dominant metrics to be chased?

A lot more has changed in the financial markets than just the way perofrmance is measured and which alphas are known.

Some examples (not in chronological order, mostly US equities focused): electronic market makers, electronic order execution (which means large orders are harder to detect now), retail access, the way retail flow is directed around, ZIRP + QE, the rise of indexing, Volcker Rule.

My point is that it would be quite hard to isolate the impact of just how people think of performance.

Thank you. Yes, I agree.

My question was only whether the behavior of the whole system has been shaped by the chasing of these metrics.

Your comment implies that you think the answer is yes, even if you won't venture a guess of the impact of these metrics in dependent of other changes (understandably, in my view; I'm not sure I would venture a guess either).

There was this concept of objectives and counter objectives which was meant to combat Goodhart’s law. The idea was that you define what you want to measure and then define another metric which acts as regularization for the first. A good example is if you define a metric of WAU only, you might be motivated to bring in as many new users each week as possible. Setting a counter metric of retention makes you take into consideration that it’s important they stay.

I don’t remember who suggested this - I vaguely remember it was someone at a16z, but can’t find the original place I read it

There was this concept of objectives and counter objectives which was meant to combat Goodhart’s law.

Metrics and counter-metrics, IIRC originally coined by Julie Zhuo.

It’s a great framework for making sure that you consider Goodhart’s law, but it’s also only as good as the person thinking about it.

The problem I have seen in practice is that the same person whose job it is to think of the counter-metric (usually the PM) often looks better and gets promoted/paid more when the real costs of chasing the KPI stay hidden. PM leadership doesn’t have the bandwidth to make sure every PM is being rigorous.

I’ve felt this pressure as a PM (I don’t think I gave into the temptation but that’s not for me to judge) and as a developer I’ve seen my PM invest only nominal time on the counter-metric, where arguably you should be spending more time thinking about it than the metric itself. In practice, that resulted in things like security-through-annoyance-and-unreliability.

I’ve been in those shoes, so I know defining metrics to track business goals is really hard. But I do think people on the ground can tell when it’s a matter of difficulty as opposed to a matter of cutting corners or storytelling spin.

> security-through-annoyance-and-unreliability

What's that? How does it work / What does it mean?

I think this approach was also suggested in Andy Grove's High Output Management.

Counter Metrics?

You can see this in the Netherlands: people are actively discouraged from filing reports of criminal activity resulting in lower crime on paper but rising (small) crime in real life. Car burglaries, bike thefts, pickpocketing shoplifting and so on are strongly subjected to this pressure.

I love this, but it struck me in the opening paragraph how similar it is to the challenges around training AI on higher-level abstraction goals, with tons of examples of AI's basically "cheating" and passing information discretely to itself to yield answers, rather than actually solving the problem as the human had hoped!

This is an excellent example of Goodhart's Law in practice.

Every time it comes up, I'm reminded of a Dune quote: "The problem isn't thinking machines, it's letting the machines do the thinking for you."

IMO the Law comes into effect as people stop also paying attention to the externalities in favor of "letting the [metric] do the thinking for them.".

We don't even need a fictional nail-factory example, because an almost-similar real example exists: the Backyard Furnaces of China during the Great Leap Forward [1]

> Pots, pans, and other metal artifacts were requisitioned to supply the "scrap" for the furnaces so that the wildly optimistic production targets could be met.

[1] https://en.wikipedia.org/wiki/Great_Leap_Forward#Backyard_fu...

We need to rate the performance of our economy not by how much we produce, but by how little.

A better metric would be something related to happiness. If we produce less and and GDP goes down but everyone is happier then who cares?

There are many ways happiness can be manipulated though. People tend to be happy when things are improving. The most obvious way to manipulate this metric is to deliberately hit rock bottom and then slowly improve from there.

Another problem is drugs. It's possible to manipulate people's happiness rather explicitly with that.

Edit: these are out there examples, but here's a more realistic one: borrow money from future generations. Set up a system that will eventually fail, but provide a lot of benefits/happiness in the present. When the long-term consequences arrive politics have already changed.

Drugs: yup, Huxley's dystopian "Brave new world" scores remarkably high on a subjective wellbeing scale. I think this is a sort of obvious limit to the validity of the metric though; it doesn't invalidate the metric within sensible limits (though we could have a chat about antidepressants and adhd I suppose)

On your last point, Wales legislated for wellbeing as a guiding principle in the Well-being for Future Generations Act which incorporates that concern.

My other issues with wellbeing are ...

1. it's shaped by expectations. There are well known cases of people who have horrific life changing accidents yet after adapting to their new lifestyle are just as happy as before. We can also imagine very well off people being picky because the champagne and caviar are not up to standard.

2. we know how to measure it at individual level but don't have enough data on how we as a society think we should trade off between average levels of wellbeing and inequality of the same.

More on all this on my link in sibling comment. I'm interested to hear feedback on all the above as we're working on the next stage of this project now.

You can also take a hint from the article and just lie about what people report.

The New Zealand “Wellbeing Budget” comes to mind when discussing GDP alternatives: https://treasury.govt.nz/sites/default/files/2019-05/b19-wel...

Funny enough I recently published something on wellbeing as a metric. Goodhart gets a mention... https://www.mdpi.com/2071-1050/12/8/3180

The OECD Better Life Index is exactly that:


It's an alternative to macro-economic indicators such as GDP that measures quality of life across 11 dimensions (housing, education, work-life balance, health, environment, community,...).

The BLI is a set of metrics created in line with concepts such as Gross National Well-Being.


A big criticism to this methodology is that it's hard to quantify a measure of quality because you can always keep on asking "What constitutes quality?" and "Did I accurately capture quality as the sum of it's constituent parameters I measure through quantification?" So, the hard part isn't the measuring itself, but finding common agreement on what gets measured, and whether those metrics match within a conceptual framework that underpins a shared understanding on how society functions.

At face value, GDP doesn't suffer from this problem of perception, because the measure of economic output is represented through an already widely agreed upon representation of economic reality that lends itself to quantification: currency.

However, GDP isn't free from the same criticism because you can always de-construct the relationship between currency and the value it represents. For instance, the U.S. ranks the top of nations measured by GDP, but when you take the Gini coefficient - which measures wealth inequality per capita - the U.S. ranks rather poorly.

https://en.wikipedia.org/wiki/Gini_coefficient https://en.wikipedia.org/wiki/List_of_countries_by_income_eq...

Framing the discussion as a question of "What am I measuring?" is that this is an alternate take on what's in essence Goodhart's Law. I think Goodhart's Law is basically a function of a form of fallacious thinking which is innate to human psychology: it's the preconception that "value" is something that exists outside of human experience.

For instance, it's the idea of gold or money being "valuable" in it's own right, regardless of how humans perceive it. Of course, it's the other way around. Value is an umbrella concept that signifies particular meaning or importance which is tied uniquely to the human experience. Value is a fluid concept with a myriad of understandings, definitions and feelings which get entirely steered by context and circumstances. Gold and money only are worth something when humans believe that it is worth something: by attributing value, through externalization.

And so, that's where you arrive at an inescapable conclusion: In order to define value, you have to dismiss the notion that anything - goods and services - have a fixed innate value, and that economic value is actually an ever-evolving fluid aggregate function based on a shared understanding and driven by compromise. So, why is this very hard to do? Because externalizing value to goods and services introduces a measure of certainty which human minds tend to crave. Whereas the idea that value is fluid introduces uncertainty, which is something our risk-averse thinking has a rather hard time coping with.

To go back to Goodhart's Law and your comment, overproduction isn't entirely preventable. It usually signifies a sort of irrational way of attributing value that isn't based on any shared understanding of value. In the case of the soviet nail factory, or the Great Leap Forward metal furnaces: those didn't serve any general purpose to those who produced or consumed them, but rather as markers for external validation of preconceptions held by the incumbent elites at the time.

In your statement, the "we" and the "everyone" are what really matter. Unless you define who these groups are, you risk falling into the same trap Goodhart's Law tries to point out.

Somalia, Congo, and North Korea produce very little. Is this what you had in mind? If not, could you clarify?

More how few people you need to produce all that is needed. Standardised to a 35 hour week (say) to deal with exploitation.

The Covid crisis has shown how few people are actually needed in most western nations and how much activity is largely fluff.

If we were rational we’d be discussing that rather than trying to restore the status quo ante - which is nothing more that chasing a figure.

By the ratio of how much we can do vs the amount of resources extracted from the planet.

What gets measured gets treasured.

alternate form: The uncountable is of no account

>But some of these chiefs started to figure out, wait a minute, the person who's in charge of actually keeping track of the crime in my neighborhood is me. And so if they couldn’t make crime go down, they just would stop reporting crime. And they found all these different ways to do it. You could refuse to take crime reports from victims, you could write down different things than what had actually happened. You could literally just throw paperwork away. And so that guy would survive that CompStat meeting, he’d get his promotion, and then when the next guy showed up, the number that he had to beat was the number that a cheater had set. And so he had to cheat a little bit more.

Why does the story end here? Why not put someone else in charge of taking crime reports?

While many here would hate it, Goodharts law is a powerful argument for things being obtus, vague and abstract. The less measurable something is, the less it can be easily hacked.

Also, interestingly, what will stand in the way of AGI - no definable loss function for these vague cases and scenarios.

Approval of managers in charge of payroll is a very obtuse, vague and abstract thing.

It can also be quite demoralizing and damaging as an incentive structure, even to very productive employees.

Approval of managers should, IMO, be the main way employee performance is measured.

If it’s damaging then you’ve got a bad manager. Either his manager or HR training is at fault. Both can be prevented with the right systems in place.

So the sample story was about how the police achieved their metrics through unsavory methods because it was too hard/too much work to do it the "correct" way?

Is the problem then how you go about achieving the metrics? Let's say that you always pair metrics with an approval process for how you plan to achieve those metrics. Presumably when the police chief gives a presentation on how he plans to reduce crime rate by throwing out crime reports, the committee would be none too pleased.

Does that theoretically handle the majority of cases where enacting metrics has adverse effects, or does it fall on the other side of 80/20? Or maybe actors are always so inclined to lie and cheat that there's no point in even exploring such an approach.

Presumably the optimal solution would be to prepare fake but easily executed plans (tuned to appeal to whoever's on the committee) to go with your fake numbers. But probably no amount of regulating police will help if your regulators are also chasing metrics (say, to make themselves look better than their opponent for voters).

Random sampling is one effective (if adversarial) way to validate numbers. For policing I guess that looks like "secret citizens" reporting crimes and sending the results back up the chain? But that doesn't solve the second half of the problem.

You optimize what you measure, don't you? It is crucial to that any set of metrics considers this effect. DOne right, a metrics system points an organisation is the "right" direction and sets priorities for everone. Done wrong, and the organisation will run, sometimes ruthlessly efficient, in the wrong direction.

For policing, just counting the number of arrests and tickets written (something German police is guilty of to certain degree), pushes investigations and police work to pursueing "easy" crimes. And sometimes harassing people for minor stuff to meet numbers. No idea what could be a better way, absennce of issues is so hard to quantify and faces the same risk of optimizing for one thing at the cost of another.

"My plan is to <blank>." The chief then does <blank>. Sociology is hard, so even if <blank> sounds plausible it doesn't help. However, lying about the paperwork does still help, and because they submitted a plausible plan (that they followed through on) there isn't any investigation. Worse yet, other people might be actively misled by <blank>'s apparent success and fail to implement it in their precinct with positive outcomes.

in some sense the history of the SAT examinations is a deep example of Goodhart at work.

Im curious what "deeper interventions" look like? What metrics systems work?

a) Pick a single measure that is very very hard to game (life expectancy, for example). But that's non-trivial, even something like life expectancy can be gamed by deciding who's included in the stats, by changing the start date (e.g. for cancer diagnosis), by ignoring quality of life, etc.

b) Use a robust mix of different measures that is harder to game. Continuing the example of the above, one common metric is the "disability-adjusted life year", which combines quality of life, length of life, economic productivity, and subjective ratings of happiness

c) Socialize your workers to believe in "the cause", so they are more likely to do what you mean rather than what you measure (hard to do when you're not a governmental or a political organization). One of the big motives of early Soviet purges (back when they just involved kicking someone out of the Party) was to remove people for which this didn't work.

d) Not exactly "fair", but often effective - investigate and punish people when they do counterproductive things to optimize their stats. This was another reason for said early purges of the CPSU membership rolls.

These are all hard.

Was mostly referencing back to this on general systems stuff, not a specific metrics system that works. https://whyisthisinteresting.substack.com/p/why-is-this-inte...

In the (probably fictional) Russian Nail Factory example, the customer would reject the nails, if there was a real customer, not a planned economy.

External metrics are harder to game. For example, many car companies pay attention to IIHS safety statistics, JD Power Ratings, etc.

You still need internal metrics, but it's harder to have target drift when you also have external organizations measuring things.

In the NYPD example, you could imagine state and federal law enforcement oversight of the city. Or the city could measure civilian complaints. At some point you have to have good faith, just metrics don't do anything by themselves.

I personally don't believe Goodhart's Law overpowers the benefit of the metric, ever.

It's a quaint factoid, but just as the soviets would just execute a factory manage gaming the system for producing pins or spikes, people will see you at some level gaming the system and call you on it. (and make a TV show like The Wire)

Just like other factoids (ie Risk compensation), the effects will be real, but the factoid is so misused at extreme interpretations the factoids do more harm the good.

Instead of downvoting you I thought I'd ask, how likely do you think people do see you gaming and don't call you on it? I've been in many more situations where you don't have the power or ability to call someone out on something because it's more complicated than just "I'll tell on you".

It's your boss not being fooled by your gaming and not promoting you.

Its not something you report.

Like taking a long lunch break.

People aren't blind, they can see what you are doing, people will talk about it behind your back.

Goodharts law relies on a mythical assumption everyone in the organisation is just out to scam it.

It's all measured in real life. If people are evil and out to game metrics why won't they just game the system when there are no metrics?

A good version of Goodharts law might be, people will work towards metrics so some times it's worth tweaking metrics to reduce incorrect outcomes, but sometimes worrying about the inefficiency is worse than the inefficiency itself.

I think your last paragraph gets to the point.

Goodhart’s law is not about people “gaming” the system. Sure that could be one of the effects of aligning incentives to metrics, but it’s really about about the dangers of aligning _goals_ to metrics (of course incentives are often tied to achieving goals).

For most non-trivial systems it can be difficult to correctly identify and describe all of the variables that affect the system and in what ways. For example, at a high level a company’s singular goal may be something as abstract as “make more money”. It’s pretty difficult to just _do_ that, so we break it down into smaller, more concrete goals that we feel will contribute to the above. How? Good news! Last quarter we hired a business analyst who started recording detailed metrics about our business and processes. Now we can simply look at the metrics from last quarter and try to improve upon some key numbers.

And this is where Goodhart’s law comes in. You see last quarter we were not _trying_ to improve upon any specific numbers. Although we can see that we achieved poor efficiency in our inventory processes last quarter, it turns out that focusing on improving this metric actually net lost us money because it created more inefficiency in shipping!

No one has to be “gaming” the system for the above to occur. A system full of good actors can fall prey to turning metrics into goals (at the expense of the system as a whole). Goodhart’s law is just a warning. Of course we still want to measure things!

If it isn't a goal (or a subgoal) then why measure it? Groups have many goals but what usually happens is that certain goals are prioritized.

Unfortunately, because of incentives, the anointed goals (or the metrics) get over prioritized causing worsening externalities until the goals or metrics get adjusted to correct the externalities. "Move fast and break things" turns into "Move fast with stable infrastructure".

Of course this is a dialectic, so expect the goals and metrics to shift again as new externalities are exploited. The ultimate correcting mechanism is a god who isn't a slave to the incentive structure and who resets the system by getting rid of bad actors.

The world is the Matrix.

> If it isn't a goal (or a subgoal) then why measure it?

Often the goals are hard to measure, so you measure a proxy. This can work well if people are actively focused on the goal, and check the measure to see how they're doing. It stops working if people forget that the measure isn't the goal, and work to increase the measure.

Your boss has the most to game from you cheating. He’s probably the reason you are gaming the metics in the first place. And if you do it sloppily he will fire you.

If everybody's gaming something a little bit, nobody is incentivized to take action on most things, even if someone reports it.

It often takes egregious abuses to get people to care.

(And is this so bad? The end result of "perfectly efficient" is pretty brutal for the individual worker.)

> I personally don't believe Goodhart's Law overpowers the benefit of the metric, ever.

I'll lay two pairs of words against that:

Vietnam War.

Body Count.

I can’t think of a single person who has worked in academia who would have any sympathy with your position. Maybe academia is an outlier.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact